Re: Please test/review my Regex to locate hyperlink in text [message #172486 is a reply to message #172485] |
Mon, 21 February 2011 08:21 |
alvaro.NOSPAMTHANX
Messages: 277 Registered: September 2010
Karma:
|
Senior Member |
|
|
El 21/02/2011 7:06, Simon escribió/wrote:
> My requirements are as follow,
>
> 1) Find all hyperlinks in a given text document...
> 2) Parse for certain links and replace them if need be, (in another
> parser function).
> 3) Any attributes given in the hyperlink can be ignored and will be
> faithfully returned in the matches.
> 4) All JavaScript and the likes are pre-stripped so the text can be
> assumed to be 'safe', (if it is not safe then it is not the job of this
> regex to handle it).
>
> // -----------------------------------
> // my pattern...
> $pattern = '/<a
> (.*?)href=[\"\']??(.*?)\/\/(.*?)[\s\"\'](.*?)>(.*?)<\/a>/i';
>
> // the call back function
> $body = preg_replace_callback($pattern, 'my_parser', $body);
>
> // -----------------------------------
>
> The way I see it this should work for...
>
> - <a href='example.com'>some text</a>
> - <a href="example.com">some text</a>
> - <a href=example.com>some text</a>
> - <a href='http://example.com'>some text</a>
> - <a href="http://example.com">some text</a>
> - <a href=http://example.com>some text</a>
>
> - <a href='example.com' tagret=_blank>some text</a>
> - <a href="example.com" tagret=_blank>some text</a>
> - <a href=example.com tagret=_blank>some text</a>
> - <a href='http://example.com' tagret=_blank>some text</a>
> - <a href="http://example.com" tagret=_blank>some text</a>
> - <a href=http://example.com tagret=_blank>some text</a>
>
> Can you poke holes in my regex please :)
> Any suggestions/better regexs?
If you are looking for <a> tags then it isn't a plain text document,
it's an HTML document. Unless it's just an exercise to learn how to use
regular expressions, you can simply do something like this:
<?php
$url = 'http://www.google.com';
$html = file_get_contents($url);
$doc = new DOMDocument;
libxml_use_internal_errors(TRUE);
$doc->loadHTML($html);
libxml_use_internal_errors(FALSE);
$links = $doc->getElementsByTagName('a');
foreach($links as $a){
echo $a->nodeValue . ': ' . $a->getAttribute('href') . PHP_EOL;
}
?>
Afterwards, you can analyse URLs with parse_url():
http://es.php.net/manual/en/function.parse-url.php
--
-- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com
--
|
|
|