Please test/review my Regex to locate hyperlink in text [message #172485] |
Mon, 21 February 2011 06:06 |
Simon
Messages: 29 Registered: February 2011
Karma: 0
|
Junior Member |
|
|
Hi,
My requirements are as follow,
1) Find all hyperlinks in a given text document...
2) Parse for certain links and replace them if need be, (in another
parser function).
3) Any attributes given in the hyperlink can be ignored and will be
faithfully returned in the matches.
4) All JavaScript and the likes are pre-stripped so the text can be
assumed to be 'safe', (if it is not safe then it is not the job of this
regex to handle it).
// -----------------------------------
// my pattern...
$pattern = '/<a (.*?)href=[\"\']??(.*?)\/\/(.*?)[\s\"\'](.*?)>(.*?)<\/a>/i';
// the call back function
$body = preg_replace_callback($pattern, 'my_parser', $body);
// -----------------------------------
The way I see it this should work for...
- <a href='example.com'>some text</a>
- <a href="example.com">some text</a>
- <a href=example.com>some text</a>
- <a href='http://example.com'>some text</a>
- <a href="http://example.com">some text</a>
- <a href=http://example.com>some text</a>
- <a href='example.com' tagret=_blank>some text</a>
- <a href="example.com" tagret=_blank>some text</a>
- <a href=example.com tagret=_blank>some text</a>
- <a href='http://example.com' tagret=_blank>some text</a>
- <a href="http://example.com" tagret=_blank>some text</a>
- <a href=http://example.com tagret=_blank>some text</a>
Can you poke holes in my regex please :)
Any suggestions/better regexs?
Many thanks
Simon
|
|
|
Re: Please test/review my Regex to locate hyperlink in text [message #172486 is a reply to message #172485] |
Mon, 21 February 2011 08:21 |
alvaro.NOSPAMTHANX
Messages: 277 Registered: September 2010
Karma: 0
|
Senior Member |
|
|
El 21/02/2011 7:06, Simon escribió/wrote:
> My requirements are as follow,
>
> 1) Find all hyperlinks in a given text document...
> 2) Parse for certain links and replace them if need be, (in another
> parser function).
> 3) Any attributes given in the hyperlink can be ignored and will be
> faithfully returned in the matches.
> 4) All JavaScript and the likes are pre-stripped so the text can be
> assumed to be 'safe', (if it is not safe then it is not the job of this
> regex to handle it).
>
> // -----------------------------------
> // my pattern...
> $pattern = '/<a
> (.*?)href=[\"\']??(.*?)\/\/(.*?)[\s\"\'](.*?)>(.*?)<\/a>/i';
>
> // the call back function
> $body = preg_replace_callback($pattern, 'my_parser', $body);
>
> // -----------------------------------
>
> The way I see it this should work for...
>
> - <a href='example.com'>some text</a>
> - <a href="example.com">some text</a>
> - <a href=example.com>some text</a>
> - <a href='http://example.com'>some text</a>
> - <a href="http://example.com">some text</a>
> - <a href=http://example.com>some text</a>
>
> - <a href='example.com' tagret=_blank>some text</a>
> - <a href="example.com" tagret=_blank>some text</a>
> - <a href=example.com tagret=_blank>some text</a>
> - <a href='http://example.com' tagret=_blank>some text</a>
> - <a href="http://example.com" tagret=_blank>some text</a>
> - <a href=http://example.com tagret=_blank>some text</a>
>
> Can you poke holes in my regex please :)
> Any suggestions/better regexs?
If you are looking for <a> tags then it isn't a plain text document,
it's an HTML document. Unless it's just an exercise to learn how to use
regular expressions, you can simply do something like this:
<?php
$url = 'http://www.google.com';
$html = file_get_contents($url);
$doc = new DOMDocument;
libxml_use_internal_errors(TRUE);
$doc->loadHTML($html);
libxml_use_internal_errors(FALSE);
$links = $doc->getElementsByTagName('a');
foreach($links as $a){
echo $a->nodeValue . ': ' . $a->getAttribute('href') . PHP_EOL;
}
?>
Afterwards, you can analyse URLs with parse_url():
http://es.php.net/manual/en/function.parse-url.php
--
-- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com
--
|
|
|