preg_match() oddities and question [message #176061] |
Tue, 22 November 2011 11:21 |
Sandman
Messages: 32 Registered: August 2011
Karma:
|
Member |
|
|
So I have this regexp:
if (preg_match("/^(.*?)\s*(\d*?)\s*([A-Z,a-z,-]*?)$/", $search, $m)){
$streetname = uc_words($m[1]);
$streetnumber = trim($m[2]);
$streetletter = strtoupper($m[3]);
$search = trim($streetname . SPACE . $streetnumber .
$streetletter);
}
The desired result is taki9ng the input ($search) and split it into
its parts as an address, right? $search can be, for example, "foo
street 34", "longstreet 45b", "longstreet 45 b" or just "longstreet".
So, if I print_r($m) with different input I get:
Array
(
[0] => foo street 34
[1] => foo street
[2] => 34
[3] =>
)
Array
(
[0] => longstreet 45b
[1] => longstreet
[2] => 45
[3] => b
)
Array
(
[0] => longstreet 45 b
[1] => longstreet
[2] => 45
[3] => b
)
You get the idea. But problems arise when I search for the streetname
alone:
Array
(
[0] => longstreet
[1] =>
[2] =>
[3] => longstreet
)
As you can see, the last group "([A-Z,a-z,-]*?)" matches the entire
search term since there are no digits and the first group is
non-greedy. And if I make the first group greedy, "longstreet" is
matched correctly, but it also catches the entire "longstreet 45b"
when searching for that.
Also, when searching for a term in swedish characters, I get this:
Array
(
[0] => vikavägen
[1] => vikavä
[2] =>
[3] => gen
)
Which is quite odd to me, why isn't "vikavägen" matched the same
(undesired) way that "oongstreet". I have tried the /u modifier, and
made sure that it was utf8-encoded, but it didn't make a difference
(incoming encoding is ISO 8859-1).
Why the difference, and how do I correctly parse out parts as needed?
Any help is appreciated.
--
Sandman[.net]
|
|
|