Re: preg_match() oddities and question [message #176064 is a reply to message #176061] |
Tue, 22 November 2011 11:47 |
tony
Messages: 19 Registered: December 2010
Karma:
|
Junior Member |
|
|
In article <mr-5B96D1(dot)12212022112011(at)News(dot)Individual(dot)NET>,
Sandman <mr(at)sandman(dot)net> wrote:
> So I have this regexp:
>
> if (preg_match("/^(.*?)\s*(\d*?)\s*([A-Z,a-z,-]*?)$/", $search, $m)){
You don't need the commas in the character class, unless you want to
match a literal comma, in which case you only need it once.
> $streetname = uc_words($m[1]);
> $streetnumber = trim($m[2]);
> $streetletter = strtoupper($m[3]);
> $search = trim($streetname . SPACE . $streetnumber .
> $streetletter);
> }
>
> The desired result is taki9ng the input ($search) and split it into
> its parts as an address, right? $search can be, for example, "foo
> street 34", "longstreet 45b", "longstreet 45 b" or just "longstreet".
What about "foo street"? (i.e. with a space, but no number)
> So, if I print_r($m) with different input I get:
>
> Array
> (
> [0] => foo street 34
> [1] => foo street
> [2] => 34
> [3] =>
> )
> Array
> (
> [0] => longstreet 45b
> [1] => longstreet
> [2] => 45
> [3] => b
> )
> Array
> (
> [0] => longstreet 45 b
> [1] => longstreet
> [2] => 45
> [3] => b
> )
>
> You get the idea. But problems arise when I search for the streetname
> alone:
>
> Array
> (
> [0] => longstreet
> [1] =>
> [2] =>
> [3] => longstreet
> )
And you would also get:
Array
(
[0] => foo street
[1] => foo
[2] =>
[3] => street
)
> As you can see, the last group "([A-Z,a-z,-]*?)" matches the entire
> search term since there are no digits and the first group is
> non-greedy. And if I make the first group greedy, "longstreet" is
> matched correctly, but it also catches the entire "longstreet 45b"
> when searching for that.
Yes, you need to define your rules more closely. Not at the regex level,
but actually at the logic/decision level. If you can make rules that
can unambiguously specify how all kinds of input should be parsed,
then you can look at how to represent that in regexes. You might need
some additional logic to operate on the parsed result.
> Also, when searching for a term in swedish characters, I get this:
>
> Array
> (
> [0] => vikavägen
> [1] => vikavä
> [2] =>
> [3] => gen
> )
>
> Which is quite odd to me, why isn't "vikavägen" matched the same
> (undesired) way that "oongstreet". I have tried the /u modifier, and
> made sure that it was utf8-encoded, but it didn't make a difference
> (incoming encoding is ISO 8859-1).
>
> Why the difference, and how do I correctly parse out parts as needed?
That's because ä is not in the set A-Za-z. If you want a character class
that properly recognises locale-specific letters, you need to change your
character class above to this:
[[:alpha:]\-]
Hope this helps!
Tony
--
Tony Mountifield
Work: tony(at)softins(dot)co(dot)uk - http://www.softins.co.uk
Play: tony(at)mountifield(dot)org - http://tony.mountifield.org
|
|
|