Re: preg_match() oddities and question [message #176100 is a reply to message #176098] |
Wed, 23 November 2011 18:54 |
The Natural Philosoph
Messages: 993 Registered: September 2010
Karma:
|
Senior Member |
|
|
Sandman wrote:
> In article <slrnjcpunb(dot)85q(dot)hellsop(at)nibelheim(dot)ninehells(dot)com>,
> "Peter H. Coffin" <hellsop(at)ninehells(dot)com> wrote:
>
>> On Wed, 23 Nov 2011 09:55:22 +0100, Sandman wrote:
>>
>>> In article <3004614(dot)SPkdTlGXAF(at)PointedEars(dot)de>, Thomas 'PointedEars'
>>> Lahn <PointedEars(at)web(dot)de> wrote:
>>>
>>>> >> "10 East 42nd Street, New York, NY 10017, USA".
>>>> > That wouldn't be a normal swedish address, no. :)
>>>> You had not limited the country or the language of your street
>>>> addresses.
>>> Well, to my defense, the subject line was "preg_match() and swedish
>>> characters" until I changed it. I hadn't changed it when I wrote my
>>> examples.
>>>
>>>> My point is that parsing a street name and a house number from a
>>>> street address is a hard problem that cannot be solved only by
>>>> applying one regular expression.
>>> Right, but your example is not a valid argument for that conclusion.
>>> My examples contained the variations of addresses that I wanted
>>> to match. Or are you saying that there is no way to use regular
>>> expressions to catch the examples I gave? Because I have a hard time
>>> believing that.
>> Address-matching is a hard task. I did that for a decade professionally
>> (as part of a job, not the sole function), and it's not easy to do well
>> for even one postal system, and trying to write a generalized one is
>> basically impossible to manage in one lifetime. The best *simple* way
>> to manage it is to take a field, blow it out into individual words,
>> standardize all the words you can find without trying to sort out
>> what they are (which is the Very Hard part of that task), throw the
>> alphabetic ones into soundex or nysiis, make a loose match by a chunk of
>> postal code or city code or province, then pick the item(s) that have
>> the greatest number of matches between incoming and loose-match record
>> of the numeric and nysiis-encoded alphabetical elements. If you weight
>> things like "numeric match = 1, plaintext that's in a dictionary that
>> matches when nysiis = 2, nondictionary text that matches nysiis = 3",
>> and do that for NAME as well as ADDRESS, you get about as good as you
>> can get without buying someone else's work. And that's STILL a lot of
>> effort to write. Regexp alone for address matching is a snipe-hunt. It
>> looks obviously right and you can spend a lot of time playing with it,
>> but it ends up being a dead end.
>
> I thank you for your input, but I still maintain that my examples
> could be parsed by using a regular expression, and unless explicitly
> told so by using examples will I admit otherwise :-D
>
> No offense, though.
>
>
after three days, you could have done the data conversion by hand...
|
|
|