Re: preg_match() oddities and question [message #176102 is a reply to message #176100] |
Wed, 23 November 2011 19:23 |
Sandman
Messages: 32 Registered: August 2011
Karma:
|
Member |
|
|
In article <jajfi0$pmj$3(at)news(dot)albasani(dot)net>,
The Natural Philosopher <tnp(at)invalid(dot)invalid> wrote:
>>> Address-matching is a hard task. I did that for a decade professionally
>>> (as part of a job, not the sole function), and it's not easy to do well
>>> for even one postal system, and trying to write a generalized one is
>>> basically impossible to manage in one lifetime. The best *simple* way
>>> to manage it is to take a field, blow it out into individual words,
>>> standardize all the words you can find without trying to sort out
>>> what they are (which is the Very Hard part of that task), throw the
>>> alphabetic ones into soundex or nysiis, make a loose match by a chunk of
>>> postal code or city code or province, then pick the item(s) that have
>>> the greatest number of matches between incoming and loose-match record
>>> of the numeric and nysiis-encoded alphabetical elements. If you weight
>>> things like "numeric match = 1, plaintext that's in a dictionary that
>>> matches when nysiis = 2, nondictionary text that matches nysiis = 3",
>>> and do that for NAME as well as ADDRESS, you get about as good as you
>>> can get without buying someone else's work. And that's STILL a lot of
>>> effort to write. Regexp alone for address matching is a snipe-hunt. It
>>> looks obviously right and you can spend a lot of time playing with it,
>>> but it ends up being a dead end.
>>
>> I thank you for your input, but I still maintain that my examples
>> could be parsed by using a regular expression, and unless explicitly
>> told so by using examples will I admit otherwise :-D
>>
>> No offense, though.
>
> after three days, you could have done the data conversion by hand...
Huh? What data conversion? The data to be searched does not need to be
converted into anything. It is neatly separated and also kept in a
combined form. It's the *in-data* (i.e. the search terms provided by
visitors to the sites). that I need to massage :)
And, three days wouldn't have gotten me far, the database contains
over 600,000 posts of addresses :)
But, as I said, the database is neatly formatted.
--
Sandman[.net]
|
|
|