L18N and search engine [message #4631] |
Thu, 01 August 2002 15:37 |
vidaubannais
Messages: 23 Registered: July 2002
Karma:
|
Junior Member |
|
|
In French, a lazy "patriot" would most of the time neglect accents on letters either because he is real lazy or because he is not too sure of the speeling.
My point is that most of the search engines (including Google, Altavista, but not only Web search engines) are able to Analise the characters that could eventually be accented and do the according related search as well as the real original searched requested by the user and then return the results all together.
Example:
Télévision:
If the user tapes "TelevISIon", the search engine would then be able to find:
Telévision
Telévision
Télevision
Télévision (the one that is grammatically correct)
TELEVISION
TELevision
and so forth
(Note that the search engine is not case sensitive either, I don't know that it is the case on FUD with its MySQL database or if it is possible to customize MySQL not to be case sensitive or to add a little something in the SELECT statement for it to because).
This was my first observation - It would be EXTREMELY useful in French as well as in German, Spanish etc ... Basically useful to ANY apha based language.
My second request would be more complicated:
plurial and singular / feminine, masculine (+ neutral in German for instance) genders should he handleable in a much cleverer way by the search engine:
Let say LE PATIENT (in French, it means ... The patient in English).
Depending on the sex of the patient (masculine or feminine) if would be spelled:
Le Patient (masculine)
La Patiente (feminine)
Basically the guy whom runs the search knows that he looks a an unknown patient. As far as he is concerned, he does not know if this patient is a girl or a boy, hence he does not know if he should look for "Le Patient" or "La Patiente".
A real search engine should be able not to make any distinction in the gender and to return both of them, unless the searched string in accordingly protected by 2 magic characters <">.
This is even more tricky that those laws apply differently depending in the language (the local).
For instance in German masculine an feminine even more complicated (words finishing by 'e', 'en' and too many rules to be able to exercice them all in this paper [I am not even quite sure that anyone know all of them, really ]).
I would like to have you feeling in this question. Keep in mind that if the user (the genuine, lazy [but not necessarily FRENCH]) does not find any result to the search engine in the first or second go, he is not going to make much more efforts trying different combination of masculine, feminine and so forth - He simply is going to post a NEW thread, about a question that already has been posted by someone else. You forum in soon going to be a data replication target where you are going to find plenty of question, hardly any responses unless you try real hard - plus think about all the space lost (
Maybe it would be possible to create a special PARSER in charged with ANALYZING the user's search requests, and generate as many QUERY (5 or 6 should be the average I believe) as needed to match answers more efficiently?
I don't say that this is FUD problem here, I just say that something should be done ... Maybe a generic set of parsers (one per language) that could be plugged between the DB and the client program (e.g. search engine). Maybe in an other GNU project? ... But I am just thinking about it ... Maybe such a project exists already???
Let me know:
Disclaimer: No offense to French and German people in this paper - I am French myself ... And live in Germany
|
|
|