FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » FUDforum » How To » L18N and search engine
Show: Today's Messages :: Polls :: Message Navigator
Switch to threaded view of this topic Create a new topic Submit Reply
icon4.gif  L18N and search engine [message #4631] Thu, 01 August 2002 15:37 Go to next message
vidaubannais is currently offline  vidaubannais   Germany
Messages: 23
Registered: July 2002
Karma: 0
Junior Member
In French, a lazy "patriot" would most of the time neglect accents on letters either because he is real lazy or because he is not too sure of the speeling.
My point is that most of the search engines (including Google, Altavista, but not only Web search engines) are able to Analise the characters that could eventually be accented and do the according related search as well as the real original searched requested by the user and then return the results all together.
Example:

Télévision:

If the user tapes "TelevISIon", the search engine would then be able to find:
Telévision
Telévision
Télevision
Télévision (the one that is grammatically correct)
TELEVISION
TELevision
and so forth

(Note that the search engine is not case sensitive either, I don't know that it is the case on FUD with its MySQL database or if it is possible to customize MySQL not to be case sensitive or to add a little something in the SELECT statement for it to because).

This was my first observation - It would be EXTREMELY useful in French as well as in German, Spanish etc ... Basically useful to ANY apha based language.

My second request would be more complicated:
plurial and singular / feminine, masculine (+ neutral in German for instance) genders should he handleable in a much cleverer way by the search engine:

Let say LE PATIENT (in French, it means ... The patient in English).
Depending on the sex of the patient (masculine or feminine) if would be spelled:
Le Patient (masculine)
La Patiente (feminine)

Basically the guy whom runs the search knows that he looks a an unknown patient. As far as he is concerned, he does not know if this patient is a girl or a boy, hence he does not know if he should look for "Le Patient" or "La Patiente".
A real search engine should be able not to make any distinction in the gender and to return both of them, unless the searched string in accordingly protected by 2 magic characters <">.

This is even more tricky that those laws apply differently depending in the language (the local).
For instance in German masculine an feminine even more complicated (words finishing by 'e', 'en' and too many rules to be able to exercice them all in this paper [I am not even quite sure that anyone know all of them, really Laughing ]).

I would like to have you feeling in this question. Keep in mind that if the user (the genuine, lazy [but not necessarily FRENCH]) does not find any result to the search engine in the first or second go, he is not going to make much more efforts trying different combination of masculine, feminine and so forth - He simply is going to post a NEW thread, about a question that already has been posted by someone else. You forum in soon going to be a data replication target where you are going to find plenty of question, hardly any responses unless you try real hard - plus think about all the space lost Surprised(

Maybe it would be possible to create a special PARSER in charged with ANALYZING the user's search requests, and generate as many QUERY (5 or 6 should be the average I believe) as needed to match answers more efficiently?

I don't say that this is FUD problem here, I just say that something should be done ... Maybe a generic set of parsers (one per language) that could be plugged between the DB and the client program (e.g. search engine). Maybe in an other GNU project? ... But I am just thinking about it ... Maybe such a project exists already???

Let me know:

Disclaimer: No offense to French and German people in this paper - I am French myself ... And live in Germany

Re: L18N and search engine [message #4639 is a reply to message #4631] Thu, 01 August 2002 16:39 Go to previous message
Ilia is currently offline  Ilia   Canada
Messages: 13241
Registered: January 2002
Karma: 0
Senior Member
Administrator
Core Developer
MySQL by nature is case-insensetive and our PostgreSQL searching code is case-sensetive as well. So, searching for FOoBaR will find FOOBAR as well as foobar and all other wonderful combinations thereof.
As for special handling of é like characters and plural/singular as well as gender prefix that is unlikely to be part of FUDforum.
There is a good reason for this, Google and all other major search engines have massive server farms used for searching and indexing data, allowing them to do all kindsa neet things as far as searching goes.
On the other hand FUDforum in most cases runs on shared servers (vhosts) where cpu time is limited and speed is of the essence. Therefor it is impossible to make 'smart searching' without making the search inanly slow. Heck, if you go around you'll notice that many web forums, especially large ones turn off their search systems completely, since those cause large load when used frequently. FUDforum search is pretty fast in comparison to other search system, but even then it is relatively slow, making it even slower is not an option.

Do not despair however, you can write your own search system with 'smartness' or use an existing search product to index FUDforum's messages. Although, I am unfamiliar with any Opensource search projects that support the level of 'smartness' you mentioned.

Keep in mind that Advanced Internet Designs Inc., the company which develops FUDforum does do commercial development, so if you are really interested in the feature we can help you integrate FUDforum data into an existing search engine product or write a custom search engine for your purposes.


FUDforum Core Developer
  Switch to threaded view of this topic Create a new topic Submit Reply
Previous Topic: Stored in a file or in the database?
Next Topic: Forum with Category?
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Sat Oct 05 22:24:02 GMT 2024

Total time taken to generate the page: 0.18121 seconds