Re: I Need to search over 100 largeish text documents efficiently. What's the best approach? [message #184744 is a reply to message #184743] |
Mon, 27 January 2014 12:23 |
Denis McMahon
Messages: 634 Registered: September 2010
Karma:
|
Senior Member |
|
|
On Mon, 27 Jan 2014 10:58:42 +0100, Arno Welzel wrote:
> Am 27.01.2014 02:43, schrieb Denis McMahon:
>
>> On Sun, 26 Jan 2014 05:34:21 -0800, rob.bradford2805 wrote:
>>
>>> What is the best/fastest approach to scan 100+ largish text files for
>>> word strings
>>
>> A quick googling finds:
>>
>> http://sourceforge.net/projects/php-grep/
>> http://net-wrench.com/download-tools/php-grep.php
>>
>> Claims to be able to search 1000 files in under 10 secs
>
> Under ideal conditions - maybe. But if each file is more than 1 MB, it
> is barely possible to even read this amount of data in just 10 seconds
> (assuming around 80 MB/s and 1000 MB of data to be searched).
>
> Even using a simple word index (word plus the name of the file(s) and
> the position(s) where the word is located) would be the better solution.
Indeed, the fastest solution would be to index each file when it changes,
and keep the indexes in a db.
Perhaps there are common words you wouldn't index, in english these might
include:
a the in on an this that then ....
Then if you have a search phrase, remove the common words, look for the
uncommon words in close proximity to each other
It might help to know more about the grep too, is this using complex
regexp, or is it a simple string search done externally using grep.
--
Denis McMahon, denismfmcmahon(at)gmail(dot)com
|
|
|