FUDforum: comp.lang.php » Tokenize an HTML page.

Home » Imported messages » comp.lang.php » Tokenize an HTML page.

Show: Today's Messages :: Polls :: Message Navigator

Tokenize an HTML page. [message #179710]

Mon, 26 November 2012 07:56

Simon
Messages: 29
Registered: February 2011

Karma:

Junior Member

Hi,

I would like index a whole bunch of html documents on my site to speed
up my internal searches, (I currently use 'LIKE "%...%"' and that's not
very efficient).

My understanding would be to:
1) Remove some html (with strip_tags( ... ))
2) Walk the string and, every time I come across a stop character,
(<space>,',",?,! etc...), then count that as a word.

The above solution is over simplistic as it does not work for many
languages, (Hebrew for example uses the single quote as part of the word).

Also stripping HTML assumes that it is properly formated, something I
cannot really guaranty, (and in any case, I might want to keep certain
items such as websites inside the href='' tags).

So, before I re-invent the wheel, can someone suggest a
script/class/code that is able to tokenize html content?

Any suggestions?

Many thanks

Simon

Report message to a moderator

[Message index]

Tokenize an HTML page.

By: Simon on Mon, 26 November 2012 07:56

Previous Topic:	setcookie() returns FALSE
Next Topic:	error reporting

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Thu Nov 28 12:42:13 GMT 2024

Total time taken to generate the page: 0.04617 seconds