Spiders and Bots [message #165940] |
Sun, 28 August 2011 22:32 |
The Witcher
Messages: 675 Registered: May 2009 Location: USA
Karma: 3
|
Senior Member |
|
|
I'm interested in the new Spider Manager, but there are a few things I don't quite understand surprise, surprise.
"Useragent:
Spider's useragent string (partial matches are accepted)."????
I found this reference for user agent string, so I assume that these would be user agent strings for:
Bing: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
GoogLe:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
NerdByNature: Mozilla/5.0 (compatible; NerdByNature.Bot; http://www.nerdbynature.net/bot) Etc.
I copied those out of my server's stats/requests page, so are these examples of the strings that need to be input? I'm not seeing anything associated with spiders or bots anywhere else except in these! Is this correct?
"IP Addresses:
Comma separated list of IP Addresses used by the spider."????
As for IP addresses I understand that well enough, I just don't comprehend well enough how to associate the IP with Bots actually crawling the site other than checking them one at a time.
In the past I've always used a "robot.txt" file so this is a new development for me. However in the last few weeks I have had one particular IP that shows repeatedly as replying, browsing, or as errors in my log (hundreds of times) apparently requesting access to files or functions that do not exist or are not enabled.
ISP Information lists this as "JPNIC" using a range of IP's from 119.63.192.0 - 119.63.199.255, so far I have copied perhaps 30 or so of the specific IP's from within this range.
So obviously I can input those 30 IP addresses separated by commas, but is there a way to in put the entire range used by this or any other Bot/spider without inserting a hundred or more IP's within the range they use?
"I'm a Witcher, I solve human problems; not always using a sword!"
|
|
|
|
Re: Spiders and Bots [message #165988 is a reply to message #165961] |
Fri, 02 September 2011 20:34 |
The Witcher
Messages: 675 Registered: May 2009 Location: USA
Karma: 3
|
Senior Member |
|
|
I guess you missed my point: I wasn't sure what a user agent strung is, and from what I found I was uncertain of which portion of it to use! Nor was I certain of how to identify spiders and bots beyond individual searches or online lists.
As for "JPNIC" I could not identify them as a bot or spider or explain all the reply or page not found Fud errors they were generating and I didn't just want to ban a range of IP's without knowing since the forums subject has an international user base.
But I think this HERE explains it.
As for "Spiders and Bots"
I defined Google as a test on a backup. Today Google is the newest registered user, on logging out I see:
Warning: preg_match() [function.preg-match]: Unknown modifier '5' in /home/user/domain.name/index.php on line 309
309 if (preg_match('/^'. $spider['useragent'] .'/i', $_SERVER['HTTP_USER_AGENT'])) {
if (empty($spider['bot_ip'])) {
Quote:Defined spider:
Google: GoogLe:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) : xx.xxx.xx.xxx
I Deleted the '/5.0" from spider definition and checked again:
Warning: preg_match() [function.preg-match]: Unknown modifier '2' in /home/user/domain.name/index.php on line 309
I Deleated: "(compatible; Googlebot/2.1; +http://www.google.com/bot.html)" from spider definition.
No further warnings yet! So it appears that you just need the user name and browser type.
"I'm a Witcher, I solve human problems; not always using a sword!"
|
|
|
Re: Spiders and Bots [message #166014 is a reply to message #165961] |
Sat, 03 September 2011 18:16 |
The Witcher
Messages: 675 Registered: May 2009 Location: USA
Karma: 3
|
Senior Member |
|
|
As usual I completely missed the obvious and was reading too much into the simple instructions!
On a fresh 3.0.3RC2 install three bot/spiders are already listed, providing a clear example of the form the name and user agent string needs to take, the only thing missing is the IP address which is easy enough to find on line.
Bot Name Useragent
Bing msnbot
Google Googlebot
Yahoo! Slurp
So obviously the browser type isn't required either which makes sense, seeing as there are so many different browsers in use.
"I'm a Witcher, I solve human problems; not always using a sword!"
|
|
|
Re: Spiders and Bots [message #166440 is a reply to message #165940] |
Mon, 12 December 2011 04:53 |
|
Rocksteve
Messages: 3 Registered: December 2011
Karma: 0
|
Junior Member |
|
|
In the past I've always victimised a "golem.txt" line so this is a new development for me. Nevertheless in the high few weeks I get had one peculiar IP that shows repeatedly as replying, browsing, or as errors in my log (hundreds of present) seemingly requesting admittance to files or functions that do not survive or are not enabled. ISP Assemblage lists this as "JPNIC" using a comprise of IP's from 119.63.192.0 - 119.63.199.255, so far I person copied perhaps 30 or so of the special IP's from within this ambit.
|
|
|