This Time Self-Hosted
dark mode light mode Search

Me and crawlbots

From time to time I look at the access statistics for my blog, not for anything in particular, but for the fact that it gives me a good impression on whether something is working or not in my setup. Today I noticed an excessive amount of traffic on this blog yesterday and I’m not really sure what’s that about.

Since the last time something like this happening it was the DDoS that brought Midas down, I checked the logs for something suspicious but sincerely I can’t seem to find anything different over it, so I guess the traffic was actually related to the past post that got more readers than usual.

But the read of the logs, especially checking to find bots, is always pretty interesting, because it really shows that sometimes people really don’t seem to understand what the net is about. For instance, my Apache server is set so that it refuses requests from clients that have no User-agent header sent; this is a violation of the protocol, and you should state who you are at least. I don’t usually kill read requests from browsers even when the agent strings are notoriously fake, since they usually don’t bother me, but this particular rule is also helpful to let people know that they should do their damn homework if they want to be good net citizens.

This for instance stops Ohloh Open Hub from accessing my feeds, with the tag feeds configured in the various projects; I already said that to Jason, but they don’t seem to care; their loss. Similar services with similar bugs are really not that important to me, I would have preferred if Ohloh fixed the problem by providing their own User-Agent signature, but, alas, that’s too late it seems.

But this is the most stupid case I have to say, because the rest are much sneakier; here are tons of robots of smaller search engines that don’t seem to be very useful at all, tools declared to be “linguistic” that download stuff randomly, and most of all “marketing research bots”. Now as I said in a past post over at Axant I don’t like when bots that are not useful to me waste my bandwidth so yeah I keep a log of the bots to be killed.

Now, while almost all the “Webmaster pages” for the bots (when they are listed, obviously) report that their bot abides to the robots autoexclusion protocol (an overcomplicated name to call robots.txt), there are quite a few of them that never request it. And some that even if they are explicitly forbidden to access something, still do (they probably hit robots.txt to not be found guilty I guess). For this reason, my actual blacklist is not in (multiple) robots.txt (that I still use to avoid good robots from hitting pages they shouldn’t) but rather in a single mod_security rules file, which I plan on releasing together with the antispam ones.

Additionally to specific crawler instances, I also started blocking user agents from most access libraries for various languages (those who don’t specify who on Earth they are — as soon as another word beside the library name and version are present the access is allowed, if you write a software that access HTTP you should add that to the user agent of the library, if you don’t replace it entirely!), and from generic search engine software like Apache nutch which are probably developed to be used on your own site, and not on others’. I really don’t get the point of all this, just bothering people because they can?

It’s fun because you can often find the actual robots because they don’t check redirections or error statuses. Which makes it kinda funny because my site redirects you right away when you enter (and both that and my blog have lots of redirections for moved pages, I don’t like breaking links).

Beside, one note on language-specific search engines; I do get the need of those, but it’d be nice if you didn’t start scanning for pages in other languages, don’t you think? And if you’re going to be generalist please translate your robot description to English. I have not banned a single one of those kind of search engines yet, but some could really at least have an English summary!

Oh well, more work for mod_security I suppose.

Comments 6
  1. FYI, there seems to be some issue with the comments feed, as I cannot get either Firefox (3.5.1) or Akregator (4.3.61+svn) to read it properly. This seems to have occurred since Friday.

  2. I was viewing a page but when I tried to download a local copy to edit and send some suggestions…I get the blog …not the page I requested. with ‘links’ and it’s much easier to download and edit a page in ‘elvis’ but read request was denied. Frankly if it’s a ‘web’ page it should be readable by anyone. Otherwise it is not a web page

  3. @ABCD strange, I see the feed for the comments both with Firefox (3.5.1) and with Google Reader; which URL are you trying?@user99 I’m not sure what you’re referring to, which page are you trying to download and with what software? If it’s about what you mailed me (seen the mail, hadn’t had time to reply yet, sorry) downloading those pages is not really what you should do since the original sources are on git. But @links@ should be able to open my website just as well. Only crawlers get filtered.

  4. Never mind the frustration. I have a gitorious account now let me know what you think when you have the time…I am off from work for the next 4 days 😉 I’ll keep checking

  5. @ABCD: thanks! I’ve edited the comment before so that it’s now generated correctly, I hope (but the feedbackw ill soon be pushed off anyway).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.