Me and crawlbots

From time to time I look at the access statistics for my blog, not for anything in particular, but for the fact that it gives me a good impression on whether something is working or not in my setup. Today I noticed an excessive amount of traffic on this blog yesterday and I’m not really sure what’s that about.

Since the last time something like this happening it was the DDoS that brought Midas down, I checked the logs for something suspicious but sincerely I can’t seem to find anything different over it, so I guess the traffic was actually related to the past post that got more readers than usual.

But the read of the logs, especially checking to find bots, is always pretty interesting, because it really shows that sometimes people really don’t seem to understand what the net is about. For instance, my Apache server is set so that it refuses requests from clients that have no User-agent header sent; this is a violation of the protocol, and you should state who you are at least. I don’t usually kill read requests from browsers even when the agent strings are notoriously fake, since they usually don’t bother me, but this particular rule is also helpful to let people know that they should do their damn homework if they want to be good net citizens.

This for instance stops Ohloh Open Hub from accessing my feeds, with the tag feeds configured in the various projects; I already said that to Jason, but they don’t seem to care; their loss. Similar services with similar bugs are really not that important to me, I would have preferred if Ohloh fixed the problem by providing their own User-Agent signature, but, alas, that’s too late it seems.

But this is the most stupid case I have to say, because the rest are much sneakier; here are tons of robots of smaller search engines that don’t seem to be very useful at all, tools declared to be “linguistic” that download stuff randomly, and most of all “marketing research bots”. Now as I said in a past post over at Axant I don’t like when bots that are not useful to me waste my bandwidth so yeah I keep a log of the bots to be killed.

Now, while almost all the “Webmaster pages” for the bots (when they are listed, obviously) report that their bot abides to the robots autoexclusion protocol (an overcomplicated name to call robots.txt), there are quite a few of them that never request it. And some that even if they are explicitly forbidden to access something, still do (they probably hit robots.txt to not be found guilty I guess). For this reason, my actual blacklist is not in (multiple) robots.txt (that I still use to avoid good robots from hitting pages they shouldn’t) but rather in a single mod_security rules file, which I plan on releasing together with the antispam ones.

Additionally to specific crawler instances, I also started blocking user agents from most access libraries for various languages (those who don’t specify who on Earth they are — as soon as another word beside the library name and version are present the access is allowed, if you write a software that access HTTP you should add that to the user agent of the library, if you don’t replace it entirely!), and from generic search engine software like Apache nutch which are probably developed to be used on your own site, and not on others’. I really don’t get the point of all this, just bothering people because they can?

It’s fun because you can often find the actual robots because they don’t check redirections or error statuses. Which makes it kinda funny because my site redirects you right away when you enter (and both that and my blog have lots of redirections for moved pages, I don’t like breaking links).

Beside, one note on language-specific search engines; I do get the need of those, but it’d be nice if you didn’t start scanning for pages in other languages, don’t you think? And if you’re going to be generalist please translate your robot description to English. I have not banned a single one of those kind of search engines yet, but some could really at least have an English summary!

Oh well, more work for mod_security I suppose.