Size matters for crawlers too!

No, I’m not talking about the cuil search engine and crawlers that insisted that “size matters” — and was thus hammering a high number of websites, including some of mines.

Some time ago I wrote a list of good practices that crawler authors should at least try to follow. Not following them is, for me, a good reason to disallow access to my own websites. Play by the rules and you can do almost anything for them, but try to be a smartass, and if I find you out, then it’s my blacklist you’re going to hit. But since sometimes I can only clear up after somebody already wasted my bandwidth.

Yesterday, I noticed in my web log statistics that there was a spike in the bandwidth used by my website that I couldn’t ascribe to the advertising I put up for my looking for a job — a quick check turned out that a new experimental crawler from the Italian CNR decided to download the whole of my website, including the full Autotools Mythbuster at once, without delay but, more importantly, without using compression!

Now, luckily my website is not excessively big, but using compression is generally speaking a good idea; while downloading my full website without compression is just accounting for about 30MB of data; what about all the other websites they are crawling? All these crawlers are abusing the network all over the globe and they should be learning their lesson.

So I devised some tricks with ModSecurity — might not be perfect, actually, almost surely they are not perfect, but this one that I’m going to talk about should help quite a bit:

SecRule REQUEST_URI "@streq /robots.txt" 
    "phase:1,pass,setvar:ip.is_robot=1,expirevar:ip.previous_rbl_check:259200,skipAfter:END_ROBOT_CHECKS"
SecRule REQUEST_HEADERS:Accept-Encoding 
    "@pm deflate gzip" "phase:1,skipAfter:END_ROBOT_CHECKS"

SecRule IP:IS_ROBOT "@eq 1" 
    "phase:1,deny,status:406,msg:'Robot at %{REMOTE_ADDR} is not supporting compressed responses'"
SecMarker END_ROBOT_CHECKS

Update: I changed the code above to allow fetching robots.txt even without Accept-Encoding; interestingly, it seemed like it killed msnbot at the spot, which looks pretty… wrong. So I either made a mistake in my rules, or the Microsoft bot is, well, broken.

What does this do? Well first of all I have to identify whether a request is coming from a bot at all; while there are a number of ways to do that, I decided to guess a half-decent heuristics: if you’re hitting my website’s robots.txt I assume you are a robot; and so I count all the requests coming from the same IP; it might be a bit too greedy but it should cover it. I also cache the request for three days, since some crawlers have the bad idea of not requesting robots.txt before sending out a bunch of page requests.

Then, I check whether there is support for either deflate or gzip algorithms for the Accept-Encoding headers; if there is I let the request pass through and be responded (compressed). Otherwise, I check back on whether it is a robot, and if so I reply with a “Not Acceptable” answer — this is intended to signify that the data is there, but it’s not available to that crawler because it does not support compressed requests.

I’ve now set this up as an experiment on my own website, I’ll report if there will be some serious false positives, I don’t think there will be many, if any at all.

Exit mobile version