Size matters for crawlers too!

No, I’m not talking about the cuil search engine and crawlers that insisted that “size matters” — and was thus hammering a high number of websites, including some of mines.

Some time ago I wrote a list of good practices that crawler authors should at least try to follow. Not following them is, for me, a good reason to disallow access to my own websites. Play by the rules and you can do almost anything for them, but try to be a smartass, and if I find you out, then it’s my blacklist you’re going to hit. But since sometimes I can only clear up after somebody already wasted my bandwidth.

Yesterday, I noticed in my web log statistics that there was a spike in the bandwidth used by my website that I couldn’t ascribe to the advertising I put up for my looking for a job — a quick check turned out that a new experimental crawler from the Italian CNR decided to download the whole of my website, including the full Autotools Mythbuster at once, without delay but, more importantly, without using compression!

Now, luckily my website is not excessively big, but using compression is generally speaking a good idea; while downloading my full website without compression is just accounting for about 30MB of data; what about all the other websites they are crawling? All these crawlers are abusing the network all over the globe and they should be learning their lesson.

So I devised some tricks with ModSecurity — might not be perfect, actually, almost surely they are not perfect, but this one that I’m going to talk about should help quite a bit:

SecRule REQUEST_URI "@streq /robots.txt" 
    "phase:1,pass,setvar:ip.is_robot=1,expirevar:ip.previous_rbl_check:259200,skipAfter:END_ROBOT_CHECKS"
SecRule REQUEST_HEADERS:Accept-Encoding 
    "@pm deflate gzip" "phase:1,skipAfter:END_ROBOT_CHECKS"

SecRule IP:IS_ROBOT "@eq 1" 
    "phase:1,deny,status:406,msg:'Robot at %{REMOTE_ADDR} is not supporting compressed responses'"
SecMarker END_ROBOT_CHECKS

Update: I changed the code above to allow fetching robots.txt even without Accept-Encoding; interestingly, it seemed like it killed msnbot at the spot, which looks pretty… wrong. So I either made a mistake in my rules, or the Microsoft bot is, well, broken.

What does this do? Well first of all I have to identify whether a request is coming from a bot at all; while there are a number of ways to do that, I decided to guess a half-decent heuristics: if you’re hitting my website’s robots.txt I assume you are a robot; and so I count all the requests coming from the same IP; it might be a bit too greedy but it should cover it. I also cache the request for three days, since some crawlers have the bad idea of not requesting robots.txt before sending out a bunch of page requests.

Then, I check whether there is support for either deflate or gzip algorithms for the Accept-Encoding headers; if there is I let the request pass through and be responded (compressed). Otherwise, I check back on whether it is a robot, and if so I reply with a “Not Acceptable” answer — this is intended to signify that the data is there, but it’s not available to that crawler because it does not support compressed requests.

I’ve now set this up as an experiment on my own website, I’ll report if there will be some serious false positives, I don’t think there will be many, if any at all.

4 thoughts on “Size matters for crawlers too!

  1. Sounds interesting, I host a mirror of an old website which is heavy-flash-only, I could use this trick maybe, I’ll stay tuned on this thread.

    Like

  2. What do you think about storing static web pages in gziped form and serving them as-is using sendfile (or equivalent) instead of storing the html and compressing on the fly each time? More data fits in block cache that way. If need be, gunzipping from disk to socket is cheaper than gzipping from disk to socket.For large files and/or large number of requests I think refusing to serve uncompressed makes lots of sense. Same with bots that don’t send If-Modified-Since (if I know they’ve crawled me before). I’m willing to humor an ancient interactive UA, but not stupid batch crawlers.

    Like

  3. Zeev I already planned on checking that out before, but I never went on to make sure that Apache would actually use it. Definitely it would make sense, especially since my website is _all_ static webpages (and the same goes for a few of my friends’).Also, it seems like my original rule didn’t work out that well, as it kept msnbot/2.0b out, while I’m pretty sure it’s properly supporting Accept-Encoding… it might not send it when accessing robots.txt for compatibility though, so I’ll be looking at that.

    Like

  4. Okay I can confirm it now: msnbot does require robots.txt without Accept-Encoding, but the reset of the website with it; probably as a safety system. So the code above that always allow robots.txt to be fetched is the good one!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s