Since I’ve been playing some extra with ModSecurity after reviewing the book, I’ve decided to implement one thing, the idea of which I’ve been toying a bit some time ago but I never went around implementing. But let’s start with some background.
I’ve had quite some pet peeves with crawlers, and generic bots. The main problem I have is with the sheer amount of them. Once upon a time you would have quite a limited amount of bots floating around, but nowadays you get quite a few of them together; some of them are the usual search engines, other are more “amateurish” things, and “of course” the usual spam crawlers, but those that do upset me are the marketing-focused crawlers. I’ll split my post focusing on each of those types, in reverse order.
Marketing crawlers are those deployed by companies that sell services like analysis of blog posts to find bad publicity and stuff like that. I definitely hate these crawlers: they keep downloading my posts for their fancy analysis, when they could use some search engine’s data already. Since most of them also seem to focus only on profiting, instead of developing their technology first, they tend to ignore the robots exclusion protocol, the HTTP/1.1 features to avoid wasting extra bandwidth, and they also ignore having some delay between requests.
I usually want to kill these bots; since some don’t look for, or even respect, the robots.txt file, and having a huge robots.txt file would be impractical. So for that, I use a list of known bot names and a simple rule in ModSecurity that denies them access. Again, thanks to Magnus’s book, I’ve been able to make the rule much faster by using the
pmFromFile matcher instead of the previous regular expression.
The second category are the spam crawlers, something that nowadays we’re unfortunately quite used to see. In this category you actually have more than one type of target: you have those who crawls your site to find email addresses (which is what Project Honey Pot tries to look out for), those who send requests to your site to spam the referrer counter (to gain extra credit if your referrer statistics are public – one more reason why my statistics are secured by obscurity and by a shallow password), and those who make use of your feeds to get content to post on their site, without a link but with lots of advertising.
These are nasty, but are more difficult to kill, and I’ll get to that later.
The third category is the one of the amateurish crawlers: new technologies being developer, “experimental” search engines and the like. I understand that it’s quite difficult for the developers to have something to work with if we all were to block them. But on the other hand, they really should start by respecting protocols and conventions, as well as by describing their work, and where the heck they are trying to get with it.
One funny thought here: if there was a start-up that wanted to developer new crawler technology, by heavily distributing rules and documentation to filter their requests out, it’s probably a quite evil way to kill the company off. To give a suggestion to those who might find themselves in that situation: try getting a number of affiliates who will let you crawl their site. To do that you need to either show a lot of good intent, or bribe them. It’s up to you what you decide to do, but lacking both, it’s likely going to be hard to get your stuff together.
The last category is search engine crawlers. Googlebot, msnbot, Yahoo! Slurp. The whole bunch is usually just disabled through
robots.txt and there is nothing to say about them in general. The whole point about talking about them here is that, well, it happens that all of the crawlers in the previous categories sometimes try to pass themselves as one of the more famous crawlers to be let in. For this reason, all of them suggest you to check their identity through double-resolution of the IP address: get the IP address of the request, reverse resolve them to the hostname (checking that it falls in the right domain, for instance for googlebot it’s simply
.googlebot.com), and then resolve the hostname to ensure it’s still the same address.
The double resolution is useful to make sure that the fake bot is not connected enough to set the reverse resolution to point to the correct domain. Luckily, Apache already has code to handle this properly to check the host-based authorizations: you just need to set HostnameLookups to
Double. And once that’s enable, the
REMOTE_HOST variable for ModSecurity is then available. The result is the following snippet of Apache configuration:
HostnameLookups Double SecRule REQUEST_HEADERS:User-Agent "@contains googlebot" "chain,deny,status:403,msg:'Fake Googlebot crawler.'" SecRule REMOTE_HOST "!@endsWith .googlebot.com" SecRule REQUEST_HEADERS:User-Agent "@contains feedfetcher-google" "chain,deny,status:403,msg:'Fake Google feed fetcher.'" SecRule REMOTE_HOST "!@endsWith .google.com" SecRule REQUEST_HEADERS:User-Agent "@contains msnbot" "chain,deny,status:403,msg:'Fake msnbot crawler.'" SecRule REMOTE_HOST "!msnbot-[0-9]+-[0-9]+-[0-9]+.search.msn.com" SecRule REQUEST_HEADERS:User-Agent "@contains yahoo! slurp" "chain,deny,status:403,msg:'Fake Yahoo! Slurp crawler.'" SecRule REMOTE_HOST "!@endsWith .crawl.yahoo.net"
At that point, any request from the three main bots will be coming from the original requester. You might notice that it uses a more complex regular expression to validate the Microsoft bot. The reason for that is that both Google and Yahoo! to be safe do provide the crawling hosts with their own (sub)domain, but Microsoft and (at a quick check, as I haven’t implemented the tests for it, since it doesn’t have as many hits as the rest) Ask Jeeves don’t have special domains (the regexp for Ask Jeeves would be
crawler[0-9]+.ask.com). And of course changing that is going to be tricky for them because many people are already validating them. So, learn from their mistakes.
Hopefully, the extra rules I’m loading ModSecurity with are actually saving me bandwidth rather than waste it; given that some fake bots seem to do hundreds of requests a day, that’s probably very likely. Also, thankfully, I have
nscd running (so that Apache does not have to send all the requests to the DNS server), and the DNS server is within the local network (so the bandwidth used to contact that is not as precious as the one used to send the data out).
My next step is probably going to be optimisation of the rules, although I’m not sure how to proceed for that; I’ll get to that when I push this to an actual repository for a real project, though.