Antibiotics for the Internet or, why blocking Semalt crawlers

As I noted earlier, I’ve been doing some more housecleaning of bad HTTP crawlers and feed readers. While it matters very little for my and my blog (I don’t pay for bandwidth), I find it’s a good exercise and, since I do publish my ModSecurity rules, it is a public service for many.

For those who think that I may be losing real readership in this, the number of visits on my site as seen by Analytics increased (because of me sharing the links to that post over to Twitter and G+, as well as in the GitHub issues and the complaint email I sent to the FeedMyInbox guys), yet the daily traffic was cut in half. I think this is what is called a win-win.

But one thing that became clear from both AWSstats and Analytics is that there was one more crawler that I did not stop yet. The crawler name is Semalt, and I’m not doing them the favour of linking to their website. Those of you who follow me on twitter have probably seen what they categorized as “free PR” for them, while I was ranting them up. I defined them a cancer for the Internet, I then realized that the right categorization would be bacteria.

If you look around, you’ll find unflattering reviews and multiple instructions to remove them from your website.

Funnily, once I tweeted about my commit, one of their people, who I assume is in their PR department rather than engineering for the blatant stupidity of their answers, told me that it’s “easy” to opt-out of their scanner.. you just have to go on their website and tell them your websites! Sure, sounds like a plan, right?

But why on earth am I spending my time attacking one particular company that, to be honest, is not wasting that much of my bandwidth to begin with? Well, as you can imagine from me comparing them to shigella bacteria, I do have a problem with their business idea. And given that on twitter they even missed completely my point (when I pointed out the three spammy techniques they use, their answer was “people don’t complain about Google or Bing” — well, yes, neither of the two use any of their spammy techniques!), it’ll be difficult for me to consider them as mistaken. They are doing this on purpose.

Let’s start with the technicalities, although that’s not why I noticed them to begin with. As I said earlier, their way to “opt out” from their services is to go to their website and fill in a form. They completely ignore robots.txt, they don’t even fetch it. And given this is an automated crawler, that’s bad enough.

The second is that they don’t advertise themselves in the User-Agent header. Instead all their fetches report Chrome/35 — and given that they can pass through my ruleset, they probably use a real browser with something like WebDriver. So you have no real way to identify their requests among a number of others, which is not how a good crawler should operate.

The third and most important point is the reason why I consider them just spammers, and so seem others, given the links I posted earlier. Instead of using the user agent field to advertise themselves, they subvert the Referer header. Which means that all their requests, even those that have been 301’d and 302’d around, will report their website as referrer. And if you know how AWStats works, you know that it doesn’t take that many crawls for them to be one of the “top referrers” for your website, and thus appear prominently in your stats, whether they are public or not.

At this point it could be easy to say that they are clueless and are not doing this on purpose, but then there is the other important part. Their crawler executes JavaScript, which means that it gets tracked by Google Analytics, too! Analytics has no access to the server logs, so for it to display the referrer as shown by people looking to filter it out, it has to make an effort. Again, this could easily be a mistake, given that they are using something like WebDriver, right?

The problem is that whatever they use, it does not fetch either images or CSS. But it does fetch the Analytics javascript and execute it, as I said. And the only reason I can think for them to want to do so, is to spam the referrer list in there as well.

As their twitter person thanked me for my “free PR” for them, I wanted to expand it further on it, with the hope that people will learn to know them. And to avoid them. My ModSecurity ruleset as I said already is set up to filter them out, other solutions for those who don’t want to use ModSecurity are linked above.

More HTTP misbehaviours

Today I have been having some fun: while looking at the backlog on IRCCloud, I found out that it auto-linked Makefile.am which I prompty decided to register it with Gandi — unfortunately I couldn’t get Makefile.in or configure.ac as they are both already registered. After that I decided to set up Google Analytics to report how many referrer arrive to my websites through some of the many vanity domains I registered over time.

After doing that, I spent some time staring at the web server logs to make sure that everything was okay, and I found out some more interesting things: it looks like a lot of people have been fetching my blog Atom feed through very bad feed readers. This is the reification of my forecast last year when Google Reader got shut down.

Some of the fetchers are open source, so I ended up opening issues for them, but that is not the case for all of them. And even when they are open source, sometimes they don’t even accept pull requests implementing the feature, for whichever reason.

So this post is a bit of a name-and-shame, which can be positive for open-source projects when they can fix things, or negative for closed source services that are trying to replace Google Reader and failing to implement HTTP properly. It will also serve as a warning for my readers from those services, as they’ll stop being able to fetch my feed pretty soon, as I’ll update my ModSecurity rules to stop these people from fetching my blog.

As I noted above, both Stringer and Feedbin fail to properly use compressed responses (gzip compression), which means that they fetch over 90KiB every turn instead of just 25KiB. The Stringer devs already reacted and seem to be looking into fixing this very soon now. Feedbin I have no answer from yet (but it’s pretty soon anyway), but it worries me for another reason too: it does not do any caching at all. And somebody set up a Feedbin instance in the Prague University that fetches my feed, without compression, without caching, every two minutes. I’m going to soon blacklist it.

Gwene still has not replied to the pull request I sent in October 2012, but on the bright side, it has not fetched my blog since a long time ago. Feedzirra (now Feedjira) used by IFTTT still does not enable compressed responses by default, even though it seems to support the option (Stringer is also based on it, it seems).

It’s not just plain feed readers that fail at implementing HTTP. Distributed social network Friendica – that aims at doing a better job than Diaspora at that – seem also to forget about implementing either compressed responses or caching. At least it seems to only fetch my feed every twelve hours. On the other hand, it seems to also get someone’s timeline from Twitter, so when it encounters a link to my blog it first send a HEAD request, and then fetches the page. Three times. Also uncompressed.

On the side of non-open-source services, FeedWrangler has probably one of the worst implementations of HTTP I’ve ever seen: it does not support compressed responses (90KiB feed), does not do caching (every time!), and while it would fetch at one hour intervals, it does not understand that a 301 is a permanent redirection, and there’s no point in keeping around two feed IDs for /articles.rss and /articles.atom (each with one subscriber). That’s 4MiB a day, which is around 2% of the bandwidth my website serves, over a day. While this is not an important amount, and I don’t have limitation on the server’s egress, it seems silly that 2% of my bandwidth is consumed on two subscribers, when the site has over a thousand visitors a day.

But what takes the biscuit is definitely FeedMyInbox: while it fetches only every six hours, it implements neither caching nor compression. And I found it only when looking into the requests coming from bots without a User-Agent header. The requests come from 216.198.247.46 which is svr.feedmyinbox.com. I’m soon also going to blacklist this until they stop being douches and provide a valid user agent string.

They are by far not the only ones though; there is another bot that fetches my feed every three hours that will soon follow the same destiny. But this does not have an obvious service attached to it, so if whatever you’re using to read my blog tells you it can’t fetch my blog anymore, try to figure out if you’re using a douchereader.

Please remember that software on the net should be implemented for collaboration between client and server, not for exploitation. Everybody’s bandwidth gets worse when you heavily use a service that is not doing its job at optimizing bandwidth usage.

Why you should care about your HTTP implementation

So today’s frenzy is all about Google’s dismissal of the Reader service. While I’m also upset about that, I’m afraid I cannot really get into discussing that at this point. On the other hand, I can talk once again of my ModSecurity ruleset and in particular of the rules that validate HTTP robots all over the Internet.

One of the Google Reader alternatives that are being talked about is NewsBlur — which actually looks cool at first sight, but I (and most other people) don’t seem to be able to try it out yet because their service – I’m not going to call them servers as it seems they at least partially use AWS for hosting – fails to scale.

While I’m pretty sure that it’s an exceptional amount of load they are receiving now as everybody and their droid are trying to register to the service and import their whole Google Reader subscription list, which then needs to be fetched and added to the database, – subscriptions to my blog’s feed went from 5 to 23 in the matter of hours! – there are a few things that I can infer from the way it behaves that makes me think that somebody overlooked the need for a strong HTTP implementation.

First of all what happened was that I got a report on twitter that NewsBlur was getting a 403 fetching my blog, and that was obviously caused by my rules’ validation of the request. Looking at my logs, I found out that NewsBlur sends requests with three different User-Agents, which show a likeliness that they are implemented by three different codepaths altogether:

User-Agent: NewsBlur Feed Fetcher - 5 subscribers - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.2.3 (KHTML, like Gecko) Version/5.2)
User-Agent: NewsBlur Page Fetcher (5 subscribers) - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)
User-Agent: NewsBlur Favicon Fetcher - http://www.newsblur.com

The third is the most conspicuous string, because it’s very minimal and does not follow your average string format, using the dash as separator instead of adding the URL in parenthesis next to the fetcher name (and version, more on that later).

The other two strings show that they have been taken by the string reported by Safari on OSX — but interestingly enough from two different Safari version, and one of the two has been actually stripped as well. This is really silly. While I can understand that they might want to look like Safari when fetching a page to display – mostly because there are bad hacks like PageSpeed that serve different HTML to different browsers, messing up caching – I doubt that is warranted for feeds; and even getting the Safari HTML might be a bad idea if then it’s displayed by the user with a different browser.

The code that fetches feeds and pages is likely quite different as it can be seen by the full request. From the feed fetcher:

GET /articles.atom HTTP/1.1
A-Im: feed
Accept-Encoding: gzip, deflate
Connection: close
Accept: application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1
User-Agent: NewsBlur Feed Fetcher - 5 subscribers - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.2.3 (KHTML, like Gecko) Version/5.2)
Host: blog.flameeyes.eu
If-Modified-Since: Tue, 01 Nov 2011 23:36:35 GMT
If-None-Match: "a00c0-18de5-4d10f58aa91b5"

This is a very sophisticated fetching code, as it not only properly supports compressed responses (Accept-Encoding header) but it also uses the If-None-Match and If-Modified-Since headers to not re-fetch an unmodified content. The fact that it’s pointing to November 1st of two years ago is likely due to the fact that since then my ModSecurity ruleset refused to speak with this fetcher, because of the fake User-Agent string. It also includes a proper Accept header that lists the feed types they prefer over the generic XML and other formats.

The A-Im header is not a fake or a bug; it’s actually part of RFC3229 Delta encoding in HTTP and stands for Accept-Instance-Manipulation. I’ve never seen that before, but a quick search turned it out, even though the standardized spelling would be A-IM. Unfortunately, the aforementioned RFC does not define the “feed” manipulator, even though it seems to be used in the wild, and I couldn’t find a proper formal documentation of how it should work. The theory from what I can tell is that the blog engine would be able to use the If-Modified-Since header to produce on the spot a custom feed for the fetcher, that only includes entries that has been modified since that date. Cool idea, too bad it lacks a standard as I said.

The request coming in from the page fetcher is drastically different:

GET / HTTP/1.1
Host: blog.flameeyes.eu
Connection: close
Content-Length: 0
Accept-Encoding: gzip, deflate, compress
Accept: */*
User-Agent: NewsBlur Page Fetcher (5 subscribers) - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)

So we can tell two things from the comparison: this code is older (there is an earlier version of Safari being used), and not the same care has been spent as it has been on the feed fetcher (which dropped the Safari identifier itself, at least). It’s more than likely that if libraries are used to send the request, a completely different library is used here, as this request declares support for the compress encoding, not supported by the feed fetcher (and as far as I can tell, never ever used). It also is much less choosy on the formats to receive, as it accepts whatever you want to give it.

*For the Italian readers: yes I intentionally picked the word choosy. While I can find Fornero an idiot as much as the next guy, I grew tired of copy-paste statuses on Facebook and comments that she should have said picky. Know your English, instead of complaining on idiocies.*

The lack of If-Modified-Since here does not really mean much, because it’s also possible that they were never able to fetch the page, as they might have introduced the feature later (even though the code is likely older). But the Content-Length header sticks out like a sore thumb, and I would expect to have been put there by whatever HTTP access library they’re using.

The favicon fetcher is the one that is the most naïve and possibly the code that needs to be cleaned up the most:

GET /favicon.ico HTTP/1.1
Accept-Encoding: identity
Host: blog.flameeyes.eu
Connection: close
User-Agent: NewsBlur Favicon Fetcher - http://www.newsblur.com

Here we start with nigh protocol violations, by not providing an Accept header — especially facepalming considering that this is where a static list of mime types would be the most useful, to restrict the image formats that will be handled properly! But what happens with my rules is that the Accept-Encoding there is not suitable for a bot at all! Since it does not support any compressed response, the code will now respond with a 406 Not Acceptable status code, instead of providing the icon.

I can understand that a compressed icon is more than likely to not be useful — indeed most images should not be compressed at all to be sent over HTTP, but why should you explicitly refuse it? Especially since the other two fetches properly support a sophisticated HTTP?

All in all, it seems like some of the code in NewsBlur has been bolted on after the fact, and with different levels of care. It might not be the best of time for them now to look at the HTTP implementation, but I would still suggest for it. A single pipelined request of the three components they need – instead of using Connection: close – could easily reduce the number of connections to blogs, and that would be very welcome to all the bloggers out there. And using the same HTTP code would make it easier for people like me to handle NewsBlur properly.

I would also like to have a way to validate that a given request comes from NewsBlur — like we do with GoogleBot and other crawlers. Unfortunately this is not really possible, because they use multiple servers, both on standard hostings and AWS, both on IPv4 and (possibly, one time) IPv6, so using FcRDNS is not an option.

Oh well, let’s see how this thing pans out.

World IPv6 Launch and their bots

You probably remember that I spent quite a bit of time working on a ModSecurity ruleset that would help me filter out marketing bots and spammers. This has paid off quite well as the number of spam comments I receive is infinitesimal compared to other blogs, even those using Akismet and various captchas.

Well, in the past few days I started noticing one more bot, which is now properly identified and rejected, scouring my website and this blog; not any other address on my systems though. The interesting part is that it tries (badly) to pass as a proper browser: "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))" (sic). Pay attention to the capital I, the version of the OS, and the two closed parenthesis at the end of the string.

Okay so this is not the brightest puppy in the litter, but there is something else: it’s distributed. With this I mean that I’m not getting the same request twice from the same IP address. Usually this gets very common for services using Amazon EC2 instances, as the IP addresses there are ever-changing, but this is not the case: the IP addresses all belong to Comcast.

I guess you can probably see where this is going to hit, given the title of the post, in the sense that there is another singularity to these requests: they actually come through a mix of IPv4 and IPv6, which is what tipped me off that it was something strange, usually the crawler bots prefer using much more easily masked IPs, not very granular IPv6s.

Since the website, the blog and (I didn’t mention that before, but it’s also hit) xine-project.org are listed on the World IPv6 launch page it’s easy to see that this is a Comcast software that is trying to see if the websites are really available. This is corroborated by the fact that the IPs all resolve to hosts within Comcast’s network throughout the whole States, but all starting with “sts02” in their reverse resolution.

Now it wouldn’t be too bad and I wouldn’t be kicking them so hard if they played by the rules, but none of the requests was going for robots.txt beforehand, and in a timeframe of 22 minutes they sent me 38 requests per host. The heck?

Now this is not the sole bot requesting data for the World IPv6 Launch; there is another one that I noticed, that was caught by an earlier rule: "Mozilla/5.0 (compatible; ISOCIPv6Bot; +http://www.worldipv6launch.org/measurements/".

Contrarily to Comcast’s this bot actually seems to only request a HEAD of the pages instead of going for a full-blown GET request. On the other hand, it still does not respect robots.txt. The requests are fewer… but they are still a lot; in the past week Comcast’s bot requested pages on my website almost 13K times – thirteen thousands – while this other bot “only” eighteen hundreds times.

Interestingly this bot doesn’t seem to be provider-specific: I see requests coming in from China, Brazil, Sweden, UK and even Italy! Although funnily enough the requests coming from Italy come from a standard IPv4 address, uh? Okay so they are probably trying to make sure that the people who signed themselves up for the IPv6 launch are really ready to be reachable by IPv6, but they could at least follow the rules, and make more sense, couldn’t they?

More crawlers hatred

After announcing it I’ve not stopped working on my ModSecurity RuleSet but, even with a bit of clumsiness in handling of changes, I’ve started using the repository as a way to document how new crawlers are discovered.

This let me reflect on a number of things, beside my concerns about EC2 I have posted already. For instance, there is still the problem of discovering new bots, and identify more complex patterns beside the User-Agent and IP themselves. Analysing each of the requests one by one is definitely out of discussion: it’ll require a huge amount of time, especially with the kind of traffic my website has. I guess the most useful thing to do here would be to apply the type of testing that Project Honey Pot have been taking care of for the past years.

I sincerely have used to use Project Honey Pot myself, but there has been a couple of things that caused trouble with it before, and at the end of the day, ModSecurity already filtered enough of those bots that it wouldn’t make an excessive amount of sense to actually submit my data. On the other hand, I guess it might provide more “pointed” data: only the most obnoxious of the crawlers would pass through my first line of defence, at first. At any rate, I’m now considering the idea of setting up again Project Honey Pot on my website, and see from there how the thing works out; maybe I’ll just add an auditlog call to the ModSecurity rules when hitting the honeypot script, and analyse those requests to find more common patterns that can help me solve the problem.

Speaking about Project Honey Pot, I’ll be masking for removal the http:BL access filter for Apache because it turned out being totally useless as it is; probably it’s outdated or worse — it hasn’t been touched for the past three years so it doesn’t really look well. For those who don’t know it, http:BL is the service provided by Project Honey Pot to their active users; it uses the collected data to export (via special DNS requests) information on a given IP address, for instance whether it is a known search engine crawler, a non-crawler or a spam crawler. The data itself is pretty valuable, I have seen very little false positive before, so it should be pretty helpful, but unfortunately, the API to access it is not the standard DNSBL, so you cannot simply use the already-provided code in ModSecurity to check it.

What I’m thinking of, right now, is making good use of the ModSecurity LUA extension capability. Unfortunately, I don’t have much experience with LUA as it is, and I don’t have much time to dedicate to this task for the moment (if you want to, I can be hired to dedicate next week to the task). The same LUA capabilities are likely to come into play if I’ll decide to implement an open proxy check, or adapt one of the IRC server ones, to work as antispam.

For the moment, I’ve updated the repository with a few more rules, and I’ll soon be pushing an optional/ directory containing my antispam configuration as well as a few extra rules that only apply in particular situations and should rather be only included within the vhosts or locations where they should be applied.

In the mean time, if you like the ruleset, and you use it, I have a flattr widget on the ruleset’s website

Amazon EC2 and old concepts

On Friday I updated my Autotools Mythbuster guide to add support for 2.68 portability notes (all the releases between 2.65 and 2.67 have enough regressions to make them a very bad choice for generic use — for some of those we’ve applied patches that make projects build nonetheless, but those releases should really just disappear from the face of Earth). When I did so, I announced the change on my identi.ca and then looked at the log, for a personal experiment of mine.

In a matter of a couple of hours, I could see a number of bots coming my way; some simply declared themselves outright (such as StatusNet that checked the link to produce the shortened version), while others tried more or less sophisticated ways to pass themselves for something else. On the other hand it is important to note that many times when a bot declares itself to be something like a browser, it’s simply to get served what the browser would see, for browser-specific hacks are still way too common, but that’s a digression I don’t care about here.

This little experiment of mine was actually aimed at refining my ModSecurity ruleset since I had some extra free time; the results of it are actually already available on the GitHub repository in form of updated blacklists and improved rules. But it made me think about a few more complex problems.

Amazon’s “Elastic Computer Cloud” (or EC2) is an interesting idea to make the best use of all the processing power of modern server hardware; this makes the phrase of a colleague of mine last year, sound even more true (“Recently we faced the return of clustered computing under the new brand of cloud computing, we faced the return of time sharing systems under the software as a service paradigm […]”) when you think of them introducing a “t1.micro” size for EBS-backed instance, for non-CPU-hungry tasks, that can be run with minimal CPU, but need more storage space.

But at the same time, the very design of the EC2 system gets troublesome in many ways; earlier this year I encountered troubles with hostnames when calling back between different EC2 instances, which ended up being resolved by using a dynamic hostname, like we were all used to use at the time of dynamic IP connections such as home ADSL (which for me has been basically till a couple of years ago). A very old technique, almost forgotten by many people, but pretty much necessary here.

It’s not the only thing that EC2 brought back from the time of ADSL though; any service based on it will lack a proper FcRDNS verification, which is very important to make sure that a bot request hasn’t been forged (that is until somebody creates a RobotKeys standard similar to DomainKeys standard), leaving it possible to non-legit bots to pass for legit ones, unless you can actually find a way to discern between the two with deep inspection of the requests. At the same time, it makes it very easy to pass for anything at all, since you can just judge by the User-Agent to find out who is making a request, as the IP address are dynamic and variable.

This situation lead to an obvious conclusion in the area of DNSBL (DNS-based black lists): all of the AWS network block is marked down as a spam source and is thus mostly unable to send email (or in the case of my blog, to post comments). Unfortunately this has a huge disadvantages: Amazon’s own internal network faces Internet from the same netblock, which means that Amazon employers can’t post comments on my blog either.

But the problem doesn’t stop there. As it was, my ruleset cached the result of robots analysis based on IP for a week. This covers the situation pretty nicely for most bots that are hosted on a “classic” system, but for those running on Amazon AWS, the situation is quite different: the same IP address can change “owner” in a matter of minutes, leading to false positives as well as using up an enormous amount of cache entries. To work around this problem, instead of hardcoding the expiration date of any given IP-bound test, I use a transaction variable, which defaults to the previous week, but gets changed to an hour in the case of AWS.

Unfortunately, it seems like EC2 is bringing us back in time, in the time of “real-time block lists” that need to list individual IPs rather than whole netblocks. What’s next, am I going to see again construction signs in websites “under construction”?

Size matters for crawlers too!

No, I’m not talking about the cuil search engine and crawlers that insisted that “size matters” — and was thus hammering a high number of websites, including some of mines.

Some time ago I wrote a list of good practices that crawler authors should at least try to follow. Not following them is, for me, a good reason to disallow access to my own websites. Play by the rules and you can do almost anything for them, but try to be a smartass, and if I find you out, then it’s my blacklist you’re going to hit. But since sometimes I can only clear up after somebody already wasted my bandwidth.

Yesterday, I noticed in my web log statistics that there was a spike in the bandwidth used by my website that I couldn’t ascribe to the advertising I put up for my looking for a job — a quick check turned out that a new experimental crawler from the Italian CNR decided to download the whole of my website, including the full Autotools Mythbuster at once, without delay but, more importantly, without using compression!

Now, luckily my website is not excessively big, but using compression is generally speaking a good idea; while downloading my full website without compression is just accounting for about 30MB of data; what about all the other websites they are crawling? All these crawlers are abusing the network all over the globe and they should be learning their lesson.

So I devised some tricks with ModSecurity — might not be perfect, actually, almost surely they are not perfect, but this one that I’m going to talk about should help quite a bit:

SecRule REQUEST_URI "@streq /robots.txt" 
    "phase:1,pass,setvar:ip.is_robot=1,expirevar:ip.previous_rbl_check:259200,skipAfter:END_ROBOT_CHECKS"
SecRule REQUEST_HEADERS:Accept-Encoding 
    "@pm deflate gzip" "phase:1,skipAfter:END_ROBOT_CHECKS"

SecRule IP:IS_ROBOT "@eq 1" 
    "phase:1,deny,status:406,msg:'Robot at %{REMOTE_ADDR} is not supporting compressed responses'"
SecMarker END_ROBOT_CHECKS

Update: I changed the code above to allow fetching robots.txt even without Accept-Encoding; interestingly, it seemed like it killed msnbot at the spot, which looks pretty… wrong. So I either made a mistake in my rules, or the Microsoft bot is, well, broken.

What does this do? Well first of all I have to identify whether a request is coming from a bot at all; while there are a number of ways to do that, I decided to guess a half-decent heuristics: if you’re hitting my website’s robots.txt I assume you are a robot; and so I count all the requests coming from the same IP; it might be a bit too greedy but it should cover it. I also cache the request for three days, since some crawlers have the bad idea of not requesting robots.txt before sending out a bunch of page requests.

Then, I check whether there is support for either deflate or gzip algorithms for the Accept-Encoding headers; if there is I let the request pass through and be responded (compressed). Otherwise, I check back on whether it is a robot, and if so I reply with a “Not Acceptable” answer — this is intended to signify that the data is there, but it’s not available to that crawler because it does not support compressed requests.

I’ve now set this up as an experiment on my own website, I’ll report if there will be some serious false positives, I don’t think there will be many, if any at all.

So You Think You Can Crawl

I have written about crawlers before ranting about the bad behavior of some crawlers, especially those of “marketing sites” that try to find out information about your site to resell to other companies (usually under the idea that they’ll find out who’s talking about their products).

Now, I don’t have a website that has so much users that it can be taken down by crawlers, but I just don’t like waiting time, disk space and bandwidth for software that makes neither me nor anybody else any good.

I don’t usually have trouble with newly-created crawlers and search engines, but I do have problems when their crawlers are just hitting my websites without following at least some decency rules:

  • give me a way to find out who the heck you are, like a link to a description of the bot — *in English, please*, like it or not it is the international language;
  • let me know what your crawler is showing; give me a sample search or something, even if your business is reselling services it shouldn’t be impaired if you just let everybody run the same search;
  • if your description explicitly states that robots.txt is supported, make sure you’re actually fetching it; I had one crawlers trying to fetch each and every article of my blog the other day, without having ever fetched robots.txt, and with the crawler’s website stating that it supported it;
  • support deflate compression! XHTML is an inherently redundant language, and deflate compression does miracles to that; even better on pages that contain RDF information (as you’re likely to repeat the content of other tags, in semantic context); the crawler above stated to be dedicated to fetch RDF information and yet didn’t support deflate;
  • don’t be a sore loser: if you state (again, that’s the case for the crawler above) that you always wait at least two seconds between requests, don’t start fetching without any delay at all when I start rejecting your requests with 403;
  • provide a support contact, for I might be interested in allowing your crawler, but want it to behave first;
  • support proper caching; too many feed-fetchers seem to ignore Etag and If-Modified-Since headers, which gets pretty nasty especially if you have a full-content feed; even worse if you support neither these nor deflate, your software is likely getting blacklisted;
  • make yourself verifiable via Forward-confirmed reverse DNS (FCrDNS); as I said in another post of mine most search engine crawlers already follow this idea, it’s easy to implement with ModSecurity, and it’s something that even Google suggests webmasters to do; now, a few people misunderstand this as a security protection for the websites; it couldn’t be farther from the true real reason: by making your crawler verifiable, you won’t risk to get hit by sanctions over crawlers trying to pass for you; this is almost impossible if your crawler uses EC2 of course.

Maybe rather than SEOs – by the way, it is just me who dislike this term and finds most people self-describing that way to just try be on the same level of CEOs – we should have Crawlers Experts running around to fix all the crappy crawlers that people write for their own startup.

Do I hate bots too much?

Since I’ve been playing some extra with ModSecurity after reviewing the book, I’ve decided to implement one thing, the idea of which I’ve been toying a bit some time ago but I never went around implementing. But let’s start with some background.

I’ve had quite some pet peeves with crawlers, and generic bots. The main problem I have is with the sheer amount of them. Once upon a time you would have quite a limited amount of bots floating around, but nowadays you get quite a few of them together; some of them are the usual search engines, other are more “amateurish” things, and “of course” the usual spam crawlers, but those that do upset me are the marketing-focused crawlers. I’ll split my post focusing on each of those types, in reverse order.

Marketing crawlers are those deployed by companies that sell services like analysis of blog posts to find bad publicity and stuff like that. I definitely hate these crawlers: they keep downloading my posts for their fancy analysis, when they could use some search engine’s data already. Since most of them also seem to focus only on profiting, instead of developing their technology first, they tend to ignore the robots exclusion protocol, the HTTP/1.1 features to avoid wasting extra bandwidth, and they also ignore having some delay between requests.

I usually want to kill these bots; since some don’t look for, or even respect, the robots.txt file, and having a huge robots.txt file would be impractical. So for that, I use a list of known bot names and a simple rule in ModSecurity that denies them access. Again, thanks to Magnus’s book, I’ve been able to make the rule much faster by using the pmFromFile matcher instead of the previous regular expression.

The second category are the spam crawlers, something that nowadays we’re unfortunately quite used to see. In this category you actually have more than one type of target: you have those who crawls your site to find email addresses (which is what Project Honey Pot tries to look out for), those who send requests to your site to spam the referrer counter (to gain extra credit if your referrer statistics are public – one more reason why my statistics are secured by obscurity and by a shallow password), and those who make use of your feeds to get content to post on their site, without a link but with lots of advertising.

These are nasty, but are more difficult to kill, and I’ll get to that later.

The third category is the one of the amateurish crawlers: new technologies being developer, “experimental” search engines and the like. I understand that it’s quite difficult for the developers to have something to work with if we all were to block them. But on the other hand, they really should start by respecting protocols and conventions, as well as by describing their work, and where the heck they are trying to get with it.

One funny thought here: if there was a start-up that wanted to developer new crawler technology, by heavily distributing rules and documentation to filter their requests out, it’s probably a quite evil way to kill the company off. To give a suggestion to those who might find themselves in that situation: try getting a number of affiliates who will let you crawl their site. To do that you need to either show a lot of good intent, or bribe them. It’s up to you what you decide to do, but lacking both, it’s likely going to be hard to get your stuff together.

The last category is search engine crawlers. Googlebot, msnbot, Yahoo! Slurp. The whole bunch is usually just disabled through robots.txt and there is nothing to say about them in general. The whole point about talking about them here is that, well, it happens that all of the crawlers in the previous categories sometimes try to pass themselves as one of the more famous crawlers to be let in. For this reason, all of them suggest you to check their identity through double-resolution of the IP address: get the IP address of the request, reverse resolve them to the hostname (checking that it falls in the right domain, for instance for googlebot it’s simply .googlebot.com), and then resolve the hostname to ensure it’s still the same address.

The double resolution is useful to make sure that the fake bot is not connected enough to set the reverse resolution to point to the correct domain. Luckily, Apache already has code to handle this properly to check the host-based authorizations: you just need to set HostnameLookups to Double. And once that’s enable, the REMOTE_HOST variable for ModSecurity is then available. The result is the following snippet of Apache configuration:

HostnameLookups Double

SecRule REQUEST_HEADERS:User-Agent "@contains googlebot" 
    "chain,deny,status:403,msg:'Fake Googlebot crawler.'"
SecRule REMOTE_HOST "!@endsWith .googlebot.com"

SecRule REQUEST_HEADERS:User-Agent "@contains feedfetcher-google" 
    "chain,deny,status:403,msg:'Fake Google feed fetcher.'"
SecRule REMOTE_HOST "!@endsWith .google.com"

SecRule REQUEST_HEADERS:User-Agent "@contains msnbot" 
    "chain,deny,status:403,msg:'Fake msnbot crawler.'"
SecRule REMOTE_HOST "!msnbot-[0-9]+-[0-9]+-[0-9]+.search.msn.com"

SecRule REQUEST_HEADERS:User-Agent "@contains yahoo! slurp" 
    "chain,deny,status:403,msg:'Fake Yahoo! Slurp crawler.'"
SecRule REMOTE_HOST "!@endsWith .crawl.yahoo.net"

At that point, any request from the three main bots will be coming from the original requester. You might notice that it uses a more complex regular expression to validate the Microsoft bot. The reason for that is that both Google and Yahoo! to be safe do provide the crawling hosts with their own (sub)domain, but Microsoft and (at a quick check, as I haven’t implemented the tests for it, since it doesn’t have as many hits as the rest) Ask Jeeves don’t have special domains (the regexp for Ask Jeeves would be crawler[0-9]+.ask.com). And of course changing that is going to be tricky for them because many people are already validating them. So, learn from their mistakes.

Hopefully, the extra rules I’m loading ModSecurity with are actually saving me bandwidth rather than waste it; given that some fake bots seem to do hundreds of requests a day, that’s probably very likely. Also, thankfully, I have nscd running (so that Apache does not have to send all the requests to the DNS server), and the DNS server is within the local network (so the bandwidth used to contact that is not as precious as the one used to send the data out).

My next step is probably going to be optimisation of the rules, although I’m not sure how to proceed for that; I’ll get to that when I push this to an actual repository for a real project, though.

Me and crawlbots

From time to time I look at the access statistics for my blog, not for anything in particular, but for the fact that it gives me a good impression on whether something is working or not in my setup. Today I noticed an excessive amount of traffic on this blog yesterday and I’m not really sure what’s that about.

Since the last time something like this happening it was the DDoS that brought Midas down, I checked the logs for something suspicious but sincerely I can’t seem to find anything different over it, so I guess the traffic was actually related to the past post that got more readers than usual.

But the read of the logs, especially checking to find bots, is always pretty interesting, because it really shows that sometimes people really don’t seem to understand what the net is about. For instance, my Apache server is set so that it refuses requests from clients that have no User-agent header sent; this is a violation of the protocol, and you should state who you are at least. I don’t usually kill read requests from browsers even when the agent strings are notoriously fake, since they usually don’t bother me, but this particular rule is also helpful to let people know that they should do their damn homework if they want to be good net citizens.

This for instance stops Ohloh Open Hub from accessing my feeds, with the tag feeds configured in the various projects; I already said that to Jason, but they don’t seem to care; their loss. Similar services with similar bugs are really not that important to me, I would have preferred if Ohloh fixed the problem by providing their own User-Agent signature, but, alas, that’s too late it seems.

But this is the most stupid case I have to say, because the rest are much sneakier; here are tons of robots of smaller search engines that don’t seem to be very useful at all, tools declared to be “linguistic” that download stuff randomly, and most of all “marketing research bots”. Now as I said in a past post over at Axant I don’t like when bots that are not useful to me waste my bandwidth so yeah I keep a log of the bots to be killed.

Now, while almost all the “Webmaster pages” for the bots (when they are listed, obviously) report that their bot abides to the robots autoexclusion protocol (an overcomplicated name to call robots.txt), there are quite a few of them that never request it. And some that even if they are explicitly forbidden to access something, still do (they probably hit robots.txt to not be found guilty I guess). For this reason, my actual blacklist is not in (multiple) robots.txt (that I still use to avoid good robots from hitting pages they shouldn’t) but rather in a single mod_security rules file, which I plan on releasing together with the antispam ones.

Additionally to specific crawler instances, I also started blocking user agents from most access libraries for various languages (those who don’t specify who on Earth they are — as soon as another word beside the library name and version are present the access is allowed, if you write a software that access HTTP you should add that to the user agent of the library, if you don’t replace it entirely!), and from generic search engine software like Apache nutch which are probably developed to be used on your own site, and not on others’. I really don’t get the point of all this, just bothering people because they can?

It’s fun because you can often find the actual robots because they don’t check redirections or error statuses. Which makes it kinda funny because my site redirects you right away when you enter (and both that and my blog have lots of redirections for moved pages, I don’t like breaking links).

Beside, one note on language-specific search engines; I do get the need of those, but it’d be nice if you didn’t start scanning for pages in other languages, don’t you think? And if you’re going to be generalist please translate your robot description to English. I have not banned a single one of those kind of search engines yet, but some could really at least have an English summary!

Oh well, more work for mod_security I suppose.