Antibiotics for the Internet or, why blocking Semalt crawlers

As I noted earlier, I’ve been doing some more housecleaning of bad HTTP crawlers and feed readers. While it matters very little for my and my blog (I don’t pay for bandwidth), I find it’s a good exercise and, since I do publish my ModSecurity rules, it is a public service for many.

For those who think that I may be losing real readership in this, the number of visits on my site as seen by Analytics increased (because of me sharing the links to that post over to Twitter and G+, as well as in the GitHub issues and the complaint email I sent to the FeedMyInbox guys), yet the daily traffic was cut in half. I think this is what is called a win-win.

But one thing that became clear from both AWSstats and Analytics is that there was one more crawler that I did not stop yet. The crawler name is Semalt, and I’m not doing them the favour of linking to their website. Those of you who follow me on twitter have probably seen what they categorized as “free PR” for them, while I was ranting them up. I defined them a cancer for the Internet, I then realized that the right categorization would be bacteria.

If you look around, you’ll find unflattering reviews and multiple instructions to remove them from your website.

Funnily, once I tweeted about my commit, one of their people, who I assume is in their PR department rather than engineering for the blatant stupidity of their answers, told me that it’s “easy” to opt-out of their scanner.. you just have to go on their website and tell them your websites! Sure, sounds like a plan, right?

But why on earth am I spending my time attacking one particular company that, to be honest, is not wasting that much of my bandwidth to begin with? Well, as you can imagine from me comparing them to shigella bacteria, I do have a problem with their business idea. And given that on twitter they even missed completely my point (when I pointed out the three spammy techniques they use, their answer was “people don’t complain about Google or Bing” — well, yes, neither of the two use any of their spammy techniques!), it’ll be difficult for me to consider them as mistaken. They are doing this on purpose.

Let’s start with the technicalities, although that’s not why I noticed them to begin with. As I said earlier, their way to “opt out” from their services is to go to their website and fill in a form. They completely ignore robots.txt, they don’t even fetch it. And given this is an automated crawler, that’s bad enough.

The second is that they don’t advertise themselves in the User-Agent header. Instead all their fetches report Chrome/35 — and given that they can pass through my ruleset, they probably use a real browser with something like WebDriver. So you have no real way to identify their requests among a number of others, which is not how a good crawler should operate.

The third and most important point is the reason why I consider them just spammers, and so seem others, given the links I posted earlier. Instead of using the user agent field to advertise themselves, they subvert the Referer header. Which means that all their requests, even those that have been 301’d and 302’d around, will report their website as referrer. And if you know how AWStats works, you know that it doesn’t take that many crawls for them to be one of the “top referrers” for your website, and thus appear prominently in your stats, whether they are public or not.

At this point it could be easy to say that they are clueless and are not doing this on purpose, but then there is the other important part. Their crawler executes JavaScript, which means that it gets tracked by Google Analytics, too! Analytics has no access to the server logs, so for it to display the referrer as shown by people looking to filter it out, it has to make an effort. Again, this could easily be a mistake, given that they are using something like WebDriver, right?

The problem is that whatever they use, it does not fetch either images or CSS. But it does fetch the Analytics javascript and execute it, as I said. And the only reason I can think for them to want to do so, is to spam the referrer list in there as well.

As their twitter person thanked me for my “free PR” for them, I wanted to expand it further on it, with the hope that people will learn to know them. And to avoid them. My ModSecurity ruleset as I said already is set up to filter them out, other solutions for those who don’t want to use ModSecurity are linked above.

WebP and the effect on my antispam

When I was posting my notes about WebP I found out at the time of posting that I could not post on my blog any more. The reason was to be found in my own ModSecurity rules as quite a long time ago I added to the antispam rules one that stops POST requests if they included image/webp in the Accept header.

Unfortunately, for whatever reason, instead of just adding image/webp to all the image requests, they added it to every single request that Chrome makes, including the POST requests when submitting a form… It does not entirely sound correct, to be honest, but there probably was a reason for that.

So I dropped the WebP check from my rules. And today I check my comments, and I found four spam elements. Turns out that the particular check was very effective, and it’s going to be a pain to leave it be. On the other hand, it seems like it’s accepting image/x-bitmap and coming from Firefox, two conditions that I expect are never met by real-life browsers, so I can probably look into adding a rule for that.

Another interesting rule I added recently and that I did not discuss yet is related to the fact that this blog is now only available over HTTPS. Most of the spam comments I receive are posted directly over HTTPS, but they report as referrer the original post’s URL over plain HTTP. Filter these out, and most of my spam is gone.

Long live ModSecurity — the problem is going to be when HTTP2 will be out, as it’s binary and leaves much less space to request fingerprinting.

User-Agent strings and entropy

It was 2008 when I first got the idea to filter User-Agents as an antispam measure. It worked for quite a while on its own, but recently my ruleset had to come up with more sophisticated fingerprinting to discover spammers. It still works better than a captcha, but it did worsen a bit.

One of the reasons why the User-Agent itself is not enough anymore is that my filtering has been hindered by a more important project. EFF’s Panopticlick has shown that the uniqueness of the strings used in User-Agent is actually an easy way to track a specific user across requests. This got so important, that Mozilla standardized their User-Agents starting with Firefox 4, to reduce their size and thus their entropy. Among other things, the “trail” component has been fixed on the desktop to 20100101 and to the same version as Firefox for the mobile version.

_Unfortunately, Mozilla lies on that page. Not only the trail is not fixed for Firefox Aurora (i.e. the alpha version), which means that my first set of rules was refusing access to all the users of that version, but also their own Lightning extension for SeaMonkey appends to the User-Agent, when they said that it wasn’t supported anymore._

A number of spambots seem to get this wrong, by the way. My guess is that they have some code that generates the User-Agent by adding a bunch of fragments, and make it randomize it, so you can’t just kick a particular agent. Damn smart if you ask me, unfortunately, as ModSecurity hashes the IP collection by remote address and user-agent, so if they cycle different user agents, it’s harder for ModSecurity to understand that it’s actually the same IP address.

I do have some reserves on Mozilla’s handling of identification of extensions. First they say that extensions and plugins should not edit the agent string anymore – but Lightning does! – then they suggest that instead they can send an extra header to identify themselves. But that just means that fingerprinting systems only need to start counting those headers as well as the generic ones that Panopticlick already considers.

On the other hand, other browsers don’t seem to have gotten the memo yet — indeed, both Safari’s and Chrome’s strings are long and include a bunch of almost-independent version numbers (AppleWebKit, Chrome, Safari — and Mobile on the iOS versions). It gets worse on Android, as both the standard browser and Chrome provide a full build identifier, which is not only different from one device to the next, but also from one firmware to the next. Given that each mobile provider has its own builds, I would be very surprised if among my friends I was able to find two with the same identifier in their browsers. Firefox is a bit better on that but it sucks in other ways so I’m not using it as my main browser anymore there.

Why you should care about your HTTP implementation

So today’s frenzy is all about Google’s dismissal of the Reader service. While I’m also upset about that, I’m afraid I cannot really get into discussing that at this point. On the other hand, I can talk once again of my ModSecurity ruleset and in particular of the rules that validate HTTP robots all over the Internet.

One of the Google Reader alternatives that are being talked about is NewsBlur — which actually looks cool at first sight, but I (and most other people) don’t seem to be able to try it out yet because their service – I’m not going to call them servers as it seems they at least partially use AWS for hosting – fails to scale.

While I’m pretty sure that it’s an exceptional amount of load they are receiving now as everybody and their droid are trying to register to the service and import their whole Google Reader subscription list, which then needs to be fetched and added to the database, – subscriptions to my blog’s feed went from 5 to 23 in the matter of hours! – there are a few things that I can infer from the way it behaves that makes me think that somebody overlooked the need for a strong HTTP implementation.

First of all what happened was that I got a report on twitter that NewsBlur was getting a 403 fetching my blog, and that was obviously caused by my rules’ validation of the request. Looking at my logs, I found out that NewsBlur sends requests with three different User-Agents, which show a likeliness that they are implemented by three different codepaths altogether:

User-Agent: NewsBlur Feed Fetcher - 5 subscribers - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.2.3 (KHTML, like Gecko) Version/5.2)
User-Agent: NewsBlur Page Fetcher (5 subscribers) - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)
User-Agent: NewsBlur Favicon Fetcher - http://www.newsblur.com

The third is the most conspicuous string, because it’s very minimal and does not follow your average string format, using the dash as separator instead of adding the URL in parenthesis next to the fetcher name (and version, more on that later).

The other two strings show that they have been taken by the string reported by Safari on OSX — but interestingly enough from two different Safari version, and one of the two has been actually stripped as well. This is really silly. While I can understand that they might want to look like Safari when fetching a page to display – mostly because there are bad hacks like PageSpeed that serve different HTML to different browsers, messing up caching – I doubt that is warranted for feeds; and even getting the Safari HTML might be a bad idea if then it’s displayed by the user with a different browser.

The code that fetches feeds and pages is likely quite different as it can be seen by the full request. From the feed fetcher:

GET /articles.atom HTTP/1.1
A-Im: feed
Accept-Encoding: gzip, deflate
Connection: close
Accept: application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1
User-Agent: NewsBlur Feed Fetcher - 5 subscribers - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.2.3 (KHTML, like Gecko) Version/5.2)
Host: blog.flameeyes.eu
If-Modified-Since: Tue, 01 Nov 2011 23:36:35 GMT
If-None-Match: "a00c0-18de5-4d10f58aa91b5"

This is a very sophisticated fetching code, as it not only properly supports compressed responses (Accept-Encoding header) but it also uses the If-None-Match and If-Modified-Since headers to not re-fetch an unmodified content. The fact that it’s pointing to November 1st of two years ago is likely due to the fact that since then my ModSecurity ruleset refused to speak with this fetcher, because of the fake User-Agent string. It also includes a proper Accept header that lists the feed types they prefer over the generic XML and other formats.

The A-Im header is not a fake or a bug; it’s actually part of RFC3229 Delta encoding in HTTP and stands for Accept-Instance-Manipulation. I’ve never seen that before, but a quick search turned it out, even though the standardized spelling would be A-IM. Unfortunately, the aforementioned RFC does not define the “feed” manipulator, even though it seems to be used in the wild, and I couldn’t find a proper formal documentation of how it should work. The theory from what I can tell is that the blog engine would be able to use the If-Modified-Since header to produce on the spot a custom feed for the fetcher, that only includes entries that has been modified since that date. Cool idea, too bad it lacks a standard as I said.

The request coming in from the page fetcher is drastically different:

GET / HTTP/1.1
Host: blog.flameeyes.eu
Connection: close
Content-Length: 0
Accept-Encoding: gzip, deflate, compress
Accept: */*
User-Agent: NewsBlur Page Fetcher (5 subscribers) - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)

So we can tell two things from the comparison: this code is older (there is an earlier version of Safari being used), and not the same care has been spent as it has been on the feed fetcher (which dropped the Safari identifier itself, at least). It’s more than likely that if libraries are used to send the request, a completely different library is used here, as this request declares support for the compress encoding, not supported by the feed fetcher (and as far as I can tell, never ever used). It also is much less choosy on the formats to receive, as it accepts whatever you want to give it.

*For the Italian readers: yes I intentionally picked the word choosy. While I can find Fornero an idiot as much as the next guy, I grew tired of copy-paste statuses on Facebook and comments that she should have said picky. Know your English, instead of complaining on idiocies.*

The lack of If-Modified-Since here does not really mean much, because it’s also possible that they were never able to fetch the page, as they might have introduced the feature later (even though the code is likely older). But the Content-Length header sticks out like a sore thumb, and I would expect to have been put there by whatever HTTP access library they’re using.

The favicon fetcher is the one that is the most naïve and possibly the code that needs to be cleaned up the most:

GET /favicon.ico HTTP/1.1
Accept-Encoding: identity
Host: blog.flameeyes.eu
Connection: close
User-Agent: NewsBlur Favicon Fetcher - http://www.newsblur.com

Here we start with nigh protocol violations, by not providing an Accept header — especially facepalming considering that this is where a static list of mime types would be the most useful, to restrict the image formats that will be handled properly! But what happens with my rules is that the Accept-Encoding there is not suitable for a bot at all! Since it does not support any compressed response, the code will now respond with a 406 Not Acceptable status code, instead of providing the icon.

I can understand that a compressed icon is more than likely to not be useful — indeed most images should not be compressed at all to be sent over HTTP, but why should you explicitly refuse it? Especially since the other two fetches properly support a sophisticated HTTP?

All in all, it seems like some of the code in NewsBlur has been bolted on after the fact, and with different levels of care. It might not be the best of time for them now to look at the HTTP implementation, but I would still suggest for it. A single pipelined request of the three components they need – instead of using Connection: close – could easily reduce the number of connections to blogs, and that would be very welcome to all the bloggers out there. And using the same HTTP code would make it easier for people like me to handle NewsBlur properly.

I would also like to have a way to validate that a given request comes from NewsBlur — like we do with GoogleBot and other crawlers. Unfortunately this is not really possible, because they use multiple servers, both on standard hostings and AWS, both on IPv4 and (possibly, one time) IPv6, so using FcRDNS is not an option.

Oh well, let’s see how this thing pans out.

Passive web logs analysis. Replacing AWStats?

You probably don’t know that, but for my blog I do analyse the Apache logs with AWStats over and over again. This is especially useful at the start of the month to identify referrer spam and other similar issues, which in turn allows me to update my ModSecurity ruleset so that more spammers are caught and dealt with.

To do so, I’ve been using for, at this point, years, AWStats which is a Perl analyzer, generator and CGI application. It used to work nicely, but nowadays it’s definitely lacking. It doesn’t filter referrers search engines as much as it used to be (it’s still important to filter out requests coming from Google, but newer search engines are not recognized), and most of the new “social bookmark” websites are not there at all — yes it’s possible to keep adding to them, but with upstream not moving, this is getting harder and harder.

Even more important, for my ruleset work, is the lack of identification of modern browsers. Things like Android versions and other fringe OSes would be extremely useful for me, but adding support for all of them is a pain and I have enough things on my plates that this is not something I’m looking forward to tackle myself. It’s even more bothersome when you consider that there is no way to reconsider the already analyzed data, if a new URL is identified as a search engine, or an user agent a bot.

One of the most obvious choices for this kind of work is to use Google Analytics — unfortunately, this means that it will only work if it’s not blacklisted from the user side — that includes NoScript users and of course most of the spammers. So this is not a job for them. It’s something that has to be done on the backend, on the logs side.

The obvious point at that point is to find something capable to import the data out of the current awstats datafiles I got, and keep importing data from the Apache log files. Hopefully this should be done by saving the data in a PostgreSQL database (which is what I usually use); native support for vhost data, but the ability to collapse it in a single view would also be nice.

If somebody knows of a similar piece of software, I’d love to give it a try — hopefully, something that is written in Ruby or Perl might be the best for me (because I can hack on those) but I wouldn’t say no to Python or even Java (the latter if somebody helped me making sure the dependencies are all packed up properly). This will bring you better modsec rules, I’m sure!

ModSecurity and my ruleset, a release

After the recent Typo update I had some trouble with Akismet not working properly to mark comments as spam, at least the very few spam comments that could get past my ModSecurity Ruleset — so I set off to deal with it a couple of days ago to find out why.

Well, to be honest, I didn’t really want to focus on why at first. The first thing I found out while looking at the way Typo uses akismet, is that it still used a bundled, hacked, ancient akismet library.. given that the API key I got was valid, I jumped to the conclusion, right or wrong it was, that the code was simply using an ancient API that was dismissed, and decided to look around if there is a newer Akismet version; lo and behold, a 1.0.0 gem was released not many months ago.

After fiddling with it a bit, the new Akismet library worked like a charm, and spam comments passing through ModSecurity were again marked as such. A pull request and its comments later, I got a perfectly working Typo which marks comments as spam as good as before, with one less library bundled within it (and I also got the gem into Portage so there is no problem there).

But this left me with the problem that some spam comments were still passing through my filters! Why did that happen? Well, if you remember my idea behind it was validating the User-Agent header content… and it turns out that the latest Firefox versions have such a small header that almost every spammer seem to have been able to copy it just fine, so they weren’t killed off as intended. So more digging in the requests.

Some work later, and I was able to find two rules with which to validate Firefox, and a bunch of other browsers; the first relies on checking the Connection: keep-alive header that is always sent by Firefox (tried in almost every possible combination), and the other relies on checking the Content-Type on the POST request for a charset being defined: browsers will have it, but whatever the spammers are using nowadays doesn’t.

Of course, the problem is that once I actually describe and upload the rules, spammers will just improve their tools to not commit these mistakes, but in the mean time I’ll have some calm, spamless blog. I still won’t give in to captchas!

At any rate, beside adding these validations, thanks to another round of testing I was able to fix Opera Turbo users (now they can comment just fine), and that lead me to the choice of tagging the ruleset and .. releasing it! Now you can download it from GitHub or, if you use Gentoo, just install it as www-apache/modsec-flameeyes — there’s also a live ebuild for the most brave.

What I’d like from my blog

My blog is, at this point, a vital part of my routine. I use my blog to write about my personal projects, I write about the non-restricted parts of my jobs, and I write about the work that goes into Gentoo Linux and other projects I follow.

I have over 2100 posts over time, especially thanks to the recent import of my original blog on Gentoo infrastructure. I don’t really know if it’s a lot, but sometimes Typo seems to miss something about it. Unfortunately I’m also running an older version of Typo, because I haven’t switched that virtual server to Ruby 1.9 yet as one of my customers is running a version of Radiant that is not going to work otherwise.

Said customer also bitched so hard, and screamed not to keep the site on my server, but as it happens the new webmasters that are supposed to pick up the website, and should have been cheaper and faster than me… have been working since June and still delivered nothing. Hopefully they’ll be done soon and I can kick said customer from the server.

Anyway, at this point there are a few things that I’d like to get out of my blogging platform in the future, which might require me to fork Typo and create my own version, which is likely going to be stripped down — as many things I really don’t care about, that are added here, like the short URLs, which I might just export as I think I used them at some point, but then I would handle through mod_rewrite rather than on the Rails side.

So let’s see what I don’t like about the current Typo I’m using:

  • The database access is more than a bit messed up; it probably has to do that upstream only cares about MySQL, while I want to run it on PostgreSQL; and this causes more than a couple of problems — have you noticed that sometimes my posts end up password-protected? Well, what happens is that the settings for the single posts are serialized in YAML and de-serialized, but somethings something bad happens and the YAML becomes invalid, causing the password-protection to kick in. I know there is an ActiveRecord extension that allows for key-value pairs to be stored in PostgreSQL-specific column types instead of having to (de)serialize them all the time, but again, this wouldn’t be something upstream would use.
  • Alternatively I’ve been toying with the idea of using MongoDB as a backend. Even with the issues that I have pointed out before, I think it might work well for a blog, especially since then the comments would be tied tot he post itself, rather than have the current connected tables.
  • There is a problem with the tags handling, again something upstream doesn’t seem to care about – at some point I remember reading they were mostly interested in making every single word in a post a tag to cross-connect posts with the same word; it’s one of the reasons why I’m not sure if I want to update it. If I change the title of one of the tags to make it more descriptive, then I edit a post that has that tag, it creates one more tag for each word in that title, instead of preserving the older tags. I really should clean up the tags I got right now.
  • I would also like that when I get to the “new post” page it would create it already and then get me back to editing it — this is important to me because sometimes if I have to restart Chromium, or suspend the laptop, something goes very wrong and it creates multiple drafts for the same post. And cleaning them up is a long task.
  • A better implementation of notification for new posts, and integration with Flattr, would be also very good. While IFTTT makes it easy to post the new entries to Twitter and LinkedIn, its lack of integration for Flattr is a major pain, and the fact that right now, to use auto-submit, I have to duplicate part of the content in the HTML of the pages, is also a problem. So being able to create a “Flattr thing” the moment when I actually post something would be a major plus for me.
  • Since I’m actually quite the paranoid, another thing I would like to have would be either two-factor authentication with Google Authenticator on a cellphone, or (actually, in addition to) certificate-based authentication for the admin interface. Having a safe way to make sure that I’m the only one logging in would make me remove some of the administrative interface rules on ModSecurity, which would in turn let me write posts from public WiFi networks sidestepping the problem I posted about the other day.
  • Scheduled posting. This used to be supported, but it’s been completely broken for years at this point, but it was very useful to me a long time ago since I would just write a bunch of posts and schedule them to be posted once a day. I suppose this should now be changed so that the planned posts are only actually posted if a process is called to make sure that the new posts are now “elapsed”… but again this is something that I’d like to have, and you readers would probably enjoy, as it would probably make for more and better content overall.

I definitely do not want to go with WordPress, I just wish I had the time to write my own Typo fork, and make it more usable for what I do, rather than hoping that the upstream development for Typo does not go in a direction I don’t like at all.. Maybe somebody else has the same requirements and would like to join me in this project; if so, send me an email.. maybe it’ll finally be the time I decide to start on the fork itself.

ModSecurity news, rules, and future

So the day started looking into getting a new version of ModSecurity into shape for a new stable ebuild in Gentoo for bug #438724 (a security issue in ModSecurity 2.6.8 and earlier). Unfortunately this also meant that I had to get a new CRS in, and that requires more testing than I was expecting.

The problem is that the ModSecurity 2.7 release is now stricter on what it accepts for rules. In particular now rules are mandated to have an unique ID. And that ID has to be only numeric. And that also means that if you publish your ruleset like I do you have to register for a reserved ID range with the ModSecurity developers. I did, and I have my proper range. I already developed a tool some time ago to validate my rules’ compliance with the new policy, but it turned out to requiring some tweaking anyway, as a few conditions weren’t reported properly.

Unfortunately the Core Rule Set (which is actually developed as a separate project by Ryan Barnett, whereas ModSecurity is maintained by Breno Silva), was not ready for this yet. Oh yes, the base rules, which are the only ones usually enabled by Gentoo, are fine, but the optional, experimental and the, newly introduced SpiderLabs Research rules are not ready. Some rules lack an ID, some IDs are duplicate, and some rules go well out of the designed ranges for them.

I pointed the guys at SpiderLabs/TrustWave at my script already — hopefully we’ll soon get a 2.2.7 release that covers those issues. Until then we’ll have to do with what we have. My rules are all fixed to work properly with the new ModSecurity though, this blog is using them already.

On a different note, I’ve considered making my validation of browsers’ user agents stronger than before, as spammers and exploit tools are becoming more advanced and more capable. In particular, I’ve found Mozilla’s docs as well as Microsoft’s which include a description for IE 8 and one for IE9 (I haven’t looked up one for IE 10 yet, I’m sure they have it). This should be enough to actually validate that there aren’t extraneous addons installed that could be signal for a spambot.

In particular, it seems like many of the posters in the recent wave of spam I’ve been hit with lately, which is looking exactly like a standard browser, reports coming from Firefox with WebMoney Advisor installed. Turns out that WebMoney is one of the many anonymous, electronic currencies that are so often used by spammers, carders, and the rest of the low-life scum that causes us so much grief as email users and bloggers. I wouldn’t be surprised if these were actually mechanical turks used to post spam bypassing various filters, who are then paid through that service.

Anyway, as usual please let me know if you can’t post the blog just send me an email, it shouldn’t happen but sometimes I have been overly excited with the rules themselves. On the other hand, I’ve tested most of the browsers as we have them lined up here at the office and they are fine — we don’t use or support Opera, but that should be fine as well. The infamous Opera Turbo issues should be fixed now, it would have been nice if Opera actually sent the proper HTTP parameters as required by the RFC when using that feature, but it’s okay.

The complexity of request validation

You might have read before that I use a complicated setup with ModSecurity to prevent spam on this blog — I have written about it before extensively, and I also published my rules so that they can be used by other sites (Videolan’s forums are using them as well).

Well, maintaining this ruleset is not that easy, if at all; the problem comes when new browsers are introduced into the mix that makes validating their validity difficult. This is what happened a few months ago when Google first published Chrome for their ICS — which I still don’t have access to, I think I’ll get an HTC One X as soon as I get to California. Well, they did it again with the new Chrome for iOS.

There are three different identifications Chrome can come in as: Chrome, CrMo (for Android) and Crios (for the iOS devices). This simply meant that any special case put in place for Chrome on Android didn’t get auto-extended to the new Chrome on iOS — which is probably intended given that Chrome on iOS has to use the standard WebKit engine of Safari, rather than come up with its own — the only reason to use it is to have synchronised bookmarks with your computer.

Now, though, is when the problems start cropping up: the new Chrome on iOS also has the same problem as the one on ICS: it doesn’t send an Accept header, which is customary for almost every other browser, including the main desktop Chrome builds. So it was a matter of adding Crios to the list of special cases, together with CrMo.

But there is one more issue: there is one feature in the Chrome for iOS interface that allows you to go back to the so-called “desktop interface” — as long as the browser decided to have different interfaces depending on the User-Agent value. What you would expect at that point is for the application to report Chrome as user agent, but it’s not the case. What it reports is instead Safari. The problem is that it still implements some particularity that is generally limited to Chrome, including SDCH, which is something I used to validate before.

So what I ended up doing was removing the support for validation of browsers supporting sdch as an encoding — although I kept the validation that if it reports it’s Chrome, it has to have sdch (unless of course it’s passing through a Proxy). This still makes it possible to workaround most of the non-sophisticated crawlers/tools that try to pass as a browser.

World IPv6 Launch and their bots

You probably remember that I spent quite a bit of time working on a ModSecurity ruleset that would help me filter out marketing bots and spammers. This has paid off quite well as the number of spam comments I receive is infinitesimal compared to other blogs, even those using Akismet and various captchas.

Well, in the past few days I started noticing one more bot, which is now properly identified and rejected, scouring my website and this blog; not any other address on my systems though. The interesting part is that it tries (badly) to pass as a proper browser: "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))" (sic). Pay attention to the capital I, the version of the OS, and the two closed parenthesis at the end of the string.

Okay so this is not the brightest puppy in the litter, but there is something else: it’s distributed. With this I mean that I’m not getting the same request twice from the same IP address. Usually this gets very common for services using Amazon EC2 instances, as the IP addresses there are ever-changing, but this is not the case: the IP addresses all belong to Comcast.

I guess you can probably see where this is going to hit, given the title of the post, in the sense that there is another singularity to these requests: they actually come through a mix of IPv4 and IPv6, which is what tipped me off that it was something strange, usually the crawler bots prefer using much more easily masked IPs, not very granular IPv6s.

Since the website, the blog and (I didn’t mention that before, but it’s also hit) xine-project.org are listed on the World IPv6 launch page it’s easy to see that this is a Comcast software that is trying to see if the websites are really available. This is corroborated by the fact that the IPs all resolve to hosts within Comcast’s network throughout the whole States, but all starting with “sts02” in their reverse resolution.

Now it wouldn’t be too bad and I wouldn’t be kicking them so hard if they played by the rules, but none of the requests was going for robots.txt beforehand, and in a timeframe of 22 minutes they sent me 38 requests per host. The heck?

Now this is not the sole bot requesting data for the World IPv6 Launch; there is another one that I noticed, that was caught by an earlier rule: "Mozilla/5.0 (compatible; ISOCIPv6Bot; +http://www.worldipv6launch.org/measurements/".

Contrarily to Comcast’s this bot actually seems to only request a HEAD of the pages instead of going for a full-blown GET request. On the other hand, it still does not respect robots.txt. The requests are fewer… but they are still a lot; in the past week Comcast’s bot requested pages on my website almost 13K times – thirteen thousands – while this other bot “only” eighteen hundreds times.

Interestingly this bot doesn’t seem to be provider-specific: I see requests coming in from China, Brazil, Sweden, UK and even Italy! Although funnily enough the requests coming from Italy come from a standard IPv4 address, uh? Okay so they are probably trying to make sure that the people who signed themselves up for the IPv6 launch are really ready to be reachable by IPv6, but they could at least follow the rules, and make more sense, couldn’t they?