You probably remember that I spent quite a bit of time working on a ModSecurity ruleset that would help me filter out marketing bots and spammers. This has paid off quite well as the number of spam comments I receive is infinitesimal compared to other blogs, even those using Akismet and various captchas.
Well, in the past few days I started noticing one more bot, which is now properly identified and rejected, scouring my website and this blog; not any other address on my systems though. The interesting part is that it tries (badly) to pass as a proper browser:
"Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))" (sic). Pay attention to the capital I, the version of the OS, and the two closed parenthesis at the end of the string.
Okay so this is not the brightest puppy in the litter, but there is something else: it’s distributed. With this I mean that I’m not getting the same request twice from the same IP address. Usually this gets very common for services using Amazon EC2 instances, as the IP addresses there are ever-changing, but this is not the case: the IP addresses all belong to Comcast.
I guess you can probably see where this is going to hit, given the title of the post, in the sense that there is another singularity to these requests: they actually come through a mix of IPv4 and IPv6, which is what tipped me off that it was something strange, usually the crawler bots prefer using much more easily masked IPs, not very granular IPv6s.
Since the website, the blog and (I didn’t mention that before, but it’s also hit) xine-project.org are listed on the World IPv6 launch page it’s easy to see that this is a Comcast software that is trying to see if the websites are really available. This is corroborated by the fact that the IPs all resolve to hosts within Comcast’s network throughout the whole States, but all starting with “sts02” in their reverse resolution.
Now it wouldn’t be too bad and I wouldn’t be kicking them so hard if they played by the rules, but none of the requests was going for
robots.txt beforehand, and in a timeframe of 22 minutes they sent me 38 requests per host. The heck?
Now this is not the sole bot requesting data for the World IPv6 Launch; there is another one that I noticed, that was caught by an earlier rule:
"Mozilla/5.0 (compatible; ISOCIPv6Bot; +http://www.worldipv6launch.org/measurements/".
Contrarily to Comcast’s this bot actually seems to only request a HEAD of the pages instead of going for a full-blown GET request. On the other hand, it still does not respect
robots.txt. The requests are fewer… but they are still a lot; in the past week Comcast’s bot requested pages on my website almost 13K times – thirteen thousands – while this other bot “only” eighteen hundreds times.
Interestingly this bot doesn’t seem to be provider-specific: I see requests coming in from China, Brazil, Sweden, UK and even Italy! Although funnily enough the requests coming from Italy come from a standard IPv4 address, uh? Okay so they are probably trying to make sure that the people who signed themselves up for the IPv6 launch are really ready to be reachable by IPv6, but they could at least follow the rules, and make more sense, couldn’t they?