On my blog I have been seeing a lot of requests coming coming from an User-Agent string that definitely didn’t look right; this was, of course, one of the steps during my usual antispam analysis for which the fake agents are usually a symptom of a sloppy spammers, easy to kill on sight. In this case, though, the spammer didn’t either have nasty referrers nor it was posting comments, which was definitely unusual. What did that mean at all? Let’s start from the beginning.
The User-Agent string I was seeing is the following:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30707; MS-RTC LM 8)
It isn’t, at a first glance, anything special: simply a very old Internet Explorer version (6.0) on Windows XP; this configuration is already banned from posting comments on my blog, as nobody should really stick to that ancient MSIE version unless they are forced to, and if they are forced to for whatever reason they better comment on my blog from a different system (or get a sane browser).
What makes the string smell funny is the presence of a double-space after the semicolon. No official User-Agent adds more than one space after the semicolon character, even less so when it comes to official MSIE strings. Of course it’s a nit, but it’s indeed based on such nits that my ModSecurity ruleset works.
At first, my thought was some spammer trying to feign coming from the MSN network so that the most basic filtering capabilities wouldn’t trigger (my ruleset documentation suggests indeed to enable the double-resolution of the hostnames so that you have a forward-confirmed reverse resolution). But a manual FCrDNS gave in on the fact that the IP address is indeed one of those assigned to msnbot, and a quick lookup with Whois shows that the IP block is indeed assigned to Microsoft.
What’s going on with these requests? Googling (or binging — erm) up for the user agent string is obviously not going to respect the space so I couldn’t find any already explained reason why they decided to go this way. My most likely explanation is that they are trying to see which websites still support their old browser, or they are trying to get a screenshot of the pages to show in the search results. What I can’t understand is why they wanted to provide such a blatant false string rather than using a real one or making up a new one for the render bot itself.
At any rate, if you see those false strings you now know that you’re not alone.