It was 2008 when I first got the idea to filter User-Agents as an antispam measure. It worked for quite a while on its own, but recently my ruleset had to come up with more sophisticated fingerprinting to discover spammers. It still works better than a captcha, but it did worsen a bit.
One of the reasons why the User-Agent itself is not enough anymore is that my filtering has been hindered by a more important project. EFF’s Panopticlick has shown that the uniqueness of the strings used in User-Agent is actually an easy way to track a specific user across requests. This got so important, that Mozilla standardized their User-Agents starting with Firefox 4, to reduce their size and thus their entropy. Among other things, the “trail” component has been fixed on the desktop to 20100101 and to the same version as Firefox for the mobile version.
_Unfortunately, Mozilla lies on that page. Not only the trail is not fixed for Firefox Aurora (i.e. the alpha version), which means that my first set of rules was refusing access to all the users of that version, but also their own Lightning extension for SeaMonkey appends to the User-Agent, when they said that it wasn’t supported anymore._
A number of spambots seem to get this wrong, by the way. My guess is that they have some code that generates the User-Agent by adding a bunch of fragments, and make it randomize it, so you can’t just kick a particular agent. Damn smart if you ask me, unfortunately, as ModSecurity hashes the IP collection by remote address and user-agent, so if they cycle different user agents, it’s harder for ModSecurity to understand that it’s actually the same IP address.
I do have some reserves on Mozilla’s handling of identification of extensions. First they say that extensions and plugins should not edit the agent string anymore – but Lightning does! – then they suggest that instead they can send an extra header to identify themselves. But that just means that fingerprinting systems only need to start counting those headers as well as the generic ones that Panopticlick already considers.
On the other hand, other browsers don’t seem to have gotten the memo yet — indeed, both Safari’s and Chrome’s strings are long and include a bunch of almost-independent version numbers (AppleWebKit, Chrome, Safari — and Mobile on the iOS versions). It gets worse on Android, as both the standard browser and Chrome provide a full build identifier, which is not only different from one device to the next, but also from one firmware to the next. Given that each mobile provider has its own builds, I would be very surprised if among my friends I was able to find two with the same identifier in their browsers. Firefox is a bit better on that but it sucks in other ways so I’m not using it as my main browser anymore there.