So Sebastian posted recently about Panopticlick, but I’m afraid he has not grasped just how many subtleties are present when dealing with tracking by User-Agent and with the limitations of the tool as it is.
First of all, let’s take a moment to realize what «Your browser fingerprint appears to be unique among the 5,207,918 tested so far.» (emphasis mine) means. If I try the exact same request as Incognito, the message is «Within our dataset of several million visitors, only one in 2,603,994 browsers have the same fingerprint as yours.» (emphasis mine). I’m not sure why EFF does not expose the numbers in the second situation, hiding the five millions under the word “several”. I can’t tell how they identify further requests on the same browser not to be a new hit altogether. So I’m not sure what the number represents.
Understanding what the number represents is a major problem, too: if you count that even just in his post Sebastian tried at least three browsers; I tried twice just to write this post — so one thing that the number does not count is unique users. I would venture a guess that the number of users is well below the million, and that does count into play for multiple factors. Because Panopticlick was born in 2010, and if less than a million real users hit it, in five years, it might not be that statistically relevant.
Indeed, according to the current reading, just the Accept headers would be enough to boil me down to one in four sessions — that would be encoding and language. I doubt that is so clear-cut, as I’m most definitely not one of four people in the UKIE area speaking Italian. A lot of this has to do with the self-selection of “privacy conscious” people who use this tool from EFF.
The whole idea of browser fingerprinting is the ability to identify an user across a set of sessions — it responds to a similar threat model as Tor. While I already pointed out I disagree on the threat model, I would like to point out again that the kind of “surveillance” that this counters is ideally the one that is executed by an external entity able to monitor your communications from different source connections — if you don’t use Tor and you only use a desktop PC from the same connection, then it doesn’t really matter: you can just check for the IP address! And if you use different devices, then it also does not really matter, because you’re now using different profiles anyway; the power is in the correlation.
In particular, when trying to tweak User-Agent or other headers to make them “more common”, you’re now dealing with something that is more likely to backfire than not; as my ModSecurity Ruleset shows you very well, it’s not so difficult to tell apart a real Chrome request by Firefox masquerading as Chrome, or IE masquerading as Safari, they have different Accept-Encoding, and other differences in style of request headers, making it quite straightforward to check for them. And while you could mix up the Accept headers enough to “look the part” it’s more than likely that you’ll be served bad data (e.g. sdch to IE, or webp to Firefox) and that would make your browsing useless.
More importantly, the then-unique combination of, say, a Chrome User-Agent for an obviously IE-generated request would make it very obvious to follow a session aggregated across different websites with a similar fingerprint. The answer I got by Sebastian is not good either: even if you tried to use a “more common” version string, you could still, very easily, create unwanted unique fingerprints; take Firefox 37: it started supporting the alt-svc extension to use HTTP2 when available, if you were to report your browser as Firefox 28 and then it followed alt-svc, then it would clearly be a fake version string, and again an easy one to follow. Similar version-dependent request fingerprinting, paired with a modified User-Agent string would make you light up as a Christmas tree during Earth Day.
But even ignoring the fact that Panopticlick does not try to identify the set of installed extensions (finding Chrome’s Readability is trivial, as it injects content into the DOM, and so do a lot more), there is one more aspect that it almost entirely ignores: server-side fingerprinting. Beside not trying to correlate the purported User-Agent against the request fingerprint, it does not seem to use a custom server at all, so it does not leverage TLS handshake fingerprints! As can be seen through Qualys analysis, there are some almost-unique handshake sequences on a given server depending on the client used; while this does not add up much more data when matched against a vanilla User-Agent, a faked User-Agent and a somewhat more rare TLS handshake would be just as easy to track.
Finally, there is the problem with self-selection: Sebastian has blogged about this while using Firefox 37.0.1 which was just released, and testing with that; I assume he also had the latest Chrome. While Mozilla increased the rate of release of Firefox, Chrome has definitely a very hectic one with many people updating all the time. Most people wouldn’t go to Panopticlick every time they update their browser, so two entries that are exactly the same apart from the User-Agent version would be reported as unique… even though it’s most likely that the person who tried two months ago updated since, and now has the same fingerprint as the person who tried recently with the same browser and settings.
Now this is a double-edged sword: if you rely on the User-Agent to track someone across connections, a ephemeral User-Agent that changes every other day due to updates is going to disrupt your plans quickly; on the other hand lagging behind or jumping ahead on the update train for a browser would make it more likely for you to have a quite unique version number, even more so if you’re tracking beta or developer channels.
Interestingly, though, Mozilla has thought about this before, and their Gecko user agent string reference shows which restricted fields are used, and references the bugs that disallowed extensions and various software to inject into the User-Agent string — funnily enough I know of quite a few badware cases in which a unique identifier was injected into the User-Agent for fake ads and other similar websites to recognize a “referral”.
Indeed, especially on Mobile, I think that User-Agents are a bit too liberal with the information they push; not only they include the full build number of the mobile browser such as Chrome, but they usually include the model of the device and the build number of the operating system: do you want to figure out if a new build of Android is available for some random device out there? Make sure you have access to HTTP logs for big enough websites and look for new build IDs. I think that in this particular sub-topic, Chrome and Safari could help a lot more by reducing the amount of details of the engine version as well as the underlying operating system.
So, for my parting words, I would like to point out that Panopticlick is a nice proof-of-concept that shows how powerful browser fingerprinting is, without having to rely on tracking cookies. I think lots of people both underestimate the power of fingerprinting and overestimate the threat. From one side, because Panopticlick does not have enough current data to make it feasible to evaluate the current uniqueness of a session across the world; from the other, because you get the wrong impression that if Panopticlick can’t put you down as unique, you’re safe — you’re not, there are many more techniques that Panopticlick does not think of trying!
My personal advice is to stop worrying about the NSA and instead start safekeeping yourself: using click-to-play for Flash and Java is good prophylaxis for security, not just privacy, and NoScript can be useful too, in some cases, but don’t just kill everything on sight. Even using the Data Saver extension for non-HTTPS websites can help (unfortunately I know of more than a few blocking it, and then there is the problem with captive portals bringing it to be clear-text HTTP too).