EFF’s Panopticlick at Enigma 2016

One of the thing I was the most interested to hear about, at Enigma 2016, was news about EFF’s Panopticlick. For context, here is the talk from Bill Burlington:

I wrote before about the tool, but they have recently reworked and rebranded it to use it as a platform for promoting their Privacy Badger, which I don’t particularly care for. For my intents, they luckily still provide the detailed information, and this time around they make it more prominent that they rely on the fingerprintjs2 library for this information. Which means I could actually try and extend it.

I tried to bring up one of my concerns at the post-talk Q&A at the conference (the Q&A were not recorded), so I thought it wold be nice to publish my few comments about the tool as it is right now.

The first comment is this: both Panopticlick and Privacy Badger do not consider the idea of server-side tracking. I have said that before, and I will repeat it now: there are plenty of ways to identify a particular user, even across sites, just by tracking behaviour that are seen passively on the server side. Bill Budington’s answer to this at the conference was that Privacy Badger’s answer is allowing cookies only if if there is a policy in place from the site, and count on this policy being binding for the site.

But this does not mean much — Privacy Badger may stop the server from setting a cookie, but there are plenty of behaviours that can be observed without the help of the browser, or even more interestingly, with the help of Privacy Badger, uBlock, and similar other “privacy conscious” extensions.

Indeed, not allowing cookies is, already, a piece of trackable information. And that’s where the problem with self-selection, which I already hinted at before, comes to: when I ran Panopticlick on my laptop earlier it told me that one out of 1.42 browsers have cookies enabled. While I don’t have any access to facts and statistics about that, I do not think it’s a realistic number to say that about 30% of browsers have cookies disabled.

If you connect this to the commentaries on NSA’s Rob Joyce said at the closing talk, which unfortunately I was not present for, you could say that the fact that Privacy Badger is installed, and fetches a given path from a server trying to set a cookie, is a good way to figure out information on a person, too.

The other problem is more interesting. In the talk, Budington introduces briefly the concept of Shannon Entropy, although not by that name, and gives an example on different amount of entropy provided by knowing someone’s zodiac sign versus knowing their birthday. He also points out that these two information are not independent so you cannot sum their entropy together, which is indeed correct. But there are two problems with that.

The first, is that the Panopticlick interface does seem to think that all the information it gathers is at least partially independent and indeed shows a number of entropy bits higher than the single highest entry they have. But it is definitely not the case that all entries are independent. Even leaving aside browser specific things such as the type of images requested and so on, for many languages (though not English) there is a timezone correlation: the vast majority of Italian users would be reporting the same timezone, either +1 or +2 depending on the time of the year; sure there are expats and geeks, but they are definitely not as common.

The second problem is that there is a more interesting approach to take, when you are submitted key/value pair of information that should not be independent, in independent ways. Going back to the example of date of birth and zodiac sign, the calculation of entropy in this example is done starting from facts, particularly those in which people cannot lie — I’m sure that for any one database of registered users, January 1st is skewed as having many more than than 1/365th of the users.

But what happens if the information is gathered separately? If you ask an user both their zodiac sign and their date of birth separately, they may lie. And when (not if) they do, you may have a more interesting piece of information. Because if you have a network of separate social sites/databases, in which only one user ever selects being born on February 18th but being a Scorpio, you have a very strong signal that it might be the same user across them.

This is the same situation I described some time ago of people changing their User-Agent string to try to hide, but then creating unique (or nearly unique) signatures of their passage.

Also, while Panopticlick will tell you if the browser is doing anything to avoid fingerprinting (how?) it still does not seem to tell you if any of your extensions are making you more unique. And since it’s hard to tell whether some JavaScript bit is trying to load a higher-definition picture, or hide pieces of the UI for your small screen, versus telling the server about your browser setup, it is not like they care if you disabled your cookies…

For a more proactive approach to improve users’ privacy, we should ask for more browser vendors to do what Mozilla did six years ago and sanitize what their User-Agent content should be. Currently, Android mobile browsers would report both the device type and build number, which makes them much easier to track, even though the suggestion has been, up to now, to use mobile browsers because they look more like each other.

And we should start wondering how much a given browser extension adds or subtract from the uniqueness of a session. Because I think most of them are currently adding to the entropy, even those that are designed to “improve privacy.”

LOLprivacy, or Misunderstanding Panopticlick for the Worst

So Sebastian posted recently about Panopticlick, but I’m afraid he has not grasped just how many subtleties are present when dealing with tracking by User-Agent and with the limitations of the tool as it is.

First of all, let’s take a moment to realize what «Your browser fingerprint appears to be unique among the 5,207,918 tested so far.» (emphasis mine) means. If I try the exact same request as Incognito, the message is «Within our dataset of several million visitors, only one in 2,603,994 browsers have the same fingerprint as yours.» (emphasis mine). I’m not sure why EFF does not expose the numbers in the second situation, hiding the five millions under the word “several”. I can’t tell how they identify further requests on the same browser not to be a new hit altogether. So I’m not sure what the number represents.

Understanding what the number represents is a major problem, too: if you count that even just in his post Sebastian tried at least three browsers; I tried twice just to write this post — so one thing that the number does not count is unique users. I would venture a guess that the number of users is well below the million, and that does count into play for multiple factors. Because Panopticlick was born in 2010, and if less than a million real users hit it, in five years, it might not be that statistically relevant.

Indeed, according to the current reading, just the Accept headers would be enough to boil me down to one in four sessions — that would be encoding and language. I doubt that is so clear-cut, as I’m most definitely not one of four people in the UKIE area speaking Italian. A lot of this has to do with the self-selection of “privacy conscious” people who use this tool from EFF.

But what worries me is the reaction from Sebastian and, even more so, the first comment on his post. Suggesting that you can hide in the crowd by looking for a “more popular” User-Agent or by using a random bunch of extensions and disabling JavaScript or blocking certain domains is naïve to say the least, but most likely missing and misunderstanding the point that Panopticlick tries to make.

The whole idea of browser fingerprinting is the ability to identify an user across a set of sessions — it responds to a similar threat model as Tor. While I already pointed out I disagree on the threat model, I would like to point out again that the kind of “surveillance” that this counters is ideally the one that is executed by an external entity able to monitor your communications from different source connections — if you don’t use Tor and you only use a desktop PC from the same connection, then it doesn’t really matter: you can just check for the IP address! And if you use different devices, then it also does not really matter, because you’re now using different profiles anyway; the power is in the correlation.

In particular, when trying to tweak User-Agent or other headers to make them “more common”, you’re now dealing with something that is more likely to backfire than not; as my ModSecurity Ruleset shows you very well, it’s not so difficult to tell apart a real Chrome request by Firefox masquerading as Chrome, or IE masquerading as Safari, they have different Accept-Encoding, and other differences in style of request headers, making it quite straightforward to check for them. And while you could mix up the Accept headers enough to “look the part” it’s more than likely that you’ll be served bad data (e.g. sdch to IE, or webp to Firefox) and that would make your browsing useless.

More importantly, the then-unique combination of, say, a Chrome User-Agent for an obviously IE-generated request would make it very obvious to follow a session aggregated across different websites with a similar fingerprint. The answer I got by Sebastian is not good either: even if you tried to use a “more common” version string, you could still, very easily, create unwanted unique fingerprints; take Firefox 37: it started supporting the alt-svc extension to use HTTP2 when available, if you were to report your browser as Firefox 28 and then it followed alt-svc, then it would clearly be a fake version string, and again an easy one to follow. Similar version-dependent request fingerprinting, paired with a modified User-Agent string would make you light up as a Christmas tree during Earth Day.

There are more problems though; the suggestion of installing extensions such as AdBlock also adds to the fingerprinting rather than block from it; as long as JavaScript is allowed to run, it can detect AdBlock presence, and with a bit of work you can identify presence of one out of the set of different blocking lists, too. You could use NoScript to avoid running JavaScript at all, but given this is by far not something most users will do, it’ll also add up to the entropy of a fingerprint for your browser, not remove from it, even if it prevents client-side fingerprinting to access things like the list of available plugins (which in my case is not that common, either!)

But even ignoring the fact that Panopticlick does not try to identify the set of installed extensions (finding Chrome’s Readability is trivial, as it injects content into the DOM, and so do a lot more), there is one more aspect that it almost entirely ignores: server-side fingerprinting. Beside not trying to correlate the purported User-Agent against the request fingerprint, it does not seem to use a custom server at all, so it does not leverage TLS handshake fingerprints! As can be seen through Qualys analysis, there are some almost-unique handshake sequences on a given server depending on the client used; while this does not add up much more data when matched against a vanilla User-Agent, a faked User-Agent and a somewhat more rare TLS handshake would be just as easy to track.

Finally, there is the problem with self-selection: Sebastian has blogged about this while using Firefox 37.0.1 which was just released, and testing with that; I assume he also had the latest Chrome. While Mozilla increased the rate of release of Firefox, Chrome has definitely a very hectic one with many people updating all the time. Most people wouldn’t go to Panopticlick every time they update their browser, so two entries that are exactly the same apart from the User-Agent version would be reported as unique… even though it’s most likely that the person who tried two months ago updated since, and now has the same fingerprint as the person who tried recently with the same browser and settings.

Now this is a double-edged sword: if you rely on the User-Agent to track someone across connections, a ephemeral User-Agent that changes every other day due to updates is going to disrupt your plans quickly; on the other hand lagging behind or jumping ahead on the update train for a browser would make it more likely for you to have a quite unique version number, even more so if you’re tracking beta or developer channels.

Interestingly, though, Mozilla has thought about this before, and their Gecko user agent string reference shows which restricted fields are used, and references the bugs that disallowed extensions and various software to inject into the User-Agent string — funnily enough I know of quite a few badware cases in which a unique identifier was injected into the User-Agent for fake ads and other similar websites to recognize a “referral”.

Indeed, especially on Mobile, I think that User-Agents are a bit too liberal with the information they push; not only they include the full build number of the mobile browser such as Chrome, but they usually include the model of the device and the build number of the operating system: do you want to figure out if a new build of Android is available for some random device out there? Make sure you have access to HTTP logs for big enough websites and look for new build IDs. I think that in this particular sub-topic, Chrome and Safari could help a lot more by reducing the amount of details of the engine version as well as the underlying operating system.

So, for my parting words, I would like to point out that Panopticlick is a nice proof-of-concept that shows how powerful browser fingerprinting is, without having to rely on tracking cookies. I think lots of people both underestimate the power of fingerprinting and overestimate the threat. From one side, because Panopticlick does not have enough current data to make it feasible to evaluate the current uniqueness of a session across the world; from the other, because you get the wrong impression that if Panopticlick can’t put you down as unique, you’re safe — you’re not, there are many more techniques that Panopticlick does not think of trying!

My personal advice is to stop worrying about the NSA and instead start safekeeping yourself: using click-to-play for Flash and Java is good prophylaxis for security, not just privacy, and NoScript can be useful too, in some cases, but don’t just kill everything on sight. Even using the Data Saver extension for non-HTTPS websites can help (unfortunately I know of more than a few blocking it, and then there is the problem with captive portals bringing it to be clear-text HTTP too).

More on browser fingerprinting, and Analytics

A few days ago I pointed out how it’s possible to use some of the Chrome extensions (and likely just as many of the Firefox ones) to gather extra entropy in addition to the one that Panopticlick already knows about. But this is not the only source of identification that Panopticlick is not considering, and that can be used to track users.

I originally intended to write a full proof of concept for it, but since I’m currently in Mountain View, my time is pretty limited, so I’ll limit myself to a description of it. Panopticlick factors in the Accept header for the page that the browser sends with the page’s request, but there is one thing that it does not check for, as it’s a bit more complex to do: the Accept header for images. Indeed, different browsers support different image formats, as I’ve found before and even browsers that support, for instance, WebP such as Opera and Chrome will have widely different Accept headers.

What does it mean? Well, if you were trying to replace, let’s say, your Chrome user agent with a Firefox one, you’d now have a very unique combination of a Firefox user agent accepting WebP images. Your hope of hiding by muddling the waters just made you stand up much more easily. The same goes if you were trying to disable WebP requests to make your images’ Accept more alike Firefox’s: now you’ll have a given version of Chrome that does not support WebP — the likeliness of being unique is even bigger.

So why am I talking this much about browser fingerprinting later? Well, you may or may not have noticed but both my blog and Autotools Mythbuster are now using Google Analytics. The reason for that is that, after my doubts on whether to keep running the blog or not, I want to know exactly how useful my blog is to people, and how many people end up reading it at given time. I was originally a bit unsure on whether this was going to be a problem for my readers, but seeing how easily it is to track people stealthily, tracking people explicitly shouldn’t be considered a problem — thus why I’m going to laugh at your expense if you’ll start complaining about this being a “web bug”.

Browser fingerprinting

I’ve posted some notes about browser fingerprinting back in March, and noted how easy it is to identify a given user across requests just by the few passive scans that are possible without even having to have Flash enabled. Indeed, EFF’s Panopticlick considers my browser unique even with Flash disabled.

But even if Panopticlick is only counting it among the people who actually ran it, which means it’s just a percentage of all the possible users out there, it is also not exercising the full force of fingerprinting. In particular it does not try to detect the installed Chrome extensions, which is actually trivial to do in JavaScript for some of these extensions. In particular in my case I can easily identify the presence of the Readabily extension because it injects an “indicator” as an iframe with a fixed ID. Similarly it’s relatively easy to identify adblock users, as you probably have noticed in a bunch of different sites already that beg you to disable the adblocker so that they can make some money with the ads.

Given how paranoid some of my readers are, I’m looking forward for somebody to add Chrome and Firefox extensions identification to Panopticlick, it’ll be definitely interesting going forward.