Obligatory disclaimer: this post is my opinion, and my opinion only. It does not represent the opinion of my current, past, and future employers. Indeed, as always is the case for my blog, this post is not signed off by anyone in my reporting chain. While there are some references to my work of the past decade, all of the information I’m talking about is open knowledge.
A few weeks ago I had a heated discussion on Twitter about User-Agent strings and browser fingerprinting, topics for which I have been following, and taking part to, for over ten years by now. Unfortunately, the discussion is still at the same level as it was a number of years ago, and I think this is to detriment of the users.
Let’s start with what fingerprinting is and isn’t. The name “fingerprint” suggests that you can uniquely identify a specific browser among the total population, but this is not what fingerprinting tends to be, among other things because it’s impossible to observe the total population. Instead, fingerprinting can be used to build connections between actions at different points in time, when a stronger pseudo-unique identifier is not available.
I’m going to explain this as a metaphor, because I think most people reading this would be familiar with TV crime dramas and police procedurals. Say that your investigators are looking for a car that was present at a scene of a crime. they don’t have a license plate, VIN number, or other uniquely identifying information, but they may have the make and model, the colour of the paint, and maybe the description of an eye witness that didn’t think of reading the plate but noticed something of it.
These details are not enough to identify a car out of the whole population of cars, in general: unless it’s a very specific classic car for which only one is known to be painted that certain colour, they will still have way too many cars that could possibly be the one they’re looking for. But if they also have a way to limit the population further, that can be a lot more useful. Say that the scene is behind a gate, and so instead of looking for any car of that make and model and colour, they’re looking for a car of that make, model, and colour, owned by someone who has a key to the gate. Now we’re talking!
Various characteristics that can be observed for a web browser metaphorically match the characteristics of a car, which is what EFF’s Panopticlick (now replaced by their Cover Your Tracks application) was trying to show people. Unfortunately there’s a significant difference between the way Panopticlick could figure out how likely a certain configuration is versus a car: at least for the most prominent characteristics, law enforcement agencies have databases, so they could tell you how many other cars with that particular trait exist out there.
The car analogy fits for another point about browser traits: while some of the traits can be changed, to confuse your tracks, it’s well possible that an attempt at disguise might make your browser stand out a lot more. I hope my old example is out of date, but it used to be that Firefox would not be accepting WebP images, which meant if you received a request claiming to be from a Firefox browser, but including WebP in the list of accepted image formats, you knew it was a fake User-Agent string. That would be like taking a Tesla and putting a Fiat Panda badge on it: nobody would believe it, and likely they will remember seeing a very ironically modified car
But how did this whole topic come up again? Well, turns out that at least some Linux distributions are still injecting their name and version in the User-Agent string of the browsers they package (including libraries used as reusable components), and at least some of their developers don’t see the harm this can cause their users. That suggests that more explanations are required.
Adding the distro name and version to User-Agent is the equivalent, in the metaphor of a car, of putting a sticker of the dealership who sold said car. This turns out to be a fairly common thing to do as well! It wouldn’t be a very specific trait, unless it would be for a dealership that would have sold very few of the cars under consideration, for instance because it’s from a different region, or because it went bust a few years before. And any Linux distribution would be something odd on a general audience website.
So what is the argument for adding this sticker? At least for the developer who kept insisting that this is not harmful to the users, the reason to have the name of the distribution (the “sticker”) is to show services that they should support the distro because they have a number of users using it. It’s an argument that I can sympathise with, but I don’t think it is reasonable.
First of all, we’re talking about an opt-out feature, that is, all of the requests that are being sent are effectively “branded” unless the user knows to turn this off. I don’t think that’s fair, because the vast majority of users are not developers, and wouldn’t know how User-Agent strings work. Unlike the dealership sticker, which a car owner would be noticing and decide to take off if they didn’t care for it, User-Agent strings are not shown to the user in day to day browsing, and users would likely be unaware how much information they are providing unless they stumble across EFF’s Cover Your Tracks. This kind of opt-out features raise absolute and righteous outrage when they are rolled out by the likes of Microsoft or Google, so I find it hypocritical to think it’s fair game to do the same for a Linux distribution.
The second point is that doing this to show the services that a specific Linux distribution is worth supporting feels myopic, for multiple reasons. Let’s start from the very obvious one: most analytics platforms do not care to give a breakdown of per distro visits. AWStats did and probably still does, but Google Analytics definitely doesn’t. WordPress basic stats don’t even bother bother giving you a breakdown by operating system, let alone distribution. Most of the self-hosted analytics software doesn’t seem to care about it either. Since I don’t work on front-facing services, I don’t even know what the analytics software used in Big Tech would show, but I can take an informed guess that nobody would be digging into how many users are using Fedora versus Ubuntu versus Arch Linux, unless they specifically focus on Linux in the first place. And they they may as well just ask, I can’t remember when was the last time I met someone who uses Linux and wouldn’t talk about their favourite distro. I am told that MediaWiki might have a per-distribution breakdown, which makes sense particularly as many distributions use it for their own wiki, but… yeah I don’t think it represents a majority of users.
For the record, according to Google Analytics, only just over one thousand visitors of this blog in the past five months used Linux, compared with nearly three thousands using Windows, and two thousands each using iOS and Android. And this is a fairly biased view, since this blog is about tech and Free Software in the first place, and in this period Hacker News featured two posts of mine in their front page.
What I’m trying to say is that for most big, global services, the question is unlikely to be “Should we support Fedora?” but rather “Should we support Linux?” in the first place. Which is, by itself, the right question to ask in my opinion. Distributions being different for difference sake have been a plague for Linux as a whole, and when it comes to web services, the fact that browsers provide a well standardized platform is a great upside. The only providers that would have to care about the differences between Gentoo and Arch Linux are those providing services outside of web browsers, and I would venture a guess that they wouldn’t want to rely on web statistics, as you may use a different browsing device compared to the system you use the service from (take for example Tailscale.)
That’s what I call a marginal upside for the project that applied the sticker: the vast majority of operators won’t even notice, and even those who will, might not care about the distribution as much as they care about the browser and its version. On the other hand, the “sticker” applies to an even smaller subset of an already small population, which makes identification of a single user interaction a lot easier.
Now, those who know me know that I have a nuanced view of privacy, as I expressed before. This means I don’t generally feel like I need to hide myself from big organizations and law enforcement, but that does not mean the same applies to everyone! Particularly with the way the world is going, not everyone is playing on the lowest difficulty level, so I think it is important for people to make informed decisions — and for Free Software developers to make decisions that are kind to users. Which is why I wouldn’t have a problem if this branding was opt-in, and well explained at first installation, just like Windows does.
Requiring to opt into a lowered privacy, even if lowered by a negligible amount, is a difficult path to take, I grant you. I said so myself, that to be invisible to analytics platform means your preferences are going to be ignored. Home Assistant makes a good case for why you should opt into the diagnostics statistics, as it helps them prioritize integrations that have the most users, but I don’t know how many people do actually opt into it right now. And Home Assistant is doing it “right” in my opinion, by making the anonymized statistics available to everyone, while a branded User-Agent would be distributing the statistics across many services, most of which will not be available to either the public or the brand owners!
Opt-in analytics are harder, also because they are vastly transactional. Lots of time passed from the advent of Clubcard, but most stores still build their analytics based on loyalty cards and signed-in discounts. Entire businesses exist to analyse spend across stores in exchange for single-digit percentage cashback. We don’t quite have pay-to-surf options anymore, but plenty of stores, financial institutions, and others (even Microsoft!) are happy to provide you with enticing discounts on online shopping if you install their extension that can provide anonymized insights on your online behaviour. Linux distributions rarely have any opportunity to offer this, but that does not exclude them from the same expectation of privacy that users are getting from other operators.
Finally, a common refrain for this is “But what about $vendor?” Whataboutism is a common problem in many fields, and Free Software is not immune to this. I personally would want to consider Free Software projects as more ethical than other vendors, but since I already said that “Kind Software” has only a partial overlap, I shouldn’t be surprised if instead of doing the best thing for the user, some projects would rather take the most value they can get away with.
Since I started looking at User-Agent strings and browser fingerprinting in general, we had a significant amount of wins for user privacy, as well as a number of regressions. Mozilla successfully reduced the variance of their User-Agent by freezing the Gecko trail, while both Apple and Google attempted freezing the whole User-Agent, with mixed results and a lot of conspiracy theories being thrown around because of it. Personally, I have at least successfully argued against providing the Android ROM version in the User-Agent string of Chrome for Android, which I’m very proud of: given how this version string changed across different providers even for the same model, it was a significant amount of entropy injected in the string!
User-Agent is, quite honestly, a legacy string by now. Chrome and Edge have been pushing for the usage of Client Hints to provide more details about the client platform in use, and that is even more of a fingerprinting issue, even though it does require active participation from the browser, rather than acting as a passive source. The fact that Cover Your Tracks does not seem to attempt showing those hints made me a bit sad. But being a passive source of information is a double-edged sword, in particular it means that you can (possibly) look back and tie together sessions based on this string alone, even if not with the highest of confidence.
I’m not coming here with oven-ready solutions (that would anyway be thrown into a microwave), but rather with food for thought, and to the idea that we should be more considerate towards our users. People who are at risk should not have to learn which combination of common traits does not stand out, and should not have to be told “Actually, just use a non-Free platform to hide in the crowd.” But these are not new topics, as I wrote before how little the community as a whole appears to care about the hard yet impactful problems.