LOLprivacy, or Misunderstanding Panopticlick for the Worst

So Sebastian posted recently about Panopticlick, but I’m afraid he has not grasped just how many subtleties are present when dealing with tracking by User-Agent and with the limitations of the tool as it is.

First of all, let’s take a moment to realize what «Your browser fingerprint appears to be unique among the 5,207,918 tested so far.» (emphasis mine) means. If I try the exact same request as Incognito, the message is «Within our dataset of several million visitors, only one in 2,603,994 browsers have the same fingerprint as yours.» (emphasis mine). I’m not sure why EFF does not expose the numbers in the second situation, hiding the five millions under the word “several”. I can’t tell how they identify further requests on the same browser not to be a new hit altogether. So I’m not sure what the number represents.

Understanding what the number represents is a major problem, too: if you count that even just in his post Sebastian tried at least three browsers; I tried twice just to write this post — so one thing that the number does not count is unique users. I would venture a guess that the number of users is well below the million, and that does count into play for multiple factors. Because Panopticlick was born in 2010, and if less than a million real users hit it, in five years, it might not be that statistically relevant.

Indeed, according to the current reading, just the Accept headers would be enough to boil me down to one in four sessions — that would be encoding and language. I doubt that is so clear-cut, as I’m most definitely not one of four people in the UKIE area speaking Italian. A lot of this has to do with the self-selection of “privacy conscious” people who use this tool from EFF.

But what worries me is the reaction from Sebastian and, even more so, the first comment on his post. Suggesting that you can hide in the crowd by looking for a “more popular” User-Agent or by using a random bunch of extensions and disabling JavaScript or blocking certain domains is naïve to say the least, but most likely missing and misunderstanding the point that Panopticlick tries to make.

The whole idea of browser fingerprinting is the ability to identify an user across a set of sessions — it responds to a similar threat model as Tor. While I already pointed out I disagree on the threat model, I would like to point out again that the kind of “surveillance” that this counters is ideally the one that is executed by an external entity able to monitor your communications from different source connections — if you don’t use Tor and you only use a desktop PC from the same connection, then it doesn’t really matter: you can just check for the IP address! And if you use different devices, then it also does not really matter, because you’re now using different profiles anyway; the power is in the correlation.

In particular, when trying to tweak User-Agent or other headers to make them “more common”, you’re now dealing with something that is more likely to backfire than not; as my ModSecurity Ruleset shows you very well, it’s not so difficult to tell apart a real Chrome request by Firefox masquerading as Chrome, or IE masquerading as Safari, they have different Accept-Encoding, and other differences in style of request headers, making it quite straightforward to check for them. And while you could mix up the Accept headers enough to “look the part” it’s more than likely that you’ll be served bad data (e.g. sdch to IE, or webp to Firefox) and that would make your browsing useless.

More importantly, the then-unique combination of, say, a Chrome User-Agent for an obviously IE-generated request would make it very obvious to follow a session aggregated across different websites with a similar fingerprint. The answer I got by Sebastian is not good either: even if you tried to use a “more common” version string, you could still, very easily, create unwanted unique fingerprints; take Firefox 37: it started supporting the alt-svc extension to use HTTP2 when available, if you were to report your browser as Firefox 28 and then it followed alt-svc, then it would clearly be a fake version string, and again an easy one to follow. Similar version-dependent request fingerprinting, paired with a modified User-Agent string would make you light up as a Christmas tree during Earth Day.

There are more problems though; the suggestion of installing extensions such as AdBlock also adds to the fingerprinting rather than block from it; as long as JavaScript is allowed to run, it can detect AdBlock presence, and with a bit of work you can identify presence of one out of the set of different blocking lists, too. You could use NoScript to avoid running JavaScript at all, but given this is by far not something most users will do, it’ll also add up to the entropy of a fingerprint for your browser, not remove from it, even if it prevents client-side fingerprinting to access things like the list of available plugins (which in my case is not that common, either!)

But even ignoring the fact that Panopticlick does not try to identify the set of installed extensions (finding Chrome’s Readability is trivial, as it injects content into the DOM, and so do a lot more), there is one more aspect that it almost entirely ignores: server-side fingerprinting. Beside not trying to correlate the purported User-Agent against the request fingerprint, it does not seem to use a custom server at all, so it does not leverage TLS handshake fingerprints! As can be seen through Qualys analysis, there are some almost-unique handshake sequences on a given server depending on the client used; while this does not add up much more data when matched against a vanilla User-Agent, a faked User-Agent and a somewhat more rare TLS handshake would be just as easy to track.

Finally, there is the problem with self-selection: Sebastian has blogged about this while using Firefox 37.0.1 which was just released, and testing with that; I assume he also had the latest Chrome. While Mozilla increased the rate of release of Firefox, Chrome has definitely a very hectic one with many people updating all the time. Most people wouldn’t go to Panopticlick every time they update their browser, so two entries that are exactly the same apart from the User-Agent version would be reported as unique… even though it’s most likely that the person who tried two months ago updated since, and now has the same fingerprint as the person who tried recently with the same browser and settings.

Now this is a double-edged sword: if you rely on the User-Agent to track someone across connections, a ephemeral User-Agent that changes every other day due to updates is going to disrupt your plans quickly; on the other hand lagging behind or jumping ahead on the update train for a browser would make it more likely for you to have a quite unique version number, even more so if you’re tracking beta or developer channels.

Interestingly, though, Mozilla has thought about this before, and their Gecko user agent string reference shows which restricted fields are used, and references the bugs that disallowed extensions and various software to inject into the User-Agent string — funnily enough I know of quite a few badware cases in which a unique identifier was injected into the User-Agent for fake ads and other similar websites to recognize a “referral”.

Indeed, especially on Mobile, I think that User-Agents are a bit too liberal with the information they push; not only they include the full build number of the mobile browser such as Chrome, but they usually include the model of the device and the build number of the operating system: do you want to figure out if a new build of Android is available for some random device out there? Make sure you have access to HTTP logs for big enough websites and look for new build IDs. I think that in this particular sub-topic, Chrome and Safari could help a lot more by reducing the amount of details of the engine version as well as the underlying operating system.

So, for my parting words, I would like to point out that Panopticlick is a nice proof-of-concept that shows how powerful browser fingerprinting is, without having to rely on tracking cookies. I think lots of people both underestimate the power of fingerprinting and overestimate the threat. From one side, because Panopticlick does not have enough current data to make it feasible to evaluate the current uniqueness of a session across the world; from the other, because you get the wrong impression that if Panopticlick can’t put you down as unique, you’re safe — you’re not, there are many more techniques that Panopticlick does not think of trying!

My personal advice is to stop worrying about the NSA and instead start safekeeping yourself: using click-to-play for Flash and Java is good prophylaxis for security, not just privacy, and NoScript can be useful too, in some cases, but don’t just kill everything on sight. Even using the Data Saver extension for non-HTTPS websites can help (unfortunately I know of more than a few blocking it, and then there is the problem with captive portals bringing it to be clear-text HTTP too).

Browser fingerprinting

I’ve posted some notes about browser fingerprinting back in March, and noted how easy it is to identify a given user across requests just by the few passive scans that are possible without even having to have Flash enabled. Indeed, EFF’s Panopticlick considers my browser unique even with Flash disabled.

But even if Panopticlick is only counting it among the people who actually ran it, which means it’s just a percentage of all the possible users out there, it is also not exercising the full force of fingerprinting. In particular it does not try to detect the installed Chrome extensions, which is actually trivial to do in JavaScript for some of these extensions. In particular in my case I can easily identify the presence of the Readabily extension because it injects an “indicator” as an iframe with a fixed ID. Similarly it’s relatively easy to identify adblock users, as you probably have noticed in a bunch of different sites already that beg you to disable the adblocker so that they can make some money with the ads.

Given how paranoid some of my readers are, I’m looking forward for somebody to add Chrome and Firefox extensions identification to Panopticlick, it’ll be definitely interesting going forward.

Serving WebP images

A few months ago I tried experimenting with WebP so to reduce the traffic on my blog without losing quality. At the end the results were negative and I decided instead to drop the backgrounds on my blog and replace them with some CSS to provide gradients, which was not a lossless change, but was definitely easier to load.

After the VideoLan Dev Days 2013 (of which I have to write some report soon, I went to speak with Pascal again, and he was telling me that the new version of Chrome finally fixed the HTTP accept header, so that finally it will prefer WebP to other image formats if present. I confirmed this, as Chrome 30 reports Accept: image/webp,*/*;q=0.8. The q= parameter is not needed for Apache actually, but it’s a good idea to have it there anyway.

Thanks to this change, and mod_negotiation’s MultiViews it’s possible to auto-select the format, between JPEG, PNG and WebP, for Chrome users. Indeed if you’re visiting my blog with Chrome 30 (not sure about 29), you’re going to be served mostly WebP images (the CC license logos are still provided as PNG because the lossless compression was worse, and the lossy one was not saving enough bytes to be worth it).

I started working on enabling this while waiting at Terminal 1 at CDG airport (this is the last time I’m flying Aer Lingus to Paris), and I was able to finish the thing before my flight boarded. What I did realize just before that, though, is that Apache would still prefer serving WebP, I’d venture a guess that it’s because it’s smaller in size. This is okay for Opera, Firefox and (obviously) Chrome, but not for Safari or IE.

Of course if the other browsers were to actually report the formats they supported it would be all fine, but that’s not the case. In particular, Firefox actually prefers image/png to anything else (Apache will add a low q= value to any glob request, just to be on the safe side, which is why I said earlier that q= is not needed for it), so that even if I don’t make any more changes, Firefox will still prefer PNG to WebP (but won’t do anything for JPEG, so if the web server prefers WebP to JPEG, it’s going to be fine).

So how to provide WebP without breaking other browsers? One solution would be to use PageSpeed to compress the images on the fly to WebP when requested by a compatible browser, but that is a bit of overkill, and is hard to package right, and, most importantly, requires browser detection logic on the server, which is not very safe.

At the end I decided to go with a safer option: provide WebP only to Chrome users and not to users of other browsers, at least until they decide to fix their Accept headers. But how to do that? Well, I needed to check Apache’s source code, because documentation does not seem to explain that clearly and explicitly, but to decide which format to serve, Apache will multiply the q= parameter coming from the browser, or its implicit values (that make image/* and */* have a default value of less than 0.1) with the qs= parameter passed when declaring the type:

AddType image/jpeg .jpeg .jpg .jpe
AddType image/png .png
AddType image/webp;qs=0.9 .webp

By using the value 0.9 to webp, and leaving the default 1 to the other formats, I’m basically telling Apache that, all things equals (like if the browser is sending Accept: */*, Internet Explorer style), I prefer to provide PNG or JPEG to the users, rather than WebP. It will also prefer to serve JPEG to Firefox (which uses image/*). Chrome 30, on the other hand, explicitly prefers WebP over any other image format, and so Apache will calculate the preference as 1.0*0.9 for WebP and 0.8*1.0 for PNG and JPEG. I have not checked what Opera does, but it looks like all the browsers on my cellphone would support WebP but they don’t prefer it, so they won’t be served it either.

So right now WebP images for my blog are an exclusive of Chrome users; the win is relatively good, by halving the size of the Autotools Mythbuster cover on the right, and shaving off a few bytes from the top image for the links. There are definitely more interesting way to save bandwidth by re-compressing the images that I used around the blog (many of which, re-compressed, end up taking half the space), but that will have to wait for me to fix this bloody Typo as right now editing posts is failing.

Another thing that I will have to work on is a tool to handle the re-compression. Right now I’m doing so by hand, and it’s both a waste of time, and prone to errors. I’ll have to come up with a good way to quickly describe images so that a tool can re-compress them and evaluate whether to keep them in WebP or not, and at the same time I need to find a way to store the originals at the highest quality. But that’s a topic for a different time.

The WebP experiment

You might have noticed over the last few days that my blog underwent some surgery, and in particular that some even now, on some browsers, the home page does not really look all that well. In particular, I’ve removed all but one of the background images and replaced them with CSS3 linear gradients. Users browsing the site with the latest version of Chrome, or with Firefox, will have no problem and will see a “shinier” and faster website, others will see something “flatter”, I’m debating whether I want to provide them with a better-looking fallback or not; for now, not.

But this was also a plan B — the original plan I had in mind was to leverage HTTP content negotiation to provide WebP variants of the images of the website. This was a win-win situation because, ludicrous as it was when WebP was announced, it turns out that with its dual-mode, lossy and lossless, it can in one case or the other outperform both PNG and JPEG without a substantial loss of quality. In particular, lossless behaves like a charm with “art” images, such as the CC logos, or my diagrams, while lossy works great for logos, like the Autotools Mythbuster one you see on the sidebar, or the (previous) gradient images you’d see on backgrounds.

So my obvious instinct was to set up content negotiation — I’ve used it before for multiple-language websites, I expected it to work for multiple times as well, as it’s designed to… but after setting all up, it turns out that most modern web browsers still do not support WebP *at all*… and they don’t handle content negotiation as intended. For this to work we need either of two options.

The first, best option, would be for browsers only Accept the image formats they support, or at least prefer them — this is what Opera for Android does: Accept: text/html, application/xml;q=0.9, application/xhtml+xml, multipart/mixed, image/png, image/webp, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1 but that seems to be the only browser doing it properly. In particular, in this listing you’ll see that it supports PNG, WebP, JPEG, GIF and bimap — and then it accepts whatever else with a lower reference. If WebP was not in the list, even if it had an higher preference for the server, it would not be sent to the client. Unfortunately, this is not going to work, as most browsers send Accept: */* without explicitly providing the list of supported image formats. This includes Safari, Chrome, and MSIE.

Point of interest: Firefox does explicit one image format before others: PNG.

The other alternative is for the server to default to the “classic” image formats (PNG, JPEG, GIF) and then expect the browsers supporting WebP prioritizing it over the other image formats. Again this is not the case; as shown above, Opera lists it but does not prioritize, and again, Firefox prioritizes PNG over anything else, and makes no special exception for WebP.

Issues are open at Chrome and Mozilla to improve the support but they haven’t reached mainstream yet. Google’s own suggested solution is to use mod_pagespeed instead — but this module – which I already named in passing in my post about unfriendly projects – is doing something else. It’s on-the-fly changing the content that is provided, based on the reported User-Agent.

Given that I’ve spent some time on user agents, I would say I have the experiences to say that this is a huge pandora’s vase. If I have trouble with some low-development browsers reporting themselves as Chrome to fake their way in with sites that check the user agent field in JavaScript, you can guess how many of those are going to actually support the features that PageSpeed thinks they support.

I’m going to go back to PageSpeed in another post, for now I’ll stop to say that WebP has the numbers to become the next generation format out there, but unless browser developers, as well as web app developers start to get their act straight, we’re going to have hacks over hacks over hacks for the years to come… Currently, my blog is using a CSS3 feature with the standardized syntax — not all browsers understand it, and they’ll see a flat website without gradients; I don’t care and I won’t start adding workarounds for that just because (although I might use SCSS which will fix it for Safari)… new browsers will fix the problem, so just upgrade, or use a sane browser.

Browsers on the Kindle Fire

A few days ago I talked about Puffin Browser with the intent to discuss into more details the situation with the browsers on the Kindle Fire tablet I’m currently using.

You might remember that at the end of last year, I decided to replace Amazon’s firmware with a CyanogenMod ROM so to get something useful on it. Beside the lack of access to Google Play, one of the problems I had with Amazon’s original firmware was that the browser that it comes with is flakey to the point of uselessness.

While Amazon’s AppStore does include many of the apps I needed or wanted – including SwiftKey Tablet which is my favourite keyboard for Android – they made it impossible to install them on their own firmware. I’ve been tempted to install their AppStore on the flashed Kindle Fire and see if they would allow me to install the apps then, it would be quite a laugh.

Unfortunately, while the CM10 firmware actually allows me to make a very good use of the device, much more than I could ever have reached with the original firmware, the browsing experience still sucks big time. I’ve currently installed a number of browsers: Android’s stock browser – with its non-compliant requests – Google Chrome, Firefox, Opera and the aforementioned Puffin. There is no real winner on the lot.

The Android browser has a terrible network implementation and takes way too much time requesting and rendering pages. Google Chrome is terrible on the whole, probably because the Fire is too underpowered to run it properly, which makes it totally useless as an app. I only keep it around for testing purposes, until I get a better Android tablet.

Firefox has the best navigation support but every time I click on a field and SwiftKey has to be brought up, it takes a full minute. Whether this is a bug in SwiftKey or Firefox, I have no idea. If someone has an idea who to complain about it to, I’d love to report it and see it fixed.

Best option you get, beside Firefox, is Opera. While slightly slower than Firefox on rendering, it does not suffer from the SwiftKey bug. I’m honestly not sure at this point if the version of Opera I’m using right now renders with their own Presto engine or with WebKit which they announced they are moving to — if it’s the latter, it’s going to be a loss for me I guess, since the two surely WebKit based browsers are not behaving nicely for me here.

Now from what I said about Puffin, you’d expect it to behave properly enough. Unfortunately that is not the case. I don’t know if it’s a problem with my local bandwidth being too limited, but in general the responsiveness is worse than Opera, although not as bad as Puffin. The end result is that even the server-side rendering does not make it usable.

More reviews of software running on the Fire will follow, I suppose, unless I decide to get a newer tablet in the next weeks.

Mobile App Review: Puffin Browser

While I was working on my previous job I have been given the task to find a way to use a Flash based application on an iPad (Android was not on the mind of anybody but me). Among the applications I have been trying for that there was Puffin Browser which is available for both iOS and Android.

I haven’t written much about this before, because it was too related to work to talk about — we were trying to get in touch with them, as we had a few issues that needed to be addressed, but since that was about six months ago now, I guess that fell through and won’t be happening anyway. Nothing that I’ll be discussing here is related to that job anyway.

So what is Puffin, and why did it relate to run a Flash application? Well, mostly it’s a browser that follows the same idea that I remember being used at least by the first Opera browser on the iPad, if I’m not mistaken. A server of theirs downloads and renders the page, and it’s displayed on the device’s screen.

Unlike most other browsers I’ve seen, though, it also renders Flash in (near) real-time, and proxies the taps as clicks. To make it nicer, it also provides a virtual mouse, and keyboard, which allows you to do operations like right clicks and drags. It’s not a bad result what you get, but there are complications.

Mostly, the complications are for server admins — not even web developers, really just server admins. The problem is that CloudMosa, the company that develops and sell this application, while using a very limited pool of IP addresses to proxy the requests, do not provide a FcRDNS to ensure that what comes, declaring itself as Puffin, is actually Puffin, and can be trusted.

You can imagine that this causes no little problem with my ruleset especially in regard to the openproxy handling. Unfortunately, and this was one of the things that caused the most problems at my previous position as well, they don’t really have a support system. They handle most of the feedback and discussion through their Facebook page. Which is to say, something very screwed up.

The complexity of request validation

You might have read before that I use a complicated setup with ModSecurity to prevent spam on this blog — I have written about it before extensively, and I also published my rules so that they can be used by other sites (Videolan’s forums are using them as well).

Well, maintaining this ruleset is not that easy, if at all; the problem comes when new browsers are introduced into the mix that makes validating their validity difficult. This is what happened a few months ago when Google first published Chrome for their ICS — which I still don’t have access to, I think I’ll get an HTC One X as soon as I get to California. Well, they did it again with the new Chrome for iOS.

There are three different identifications Chrome can come in as: Chrome, CrMo (for Android) and Crios (for the iOS devices). This simply meant that any special case put in place for Chrome on Android didn’t get auto-extended to the new Chrome on iOS — which is probably intended given that Chrome on iOS has to use the standard WebKit engine of Safari, rather than come up with its own — the only reason to use it is to have synchronised bookmarks with your computer.

Now, though, is when the problems start cropping up: the new Chrome on iOS also has the same problem as the one on ICS: it doesn’t send an Accept header, which is customary for almost every other browser, including the main desktop Chrome builds. So it was a matter of adding Crios to the list of special cases, together with CrMo.

But there is one more issue: there is one feature in the Chrome for iOS interface that allows you to go back to the so-called “desktop interface” — as long as the browser decided to have different interfaces depending on the User-Agent value. What you would expect at that point is for the application to report Chrome as user agent, but it’s not the case. What it reports is instead Safari. The problem is that it still implements some particularity that is generally limited to Chrome, including SDCH, which is something I used to validate before.

So what I ended up doing was removing the support for validation of browsers supporting sdch as an encoding — although I kept the validation that if it reports it’s Chrome, it has to have sdch (unless of course it’s passing through a Proxy). This still makes it possible to workaround most of the non-sophisticated crawlers/tools that try to pass as a browser.

The importance of HTTP request fingerprinting

I started looking at ModSecurity when I wanted to implement a Uesr-Agent based antispam method which has proven time and time again working quite well to the point I started publishing the ruleset which takes care not only of working as an antispam method, as well as a way to avoid tons of bad crawlers from finding my email addresses and so on.

When I first proposed this kind of filtering I received quite a few complains, that the HTTP protocol didn’t define the User-Agent in such a way, but thanks first to EFF’s Panopticlick – demonstrating clearly that the “anonymised” requests are not as anonymous as their perpetrators would expect them to be – and most recently SpiderLabs’s work I am now fully certain that I took the right road.

I’ve spent a bit more work on the rules this week, to make them further resilient to fake the requests such as those coming from scriptkiddies’ tools such as the HOIC tool described in the SpiderLabs’s blog post linked above. One of the most interesting detection I came up with is for real Chrome requests: while it seems to me like Google itself does not leverage it, Chrome as of version 18 is still implementing their own proposed Shared Dictionary Compression for HTTP even though I don’t think it’ll ever be used in the real world. Being the only browser actually requesting such an encoding, I can easily assume a connection between the two — this was only disattended by Epiphany, which in its most recent versions declares to be Chrome… which means you then have a browser claiming to be another (Chrome), which in turn claims to be a third (Safari), which uses an engine (KHTML) claiming to be the same as another (Gecko), all the while declaring it’s all compatible with Mozilla/5.0.

One issue I found while doing this work had to do with Android. For both versions 2 and 3 (is somebody really hoping to use Android 1?), the (default, AOSP) browser sends a full-fledged HTTP request, which among other things include an Accept header. This is what every browser I ever tried does, to the point that ModSecurity’s own Core Rule Set assigns negative points to requests coming without one; in my ruleset it’s further tightened by checking whether the request is purportedly from a known browser, and if so rejecting it if it doesn’t include that header; this worked up to now — note that requests coming through a Proxy, making that explicit through a Via header, are not validated against these checks simply because many proxies are known to muck with the headers.

Anyway as I was saying this is disattended badly by Android 4 (up to 4.0.3, and CyanogenMod as well); it might have started as a way to minimise the bandwidth usage, but for whatever reason in this version, the AOSP browser does not send an Accept reader at all — actually it seems like it dropped most of the headers that it was sending before and that are not strictly necessary for the server to process the request. I could have sworn that Accept was mandatory for the HTTP protocol, but it seems that either I was totally mistaken, or it was only noted in some recommendation that never made it to the standard. The ruleset now exonerates Android 4 from that particular test, but I’m not really too happy about it.

But that’s definitely not the only thing that is out of place with Android. Indeed, if you take an HTC Android device, the browser you open is not the AOSP one, but it’s HTC’s own implementation. This version … does not fully declare itself as an Android device, using a browser compatible with Mobile Safari. Instead, what it reports itself as is a complete Safari, and not in the way that Chrome does it, but by pretending it’s Mac OS X 10.6.3 running on an Intel Mac. Honestly, that’s way crazy to do.

There are a few more things that I hope to be able to handle in my ruleset to make it even tighter, without adding substantial false positives. This means not only fewer spam comments, but also fewer crawlers finding our email addresses, and fewer risks associated with Denial of Service attacks, distributed or not.

If you would like to help with the ruleset, you can find it on Flattr where it’s depressingly stopping at only two clicks. If you would like to use the ruleset, you can find it on GitHub and you can use it for free, obviously.

Why I check your user agents

I’m one of the few Free Software activists that actually endorses the use of User-agent header, I’m afraid. The reason for that is that, while in general that header is used to implement various types of policies, it is often used as part of lock-in schemes (sometimes paper-thin lock-ins by the way), and we all agree that lock-ins are never nice. It is a different discussion on whether those lock-ins are something to simply attack, or something to comprehend and accept — I sincerely think that Apple has all the rights to limit the access to their trailers to QuickTime, or at least try to, as they are providing the service, and it’s for them a platform to show their software; on the other hand, BBC and RAI using it to lock-in their public service TV is something nasty!

So basically we have two reasons to use User-agent: policies and statistics. In the former category I also count in the implementation of workarounds of various species. Statistics, are mostly useful to decide on what to focus, policies, can be used for good or evil; lock-ins are generally evil, but you can use policies to improve the quality of the service for users.

One of the most commonly used workarounds applied by using the user agent declarations are related to MSIE missing features; for instance, there is one to handle serving properly the XHTML files through the application/xhtml+xml mime type, which it doesn’t support:

RewriteCond %{REQUEST_URI} ^/[a-z_/]*$
RewriteCond %{HTTP_USER_AGENT} MSIE [OR]
RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [OR]
RewriteCond %{HTTP_ACCEPT} application/xhtml+xmls*;s*q=0.?0*(s|,|$)
RewriteRule ^/[a-z_/]*$ - [T=text/html]

Yes this has one further check that most of the copies of the same check have on Internet; the reason is that I have experimentally noticed that Facebook does not handle XHTML properly; indeed if you attach a link to a webpage that has images, and is served as XHTML, it won’t get you the title nor allow you to choose an image to use for the link. This was true at least up to last December, and I assume the same is true now, and thus why I have that extra line.

In a different situation, feng uses the User-agent field to identify bugged software and implement specific workarounds (such as ignoring the RTSP/1.0 standard, and seek on subsequent PLAY requests without PAUSE).

Stepping away from workarounds, policies that can implemented this way include warning about insecure, unsupported browsers, trojan-infected systems, and provide them with an informational message telling the user what to do to get something better/cleaner (I do that for a few websites to tell the users that they are running something very broken — such as Internet Explorer 6). This is policy, it’s generally a good policy in my opinion. *On a different note, if somebody can suggest a way to use cookies to add a static way to bypass the check, I’d be happy.*

There are many more things you can do with agent-specific policies, including providing lower-quality images for smartphones, without implementing mobile-specific website vhosts, but I won’t go into deeper details right now.

For what concerns statistics, they usually provide a way for developers and designer to focus on what’s really being used by the targets of their software. Again, some activists dislike this because it shows that it’s not worth considering non-Firefox, non-IE browsers for most websites, and sometimes not even Firefox, but avoiding these extreme cases, statistics are, in the real working world, very important.

Some people feel like being smarter than the average programmer, and want to throw out of place the statistics by saying that they are using “Commodore 64” or “MS-DOS” as operating system. They pretend to defend their privacy, to camouflage among the bad bad Internet. What they are doing, is actually trying to hide on a plane by wearing a balaclava which you might guess is pretty peculiar. In fact, if you try EFF’s Panopticlick you can see that an unique, “novelty” User-agent is actually making you spark among the Internet users. Which means that if you’re trying to hide through a crowd with the balaclava you’re not smarter than anybody, you’re actually stupider than the average.

Oh and by the way, there is no way your faking being Googlebot will work out good for you; on my webserver for instance, you’ll get 403 responses for all your requests… unless your reverse resolution properly forward-confirms to be coming from the googlebot server farm…

Technology use and abuse

When I started working on my antispam filtering based on user agent strings provided by browsers, I got quite a bit of feedback of people who complained that user agent strings weren’t meant to be used for that, that they should be used just for statistical purposes and other stuff like that. Indeed, reading around free software planets, you find lots of people maintaining this position, that no logic code should be conditional to the user agent string; this usually involves people working on Debian-based systems where Firefox is banned and Iceweasel is the way.

Now, I understand that what I’m doing is borderline valid for the protocol; that discriminating users based on their user agent string is not ethically perfect; but let me say that the thing works pretty nicely; I observe from time to time the comments that gets denied (mod_security can keep them in log) and I haven’t found a single false positive; there are a few false negatives (that is, spam that passes through the mod_security filter and reaches the blog); but luckily the antispam features in Typo itself are good enough at that point. Lately this has happened because a few spambots started declaring themselves some almost credible IE 7 on Windows XP or Vista; while IE8 was released, I’d rather give it a few more months before starting to reject those, too.

These results start making me wonder how much what I’m doing is abuse and how much is it use; there are some questionable reasons behind logic switches happening between Firefox and Iceweasel, but that does not involve me. And at the same time, one would expect that stuff like that is doomed to happening; both Apple and Google seems to have accepted that, and you can see that Safari is still declaring itself KHTML, and Google Chrome declares itself as Safari, too. Sure that most of the code that tries to identify one of the three of them should just hit on WebKit (well, that is, if KDE were to finally decide to go with one engine that is getting support out there), but at the same time they try to be pragmatic and accept that there is code logic based on user agents.

Back to my usage, since publishing the rules on this blog starts to get messy, because of mod_security itself (funny!), I’m probably going to post them on a git or something in the next few days; I’ll be adding also the public service rules that I’ve been using now for a while at least for my friend’s site (and actually found a couple of friends of his that had dialers on their systems and never noticed).

So maybe I’m using it for something that wasn’t designed for, on the other hand, it works and it really does not differ much from using statistical analysis of headers from email messages; and you know that your mail server, or client, or proxy, or whatever is doing something like that with spamassassin!