Free Idea: a filtering HTTP proxy for securing web applications

This post is part of a series of free ideas that I’m posting on my blog in the hope that someone with more time can implement. It’s effectively a very sketched proposal that comes with no design attached, but if you have time you would like to spend learning something new, but no idea what to do, it may be a good fit for you.

Going back to a previous topic I wrote about, and the fact that I’m trying to set up a secure WordPress instance, I would like to throw out another idea I won’t have time to implement myself any time soon.

When running complex web applications, such as WordPress, defense-in-depth is a good security practice. This means that in addition to locking down what the code itself can do on to the state of the local machine, it also makes sense to limit what it can do to the external state and the Internet at large. Indeed, even if you cannot drop a shell on a remote server, there is value (negative for the world, positive for the attacker) to at least being able to use it form DDoS (e.g. through an amplification attack).

With that in mind, if your app does not require network at all, or the network dependency can be sacrificed (like I did for Typo), just blocking the user from making outgoing connection with iptables would be enough. The --uid-owner option makes it very easy to figure out who’s trying to open new connections, and thus stop a single user transmitting unwanted traffic. Unfortunately, this does not always work because sometimes the application really needs network support. In the case of WordPress, there is a definite need to contact the WordPress servers, both to install plugins and to check if it should self-update.

You could try to limit access to what the user can access by hosts. But that’s not easy to implement right either. Take WordPress as an example still: if you wanted to limit access to the WordPress infrastructure, you would effectively have to allow it accessing *.wordpress.org, and this can’t really be done in iptables, at far as I know, since those connections go to IP literal addresses. You could rely on FcRDNS to verify the connections, but that can be slow, and if you happen to have access to poison the DNS cache of the server, you’re effectively in control of this kind of ACL. I ignored the option of just using “standard” reverse DNS resolution, because in that case you don’t even need to poison DNS, you can just decide what your IP will reverse-resolve to.

So what you need to do is actually filter at the connection-request level, which is what proxies are designed for. I’ll be assuming we want to have a non-terminating proxy (because terminating proxies are hard), but even in that case you can now know what (forward) address you want to connect to, and in that case *.wordpress.org becomes a valid ACL to use. And this is something you can actually do relatively easily with Squid, for instance. Indeed, this is the whole point of tools such as ufdbguard (which I used to maintain for Gentoo), and the ICP protocol. But Squid is particularly designed as a caching proxy, it’s not lightweight at all, and it can easily become a liability to have it in your server stack.

Up to now, what I have used to reduce the surface of attacks of my webapps is set them behind a tinyproxy, which does not really allow for per-connection ACLs. This only provides isolation against random non-proxied connections, but it’s a starting point. And here is where I want to provide a free idea for anyone who has the time and would like to provide better security tools for srver-side defense-in-depth.

A server-side proxy for this kind of security usage would have to be able to provide ACLs, with both positive and negative lists. You may want to provide all access to *.wordpress.org, but at the same time block all non-TLS-encrypted traffic, to avoid the possibility of downgrade (given that WordPress has a silent downgrade for requests to api.wordpress.org, that I talked about before).

Even better, such a proxy should have the ability to distinguish the ACLs based on which user (i.e. which webapp) is making the request. The obvious way would be to provide separate usernames to authenticate to the proxy — which again Squid can do, but it’s designed for clients for which the validation of username and password is actually important. Indeed, for this target usage, I would ignore the password altogether, and just use the user at face value, since the connection should always only be local. I would be even happier if instead of pseudo-authenticating to the proxy, the proxy could figure out which (local) user the connection came from, by inspecting the TCP socket connection, kind of like querying the ident protocol used to work for IRC.

So to summarise, what I would like to have is an HTTP(S) proxy that focuses on securing server-side web applications. Does not have to support TLS transport (because it should only accept local connections), nor it should be a terminating proxy. It should support ACLs that allow/deny access to a subset of hosts, possibly per-user, without needing a user database of any sort, and even better if it can tell by itself which user the connection came from. I’m more than happy if someone tells me this already exists, or if not, someone starts writing this… thank you!

That’s interesting…

Please turn away from this blog post if even discussing the existence of porn is a problem for you. You’ve been warned.

XKCD - Swimsuit Issue
Comic © Randall Munroe, CC-BY-NC

I’ve seen last night the giggling around about recent publishing of YouPorn accounts and password data and I started wondering what most of the fuss was about. Most of the sites are either copy-pasting the same statements or showing, well, some very strange ideas about porn altogether, I think.

Let’s be clear, I’m not going to advocate one way or another about it, but I don’t think that I’m scandalised that porn exists and that people look for it. Heck I can’t remember the last time I saw a TV series with characters in the late teens or twenties that never ever discussed about porn at all (okay I’m wrong, there’s 7th Heaven but I never liked that anyway). I’m neither saying this is how everybody is nor whether this is right or wrong, but it should probably be taken for granted now.

Considering this, the two notes about this publishing of YouPorn accounts being detrimental to marital and work relationship seems to be … bogus, to me. Let’s start with the second of the two: job relationship. People expect that employers will just fire people because they registered on the YouPorn site. Why should they? If they had used a corporate email address, then it might be that the leak, more than the registration itself, is detrimental to the company’s public profile, but that seems to be a corner case. If the problem is that they are supposed to have been surfing for porn while they were supposed to be working… this leak is providing no new info, for any decently-run company.

Even when using HTTPS, corporate proxies know for which hostname you’ve been looking for (after all, when you’re using proxies you’re not even allowing DNS to pass through, so no aliases can be resolved and no IP addresses are involved). So if corporate policy is “no porn at work”, the solution is not to hope for a leak of account information (let alone the fact that the site is very well usable without registering), but set up a proxy system that either blocks navigation to those sites or warn the administrator about users attempting to connect to them.

Furthermore, I would be surprised if many employers are so uptight that just having a registration on a porn site would make them fire their employee (of course it can always be used as a good excuse to fire a bad employee, or one you hate, but that’s beside the point here, I’m talking about this only reason as ground). Reason why it sounds strange to me is that my own customers talk to me about YouPorn. That’s how I know it in the first place, a customer of mine. And I don’t mean the man or boy who brings here his computer to be fixed after a herd of viruses made it unbootable, I mean corporate customers.

As for the marital relationships, I admit I don’t have much experience (actually, I have no experience at all, given that I’m a twenty-six years old single, virgin, who never ever kissed a girl — let’s not go there now), but I’d be very surprised if any spouse would be concerned about one’s registration on a porn site… if the leak included the viewed videos, now that would be a different story altogether (and let’s remember that since the website is not HTTPS-protected, employers do know which videos are viewed — I know that because I manage a couple of the named corporate proxies and I’m asked to check for that kind of stuff from time to time).

After all this, though, there is another question I have in mind: are we sure the leak is legitimate? Seems like YouPorn themselves made a statement (safe for work, mostly) declaring that the breach involves a third-party service called YP Chat. The password list (which I have checked out against my own customers to warn them if I found their passwords) looks suspiciously neat, with similarly-named users over and over, and so many identical passwords between different users that it looks more like a textbook example than an actual passwords’ list (as pointed out on the Naked Security’s post’s comments). Also, I somehow doubt that the YouPorn registered users are just in the thousands, even though the website is entirely well-fruible without an account, as spammers just love registering to websites.

With this in mind my question would be: is it enough to post pastebin full of usernames and passwords for a scandalous website to get someone you don’t like fired from their workplace? That would be tremendously stupid; at the same time, do websites such as Naked Security actually check out what they post around? I haven’t read any “we confirmed that the username/password is valid” on the articles I found relating to this, and that would have been a very shallow test as well. I’m honestly surprised about how much it is talked about something that might not actually exist in the first place at all.

ufdbGuard and blacklists

If you run a filtering proxy with Squid, you probably know SquidGuard and ufdbGuard: they are two of the applications that implement content filtering based, mostly, on lists of known bad URLs. I currently only have one filtering proxy set up at a customer of mine, whose main task is avoiding “adult” sites in general, and I set it up with ufdbGuard, which is developed by the same company as URLFilter Database, and have used the demo version of it for a while.

Unfortunately my customer is not really keen on paying for a six users license when his office only counts three people, and I really can’t blame him on that note, even though the price is by itself a good one. So now that I don’t have access to the main blacklist any longer I took a suggestion by Aleister and tried Shalla instead.

Compared with SquidGuard, which as far as I can see is its ancestor, ufdbGuard boast a faster matching engine; at the base of this, as far as I can see, is the fact that instead of relying on plain text lists, it uses binary database files in its own format which are pre-generated. This, though, makes it a bit less trivial to run ufdbGuard with a different list than the one they provide themselves.

You can then use the ufdbGenTable command to generate the table, passing it the blacklist for domains and URLs that will be transformed in what ufdbGuard can use. This is the main reason why ufdbGuard is so much faster: instead of having to read and split text files, it accesses optimized database files designed to make matching much faster. But not all the lists are created equal, and ufdbGenTable has also an analysis task: it reports on a number of issues, such as the third-level www. domain part present in the blacklist (which is at a minimum redundant, and increase the size of the pattern to match) and too long URLs (that could suggests the wrong match is applied).

This kind of analysis is what undermined my confidence in the Shalla list. While being sponsored by a company, and thus not a complete community effort, the list is supposed maintained via third-party contributions, but either their review process is broken, or simply it is not as alive as it seemed to be at a first glance. ufdbGenTable reports a number of warnings, mostly the www issue noted above (which can be solved with a simple -W option to the command), but also a lot of long URLs. This in turn shows that the list contains URLs to single documents within a website, and a number of URLs that include a session ID (a usually random hash that is never repeated, and thus could never be matched).

Looking at the submission list also doesn’t help: demonoid.me has been rejected for listing with the reason “Site never functional”, which sounds silly given that’s the site I got all the recent episodes of Real Time with Bill Maher. The impression I got of that list is that it is handled like an amateur, and I thus cannot really rely on that in any way.

A more interesting list is the UT1 blacklist from the Université Toulouse 1, whose main target is adult websites (which is what I’m mostly interested in filtering in this case). This list is interesting in a number of ways: the update is available either as singly-compressed lists, or, even better for my case, as an rsync tree that does not require me to fetch the whole list each time. This simplifies my update script to something like this:

#!/bin/sh

rsync --delete --exclude 'domains.ufdb' --archive rsync://ftp.univ-tlse1.fr/blacklist/dest/ /usr/local/share/ut1/

exit=0

umask 0333

find /usr/local/share/ut1 -mindepth 1 -maxdepth 1 -type d -printf '%fn' | while read table; do
    [ -f /usr/local/share/ut1/${table}/domains ] || continue

    myparams="-W -t $table -d /usr/local/share/ut1/${table}/domains"
    [ -f /usr/local/share/ut1/urls ] && myparams="${myparams} -u /usr/local/share/ut1/${table}/urls"

    ufdbGenTable ${myparams} || exit=1
done

exit $exit

UT1 also, contrarily to Shalla, makes use of expression lists, which allows matching many URLs without the need to list all of them as domains, just like the original URLFilter list (judging from the configuration file). Unfortunately, using this list you cannot enforce HTTPS certificates, which most definitely was a cool feature of ufdbGuard when used with its original list. I wonder if it could be possible to generate an equivalent list from the ca-certificates package that we already ship in Gentoo.

I am of the idea that it is possible to both improve handling of blacklists, possibly through a more precise “crowdsourcing” method, and the code of ufdbGuard itself, which right now tends to be a bit rough on the edges. I’m just surprised by the fact that there are multiple lists, rather than a single, complete and optimized one.