ufdbGuard and blacklists

If you run a filtering proxy with Squid, you probably know SquidGuard and ufdbGuard: they are two of the applications that implement content filtering based, mostly, on lists of known bad URLs. I currently only have one filtering proxy set up at a customer of mine, whose main task is avoiding “adult” sites in general, and I set it up with ufdbGuard, which is developed by the same company as URLFilter Database, and have used the demo version of it for a while.

Unfortunately my customer is not really keen on paying for a six users license when his office only counts three people, and I really can’t blame him on that note, even though the price is by itself a good one. So now that I don’t have access to the main blacklist any longer I took a suggestion by Aleister and tried Shalla instead.

Compared with SquidGuard, which as far as I can see is its ancestor, ufdbGuard boast a faster matching engine; at the base of this, as far as I can see, is the fact that instead of relying on plain text lists, it uses binary database files in its own format which are pre-generated. This, though, makes it a bit less trivial to run ufdbGuard with a different list than the one they provide themselves.

You can then use the ufdbGenTable command to generate the table, passing it the blacklist for domains and URLs that will be transformed in what ufdbGuard can use. This is the main reason why ufdbGuard is so much faster: instead of having to read and split text files, it accesses optimized database files designed to make matching much faster. But not all the lists are created equal, and ufdbGenTable has also an analysis task: it reports on a number of issues, such as the third-level www. domain part present in the blacklist (which is at a minimum redundant, and increase the size of the pattern to match) and too long URLs (that could suggests the wrong match is applied).

This kind of analysis is what undermined my confidence in the Shalla list. While being sponsored by a company, and thus not a complete community effort, the list is supposed maintained via third-party contributions, but either their review process is broken, or simply it is not as alive as it seemed to be at a first glance. ufdbGenTable reports a number of warnings, mostly the www issue noted above (which can be solved with a simple -W option to the command), but also a lot of long URLs. This in turn shows that the list contains URLs to single documents within a website, and a number of URLs that include a session ID (a usually random hash that is never repeated, and thus could never be matched).

Looking at the submission list also doesn’t help: demonoid.me has been rejected for listing with the reason “Site never functional”, which sounds silly given that’s the site I got all the recent episodes of Real Time with Bill Maher. The impression I got of that list is that it is handled like an amateur, and I thus cannot really rely on that in any way.

A more interesting list is the UT1 blacklist from the Université Toulouse 1, whose main target is adult websites (which is what I’m mostly interested in filtering in this case). This list is interesting in a number of ways: the update is available either as singly-compressed lists, or, even better for my case, as an rsync tree that does not require me to fetch the whole list each time. This simplifies my update script to something like this:

#!/bin/sh

rsync --delete --exclude 'domains.ufdb' --archive rsync://ftp.univ-tlse1.fr/blacklist/dest/ /usr/local/share/ut1/

exit=0

umask 0333

find /usr/local/share/ut1 -mindepth 1 -maxdepth 1 -type d -printf '%fn' | while read table; do
    [ -f /usr/local/share/ut1/${table}/domains ] || continue

    myparams="-W -t $table -d /usr/local/share/ut1/${table}/domains"
    [ -f /usr/local/share/ut1/urls ] && myparams="${myparams} -u /usr/local/share/ut1/${table}/urls"

    ufdbGenTable ${myparams} || exit=1
done

exit $exit

UT1 also, contrarily to Shalla, makes use of expression lists, which allows matching many URLs without the need to list all of them as domains, just like the original URLFilter list (judging from the configuration file). Unfortunately, using this list you cannot enforce HTTPS certificates, which most definitely was a cool feature of ufdbGuard when used with its original list. I wonder if it could be possible to generate an equivalent list from the ca-certificates package that we already ship in Gentoo.

I am of the idea that it is possible to both improve handling of blacklists, possibly through a more precise “crowdsourcing” method, and the code of ufdbGuard itself, which right now tends to be a bit rough on the edges. I’m just surprised by the fact that there are multiple lists, rather than a single, complete and optimized one.

Sealed tinderbox

I’ve been pushing the tinderbox one notch stricter from time to time; a few weeks ago I set up the tinderbox so that any network access beside for the basic protocols (HTTP, HTTPS, FTP and RSYNC) was denied; the idea is that if the ebuilds try to access network by themselves, something is wrong: once the files are fetched, that should be enough. Incidentally, this is why live ebuilds should not be in the tree.

Now, since I’ve received a request regarding the actual network traffic issued by the tinderbox, I decided to go one step further still, and make sure that beside for the tasks that do require network access the tinderbox does not connect to anything outside of the local network. To do so, I set up a local RSync mirror, then added a squid passthrough proxy, that does not cache anything; at that point, rather than allowing some protocols on the router for the tinderbox, I simply reject anything originating from the tinderbox to access Internet; all the outgoing connections originating from the tinderbox are done through Yamato, so I have something like this in my make.conf:

FETCHCOMMAND="/usr/bin/curl --location --proxy yamato.local:3128 --output ${DISTDIR}/${FILE} ${URI}" 
RESUMECOMMAND="/usr/bin/curl --location --proxy yamato.local:3128 --continue-at - --output ${DISTDIR}/${FILE} ${URI}"

Note: googling on how to set up those two variables in Gentoo to use curl I did find some descriptions on the Gentoo Forums that provide most of them; unfortunately all I found ignore the --location option, which makes it fail to fetch stuff from the SourceForge mirrors and any other mirroring system that uses 302 Moved responses.

I also modified the bti-calling script so that the identi.ca dents are sent properly through the proxy. I didn’t set the http_proxy variable, because that would have made moot the sealing. Instead, by setting it up this way, explicitly for the fetch and dent, if any testsuite tries to fetch something, even via HTTP, will be denied.

But… why should it be a problem if testsuites were to access services on the network? Well, the answer is actually easy once you understand two rules of Gentoo: what is not in package.mask is supposed to work, and any bug found needs to be fixable, and testsuites results need to be reproducible, to make sure that the package works. When you rely on external infrastructure like GIT repositories, you have no way to make sure that if there is a problem it can be fixed; and when your testsuite relies on remote network services, it might fail because of connection problems, and it will fail if the remote service is closed entirely.

I’ve also been tempted to remove IPv4 connectivity from the tinderbox at all; IPv6 should well be enough given that it only needs to connect to Yamato, and it would be under NAT anyway..