If you run a filtering proxy with Squid, you probably know SquidGuard and ufdbGuard: they are two of the applications that implement content filtering based, mostly, on lists of known bad URLs. I currently only have one filtering proxy set up at a customer of mine, whose main task is avoiding “adult” sites in general, and I set it up with ufdbGuard, which is developed by the same company as URLFilter Database, and have used the demo version of it for a while.
Unfortunately my customer is not really keen on paying for a six users license when his office only counts three people, and I really can’t blame him on that note, even though the price is by itself a good one. So now that I don’t have access to the main blacklist any longer I took a suggestion by Aleister and tried Shalla instead.
Compared with SquidGuard, which as far as I can see is its ancestor, ufdbGuard boast a faster matching engine; at the base of this, as far as I can see, is the fact that instead of relying on plain text lists, it uses binary database files in its own format which are pre-generated. This, though, makes it a bit less trivial to run ufdbGuard with a different list than the one they provide themselves.
You can then use the ufdbGenTable
command to generate the table, passing it the blacklist for domains and URLs that will be transformed in what ufdbGuard can use. This is the main reason why ufdbGuard is so much faster: instead of having to read and split text files, it accesses optimized database files designed to make matching much faster. But not all the lists are created equal, and ufdbGenTable
has also an analysis task: it reports on a number of issues, such as the third-level www.
domain part present in the blacklist (which is at a minimum redundant, and increase the size of the pattern to match) and too long URLs (that could suggests the wrong match is applied).
This kind of analysis is what undermined my confidence in the Shalla list. While being sponsored by a company, and thus not a complete community effort, the list is supposed maintained via third-party contributions, but either their review process is broken, or simply it is not as alive as it seemed to be at a first glance. ufdbGenTable
reports a number of warnings, mostly the www
issue noted above (which can be solved with a simple -W
option to the command), but also a lot of long URLs. This in turn shows that the list contains URLs to single documents within a website, and a number of URLs that include a session ID (a usually random hash that is never repeated, and thus could never be matched).
Looking at the submission list also doesn’t help: demonoid.me
has been rejected for listing with the reason “Site never functional”, which sounds silly given that’s the site I got all the recent episodes of Real Time with Bill Maher. The impression I got of that list is that it is handled like an amateur, and I thus cannot really rely on that in any way.
A more interesting list is the UT1 blacklist from the Université Toulouse 1, whose main target is adult websites (which is what I’m mostly interested in filtering in this case). This list is interesting in a number of ways: the update is available either as singly-compressed lists, or, even better for my case, as an rsync
tree that does not require me to fetch the whole list each time. This simplifies my update script to something like this:
#!/bin/sh
rsync --delete --exclude 'domains.ufdb' --archive rsync://ftp.univ-tlse1.fr/blacklist/dest/ /usr/local/share/ut1/
exit=0
umask 0333
find /usr/local/share/ut1 -mindepth 1 -maxdepth 1 -type d -printf '%fn' | while read table; do
[ -f /usr/local/share/ut1/${table}/domains ] || continue
myparams="-W -t $table -d /usr/local/share/ut1/${table}/domains"
[ -f /usr/local/share/ut1/urls ] && myparams="${myparams} -u /usr/local/share/ut1/${table}/urls"
ufdbGenTable ${myparams} || exit=1
done
exit $exit
Code language: PHP (php)
UT1 also, contrarily to Shalla, makes use of expression lists, which allows matching many URLs without the need to list all of them as domains, just like the original URLFilter list (judging from the configuration file). Unfortunately, using this list you cannot enforce HTTPS certificates, which most definitely was a cool feature of ufdbGuard
when used with its original list. I wonder if it could be possible to generate an equivalent list from the ca-certificates package that we already ship in Gentoo.
I am of the idea that it is possible to both improve handling of blacklists, possibly through a more precise “crowdsourcing” method, and the code of ufdbGuard
itself, which right now tends to be a bit rough on the edges. I’m just surprised by the fact that there are multiple lists, rather than a single, complete and optimized one.