If you run a filtering proxy with Squid, you probably know SquidGuard and ufdbGuard: they are two of the applications that implement content filtering based, mostly, on lists of known bad URLs. I currently only have one filtering proxy set up at a customer of mine, whose main task is avoiding “adult” sites in general, and I set it up with ufdbGuard, which is developed by the same company as URLFilter Database, and have used the demo version of it for a while.
Unfortunately my customer is not really keen on paying for a six users license when his office only counts three people, and I really can’t blame him on that note, even though the price is by itself a good one. So now that I don’t have access to the main blacklist any longer I took a suggestion by Aleister and tried Shalla instead.
Compared with SquidGuard, which as far as I can see is its ancestor, ufdbGuard boast a faster matching engine; at the base of this, as far as I can see, is the fact that instead of relying on plain text lists, it uses binary database files in its own format which are pre-generated. This, though, makes it a bit less trivial to run ufdbGuard with a different list than the one they provide themselves.
You can then use the ufdbGenTable
command to generate the table, passing it the blacklist for domains and URLs that will be transformed in what ufdbGuard can use. This is the main reason why ufdbGuard is so much faster: instead of having to read and split text files, it accesses optimized database files designed to make matching much faster. But not all the lists are created equal, and ufdbGenTable
has also an analysis task: it reports on a number of issues, such as the third-level www.
domain part present in the blacklist (which is at a minimum redundant, and increase the size of the pattern to match) and too long URLs (that could suggests the wrong match is applied).
This kind of analysis is what undermined my confidence in the Shalla list. While being sponsored by a company, and thus not a complete community effort, the list is supposed maintained via third-party contributions, but either their review process is broken, or simply it is not as alive as it seemed to be at a first glance. ufdbGenTable
reports a number of warnings, mostly the www
issue noted above (which can be solved with a simple -W
option to the command), but also a lot of long URLs. This in turn shows that the list contains URLs to single documents within a website, and a number of URLs that include a session ID (a usually random hash that is never repeated, and thus could never be matched).
Looking at the submission list also doesn’t help: demonoid.me
has been rejected for listing with the reason “Site never functional”, which sounds silly given that’s the site I got all the recent episodes of Real Time with Bill Maher. The impression I got of that list is that it is handled like an amateur, and I thus cannot really rely on that in any way.
A more interesting list is the UT1 blacklist from the Université Toulouse 1, whose main target is adult websites (which is what I’m mostly interested in filtering in this case). This list is interesting in a number of ways: the update is available either as singly-compressed lists, or, even better for my case, as an rsync
tree that does not require me to fetch the whole list each time. This simplifies my update script to something like this:
#!/bin/sh
rsync --delete --exclude 'domains.ufdb' --archive rsync://ftp.univ-tlse1.fr/blacklist/dest/ /usr/local/share/ut1/
exit=0
umask 0333
find /usr/local/share/ut1 -mindepth 1 -maxdepth 1 -type d -printf '%fn' | while read table; do
[ -f /usr/local/share/ut1/${table}/domains ] || continue
myparams="-W -t $table -d /usr/local/share/ut1/${table}/domains"
[ -f /usr/local/share/ut1/urls ] && myparams="${myparams} -u /usr/local/share/ut1/${table}/urls"
ufdbGenTable ${myparams} || exit=1
done
exit $exit
Code language: PHP (php)
UT1 also, contrarily to Shalla, makes use of expression lists, which allows matching many URLs without the need to list all of them as domains, just like the original URLFilter list (judging from the configuration file). Unfortunately, using this list you cannot enforce HTTPS certificates, which most definitely was a cool feature of ufdbGuard
when used with its original list. I wonder if it could be possible to generate an equivalent list from the ca-certificates package that we already ship in Gentoo.
I am of the idea that it is possible to both improve handling of blacklists, possibly through a more precise “crowdsourcing” method, and the code of ufdbGuard
itself, which right now tends to be a bit rough on the edges. I’m just surprised by the fact that there are multiple lists, rather than a single, complete and optimized one.
Hey Diego,I spent the 5 years before I came to Genesi working at a content filtering company. UfdbGuard came out a while back and I always wanted to compare it to squidGuard but never did get the chance so I found the post very interesting. I think you might be mistaken though about sg being plaintext only. I know that you can generate sqlite databases from it’s text files. It’s still slower from what I hear.Maintaining those databases is always a rough job. One of the biggest issues you run into, at least here stateside is child pornography. In the US you have to be registered with one of the federal government’s many 3 letter acronym groups as well as you state and local police departments so that if you do visit one of those sites the door isn’t kicked in and such.Crowdsourcing might be great but you will always have a difference of opinion on what is porn and what isn’t. We had many customers who would call up because a museum website as an example had a picture of some ancient Greece statue depicting a male or female nude/semi-nude. You also get people who try to slide in sites that they simply don’t agree with and it causes overblocking and headscratching, as well as complaints and eventually slashdotting (hello ACMA list! – our company was one of the providers for said blocking although the sites were sent to us encrypted and we had to enter them in to the category exactly as they presented them. I scratched my head more than once, but we were told to just do it.There really is no great way to maintain such lists without overblocking. Have you looked in to ufdb’s handling of regexs or their proxy detection? I don’t recall if it handles “keywords(regex)” or not, but the proxy handling was what made me most interested in it, along with the auto search engine “safe mode” handling.Thanks for the great post!
I haven’t looked into squidGuard much but most of the documentation I was looking at never referred to precompiled databases. On the other hand, I expect that if it uses general database approaches (which is what sqlite is) it would still not have the same kind of optimization a tailored database would give.I had no doubt that handling those blacklists in the States would be troublesome: I don’t think it is a coincidence that all the blacklists I found were maintained in Europe.But I admit I wouldn’t have thought about the museum case… I would have expected that only to “happen on The Simpsons”:http://en.wikipedia.org/wik… but I guess I underestimate extremists…Or maybe it’s just the Mediterranean culture that makes porn well-defined enough…
Hello, I was just reading this excellent article about setting up a Squid proxy and thought I would take the time to write a short note to inform your readers that we offer blacklists tailored specifically for Squid proxy native acl, as well as alternative formats for the most widely used third party plugins. So we invite you all to check us out. We take a great deal of pride in the fact that our works offer a higher degree of quality than the freely available options. Our lists are also compatible with UrlFilterdb.Quality Blacklists Tailored For Squid Proxy – http://www.squidblacklist.org
Hi,correct me if I’m wrong but squidGuard uses Berkeley databases.