This Time Self-Hosted
dark mode light mode Search

Spring cleaning in your $HOME: spamassassin with SQL backend

This is going to be the first of a series of posts about «spring cleaning» of your home directory. We’re also in the right season, so I’m not that Off Topic for now 🙂

Why do I care about having a clean home directory? Well, it vastly depends on my setup, but I think this is common enough to grant some discussion about it. I have my /home in a partition that is set up with DM to be replicated on my two harddrives, providing me a basic RAID1 setup for that single partition; this allows me to be relatively safe from a harddisk crash, for what concerns my important data, like SSH and GPG keys, configuration files, mail and so on.

The problem with this is that everything that gets written to my home directory has to be written on two disks, and is often a performance drawback; for this reason, I tend to scatter the non-essential data (like repository checkouts and similar) in different partitions, as they also don’t require much backup most of the times. This also brings me to hate the software that uses my home directory to save cache data, because it ends up using RAID1 for disposable data that I wouldn’t want to have backed up together with really important data.

So, this series of posts are going to explain how I try to keep my home directory clean from cache data, in part to help someone else that might want to do the same, in part for me to remember how and why I did something 😉

One of the first services that I thought of, using data in my home directory, was spamassassin; while the amount of spam mail I receive has now decreased a lot since I left Gentoo (as I’m not in 10 aliases), I still receive quite a bit, so I’m not yet ready to remove my local SpamAssassin filter; it’s probably a sane idea especially since for xine-lib I’m going to repeat my email address over and over at every commit 😉

SpamAssassin saves some data in ~/.spamassassin, namely the bayesian tokes database, the automatic whitelist and your extra preferences. As I don’t have extra per-user preferences (I use SpamAssassin in a single-user environment), I don’t need those, but I do need bayes and awl to work. Since I already have Amarok using PostgreSQL in this box, I decided to use PostgreSQL to also save SpamAssassin data.

Unfortunately, as it is the ebuild does not allow you to easily add postgres support, but this is probably going to be fixed in the future; I have a better ebuild in my overlay and I’ll see to send the changes to Perl team now; in the mean time, the things to change are not that much.

The documentation on setting up SpamAssassin with SQL backend can be found on SpamAssassin Wiki, and it applies to PostgreSQL as well as MySQL, even if some things has to be changed around, nothing major though.

During this post I’ll assume that both PostgreSQL and SpamAssassin are only reachable on localhost, and that you don’t need extra security concerns like a password to the database or something like that.

First of all, stop SpamAssassin (if your mail system is not mission critical) and start backing up the bayesian database:

% sudo /etc/init.d/spamd stop
% sa-learn --backup > sa-bayes-backup

This will create a sa-bayes-backup file with the bayesian token currently saved in your home directory in a Berkley DB file.

After this, change the useflags for mail-filter/spamassassin: disable the berkdb useflag and enable the postgres useflag; ignore the warning currently thrown by the ebuild that the bayesian filter needs the DB_File module, it works just as fine with PostgreSQL as backend, but you have to configure it. You might also want to enable the doc useflag, as right now it’s unfortunately controlling the installation of user-serviceable documentation; in alternative, just get an extracted copy of SpamAssassin’s tarball to use as a reference.

Now, it’s time to create the user and the database to store the data into.

% sudo -u postgres -i
postgres % createuser spamassassin
postgres % createdb -O spamassassin spamassassin
postgres % bzcat /usr/share/doc/spamassassin-3*/sql/bayes_pg.sql.bz2 | 
  psql -U spamassassin spamassassin
postgres % bzcat /usr/share/doc/spamassassin-3*/sql/awl_pg.sql.bz2 | 
  psql -U spamassassin spamassassin

You could also use per-user preferences stored in SQL backend if you really need them; as I don’t need them, I instead edited /etc/conf.d/spamd replacing the -c option (which forces spamd into creating per-user configuration files if missing) with -x (which says to spamd to ignore per-user options, that is just what I need.

Now it’s time to set up the database connection from SpamAssassin; although the ebuild suggests to use the secrets.cf file, that is not readable by users, to configure the connection to the database, if you plan to use sa-learn from your user, you might prefer to just enable it in a world-readable file, especially if you don’t have any security concerns on the use of the spamassassin PostgreSQL database; this is what I have done anyway:

bayes_store_module      Mail::SpamAssassin::BayesStore::PgSQL
bayes_sql_dsn           DBI:Pg:dbname=spamassassin;host=localhost
bayes_sql_username      spamassassin
bayes_sql_override_username     spamassassin

auto_whitelist_factory  Mail::SpamAssassin::SQLBasedAddrList
user_awl_dsn            DBI:Pg:dbname=spamassassin;host=localhost
user_awl_sql_username   spamassassin

At this point, SpamAssassin will only use PostgreSQL for its databases, so you can just remove your ~/.spamassassin directory, it will not be recreated. Let’s then start PostgreSQL (or make sure it’s started already, and then restore the Bayes database:

% sudo /etc/init.d/postgresql start
% sa-learn --restore sa-bayes-backup

Now you could restart spamd and have your system back already, but there is one problem with the current ebuild (the one in my overlay does not need this change though): it does not depend on PostgreSQL. From one side it’s correct, you might not be using the localhost pgsql to store the data, so in that case you don’t have to care to start spamd after postgresql, but if you’re going to use a local configuration, you certainly don’t want spamd to start before the PostgreSQL database is up, so you have to edit the /etc/init.d/spamd script, and add in the depend() function, a simple use postgresql line; add postgresql to your default runlevel, and that should be it.

At this point you’re set, just restart your spamd, and it won’t use your homedirectory to store cache data anymore!

% sudo /etc/init.d/spamd start

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.