Tarsnap and backup strategies

After having had a quite traumatic experience with a customer’s service running on one of the virtual servers I run last November, I made sure to have a very thorough backup for all my systems. Unfortunately, it turns out to be a bit too thorough, so let me explore with you what was going on.

First of all, the software I use to run the backup is tarsnap — you might have heard of it or not, but it’s basically a very smart service, that uses an open-source client, based upon libarchive, and then a server system that stores content (de-duplicated, compressed and encrypted with a very flexible key system). The author is a FreeBSD developer, and he’s charging an insanely small amount of money.

But the most important part to know when you use tarsnap is that you just always create a new archive: it doesn’t really matter what you changed, just get everything together, and it will automatically de-duplicate the content that didn’t change, so why bother? My first dumb method of backups, which is still running as of this time, is to simply, every two hours, dump a copy of the databases (one server runs PostgreSQL, the other MySQL — I no longer run MongoDB but I start to wonder about it, honestly), and then use tarsnap to generate an archive of the whole /etc, /var and a few more places where important stuff is. The archive is named after date and time of the snapshot. And I haven’t deleted any snapshot since I started, for most servers.

It was a mistake.

The moment when I went to recover the data out of earhart (the host that still hosts this blog, a customer’s app, and a couple more sites, like the assets for the blog and even Autotools Mythbuster — but all the static content, as it’s managed by git, is now also mirrored and served active-active from another server called pasteur), the time it took to extract the backup was unsustainable. The reason was obvious when I thought about it: since it has been de-duplicating for almost an year, it would have to scan hundreds if not thousands of archives to get all the small bits and pieces.

I still haven’t replaced this backup system, which is very bad for me, especially since it takes a long time to delete the older archives even after extracting them. On the other hand it’s probably a lot of a matter of tradeoff in the expenses as well, as going through all the older archives to remove the old crap drained my credits with tarsnap quickly. Since the data is de-duplicated and encrypted, the archives’ data needs to be downloaded to be decrypted, before it can be deleted.

My next preference is going to be to set it up so that the script is executed in different modes: 24 times in 48 hours (every two hours), 14 times in 14 days (daily), and 8 times in two months (weekly). The problem is actually doing the rotation properly with a script, but I’ll probably publish a Puppet module to take care of that, since it’s the easiest thing for me to do, to make sure it executes as intended.

The essence of this post is basically to warn you all that, no matter whether it’s cheap to keep around the whole set of backups since the start of time, it’s still a good idea to just rotate them.. especially for content that does not change that often! Think about it even when you set up any kind of backup strategy…