Tarsnap and backup strategies

After having had a quite traumatic experience with a customer’s service running on one of the virtual servers I run last November, I made sure to have a very thorough backup for all my systems. Unfortunately, it turns out to be a bit too thorough, so let me explore with you what was going on.

First of all, the software I use to run the backup is tarsnap — you might have heard of it or not, but it’s basically a very smart service, that uses an open-source client, based upon libarchive, and then a server system that stores content (de-duplicated, compressed and encrypted with a very flexible key system). The author is a FreeBSD developer, and he’s charging an insanely small amount of money.

But the most important part to know when you use tarsnap is that you just always create a new archive: it doesn’t really matter what you changed, just get everything together, and it will automatically de-duplicate the content that didn’t change, so why bother? My first dumb method of backups, which is still running as of this time, is to simply, every two hours, dump a copy of the databases (one server runs PostgreSQL, the other MySQL — I no longer run MongoDB but I start to wonder about it, honestly), and then use tarsnap to generate an archive of the whole /etc, /var and a few more places where important stuff is. The archive is named after date and time of the snapshot. And I haven’t deleted any snapshot since I started, for most servers.

It was a mistake.

The moment when I went to recover the data out of earhart (the host that still hosts this blog, a customer’s app, and a couple more sites, like the assets for the blog and even Autotools Mythbuster — but all the static content, as it’s managed by git, is now also mirrored and served active-active from another server called pasteur), the time it took to extract the backup was unsustainable. The reason was obvious when I thought about it: since it has been de-duplicating for almost an year, it would have to scan hundreds if not thousands of archives to get all the small bits and pieces.

I still haven’t replaced this backup system, which is very bad for me, especially since it takes a long time to delete the older archives even after extracting them. On the other hand it’s probably a lot of a matter of tradeoff in the expenses as well, as going through all the older archives to remove the old crap drained my credits with tarsnap quickly. Since the data is de-duplicated and encrypted, the archives’ data needs to be downloaded to be decrypted, before it can be deleted.

My next preference is going to be to set it up so that the script is executed in different modes: 24 times in 48 hours (every two hours), 14 times in 14 days (daily), and 8 times in two months (weekly). The problem is actually doing the rotation properly with a script, but I’ll probably publish a Puppet module to take care of that, since it’s the easiest thing for me to do, to make sure it executes as intended.

The essence of this post is basically to warn you all that, no matter whether it’s cheap to keep around the whole set of backups since the start of time, it’s still a good idea to just rotate them.. especially for content that does not change that often! Think about it even when you set up any kind of backup strategy…

Reinventing the wheels for fun and….

So it’s Christmas day, and I’m skimming through the 32+ MB of logs generated by my elven script and I found some nice nuggets:

Symbol getopt_long@@ (32-bit UNIX System V ABI Intel 80386) present 35 times
  /usr/bin/graph
  /usr/bin/ebzipinfo
  /usr/bin/spline
  /usr/bin/double
  /usr/bin/uupick
  /usr/bin/autotrace
  /usr/bin/uustat
  /usr/sbin/ndtpd
  /usr/bin/ode
  /usr/sbin/ndtpcheck
  /usr/bin/ebrefile
  /usr/bin/uucp
  /usr/sbin/ndtpcontrol
  /usr/bin/uulog
  /usr/sbin/uuxqt
  /usr/bin/ebzip
  /usr/bin/faad
  /usr/bin/lha
  /usr/lib/libsox.so.1.0.0
  /usr/sbin/uucico
  /usr/bin/ebinfo
  /usr/sbin/uuconv
  /usr/bin/plot
  /usr/bin/tek2plot
  /usr/lib/libc.so.5
  /usr/bin/uuname
  /usr/bin/uux
  /usr/bin/ebstopcode
  /usr/bin/stklos
  /usr/bin/cu
  /usr/bin/plotfont
  /usr/bin/ebfont
  /usr/bin/ebunzip
  /usr/bin/pic2plot
  /usr/sbin/uuchk
Symbol strncasecmp@@ (32-bit UNIX System V ABI Intel 80386) present 5 times
  /usr/bin/xpilots
  /usr/bin/gargoyle-agility
  /usr/lib/libc.so.5
  /usr/bin/xedit
  /usr/bin/xpilot
Symbol strnlen@@ (32-bit UNIX System V ABI Intel 80386) present 5 times
  /usr/bin/tarsync
  /usr/lib/libCw.so.1.0.0
  /usr/lib/libmba.so.0.9.1
  /usr/lib/python2.5/site-packages/numarray/_chararray.so
  /usr/bin/linksys-tftp
Symbol stricmp@@ (32-bit UNIX System V ABI Intel 80386) present 5 times
  /usr/bin/qemacs
  /usr/lib/libraidutil.so.0.0.0
  /usr/lib/openbabel/2.2.0/inchiformat.so
  /usr/lib/libxerces-c-3.0.so
  /usr/lib/libIL.so.1.0.0

Now if you ignore the references to the old compatibility libc.so.5 you can still find that there are quite a few programs that reinvent the wheel, reimplementing some functions that the C library already provides. Now, this would be fine and dandy if the implementation was subtly different, or was done with some particular purpose in mind, like glib’s functions, but I really can’t find the reason for the situation above to happen.

These are usually smaller functions, but still there is no reason for them to be present since anyway it’s more than likely that most of their use is replaced away by the compiler itself; more interesting is their presence in shared objects, since that would interpose around other calls, although this most likely won’t happen on glibc based systems since they provide versioned symbols which wouldn’t then interpose by default. That’s still a problem for *BSD systems though.

This has a much lower priority in my list than identifying all libraries bundled by various packages (even when they cannot be fixed, because they are proprietary or something else), because these are unlikely to turn out being security issues. On the other hand, the idea of doing such pointless duplication of common functions might as well caus security issues to be hidden, for instance if a package was to reimplement mktemp, then it would most likely be a problem.

Anyway for those interested to find out what’s duplicating code in their system, the new hit parade, derived directly from the Bug shows SQLite3 trying to climb up the ladder, together with boehm-gc. Why people can’t understand that there are libraries in the system already?