My take on compression algorithms

Biancospino - Hawthorn

I just read Klausman’s entry about compression algorithms comparison, and while I’m no expert at all in the field of compression algorithms, I wanted to talk a bit about it myself, from a power user point of view.

Tobias’s benchmarks are quite interesting, although quite similar in nature to many others you can find out there comparing lzma to gzip and bzip2. One thing I found nice for him to explicit is that lzma is good when you decompress more than compress. This is something a lot of people tend to skip over, causing some quite catastrophic (in my view) results.

Keeping this in mind, you can see that lzma is not really good when you compress as many times (or more) than you compress. When would that happen is the central point of this. You certainly expect a backup system to compress a lot more than decompress, as you want to take daily (or more frequent) backups, but the hope is never to need to restore one of those. For Gentoo users, another place where they compress more than decompress is for manpages and documentation. They are compressed every time you merge something, but you don’t tend to read all the manpages and all the documentation every day. I’m sure most users don’t ever read most of the documentation that is compressed and installed. Additionally, lzma does not seem to perform just as good on smaller files, so I don’t think it’s worth the extra time needed to compress the data.

One thing that Tobias’s benchmark has in common with the other benchmarks about lzma I’ve seen is that it doesn’t take much into consideration the memory usage. Alas, valgrind removed the massif graph that gave you the exact memory footprint of a process, it would have been quite interesting to see those graphs. I’d expect lzma to use a lot more memory than bzip2, to be so quick in decompression. This would make it particularly bad on older systems and embedded use cases, where one might be interested to save flash (or disk) space.

For what concerns GNU choice of not providing bzip2 files anymore, and just providing gzip or lzma compressed tarballs, I’m afraid that the choice has been political as much as technical, if not more. Both zlib (for gzip) and bzip2 (with its libbz2) have very permissive licenses, and that makes them ideal even for proprietary software, or free software with, in turn, permissive licenses like the BSD license. lzma-utils is still free software, but with a more restrictive license, LGPL-2.1.

While LGPL still allows proprietary software to link, dynamically, the library, it still is more restrictive, and will likely turn away some proprietary software developers. I suppose this is what the GNU project wants anyway, but I still find it a political choice, not a technical one. Also, it has an effect on users, as one has to either use the bigger gzip version or also install lzma-utils to be able to prepare a GNU-like environment on a proprietary system, like for instance Solaris.

I’m sincerely not convinced by lzma myself. It takes way too much time during compression to find it useful for backups, which is my main compression task, and I’m uncertain about its memory use. The fact that bsdtar doesn’t support it yet directly is also a bit of a turn down for me, as I grow used not to have three processes for extracting a tarball. Doug’s concerns about the on-disk format also makes it unlikely for me to start using that.

Sincerely, I’m more concerned with the age of tar itself, while there are ways to add stuff to tar that it wasn’t originally designed for, the fact that to change it you have to fully decompress it and then re-compress it makes it pretty much impossible to use as a desktop compression method like the rar, zip and ace (and 7z, somewhat, as far as I can see you cannot remove a file from an archive) formats are on Windows. I always found it strange that the only widespread archive method supporting Unix information (permissions, symlinks and so on) is the one that was used for magnetic tapes and is thus sequential by nature…

Well having it sequential makes it more interesting for backing up on a flash card probably (and I should be doing that by the way), but I don’t see it much useful to compress a bunch of random files with data on them… Okay that one of the most used cases for desktop compression has been compressing Microsoft Office’s hugely bloated files, and that both OpenOffice and, as far as I know, newer Office versions use zip files to put their XML into, but I still can see a couple of things I could be using a desktop compression tool from time to time…

4 thoughts on “My take on compression algorithms

  1. You mention that valgrind removed the massif graph. They removed the PostScript graph. But the data is still there and in a more accessible format. Just use ms_print on the output file.

    Like

  2. Hmm, I’d forgotten about the documentation compression at install time. I sent a response to Tobias regarding his benchmarks; I figured that lzma would be okay for Gentoo users, since we do lots of unpacking as part of our installation process.I’d completely forgotten about the stuff that gets compressed when installed. And you’re right, lzma doesn’t scale particularly well to small files, not like gzip does.I guess it’s a tradeoff — you can unpack large archives like kernel sources very quickly, but you’ll get (a bit?) more time added to each merge operation if lzma is doing the postinstall compression.

    Like

  3. Well, you don’t have to use just one single compression method… we should be able to manage multiple compression algorithms so that each one is used where it’s better suited. We could be using lzma for sources and then continue using bzip2 for ecompress. We could even make configurable the compression algorithm used for binpkgs, so that for instance if you have a huge network that needs to be managed you can just compress them with lzma, as a build will be unpacked 10 or more times, but if you only keep them as a backup you can use bzip2. But I’d rather wait till the on disk format is stable.Christoph, the new graph data that is generated is not the same as the one used to build the graph before, the graph before used to represent a continuous behaviour of memory in time, the new one samples the memory used at given time, just not the same kind of graph even if you plot it as it was done before.

    Like

  4. Instead of going the lzma way, they should have started useng pbzip2 to compress stuff. This way people using multicore/multicpu systems would get a nice speedup compared to ‘plain’ bzip2 while still being compatible.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s