Was Acronis True Image 2020 a mistake?

You may remember that a few months ago I complained about Acronis True Image 2020. I have since been mostly happy with the software, despite it being still fairly slow when uploading a sizable amount of changed files, such as after shooting a bunch of pictures at home. This would have been significantly more noticeable if we had actually left the country since I started using it, as I usually shoot at least 32GB of new pictures on a trip (and sometimes twice as much), but with lockdown and all, it didn’t really happen.

But, beside for that, the software worked well enough. Backup happened regularly, both on the external drive and the Cloud options, and I felt generally safe with using it. Until a couple of weeks ago, when suddenly it stopped working, and failed with Connection Timeout errors. They didn’t correlate with anything: I did upgrade to Windows 10 20H1, but that was a couple of weeks before, and backups went through fine until then. There was no change in network, there was no change from my ISP, and so on.

So what gives? None of the tools available from Acronis reported errors, ports were not marked as blocked, and I was running the last version of everything. I filed a ticket, was called on the phone by one of their support people who actually seemed to know what he was doing — TeamViewer at hand, he checked once again for connectivity, and once again found that everything is alright, the only thing he found to change was disabling the True Image Mounter service, which is used to get quick access to the image files, and thus is not involved in the backup process. I had to disable tha tone because, years after Microsoft introducing WSL, enabling it breaks WSL filesystem access altogether, so you can’t actually install any Linux distro, change passwords in the ones you already installed, or run apt update on Debian.

This was a week ago. In the meantime support asked me to scan the disks for errors because their system report reported one of the partitions as having issues (if I read their log correctly, that’s one of the recovery images so it’s not at all related to the backup), and the more recent one to give them a Process Monitor log while running the backup. Since they don’t actually give you a list of process to limit to, I ended up having to kill most of the other running application to take the log, as I didn’t want to leak more information that I was required to. It still provided a lot of information I’m not totally comfortable with having provided. And I still have no answer, at the time of writing.

It’s not all here — the way you provide all these details to them is a fairly clunky: you can’t just mail them, or attach them through their web support interface, as even their (compressed) system report is more than 25MB for my system. Instead what they instruct you to do is to take the compressed files and uploaded them through FTP with the username/password pair they provide to you.

Let me repeat that. You upload compressed files, that include at the very least most of the filenames you’re backing up, and possibly even more details of your computer, with FTP. Unencrypted. Not SFTP, not FTPS, not HTTPS. FTP. In 2020.

This is probably the part that makes my blood boil. Acronis has clearly figured out that the easiest way for people to get support is to use something that they can use very quickly. Indeed you can still put an FTP URL In the location bar of your Windows 10 File Explorer, and it will allow you to upload and download files over it. But it does that in a totally unencrypted, plain-text manner. I wonder how much more complicated it would be to use at least FTPS, or to have an inbound-only password-protected file upload system, like Google Drive or Dropbox, after all they are a cloud storage solution provider!

As for myself, I found a temporary workaround waiting for the support folks to figure out what they likely have screwed up on their London datacenter: I’m backing up my Lightroom pictures to the datacenter they provide in Germany. It took three days to complete, but it at least gives me peace of mind that, if something goes horribly wrong, at least the most dear part of my backup is saved somewhere else.

And honestly, using a different backup policy than the rest of the system just for the photos is probably a good idea: I set it to “continuous backup”, because generally speaking it usually stays the same all the time, until I go and prepare another set to publish, then a lot of things change quickly and then nothing until the next time I can do it.

Also, I do have the local backup — that part is still working perfectly fine. I might actually want to use it soon, as I’m of two minds between trying to copy over my main OS drive from a 1TB SSD to a 2TB SSD, and just getting a 2TB SSD, and installing everything anew onto it. If I do go that route, I also will reuse the 1TB SSD onto my NUC instead, which right now is running with half SATA and half NVMe storage.

Conclusions? Well, compared to the Amazon Glacier + FastGlacier (that has not been updated in just over two years now, and still sports a Google+ logo and +1 button!), it’s still good value for money. I’m spending a fraction of what I used to spend with Amazon, and even in the half-broken state it’s backing up more data and has significantly faster access. The fact that you can set different policies for different parts of the backup is also a significant plus. I just wish there was a way to go from a “From Folders to Cloud” backup to a tiered “From Folders to External, plus Cloud” — or maybe I’ll bite the bullet and, if it’s really this broken, just go and re-configure also the Lightroom backup to use the tiered option.

But Acronis, consider cleaning up your support act. It’s 2020, you can’t expect your customers to throw you all their information via unencrypted protocols, for safety’s sake!

Update 2020-06-30: the case is now being escalated to the “development and cloud department” — and if this is at all in the same ballpark as the companies I worked for it means that something is totally messed up in their datacenter connectivity and I’m the first one to notice enough to report to them. We’ll see.

Windows Backup Solutions: trying out Acronis True Image Backup 2020

One of my computers is my Gamestation, which to be honest has not ran a game in many months now, which runs Windows out of necessity, but also because honestly sometimes I just need something that works out of the box. The main usage of that computer nowadays is Lightroom and Photoshop for my photography hobby.

Because of the photography usage, backups are a huge concern to me (particularly after movers stole my previous gamestation), and so I have been using a Windows tool called FastGlacier to store a copy of most of the important stuff to Amazon Glacier service, in addition to letting Windows 10 do its FileHistory magic on an external hard drive. Not a cheap option, but (I thought) a safe and stable one. Unfortunately the software appears to not being developed anymore, and with one of the more recent Windows 10 updates it stopped working (and since I had set it up as a scheduled operation, it failed silently, which is the worst thing that can happen!)

My original plan for last week (at the time of writing), was to work on pictures, as I have shots from trip over three years ago that I have still not wandered through, rather than working on reverse engineering. But when I noticed the lacking backups, I decided to put that on hold until the backup problem was solved. The first problem was finding a backup solution that would actually work, and that wouldn’t cost an arm and a leg. The second problem was that of course most of the people I know are tinkerers that like rube-goldberg solutions such as using rclone on Windows with the task scheduler (no thanks, that’s how I failed the Glacier backups).

I didn’t have particularly high requirements: I wanted a backup solution that would do both local and cloud backups — because Microsoft has been reducing the featureset of their FileHistory solution, and so relying on it feels a bit flaky. And the ability to store more than a couple of terabytes on the cloud solution (I have over 1TB of RAW shots!), even at a premium. I was not too picky on price, as I know features and storage are expensive. And I wanted something that would just work out of the box. A few review reads later, I found myself trying Acronis True Image Backup. A week later, I regret it.

I guess my best lesson learnt from this is that Daniel is right, and it’s not just about VPNs: most review sites seem to be scoring higher the software they get more money from via affiliate links (you’ll notice that in this blog post there won’t be any!) So while a number of sites had great words for Acronis’s software, I found it sufficiently lacking that I’m ranting about it here.

So what’s going on with the Acronis software? First of all, while it does support both “full image” and “selected folders” modes, you need to be definitely aware that the backup is not usable as-is: you need the software to recover the data. Which is why it comes with bootable media, “survival kits”, and similar amenities. This is not a huge deal to me, but it’s still a bit annoying, when FileHistory used to allow direct access to the files. It also locks you in in accessing the backup with the software, although Acronis makes the restore option available even after you let your subscription expire, which is at least honest.

Then the next thing that was clear to me was that the speed of the cloud backup is not the strongest suit of Acronis. The original estimate for backing up the 2.2TB of data that I expected to back up was on the mark at nearly six days. To be fair to Acronis, the process went extremely smoothly, it never got caught up, looped, crashed, or slowed down. The estimate was very accurate, and indeed, running this for about 144 hours was enough to have the full data backed up. Their backup status also shows the average speed of the processes, that matched my estimate while the backup was running, of 50Mbps.

The speed is the first focus of my regret. 50Mbps is not terribly slow, and for most people this might be enough to saturate their Internet uplink. But not for me. At home, my line is provided by Hyperoptic, with a 1Gbps line that can sustain at least 900Mbps upload. So seeing the backup bottlenecked by this was more than a bit annoying. And as far as I can tell, there’s no documentation of this limit on the Acronis website at the time of writing.

When I complained on Twitter about this, it was mostly in frustration for having to wait, but I was considering the 50Mbps speed at least reasonable (although I would have considered paying a premium for faster uploads!) the replies I got from support have gotten me more upset than before. Their Twitter support people insisted that the problem was with my ISP and sent me to their knowledgebase article on using the “Acronis Cloud Connection Verification Tool” — except that following the instruction showed I was supposed to be using their “EU4” datacenter, for which there is no tool. I was then advise to file a ticket about it. Since then, I appear to have moved back to “EU3” — maybe EU4 was not ready yet.

The reply to the ticket was even more of an absurdist mess. Beside a lot of words to explain “speed is not our fault, your ISP may be limiting your upload” (fair, but I already noted to them that I knew that was not the case), one of the steps they request you to follow is to go to one of their speedtest apps — which returns a 504 error from nginx, oops! Oh yeah and you need to upload the logs via FTP. In 2020. Maybe I should call up Foone to help. (Windows 10, as it happens, still supports FTP write-access via File Explorer, but it’s not very discoverable.)

Both support people also kept reminding me that the backup is incremental. So after the first cloud backup, everything else should be a relatively small amount of data to be copied. Except that I’m not sold onto that either, still: 128GB of data (which is the amount of pictures I came back from Budapest with), would take nearly six hours to back up.

When I finally managed to get a reply that was not directly from a support script, they told me to run the speedtest on a different datacenter, EU2. As it turns out, this is their “Germany” datacenter. This was very clear by tracerouting the IP addresses for the two hosts: EU3 is connected directly to LINX, EU2 goes back to AMS, then FRA (Frankfurt). The speedtest came out fairly reasonable (around 250Mbps download, 220Mbps upload), so I shared the data they requested in the ticket… and then wondered.

Since you can’t change the datacenter you backup to once you started a backup, I tried something different: I used their “Archive” feature, and tried to archive a multi-gigabyte file, but to their Germany datacenter, rather than the United Kingdom one (against their recommendation of «select the country that is nearest to your current location»). Instead of a 50Mbps peak, I got a 90Mbps peak, with a sustained of 67Mbps. Now this is still not particularly impressive, but it would have cut down the six days to three, and the five hours to around two. And clearly it sounds like their EU3 datacenter is… not good.

Anyway, let’s move on and look at local backups, which Acronis is supposed to take care of by itself. For this one at first I wanted to use the full image backup, rather than selecting folders like I did for the cloud copy, since it would be much cheaper, and I have a 9T external harddrive anyway… and when you do that, Acronis also suggests you to create what they call the “Acronis Survival Kit” — which basically means turning the external hard drive bootable, so that you can start up and restore the image straight from it.

The first time I tried setting it up that way, it formatted the drive, but it didn’t even manage to get Windows to connect the new filesystem. I got an error message linking me to a knowledgebase article that… did not exist. This is more than a bit annoying, but I decided to run a full SMART check on the drive to be safe (no error to be found), and then try again after a reboot. Then it finally seemed to work, but here’s where things got even more hairy.

You see, I’ve been wanting to use my 9TB external drive for the backup. A full image of my system was estimated at 2.6TB. But after the Acronis Survival Kit got created, the amount of space available for the backup on that disk was… 2TB. Why? It turned out that the Kit creation caused the disk to be repartitioned as MBR, rather than the more modern GPT. And in MBR you can’t have a (boot) partition bigger than 2TB. Which means that the creation of the Survival Kit silently decreased my available space to nearly 1/5th!

The reply from Acronis on Twitter? According to them my Windows 10 was started in “BIOS mode”. Except it didn’t. It’s set up with UEFI and Secure Boot. And unfortunately it doesn’t seem like there’s an easy way to figure out why the Acronis software thinks it’s that way. But worse than that, the knowledgebase article says that I should have gotten a warning, which I never did.

So what is it going to be at the end of the day? I tested the restore from Acronis Cloud, and it works fine. Acronis has been in business for many years, so I don’t expect them to disappear next year. So the likeliness of me losing access to these backups is fairly low. I think I may just stick to them for the time being, and hope that someone in the Acronis engineering or product management teams can read this feedback and think about that speed issue, and maybe start considering the idea of asking support people to refrain from engaging with other engineers on Twitter with fairly ridiculous scripts.

But to paraphrase a recent video by Techmoan, these are the type of imperfections (particularly the mis-detected “BIOS booting” and the phantom warning), that I could excuse to a £50 software package, but that are much harder to excuse in a £150/yr subscription!

Any suggestions for good alternatives to this would be welcome, particularly before next year, when I might reconsider if this was good enough for me, or a new service is needed. Suggestions that involve scripts, NAS, rclone, task scheduling, self-hosted software will be marked as spam.

Tarsnap and backup strategies

After having had a quite traumatic experience with a customer’s service running on one of the virtual servers I run last November, I made sure to have a very thorough backup for all my systems. Unfortunately, it turns out to be a bit too thorough, so let me explore with you what was going on.

First of all, the software I use to run the backup is tarsnap — you might have heard of it or not, but it’s basically a very smart service, that uses an open-source client, based upon libarchive, and then a server system that stores content (de-duplicated, compressed and encrypted with a very flexible key system). The author is a FreeBSD developer, and he’s charging an insanely small amount of money.

But the most important part to know when you use tarsnap is that you just always create a new archive: it doesn’t really matter what you changed, just get everything together, and it will automatically de-duplicate the content that didn’t change, so why bother? My first dumb method of backups, which is still running as of this time, is to simply, every two hours, dump a copy of the databases (one server runs PostgreSQL, the other MySQL — I no longer run MongoDB but I start to wonder about it, honestly), and then use tarsnap to generate an archive of the whole /etc, /var and a few more places where important stuff is. The archive is named after date and time of the snapshot. And I haven’t deleted any snapshot since I started, for most servers.

It was a mistake.

The moment when I went to recover the data out of earhart (the host that still hosts this blog, a customer’s app, and a couple more sites, like the assets for the blog and even Autotools Mythbuster — but all the static content, as it’s managed by git, is now also mirrored and served active-active from another server called pasteur), the time it took to extract the backup was unsustainable. The reason was obvious when I thought about it: since it has been de-duplicating for almost an year, it would have to scan hundreds if not thousands of archives to get all the small bits and pieces.

I still haven’t replaced this backup system, which is very bad for me, especially since it takes a long time to delete the older archives even after extracting them. On the other hand it’s probably a lot of a matter of tradeoff in the expenses as well, as going through all the older archives to remove the old crap drained my credits with tarsnap quickly. Since the data is de-duplicated and encrypted, the archives’ data needs to be downloaded to be decrypted, before it can be deleted.

My next preference is going to be to set it up so that the script is executed in different modes: 24 times in 48 hours (every two hours), 14 times in 14 days (daily), and 8 times in two months (weekly). The problem is actually doing the rotation properly with a script, but I’ll probably publish a Puppet module to take care of that, since it’s the easiest thing for me to do, to make sure it executes as intended.

The essence of this post is basically to warn you all that, no matter whether it’s cheap to keep around the whole set of backups since the start of time, it’s still a good idea to just rotate them.. especially for content that does not change that often! Think about it even when you set up any kind of backup strategy…

Backing up cloud data? Help request.

I’m very fond of backups, after the long series of issues I’ve had before I started doing incremental backups.. I still have got some backup DVDs around, some of which are almost unreadable, and at least one that is compressed with the xar archive in a format that is no longer supported, especially on 64-bit.

Right now, my backups are all managed through rsnapshot, with a bit of custom scripts over it to make sure that if an host is not online, the previous backup is maintained. This works almost perfectly, if you exclude the problems with restored files and the fact that a rename causes files to double, as rsnapshot does not really apply any data de-duplication (and the fdupes program and the like tend to be .. a bit too slow to use on 922GB of data).

But there is one problem that rsnapshot does not really solve: backup of cloud data!

Don’t get me wrong: I do backup the (three) remote servers just fine, but this does not cover the data that is present in remote, “cloud” storage, such as the GitHub, Gitorious and BitBucket repositories, or delicious bookmarks, GMail messages, and so on so forth.

Cloning the bare repositories and backing those up is relatively trivial: it’s a simple script to write. The problem starts with the less “programmatic” services, such as the noted bookmarks and messages. Especially with GMail as copying the whole 3GB of data each time from the server is unlikely to work fine, it has to be done properly.

Has anybody any pointer on the matter? Maybe there’s already a smart backup script, similar to tante’s smart pruning script that can take care of copying the messages via IMAP, for instance…

Tightening security

I’m not sure why is it that I started being so paranoid about security; quite a few things I’ve been changing in my workflow lately, and even though I kept a decent security of my systems and network, now I’m going one step further.

Beside working on getting Kerberos-strengthened NFS working (and trying to get libvirt if it wasn’t for gtk-vnc and the other mess), I’m now considering something to strengthen the security of this laptop. Given what I’ve seen with pam_mount, it would also make sense to get that improved, fixed, maybe even integrated with the pambase, as usual for me.

But beside not actually having an idea of how to configure that up yet, it also made me think of the use case for it. Let’s say I actually encrypt the whole partition (I know there are a few options for not using an entirely encrypted partition, but since last I checked they all require patched kernels, I’d like to stay quite away from those); it gets mounted when I login (on GDM), and up to that it’s okay, but what happens when I close the laptop in suspend? It wouldn’t get unmounted because all the process are still there. And if somebody can get a new login with that, well, you’re screwed because the other sessions can see the mounted partition as well.

One option I can think of is one old friend of mine: pam_namespace. This module allows to “split” the mount namespace of user login sessions at PAM login; placed before pam_mount, it would let the partition to appear mounted for all the processes starting from the process calling the module. What this can actually achieve is that even if you have root password, and create a new session with its credentials.. the partition will not appear to be mounted at all. Cool, but pam_namespace breaks a bunch of things such as HAL. It was almost exactly one year ago that I wrote about that.

Another option is to simply logout before suspending the laptop; this should also fix the graphic card reset problems: shut down X before suspending, reopen it with a new login afterwards. It take a bit more to reopen everything of course, but that’s not the main problem — it wouldn’t be a problem at all if software actually restarted as intended, like if Gnome actually restored the session and included Chromium tabs.

Unfortunately, I actually got one good reason to think that there is some trouble for what concerns this idea. One of the first incarnations of the tinderbox I found out that it actually let some stray processes; and this was by just executing a console-only chroot, as root, and without any desktop system software running. I’m quite sure that at least the GnuPG and SSH agent software is kept running at the end of the session. Such stray processes would still make it impossible to unmount the partition.

Finally the last remaining solution is to turn off the whole system, but as you probably already know that it takes time for a cold start to work out properly.

What options are there for these situations? Anybody have suggestions? I wouldn’t mind even just using an encrypted directory tree, mounted via FUSE, and encrypting with GnuPG (and thus, with my FSFe Fellowship Smartcard).

A similar, lower-priority but maybe even of more long-term importance is encrypting my backup data; in this case, I cannot be there to input the password over and over again, so I have to find a different solution. One thing I actually thought of is to make a (sane) use of the “software protection hardware keys” that I remember from computer magazines of the final ‘90s. There is actually a manufacturer of those not far from where I live, in the same city as my sister; I wouldn’t mind buying a sample USB key from them, and if they give out the details for communicating with it, implement an open source library to talk with that and see if I can make use it as encryption key for a whole partition.

At any rate, any suggestion so that I don’t have to reinvent, or at least redocument, the wheel, is as usual very welcome.

GMail backup, got any ideas?

I’ve been using GMail as my main email system for quite a few years now, and it works quite well; most of the times at least. While there are some possible privacy concerns, I don’t really get most of them (really, any other company hosting my mail will have similar concerns; I don’t have the time to manage my own server, and the end result is, if you don’t want anybody but me to read your mail, encrypt it!). Most of my problems related to GMail are pretty technical.

For instance, for a while I struggled with cleaning up old cruft from the server, removing old mailing list messages, old promotional messages or the announcements coming from services like Facebook, WordPress.com and similar. Thankfully, Jürgen wrote for me a script that takes care of the task. Now you might wonder why there is the need for a custom script to handle deletion of mail on GMail, given it uses the IMAP protocol… the answer is that even though it uses the IMAP interface, the way they store the messages makes it impossible to use the standard tools. The usual deletion scripts you may find for IMAP mailboxes set the deleted flag on and then expunge the folder… but that’s just going to archive the messages on GMail, you got to move the messages to the Trash folder… whose path depends on the language you set on GMail’s interface.

Now the same applies to the problem of backup: while I trust GMail will be available for a long time, I don’t like the idea of having no control whatsoever on the backup of my mail. I mean complete backup. Backup of the almost 3GB of mail that GMail is currently storing for me. Using standard IMAP backup software will probably require me something like 6GB of storage, and transfer as well! The problem is that all the content available in the “Labels” (which are most definitely not folders) is duplicated in the “All Mail” folder.

A proper GMail-designed interface would only fetch the content of the messages from the “All mail” folder, and then just use the message IDs to file the messages with the correct label. An even better software would allow me to convert the backed-up mess into something that can be served properly by an email client or an IMAP server, using Maildir structures.

it goes without saying that it should work incrementally: if I run it daily I don’t want it to fetch 3GB of data every time; I can bear with it the first time, but that’s about it. And it should also rate-limit itself to avoid hitting the GMail lockdown for possible abuse!

As far as I can see, there is no software to do that, I most definitely have no time to work on it… does anybody feel like doing so, or to find me the software I’m looking for? No in this case I most definitely don’t intend using proprietary software, no matter how handy it is: it’s going to handle very sensitive information, like my GMail password, and that’s not something that I’d keep available to a software I can’t look at the sources of.

Health, accounting and backups

For those who said that I have anger management issues regarding my last week’s post I’d like to point out that it’s actually a nervous breakdown that I got, not strictly (but partly) related to Gentoo.

Since work, personal life, Gentoo and (the last straw) taxes all merged this week, I ended up having to take a break from a lot of stuff; this included putting on hold for the week all kind of work, and actually spend most of my time making sure I have proper accounting, both for what concerns my freelancer activity, and home expenses (this is getting particularly important because I’m almost living alone – even if I technically am not – and thus I have to make sure that everything fits into the budget). Thankfully, GnuCash provides almost all the features I need. I ended up entering all the accounting information I had available, dating back to January 1st 2009 (my credit card company’s customer service site hasn’t worked in the past two weeks — since it’s the subsidiary of my own bank, I was able to get the most recent statements through them, but not the full archive of statements since issuing of the cards, which is a problem to me), and trying to get some data out of it.

Unfortunately, it seems like while GnuCash already provides a number of reports, it does not have the kind of reports I have, such as “How much money did the invoices from 2009 consists of?” (which is important for me to make sure I don’t go over the limit I’m given), or “How much money did I waste in credit card interests?”… I’ll have to check out the documentation and learn whether I can make some customised reports that produce the kind of data I need. And maybe there’s a way to set the term of payments that I have with a client of mine (30 days from the end of the month the invoice was issued in… which means if I issue the invoice tomorrow, I’ll be paid on May 1st).

On a different note, picking up from Klausman’s post I decided to also fix up my backup system, which was, before, based on single snapshots of the system on external disks and USB sticks; and moved to use a single rsnapshot system to back everything up in a single external disk, from the local system, the router, the iMac, the two remote servers, and so on. This worked out fine when I tried again the previous eSATA controller I had, but unfortunately it again failed (d’oh!) so I fell back to Firewire 400 but that’s way too slow for rsnapshot to do a full backup hourly. I’m thus trying to find a new setup for the external disk. I’m unsure whether to look up a FireWire 800 card or a new eSATA controller. I’m not sure about Linux’s support for the former though; I know that FireWire used to be not too well maintained, so I’m afraid it might just go down to FireWire 400, which is pointless. I’m not sure about eSATA because I’m afraid it might not be the controller’s fault but rather a problem with (three different kind of) disks or the cables; and if the problem is in the controller, I’m afraid about the chip on it; the one I have here is a JMicron-based controller, but with a memory chip that is not flashable with the JMicron-provided ROM (and I think there might be a fix in there for my problem) nor with flashrom as it is now.

So if you have to suggest an idea about this I’d be happy to hear of it; right now I only found a possibly interesting (price/features) card from Alternate (Business-to-business) “HighPoint RocketRAID 1742” which is PCI-based (I have a free PCI slot right now, and in case I can move it to a different box that has no PCI-E), and costs around €100. I’m not sure about driver support for that though, so if somebody have experience about it, please let me know. Interestingly enough my two main suppliers in Italy seem to not have any eSATA card, and of course high-grade, dependable controllers aren’t found at the nearest Saturn or Mediamarkt (actually, Mediaworld here, but it’s the very same thing).

Anyway, after this post I’m finally back to work on my job.

Stash your cache away

While I’m now spending a week out of my home (I’m at my sister’s family place, while she’s at the beach), I still be working, and writing blog posts, and maybe taking care of some smaller issues in Gentoo. I’m just a bit hindered becaues while I type on the keyboard I often click something away with the trackpad; I didn’t think about getting a standalone keyboard. I guess if somebody would want to send my way an Apple bluetooth keyboard I wouldn’t be saying no.

While finally setting up a weekly backup of my /home directory, yesterday, I noticed quite a few issues with the way software makes use of it. The first thing of course was to find the right software to do the job; I opted for a simple rsync in cron, after all I don’t care much about having incremental multiple backups a-la Time Machine, having a single weekly copy of my basic data is good enough.

The second problem was that, some time ago, I found that having a 4GB USB flash drive was enough if I wanted to copy the home, but when I looked at it yesterday, I found it being well over 5GB. How did that happen? Some baobab later, I find the problems. From one side, my medical records, (over 500 pages) scanned with a hi-grade all-in-one laser printer (no, not by me at home), are too big. They might have been scanned as colour documents (they are photocopies, so that’s not really right) or they might be at huge resolution, I have to check that, since having over half a gig of just printed records is a bit too much for me (I also have another full CD of CT scan data).

The second problem is that a lot of software misuses my home by writing down cache and temporary files in it rather than in the proper locations. Let me explain: if you need to create a temporary file or socket to communicate between different software in the same host, rather than writing it to my home, you should probably use TMPDIR (like a lot of software, fortunately, does). The same goes if you write cache data, and yes I’m referring to you, Evolution and Firefox, but also to Adobe Flash, Sun JDK and IcedTea.

Indeed, the FreeDesktop specifications already provide an XDG_CACHE_DIR variable that can be used to change the place where cache data should be saved, defaulting to ~/.cache, and in my system set to /var/cache/users/flame. This way, all the (greedy) cache systems would be able to write as much data as they want, without either wasting my space on the backup flash, or forcing me to write them to two disks (/var/cache is in a sort-of throwaway disk).

For now I resolved by making some symlinks, hoping they keep stable, and creating a ~/.backup-ignore file, akin to .gitignore with the paths to the stuff that I don’t want backed up. The only problem I really have is with evolution because that one has so many subdirectories and I can’t really understand what I should backup and what not.

Oh and there are a few more problems there: the first is that a lot of software over the past two years migrated from just the home to ~/.config but the old files were kept around (nautilus is an example) and a few directories contained very very old and dusty session data that wasn’t cleared up properly.

Providing too many configuration options to tell where the stuff is, can definitely lead to bad problems, but using the right environment variable to decide where stuff should go and where it should be looked up at, can definitely solve lots of your problems!

Questing for the guide

I was playing some Oblivion while on the phone with a friend, when something came up to my mind, related to my recent idea of an autotools guide . The idea came up in my mind by mixing Oblivion with something that Jürgen was saying this evening.

In the game you can acquire the most important magical items in three ways: you can find them around (rarely), you can build them yourselves (by hunting creatures’ souls), you can pay for them with gold, or you can get them during quests. The latter are usually the most powerful but it’s not always true. At any rate, the “gold” option is rarely the one used because it’s a somewhat scarce resource. You might start to wonder what this has to do with the autotools guide that I’ve made public yesterday, but you might also have already seen where I’m going.

Since I’m the first one to know that money, especially lately, is a scarce resource, and that, me first, I’m the kind of person who’s glad to put in an effort with a market value three/four times more than whatever money I could afford to repay a favour, it would be reasonable for me to provide a way of “payment” through use of technical skills and effort.

So here is my alternative proposal: if you can get me a piece of code that I failed to find and I don’t have time to write, releasing it under a FOSS license (GPLv2+ is very well suggested; compatibility with GPL is very important anyway), and maintaining it until it’s almost “perfect”, I’ll exchange that for a comparable effort in extending the guide.

I’ll post these “quests” from time to time on the blog so you can see them and see whether you think you can complete them; I’ll have to find a way to index them though, for now it’s just a proposal so I don’t think I need to do this right away. But I can drop two ideas if somebody has time and is willing to work on them; both of them relate to IMAP and e-mail messages, so you’ve been warned. I’m also quite picky when it comes to requirements.

The first, is what Jürgen was looking at earlier: I need a way to delete the old messages from some GMail label every day. The idea is that I’d like to use GMail for my mailing lists needs (so I have my messages always with me and so on), but since keeping the whole archive is both pointless (there is gmane, google groups, and the relative archives) and expensive (in term of space used in the GMail IMAP account and of bandwidth needed to sync “All Mail”, via UMTS), I’d like to just always keep the last 3 weeks of e-mail messages. What I need, though, is something slightly more elaborated than just deleting the old messages. It has to be a script that I can run on a cron job locally, and connects to the IMAP server. It has to allow deleting the messages completely from GMail, which means dropping them in the Trash folder (just deleting them is not enough, you just remove the label), and emptying it too; it also has to be configurable on a per-label basis of time to keep the messages (I would empty the label with the release notifications every week rather than every three weeks), and hopefully be able to specify to keep unread messages longer, and consider flagged messages as protected. I don’t care much about implementation language but I’d frown up at things “exotic” like ocaml, smalltalk and similar since it would require me to install their environment. Perl, Python and Ruby all are fine, and Java is too since the thing would run just once a day and is not much of a slowdown to start the JVM for that. No X connection though.

The second is slightly simpler and could be coupled with the one before: I send my database backups from the server to my GMail e-mail address, encrypted with GPG and compressed with BZip2, and then split in message-sized chunks. I need a way to download all the messages and reassemble the backups, once a week, and store it on a flash card, using tar directly on it like it was a tape (no need for a filesystem should reduce the erase count). The email messages have the number of the chunk, the series of the backup (typo or bugzilla) and the date of backup all encoded in the subject. More points if it can do something like Apple’s Time Machine to keep backups each day for a week, each week for a month (or two) and then a backup a month up to two years.

So if somebody has the skill to complete these tasks and would be interested in seeing the guide expanded, well, just go for it!

My take on compression algorithms

Biancospino - Hawthorn

I just read Klausman’s entry about compression algorithms comparison, and while I’m no expert at all in the field of compression algorithms, I wanted to talk a bit about it myself, from a power user point of view.

Tobias’s benchmarks are quite interesting, although quite similar in nature to many others you can find out there comparing lzma to gzip and bzip2. One thing I found nice for him to explicit is that lzma is good when you decompress more than compress. This is something a lot of people tend to skip over, causing some quite catastrophic (in my view) results.

Keeping this in mind, you can see that lzma is not really good when you compress as many times (or more) than you compress. When would that happen is the central point of this. You certainly expect a backup system to compress a lot more than decompress, as you want to take daily (or more frequent) backups, but the hope is never to need to restore one of those. For Gentoo users, another place where they compress more than decompress is for manpages and documentation. They are compressed every time you merge something, but you don’t tend to read all the manpages and all the documentation every day. I’m sure most users don’t ever read most of the documentation that is compressed and installed. Additionally, lzma does not seem to perform just as good on smaller files, so I don’t think it’s worth the extra time needed to compress the data.

One thing that Tobias’s benchmark has in common with the other benchmarks about lzma I’ve seen is that it doesn’t take much into consideration the memory usage. Alas, valgrind removed the massif graph that gave you the exact memory footprint of a process, it would have been quite interesting to see those graphs. I’d expect lzma to use a lot more memory than bzip2, to be so quick in decompression. This would make it particularly bad on older systems and embedded use cases, where one might be interested to save flash (or disk) space.

For what concerns GNU choice of not providing bzip2 files anymore, and just providing gzip or lzma compressed tarballs, I’m afraid that the choice has been political as much as technical, if not more. Both zlib (for gzip) and bzip2 (with its libbz2) have very permissive licenses, and that makes them ideal even for proprietary software, or free software with, in turn, permissive licenses like the BSD license. lzma-utils is still free software, but with a more restrictive license, LGPL-2.1.

While LGPL still allows proprietary software to link, dynamically, the library, it still is more restrictive, and will likely turn away some proprietary software developers. I suppose this is what the GNU project wants anyway, but I still find it a political choice, not a technical one. Also, it has an effect on users, as one has to either use the bigger gzip version or also install lzma-utils to be able to prepare a GNU-like environment on a proprietary system, like for instance Solaris.

I’m sincerely not convinced by lzma myself. It takes way too much time during compression to find it useful for backups, which is my main compression task, and I’m uncertain about its memory use. The fact that bsdtar doesn’t support it yet directly is also a bit of a turn down for me, as I grow used not to have three processes for extracting a tarball. Doug’s concerns about the on-disk format also makes it unlikely for me to start using that.

Sincerely, I’m more concerned with the age of tar itself, while there are ways to add stuff to tar that it wasn’t originally designed for, the fact that to change it you have to fully decompress it and then re-compress it makes it pretty much impossible to use as a desktop compression method like the rar, zip and ace (and 7z, somewhat, as far as I can see you cannot remove a file from an archive) formats are on Windows. I always found it strange that the only widespread archive method supporting Unix information (permissions, symlinks and so on) is the one that was used for magnetic tapes and is thus sequential by nature…

Well having it sequential makes it more interesting for backing up on a flash card probably (and I should be doing that by the way), but I don’t see it much useful to compress a bunch of random files with data on them… Okay that one of the most used cases for desktop compression has been compressing Microsoft Office’s hugely bloated files, and that both OpenOffice and, as far as I know, newer Office versions use zip files to put their XML into, but I still can see a couple of things I could be using a desktop compression tool from time to time…