A smarter netcat?

Today I restored data from the hard disk of the previous laptop of a friend of mine, who had her laptop burn out (almost literally) a few months ago. To do so, my usual first task is to back up the whole content of the drive itself, so that even if I screw up with the data there is no data loss on the original drive. Having more than 2TB addressable on this workstation certainly helps.

Now, I’m really not sure whether the copy operation was network-bound or disk-bound, and if it was disk whether it was the reading or the writing, but it certainly wasn’t exactly blazing fast, around an hour to copy 120GB. Sure it’s an average of 2GB/min so it’s not really slow either, but still I see room for improvement.

The host reading the data (Enterprise living again) booted from SysRescueCD, so the CPU was not doing anything else but that, so certainly it’s not a problem with CPU power; Yamato was building, but still it wasn’t bound on the CPU, that seemed obvious. The source was a 2.5” SATA drive, while the destination was an ext4 partition in LVM, on one of my SATA-II disks. The medium, classic nc over Gigabit Ethernet point-to-point connection (I borrowed the connection I use for the laptop’s iSCSI to back it up with Time Machine).

Now, I admit this is not something I do that often, but it’s something I do from time to time nonetheless, so I think that spending a couple of hours thinking of a solution could help, especially since I can see similar situations happening in the future where I’m much tighter with time. The one thing that seemed obvious to me was a lack of a compression layer. Since I wasn’t using a tunnel of any sort, netcat was sending the data through the TCP connection as it was read from the disk, without any kind of compression.

While data deduplication, as was suggested to me on a previous related post would be cool, it might be a bit of overkill in this situation, especially considering that Partimage support for NTFS is still experimental (and as you might guess I tend to have this kind of needs with non-Linux systems, mostly Windows and much more rarely Mac OS X). But at least being able to compress the vast areas of “zeros” in the disk would have had a major gain.

I guess that even just hooking up a zlib filter could have reduced the amount of traffic, but I wonder why isn’t there a simple option to handle the compression/decompression transparently rather than via a pipe filter. While filters are cool when you have to do advanced mojo on your inputs/outputs, tar has shown how it’s much easier to just have an option to handle that. I guess I’ll have to investigate alternatives to netcat, although I’d like it if the alternative would just allow me to use a gzip filter to handle the stream with a standard non-smart nc command.

I guess I’ll try my filter idea next time I use the network to transfer an hard disk image, for now the next image I have to take care of is of my mother’s iBook, which supports the IEEE1394 (Firewire) target mode, which is so nice to have.

Locales, NLS, kernels and filesystems

One issue that is yet to be solved easily by most distribution (at least those not featuring extensive graphical configuration utilities, like Fedora and Ubuntu do), is most likely localisation.

There is an interesting blog post by Wouter Verhelst on Planet Debian that talks about setting locales variables. It’s a very interesting reading, as it clarifies pretty well the different variables.

One related issue seems to be understanding the meaning of the NLS settings that are available in the kernel configuration for Linux. Some users seem to think that you have to enable the codepages in there to be able to use a certain locale as system locale.

This is not the case, the NLS settings in there are basically only used for filesystems, and in particular only VFAT and NTFS filesystems. The reason of this lies in the fact that both filesystems are case-insensitive.

In usual Unix filesystems, like UFS, EXT2/3/4, XFS, JFS, ReiserFS and so on, file names are case sensitive, and they end up being just a string of arbitrary characters. On VFAT and NTFS, instead, the filenames are case *in*sensitive.

For case sensitivity, you need equivalence tables, and those are defined by different NLS values. For instance, for Western locale, the character ‘i’ and ‘I’ are equivalent, but in Turkish, they are not, as ‘i’ pairs with ‘İ’ and ‘I’ with ‘ı’ (if you wish to get more information about this, I’d refer you to Michael S. Kaplan’s blog on the subject).

So when you need to support VFAT or NTFS, you need to support the right NLS table, or your filesystem will end up corrupted (on Turkish charset, you can have two files called “FAIL” and “fail” as the two letters are not just the same). This is the reason why you find the NLS settings in the filesystems section.

Of course, one could say that HFS+ used by MacOS is also case-insensitive, so NLS settings should apply to that too, no? Well, no. I admit I don’t know much about historical HFS+ filesystems, as I only started using MacOS from version 10.3, but at least since then, the filenames are saved encoded in UTF-8, which has very well defined equivalence tables. So there is no need for option selections, the equivalence table is defined as part of the filesystem itself.

Knowing this, why VFAT does not work properly with UTF-8, as stated by the kernel when you mount it as iocharset=utf-8? The problem is that VFAT works on a per-character equivalence basis, and UTF-8 is a variable-size encoding, which does not suit well VFAT.

Unfortunately, make oldconfig and make xconfig seem to replace, at least on my system, the default charset with UTF-8 every time, maybe because UTF-8 is the system encoding I’m using. I guess I should look up to see if it’s worth to report a bug about this, or if I can fix it myself.

Update (2017-04-28): I feel very sad to have found out over a year and a half later that Michael died. The links in this and other posts to his blog are now linked to the archive kindly provided and set up by Jan Kučera. Thank you, Jan. And thank you, Michael.