Today I restored data from the hard disk of the previous laptop of a friend of mine, who had her laptop burn out (almost literally) a few months ago. To do so, my usual first task is to back up the whole content of the drive itself, so that even if I screw up with the data there is no data loss on the original drive. Having more than 2TB addressable on this workstation certainly helps.
Now, I’m really not sure whether the copy operation was network-bound or disk-bound, and if it was disk whether it was the reading or the writing, but it certainly wasn’t exactly blazing fast, around an hour to copy 120GB. Sure it’s an average of 2GB/min so it’s not really slow either, but still I see room for improvement.
The host reading the data (Enterprise living again) booted from SysRescueCD, so the CPU was not doing anything else but that, so certainly it’s not a problem with CPU power; Yamato was building, but still it wasn’t bound on the CPU, that seemed obvious. The source was a 2.5” SATA drive, while the destination was an ext4 partition in LVM, on one of my SATA-II disks. The medium, classic
nc over Gigabit Ethernet point-to-point connection (I borrowed the connection I use for the laptop’s iSCSI to back it up with Time Machine).
Now, I admit this is not something I do that often, but it’s something I do from time to time nonetheless, so I think that spending a couple of hours thinking of a solution could help, especially since I can see similar situations happening in the future where I’m much tighter with time. The one thing that seemed obvious to me was a lack of a compression layer. Since I wasn’t using a tunnel of any sort, netcat was sending the data through the TCP connection as it was read from the disk, without any kind of compression.
While data deduplication, as was suggested to me on a previous related post would be cool, it might be a bit of overkill in this situation, especially considering that Partimage support for NTFS is still experimental (and as you might guess I tend to have this kind of needs with non-Linux systems, mostly Windows and much more rarely Mac OS X). But at least being able to compress the vast areas of “zeros” in the disk would have had a major gain.
I guess that even just hooking up a zlib filter could have reduced the amount of traffic, but I wonder why isn’t there a simple option to handle the compression/decompression transparently rather than via a pipe filter. While filters are cool when you have to do advanced mojo on your inputs/outputs,
tar has shown how it’s much easier to just have an option to handle that. I guess I’ll have to investigate alternatives to netcat, although I’d like it if the alternative would just allow me to use a gzip filter to handle the stream with a standard non-smart
I guess I’ll try my filter idea next time I use the network to transfer an hard disk image, for now the next image I have to take care of is of my mother’s iBook, which supports the IEEE1394 (Firewire) target mode, which is so nice to have.