As I wrote before I’m currently converting my video archive to a format compatible with iTunes, AppleTV and PlayStation3. To do so, I’m archiving it directly on my OSX-compatible partitions, so that iTunes can access it. I’m actually thinking about moving it on an NFS-shared XFS partition, but for now it’s just HFS+.
To do so, though, I had to repartition the external 1TB hard disk, to give more space for the iTunes Library partition. For a series of reasons, I had to backup the content of the disk before, and now I’m restoring the data. To do the backup, I used the DMG image format that is native to OSX.
The nice thing of DMG is that it transparently handles encrypted images, compressed images, read-only images and read-write images. The compressed image format, that is what I used, was able to cut in half the space used up in the partition.
Now of course I know DMG is just a glorified raw image format, and that there are a lot of alternatives for Linux, but this made me think a bit about what the average user knows about disk and partition images.
I know lots of people who think that a raw image taken with
dd(1) is good enough to be stored and used for every use. I don’t agree with that, but I can understand what the problem is with understanding why a raw image taken with
dd is not just “good enough”.
The problem is that a lot of users ignore the fact that in a partition, together with the actual files’ data, there is space occupied by the filesystem metadata, the journal, and of course all the unused space is not always contiguous and it’s not always zeroed out.
Let’s take a common ad easy example that a lot of users who had to use Windows at one time will have no problem understanding: the FAT32 filesystem.
FAT filesystems tend to fragment a lot. Fragmentation does not only mean that files are sparse around the disk, but also that the free space is sparse between the fragments of files. Also, when you delete a file, its content is not removed from the disk, as you can guess if you ever used undelete-like tools to restore files that were deleted.
When you compress with tools like
bzip2 or similar a file, the fact that it’s fragmented does not get in the way of the compression: if the file contains the same data repeated over and over it will be compressed quite easily. If the file is fragmented the same does not apply.
The fact that the free space is not zeroed out can cause lots of problems because a perfectly defragmented partition with 40GB of unused space cannot describe the unused space as “40GB of 0s”… unless.
Unless the compression algorithm knows about the filesystem. If, when you take an image of the partition, you use a tool that knows about the format of the partition, it starts to be much more useful. For FAT, that would mean that a tool could just move all the files’ data at the start of the partition, put all the unused space at the end, and consider it empty, zeroed out. The result is no more a 1:1 copy, but it’s probably good enough.
Now of course an alternative is just to use an archiving tool that can actually get all the files’ metadata (attributes, permissions, and so on), then you don’t have to worry about unused space at all. But that assumes you can use custom tools both for creating the image and to restore it. Creating an image of a partition could be quite easy to do with an already set-up complex tool, but there might be the need for the restore part of the code to be as lightweight as possible.
Now, I’m no expert on filesystems myself, so I might be wrong, or there might be the software doing this already out there. I don’t know. But I think it wouldn’t be too far fetched to think that there might be a software capable of taking an hard drive with an MBR partition table, with two FAT32 filesystems on it, both fragmented, with the unused space not zeroed out at all, and creating an image file that only needs bzip2 to be uncompressed, and dd to be written, but still providing a much smaller image file than a simple raw image compressed with bzip2, or maybe even rzip.
Such a tool could easily find duplicated files (which happens a lot on Windows because of duplicated DLLs, but can easily happen on Linux too because of backup copies for instance), and put them one near the other so to improve compression.
I know this post is quite vague on details, and that’s because as I said I’m not much of an expert on the field. I just wanted to make some users reflect on the fact that a simple raw image is just not always the perfect solution if what you need is efficiency in storage space.