Same content, different file

Flameeyes

14 years ago

I’d first like to apologise for my previous post ending up on Planet Gentoo rather than just Gentoo Universe: a click too much. On the other hand, I would like to note that if one post gets mis-categorized, with an average of 300 posts an year, I don’t think I deserves the words I received in the comments from one very lunatic user. Thanks to those of you who supported me, really!

You might remember that I have written a few posts about the Portage tree size after which I tried with the help of equilibrium from GeCHI a few possible solutions to reduce the overhead. One of these was a systematic approach to what a few users suggested in my posts: SquashFS.

While the final results aren’t really that much of importance for what I’m writing here, the test with SquashFS made it clear that there are indeed a lot of duplicated files in the tree, which is something I also said before. Unfortunately, it’s not really that easy to understand how they are supposed to work. Let me try to explain.

The duplicated files that we don’t want in the tree are those created by the insistence of developers to always use ${P} or ${PV} in their patch file names (or subdirectories of files/ even!). Instead of editing the ebuild to replace the version with the previous version, on bump the patches gets duplicated. While I’m hopeful that rsync can handle the duplicated files efficiently, they waste on-disk space and cache entries in the VFS cache (which is probably even more important in some cases, when going through the tree).

The duplicated files that we have lots of, and we’re forced to have in tree, mostly related to profiles: the parent file is usually either .. or a common relative path for quite a few different profiles. It’s bad, somehow, because they are very small files (<100 bytes) and yet will end up eating at a minimum 512 bytes of space (with obvious exceptions).

The kind of duplicated files we need more of is another instead: metadata.xml files. There are quite a few “homograph“ metadata.xml: they contain the same herd/maintainer data, but they have difference in indentation, whitespace, and other similar trivial things. Why do I think we need more duplicated metadata files? Well it’s simple: to improve possible data de-duplication. If the files have the same meaning, but different literal content, they will be always two distinct files.

On the other hand, if two files have the same literal content (md5 is the same for instance), this helps data de-duplication in many different ways:

SquashFS will properly detect identical files, and only store different (inode) metadata, but using only a copy of the file’s data; similar things can be done by other data de-duplication systems, like I think btrfs should have;
as I said I’m hopeful (although I don’t know the protocol well enough) that rsync can optimise sending multiple copies of the same file;
even the good old tarball has one possible way to optimise multiple identical metadata.xml files, thanks to the fact that the archives produced by tarball are so-called “solid archives” (the compression is applied over the whole content of the archive, rather than on the files individually – as it happens with zip files – even though there is a problem related to their distance in the archive, and the block size used by the compression algorithm, but I’m not really going to go that deep into the problem now).

So what is the problem at this point? There are a few: the first is that nowadays we added more information to the metadata.xml files, with the result that sometimes you can’t use just the same file as any other package (USE flags descriptions for instance); this reduces the usefulness of data de-duplication, as the content of the file is not the same, but it could also help to have a single indentation, order, and format standard, as compressing them together would still allow the compression algorithm to share at least part of the dictionary between them.

The biggest problem is that trying to get an agreement, in Gentoo, over the proper formatting (and indenting) of a file is like trying to herding cats. Sigh.

Share this: