This Time Self-Hosted
dark mode light mode Search

Same content, different file

I’d first like to apologise for my previous post ending up on Planet Gentoo rather than just Gentoo Universe: a click too much. On the other hand, I would like to note that if one post gets mis-categorized, with an average of 300 posts an year, I don’t think I deserves the words I received in the comments from one very lunatic user. Thanks to those of you who supported me, really!

You might remember that I have written a few posts about the Portage tree size after which I tried with the help of equilibrium from GeCHI a few possible solutions to reduce the overhead. One of these was a systematic approach to what a few users suggested in my posts: SquashFS.

While the final results aren’t really that much of importance for what I’m writing here, the test with SquashFS made it clear that there are indeed a lot of duplicated files in the tree, which is something I also said before. Unfortunately, it’s not really that easy to understand how they are supposed to work. Let me try to explain.

The duplicated files that we don’t want in the tree are those created by the insistence of developers to always use ${P} or ${PV} in their patch file names (or subdirectories of files/ even!). Instead of editing the ebuild to replace the version with the previous version, on bump the patches gets duplicated. While I’m hopeful that rsync can handle the duplicated files efficiently, they waste on-disk space and cache entries in the VFS cache (which is probably even more important in some cases, when going through the tree).

The duplicated files that we have lots of, and we’re forced to have in tree, mostly related to profiles: the parent file is usually either .. or a common relative path for quite a few different profiles. It’s bad, somehow, because they are very small files (<100 bytes) and yet will end up eating at a minimum 512 bytes of space (with obvious exceptions).

The kind of duplicated files we need more of is another instead: metadata.xml files. There are quite a few “homograph“ metadata.xml: they contain the same herd/maintainer data, but they have difference in indentation, whitespace, and other similar trivial things. Why do I think we need more duplicated metadata files? Well it’s simple: to improve possible data de-duplication. If the files have the same meaning, but different literal content, they will be always two distinct files.

On the other hand, if two files have the same literal content (md5 is the same for instance), this helps data de-duplication in many different ways:

  • SquashFS will properly detect identical files, and only store different (inode) metadata, but using only a copy of the file’s data; similar things can be done by other data de-duplication systems, like I think btrfs should have;
  • as I said I’m hopeful (although I don’t know the protocol well enough) that rsync can optimise sending multiple copies of the same file;
  • even the good old tarball has one possible way to optimise multiple identical metadata.xml files, thanks to the fact that the archives produced by tarball are so-called “solid archives” (the compression is applied over the whole content of the archive, rather than on the files individually – as it happens with zip files – even though there is a problem related to their distance in the archive, and the block size used by the compression algorithm, but I’m not really going to go that deep into the problem now).

So what is the problem at this point? There are a few: the first is that nowadays we added more information to the metadata.xml files, with the result that sometimes you can’t use just the same file as any other package (USE flags descriptions for instance); this reduces the usefulness of data de-duplication, as the content of the file is not the same, but it could also help to have a single indentation, order, and format standard, as compressing them together would still allow the compression algorithm to share at least part of the dictionary between them.

The biggest problem is that trying to get an agreement, in Gentoo, over the proper formatting (and indenting) of a file is like trying to herding cats. Sigh.

Comments 10
  1. Enforcing a whitespace-/indentation-rule for XML is arbitrary since XML ignores both. Therefore I’d rather see it as yet another artificial annoying rule.In addition: as soon as upstream metadata tags get used, even more metadata.xml files will differ.

  2. Most programming languages ignore whitespace and indentation. Is that a reason not to have conventions?

  3. @dev-zero thanks for proving my note about herding cats.Sure, XML ignores whitespace. But as I said there are *very good technical reasons* to have the same indenting format for all the files just for the sake of de-duplication and compression.But “yet another artificial annoying rule” (as in, the mindset that lets you write that) is the reason why Gentoo’s results are getting shittier by the year.

  4. @diego, for your purpose a tab-space indentation is the only viable solution:1- with white-spaces, the files will be bigger size than with tab-spaces;2- white-spaces are more error prone than tab-spaces, nullify the de-duplication purpose and requiring extra QA checks for the validation of the files;however I agree with you, the new Gentoo trend seem to be: “WTF! a new rule? why we need it? the solution is to ignore the problem”

  5. @dev-zeroWe already have this “artifical rule” in ebuilds.There is no reason for identing with tab in bash scripts (apart the technical one), the same would just apply for the metadata.xml.I guess you never had issue with forced style on ebuild files…

  6. Maybe I’m missing something, but standardizing indentation and whitespaces should be the easiest thing to do automatically – strip all indentation and whitespaces and recreate them again with some xml beautifier package. So why not just add some hook to the vcs that services the tree that would do that to every metadata.xml file?

  7. A simple step towards this goal is to put a vim modeline in skel.metadata. I’m surprised no one has done this yet. I just use whatever indenting the previous person did and if there was a modeline, even better.

  8. @Jeremy: then we’ll add a kate modeline, and a emacs modeline, and ….Why not rather make the vim, emacs and whatever else plugins handle those properly?

  9. It shouldn’t be too hard to write some analogue of dev-util/indent for xml (probably such a thing even exists already), so one could simply require its use (repoman?).Concerning tarballs, it is not so important to have identical files: The differences cost in the compression just a few bytes, and for many files the differences itself are similar to each other and thus cost even less.For squashfs, the situation is different, because for identical files the problem of a huge distance in the archive is avoided.

  10. If identical files are hardlinked together, perhaps by some periodic deduplication scan (since VCSes don’t seem to care), that would reduce blocks and inodes used both, right? I wonder how expensive for the rsync server it would be if all clients used –hard-links, and I wonder if that’s a complete enough solution.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.