The size of the Gentoo tree

You might have noticed that I started working on cleaning up the tree (before I had a few problems with my system, but that’s for another day). Some people wondered whether that’s really going to make much difference, so I wanted to take a look at it myself. I was already quite sure that, while reducing the size of filesdir is important, especially to avoid more stuff to be added to the tree, getting rid of all the filesdir wouldn’t really make a terrible impact. Some extra time at hand, some find commands later, and Google Docs, lead to this:

<!–

–>

As you can see, the big part of the tree is ate up by the support files, more than twice the size of all the ebuilds; files/ directories are just little more than ebuilds, and there is a huge amount of filesystem allocation overhead, even if my tree is in a filesystem with 1 KiB blocks. Another interesting note is that the licenses use up more space than the whole set of profiles, scripts and eclasses!

For the sake of finding something to work on, let’s break up the support files class into a different graph:

<!–

–>

So much for those who complained that adding information about packages in metadata.xml is wasting space for users… the real space waster are change logs instead! But they are useful to keep around, for a while at least. I guess what we really need is better ChangeLog integration in repoman, so that a) it updates the ChangeLog on commit (stopping developers from committing without updating them!) and b) it can delete older and obsolete entries (like, keep the most recent 40 changes or so).

Update (2017-04-30): Unfortunately the spreadsheet links I used are now broken, so no graphs are available right now.

36 thoughts on “The size of the Gentoo tree

  1. Ehh, only I can’t see graphics?About ChangeLogs. Maybe better exclude them from syncing list? I know, they required for QA, but they most useful for developers, not average users.

    Like

  2. Quite interesting. Maybe I’m wrong, but at least some things could be fixed quite easily:Does everybody need to download changelogs? Probably a USE flag or some Portage configuration variable could be added to make it optional.

    Like

  3. What about just storing them on the servers and provide a little script to show them?<typo:code>#!/bin/zshARGS=(`echo $1 | awk -F/ ‘{print $1,$2}’`)CL=/tmp/.gentoo/cl/$ARGS[1]/$ARGS[2].txtwget http://server.foo/$ARGS[1]/$ARGS[2].txt > $CL$EDITOR $CL ; rm $CL</typo:code>pro:- saves a lot of diskspace- saves a lot of traffic on the servers, at least if you’re with me that fewer people actually read the changelogs- always up2date, i.e. no sync required- user can still fetch single cl’s by hand if needed- pretty easy to do and to maintaincon:- user needs internetaccessAs an alternative, the could be an (i.e.) sqlite-db containing all cl’s, which would save diskspace and allow some advanced stuff. The trouble would be to keep this up2date, but that could be done with a sqlscript containing the UPDATE-Values and running it after every –sync.

    Like

  4. Damn it, not fully awake…the above script should be called as:’script majorcat-minorcat/entry’ie’script www-client/elinks’

    Like

  5. I actually tried excluding ChangeLogs from my portage tree a few days ago. (If you’re curious: add PORTAGE_RSYNC_EXTRA_OPTS=”–exclude ChangeLog –delete-excluded” to your make.conf, and then sync.) I keep my portage tree in a squashfs image, and dropping changelogs reduced the size by about 25%.I would be utterly thrilled if you figured out a way to only keep recent entries in the ChangeLogs in the tree, though. They’re nice to have around sometimes, but history going back years is really overkill for all but a few users.

    Like

  6. Cons:- Increased server load- Changelogs show entries that don’t apply if your tree isn’t synced.As far as the ChangeLog goes, a server daemon really should just generate it straight from the VCS’ commit log in the first place — look at each ebuild directory, get the last 50 commits that modified files in it, convert to text and write it out in the Rsync directory. Storing the stuff as a database that portage can query on the user’s system to get changelog info isn’t a bad idea to reduce the disk size (emerge ––changelog kdelibs); unfortunately, it will probably increase network usage instead since the DB will likely always be synced in full without a special internal structure to satisfy Rsync’s changed block detection algorithm (measuring would be needed to be sure).[You can’t bundle UPDATE/INSERT/DELETE scripts because, update since when? yesterday? last week? last year? Have the server generate it depending on the client (likely impractical due to the mirrors and server load)?]

    Like

  7. Andrew,regarding your “increased server load”. Why do you think that would be the case? I mean, we’re currently delivering the ChangeLog-files to each and every user, no matter if s/he needs or want them so I guess only delivering them on demand would be better, guessing that approx. <10% of the userbase make use of changelogs on a regular basis.I for one don’t read changelogs very often, at least not the Gentoo ones. If I’m interested in the ‘news kind’, I’ll check out upstream, if there’s a bug I checkout the tracker and if some flags change, I skim trough the ebuild.So the question is, how vital do we think changelogs are? In my eyes, they’re only usefull to the devs, but they also got cvs-access. If there’s some problem, I guess google comes first, so reading a cl online would be no problem.How do other distribution handle this? Does Debian or Fedora have a local store? What about *BSD?@Ravi, thanks for the idea to exclude them from syncing, I’ll test this with .gitignore on the funtoo-tree.

    Like

  8. A good solution to the ChangeLog problem (yes, it’s a problem IMHO), is to totally avoid them; why are they here? what are their purposes? Aren’t CVS changelogs enough? wait, *yes*, ChengeLog file can be stored as CVS commit messages, whitout the need to waste useless space in the portage tree.To be more precise, it’s time to move the ass away from CVS/SVN (less Gentoo Infra maintenance) and use Git, after than the command output of “emerge -l” can be replaced with the output from the Git command: “git log –relative=category/package”; simple and effective.

    Like

  9. I must say that leaving the ChangeLog out will not be much of a loss for the typical user. I have rarely found useful info in them, it usually just say something like “version bump”. The real info is in the upstream ChangeLogs, and even there it is rarely useful for the typical user.I suggest a useflag not syncing the ChangeLogs except if explicitly enabled./Jakob

    Like

  10. I agree Changelogs are rarely read. I do occasionally read them though. Most often the Gentoo ebuild related ones but occasionally the ones in the source.Ususally it’s what’s been tested on my hardware if I want to use something that is is still not considered stable enough for everyone but is newer than what I have in my ‘tree’Be nice to be able to pull them in as a separate command even gzipped since ‘less’ handles that just fine.

    Like

  11. WARNING: do not use –delete-excluded in make.conf or you’ll lose the local, distfiles and packages directories in addition to the ChangeLog files…

    Like

  12. Interesting ratios… though you may want to avoid pie charts next time as they are obfuscating the data — especially when data sets have similar sizes like the first graph (overhead/files/ebuild all seem the same (as altro/licenses do)); we need to read your interpretation to know the relative sizes, a problem we wouldn’t have with good ‘ol bar charts.See why it’s seldom a good idea to resort to piecharts:http://www.perceptualedge.c

    Like

  13. One nice use of ChangeLog messages for ebuilds is to check wheter updates for huge packages are really required for me. If there’s an OO.org -rX bump and the ChangeLog indicates that it is something unrelated to me I’ll rather do echo xy >> /etc/portage/package.mask then emerge -u

    Like

  14. @ph030An rsync has a somewhat predictable load, serving on demand can result in unpredictable load spikes if some breakage happens and a lot of people look up the ChangeLog at once. You are probably right that this is unlikely though.@usefullness of logsI generally find the logs useful when something catches my eye in an emerge -pvuDNt world or breaks my system and I want to find out why it was stabled [bug numbers] — happens more often then it should. On the other hand, most of the logs don’t get read so I guess I’d have to agree that excluding them probably won’t hurt most people; provided they have an always-on net connection at least.

    Like

  15. The size of the portage tree in itself is a non-problem as disk space is cheap nowadays and only getting cheaper.Performance gains would be interesting, and users would like shorter rsync times. Removing ChangeLogs from the tree could definitely help there, and as suggested above (and elsewhere) we could easily rely on VCS logs. I would suggest implementing this at the time we migrate from cvs to git.

    Like

  16. Ben you mean “never”, then? Because we’ve been talking about switching for what? Three years?And the size of the portage tree _is_ a problem because that bloody size is transferred _for each new user_ when syncing anew!

    Like

  17. Ok, you are getting alot of comments above stating that ChangeLogs are useless. Of course this is an non-precise sample of participants. For example, I didn’t have enough data to provide a proper ChangeLog comment for a version bump, and 3 people actually found me to ask “hey, your ChangeLog wasn’t as good as it normally was, what’s new?”So, 3 is not a very big number either. However, it was significant enough for them to actually find me.I always use this motto for my commits: “VCS commit messages are for devs, ChangeLog comments are for users” – Granted, normally my VCS message is the same as my ChangeLog message :)Additionally, I think it is a very good idea for the rsync staging box to truncate ChangeLogs at an arbitrary number, say 200 lines. I have been advocating this for awhile.

    Like

  18. squashfs/append & mount … /usr/portage? Rsync wrapper?Whole snapshots may be squashfs-ed too.

    Like

  19. I agree with Jeremy Olexa because as a user, I oftentimes like to read the changelogs to know why it is that I need to reinstall a partictular package (especially if it a big one).Furthermore, as people posted above, excluding the ChangeLogs is pretty tivial, so a solution could be to make a portage variable that is set by default to exclude ChangeLogs, but could be switched on for those people who find them useful.Limiting the size of the ChangeLog could be a good solution too, since changes from 2 years ago are not very useful.

    Like

  20. Hi, I’d find it a very bad idea to exclude really old changelog entries. I often find things in ebuilds and patches where I ask myself “what the fuck is this about” and walking through changelog helps.Reading through the comments I think the preferred way would be not having the changelog in the tree, but store them online and have a simple tool to retrieve them.

    Like

  21. What about the Manifests?On a not-really-up2date machine, I found ~14.000 Manifests, summing up to 61MB. Cat’ing them all into a single file only leaves 22MB giving 2/3 fs-overhead on an xfs-parition.Since they have quite an easy structure, it would be quite easy to store and update all this in a sqldb while it also shouldn’t be much – if even – slower than acting on a single file.Further:`gzip file` -> 11MB`bzip2 -9` file -> 8.7M

    Like

  22. Just some thoughts:1. It is not the size of the ancient files which matters (especially if the user keeps the portage tree on squashfs+aufs) but the size of files which frequently change, and what rsync needs to transfer them – the rsync algorithm should typically only transfer the newest entries of the ChangeLogs and recognize the old ones by checksums. It would certainly be necessary to log (for a while) detailed output of rsync how much transfer volume is really needed for which files.2. Cutting ChangeLogs to a certain amount of entries may for rsync actually even mean to increase the transfer volume. Also, it means that svn servers would have to store more data (it certainly depends on the VCS how much is the difference). Moreover, if gentoo would switch to another VCS which has no cheap checkouts (I do not know whether git has), cutting ChangeLogs would also increase the size of data stored on the user’s system.

    Like

  23. Just as a note for those commenting: the standard rsync algorithm is actually *disabled* for the portage tree, because –whole-file is *always* passed to rsync, so long as either the defaults are not changed, or you are syncing to any server in *.gentoo.org. This means that any time a file is changed, the entire file is redownloaded from the rsync server, not just the parts that changed.

    Like

  24. I agree with Jeremy that ChangeLogs are very important and should be kept available. But as common practice is to have the same log message for the changelog and the commit log, we could cut out this duplication. That is why I advocate keeping the VCS commit log only, and write a tool that makes it very easy for users to retrieve them. This would definitely cut down the volume of data which needs to be rsynced.Flameeyes: I don’t mean never. I am cautiously optimistic and would expect us to migrate to git within the next two years or so.And a ~30MB snapshot tarball to download the portage tree before your first sync isnt that much. Not compared to the gigs of source tarballs that are needed for the avarage desktop install. If you are concerned about that, you’re with the wrong distro.

    Like

  25. Well…suppose you could set the ‘install’ system to use binpkg by default and encourage the use of emerge-webrsync for those doing fresh installs.Not sure whether there would be any real benefit.Changelogs could be a separate emerge just like any other ebuild? Not installed or synced on default?

    Like

  26. While you’re poking at this, add details of the size of the various overheads:- tail-packing reiserfs3- 1KiB inodes- 4KiB inodes- 64KiB inodes

    Like

  27. there is also a big numer of duplicate files, 9625, symlinks should work with a decent VCSAlso removing ChangeLogs my squashfs file is 33M! Thanks for the hint

    vserver sources suexec portage   > mksquashfs /g/portage /srv/pcemail/portage.sqsh >            -noappend -no-exports -no-recovery -force-uid 250 -force-gid 250 >            -wildcards -e '*/*/ChangeLog'Parallel mksquashfs: Using 8 processorsCreating 4.0 filesystem on /srv/pcemail/portage.sqsh, block size 131072.[===============================================================================================================/] 101819/101819 100%Squashfs 4.0 filesystem, data block size 131072        compressed data, compressed metadata, compressed fragments        duplicates are removedFilesystem size 32752.03 Kbytes (31.98 Mbytes)        25.80% of uncompressed filesystem size (126949.55 Kbytes)Inode table size 1350593 bytes (1318.94 Kbytes)        34.40% of uncompressed inode table size (3925625 bytes)Directory table size 1276508 bytes (1246.59 Kbytes)        38.01% of uncompressed directory table size (3358423 bytes)Number of duplicate files found 9625Number of inodes 122598Number of files 101816Number of fragments 928Number of symbolic links  0Number of device nodes 0Number of fifo nodes 0Number of socket nodes 0Number of directories 20782Number of ids (unique uids + gids) 1Number of uids 1        portage (250)Number of gids 1        portage (250)ls -l ~/md3/ifx/pcemail/portage.sqsh-rwxr--r-- 1 portage portage 33542144 Sep 29 00:14 /root/md3/ifx/pcemail/portage.sqsh

    Like

  28. hey fellow gentoo-ladies :)Maybe I’m a little alone with this one, but I always check the ChangeLogs before I do something. (plus upstream changelogs, if enough time available)I do like to get through the whole list of ‘minor fixes, bumps, etc,..’. Keep a close eye on things ;)Noticed portage doesn’t show the complete changelog, only the changelog to a specific version, the rest gets cut off. Happens when a version bump is shown by portage as last entry, but still a few fixes, patches were applied, which you don’t see then.I personally don’t mind how large the tree was or will be in the future. My only concern is, how much load are the gentoo servers capable of and still ‘function’ healthy.If we have to get rid of something to reduce the load, feel welcome to ‘slim down’ the tree. :)Please in a sane way, don’t chop ChangeLogs away, because a few assume nobody is interested in.Nostalgic ChangeLogs, maybe we can put them in the gentoo museum. ;)

    Like

  29. I certainly care about the size of the portage tree. I find “disk space is cheap” or other “Moore takes care of it” reasonings dangerous; this way the computer is never fast, as while the hardware gets faster, software gets slower. Reasonable effort should go to efficiency. Sometimes the efficient way is also simpler and more maintainable (win/win), but even if is more complex, the tradeoff should be given some thought.Regarding changelogs, I often use them, yet I would be perfectly happy with them out of tree (and a convenient way of retrieving them). But if you do choose to let them in the tree, why not gzipping them?

    Like

  30. How about not truncating changelogs, but keep only those that are related to files actually *changed* and what’s the most important present in actual portage tree.It’s not like it’s impossible to do – properly formatted changelog contains list of ‘changes’, likekannel-1.4.1.ebuild:or+files/kannel-1.4.0-mysql-list.patch, kannel-1.4.0.ebuild:And those entries that refer to nonexisting files could be just purged for rsync users.

    Like

  31. @Ben – throwing disc at the problem isn’t the answer. When email servers start to get full, you don’t add more disc, you tell your users to archive. You only add more disc when it’s totally necessary. That’s the climate of modern business these days, and Gentoo ultimately is a business, with a product and customers.I say ditch the CL’s altogether. If I want to know what’s in a new version, I look upstream. Or on p.g.o. – there’s no need for the changelogs to be in the downloads.

    Like

  32. Update – I put the exclude line as listed above into make.conf, and my root partition used size shrunk by 2GB after the next sync.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s