Big filesystems

Very few of you probably remember that over two years ago, in October 2009, I did some investigative work on Portage Tree’s overhead to show just how much space was going to be wasted with small files on filesystems with too big block sizes.

It wasn’t the only time I noted that while for things like Portage, and likely your operating system’s file, it makes sense to have smaller-than-page-size blocks, it doesn’t seem as smart to do the same for bigger files such as music and video. At the time I noted that HFS+ somehow supported 64KiB blocks with the Linux driver – a driver that is very much unstable and often times unusable – while XFS refuses to play well with similarly-sized blocks, even though it is designed to support them.

I’ve read many people complaining that I didn’t know what I was talking about when I called for bigger block sizes for Linux’s filesystems. Many people insisted that the presence of extents in ext4 made it completely moot to have bigger block sizes. If that’s so, I wonder why ext4 now implements bigalloc which is basically a trick to allow bigger block cluster sizes.

I read about it, with the release announcement of kernel 3.2, while I was on vacation and I just couldn’t wait to try it out with some of my filesystems. Luckily I tried it with the least important one, though, as it’s far from being mature for using.

The current implementation does not support online resizing, so you’re supposed to use resize2fs with the unmounted filesystem … too bad that it fails to run entirely when using the latest version of e2fsprogs. Oh and don’t forget that the switch to turn on bigalloc is not documented anywhere yet.

So it is to be expected given that it’s a very new feature, but I wonder why half the fuss about 3.2 release was about a feature … that definitely is not ready for prime time even in testing ground. I just hope that work toward this kind of features will also mean that XFS will gain support for 64KiB blocks, which I would prefer to ext4’s 1MiB clusters in the first place.

Also I would like to point out one thing for those of you who wish to use this feature on volumes shared with Samba to OS X hosts: you’ll end up with tons of space wasted to .DS_Store files unless the inline data feature is also used, and the inode size is increased. On my filesystems, .DS_Store files weight between 741 bytes to 14KiB… I thought I configured Samba to use extended attributes to store the data instead of using external files, but for what I gathered on the Netatalk mailing list recently, this conflicts with the size limit applied to EAs on ext4… I guess this is another of those things that really need some tweaking to get right.

Another good reason to use 64-bit installations: Large File Support headaches

A couple of months ago I wrote about why I made my router a 64-bit install listing a series of reasons why 64-bit hardened systems are safer to manage than 32-bit ones, mostly because of the feature set of the CPUs themselves. What I didn’t write about that time, though, is the fact that 64-bit installs also don’t require to deal with the curse of large file support (LFS).

It was over two years ago I last wrote about this and at the time my motivation was mostly drained by a widely known troll, insisting that I got my explanation wrong. Just for the sake of not wanting to repeat the same pantomime, I’d like to thank Lars for actually getting me a copy of Advanced Programming in the Unix Environment so that I can actually reference said troll with the pages where the diagrams he referred to are: 106 to 108. And there is nothing there to corroborate his views against mine.

But now, let’s take a few steps back and let’s look at what I’m talking about altogether.

What is the large file support? It is a set of interfaces designed to work around the limits imposed by the original design of the POSIX API for file support on 32-bit systems. The original implementations of functions like open(), stat(), fseeko() and so on was designed using 32-bit data types, either signed or unsigned depending on the use case. This has the unfortunate effect of limiting a number of attributes to that boundary; the most obvious problem is the size of the files themselves: you cannot use open() to open a descriptor to a file that is bigger than 2GB, as then the offsets would overflow. The inability to process files bigger than 2GB by some of your software isn’t, though, that much of a problem – after all, not all software can work with such files within reasonable resource constrain – but that’s not the worst problem you have to consider.

Because of this limit on file size, the new set of interfaces has been always called “large file”, but the name itself is a bit of a misnomer; this new set of interfaces, with extended 64-bit parameters and data fields, is required for operating on large file systems as well. I might not have expressed it in the most comprehensible of terms two years ago, so let’s here it from scratch again.

In a filesystem, the files’ data and meta-data is tied to structures called inodes; each inode has an individual number; this number is listed within the content of a directory to link that to the files it contains. The number of files that can be created on a filesystem is limited by the number of unique inode numbers that the filesystem is able to cope with — you need at least one inode per file; you can check the status with df -i. This amount is in turn tied both to the size of the datafield itself, and to the data structure used to look up the location of the inode over the filesystem. Because of this, the ext3 filesystem does not even reach the 32-bit size limit. On the other hand, both XFS and ext4, using more modern data structures, can reach that limit just fine… and they are actually designed to overcome it altogether.

Now, the fact that they are designed to support a 64-bit inode number field does not mean that they’ll always will; for what it’s worth, XFS is designed to support block sizes over 4KiB, up to 64KiB, but the Linux kernel does not support that feature. On the other hand, as I said, the support is there to be used in the future. Unfortunately this cannot be feasibly done until we know for sure that the userland software will work with such a filesystem. It is one thing to be unable to open a huge file, it is another to not being able to interact in any way with files within a huge filesystem. Which is why both Eric and me in the previous post focused first off on testing what software was still using the old stat() calls with the data structure with a 32-bit inode number field. It’s not about the single file size, it’s a matter of huge filesystem support.

Now, let’s wander back to why I wanted to go back at this topic. With my current line of work I discovered at least one package in Gentoo (bsdiff) that was supposed to have LFS support, but didn’t because of a simple mistake (append-lfs-flags acts on CPPFLAGS but that variable wasn’t used in the build at all). I thought a bit about it, and there are so many ways to sneak in a mistake that would cause a package to lose LFS support even if it was added at first. For instance for a package based on autotools, using AC_SYS_LARGEFILE to look for the proper largefile support, is easy to forget including config.h before any other system library header, and when that happens, the largefile support is lost.

To make it easier to identify packages that might have problems, I’ve decided to implement a tool for this in my Ruby-Elf project called verify-lfs.rb which checks for the presence of non-LFS symbols, as well as a mix of both LFS and non-LFS interfaces. The code is available on Gitorious, although I have yet to write a man page, and I have to add a recursive scan option as well.

Finally, as the title suggest, if you are using a 64-bit Linux system you don’t have to even think about this at all: modern 64-bit architectures define the original ABI as 64-bit already, making all the largefile support headaches irrelevant. The same goes for FreeBSD as well, as they implemented the LFS interface as their only interface with version 5, avoiding the whole mess of conditionality.

I’m seriously scared of what I could see if I were to run my script over the (32-bit) tinderbox. Sigh.

Useless legacies

I always fine it at least fascinating, the religiousness (and this is most definitely not a compliment, coming from an atheist) with which some people stand to defend “classical” (or, in my opinion more properly, “legacy”) choices in the Unix world. I also tend to not consider them too much; I have challenged the use of separate /boot over two years ago, and I still stand behind my opinion: for the most common systems’ configurations, /boot is not useful to stay separate. Of course there are catches.

*One particular of these catches is that you need to have /boot on its own partition to use LVM for the root file system, and that in turn is something you probably would like to have standing to today’s standards, so that you don’t really have to choose how much space to dedicate to root, which heavily depends on how much software you’d be going to put on it. Fedora has been doing that for a while, but then it diverges the problem to how much space dedicate to /boot, and that became quite a problem with the update 11→12,… in general, I think the case might be building up for either using a separate /boot, or just use EFI, which as far as I can tell, can solve the problem to the root… no pun intended.*

For some reason, it seems like a huge lot of legacies relate to filesystems, or maybe it’s just because filesystems are something I struggle with continuously, especially for what concerns combining the classical Unix filesystem hierarchy with my generally less hierarchical use of it. I’m not going to argue for not splitting the usual /usr out of the root file system here (while it’s something I definitely would support, that pretty artificial split makes the whole system startup a messy problem), nor I’m going to discuss how to divide your storage space to file in the standard “legacy” hierarchy.

What I’m just wondering about is why has lost+found been so strongly defended by somebody who (I read) is boasting to have experience in disaster recovery? I’m not doubting the usefulness of that in general, but I’m also considering that in most “desktop” cases it’s just confusing — or irritating in my case, in a particular automated system I’m working on.

First let me start with saying why I’m finding this annoying: try running initdb on a newly-created, just mounted ext3 file system. It will fail, because it finds the lost+found directory in the base of the filesystem, and since the directory is not empty, it refuses to work. There isn’t, by the way, a way to tell it to just run such as a --force switch, which is the most obnoxious thing in all this. I know what I’m doing, I just want you to do it! So anyway, my choices here are either remove the lost+found directory every time I mount a new filesystem (I have to admit I don’t know/don’t remember whether the directory is re-created at mount, or during fsck), or create a sub-directory to run the initdb in. Whether the choice, it requires me one further command, which is not much but in this case it’s a slight problem.

So I went wonder “Is lost+found really useful? Can’t I just get rid of it?”, then hell broke loose.

I’m quite positive I don’t need the directory to be there empty; I can understand it might be useful when stuff is in it, but empty? On a newly-created filesystem? I have my sincere doubts about that. And even when stuff gets into it, is it really useful to have it restored there? Well almost certainly in some cases, but always? Without a way at all to get rid of that option? It sounds a bit too much for me.

Let me show you a few possible scenarios, which is what I experience:

  • /var/cache is on a separate filesystem for me, reason for which is that it’s quite big and it ends up growing a lot because it keeps, among other things, the Portage disfiles for me and the tinderbox; if anything happen to that filesystem, I won’t spend more than 5 minutes on it, it’ll be destroyed and recreated; the name cache should make it obvious, and the FHS designates it for content that can be dropped, and recreated, without trouble; do I need orphan files recovered from that filesystem? No, I just need to know whether there is something wrong with the FS, if there is, I’ll recreate it to be on the safe side that data didn’t corrupt;
  • my router’s root file system… it turned out to be corrupt a couple of time and stuff was added to lost+found… did I care about that? Not really. I flashed in a new copy of the filesystem, no data loss for me in there, beside once, before I set up rsnapshot where I lost my network configuration, oh well, took me the whole of half an hour to rewrite it from scratch — if you wonder what the corruption was about, it was a faulty CF card; I’ll have to write about those CF cards at some point;
  • the rest of my running data, which is all of the rest of my systems… if I were to find corruption on my filesystems, I’d do like I did in the past: clear them out, make sure I hadn’t chosen the wrong filesystem type to begin with, and then recreate them; do I care about finding the data in lost+found? Nope, I got backups.

The trick here is I got backups. Of course if I didn’t have backups, or if my backups were foobar’d, I’d be looking at everything to restore my data, but to be honest, I found that it’s a much better investments to improve your backup idea rather than spend time recovering data. Of course, I don’t have “down to the microsecond” backups as somebody told me I’d be needing to avoid using lost+found, but again, I don’t need that kind of redundancy. I have hourly backups, for my systems, which are by themselves above average, it works pretty well. I’d be surprised if the vast majority of the desktop systems cared about backups over a week.

Now this should cover most of my points: lost+found is not indispensable. You can really well live without. I don’t think I ever used it myself, when faced with corrupted filesystems (and trust me it happened to me more than once) my solution was either of: get the backup, re-do the little work lost, discard the data altogether. Sure I might have lost in the years bits and pieces of stuff that I might have cared about, but nothing major. The worse thing happened to me in the past three years has been the need to re-download the updates and drivers for Windows (XP and Vista both) that I keep around when customers bring me their computers to fix. Okay I have no experience with enterprise-grade post-apocalyptic disaster recovery, so what? It doesn’t change the fact that in my case (and I’d say, a lot of users’ cases) it doesn’t matter.

I’m not asking to get rid of the feature altogether, but to make it optional would be nice, or at least, not force me to have the directory around. Interestingly enough, xfs_repair does not need the directory to be present; it’ll use it if it’s present and full, it’ll create and populate it if orphan files are found, but otherwise it’s invisible. Apple’s HFS+ is more or less on the same page. I admit ignorance for what concerns the Reiser family, JFS and ZFS.

Whatever the case, can we just stop asserting that what was good in the ‘70s, or what is good for enterprise-grade systems, is good for desktop systems as well? Can we stop accepting legacies just because they are there? I’m not for breaking hell of compatibility at every turn (and please nobody say HAL, ‘kay? — okay here the pun was most definitely intended), but yes, it takes to challenge the status quo to get something better out of it!

P.S.: if somebody can suggest me an eventual option to mkfs or mount to avoid that directory, I’m still eager to know it!

Testing beforehand

Things need to be tested, by developers, before they get ready for public consumption. This is something that is pretty well known in the Free Software development work, but that does not seem to get to all the users, especially the novices that come from the world of proprietary software, in particular Windows software.

This is probably because in that world, stable software does not get really much changes by default, and as a result most users tend to use experimental software, in beta state often times, to work daily. Now, this is probably due to how the software is labelled by those companies too: Apple’s “beta” Safari 4 is mostly stable, compared to the older version, but I guess it’s far from complete behind the scenes; on the other hand a development version of a piece of Free Software may very well be unusable from the crashes, since it gets available much sooner.

Similarly, tricks that increase performance in software ecosystems like Windows’s are pretty common, because there is no other way to get better performances (and Microsoft is pretty bad at that, I guess I’ll write about that one day too). At the same time what we consider tricky in Free Software world may very well be totally and utterly broken.

Indeed, since I joined Gentoo there has been quite a few different tweaks and tricks that are supposed to either improve your runtime performance tenfold, or make you compile stuff in a fifth of the time. Some of these came out just stupid copy and paste while other are outright disinformation which I tried to debunk before. On the other hand, I’m the handler of one of the most successful tricks (at least I hope it is so!).

My problem is that for some users, the important tricks are the ones that the developers don’t speak about. I don’t know why this is, maybe they think that the developers are up to screw them up. Maybe they think the distribution developers like me are just part of a conspiracy that wants to waste their CPU power or something. Or maybe they think we want to be cool being able to compile stuff much sooner than they do. Who knows? But the point is that none of this is the case, of course.

What I think is that this kind of tricks should really be tested by developers first so that they don’t end up biting people in their bottoms. One of these tricks that lately seems to be pretty popular is in-memory builds with tmpfs. Now, this is something I really should look into doing myself, too. With 16GB of memory, with the exception of OpenOffice, this can be quite an useful addition to my tinderbox (if I can save and restore the state quickly, that is).

I do have a problem with the users telling people to use this right now as it is. The first problem is that, given that ccache and distcc usage are handled by Portage, this probably should be, too. The second problem is related to what the suggestions lack: the identification of the problems. Because, mind you, it’s not just building in memory what you’re doing, it’s also building with tmpfs!

By itself, tmpfs does not have any particular bugs that might hinder builds, but it has one interesting feature: sub-second timestamps. These are available also on XFS, so to say that Gentoo does not support building on tmpfs (because it increases build failure rate) is far from being the truth, as we do support XFS builds pretty well. Unfortunately neither users nor, I have to say, developers, know about this detail. Indeed you can see it here:

flame@yamato ~ % touch foobar /tmp/foobar
flame@yamato ~ % stat -c '%y' foobar /tmp/foobar 
2009-06-02 04:04:14.000000000 +0200
2009-06-02 04:04:21.197995915 +0200

How this relates to the builds is easy to understand if you know how make works: by tracking mtime of dependencies and targets. If they don’t follow in the right sequence the build may break or enter infinite loops (like in the case of smuxi some time ago), and indeed this is much easier when the resolution of mtime is higher than a second: if the timestamp stops as a second, any command taking less than that will not be considered as an extra overhead.

I have written already a few posts about fixing make in my “For a Parallel World” series; most of them are useful to fix this kind of issues too, so you might want to refer to those.

Finally, I want to say that there are other things that you should probably know when thinking about using tmpfs to build straight in memory. One of these is that, by default, gcc is going to build in memory by itself, somehow. Indeed the -pipe compiler flag that almost everybody has in their CFLAGS variable tells the compiler just that: to keep in memory the temporary data and execute, for instance, the assembler directly there. While the kind of temporary data that is kept in the build directory and that kept in memory by -pipe are not the same thing, if you’re limited on memory you could probably just try to disable -pipe and leave the compiler to use in-memory files.

But sincerely, I think there would be a much greater gain if people started to help out at fixing parallel make issues; compiling with just one core can get pretty tiresome even on a warbox like Yamato, and this is the case of Ardour for instance because scons is not currently called with a job option to build in parallel. Unfortunately last time I tried to declare a proper variable for the number of parallel jobs, so that it didn’t have to be hackishly extracted from MAKEOPTS, the issue ended up stopped in gentoo-dev by bikeshed arguments on the name of the variable.

On the other hand this “trick” (if you want to call it this way) could be a nice way to start, given that lots of parallel make issues also appear with tmpfs/xfs (the timestamps might go backward); I think I remember ext4 having an option to enable sub-second timestamps, maybe developers should start by setting up their devbox with that enabled, or with xfs, so that the issues can be found even by those who don’t have enough memory to afford in-memory builds.

Further readings: Mike’s testing with in-memory builds for FATE .

Filesystems — ext4dev fails

flame@yamato ~ % touch /var/tmp/portage/test
touch: cannot touch `/var/tmp/portage/test’: No space left on device
flame@yamato ~ % df -h | grep /var/tmp
32G 7.2G 23G 25% /var/tmp
flame@yamato ~ % df -i | grep /var/tmp
2097152 419433 1677719 21% /var/tmp

Now, a mount cycle later it worked fine, but it’s still not too nice since it caused all the emerge running to fail, just like XFS did, but without leaving trace on the kernel log, which makes it obnoxious since it’s hard to debug. I hope 2.6.28 is going to be better, certainly the tinderboxing is a nice way to stress-test filesystems.

I start to consider the idea of OpenSolaris, NFS, and InfiniBand…

Have you tested that?

Even though I’ve heard Patrick complaining about this many times, I never would have been able to assess how much of the tree goes untested if I didn’t start my little own tinderbox. Now, I’m probably hitting more problems than Patrick because I’m running ~arch (or mostly ~arch) with --as-needed enabled, but it still shows that there is a huge amount of stuff that needs to be fixed, or dropped.

Up to now I’ve been using GCC 4.1, and still hit build failures with it; now I’ve switched to GCC 4.3, even though the tracker shown a bad situation already; and of course there are packages that didn’t have bugs opened just yet, because nobody built them recently.

Still, supporting the new compilers is not my main concern sincerely; there are packages that won’t build with GCC 4.3 just yet, like VirtualBox, as there are packages that still don’t compile with GCC 4.0. What concerns me is that there is stuff that hasn’t been tested at all. For instance, sys-fs/diskdev_cmds which was marked ~amd64 was totally broken, with fsck.hfs causing a Segmentation Fault as soon as it was executed (the version that is now available works, the old one has been taken out of the keyworded tree).

Since even upstream sometimes fail, one should also take into consideration the packages’ tests, possibly ensuring their failures are meaningful, and not just “ah this would never work anyway”. If you check dev-ruby/ruby-prof, the test suite is executed, but a specific test, which is known to fail, is taken out of it first. This is actually pretty important because it saves me from using RESTRICT to disable the whole testsuite, and executing the remaining tests helped me when new code was added to support rdtsc on AMD64, which was broken. The broken code never entered the tree, by the way.

Unfortunately doing a run with FEATURES=test enabled is probably going to waste my time since I expect a good part of the tree to fail with that; with some luck, if Zac implements me a softfail for tests, I’ll be able to do such a run in the next months. I wonder if the run this time will be faster, I’ve moved my chroots partition to use ext4(dev) rather than XFS, and it seems to be seriously faster. I guess once 2.6.28 is released I’ll move the rest of my XFS filesystems to ext4 (not my home directory yet though, which is ext3, nor the multimedia stuff that is going remain HFS+ for now).

My build run also has some extra bashrc tests, beside the ones I already written about, that is the checks for misplaced documentation and man pages. One checks for functions from the most common libraries (libz, libbz2, libexpat, libavcodec, libpng, libjpeg) that gets bundled in, to identify possibly bundled-in copies of those, another checks for the functions that are considered insecure and dangerous by libc itself (those for which the linker will emit a warning). It is interesting to see their output, although a bit scary.

Hopefully, the data that I gather and submit on the bugzilla for these builds will allow us to have a better, more stable, and more secure portage tree as the time goes by. And hopefully ext4 won’t fry my drives.

Filesystems — take two

After the problem last week with XFS, today seems like a second take.

I wake up this morning with a reply about my HFS+ export patch, telling me that I have to implement the get_parent interface to make sure that NFS works even when the dentry cache is empty (which is what caused some issues with iTunes while I was doing my conversion most likely), good enough, I started working on it.

And while I was actually working on it, I find that the tinderbox is not compiling. A dmesg later shows that, once again, XFS had in-memory corruption, and I have to restart the box again. Thankfully, I got my SysRescue USB stick, which allowed me to check the filesystem before restarting.

Now this brings me to a couple of problems I have to solve. The first is that I finally have to switch /var/tmp to its own partition so that /var does not get clobbered if/when the filesystem go crazy; the second is that I have to consider alternatives to XFS for my filesystems. My home is already using ext3, but I don’t need performance there so it does not matter much; my root partition is using JFS since that’s what I tried when I reinstalled the system last year, although it didn’t turn out very good, and the resize support actually ate my data away.

Since I don’t care if my data gets eaten away on /var/tmp (the worst that might happen is me losing a patch I’m working on, or not being able to fetch the config.log for a failed package – and that is something I’ve been thinking about already), I think I’ll try something more “hardcore” and see how it goes, I think I’ll use ext4 as /var/tmp, unless it panics my kernel, in which case I’m going to try JFS again.

Oh well, time to resume my tasks I guess!