My take on the separate /usr issue

This is a blog post I would have definitely preferred not to write — it’s a topic that honestly does not touch me that much, for a few reasons I’ll explore in a moment, and at the same time it’s one that is quite controversial as it has quite a few meanings layered one on top of the other. Since I’m writing this, I would though first to make sure that the readers know who I am and why I’m probably going to just delete posts that tell me that I don’t care about compatibility with older systems and other operating systems.

My first project within Gentoo has been Gentoo/FreeBSD — I have a (sometimes insane) interest in portability with operating systems that are by far not mainstream. I’m a supporter of what I define “software biodiversity”, and I think that even crazy experiments have the right to exist, if anything to learn tricks and issues to avoid. So please don’t give me that kind of crap I noted above.

So, let’s see — I generally have little interest in keeping things around just for the sake of it, and as I wrote a long time ago I don’t use a separate /boot in most cases. I also generally dislike legacies for the sake of legacies. I guess it’s thus a good idea to start looking at which legacies bring us to the point of discussing whether /usr should be split. If it’s not to be split, there’s no point debating supporting split /usr, no?

The first legacy, which is specific to Gentoo is tied to the fact that our default portage tree is set to /usr/portage, and that both the ebuilds’ tree itself and the source files (distfiles), as well as the built binary packages, are stored there. This particular tree is hungry in both disk space itself and even more so in inodes. Since both the tree, and in general the open source projects we package, keep growing, the amount of these two resources we need increases as well, and since they are by default on /usr, it’s entirely possible that if this tree’s resources are allocated statically when partitioning, it’ll reach a point where there won’t be enough space, or inodes, to allocate anything in it. If /usr/portage resides in the root filesystem, it’s also very possible, if not even very likely, that the system would stop entirely to work because there is not enough space available on it.

One solution to this problem is to allocate /usr/portage on its own partition — I still don’t like it that much as an option, because /usr is supposed to be, standing to the FHS/LSB, for read-only data. Most other distributions you’ll find using subdirectories in /var as that’s what it’s designed to be used for. So why are we using /usr? Well, it turns out that this is something that was inspired from FreeBSD, where /usr is used for just about everything including temporary directories and other similar uses. Indeed, /usr/portage finds its peer in /usr/ports which is where Daniel seems to have been inspired to write Portage in the first place. It should be an easy legacy to overcome, but probably migrating it is going to be tricky enough that nobody has done so yet. Too bad.

Before somebody ask, yes, for a while ago splitting the whole /var – which is general considered much more sensible – was a pain in the neck, among other things because things were using /var/run and other similar paths before the partition could be mounted. The situation is now much better thanks to the fact that /run is now available much earlier in the boot process — this is not yet properly handled by all the init scripts out there but we’re reaching that point, slowly.

Okay so to the next issue: when do you want to split /usr at all? Well, this all depends on a number of factors, but I guess the first question is whether you’re installing a new system or maintaining an old one. If you install a new one, I really can’t think of any good reason to split /usr out — the only one that comes passing in my mind is if you want to have it in LVM and keep the rootfs as a standalone partition — and I don’t see why. I’d rather, at that point, put the rootfs in LVM as well, and just use an initrd to accomplish that — if it’s too difficult, well, that’s a reason to fix the way initrd or LVM are handled, not to keep insisting to split /usr! Interestingly enough, such a situation calls for the same /boot split I resented five years ago. I still use LVM without having the rootfs in it, and without needing to split /usr at all.

Speaking of which, most ready-to-install distributions only offer the option of using LVM — it makes sense, as you need to cater for as many systems as possible at once. This is why Gentoo Linux is generally disconnected to the rest: the power of doing things for what you want to use it for, makes it generally possible to skip the overgeneralization, and that’s why we’re virtually the only distribution out there able to work without an initrd.

Another point that came up often is with a system where the space in rootfs was badly allocated, and /usr is being split because there is not enough space. I’m sorry that this is a common issue, and I do know that it’s a pain to re-partition such a system as it involves at least a minimal downtime. But this is why we have workarounds, including the whole initrd thing. I mean, it’s not that difficult to manage, with the initrd, and yes I can understand that it’s more work than just having the whole system boot without /usr — but it’s a sensible way to handle it, in my opinion. It’s work or work, work for everybody under the sun to get split /usr working properly, or work for those who got the estimate wrong and now need the split /usr and you can guess who I prefer doing the work anyway (hint: like everybody in this line of business, I’m lazy).

Some people have said that /usr is often provided on NFS, and a very simple, lightweight rootfs is used in these circumstances — I understand this need, but the current solution to support split /usr is causing the rootfs to not be as simple and lightweight as before — the initrd route in that sense is probably the best option: you just get an initrd to be able to mount the root through NFS, and you’re done. The only problem with this solution is handling if /etc needs to be different from one system to the next, but I’m pretty sure it’s something that can be more easily fixed as well.

I have to be honest, there is one part of /usr that I end up splitting away very often: /usr/lib/debug — the reason is simple: it keeps increasing with the size of the sources, rather than with the size of the compiled code, and with the new versions of the compilers, which add more debug information. I got to a point where the debug files occupied four/five times the size of the rest of the rootfs. But this is quite the exception.

But why would it have to be that much of a problem to keep a split /usr? Well, it’s mostly a matter of what you’re supposed to be able to use without /usr being mounted. For many cases, udev was and is the only problem, as they really don’t want much in the matter of early-boot environment beside being able to start lvm and mount /usr, but the big problem happen if you want to be able to have even a single login with /usr not mounted — because the PAM chain has quite a few dependencies that wouldn’t be available until it’s mounted. Moving PAM itself is not much of an option, and this gets worse, because start-stop-daemon can technically also use chains that, partially, need /usr to be available, and if that happens, no init script using s-s-d would be able to run. And that’s bad.

So, do I like the collapsing of everything in /usr? Maybe not that much because it’s a lot of work to support multiple locations, and to migrate configurations. But at the same time I’m not going to bother, I’ll just keep the rootfs and /usr in the same partition for the time being, and if I have to split something out, it’ll be /var.

Big filesystems

Very few of you probably remember that over two years ago, in October 2009, I did some investigative work on Portage Tree’s overhead to show just how much space was going to be wasted with small files on filesystems with too big block sizes.

It wasn’t the only time I noted that while for things like Portage, and likely your operating system’s file, it makes sense to have smaller-than-page-size blocks, it doesn’t seem as smart to do the same for bigger files such as music and video. At the time I noted that HFS+ somehow supported 64KiB blocks with the Linux driver – a driver that is very much unstable and often times unusable – while XFS refuses to play well with similarly-sized blocks, even though it is designed to support them.

I’ve read many people complaining that I didn’t know what I was talking about when I called for bigger block sizes for Linux’s filesystems. Many people insisted that the presence of extents in ext4 made it completely moot to have bigger block sizes. If that’s so, I wonder why ext4 now implements bigalloc which is basically a trick to allow bigger block cluster sizes.

I read about it, with the release announcement of kernel 3.2, while I was on vacation and I just couldn’t wait to try it out with some of my filesystems. Luckily I tried it with the least important one, though, as it’s far from being mature for using.

The current implementation does not support online resizing, so you’re supposed to use resize2fs with the unmounted filesystem … too bad that it fails to run entirely when using the latest version of e2fsprogs. Oh and don’t forget that the switch to turn on bigalloc is not documented anywhere yet.

So it is to be expected given that it’s a very new feature, but I wonder why half the fuss about 3.2 release was about a feature … that definitely is not ready for prime time even in testing ground. I just hope that work toward this kind of features will also mean that XFS will gain support for 64KiB blocks, which I would prefer to ext4’s 1MiB clusters in the first place.

Also I would like to point out one thing for those of you who wish to use this feature on volumes shared with Samba to OS X hosts: you’ll end up with tons of space wasted to .DS_Store files unless the inline data feature is also used, and the inode size is increased. On my filesystems, .DS_Store files weight between 741 bytes to 14KiB… I thought I configured Samba to use extended attributes to store the data instead of using external files, but for what I gathered on the Netatalk mailing list recently, this conflicts with the size limit applied to EAs on ext4… I guess this is another of those things that really need some tweaking to get right.

Apple’s HFS+, open-source tools, and LLVM

The title of this post seems a bit messed up, but it’ll make sense at the end. It’s half a recount of my personal hardware trouble and half a recount of my fighting with Apple’s software, and not of the kind my reader hate to read about I guess.

I recently had to tear apart my Seagate FreeAgent Xtreme external HDD. The reasons? Well, beside leaving me without a connection while using it (with Yamato) on eSATA, and forcing me to use either Firewire or USB (both much slower — and I did pay it to use eSATA!), yesterday it decided it didn’t want to let me access anything via either of the three connections, not even after a number of power cycles (waiting for it to cool down as well); this was probably related to the fact that I tried to use it again as eSATA, connected to the new laptop to try copying an already set-up partition from the local drive to make space for (sigh) Windows 7.

Luckily, there was no data worth spending time on, in that partition, just a few GNOME settings I could recreate in a matter of minutes anyway.

Since the Oxford Electronics-based bridge on the device decided not to help me out to get my data back, I decided to break it up, with the help of a Youtube video (don’t say that Youtube isn’t helpful!), and took the drive itself out, which is obviously a Seagate 7200.11 1TB drive, quite a sturdy one to look at it. No I won’t add it at the 7th disk drive to Yamato, mostly because I fear it wouldn’t be able to start up anymore if I did so.

Thankfully, I bought a Nilox-branded “bay” a month or so ago, when I gave away what remained of Enterprise to a friend of mine (the only task that Enterprise was still doing was saving data out of SATA disks when people brought me laptops or PCs that fried up. My choice for that bay was due to the fact that it allows you to plug in both 3.5” and 2.5” SATA disks without having to screw them anywhere. It does look a lot like something out of the Dollhouse set, to be honest, but that doesn’t matter now.

I plugged it in, and started downloading the data; I can’t be sure it is all fine, so I deleted lots and lots of stuff I won’t be safe about for a while. Then I shivered, fearing the disk itself was bad, and that I had no way to check it out… thankfully, the bay uses Sunplus electronics in it, and – lo and behold! – smartmontools has a driver for the Sunplus USB bridge! A SMART test later, and the disk turns out to feel better than any other disk I ever used. Wow. Well, it’s expected as I never compiled on it.

Anyway, what can I do with a 1TB SATA disk I cannot plug into any computer as it is? Well, actually one thing I can do: backup storage. Not the kind of rolling backup I’m currently doing with rsnapshot and the WD MyBook Studio II in eSATA (anything else is just too slow to backup virtual machines), but rather a fixed backup of stuff I don’t expect to be looking at or using anytime soon. But to be on the safe side, I wanted to have it available in a format I can access, on the go, with the Mac as well as from Linux; and vfat is obviously not a good choice.

The choice is, for the Nth time, HFS+. Since Apple has published quite a bit of specs on the matter, the support in Linux is decent, albeit far from being perfect (I still haven’t finished my NFS export patch, it does not support ACLs or extended attributes, and so on). It’s way too unreliable for rsnapshot (with hardlinking) but should work acceptably well for the storage.

The only reason I have not to use it for something I want to rely on, as it is, is that the tools for filesystem creationa nd check (mkfs and fsck) are quite a bit old. I’m not referring to “hfsutils” or “hfsplusutils” both of which are written from scratch and have a number of problems, including but not limited to, shitty 64-bit code. I’m referring to the diskdev_cmds package in Gentoo which is a straight port of Apple’s own code, which is released as FLOSS under the APSL2 license.

Yes, I call that FLOSS! You may hate Apple as much as you wish, but even FSF considers APSL2 a Free Software license albeit with problems; on the other hand they explicitly state this (emphasis mine):

For this reason, we recommend you do not release new software using this license; but it is ok to use and improve software which other people release under this license.

Anyway, I went to Apple’s releases for 10.6.3 software (interestingly they haven’t published yet 10.6.4 which was released just the other day), and downloaded diskdev_cmds, and the xnu package that contains their basic kernel interfaces, and I started working on an autotools build system to make it possible to easily port the code in the future (thanks to git and branching).

The first obstacle, beside the includes obviously changing, was that Apple decided to make good use of a feature they implemented as part of Snow Leopard’s “Grand Central Dispatch”, their “easy” multi-threading implementation (somewhat similar to the concept of OpenMP): “blocks”. Anonymous functions for the C language, an extension they worked in LLVM. So GCC straight is unable to build the new diskdev_cmds. I could either go to fetch an older diskdev_cmds tarball, from Leopard rather than Snow Leopard, where GCD was not implemented yet, or I could up the ante and try to get it working with some other tools. Guess what?

In Gentoo we already have LLVM around, and the clang frontend as well. I decided to write an Autoconf check for blocks support, and rely on clang for the build. Unfortunately it also needs Apple’s own libclosure, that provides some interfaces to work with blocks. And the basis for the GDC interface. It actually resonated a bit when Snow Leopard was presented because Apple released it for Windows as well, with the sources under MIT license (very liberal). Unfortunately you cannot find it in the page I linked above but you have to look at 10.6.2 page for whatever reason.

I first attempted to merge this straight in the diskdev_cmds sources, but then I guessed that it makes more sense to try porting it alone, and make it available, maybe somebody will find some good use for it. Unfortunately the task is not as trivial as it looks. The package needs two very simple functions for “atomic compare and swap” which OS X provides as part of its base library, and so does Windows. On Linux, equivalent functions are provided by HP’s libatomic_ops (you probably have it around because of PulseAudio).

Unfortunately, libatomic_ops does not build, as it is, with clang/LLVM; there is a mistake in the code, or the way it’s parsed; it’s not something unexpected given that inline assembler is a lot compiler-dependent. In this case it’s a size problem: it uses a constraint for integer types (32-bit) but a temporary (and same-sized input) of type unsigned character (8-bit). The second stop is again libatomic_ops’s problem: while it provides an equivalent interface to do atomic compare and swap for long types, it doesn’t do so for int types; that means it works fine on x86 (and other 32-bit architectures where both types are 32-bit) but it won’t do for x86-64 and other 64-bit architectures. Guess what the libclosure code needs?

Now of course it would be possible to lift the atomic operations out of the xnu code, or just write them straight, as libatomic_ops already provides them all, just not correctly-sized for x86-64 but the problem remains that you then have to add a number of functions for the various architecture rather than having a generic interface; xnu provides functions only for x86/x86-64 and PPC (since that’s what Apple uses/used).

And where has this left me now? Well, nowhere far, mostly with a sour feeling about libatomic_ops inability to provide a common, decent interface (for those who wonder, they do provide char-sized inlines for compare and swap for most architecture, and even the int-sized alternatives that I was longing for… but only for IA-64. You wouldn’t believe that until you remembered that the whole library is maintained by HP.

If I could take the time off without risking trouble, I would most likely try to get better HFS+ support in Linux, if only to make it easier and less troublesome for OSX users to migrate to Linux at one point or another. The specs are almost all out there, the code as well. Unfortunately I’m no expert in filesystems and I lack the time to invest on the matter.

Useless legacies

I always fine it at least fascinating, the religiousness (and this is most definitely not a compliment, coming from an atheist) with which some people stand to defend “classical” (or, in my opinion more properly, “legacy”) choices in the Unix world. I also tend to not consider them too much; I have challenged the use of separate /boot over two years ago, and I still stand behind my opinion: for the most common systems’ configurations, /boot is not useful to stay separate. Of course there are catches.

*One particular of these catches is that you need to have /boot on its own partition to use LVM for the root file system, and that in turn is something you probably would like to have standing to today’s standards, so that you don’t really have to choose how much space to dedicate to root, which heavily depends on how much software you’d be going to put on it. Fedora has been doing that for a while, but then it diverges the problem to how much space dedicate to /boot, and that became quite a problem with the update 11→12,… in general, I think the case might be building up for either using a separate /boot, or just use EFI, which as far as I can tell, can solve the problem to the root… no pun intended.*

For some reason, it seems like a huge lot of legacies relate to filesystems, or maybe it’s just because filesystems are something I struggle with continuously, especially for what concerns combining the classical Unix filesystem hierarchy with my generally less hierarchical use of it. I’m not going to argue for not splitting the usual /usr out of the root file system here (while it’s something I definitely would support, that pretty artificial split makes the whole system startup a messy problem), nor I’m going to discuss how to divide your storage space to file in the standard “legacy” hierarchy.

What I’m just wondering about is why has lost+found been so strongly defended by somebody who (I read) is boasting to have experience in disaster recovery? I’m not doubting the usefulness of that in general, but I’m also considering that in most “desktop” cases it’s just confusing — or irritating in my case, in a particular automated system I’m working on.

First let me start with saying why I’m finding this annoying: try running initdb on a newly-created, just mounted ext3 file system. It will fail, because it finds the lost+found directory in the base of the filesystem, and since the directory is not empty, it refuses to work. There isn’t, by the way, a way to tell it to just run such as a --force switch, which is the most obnoxious thing in all this. I know what I’m doing, I just want you to do it! So anyway, my choices here are either remove the lost+found directory every time I mount a new filesystem (I have to admit I don’t know/don’t remember whether the directory is re-created at mount, or during fsck), or create a sub-directory to run the initdb in. Whether the choice, it requires me one further command, which is not much but in this case it’s a slight problem.

So I went wonder “Is lost+found really useful? Can’t I just get rid of it?”, then hell broke loose.

I’m quite positive I don’t need the directory to be there empty; I can understand it might be useful when stuff is in it, but empty? On a newly-created filesystem? I have my sincere doubts about that. And even when stuff gets into it, is it really useful to have it restored there? Well almost certainly in some cases, but always? Without a way at all to get rid of that option? It sounds a bit too much for me.

Let me show you a few possible scenarios, which is what I experience:

  • /var/cache is on a separate filesystem for me, reason for which is that it’s quite big and it ends up growing a lot because it keeps, among other things, the Portage disfiles for me and the tinderbox; if anything happen to that filesystem, I won’t spend more than 5 minutes on it, it’ll be destroyed and recreated; the name cache should make it obvious, and the FHS designates it for content that can be dropped, and recreated, without trouble; do I need orphan files recovered from that filesystem? No, I just need to know whether there is something wrong with the FS, if there is, I’ll recreate it to be on the safe side that data didn’t corrupt;
  • my router’s root file system… it turned out to be corrupt a couple of time and stuff was added to lost+found… did I care about that? Not really. I flashed in a new copy of the filesystem, no data loss for me in there, beside once, before I set up rsnapshot where I lost my network configuration, oh well, took me the whole of half an hour to rewrite it from scratch — if you wonder what the corruption was about, it was a faulty CF card; I’ll have to write about those CF cards at some point;
  • the rest of my running data, which is all of the rest of my systems… if I were to find corruption on my filesystems, I’d do like I did in the past: clear them out, make sure I hadn’t chosen the wrong filesystem type to begin with, and then recreate them; do I care about finding the data in lost+found? Nope, I got backups.

The trick here is I got backups. Of course if I didn’t have backups, or if my backups were foobar’d, I’d be looking at everything to restore my data, but to be honest, I found that it’s a much better investments to improve your backup idea rather than spend time recovering data. Of course, I don’t have “down to the microsecond” backups as somebody told me I’d be needing to avoid using lost+found, but again, I don’t need that kind of redundancy. I have hourly backups, for my systems, which are by themselves above average, it works pretty well. I’d be surprised if the vast majority of the desktop systems cared about backups over a week.

Now this should cover most of my points: lost+found is not indispensable. You can really well live without. I don’t think I ever used it myself, when faced with corrupted filesystems (and trust me it happened to me more than once) my solution was either of: get the backup, re-do the little work lost, discard the data altogether. Sure I might have lost in the years bits and pieces of stuff that I might have cared about, but nothing major. The worse thing happened to me in the past three years has been the need to re-download the updates and drivers for Windows (XP and Vista both) that I keep around when customers bring me their computers to fix. Okay I have no experience with enterprise-grade post-apocalyptic disaster recovery, so what? It doesn’t change the fact that in my case (and I’d say, a lot of users’ cases) it doesn’t matter.

I’m not asking to get rid of the feature altogether, but to make it optional would be nice, or at least, not force me to have the directory around. Interestingly enough, xfs_repair does not need the directory to be present; it’ll use it if it’s present and full, it’ll create and populate it if orphan files are found, but otherwise it’s invisible. Apple’s HFS+ is more or less on the same page. I admit ignorance for what concerns the Reiser family, JFS and ZFS.

Whatever the case, can we just stop asserting that what was good in the ‘70s, or what is good for enterprise-grade systems, is good for desktop systems as well? Can we stop accepting legacies just because they are there? I’m not for breaking hell of compatibility at every turn (and please nobody say HAL, ‘kay? — okay here the pun was most definitely intended), but yes, it takes to challenge the status quo to get something better out of it!

P.S.: if somebody can suggest me an eventual option to mkfs or mount to avoid that directory, I’m still eager to know it!

To-Do lists, tasks, what to do

These holidays really sucked, for a long series of reasons, and in general, I’m not feeling well neither emotionally nor physically. But they offered me the time to think about what I want to do. I’ve been working on my collision detection script lately and I’m now confident I can make it work as a proper tool to identify issues, and I’d love to work on fixing those issues with upstream, but the problem is I lack the time to do that.

Even if I do improve a project a day, it’s never going to be enough, because in a few months I’m going to need to do it again because code would have rot and stuff like that. I need help for this. One thing I’m going to do is working on a personal archive of autoconf macros I can use on different projects; the attributes.m4 file that comes with xine, lscube and Lennart’s projects is already a personal archive of macros under some aspects, the problem is that its history is shared among different repositories which is very nasty. I’ve avoided up to now to create a repository for that, splitting it up in different macro files (since it’s far from being just attributes checking any longer), but I think i should look for a solution for that problem rather than continuing to procrastinate that.

Today I stopped procrastinating on getting rid of JFS for my root filesystem: since kernel 2.6.28 was released, I’m now starting the long awaited conversion of my partitions to ext4, starting from the repositories (and here git wins against Mercurial big time: once the git repositories are repacked, copying them over is very very quick), while copying over openjdk, icedtea and xine repositories that use Mercurial take so much longer.

Talking about xine, I’m going to do some more work on that in the next days, mostly code cleanup if I can, but I’m also planning on setting up a Transifex instance on this server for xine (and my own projects); hopefully it’ll make it simpler to provide translated versions of xine-lib, xine-ui and gxine. As well as of my own tools that need to be translated one way or another.

There are so many things I’d have to do that I haven’t been able to in months, reading is one of those, but I’m going to preserve that for when I’m going to the hospital next month for check ups; I’m not going to bring my laptop with me this time, nor any handheld console. I’ll be around on the cellphone a bit maybe, but that’ll be it. For the rest of the time I’ll be reading and listening to music (I’m not going to leave the iPod at home, knowing hospitals it will come handy). Actually, since I just have to have a CT scan, a chest X-Ray and an MRCP (MRI), I don’t strictly need to stay in the hospital; but not having a driving license does not help; although I guess even if I had one, I’d better not be driving after they have tests on me.

I’m going to spend the new year’s eve alone at home, maybe with my mom and my sister with her husband and my nephew. On the whole, it’s not going to be holiday either, so like it happened on Christmas, I’m going to spend most of the day working on some analysis or similar. I’ve seen that some of the issues I’ve brought up lately have started being taken care of, which is very good.

I know this post sounds pretty incoherent, I guess I’m incoherent myself at the moment. Anyway if you wish to help out with anything at all, feel free to drop a line.

Filesystems — ext4dev fails

flame@yamato ~ % touch /var/tmp/portage/test
touch: cannot touch `/var/tmp/portage/test’: No space left on device
flame@yamato ~ % df -h | grep /var/tmp
32G 7.2G 23G 25% /var/tmp
flame@yamato ~ % df -i | grep /var/tmp
2097152 419433 1677719 21% /var/tmp

Now, a mount cycle later it worked fine, but it’s still not too nice since it caused all the emerge running to fail, just like XFS did, but without leaving trace on the kernel log, which makes it obnoxious since it’s hard to debug. I hope 2.6.28 is going to be better, certainly the tinderboxing is a nice way to stress-test filesystems.

I start to consider the idea of OpenSolaris, NFS, and InfiniBand…

Hacking the kernel

So in my odissey between filesystems I had to do some serious kernel hacking to make sure I could get the HFS+ filesystem properly exported to my laptop. I actually have noticed iTunes failing often, to the point of annoyance, to copy the data, resulting in it to adapt automatically to a different path for the iTunes collection, but I didn’t understand the problem.

Turns out Christoph Hellwig knows the reason pretty well: my patch was incomplete, the get_parent() method is really needed for NFS to work properly, and he also provided me with a quick test case to try it out. Sure enough, there’s a problem. Unfortunately my first try, factoring out lookup() and use that to implement get_parent() like ext2/3 and other filesystems do, failed.

After looking around the code, and Apple’s specs, I start to understand: common UNIX-like filesystems always have an hardlink named .. for the parent directory, but HFS+ does not have that, instead you have to look up the “thread” catalog entries to know who’s the parent of a file or directory (or, as the specifications call it, folder).

Now, after having worked on this I feel like saying a couple of things: I really like Free Software because it makes it very nice to add support for something that didn’t exist before (like NFS export of HFS+), but I like the way Apple write complete specification for so many things. Maybe too specific, but still nice to have some rather than none.

This has been refreshing somewhat since I didn’t seriously hack at the kernel in years, since the time I dealt with porting LIRC to kernel 2.5, which happened, as you might guess, before 2.6 was released. Actually, during that period when I worked on the LIRC patchset, I also moved from Debian to Gentoo, which also was the main reason why I actually implemented devfs support for LIRC devices (hey you guys remember devfs, don’t you?).

Now, hopefully if I can find a bit more free time for this, I should be able to submit a couple more changes to the kernel which I’ve been scheduling for a while, and maybe I can try my Ruby-Elf scripts on the kernel itself. It would be great if they could help the kernel as much as they have helped me with xine and other projects.

At any rate, the future looks nicer tonight.

Filesystems — take two

After the problem last week with XFS, today seems like a second take.

I wake up this morning with a reply about my HFS+ export patch, telling me that I have to implement the get_parent interface to make sure that NFS works even when the dentry cache is empty (which is what caused some issues with iTunes while I was doing my conversion most likely), good enough, I started working on it.

And while I was actually working on it, I find that the tinderbox is not compiling. A dmesg later shows that, once again, XFS had in-memory corruption, and I have to restart the box again. Thankfully, I got my SysRescue USB stick, which allowed me to check the filesystem before restarting.

Now this brings me to a couple of problems I have to solve. The first is that I finally have to switch /var/tmp to its own partition so that /var does not get clobbered if/when the filesystem go crazy; the second is that I have to consider alternatives to XFS for my filesystems. My home is already using ext3, but I don’t need performance there so it does not matter much; my root partition is using JFS since that’s what I tried when I reinstalled the system last year, although it didn’t turn out very good, and the resize support actually ate my data away.

Since I don’t care if my data gets eaten away on /var/tmp (the worst that might happen is me losing a patch I’m working on, or not being able to fetch the config.log for a failed package – and that is something I’ve been thinking about already), I think I’ll try something more “hardcore” and see how it goes, I think I’ll use ext4 as /var/tmp, unless it panics my kernel, in which case I’m going to try JFS again.

Oh well, time to resume my tasks I guess!

Filesystems

It seems like my concerns were a little misdirected; instead of the disks dying, the first problem appeared was an XFS failure on /var, after about two runs and a half of tree build. I woke up in the middle of the night with the scariest thought about something not being fine on Yamato, and indeed I came to see it not working any more. Bad bad bad.

I’m now considering the idea of getting a box to just handle all the storage problems running something a bit more tested lately: Sun’s ZFS. While Ted Tso’s concerns are pretty scary indeed, it seems like ZFS is the one filesystem that I could be using to squirm out all the possible performance and quality of disks, for network serving. And as far as I remember, Sun’s Solaris operating system comes with an iSCSI target software out of the box, which would really work out well for my MacBook’s needs too.

Now the problem is, does Enterprise still work? The motherboard is what I’m not sure about, but I guess I can just try that and then replace it if needed; I certainly need to replace the power supply since it’s now mounting a 250W, and I also need to replace the chassis, since the one I have now, mounting a Plexiglass side, and that makes it too noisy to stay turned on all the time.

I’m considering setting it up with four 500GB drives, which would cost me around 600 euro, included case and power supply; having eight, using the Promise SATA PCI card I have already, would bring me to 1K euro, and 4TB of space, but I don’t think it’s worth that yet. Both the Promise card and the onboard controller are SATA/150 but that shouldn’t be too much of a problem, with the Gigabit Ethernet being the bottleneck more than likely. Unfortunately this plan will not be enacted until I get enough jobs to finish paying for Yamato, and save the money for that.

Now, while I have to do with what I have, there is one problem. I have my video and music collection on the external Iomega drive, RAID1 “hardware”, 500GB of actual space divided roughly in 200GB for music/video and 300GB for OSX’s TimeMachine; the partition table is GUID (EFI) and the partitions are HFS+, so that if ever Yamato is turned off, I can access the data directly on the laptop through FireWire. This is all fine and dandy, if it wasn’t that I cannot move my iTunes folder in there because I cannot export the filesystem through NFS.

Linux does need kernel support for exporting filesystems through NFS, and the HFS+ driver in current Linux does not support this feature — yet. Because the nice thing about Linux and Free Software is that you can make them do whatever you wish as long as you have the skills to do that. And I hope I have enough skill to get this to work. I’m currently setting up a Fedora 10 install on vbox so that I can test my changes without risking a panic on my running kernel.

Once that’s working I’ll focus again on the tinderboxing, even though I cannot deal with the disk problem just yet. I have a few things to write about on that regard, especially about the problem of bundled libraries.

Who wants to support largefile?

This post is inspired by a post of Eric Sandeen, whose blog I read last night after discovering we share an interest in making software build in parallel.

A little background for those who don’t know the issue I’m going to talk about. Classically, inode numbers and offsets were 32-bit values, but as you might guess nowadays this cannot be true, files bigger than 2GB (the highest offset that 32-bit can represent) are quite common, just think of DVD images, or even better of BluRay disks, 50GB are huge), and modern filesystems (as Eric points out: XFS, btrfs and ext4) have or might have 64-bit inode numbers. Since changing the size of types would have broken ABI compatibility, GNU libc, as well as other libraries, added support for the so-called “largefile” mode. In largefile mode, the standard file operations have types with 64-bit size. The way this is implemented is by replacing calls like open() or stat() with 64-bit variants, called open64() and stat64(). Other operating systems like FreeBSD broke ABI compatibility and only have 64-bit interfaces. On new systems that are natively 64-bit, like AMD64, the new 64-bit interface is enabled by default, so the 64-bit specific interface is not needed.

Now since the two interfaces are, well, different interfaces, the only moment when they can be switched is at build time, indeed, you need to pass some compiler defines so that it replaces he calls at buildtime, and thus make use of either the old or the new largefile interface. Most packages you can think of are probably using largefiles already, some conditionally, some unconditionally as needed, and some unconditionally, needed or not just to be safe. The problem is that not all software can deal with largefile properly as it is.

The usual way to discover a package does not support largefile is watching it fail on a >2GB file. The problem is that it’s not so nice since it means you have to fix the problem when it becomes a problem, while it would be much better to be able to identify the problem earlier, so that it can be solved before it becomes a true problem. But Eric’s post has given me an idea; I asked him for the script (which you can find attached to this post if Typo is not going to do some funny thing update: I finally was able to make lighttpd serve the script; for once Typo was innocent) and I used the same logic to identify packages using 32-bit interfaces with scanelf after portage installs it.

This is not yet a complete test since I’m forcing it to work only on x86 systems (I wanted to exclude AMD64), and it only checks stat symbols, it should check open, read write and all the other symbols too. More importantly, this is not going to work with the scanelf that you got installed by portage right now (0.1.18), since I had to fix it a bit to properly handle regexp matching and multiple symbols matching. So if you want to try this you’ll probably have to wait till I release a 0.1.19 version. At any rate, the code in the bashrc file is just the following, for now:

post_src_install() {
    scanelf -q -F "#s%F" -R -s '-__xstat,-__lxstat,-__fxstat' "${D}" > "${T}"/flameeyes-scanelf-stat64.log
    if [[ -s "${T}"/flameeyes-scanelf-stat64.log ]]; then
    ewarn "Flameeyes QA Warning! Missing largefile support"
    cat "${T}"/flameeyes-scanelf-stat64.log >/dev/stderr
    fi
}

Please don’t rush submitting bugs for these things though; these are useful to know and they should probably be fixed, but please send the patches upstream rather than directly to Gentoo, for now.