LXC execution, something’s not right yet

I’ve complained about some LXC choices last week, but at the same time I was working on at least trying to get it to run, somewhat, so that I could make use of it on our required setup.

The result has been a new revbump of LXC (0.8.0_rc1-r1), which contains a patch using libtool to build the library, this also makes sure that the library is properly created with an always-variable soname (see the link for more explanation on what that is).

This new version actually allows you to go one step further, and you can properly set it to execute commands within the container, directly. Even though it takes a bit of time due to the POSIX Message Queue not being extremely fast (Luca do you know anything about that?). The problem is that if you’re going to run any interactive command.. you get stuck, almost literally, including a very simple emerge -av.

The problem, as far as I can tell, is that the namespacing allows the container to create a new pseudo-tty (PTS) within the container itself, instead of using the one that is connected with the current session. This means that you cannot actually use the tty at all, making it impossible to run any kind of interactive command (at the same time, it doesn’t make it known to the command that it lacks a controlling tty, so for instance an emerge -av does not get downgraded to emerge -pv this way.

I’m hoping that maybe Kevin’s (whichever Kevin that is!) patches mentioned in my previous post will help getting it to work, if so that would also mean that the tinderbox would be much easier to deal with than it has been in the past, and might actually get me to restore it to a working state (it hasn’t been working for a quite long time at this point, and I’m certainly not happy about it).

For the moment what I can tell is that I’ve half-tracked down the issue with the netprio cgroup, and contacted its original author to see how we can deal with it, and I have a couple of changes for the ebuild and init scripts queued up. Since at least the cgroup mountpoint issue has been fixed in the utilities, I’ll soon make it depend on a version of OpenRC new enough to mount the thing by itself, easing off part of the init script log (well, to be honest I’ve already dropped most of that logic), so that it can actually grow from there…

I guess I should thank Tiziano for telling me about LXC at the time, although there is still so much work to do before it works as intended. Oh well.

CGROUPS woes

The cgroup functionality that the Linux kernel introduced a few versions ago, while originally being almost invisible, is proving itself having a quite wide range of interests, which in turn caused not few headaches to myself and other developers.

I originally looked into cgroups because of LXC and then I noticed it being used by Chromium, then libvirt (with its own bugs related to USB devices support). Right now the cgroup functionality is also used by the userland approach to task scheduling to replace the famous 200LOC kernel patch, and by the newest versions of the OpenVZ hypervisor.

While cgroup is a solid kernel technique, its interface doesn’t seem so much. The basic userland interface is accessible through a special pseudo-filesystem, just like the ones used for /sys and /proc. Unfortunately, the way to use this interface hasn’t really been documented decently, and that results in tons of problems; in my previous post regarding LXC I mistakenly inverted the cgroup-files I actually confused the way Ubuntu and Fedora mount cgroups; it is Fedora to use /sys/fs/cgroup as the base path for accessing cgroups, but as Lennart commented on the post itself, there’s a twist.

In practice there are two distinct interfaces to cgroups; one is through a single, all-mixed-in interface, that is accessed through the cgroup pseudo-filesystem when mounted without options; this is the one you can find mounted in /cgroup (also by the lxc init script in Gentoo) or /dev/cgroups. The other interface allows access (and thus limit) to one particular type of cgroup (such as memory, or cpuset), and have each hierarchy mounted at a different path. That second interface is the one that Lennart designed to be used by Fedora and that has been made “official” by the kernel developers in commit 676db4af043014e852f67ba0349dae0071bd11f3 (even though it is not really documented anywhere but in that commit).

Now as I said the lxc init script doesn’t follow that approach but rather it takes the opposite direction; this was not intended as a way to ditch the road taken by the kernel developers or by Fedora, but rather out of necessity: the commit above was added last summer, the Tinderbox has been running LXC for over an year at that point, and of course all the LXC work I did for Gentoo was originally based on the tinderbox itself. But since I did have a talk with Lennart and the new method is the future, I added to my TODO list, last month still, to actually look into making cgroups a supported piece of configuration in Gentoo.

And it came crashing down.

Between yesterday and this morning I actually found the time I needed to get to write an init script to mount the proper cgroup hierarchy the Fedora style. Interestingly enough, if you were to umount the hirarchy after mucking with it, you’re not going to mount it anymore, so there won’t be any “stop” for the script anyway. But that’s the least of my problems now. Once you do mount cgroups the way you’re supposed to, following the Fedora approach, LXC stops working.

I haven’t started looking into what the problem could be there; but it seems obvious that LXC doesn’t seem to take it very nicely when its single-access interface for cgroups is instead split in a number of different directories, each with its own little interface to use. And I can’t blame it much.

Unfortunately this is not the only obstacle LXC have to face now; beside the problem with actually shutting down a container (which only works partially and mostly out of sheer luck with my init system), the next version of OpenRC is going to drop support ofr auto-detecting LXC, both because identifying the cpuset in /proc is not going to work soon (it’s optional in kernel and considered deprecated) and because it wrongly identify the newest OpenVZ guests as LXC (since they also started using the same cgroups basics as LXC). These two problems mean that soon you’ll have to use some sort of lxc-gentoo script to set up an LXC guest, which will both configure a switch to shut the whole guest down, and configure OpenRC to accept it as an LXC guest manually.

Where does this leave us? Well, first of all, I’ll have to test if the current GIT master of LXC can cope with this kind of interface. If it doesn’t, I’ll have to talk with upstream to see that they would actually be supported so that LXC can be used with a Gentoo host, as well as a Fedora one, with the new cgroups interface (so that it can be made available to users for use with chromium and other software that might make good use of them). Then it would be time to focus on the Gentoo guests, so I’ll have to evaluate the contributed lxc-gentoo scripts that I know are on the Gentoo Wiki, for a start.

But let me write this again: don’t expect LXC to work nice for production use, now or anytime soon!

More about Linux Resource Containers and Gentoo

I have written before that I strongly object at LXC userland to be considered production-ready and now I have another example for you.

Let’s not even dig into the fact that the buildsystem for the userland tools is quite objectionable, as it “avoids” using libtool by using silly hacks into Makefile.am. Let’s not even spend much to say that they have no announcement-only mailing list, and they stopped using the SF.net File Release System (that has an RSS feed of the changes, and a number of mirrors) in favour of a simple download directory since that’s just project administration gone wrong.

The one big issue is that there is no documentation of changes between one release and the other. Either you follow GIT, or you’re left wondering what the heck is going on, if you look at the tarball itself only. On the other hand, just judging from the commit messages, there isn’t enough information either, so you have to read the code itself to understand what the heck is going on.

So let’s begin with what brought me here: I use LXC heavily in place of testing CHROOT, since it’s easier to SSH to an already-setup instance than having to set up a new one each time there is something new to test. Beside a number of “named” containers, I started having a number of “buildyards” which I only use to test particular services; case in point I wanted to test Squid together with the new SquidClamav which I need for a customer of mine so I copied over one of my previous buildyards and fired it up…

Unfortunately, the results weren’t very good: I didn’t get a portage tree bound… quickly checking around, 0.7.2 worked fine 0.7.3 didn’t. After looking at the code it became apparent that the root filesystem mountpoint that was introduced in this release as a configuration option is not only used for loop-device backed images (which are now supported, and weren’t before), but also for standard directory-based containers. If you add this to one issue I have described before (the fact that lxc does not validate that the mount paths provided for bind-mounts are within the new rootfs tree) you may start to understand the issue here.

If you haven’t seen it yet, here’s the breakdown:

  • my container’s rootfs is located at /media/chroots/lxc-buildyard4;
  • the /etc/lxc/buildyard4.conf file used to bind-mount the portage directory as /media/chroots/lxc-buildyard4/usr/portage;
  • with 0.7.2, the pivot_root system was called over /media/chroots/lxc-buildyard4 and all was fine;
  • with 0.7.3, before pivoting, /media/chroots/lxc-buildyard4 was bind-mounted to a different path (let’s assume /usr/lib/lxc/rootfs but it was a bit more messed up);
  • when I accessed /usr/portage within the chroot I was actually accessing the path to be found at /usr/lib/lxc/rootfs/usr/portage.

Okay so it’s a bit murky, because if you think of bind-mounts the same way you think about symlink, the first bind mount should have accessed all the sub-bind-mounts as well, but that wasn’t the case because the first wasn’t a recursive bind-bound. Which means you really have to change all your configuration files to use the new rootfs mount path, and they didn’t seem to make that very clear as news files.

Besides, the default configuration is variable on the libdir setting, which means you’d have different paths between 32-bit and 64-bit systems (symlinks are ignored in part, remember) so to avoid that, I’ve revision-bumped lxc in tree and is now using /usr/lib/lxc/rootfs directly, ignoring multilib altogether.

On a different note, I’m still planning on writing a post detailing how cgroups work, since both LXC and libvirt (two projects I follow) make use of them, as well as Chrome/Chromium and, now, the Lennart userland implementation of the 200loc kernel patch. But before doing that I want to compare how the other distributions solve the mountpoint problem:

  • the cgroup filesystem has to be mounted to be used, like /sys, /dev, /proc and so on… but OpenRC currently ignores it;
  • the LXC init script accepts an already-mounted cgroup filesystem or mounts it over /cgroup;
  • as far as I can tell, Fedora uses /dev/cgroup but that montpoint, like /dev/pts and /dev/shm need to be created after udev was started, as /dev is a virtual filesystem itself (that’s what /etc/init.d/devfs does);
  • while on the other hand Ubuntu seem to rely on /sys/fs/cgroup which is an empty directory on the /sys pseudo-filesystem created when cgroup is enabled in the kernel.

Sincerely, my preferred solution right now is the last one, since that requires no special code, just need /sys mounted, and is much more similar to how fusectl is mounted (on /sys/fs/fuse/connections). If you have any comments, feel free to have a say here.