More about Linux Resource Containers and Gentoo

I have written before that I strongly object at LXC userland to be considered production-ready and now I have another example for you.

Let’s not even dig into the fact that the buildsystem for the userland tools is quite objectionable, as it “avoids” using libtool by using silly hacks into Makefile.am. Let’s not even spend much to say that they have no announcement-only mailing list, and they stopped using the SF.net File Release System (that has an RSS feed of the changes, and a number of mirrors) in favour of a simple download directory since that’s just project administration gone wrong.

The one big issue is that there is no documentation of changes between one release and the other. Either you follow GIT, or you’re left wondering what the heck is going on, if you look at the tarball itself only. On the other hand, just judging from the commit messages, there isn’t enough information either, so you have to read the code itself to understand what the heck is going on.

So let’s begin with what brought me here: I use LXC heavily in place of testing CHROOT, since it’s easier to SSH to an already-setup instance than having to set up a new one each time there is something new to test. Beside a number of “named” containers, I started having a number of “buildyards” which I only use to test particular services; case in point I wanted to test Squid together with the new SquidClamav which I need for a customer of mine so I copied over one of my previous buildyards and fired it up…

Unfortunately, the results weren’t very good: I didn’t get a portage tree bound… quickly checking around, 0.7.2 worked fine 0.7.3 didn’t. After looking at the code it became apparent that the root filesystem mountpoint that was introduced in this release as a configuration option is not only used for loop-device backed images (which are now supported, and weren’t before), but also for standard directory-based containers. If you add this to one issue I have described before (the fact that lxc does not validate that the mount paths provided for bind-mounts are within the new rootfs tree) you may start to understand the issue here.

If you haven’t seen it yet, here’s the breakdown:

  • my container’s rootfs is located at /media/chroots/lxc-buildyard4;
  • the /etc/lxc/buildyard4.conf file used to bind-mount the portage directory as /media/chroots/lxc-buildyard4/usr/portage;
  • with 0.7.2, the pivot_root system was called over /media/chroots/lxc-buildyard4 and all was fine;
  • with 0.7.3, before pivoting, /media/chroots/lxc-buildyard4 was bind-mounted to a different path (let’s assume /usr/lib/lxc/rootfs but it was a bit more messed up);
  • when I accessed /usr/portage within the chroot I was actually accessing the path to be found at /usr/lib/lxc/rootfs/usr/portage.

Okay so it’s a bit murky, because if you think of bind-mounts the same way you think about symlink, the first bind mount should have accessed all the sub-bind-mounts as well, but that wasn’t the case because the first wasn’t a recursive bind-bound. Which means you really have to change all your configuration files to use the new rootfs mount path, and they didn’t seem to make that very clear as news files.

Besides, the default configuration is variable on the libdir setting, which means you’d have different paths between 32-bit and 64-bit systems (symlinks are ignored in part, remember) so to avoid that, I’ve revision-bumped lxc in tree and is now using /usr/lib/lxc/rootfs directly, ignoring multilib altogether.

On a different note, I’m still planning on writing a post detailing how cgroups work, since both LXC and libvirt (two projects I follow) make use of them, as well as Chrome/Chromium and, now, the Lennart userland implementation of the 200loc kernel patch. But before doing that I want to compare how the other distributions solve the mountpoint problem:

  • the cgroup filesystem has to be mounted to be used, like /sys, /dev, /proc and so on… but OpenRC currently ignores it;
  • the LXC init script accepts an already-mounted cgroup filesystem or mounts it over /cgroup;
  • as far as I can tell, Fedora uses /dev/cgroup but that montpoint, like /dev/pts and /dev/shm need to be created after udev was started, as /dev is a virtual filesystem itself (that’s what /etc/init.d/devfs does);
  • while on the other hand Ubuntu seem to rely on /sys/fs/cgroup which is an empty directory on the /sys pseudo-filesystem created when cgroup is enabled in the kernel.

Sincerely, my preferred solution right now is the last one, since that requires no special code, just need /sys mounted, and is much more similar to how fusectl is mounted (on /sys/fs/fuse/connections). If you have any comments, feel free to have a say here.

Linux Containers and Networking

So, at the moment I start writing this (and that’s unlikely to be the time I actually post this, given that I see now it could use some drawings) it’s early in the morning in Italy and I haven’t slept yet – a normal condition for me especially lately – but I have spent a bit bouncing ideas around with ferringb, Ramereth and Craig for what concerns Linux Containers (LXC). Given that, I’d like to point point out a couple of things regarding networking and LXC that might not have been extremely obvious before.

First of all, of the four networking types supported by LXC, I only could try two, for obvious reasons: phys is used to assign a particular physical device to the container, and only works if you have enough physical devices to work with, vlan requires a device able to do vlan tagging. This leaves us with veth (virtual ethernet), and macvlan (mac-address based virtual lan tagging). The former is the most simple setup, and the one I’ve been using; it creates a pair of devices, one of which is assigned within the container, and the other which is assigned to the host; you can then manage that device exactly like any other device you have on your system, and in my case that means it’s added to the br0 bridge where KVM instances are also joined. LXC allows for defining the bridge to join directly in the configuration file.

Linux Containers with Virtual Ethernet

The macvlan mode is supposed to have smaller overhead because the kernel knows the mac address assigned to the single interfaces beforehand; on the other hand setting it up is slightly harder; in particular, there is one further mode parameter that can be set, in either vepa (Virtual Ethernet Port Aggregator) or bridge mode; the former isolates the container, like they were a number of different hosts connected over to the network segment, but disallows the various containers from talking with one another; on the other hand the latter mode actually creates a special bridge (not to be confused with the Linux bridge used above with virtual ethernet devices), that allows all the containers to talk with one another.. but isolates them from the host system

Linux Containers with MACVLAN VEPA-mode

You end up having to choose between the performance of network-to-container and that of host-to-container: in the first case you can choose macvlan, reducing the work the kernel has to do, but requiring you to route your own traffic to the container with an outside router; in the second case you use veth and make the kernel handle the bridge itself. In my case, since the containers are mostly used for local testing, and the workstation will still be using the in-kernel bridge anyway, the choice is obvious for veth.

Linux Containers with MACVLAN Bridge-mode

Now, when I decided to seal the tinderbox I wondered about one thing, that LXC cannot do and that I would like to find the time to send upstream. As it is, I want to disallow any access from the tinderbox to the outside, minus the access to the RSync service and the non-caching Squid proxy. To achieve that I dropped IPv4 connectivity (so I don’t run any DHCP client at all), and limited myself to autoconfigured IPv6 addresses; then I set in /etc/hosts the static address for yamato.home.flameeyes.eu, and used that as hostname for the two services. Using iptables to firewall the access to any other thing had unfortunate results before (the kernel locked up without actually any panic happening); while I have to investigate that again, I don’t think much changed in that regard. There is no access to the outside network or from the outside network, since the main firewall is set to refuse talking at all with the tinderbox, but that’s not generally a good thing (I would like, at some point in the future, to allow access to the tinderbox to other developers), and does not ensures isolation between that and the other boxes on the network, which is a security risk (remember: the tinderbox builds and execute a lot of code that for me is untrusted).

Now, assuming that the iptables kernel problem happens only with the bridge enabled (I would be surprised if it failed that badly on a pure virtual ethernet device!), my solution was actually kinda easy: I would just have used the link-local IPv6 address, and relied on Yamato as a jump-host to connect to the tinderbox. Unfortunately, while LXC allows you to set a fixed hardware address for the interface created inside the container, it provides you no way to do the same for the host-side interface (which also get a random name such as veth8pUZr), so you cannot simply use iptables to enforce the policy as easily.

But up to this, it’s just a matter of missing configuration interfaces, so it shouldn’t be much of a problem, no? Brian pointed out a chance of safety issue there though, and I went on to check it out. Since when you use virtual ethernet devices it is the kernel’s bridge that takes care of identifying where to send the packages based on STP there is no checking of the hardware address used by the container; just like the IP settings you have there, any root user inside the container can add and remove IP addresses and change the mac address of it altogether. D’uh!

I’m not sure whether this would work better with macvlan, but as it is, there is enough work to be done with the configuration interface, and – over an year after the tinderbox started using LXC to run – it’s still not ready for production use — or at least not for the kind of production use where you actually allow third parties to access a “virtualised” root.

For those interested, the original SVG file of the (bad) network diagrams used in the article, is here and is drawn using Inkscape.

Polishing init scripts

One of the nicest features of OpenRC/Baselayout 2 (that sooner or later will hit stable I’m sure) is that you can replace bash (which is slow, as its own documentation admits) with a faster, slimmer pure POSIX shell. Okay, so maybe Fedora is now moving away from shells altogether and I guess I can see why from some point of views; but as Nirbheek said it’s unlikely to be suitable for all use cases; in particular, our init system tends to work perfectly fine as is for servers and embedded systems, so … I don’t see the reason we should be switching there. Adding support for dash or any other POSIX sh-compatible fast shell is going to be a win-win situation there — do note that you still need bash to run ebuilds, though!

Now, you can use dash already for the basic init scripts provided by OpenRC and Baselayout, but all the packages need to install proper Posix SH-compatible init scripts if you want to use it with them. Thankfully a number of users seem to care about that, such as Davide last year and Kai right now.

But POSIX compatibility is not the only thing we should actually look out for in our init scripts:

  • Eray pointed out that some scripts don’t re-create their /var/run subdirectories which is indeed a problem that should go fixed at some point; I had similar bad issues with running my own router based on Gentoo;
  • one too often misused parameter of start-stop-daemon is --quiet which can be… way too quiet; if it’s passed, you’re not going to receive any output at all if the daemon you’ve tried to start fails; and that is a problem;
  • there are problems with the way the system-services PAM chain is passed through so that limits are not respected (and if that’s the case, caps wouldn’t be respected coming from PAM either);
  • the way LXC works, init scripts looking just at the process’s name could cause a guest’s daemon to stop if the host’s is closed… this is mostly exercised by killall but also start-stop-daemon when not given a pidfile (and rather just an executable) will have the same problem; and the same goes for pkill, goes without saying.

These are a number of polishing tasks that are by all count minors and not excessively important but… if you’ve free time and care about Gentoo on LXC, embedded or with fast startup, you might want to look into those. Just saying!

Linux Containers on Gentoo, Redux

I’ve got a few further requests for Linux Containers support in Gentoo, one of which came from Alex, so I’ve decided to give it a try to update the current status of the stack, and give a decent description of what the remaining problems are.

First of all, the actual software: I’m still not fond of the way upstream is dealing with the lxc package itself; the build system is still trying to be smart and happening to be stupid, related to the LXC library. There are still very little failsafes, and there isn’t really enough space to manage LXC with the default tools as they are. While libvirt should support LXC just fine, I haven’t found the time to try it again and see if it works; I’m afraid it might only work if you use the forced setup that RedHat uses for LXC… but again I cannot say much until I find time to try it out and tweak it where needed.

*A note, as I stated before a lot of the documentation and tutorials regarding libvirt only apply to RedHat or Fedora. I can’t blame them for that, they do the work, they do the walk, but often it means that we have to adapt them or at least find a way to provide them with the pieces they expect in the right place. It requires a lot of my time to do that.*

I’ve finally added my “custom” init script to the ebuild, with version 0.7.2; it might change further with or without revision bump as I fix bugs reported; it should mostly auto-configure, the only requirement it has is to symlink to lxc.container to start the container defined in /etc/lxc/container.conf; it auto-detects the root path (so it won’t force you a particular filesystem layout), and works with both 32- and 64-bit containers transparently, so long there is a /sbin/init command (which I could have to change for systemd-based distributions at some point). What I now reason it lacks is support for detecting the network interface it uses and require that started up; I can add that at some point, in the mean time use /etc/conf.d/lxc.container and then add rc_need="net.yourif".

For what concerns networking, last I checked with lxc 0.7.1 userspace, and kernel 2.6.34, the macvlan system still isolated the host from the guests, which might be what you want but it’s definitely not what I care for. I’m guessing this might actually be by design; at any rate, even though technically slower, I find myself quite comfortable with using a Linux-based bridge as main interface, and bridge together the Virtual Ethernet device of the guest with the physical interface(s) of the host. This also works fine with libvirt/KVM, so it’s not a bad decision in my opinion. I just added 0.7.2 but I can’t see how that makes a difference, as macvlan is handled in kernel.

Thankfully, doing so with Gentoo’s networking system (which Roy wanted to deprecate, tsk!) is piece of cake: open /etc/conf.d/net, rename config_eth0 with config_br0, then add config_eth0="null" and bridge_br0="eth0".. exec ln -s net.lo /etc/init.d/net.br0, and use that for bringing the network up. Then on the LXC configuration side you got

lxc.network.type = veth
lxc.network.link = br0

and you’re all set. As I said, piece of cake.

Slightly more difficult is to properly handle the TTY devices; some people prefer to make use of the Linux virtual terminals to handle LXC containers; I sincerely don’t want it to mess with my real virtual terminals, I prefer using the lxc-console command to access it without networking. Especially since it messes up a lot if you are using KMS with the Radeon driver (which is what I’ve been doing for the past year or so).

For this to work out, I noted two things though: the first is that simply using the cgroup access control lists on the devices don’t help out that much (I actually haven’t tried to set them up properly just yet); on the other hand, LXC can create “pseudo-ttys” that can be used with lxc-console; the default number (9) does not work all that well, because the Gentoo init system set up twelve virtual terminals by default. So my solution is to use my custom static device tarball and the following snippet in the configuration:

lxc.tty = 12
lxc.pts = 128

This ensures that the TTY devices are all properly set up, so that they don’t mess with your virtual terminals, and lxc-console works like a charm in this configuration.

Now, the sad part: OpenRC is not yet stable, and I haven’t fixed yet the NFS bug I found (you stop it into the container and the host’s NFS exports are destroyed.. bad script, bad script!). On the other hand, I’m not doing LXC daily any longer for the simplest reason: the tinderbox is set up as I wish already, for the most part, so I have little to no incentive to work more on this; the good news is, I’m up for hire as I said for what concerns Ruby. So if you really want to use LXC in production and want me to improve whatever area Gentoo-related to it, including libvirt, you can just contact me.

Beside that, everything should be in place. Have fun working with LXC!

LXC and why it’s not prime-time yet

Lately I got a number of new requests about the status of LXC (Linux Containers) support in Gentoo; I guess this is natural given that I have blogged a bit about it and my own tinderbox system relies on it heavily to avoid polluting my main workstation’s processes with the services used by the compile – and especially test – phases. Since a new version was released on Sunday, I guess I should write again on the subject.

I said before that in my opinion LXC is not ready yet for production use, and I maintain that opinion today. I would also rephrase it in something that might make it easier to understand what I think: I would never trust root on a container to somebody I wouldn’t trust root with on the host. While it helps a great deal to reduce the nasty effects of an application mistakenly growing rogue, it neither removes the option entirely, nor it strengthen the security for intentional meddling with the system. Not alone at least. Not as it is.

The first problem is something I have already complained about: LXC shares the same kernel, obviously and by design; this is good because you don’t have to replicate drivers, resources, additional layers for filesystem and all the stuff, so you have real native performance out of it; on the other hand, this also means that if the kernel does not provide namespace/cgroup isolation, it does not allow you to make distinct changes between the host system and the container. For instance, the kernel log buffer is still shared among the two, which causes no little problems to run a logger from within the container (you can do so, but you have to remember to stop it from accessing the kernel’s log). You also can’t change sysctl values between the host and the container, for instance to disable the brk() randomizer that causes trouble with a few LISP implementations.

But there are even more interesting notes that make the whole situation pretty interesting. For instance, with the latest release (0.7.0), networking seems to have slightly slowed down; I’m not sure what’s the problem exactly, but for some reason it takes quite a bit longer to connect to the container than it used to; nothing major so I don’t have to pay excessive attention to it. On the other hand, I took the chance to try again to make it work with the macvlan network rather than the virtual Ethernet network, this time even googling around to find the solution about my problem.

Now, Virtual Ethernet (veth) is not too bad; it creates a peer-to-peer connection between the host and the container; you can then manage that as you see fit; you can then set up your system as a router, or use Linux ability to work as a bridge to join container’s network with your base network. I usually do that, since it reduces the amount of hops I need to add to reach Internet. Of course, while all the management is done in-kernel, I guess there are a lot of internal hops that have to be passed, and for a moment I thought that might have been slowing down the connection. Given that the tinderbox accesses the network quite a bit (I use SSH to control it), I thought macvlan would be simpler: in that case, the kernel is directing the packets coming toward a specific MAC address through the virtual connection of the container.

But the way LXC does it, it means that it’s one-way. By default, actually, each macvlan interface you create, isolates the various containers one from the other as well; you can change the mode to “bridge” in which case the containers can chat one with the other, but even then, the containers are isolated from the host. I guess the problem is that when they send packets, they get sent out from the interface they are bound to but the kernel will ignore them if they are directed back in. No there is currently no way to deal with that, that I know of.

Actually upstream has stated that there is no way to deal with that right now at all. Sigh.

An additional problem with LXC is that even when you do blacklist all the devices so that the container’s users don’t have access to the actual underlying hardware, it can mess up your host system quite a bit. For instance, if you were to start and stop the nfs init script inside the container.. you’d be disabling the host’s NFS server.

And yes, I know I have promised multiple time to add an init script to the ebuild; I’ll try to update it soonish.

Gentoo as a guest OS

I have ranted on about the purported support for Gentoo in Amazon EC2; I would like to slightly defend Amazon on this matter saying something that some people might find inflammatory but, I think, we should acknowledge if we strive to improve: Gentoo suck as a guest OS! Now that I pointed at the elephant in the room, let me try to qualify and explain this statement, trying to see what we can do about it.

The first obvious problem is that the standard baselayout we ship with has no idea how to work with vservers and other virtualised environments; we have baselayout-vserver, but last I knew it was a bit behind in features. Actually, we have a very good support for guests in baselayout, version 2, based on OpenRC. Indeed one of the features that the new baselayout brought to the mix was supporting out of the box many different VPS implementations (xen, vserver proper, OpenVZ, and, lately, LXC thanks to Andrian). Unfortunately, almost two years after the addition of OpenRC to the tree, and over three and a half years after the introduction of the first alpha of what became the current situation, we still don’t have it stable, although we’re nearing that date, hopefully.

02 Oct 2006; Roy Marples
+baselayout-1.13.0_alpha1.ebuild:
First alpha cut from the 1.13 branch with BSD and VServer support.
svcdir is now forced – you cannot change it’s location.
If you downgrade back to 1.12 you’ll have to copy /lib/rcscripts/init.d
to your svcdir (/var/lib/init.d by default) and run depscan.sh -u

Even discounting this problem (I actually use OpenRC on all my production systems, and never, or almost never, got problems with that), there are still problems with a number of scripts and configurations that assume presence or work of various real-hardware components that (obviously) we don’t have.

For instance, take the default syslog-ng configuration: it uses the /dev/tty12 device for logging… it doesn’t check whether the device exists before opening it for write, and the end result is that it can create a standard file with the log printed on it if you don’t remember to disable it after install (yes I noticed this just the other day after months of having the vservers running, shame on me!). On the other hand, metalog (which requires an option to run in vservers, since it will fail to start if it cannot access the kernel, and it obviously cannot do that within a vserver — and only the currently-unstable version has the option to do that) uses /dev/tty10 by the default configuration… just to show that we can’t even agree on the devices to use!

And of course there is the one big issue: our system set is too big! This means we end up bringing in a lot of stuff that is definitely not useful on vservers by default (like the kernel sources) just because we need them for the default case of real hardware. I guess one solution here would be to reduce the system set and then use the additional sets available in Portage 2.2 to handle the real-hardware cases (in the default stage… and from that an emerge -C &x40;hwboot would give you a virtualizable-system)… but alas that’s something that never got enough momentum to become reality.

In general, though, Gentoo could become a much better OS to work as guest… we should all try to work on improving that step by step, bit by bit… remember: report whatever you feel is wrong with Gentoo packaging on our bugzilla after making sure there isn’t a report already. Don’t assume somebody else will see your problem, don’t assume that you’ll be the only one with that problem, *please report it*… just try to keep an open mind that maybe it’ll be marked as an invalid bug because you should have done some extra configuration… on the other hand, we really should make it easier to use Gentoo as a guest operating system.

LXC’s unpolished code

So I finally added lxc 0.6.5 to the tree for all of the Gentoo users to try it if they care; on the bright side, the lxc.mount option seem to finally work fine, and I also found the problem I complained about a few days ago. It is this latter problem that made me feel like writing this half-rant post.

Half-rant because I can see they are going to great extents to improve it from release to release, but still a rant because I think it should have been less chaotic to begin with. On the other hand, I could probably be more constructive if I went on to provide patches… I’ll probably do that in the future if I free myself, but if you follow my blog you know I’m quite swamped already between different things.

Starting from the 0.6.4 release, they dropped the “initialisation” task, and just let you run lxc-start with the config file. It was definitely a nice way to do it, as the init command wasn’t doing anything heavy that shouldn’t be done on first startup anyway. It was, though, a bit of a problem for scripts that used the old method as the simple lxc-start -n myguest (the init step wouldn’t be needed after a restart of the host) would mess up your system badly, as it would spawn a new container using / as the new root… overriding your own system bootup. Thankfully, Andrian quickly added a check that refused to proceed if a configuration file was not given. This does not save you from being able to explicitly mess your system up by using / as your new root, but at least avoids possible mistakes when using the old-style calls.

So what about the 0.6.5 problem? Well the problem came to be because 0.6.5 actually implements a nice feature (contributed by a non-core developer it seems): root pivoting. The idea is to drop access to the old root, so that the guest cannot in any way access the host’s filesystem unless given access to. It’s a very good idea, but there are two problems with it: it doesn’t really do it systematically, but rather with a “try and hope” approach, and it failed under certain conditions, saying that the original root is still busy (note here, since this happens within the cgroup’s mount namespace, it doesn’t matter to the rest of the system).

At the end, last night I was able to identify the problem: I had this line in the fstab file used by lxc itself:

none /tmp tmpfs size=200m 0 0

What’s wrong with it? The mountpoint. The fstab (and lxc.mount commands) are used without previous validation or handling, so this is not mounting the /tmp for the guest, but the /tmp for the host, within the guest’s mount namespace. The result is that /tmp gets mounted twice (once inherited by the base mount namespace, once within the guest’s namespace, but it’s only unmounted once (as the unmount list keeps each mount point exactly once). This is quite an obvious error on my part, I should have used /media/chroots/tinderbox/tmp as mountpoint, but LXC being unable to catch the mistake in mountpoint (at least warning about it) is a definite problem.

Another thing that makes me feel like LXC really needs some polishing is that you cannot just run the commands from the source directory: the build system uses autoconf and automake, but the authors explicitly backed away from libtool as it’s “Linux-only” (which really doesn’t say much about the usefulness of libtool in this case). given I’m not even sure whether the liblxc library is supposed to be ABI stable or not (they have never bumped the soname, but that is suspicious), it might really be better if they used libtool and learnt out to handle it. Also, it uses badly recursive Makefiles, it would probably take just a second to build if I remade the build system as a standard non-recursive autotools package, like udev.

Oh well, let’s hope for the future releases to improve polishing, bit by bit!

Linux Resource Containers and Gentoo, another step

I currently am tackling a new job, details of which I’m not going to disclose, but one of the steps on my “pre-work setup” task was to create a new LXC-based guest, reason for which is quite easy to guess: I wanted a clean slate to work on rather than my messed up live workstation.

I could have gone to use Fedora or Ubuntu in a KVM, or even a Gentoo install in KVM, but I still like a lot the idea of using LXC, it’s a very nice “virtualisation” technology that does not consume a huge amount of resources, nor it requires special kernel support (it’s working mostly fine with vanilla kernels). And as I’m one of the maintainers for it in Gentoo, I also wanted to check if I could improve the situation.

Thanks to Adrian Nord, I’ve been able to learn quite a bit of things about LXC even without having to follow the upstream mailing lists. Unfortunately, the documentation about using LXC is scattered and sometimes messed up, so it took me a while to get this to work as intended. So here a few notes that might come useful to other people wanting to use it:

I’ve currently not packaged lxc-0.6.5 that was released earlier this week; my reason to avoid it is that I cannot get it to work, at least I couldn’t when it was released; once I can find what the heck is wrong with it, I’ll make sure to add it; unfortunately I cannot look into it as long as I’m using the lxc guests to do real work. On a similar, and maybe related note, I couldn’t get 0.6.4 to work on the 2.6.33 release candidates, so it might be a problem of version-specificity.

You cannot rely on udev, neither to have /dev filled in, nor to have a working detection of hardware; I guess that might be obvious given that we’re talking about virtual environments, but there is a trick: by default if you just bind-mount /dev (like I was used to do on chroots) you end up with the guest having full access to the devices in the host. You need to set up an access control list on the device nodes that are allowed to be accessed for that to work.

The easiest way to handle the problem with /dev, as Adrian pointed me to, is to use a static /dev; this does not mean using the static-dev package as that installs a huge amount of extra stuff we will probably never need, and at the same time it does not install some other devices, such as random on my system. My solution? This tarball – not even compressed as it’s just metadata! – creates the subset of devices that seem to actually be used by this stuff.

You don’t want to bind-mount /dev or /dev/pts, but you want instead for LXC to take care of the latter for you: not only it’ll mount a new instance of the PTS filesystem, but it’ll also bind the tty* devices to pseudo-terminals, allowing you to access the various virtual consoles through the lxc-console command. To do that you need a bit of configuration in the file:

lxc.tty = 12
lxc.pts = 128

Note the numbers there: the second one, referring to lxc.pts is, as of 0.6.4, unused, and just needs to be non-zero so that the new /dev/pts instance is created properly. The former instead is important: that’s the number of TTYs that LXC will be wrapping around to pseudo-terminals. You want that to be at least 12 for Gentoo-based guests. The reason is that consolefonts and keymaps scripts will be accessing tty1 through tty12 during the “start-up” and will then mess up you system badly if they are not wrapped. There are two extra catches: the first is that if the devices don’t exist, the init scripts will create regular files with those names (and that might sound quite strange when you go debugging you problems), the second is that you need to have the files for the given number of ttys around: LXC will silently fail if it cannot find the file to bind at the device path. Which also takes a while to debug.

I still haven’t finished making sure that the cgroup device access functions work as intended, so for now I won’t post anything about those just yet. But you might want to look into lxc.cgroup.devices.access to whitelist the device nodes that I have in the tarball, with the exclusion of the 4:* range (which is the real TTYs.

Now, maybe I’ll be able to add an init script with the next release of LXC I’m going to package!

Backports, better access to

One of the tasks that distributions have to deal with on a daily basis is ensuring that the code they package is as free of bug as humanly possible without having to run after upstream continuously. To solve this problem the usual solution is to use the upstream-provided releases, but also apply over those patches to fix issues and, most importantly, backports.

Backports are patches that upstream already accepted, and merged into the current development tree, applied over an older version (usually, the latest released version). Handling these together with external patches tends to get tricky, from one side you need to track down for the new release which have been merged already (just checking those that apply don’t do the trick, since you would probably find non-merged patches that don’t apply because source files changed between the two versions), from the other, they often apply with fuzz, which has been proven a liability with the latest GNU patch version.

Now, luckily, with time better tools have been created to handle patching: quilt is a very good one when you have to deal with generic packages, but even better than that is the git source control manager software. When you have an upstream repository in git, you can clone it, then create a new branch stemming from the latest release tag, and apply your patches, committing them right away exactly how they are. And thanks to the cherry-pick and cherry commands, handling backports (especially verifying whether they have been merged upstream) is piece of cake.

It even gets better when you use the format-patch command to generate the patchset since they will be ordered, and described, right away; the only thing that it lacks is creating a patchset tarball right out of that, not that it’s overly difficult to do though (I should probably write a script and publish that). Add then tags, and the ability to reset branches and you can see how dealing with distribution patching via git gets much nicer than it was before the coming of git.

But even this has a relative usefulness when you keep the stuff just locally-available. So to solve this problem, and especially to remove myself as a single point of failure (given my background that’s quite something — and even more considering lately I had to spend most of my time on paid work projects, as life became suddenly, and sadly, quite complicated), I’ve decided to prepare and push out branches.

So you can see there is one repository for lxc (thanks to Adrian and Sven for the backports), and one for libvirt (thanks to Daniel that after daring me went on to merge the patches to the schema I made; two out of three are in, the last one is the only reason why nxml integration for libvirt files is not yet in).

Now, there is one other project I’d like to publish the patches of this way, and that project is iscsitarget; unfortunately upstream is still using Subversion so I’ve been using a git-svn bridge which is far from nice. On the other hand my reasoning to publish that is that I dropped iscistarget! — which means that if you’ve been relying on it, from next kernel release onward you’ll probably encounter build failures (I already fixed it to build with 2.6.32 beforehand since I was using release candidates for a while). Myself, I’ve moved to sys-block/tgt (thanks to Alexey) which does not require an external kernel module, but rather uses the SCSI target module that Linux already provides out of the box.

Debating future tinderbox work

I’ve been not working on the tinderbox lately because my “daily job” (which is not really daily) swamped me out badly. Since this week I’m going to London to take some days off, I’ll probably get back to the tinderbox after that.

For the next ride of the tinderbox, there is at least one thing that’s definitely going to be interesting: the new X11R7.5 release means that quite a bit of packages might not build at all since they don’t have the new includes fixed. I found one or two packages with such problems while doing Yamato’s root filesystem rebuild (after glibc 2.11 update).

There is another interesting idea that I should probably toy with: the way the tinderbox works, it tests all non-masked packages; by QA standards, those should not use the network at build time. During my world rebuild this night, network went offline, and one package failed since it tried to wget a piece of source from the network. And it’s not even the first one lately.

Thanks to the fact that my tinderbox uses containers I can easily isolate it out of the network so that it cannot access the network, and then make sure that the ebuilds trying to use the network get their access refused.

The other problem to cope with is the size of the logs and the fact that I still lack an analysis script and thus opening new batches of bugs requires a huge amount of work, especially when it comes to attaching the log and getting some information out of it.

Any suggestion on how to proceed with the tinderbox will definitely be welcome.