LXC and why it’s not prime-time yet

Lately I got a number of new requests about the status of LXC (Linux Containers) support in Gentoo; I guess this is natural given that I have blogged a bit about it and my own tinderbox system relies on it heavily to avoid polluting my main workstation’s processes with the services used by the compile – and especially test – phases. Since a new version was released on Sunday, I guess I should write again on the subject.

I said before that in my opinion LXC is not ready yet for production use, and I maintain that opinion today. I would also rephrase it in something that might make it easier to understand what I think: I would never trust root on a container to somebody I wouldn’t trust root with on the host. While it helps a great deal to reduce the nasty effects of an application mistakenly growing rogue, it neither removes the option entirely, nor it strengthen the security for intentional meddling with the system. Not alone at least. Not as it is.

The first problem is something I have already complained about: LXC shares the same kernel, obviously and by design; this is good because you don’t have to replicate drivers, resources, additional layers for filesystem and all the stuff, so you have real native performance out of it; on the other hand, this also means that if the kernel does not provide namespace/cgroup isolation, it does not allow you to make distinct changes between the host system and the container. For instance, the kernel log buffer is still shared among the two, which causes no little problems to run a logger from within the container (you can do so, but you have to remember to stop it from accessing the kernel’s log). You also can’t change sysctl values between the host and the container, for instance to disable the brk() randomizer that causes trouble with a few LISP implementations.

But there are even more interesting notes that make the whole situation pretty interesting. For instance, with the latest release (0.7.0), networking seems to have slightly slowed down; I’m not sure what’s the problem exactly, but for some reason it takes quite a bit longer to connect to the container than it used to; nothing major so I don’t have to pay excessive attention to it. On the other hand, I took the chance to try again to make it work with the macvlan network rather than the virtual Ethernet network, this time even googling around to find the solution about my problem.

Now, Virtual Ethernet (veth) is not too bad; it creates a peer-to-peer connection between the host and the container; you can then manage that as you see fit; you can then set up your system as a router, or use Linux ability to work as a bridge to join container’s network with your base network. I usually do that, since it reduces the amount of hops I need to add to reach Internet. Of course, while all the management is done in-kernel, I guess there are a lot of internal hops that have to be passed, and for a moment I thought that might have been slowing down the connection. Given that the tinderbox accesses the network quite a bit (I use SSH to control it), I thought macvlan would be simpler: in that case, the kernel is directing the packets coming toward a specific MAC address through the virtual connection of the container.

But the way LXC does it, it means that it’s one-way. By default, actually, each macvlan interface you create, isolates the various containers one from the other as well; you can change the mode to “bridge” in which case the containers can chat one with the other, but even then, the containers are isolated from the host. I guess the problem is that when they send packets, they get sent out from the interface they are bound to but the kernel will ignore them if they are directed back in. No there is currently no way to deal with that, that I know of.

Actually upstream has stated that there is no way to deal with that right now at all. Sigh.

An additional problem with LXC is that even when you do blacklist all the devices so that the container’s users don’t have access to the actual underlying hardware, it can mess up your host system quite a bit. For instance, if you were to start and stop the nfs init script inside the container.. you’d be disabling the host’s NFS server.

And yes, I know I have promised multiple time to add an init script to the ebuild; I’ll try to update it soonish.