A good reason not to use network bridges

So one of the things I’m working on for my job is to look to set up Linux Containers to separate some applications — yes I know I’m the one who said that they are not ready for prime time but please note that what I was saying is that I wouldn’t give root inside a container to anybody I would trust — which is not the same as to say that they are not extremely useful to limit the resource consumption of various applications.

Anyway, there is one thing that has to be considered, of which I already quickly wrote about : networking. The simplest way to set up a LXC host, if your network is a private one, with a DHCP server or something along those lines, is to create one single bridge between your public network interface and the host-side of virtual Ethernet pairs — this has one unfortunate side effect: to make it working, it puts the network interface in promiscuous mode, which means that it receives all the packets directed to any other interface, which slows it down quite a bit.

So how do you solve the issue? Well, I’m honestly not sure whether macvlan improves the situation, I’m afraid not. What I decided for Excelsior, since it is not on a private network, was to set up an internal bridge, and have static IP addresses set to internal IPs. When i need to jump into one of the containers, I simply use the main public IP as an SSH jumphost and then connect to the correct address. I described the setup before although I made then a further change so now I don’t have to bother with the private IP addresses in the configuration file: I use the public IPv6 AAAA record for the containers, which simply resolve as usual once inside my jumphosts.

Of course with the exception of jumphosts, that kind of settings, which involve using NAT on iptables, has no way to receive connections from the outside.

So what other options are there? One thing I’ve been thinking about was to use a level-3 managed switch and set it to route a subnet to the LXC host — but that wouldn’t fly too much. So at the end the question would be “what is it that I need access on the containers form the outside?” and the answer is simply “the websites”. The containers provide a number of services, but only the websites are mapped to the outside. So, do I need IPs that are even partially public? Not really.

The solution I’m planning right now is that I’ll set up a box with either an Apache reverse-proxy or some other reverse proxy (depending on how much we want to handle on the proxy itself), and have that contact the internal containers, the same way it would be if you had one reverse proxy on the Internet, and the servers on the internal network.

I guess at some point I should overhaul the LXC wiki page for what concerns networking; I already spent some time to remove some duplicated content and actually sync it with what’s going on on the ebuild…

Linux Containers and Networking

So, at the moment I start writing this (and that’s unlikely to be the time I actually post this, given that I see now it could use some drawings) it’s early in the morning in Italy and I haven’t slept yet – a normal condition for me especially lately – but I have spent a bit bouncing ideas around with ferringb, Ramereth and Craig for what concerns Linux Containers (LXC). Given that, I’d like to point point out a couple of things regarding networking and LXC that might not have been extremely obvious before.

First of all, of the four networking types supported by LXC, I only could try two, for obvious reasons: phys is used to assign a particular physical device to the container, and only works if you have enough physical devices to work with, vlan requires a device able to do vlan tagging. This leaves us with veth (virtual ethernet), and macvlan (mac-address based virtual lan tagging). The former is the most simple setup, and the one I’ve been using; it creates a pair of devices, one of which is assigned within the container, and the other which is assigned to the host; you can then manage that device exactly like any other device you have on your system, and in my case that means it’s added to the br0 bridge where KVM instances are also joined. LXC allows for defining the bridge to join directly in the configuration file.

Linux Containers with Virtual Ethernet

The macvlan mode is supposed to have smaller overhead because the kernel knows the mac address assigned to the single interfaces beforehand; on the other hand setting it up is slightly harder; in particular, there is one further mode parameter that can be set, in either vepa (Virtual Ethernet Port Aggregator) or bridge mode; the former isolates the container, like they were a number of different hosts connected over to the network segment, but disallows the various containers from talking with one another; on the other hand the latter mode actually creates a special bridge (not to be confused with the Linux bridge used above with virtual ethernet devices), that allows all the containers to talk with one another.. but isolates them from the host system

Linux Containers with MACVLAN VEPA-mode

You end up having to choose between the performance of network-to-container and that of host-to-container: in the first case you can choose macvlan, reducing the work the kernel has to do, but requiring you to route your own traffic to the container with an outside router; in the second case you use veth and make the kernel handle the bridge itself. In my case, since the containers are mostly used for local testing, and the workstation will still be using the in-kernel bridge anyway, the choice is obvious for veth.

Linux Containers with MACVLAN Bridge-mode

Now, when I decided to seal the tinderbox I wondered about one thing, that LXC cannot do and that I would like to find the time to send upstream. As it is, I want to disallow any access from the tinderbox to the outside, minus the access to the RSync service and the non-caching Squid proxy. To achieve that I dropped IPv4 connectivity (so I don’t run any DHCP client at all), and limited myself to autoconfigured IPv6 addresses; then I set in /etc/hosts the static address for yamato.home.flameeyes.eu, and used that as hostname for the two services. Using iptables to firewall the access to any other thing had unfortunate results before (the kernel locked up without actually any panic happening); while I have to investigate that again, I don’t think much changed in that regard. There is no access to the outside network or from the outside network, since the main firewall is set to refuse talking at all with the tinderbox, but that’s not generally a good thing (I would like, at some point in the future, to allow access to the tinderbox to other developers), and does not ensures isolation between that and the other boxes on the network, which is a security risk (remember: the tinderbox builds and execute a lot of code that for me is untrusted).

Now, assuming that the iptables kernel problem happens only with the bridge enabled (I would be surprised if it failed that badly on a pure virtual ethernet device!), my solution was actually kinda easy: I would just have used the link-local IPv6 address, and relied on Yamato as a jump-host to connect to the tinderbox. Unfortunately, while LXC allows you to set a fixed hardware address for the interface created inside the container, it provides you no way to do the same for the host-side interface (which also get a random name such as veth8pUZr), so you cannot simply use iptables to enforce the policy as easily.

But up to this, it’s just a matter of missing configuration interfaces, so it shouldn’t be much of a problem, no? Brian pointed out a chance of safety issue there though, and I went on to check it out. Since when you use virtual ethernet devices it is the kernel’s bridge that takes care of identifying where to send the packages based on STP there is no checking of the hardware address used by the container; just like the IP settings you have there, any root user inside the container can add and remove IP addresses and change the mac address of it altogether. D’uh!

I’m not sure whether this would work better with macvlan, but as it is, there is enough work to be done with the configuration interface, and – over an year after the tinderbox started using LXC to run – it’s still not ready for production use — or at least not for the kind of production use where you actually allow third parties to access a “virtualised” root.

For those interested, the original SVG file of the (bad) network diagrams used in the article, is here and is drawn using Inkscape.

LXC and why it’s not prime-time yet

Lately I got a number of new requests about the status of LXC (Linux Containers) support in Gentoo; I guess this is natural given that I have blogged a bit about it and my own tinderbox system relies on it heavily to avoid polluting my main workstation’s processes with the services used by the compile – and especially test – phases. Since a new version was released on Sunday, I guess I should write again on the subject.

I said before that in my opinion LXC is not ready yet for production use, and I maintain that opinion today. I would also rephrase it in something that might make it easier to understand what I think: I would never trust root on a container to somebody I wouldn’t trust root with on the host. While it helps a great deal to reduce the nasty effects of an application mistakenly growing rogue, it neither removes the option entirely, nor it strengthen the security for intentional meddling with the system. Not alone at least. Not as it is.

The first problem is something I have already complained about: LXC shares the same kernel, obviously and by design; this is good because you don’t have to replicate drivers, resources, additional layers for filesystem and all the stuff, so you have real native performance out of it; on the other hand, this also means that if the kernel does not provide namespace/cgroup isolation, it does not allow you to make distinct changes between the host system and the container. For instance, the kernel log buffer is still shared among the two, which causes no little problems to run a logger from within the container (you can do so, but you have to remember to stop it from accessing the kernel’s log). You also can’t change sysctl values between the host and the container, for instance to disable the brk() randomizer that causes trouble with a few LISP implementations.

But there are even more interesting notes that make the whole situation pretty interesting. For instance, with the latest release (0.7.0), networking seems to have slightly slowed down; I’m not sure what’s the problem exactly, but for some reason it takes quite a bit longer to connect to the container than it used to; nothing major so I don’t have to pay excessive attention to it. On the other hand, I took the chance to try again to make it work with the macvlan network rather than the virtual Ethernet network, this time even googling around to find the solution about my problem.

Now, Virtual Ethernet (veth) is not too bad; it creates a peer-to-peer connection between the host and the container; you can then manage that as you see fit; you can then set up your system as a router, or use Linux ability to work as a bridge to join container’s network with your base network. I usually do that, since it reduces the amount of hops I need to add to reach Internet. Of course, while all the management is done in-kernel, I guess there are a lot of internal hops that have to be passed, and for a moment I thought that might have been slowing down the connection. Given that the tinderbox accesses the network quite a bit (I use SSH to control it), I thought macvlan would be simpler: in that case, the kernel is directing the packets coming toward a specific MAC address through the virtual connection of the container.

But the way LXC does it, it means that it’s one-way. By default, actually, each macvlan interface you create, isolates the various containers one from the other as well; you can change the mode to “bridge” in which case the containers can chat one with the other, but even then, the containers are isolated from the host. I guess the problem is that when they send packets, they get sent out from the interface they are bound to but the kernel will ignore them if they are directed back in. No there is currently no way to deal with that, that I know of.

Actually upstream has stated that there is no way to deal with that right now at all. Sigh.

An additional problem with LXC is that even when you do blacklist all the devices so that the container’s users don’t have access to the actual underlying hardware, it can mess up your host system quite a bit. For instance, if you were to start and stop the nfs init script inside the container.. you’d be disabling the host’s NFS server.

And yes, I know I have promised multiple time to add an init script to the ebuild; I’ll try to update it soonish.