Lately I got a number of new requests about the status of LXC (Linux Containers) support in Gentoo; I guess this is natural given that I have blogged a bit about it and my own tinderbox system relies on it heavily to avoid polluting my main workstation’s processes with the services used by the compile – and especially test – phases. Since a new version was released on Sunday, I guess I should write again on the subject.
I said before that in my opinion LXC is not ready yet for production use, and I maintain that opinion today. I would also rephrase it in something that might make it easier to understand what I think: I would never trust root on a container to somebody I wouldn’t trust root with on the host. While it helps a great deal to reduce the nasty effects of an application mistakenly growing rogue, it neither removes the option entirely, nor it strengthen the security for intentional meddling with the system. Not alone at least. Not as it is.
The first problem is something I have already complained about: LXC shares the same kernel, obviously and by design; this is good because you don’t have to replicate drivers, resources, additional layers for filesystem and all the stuff, so you have real native performance out of it; on the other hand, this also means that if the kernel does not provide namespace/cgroup isolation, it does not allow you to make distinct changes between the host system and the container. For instance, the kernel log buffer is still shared among the two, which causes no little problems to run a logger from within the container (you can do so, but you have to remember to stop it from accessing the kernel’s log). You also can’t change sysctl values between the host and the container, for instance to disable the brk()
randomizer that causes trouble with a few LISP implementations.
But there are even more interesting notes that make the whole situation pretty interesting. For instance, with the latest release (0.7.0), networking seems to have slightly slowed down; I’m not sure what’s the problem exactly, but for some reason it takes quite a bit longer to connect to the container than it used to; nothing major so I don’t have to pay excessive attention to it. On the other hand, I took the chance to try again to make it work with the macvlan network rather than the virtual Ethernet network, this time even googling around to find the solution about my problem.
Now, Virtual Ethernet (veth) is not too bad; it creates a peer-to-peer connection between the host and the container; you can then manage that as you see fit; you can then set up your system as a router, or use Linux ability to work as a bridge to join container’s network with your base network. I usually do that, since it reduces the amount of hops I need to add to reach Internet. Of course, while all the management is done in-kernel, I guess there are a lot of internal hops that have to be passed, and for a moment I thought that might have been slowing down the connection. Given that the tinderbox accesses the network quite a bit (I use SSH to control it), I thought macvlan would be simpler: in that case, the kernel is directing the packets coming toward a specific MAC address through the virtual connection of the container.
But the way LXC does it, it means that it’s one-way. By default, actually, each macvlan interface you create, isolates the various containers one from the other as well; you can change the mode to “bridge” in which case the containers can chat one with the other, but even then, the containers are isolated from the host. I guess the problem is that when they send packets, they get sent out from the interface they are bound to but the kernel will ignore them if they are directed back in. No there is currently no way to deal with that, that I know of.
Actually upstream has stated that there is no way to deal with that right now at all. Sigh.
An additional problem with LXC is that even when you do blacklist all the devices so that the container’s users don’t have access to the actual underlying hardware, it can mess up your host system quite a bit. For instance, if you were to start and stop the nfs
init script inside the container.. you’d be disabling the host’s NFS server.
And yes, I know I have promised multiple time to add an init script to the ebuild; I’ll try to update it soonish.
Forgive me if i’ve missed something but I would like to set up a system for testing ebuilds, nothing of the scale of what you are doing with the tinderbox but as a sandbox to test certain software that I am interested in. I’ve been through UML and Xen but never tried containers. Could a setup like this be done under Xen? UML is slow and Xen seems to give you all the power of a real root access with minimal overhead. Are there any docs or scripts to set up a test system like your tinderbox? Thanks.
Nothing against LXC, but I have strange feeling that this initiative is about 5 years late comparing to OpenVZ.OpenVZ just works, I’ve had a service with about 450 containers inside, each with its public IP address, with quota, live migration, ro “deduplication” (mount – o bind), vlans… all working right out once the kernel and one tool is installed.Reading about such issues with LXC makes me grave. Why should I even bother looking at LXC then? because its mainstream or what? I just don’t get it.
Mostly, OpenVZ requires specific kernel versions, while LXC works mostly with stock kernels. Yes there is a huge bias here by the fact that LXC is _in_ the kernel; but that’s just the point.If OpenVZ was in the kernel I wouldn’t bother with LXC, of course.
Why OpenVZ is not in mainline kernel ?
chriss ?really don’t know, I suspect changes are just too deep for maintainers. I’ve tried to get the answer several times and have found some pieces on different forums, like this one.http://www.mail-archive.com…Anyway, I also suspect that until LXC is as stable as openvz now, the whole idea of containers gets dusty and the war is over with KVM/VMware the winner.Or is already too late for containers?
Well, considering the kind of performance that a tuned kernel has on a KVM host, I would suspect that neither containers nor openvz are going to make enough of a difference to be worth going on working with…Adding KVM guest support, virtio net and virtio disk, … it’s almost as good as a real system.
“Adding KVM guest support, virtio net and virtio disk, … it’s almost as good as a real system.”That isn’t exactly field for comparison here.The killer feature in OpenVZ (LXC?) is bind mounts.I’ve managed to make 450 VEs using about 200MB of storage (each machine has almost 200MB). All of the containers share the same system directories and differs to each other with several config files and user data. At startup each container has 24MB of unique data.This means that further optimization was possible: instead of using a disk I’m using tmpfs for system storage and disk for userdata.This means that disk is free of system I/O and memory is being used instead.BTW, this also means that there is no easy way to migrate the containers (bind mount prevents live migration), but it is still possible using some shell tweaks to achieve this (once a user is logged out, I can umount and migrate the container).Thus, I believe the performance here is unmatchable.I don’t really know, if there is such possibility in KVM.
Word is macvlan no longer isolates containers from the host with lxc-0.7.1 and a minimum 2.6.33 kernel. But I’ll know more in a few hours after testing that claim.The big advantage to lxc over OpenVZ, in my brief experience, is that building a 2.6.32 kernel with the latest (also next-to-latest) OpenVZ patches does not produce a stable result.Since none of my prospective uses of containers involve sharing root with anyone not already trusted to share root, the only concern here is how easily a non-root app might break out of its container. Only if it can get root, right? So it still beats all hell out of running that app on the same system, but not in a container.
“The big advantage to lxc over OpenVZ, in my brief experience, is that building a 2.6.32 kernel with the latest (also next-to-latest) OpenVZ patches does not produce a stable result.”Unfortunately, its rather true. I’m still using 2.6.27.Anyway, does lxc developers foresee some kind of “live migration”?
As I see it, OpenVz and LinuxVserver did not make it into the mainstream kernel cause it is a set of “propretary” patches for a specific task. You would find people who like the OpenVz patches more than the ones of Vserver and vice versa. So which one to choose?LXC is a generic redesign of the kernel introducing namespace support to the main kernel subsystems (see http://lxc.sourceforge.net/…. This creates code that is of interest to more people and maintainable by more people, this it will be more stable and lower the burden of the OpenVz patch maintainers. Cool :)Different user tools implementations already exist to use these new kernel features, LXC usertools and libvirt are only two of them and I guess there will be more in the future. OpenVz could be one more, different management interface approach, same kernel infrastructure used. This maybe the reason the OpenVz guys – as far as I know – where involved, helpful and active in LXC development.For me it is a real improvement to have implementations of two main virtualisation techniques – kvm and lxc, i.e. hypervisor and container – integrated into the kernel and just select the right one for the right task without a hassle to build and maintain custom kernels. Linux rocks, great future ahead :))We’ll see a transition phase of course, where you have to think twice if lxc is the one to choose for the task you have on mind. But lots of the scenarios propretary to me would fit well with lxc and I guess this could be said for a lot of people. From my point of view, the one Flameeyes picked up is in fact the last remaining, where I personally would step back: I would not offer public container based root servers using lxc. Yet. Lucky me – no plan to do so ;-))
Most of these objections apply to OpenVZ as well.
I think one of the situations where LXC may be preferable to OpenVZ is if you also depend on other external patches, in our case OpenAFS.Also, LXC+Btrfs seems promising to increase disk-efficiency and maintenance. (Snapshot the entire instance before applying updates, for instance.)And the performance-hit of KVM is not big, but it’s still notable, especially in disk-heavy applications; http://www.phoronix.com/sca….What I’d personally would like to see though, is a specialized distro aimed specifically at providing lightweight overlay-style appliances on top of LXC. Something similar to slax-packages and Conary, integrated with Btrfs and LXC.
I would like to offer this link: http://www.slideshare.net/b… to give an idea of container based virtualization performance with the other forms for those who might not be familiar. Also consider that I can implement lxc, openvz and vserver on non machines with non virtulization ready CPU’s. I can implement vservers, lxc and probably openvz (since I don’t have any experience with it) on an old PPC Mac if i wanted to; the idea of freedom to do what one wants is still important to me. (don’t tell me what to do with my hardware).Also as it concerns security, may I also present this link: http://www.ibm.com/develope… and this one: https://sites.google.com/a/…. With these techniques it is possible to have a very well locked down system.We should avoid the human tendency of projecting our own shortcomings onto external things instead of saying, I don’t know how to do something or I don’t know if there is a way this could be done.These presented methods may have their own shortcomings, until someone resolves them and life will go on to better things.BK
I must also say that Diego’s concerns are valid for the moment. I use macvlans for my set up. I have enabled the bridging because I wanted the containers to talk to each other. But I have to further investigate them. I can isolate the containers from each other and from the system, but there are other interesting things that can be done with switches using macvlans that I have not the time to investigate. The 3 available modes should be better detailed, but I love the ability to create a new device on which I can house a subnet and have them totally isolated from another macvlan device with another subnet.I have installed a NFS server inside the container yet but getting a client to work requires adding:sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0nfsd /proc/fs/nfsd nfsd rw 0 0I haven’t tested this to its fullest.Anyway, I’m just learning to use it. By the way, OpenQRM supports lxc now and they have a video on youtube.
I wanted to mention that macvlans shouldn’t be confused for they may not be or originally meant to be. It can used in bridge mode but it’s power had to do with what is described here:http://www.redhat.com/archi…In order to get the full power of macvlans especially in VEPA mode, a switch which has “hairpin” functionality would be needed. It’s really cool stuff. If you own a reflective switch, then all communication between macvlans will go through the swtich and back to the port they came from and can be accounted as traffic by the switch.
Some other sources of information relating to VEPA:http://www.definethecloud.n…http://www.computerworld.co…http://gr33ndata.blogspot.c…There is a lot of documentation regarding this. Search for VEPA and switch.I hope that viewed in this context, macvlans in Linux can be put to better and more appropriate use.
I’ve got the feeling by the end of 2013 it will be more than ready for prime time. SELinux, seccomp(2), AppArmor, user namespaces.What is really needed is more and more ways to make it easy to use it securely in a multi-tenancy environment.
I wonder about the status now. I am setting up containers in a complete standard Debian 7.0 wheezy rc1, and error messages are spewing left and right (when Googling them you get bug reports from 2011 and 2012). Documentation is limited, and Debian is dropping support for Linux-Vserver and OpenVZ.Is LXC ready for prime-time now?