Report from SCaLE13x

This year I have not been able to visit FOSDEM. Funnily enough this confirms the trend of me visiting FOSDEM only on even-numbered years, as I previously skipped 2013 as I was just out for my first and only job interview, and 2011 because of contract related timing. Since I still care for going to an open source conference early in the year, I opted instead for SCaLE, the timing of which fit perfectly my trip to Mountain View. It also allowed me to walk through Hermosa Beach once again.

So Los Angeles again it was, which meant I was able to meet with a few Gentoo developers, a few VideoLAN developers who also came all the way from Europe, and many friends who I have met at various previous conferences. It is funny how I end up meeting some people more often through conferences than I meet my close friends from back in Italy. I guess this is the life of the frequent travelers.

While my presence at SCaLE was mostly a way to meet some of the Gentoo devs that I had not met before, and see Hugo and Ludovic from VideoLAN who I missed at the past two meetings, I did pay some attention to the talks — I wish I could have had enough energy to go to more of them, but I was coming from three weeks straight of training, during which I sat for at least two hours a day in a room listening to talks on various technologies and projects… doing that in the free time too sounded like a bad idea.

What I found intriguing in the program, and in at least one of the talks I was able to attend, was that I could find at least a few topics that I wrote about in the past. Not only now containers are now all the rage, through Docker and other plumbing, but there was also a talk about static site generators, of which I wrote in 2009 and I’ve been using for much longer than that, out of necessity.

All in all, it was a fun conference and meeting my usual conference friends and colleagues is a great thing. And meeting the other Gentoo devs is what sparked my designs around TG4 which is good.

I would like to also thank James for suggesting me to use Tweetdeck during conferences, as it was definitely nicer to be able to keep track of what happened on the hashtag as well as the direct interactions and my personal stream. If you’re the occasional conferencegoer you probably want to look into it yourself. It also is the most decent way to look at Twitter during a conference on a tablet, as it does not require you to jump around between search pages and interactions (on a PC you can at least keep multiple tabs open easily.)

Ah, LXC! Isn’t that good now?

If you’re using LXC, you might have noticed that there was a 0.8.0 release lately, finally, after two release candidates, one of which was never really released. Do you expect everything would go well with it? Hah!

Well, you might remember that over time I found that the way that you’re supposed to mount directories in the configuration files changed, from using the path to use as root, to the default root path used by LXC, and every time that happened, no error message was issued that you were trying to mount directories outside of the tree that they are running.

Last time the problem I hit was that if you try to mount a volume instead of a path, LXC expected you to use as a base path the full realpath, which in the case of LVM volumes is quite hard to know by heart. Yes you can look it up, but it’s a bit of a bother. So I ended up patching LXC to allow using the old format, based on the path LXC mounts it to (/usr/lib/lxc/rootfs). With the new release, this changed again, and the path you’re supposed to use is /var/lib/lxc/${container}/rootfs — again, it’s a change in a micro bump (rc2 to final) which is not documented…. sigh. This would be enough to irk me, but there is more.

The new version also seem to have a bad interaction with the kernel on stopping of a container — the virtual ethernet device (veth pair) is not cleared up properly, and it causes the process to stall, with something insisting to call the kernel and failing. The result is a not happy Diego.

Without even having to add the fact that the interactions between LXC and SystemD are not clear yet – with the maintainers of the two projects trying to sort out the differences between them, at least I don’t have to care about it anytime soon – this should be enough to make it explicit that LXC is not ready for prime time so please don’t ask.

On a different, interesting note, the vulnerability publicized today that can bypass KERNEXEC? Well, unless you disable the net_admin capability in your containers (which also mean you can’t set the network parameters, or use iptables), a root user within a container could leverage that particular vulnerability.. which is one extremely good reason not to have untrusted users having root on your containers.

Oh well, time to wait for the next release and see if they can fix a few more issues.

A good reason not to use network bridges

So one of the things I’m working on for my job is to look to set up Linux Containers to separate some applications — yes I know I’m the one who said that they are not ready for prime time but please note that what I was saying is that I wouldn’t give root inside a container to anybody I would trust — which is not the same as to say that they are not extremely useful to limit the resource consumption of various applications.

Anyway, there is one thing that has to be considered, of which I already quickly wrote about : networking. The simplest way to set up a LXC host, if your network is a private one, with a DHCP server or something along those lines, is to create one single bridge between your public network interface and the host-side of virtual Ethernet pairs — this has one unfortunate side effect: to make it working, it puts the network interface in promiscuous mode, which means that it receives all the packets directed to any other interface, which slows it down quite a bit.

So how do you solve the issue? Well, I’m honestly not sure whether macvlan improves the situation, I’m afraid not. What I decided for Excelsior, since it is not on a private network, was to set up an internal bridge, and have static IP addresses set to internal IPs. When i need to jump into one of the containers, I simply use the main public IP as an SSH jumphost and then connect to the correct address. I described the setup before although I made then a further change so now I don’t have to bother with the private IP addresses in the configuration file: I use the public IPv6 AAAA record for the containers, which simply resolve as usual once inside my jumphosts.

Of course with the exception of jumphosts, that kind of settings, which involve using NAT on iptables, has no way to receive connections from the outside.

So what other options are there? One thing I’ve been thinking about was to use a level-3 managed switch and set it to route a subnet to the LXC host — but that wouldn’t fly too much. So at the end the question would be “what is it that I need access on the containers form the outside?” and the answer is simply “the websites”. The containers provide a number of services, but only the websites are mapped to the outside. So, do I need IPs that are even partially public? Not really.

The solution I’m planning right now is that I’ll set up a box with either an Apache reverse-proxy or some other reverse proxy (depending on how much we want to handle on the proxy itself), and have that contact the internal containers, the same way it would be if you had one reverse proxy on the Internet, and the servers on the internal network.

I guess at some point I should overhaul the LXC wiki page for what concerns networking; I already spent some time to remove some duplicated content and actually sync it with what’s going on on the ebuild…

Securing Tinderbox network access

For those who wonder why Excelsior is now built for a week and it’s still not running in full capacity, the reason is relatively simple: I’m still trying to figure out how to handle network security so that I can give other developers access to it, and not risk that we have a security breach in other places.

The first problem is that the current setup I need to use and the final one will be very different: right now the system is connected to the internal network of my employer, while in production it will be DMZ’d, with a public IP available straight to it. The latter setup will let me set up an IPv6 tunnel, which means all the internal containers will be available by themselves, while the current one prevents me to set it up at all.

Previously this was much easier to deal with simply because I had control over an external firewall myself, which took care of most of the filtering, which combined with LXC networking options made it decently easy to deal with. Unfortunately this time I have to do without this.

One of the things that might not be obvious from the post above, nor from the documentation, is that the three setups I have drawn are the ones that require the least amount of configuration on the host-side: if you use the standard bridge, you just need to start net.br0 (or equivalent), and LXC will connect the host-side virtual Ethernet pair to the bridge by itself. If you’re using the MACVLAN modes instead, the containers will piggy-back the interface of the host, and then rely on the external router/firewall/switch to deal with the configuration — by the way I’m glad that man lxc.conf now actually says that you’re supposed to have a reflective switch to use those modes. You can probably guess it’s not that great an idea if you expect lots of internal communication.

What I’m going to do for now is setting up a system that is defended as much as possible by depth, with iptables carrying out enough filtering that I should be realistically safe. Unfortunately just iptables is not enough and what you need is iptables and ebtables (for Ethernet Bridging filtering), to make sure that the containers’ don’t dupe your IPs or something.

The idea is that one IPv6 is enabled, you can jump straight into the tinderboxes, which is required to control it, but until then, the host system acts as a jump host through a custom scponly setup, which only allows forwarding, as ProxyCommand, port 22 of others boxes within that same system.

I’d like to show more documentation of what I’m trying to do, and what I achieved already, but to do so, I’m afraid I’ll be needing some more … visual examples, as it’s very hard to explain it in words, while it should be much more clearer with a series of drawings. I guess I’ll start working on them soonish, maybe if Luca can package Synfig for me…

For now, this is enough, I guess.

Hard Containers — LXC and GrSecurity

Okay so Excelsior is here, and it’s installed, and here starts the new list of troubles, which seems to start, as usual, with my favourite system ever: LXC which is the basis the Tinderbox work upon.

The first problem is not strictly tied to LXC, but one of the dependencies required: the lynx browser fails to build in parallel if there are enough cores available (which there certainly are here!). This bug should be relatively easy to fix but I haven’t had time to look into it just yet.

A second issue came up due to the way I was proceeding to do the install, outside of office hours, and is that the ebuild is using the kernel sources, I think to identify which capabilities are available on the system. This should be fixed as well, so that it checks the capabilities on the installed linux-headers instead of the sources, which might not be the same.

The third issue is funny: Excelsior is going to use an hardened kernel. The reason is relatively simple to understand: it’s going to run a lot of code of unknown origins, it’ll allow other people in, one wants to be as as possible… unfortunately it seems like this is not, by default, a good configuration to use with LXC.

In particular, grsecurity is designed by default to limit what you can do within a chroot, by applying a long list of restrictions. This is fine, if not for the fact that LXC also chroots to start its own set up process. I’m now fixing the ebuild to warn about those options that you have to (or might) want to disable in your GrSec setup to use LXC.

Interestingly, it’s not a good idea to disable all of them, since a few are actually very good if you want to use LXC, such as the mknod restriction, which is very good in particular if you want to make sure that only a subset of the devices are accessible (even when counting in the allowed/non-allowed access of the devices cgroup).

In particular, these have to be disabled:

  • Deny mounts (CONFIG_GRKERNSEC_CHROOT_MOUNT)
  • Deny pivot_root in chroot (CONFIG_GRKERNSEC_CHROOT_PIVOT)
  • Capability restrictions (CONFIG_GRKERNSEC_CHROOT_CAPS)

while the double-chroot would be counter-synergistic as it would disallow services within the container to further chroot to allow a defense-in-depth approach.

Then there is another issue. Before starting to set up the actual tinderbox, I wanted to prepare another container, which is the one I’ll be using for my own purposes, including bumping of Ruby packages and stuff like that. Since the system is supposed to stay very locked down, this time I want to mount the block device straight into the container, which is a supported configuration…. but it turns out that the configuration parser, trying to workaround old issues (yes that’s a one and a half years’ old post) will ignore any mount request that doesn’t have the destination rootfs prefixed.

Unfortunately when you mount a block device, it means that you’ll end up with something along the lines of /dev/sdb1/usr/portage. This also collides with the documentation in man lxc.conf:

If the rootfs is an image file or a device block and the fstab is used to mount a point somewhere in this rootfs, the path of the rootfs mount point should be prefixed with the /usr/lib/lxc/rootfs default path or the value of lxc.rootfs.mount if specified.

Anyway this should be fixed in 0.8.0_rc2-r2 which is now in tree, I’ve not been able to test it thoroughly yet, so holler at me if something doesn’t work.

LXC execution, something’s not right yet

I’ve complained about some LXC choices last week, but at the same time I was working on at least trying to get it to run, somewhat, so that I could make use of it on our required setup.

The result has been a new revbump of LXC (0.8.0_rc1-r1), which contains a patch using libtool to build the library, this also makes sure that the library is properly created with an always-variable soname (see the link for more explanation on what that is).

This new version actually allows you to go one step further, and you can properly set it to execute commands within the container, directly. Even though it takes a bit of time due to the POSIX Message Queue not being extremely fast (Luca do you know anything about that?). The problem is that if you’re going to run any interactive command.. you get stuck, almost literally, including a very simple emerge -av.

The problem, as far as I can tell, is that the namespacing allows the container to create a new pseudo-tty (PTS) within the container itself, instead of using the one that is connected with the current session. This means that you cannot actually use the tty at all, making it impossible to run any kind of interactive command (at the same time, it doesn’t make it known to the command that it lacks a controlling tty, so for instance an emerge -av does not get downgraded to emerge -pv this way.

I’m hoping that maybe Kevin’s (whichever Kevin that is!) patches mentioned in my previous post will help getting it to work, if so that would also mean that the tinderbox would be much easier to deal with than it has been in the past, and might actually get me to restore it to a working state (it hasn’t been working for a quite long time at this point, and I’m certainly not happy about it).

For the moment what I can tell is that I’ve half-tracked down the issue with the netprio cgroup, and contacted its original author to see how we can deal with it, and I have a couple of changes for the ebuild and init scripts queued up. Since at least the cgroup mountpoint issue has been fixed in the utilities, I’ll soon make it depend on a version of OpenRC new enough to mount the thing by itself, easing off part of the init script log (well, to be honest I’ve already dropped most of that logic), so that it can actually grow from there…

I guess I should thank Tiziano for telling me about LXC at the time, although there is still so much work to do before it works as intended. Oh well.

New container notes: bind mounted files

I’ve been using Linux Containers (LXC) for quite a while, and writing about it — in the mean time the Control Group feature in the Linux kernel has grown to be used in so many different ways, that probably a number of people weren’t really expecting, such as the systemd and OpenRC integrations.

I’ve set up another LXC host system here when I’m working so I can share a single box for multiple purposes without risking to contaminate the different components (there is more to this but I’ll get to it later), in a manner vaguely reminiscent to what ChromiumOS does for its own builds.

One of the things we’ve been dealing with is the problem of synchronising the userids so that they are kept the same across the different containers, and for that the first obvious solution that came to my mind is to use a bind mount on a file. If you didn’t know it’s possible, well, you now know it is. A bind mount on a file allows you to make a given file from the host system appear as a file within the container, and just that one file, rather than the whole directory that contains it.

Unfortunately, it doesn’t work as many suppose it works. It’s neither like a symlink nor a hardlink, although the latter is probably the closest idea you can have of it.

When you have a file that has two hardlinks (so two references on the same filesystem), the changes between the two are reflected as long as you replace the content of the file and not the file itself. This is very important, and a very long time ago, before GIT and Quilt became popular, it was used to make patching software trees easier — you created an hardlinked copy of your sources, edited the files you needed, then diff -uR would just skip over the same-content hardlinks without caring. This method relied on editors to break the hardlinks by using a sequence of operations along the lines of “rename the current copy to act as a backup, write a new file with the same name, done”.

What happens with bound files is that if you editor works in the same way, the bound file is renamed, and a new, non-bound file is created in its place, which means that the changes within the LXC host will not be reflected outside of it, and the same goes for the changes done outside: if you want for them to be reflected inside you have to reboot the container.

So you can probably use this trick if your files are to be edited with either Emacs or VIM, since both of them, by default and in normal circumstances (Emacs via tramp will not preserve hardlinks!), seem to work just fine — where the trouble starts is when you’re dealing with utilities in the shadow family since they will break hardlinks for you.

So there you go for what concerns this part of containers’ usage.

CGROUPS woes

The cgroup functionality that the Linux kernel introduced a few versions ago, while originally being almost invisible, is proving itself having a quite wide range of interests, which in turn caused not few headaches to myself and other developers.

I originally looked into cgroups because of LXC and then I noticed it being used by Chromium, then libvirt (with its own bugs related to USB devices support). Right now the cgroup functionality is also used by the userland approach to task scheduling to replace the famous 200LOC kernel patch, and by the newest versions of the OpenVZ hypervisor.

While cgroup is a solid kernel technique, its interface doesn’t seem so much. The basic userland interface is accessible through a special pseudo-filesystem, just like the ones used for /sys and /proc. Unfortunately, the way to use this interface hasn’t really been documented decently, and that results in tons of problems; in my previous post regarding LXC I mistakenly inverted the cgroup-files I actually confused the way Ubuntu and Fedora mount cgroups; it is Fedora to use /sys/fs/cgroup as the base path for accessing cgroups, but as Lennart commented on the post itself, there’s a twist.

In practice there are two distinct interfaces to cgroups; one is through a single, all-mixed-in interface, that is accessed through the cgroup pseudo-filesystem when mounted without options; this is the one you can find mounted in /cgroup (also by the lxc init script in Gentoo) or /dev/cgroups. The other interface allows access (and thus limit) to one particular type of cgroup (such as memory, or cpuset), and have each hierarchy mounted at a different path. That second interface is the one that Lennart designed to be used by Fedora and that has been made “official” by the kernel developers in commit 676db4af043014e852f67ba0349dae0071bd11f3 (even though it is not really documented anywhere but in that commit).

Now as I said the lxc init script doesn’t follow that approach but rather it takes the opposite direction; this was not intended as a way to ditch the road taken by the kernel developers or by Fedora, but rather out of necessity: the commit above was added last summer, the Tinderbox has been running LXC for over an year at that point, and of course all the LXC work I did for Gentoo was originally based on the tinderbox itself. But since I did have a talk with Lennart and the new method is the future, I added to my TODO list, last month still, to actually look into making cgroups a supported piece of configuration in Gentoo.

And it came crashing down.

Between yesterday and this morning I actually found the time I needed to get to write an init script to mount the proper cgroup hierarchy the Fedora style. Interestingly enough, if you were to umount the hirarchy after mucking with it, you’re not going to mount it anymore, so there won’t be any “stop” for the script anyway. But that’s the least of my problems now. Once you do mount cgroups the way you’re supposed to, following the Fedora approach, LXC stops working.

I haven’t started looking into what the problem could be there; but it seems obvious that LXC doesn’t seem to take it very nicely when its single-access interface for cgroups is instead split in a number of different directories, each with its own little interface to use. And I can’t blame it much.

Unfortunately this is not the only obstacle LXC have to face now; beside the problem with actually shutting down a container (which only works partially and mostly out of sheer luck with my init system), the next version of OpenRC is going to drop support ofr auto-detecting LXC, both because identifying the cpuset in /proc is not going to work soon (it’s optional in kernel and considered deprecated) and because it wrongly identify the newest OpenVZ guests as LXC (since they also started using the same cgroups basics as LXC). These two problems mean that soon you’ll have to use some sort of lxc-gentoo script to set up an LXC guest, which will both configure a switch to shut the whole guest down, and configure OpenRC to accept it as an LXC guest manually.

Where does this leave us? Well, first of all, I’ll have to test if the current GIT master of LXC can cope with this kind of interface. If it doesn’t, I’ll have to talk with upstream to see that they would actually be supported so that LXC can be used with a Gentoo host, as well as a Fedora one, with the new cgroups interface (so that it can be made available to users for use with chromium and other software that might make good use of them). Then it would be time to focus on the Gentoo guests, so I’ll have to evaluate the contributed lxc-gentoo scripts that I know are on the Gentoo Wiki, for a start.

But let me write this again: don’t expect LXC to work nice for production use, now or anytime soon!

Why there is no third party access to the tinderbox

So we seem all to agree that the tinderbox approach (or at least a few tinderbox approaches) are the only feasible way to keep the tree in a sane state without requiring each developer to test it all alone. So why is my tinderbox (and almost any other tinderbox) not accessible to third parties? And, as many asked before, why isn’t a tinderbox such as the one I’ve been running managed by infra themselves, rather than by me alone?

The main problem is how privileges are handled; infra policies are quite strict, especially for what concerns root access; right now, the tinderbox is implemented with LXC which is not dependable to have proper privilege separation, neither locally, nor for what concerns network (even if you firewall it out so that the mac address only has access to limited services, you can still change the mac address for the LXC). This approach is not feasible for infra-managed boxes, and can have quite a bit of trouble for what concerns private networks as well (such as mine).

The obvious alternative to this is to use KVM; while KVM provides a much better security encapsulation, it is not a perfect fit either; in the earliest iteration of working on the tinderbox I did consider using KVM; but even when using hardware support, it was too slow; the reason is that most of the work done by the tinderbox is I/O bound, to the point that even on the bare OS, with a RAID0, it leaves a lot to desire. While virtio-block increases a lot the performance of virtualised systems, it’s still nowhere near the kind of performance you can get out of LXC, and as I said, that’s still slow.

This brings up two possible solutions; one is to use tmpfs to do the build; which is something I actually do on my normal systems but not on the tinderbox, for a number of reasons I’ll be telling in a moment, and the other is the recently-implemented plan9-compatible virtualisation of filesystems (in contrast to block devices upon which you build filesystems), which is supposed to be faster than NFS, as well as less tricky. I haven’t had the chance to try that out yet, but while it sounds interesting as an approach I’m not entirely sure it’s reliable enough for what the tinderbox approach needs.

There is another problem with the KVM approach here, and it relates once again to the infra policies: KVM gets very high performance hits when you use hardened kernels, especially the grsecurity features. This performance hit makes it very difficult to depend on KVM within infra boxes; until somebody finds a way to work around that problem, there are very few chances of getting it working there.

Of course the question would then become “why do you need third-parties to access the tinderbox as root?”; you can much more easily deal with restriction for standard users rather than root, even though you still have to deal with privilege escalation bugs. Well, it’s complicated. While a lot of the obvious bugs are easily dealt with by looking at a build log and/or the emerge --info output, a lot require more information, such as the version of packages installed on the system, or the config.log file generated by autoconf, or a number of other files for other build systems. For packages that include computer-generated source or data files, relying on other software altogether, you also might need those sources to ensure that the generated code corresponds to what is expected.

All of this requires you to access the system first-hand, and by default it also requires you to access the system as root, as the portage build directories are not world-readable. I thought of two ways to avoid the need to accss the system, but neither is especially fun to deal with. The first is gathering the required logs and files and produce a one-stop-log with all that data. It makes the log tremendously long, and complex, and it requires the ebuilds to declare which files to gather out of a build directory to report the bugs. The other way is to simply save up a tarball with the complete build directory, and access that as needed.

I originally tried to implement this second method; storing the data that way is also helpful because you can then easily clean up the build directory, which is a prerequisite to build with tmpfs; unfortunately, I soon discovered that the builds are just too big. I’m not referring only to builds such as OpenOffice, but also a lot of the scientific software, as well as a number of unexpected software, especially when you run tests for it, as often enough tests involve generating output and comparing it to expected results. On the pure performance scale of using tmpfs, I found that this method ended up taking more time than simply building on the harddrives. I’m not sure how it would scale within KVM.

With all these limitations, I hope it is clear why the tinderbox has been up to now a one-man-gateway project, at least for me. I’d sincerely be happy if infra could pick up the project, and manage the running bits of it, but until somebody finds a way to deal with that that doesn’t require a level of magnitude more work than what I’ve been doing up to now, I’ll keep running my own little tinderbox, and hope the others will run one of theirs as well.

Sharing distfiles directories, some lesser known tricks

I was curious to see why there was much noise coming from the Gentoo Wiki LXC page to my blog lately, so I decided to peek at the documentation there once again, slightly more thoroughly; while there are a number of things that could likely be simplified (but unless I’m paid to do so I’m not going to work on documenting LXC until it’s stable enough upstream; as it is it might just be a waste of time since they change the format of configuration files every other release), there is one thing that really baffled me:

If you want to share distfiles from your host, you might want to look at unionfs, which lets you mount the files from your host’s /usr/portage/distfiles/ read-only whilst enabling read-write within the guest.

While unionfs is the kind of feature that goes quite well with LXC, for something as simple as that is quite an overkill. On the other hand I remembered that I also had to ask about this to Zac, so it might be a good idea to document this a moment.

Portage already allows you to share a read-only distfiles directory.

You probably remember that I originally started using Kerberos because I wanted some safe way to share a few directories between Yamato and the other boxes. One of the common situations I had was the problem of sharing the distfiles, since quite a few are big enough that I’d rather not download them twice. Well, even with that implemented, I couldn’t get Portage to properly access them, so I looked for alternative ways. And there is a variable that solves everything: PORTAGE_RO_DISTDIRS.

How does that work? Simply mount some path such as /var/cache/portage/distfiles/remote as the remote distfiles directory, then set PORTAGE_RO_DISTDIRS to that value. If the required files are in the read-only directory, and not in the read-write one, Portage will symlink them and ensure access on a stable path. Depending on your setup, you might want to export/import the directory as NFS, or – in case of LXC – you might want to set it up as a read-only bind mount. Whatever you choose, you don’t need kernel support for unionfs to use a read-only distfiles directory. Okay?