Okay so Excelsior is here, and it’s installed, and here starts the new list of troubles, which seems to start, as usual, with my favourite system ever: LXC which is the basis the Tinderbox work upon.
The first problem is not strictly tied to LXC, but one of the dependencies required: the lynx browser fails to build in parallel if there are enough cores available (which there certainly are here!). This bug should be relatively easy to fix but I haven’t had time to look into it just yet.
A second issue came up due to the way I was proceeding to do the install, outside of office hours, and is that the ebuild is using the kernel sources, I think to identify which capabilities are available on the system. This should be fixed as well, so that it checks the capabilities on the installed linux-headers instead of the sources, which might not be the same.
The third issue is funny: Excelsior is going to use an hardened kernel. The reason is relatively simple to understand: it’s going to run a lot of code of unknown origins, it’ll allow other people in, one wants to be as as possible… unfortunately it seems like this is not, by default, a good configuration to use with LXC.
In particular, grsecurity is designed by default to limit what you can do within a chroot, by applying a long list of restrictions. This is fine, if not for the fact that LXC also chroots to start its own set up process. I’m now fixing the ebuild to warn about those options that you have to (or might) want to disable in your GrSec setup to use LXC.
Interestingly, it’s not a good idea to disable all of them, since a few are actually very good if you want to use LXC, such as the mknod
restriction, which is very good in particular if you want to make sure that only a subset of the devices are accessible (even when counting in the allowed/non-allowed access of the devices cgroup).
In particular, these have to be disabled:
- Deny mounts (CONFIG_GRKERNSEC_CHROOT_MOUNT)
- Deny pivot_root in chroot (CONFIG_GRKERNSEC_CHROOT_PIVOT)
- Capability restrictions (CONFIG_GRKERNSEC_CHROOT_CAPS)
while the double-chroot would be counter-synergistic as it would disallow services within the container to further chroot to allow a defense-in-depth approach.
Then there is another issue. Before starting to set up the actual tinderbox, I wanted to prepare another container, which is the one I’ll be using for my own purposes, including bumping of Ruby packages and stuff like that. Since the system is supposed to stay very locked down, this time I want to mount the block device straight into the container, which is a supported configuration…. but it turns out that the configuration parser, trying to workaround old issues (yes that’s a one and a half years’ old post) will ignore any mount request that doesn’t have the destination rootfs prefixed.
Unfortunately when you mount a block device, it means that you’ll end up with something along the lines of /dev/sdb1/usr/portage
. This also collides with the documentation in man lxc.conf
:
If the rootfs is an image file or a device block and the fstab is used to mount a point somewhere in this rootfs, the path of the rootfs mount point should be prefixed with the /usr/lib/lxc/rootfs default path or the value of lxc.rootfs.mount if specified.
Anyway this should be fixed in 0.8.0_rc2-r2
which is now in tree, I’ve not been able to test it thoroughly yet, so holler at me if something doesn’t work.
Well, to limit mknod without grsec you can use cgroup.device, which can be used to specify which devices can be created (mknod) and used, I highly recommend using it. Next is droping cap cap_mknod.There is also another thing about grsec and lxc you didnt noticed, since some releases Spender added some extra code to hardened /sys/kernel/uevent_helper and /proc/sysrq-trigger that you now need a SYS_ADMIN cap to use it. the first one allow you to run code on host on any kernel event, if you know a real path to container’s rootfs, you can run whatever script you want. Another is a switch for sysrq, so you can remount-ro or even shutdown host.Also, I am not sure what you mean by ‘/dev/sdb1/usr/portage’, if you want mount sdb1 as /usr/portage inside container you just do:> lxc.mount.entry = /dev/sdb1 /home/lxc/NAME/rootfs/usr/portage none ro 0 0And that does not require any voodoo magic, my containers lays under /home/lxc/NAME and the rootfs is mounted into /home/lxc/NAME/rootfs. It has been always working for me, even before 0.8.fwiw if you want very locked down containers, don’t mount /sys in containers at all and remember to disable caps like sys_mount, sys_admin, net_admin and so on. without net_admin you need to configure routing via lxc config (the gateway), thats since 0.8-rc1 or you can easly backport it to 0.7.5 if you run it anywhere (I did so). This is pretty much how my containers are done.
(oops, the entry should contain fsname (eg ext4) instead of ‘none’, sorry.)
Thanks about the hint for @uevent_helper@ — I don’t usually use sysrq so that shouldn’t be an issue. That one I missed… of course disabling the whole SYS_ADMIN defeats a bit the purpose of using a full-fledged container, but it _might_ just work for the tinderbox.As for @/dev/sdb1@ … the issue is the other way around: if you have @/home/lxc/NAME@ you’re on a quite different setup, because your rootfs is a directory — the problem I referred to applies if you’re putting your rootfs on a device, like I’m doing now.. that simply doesn’t work until my patch (which is in Portage).For now it’s now setting up my own little area of the system, next step is the first tinderbox.
Not quite, the lxc.rootfs is just mountpoint where lxc mount everything before it chroot. You can do the same and just put lxc.mount.entry with ‘rootfs’. My ‘rootfs’ is indeed a mountpoint, see:> lxc.mount.entry = none /home/lxc/NAME/rootfs aufs br=/home/lxc/NAME/aufs_rw=rw:/mnt/squashfs/debian-squeeze-27-04-2012_17-23=ro 0 0I can understand that if lxc.rootfs points to file or block device it will be auto-mounted, but if its just a dir, then its just used as a mountpoint before chroot. I am pretty sure the default /var/lib/lxc/rootfs would do but I wanted to kept it separated so if I ever mount rootfs before lxc-start, it will work.In above example I point lxc.rootfs to the empty dir, where lxc-start mount aufs. Pretty neat solution as my containers are usualy short-life, the squashy provide me a deduplication between containers and aufs provide me full blown write support onto it.
Well, I’m not using aufs, and in general I just want to use the logical volume like if I had a virtual machine in front of me, instead of a container.. and in that case you have to hit the bug I described above and fixed in-tree.
The kernel headers are to learn the syscall number of setns, since LXC wants to use setns even if you use it with a glibc that does not yet expose a setns syscall wrapper. My local overlay patches out that check and hardcodes __NR_setns as manually extracted from the kernel sources, because I found that obnoxious. It looks like recent versions of sys-kernel/linux-headers have usable definition of __NR_setns.Could you elaborate on the pivot_root/chroot problem? As far as I know, stock LXC does not call chroot at all. I added a call to chroot immediately after the pivot_root since the documentation specifically says that pivot_root might one day require that.”Anyway this should be fixed in 0.8.0_rc2-r2 which is now in tree” – I do not see this. http://sources.gentoo.org/c… shows several -r bumps of 0.8.0_rc1, but no sign of _rc2, dead or alive.
Sorry that I will be a bit off topic here, but I made recently (this Friday) a upgrade from lxc 0.7.6 to 0.8.0-rc1 as I had issues with getting tun/tap to work inside the container.I did add thelxc.cgroup.devices.allow = c 10:200 rwmand create the nod in the same fashion as I did with all the others used by the container, but still I can’t get it to create the tap device * Bringing up interface sbt1 * ERROR: interface sbt1 does not exist * Ensure that you have loaded the correct kernel module for your hardware * ERROR: net.sbt1 failed to startThe module is loaded in the host and I have no issues creating the same tap device on the host, but I need the tap in the container (as the host don’t have access to the network in question).I got a a big surprise after the update, the init script stop to work, this one I managed to solve. In the 0.8.0 init script, the path to used to check if a container is up and running is:/sys/fs/cgroup/cpuset/lxc/but I don’t have the cpuset directory at all, I had to modify it to use the following path:/sys/fs/cgroup/lxc/which I think was used in the 0.7.6 init script. I think we need to move the path out to a config file say /etc/conf.d/lxc, so it can be easy to change the path to match your setup and not worry that at next update the init script using another path than what I have on my system.That the path is different for me than for you, I guess depends on different modules enabled, I may have missed something which causes both the different path and the tun/tap not to work, but all the documentation about which modules to use are for somewhat early lxc capable 2.6 kernels with different modules/module names than in 3.3 kernels.
The grsec people mention that it’s inadvisable to disable CONFIG_GRKERNSEC_CHROOT_CAPS… https://forums.grsecurity.n…I’m not sure they’re right. My (only somewhat informed) view is that you can drop all the CAPS you like with the new directive lxc.cap.keep (created as an optional but IMHO superior solution in response to a bug I filed regarding lxc.conf’s existing lxc.cap.drop directive’s architectural backwardness)