This month Gentoo finally marked stable baselayout2 and OpenRC, which is an outstanding accomplishment, even though it happens quite late in the game, given that OpenRC exists for a few years by now. Since now these packages are stable, they are also used to build the new stages that are provided for installing new copies of Gentoo.
This has had an unforeseen (but not totally unexpected) problem with users and groups handling, since the new version of baselayout dropped a few users and groups that were previously defined by default, in light of its more BSD-compatible nature. Unfortunately, some of these users and groups were referenced by ebuilds, like in the case of Asterisk, that set its user as part of both the asterisk and dialout groups — the latter is no longer part of the default set of users created by baselayout, so installing Asterisk before last on a new system created from the OpenRC-based stage would have failed.
Okay so this is a screw-up and one that we should fix it as soon as possible, but why did it happen in the first place? There are two things to consider here: the root cause of the problem and why it wasn’t caught before this happened. I’d start with the second one in my opinion.
When testing OpenRC, we all came to the conclusion that it worked fine. Not only my computers, but even my vservers, my customer’s servers, and most of the developers’ production boxes have been running OpenRC for years. I even stopped caring about providing non-OpenRC-compatible init scripts at some point. Why did none of us hit this problem before? Being a rolling distribution, our main testing process does not involve making a new install altogether: you upgrade to OpenRC and judge whether it works or not.
Turns out this is not such a great idea for what concern critical system packages (we have seen issues with Python before as well): when upgrading from Baselayout 1 to Baselayout 2, not all files are replaced; users and group added by baselayout 1 are kept around, which makes it impossible to identify this class of issues. We should probably document more stringent stable marking process for system components, and work with releng to find a way to test a stage so that it actually boots up with a given kernel and configuration (KVM should help a lot there).
As for the root cause of the problem, we have been fighting with this issue since I became a dev, and that’s why there is GLEP27 which is supposed to take care of managing users and groups and assigning them global IDs. Unfortunately this is one of those GLEPs that were defined, but never implemented.
To be honest there has been work on the issue, which was also funded by the Google Summer of Code program, but the end results didn’t make it to Gentoo, but rather to another project (which is why I always have doubts about Gentoo’s waste of GSoC funding).
So until we have a properly-implemented GLEP27, which is nothing glamorous, nothing that newcomers seem to feel like tackling, we’re just dancing around a huge number of problems with handling of users and groups, that is not going to get easier with time, at all.
What is my plan here? I’ll probably find some time tonight or so to set up a tinderbox that uses the OpenRC-based stage, and see what might not work out of the box; unfortunately even that is not going to be a complete solution: if two ebuilds use the same group, and they are independent one from the other, it is well possible that the group is added by one and not the other, so whether they install correctly depends on the order of installation. Which is simply a bad thing to have and a difficult to test for.
In the mean time, please do report any package that fails to build with the new stages. Thank you!