PSA: Packages failing to install with new, OpenRC-based stages: missing users and groups

This month Gentoo finally marked stable baselayout2 and OpenRC, which is an outstanding accomplishment, even though it happens quite late in the game, given that OpenRC exists for a few years by now. Since now these packages are stable, they are also used to build the new stages that are provided for installing new copies of Gentoo.

This has had an unforeseen (but not totally unexpected) problem with users and groups handling, since the new version of baselayout dropped a few users and groups that were previously defined by default, in light of its more BSD-compatible nature. Unfortunately, some of these users and groups were referenced by ebuilds, like in the case of Asterisk, that set its user as part of both the asterisk and dialout groups — the latter is no longer part of the default set of users created by baselayout, so installing Asterisk before last on a new system created from the OpenRC-based stage would have failed.

Okay so this is a screw-up and one that we should fix it as soon as possible, but why did it happen in the first place? There are two things to consider here: the root cause of the problem and why it wasn’t caught before this happened. I’d start with the second one in my opinion.

When testing OpenRC, we all came to the conclusion that it worked fine. Not only my computers, but even my vservers, my customer’s servers, and most of the developers’ production boxes have been running OpenRC for years. I even stopped caring about providing non-OpenRC-compatible init scripts at some point. Why did none of us hit this problem before? Being a rolling distribution, our main testing process does not involve making a new install altogether: you upgrade to OpenRC and judge whether it works or not.

Turns out this is not such a great idea for what concern critical system packages (we have seen issues with Python before as well): when upgrading from Baselayout 1 to Baselayout 2, not all files are replaced; users and group added by baselayout 1 are kept around, which makes it impossible to identify this class of issues. We should probably document more stringent stable marking process for system components, and work with releng to find a way to test a stage so that it actually boots up with a given kernel and configuration (KVM should help a lot there).

As for the root cause of the problem, we have been fighting with this issue since I became a dev, and that’s why there is GLEP27 which is supposed to take care of managing users and groups and assigning them global IDs. Unfortunately this is one of those GLEPs that were defined, but never implemented.

To be honest there has been work on the issue, which was also funded by the Google Summer of Code program, but the end results didn’t make it to Gentoo, but rather to another project (which is why I always have doubts about Gentoo’s waste of GSoC funding).

So until we have a properly-implemented GLEP27, which is nothing glamorous, nothing that newcomers seem to feel like tackling, we’re just dancing around a huge number of problems with handling of users and groups, that is not going to get easier with time, at all.

What is my plan here? I’ll probably find some time tonight or so to set up a tinderbox that uses the OpenRC-based stage, and see what might not work out of the box; unfortunately even that is not going to be a complete solution: if two ebuilds use the same group, and they are independent one from the other, it is well possible that the group is added by one and not the other, so whether they install correctly depends on the order of installation. Which is simply a bad thing to have and a difficult to test for.

In the mean time, please do report any package that fails to build with the new stages. Thank you!

Ranting on about EC2

Yes, I’m still fighting with Amazon’s EC2 service for the very same job, and I’m still ranty about it. Maybe I’m too old-school, but I find using the good old virtual servers is much much easier to deal with. It’s not that I cannot see the usefulness of the AWS approach (you can easily try to get something going without sustaining a huge initial investment of capital to get the virtual servers, and you can scale it further on in the working), but I think more than half the interface is just an afterthought, rather than an actual design.

The whole software support for AWS is a bit strange: the original tools, that are available in Portage, are written in Java for the big part, but they don’t seem to be actively versioned and properly released by Amazon themselves, so you actually have to download the tools, then check the version from the directory inside the tarball to know the stable download URL for them (to package them in Gentoo, that is). You can find code to manage AWS services in many languages, including Ruby, for various pieces of it, but you cannot easily find an alternative console if not the ElasticFox extension for Firefox, which I have to say makes me doubt a lot (my Firefox is already slow enough). On the other hand, I actually found some promising command-line utilities in Rudy (which I packaged in Gentoo with a not indifferent effort), but beside some incompatibility with the latest version of the amazon-ec2 gem (which I fixed myself), there are other troubles with it (like not being straightforward how to handle multiple AMIs for different roles, or being impossible to handle snapshot/custom AMI creation through just it). Luckily, the upstream maintainer seems to be around and quite responsive.

Speaking about the libraries, it seems like one of the problems with the various Ruby-based libraries is that one of the most commonly used libraries (RightScale’s right_aws gem) is no longer maintained, or at least upstream has gone missing, and that causes obvious stir in the community. There is a fork for it, that forks the HTTP client library as well (right_http_connection, becoming http_connection — interestingly enough for a single, one line change that I’ve simply patched in on the Gentoo package). The problem is that the fork got worse than the original gem for what packaging is concerned: not only the gem is not providing the documentation, Rakefile, tests and so on, but they are not even tagged in the git repository last I check. Alas.

Luckily, it seems like amazon-ec2 is much better at this job; not that it was pain-free, but even here upstream is available, and fast to release a newer version; the same goes for litc, and the dependencies of the above-mentioned Rudy (see also this blog post from a couple of days ago). This actually make it so that the patches I’m applying, and adding to Gentoo, are deleted or don’t even enter the tree to begin with, which is good for the users who have to sync to keep the size of Portage down to acceptable levels.

Now, back to the EC2 support proper; I already ranted before about the lack of Gentoo support; turns out that there is more support if you go over the American regions, rather than the European one. And at the same time, the European zone seems to have problems: I spent a few days wondering why right_aws failed (and I thought it was because of the bugs that they forked it in the first place), but at the end I had to decide that the problem was with AWS itself: from time to time, a batch of my request fall into oblivion, with errors ranging from “not authorized“ to “instance does not exist” (for something I’m still SSH’d into, by the way). At the end, I decided to move to a different region, US/East, which is where my current customer is doing their tests already.

Now this is not easy either since there is no way to simply ask Amazon to transfer a volume from a given region (or zone) and copy it to another in their own systems (you can use snapshot to recreate a volume within a region on different availability zones, but that’s another problem). The official documentation suggests you to use out-of-band transmission (which, for big volumes, becomes expensive), and in particular the use of sync. Now this wouldn’t have to be too difficult, their suggestion is also to use rsync directly, which would be a good suggestion, if not for one particular. As far as I can tell, the only well-supported community distribution available, with a decently recent kernel (one that works with modern udev, for instance) is Ubuntu; in Ubuntu, you cannot access the root user directly as you all probably well know, and EC2 is no exception (indeed, the copiable command that they give you to connect to your instances is wrong for the Ubuntu case, they explicitly tell you to use the root user, when you have, instead, to use the ubuntu user, but I digress); this also means that you cannot use the root user as either origin or destination of an rsync command (you can sudo -i to get a root session from one or the other side, but not on both, and you need it on both to be able to rsync over the privileged files); okay the solution is definitely easy to find, you just need to tar up the tree you want to transfer, and then scp that over, but it really strikes odd to me that their suggested approach does not work with the only distribution that seems to be updated and supported on their platform.

Now, after the move to the US/East region, problems seems to have disappeared and all commands finally succeeded every time, yuppie! I finally was able to work properly on the code for my project, rather than having to fight with deployment problems (this is why my work is in development and not system administration); after such an ordeal, writing custom queries in PostgreSQL was definitely more fun (no Rails, no ActiveRecord, just pure good old PostgreSQL — okay I’m no DBA either, and sometimes I might have difficulties getting big queries to perform properly, as demonstrated by my work on the collision checker but some simpler and more rational scheme I can deal with pretty nicely). Until I had to make a change to the Gentoo image I was working with, and decided to shut it down, restart Ubuntu, and make the changes to create a new AMI; then hell broke loose.

Turns out that for whatever reason, for all the day yesterday (Wednesday 17th February), after starting Ubuntu instances, with both my usual keypair and a couple of newly-created ones (to exclude a problem with my local setup), the instance would refuse SSH access, claiming “too many authentication failures”. Not sure on the cause, I’ll have to try again tonight and hope that it works as I’m late on delivery already. Interestingly enough, the system log (which only appears one out of ten requests for it from the Amazon console) shows everything as okay, with the sole exception of the Plymouth software that crashes with segmentation fault (code 11) just after the kernel loaded.

So all in all, I think that as soon as this project is completed, and with the exception of eventual future work on this, I will not turn back to Amazon’s EC2 anytime soon; I’ll keep getting normal vservers, with proper Gentoo on them, without hourly fees, with permanent storage and so on so forth (I’ll stick with my current provider as well, even though I’m considering adding a fallback mirror somewhere else to be on the safe side; while my blog’s not that interesting, I have a couple of sites on the vserver that might require me to have higher uptime, but that’s a completely off-topic matter right now).

Oops I did it again.

What, you’ll ask? I broke Gentoo/FreeBSD, or at least I’m preparing locally to break it, badly.

With 6.1 release I thought I was finally safe from libraries being moved out of base system, changing sonames (switching from the pimped up versions of FreeBSD to the proper ones from their authors), but it seems like I didn’t hit a crazy one now.

I should have paid more attention to Tiziano the other day, and I would have broke Gentoo/FreeBSD a couple of days already, and fixed it too, but I had to end up with my head against the wall of libeditline to end up working on the problem.

A little background: GNU readline is a cool library for line editing, the thing you usually do in bash when you move on the command line you’re writing and change it here and there; it’s released under the GPL license, not the LGPL license, so the software that’s released with non-GPL licenses can’t use it (think of BSD or MPL licensed software). To overcome this problem, NetBSD project developed libeditline (or simply libedit) that is a BSD library, API compatible with GNU readline.

This library is used (with an inline copy) in Heimdal (which Tiziano was working on) and in Firebird (which I tried to work on last night); as this end up creating conflicts, and in general is against our policies, I told him (and knew for myself) that the inline copy should have been replaced by a shared copy of the library.. for Linux it’s easy, Mike added some time go the dev-libs/libedit package, but on FreeBSD, where the library should have been provided by sys-freebsd/freebsd-lib, they try to use add_history() and readline() functions that are not available.

I then tried to write an ebuild for the version of libedit that’s in ports but that didn’t work out much good, mostly because it would have meant having two packages almost identical in the tree, and because it was also older than Mike’s package. As this didn’t work, I then worked on making dev-libs/libedit build on FreeBSD by splitting the Gentoo patch from the GLIBC patch, and the result is quite neat, I’m just waiting for Mike to check it and say if he’s okay with this merge.

What’s the problem then, if this is done already? Eh, as I said, the library changed soname, from libread.so.5 to libread.so, and the result is that /bin/sh won’t load after you merge freebsd-lib and before merging freebsd-bin again with libedit installed.

Okay, it’s hardly a showstopper, considering that not much of portage relies on /bin/sh and most simply depends on bash (that links to readline and is thus safe) but it will be a mess for first time installs.

For this reason, I decided that as soon as Mike merges the changes, I’ll be rebuilding the 6.2 stage, and updating the documentation myself so that I can point people to 6.2 directly (that’s proving simpler to manage than 6.1 especially with the new baselayout). The problem is the migration from one to the other being not that trivial, but that can be fixed easily too.

Anyway, the new stages will be shinier and cleaner, with the libedit split out, and after that we can work on getting Firebird and Heimdal on Gentoo/FreeBSD :)

Updating to BETA3

So, today FreeBSD 6.2-BETA3 was released, and I started my usual time to bump the ebuild and fix the stuff that changed. Luckily this time the only changes were in the mk definitions (a slight change in bsd.info.mk) and in the kernel’s source (an ntfs change to the GCC 4.1 patch, and two patches – created respectively by Javier and Alex – that were merged into the official sources).

The update was smooth and this is good, the kernel is now running fine and I didn’t find any particular problem. This is a good start, and this makes me glad I didn’t build the new stages yet, so I don’t have to restart the work from scratch, considering the time required to build them.

A particular thanks for this release goes to Patrick Lauer, as it’s thanks to him that I was able to actually fetch the sources for this release, repackage them and upload to the mirrors, as pitr is unreachable, so I used the gentooexperimental server.

Also, I’ve started a few frivolous keywording lately, which means multimedia stuff.. most notable keywordings being mplayer and audacious.. the latter thanks to being able to stream audio around through PulseAudio.. unfortunately Audacious crashes when it’s closed, and I’m not sure why, GDB does not seem to work correctly on Prakesh and I’m not sure why… I should try to take the binpkg from Farragut and see if that works.

For who’s wondering, I know a new version of bsdtar/libarchive was released but… it’s totally broken, so I won’t be adding it to portage, I hope for 2.0a2…

Oh, and I can ensure you, Seemant, you won’t turn back that easily :)

Story of a broken stage

So, I have to apologise with all Gentoo/FreeBSD users who got stuck with the 20060721 stage, unfortunately it was broken by two issues that I didn’t find when I was preparing the stage. The first was properly my mistake, bsdtar does not like being installed with -j8, the other was due to eselect-compiler. I fixed it preparing a 0730 snapshot, but I’m still waiting for jforman and curtis119 to uploaded it, and today I also prepared a 0802 stage that is built directly with gettext-0.15 (and thus libintl.so.8).

Now, let’s see what happens with the newest stage: gettext-0.15 is in that stage, which means there won’t be any particular trouble with the update, which is a really good thing, because it actually made me afraid of losing a box for a while. Thanks to Timothy I also fixed the wrong default alias for ls, and now finally ls is coloured by default when you run on a compatible terminal :)
UTF-8 works by default, too.

Unfortunately, as mirror and torrent masters are currently unavailable (/me pokes both), I’ve temporarily uploaded the stage here but they’ll go away as soon as the mirrors are updated.

On a totally unrelated note, KDE 3.5.4 is now unmasked in portage, as the KDE mirrors started providing the tarballs, and they have the same digests of the pre-releases, so the mirroring system also picked them up and they are available also on Gentoo mirrors. Funny to see that kde-packager was not being warned about the availability of the tarballs to public beforehand, but I’m told this is the usual way to handle the release… pretty poor thing to me, but whatever.

Oh and yes, yesterday my connection was down, again. I suspect the cause was the bad (well, not so bad for me as I like rain and thunderstorms) weather we had in the past few days, that probably created some problems with the cables and so on. I spent the afternoon continuing to read The Dark Tower , and I have to say, reading books in English makes you enjoy them a huge lot more than reading the translations.

Heat and stagebuilding

So, today while trying out KOffice, I ended up lookign at the page with libiconv’s releases, that it’s usually static and I just watch to know that I have nothing to do and… well there was a new release of libiconv 1.11… Time to update it then.

Luckily this time no big deals this time, no ABI breakage, no soname change, nothing. But there is something interesting: with GCC 4.0/4.1 it does support hidden visibility, which is a quite interesting thing, not because libiconv has a lot of symbols, but because almost anything loads it when you build with NLS support enabled.

After this, I was finally decided to prepare the new stage, that was long overdue. So I cleaned up defiant, set up distcc so that even ‘cc’ and ‘gcc’ plays fine, and set up a configuration for the new stage with PORTAGE_CONFIGROOT. The stagebuilding took some hours and with defiant turned on all that time, well, the temperature on my study room gone way over my preferred one. The result is that I’m still melting.

Anyway, the new stage is ready, uploaded to pecker, and waiting to take the way of torrents and mirrors. Notable changes: GCC 4.1.1 and Binutils 2.17, Portage 2.1, libiconv 1.11 and libintl installed in /lib, which means you can actually have /usr split out now.

Okay, now I’ll check a little thing (pulseaudio’s linking troubles) and then I’ll go reading. I’ll hope to take tomorrow off, but I won’t count on it of course.