Gold readiness obstacle #1: Berkeley DB

I have already said that I’m working on getting gold tested a couple of years after its introduction. The situation with this linker is a bit difficult to assess; Google is set on making heavy use of it, and is supposedly faster to link Chrome (even though it uses an inconsiderate amount of RAM to achieve that — it’s the usual approach of Google software I guess: you can always throw more RAM, you can’t always throw more time!), among others. On the other hand I can tell for sure that no distribution tried to build their whole package set with it yet, simply by looking at the kind of packages that fail to build for one reason or another.

I’ll leave the failures that are important to other, non-Gentoo-based distributions for the next few posts; today, the target is a failure that limits itself to Gentoo systems, because it involves a workaround we implemented a long time ago, which is now going to bite our ass until we either solve it at once, or find an alternative workaround. But let’s start with the original problem.

The Berkeley DB library (berkdb) – which is now maintained by Oracle, for the record – is a very common library used for storing data in plain files. There are a number of different “generations” of API, one of which is provided by the FreeBSD C library as well (db1); and the very generic API structure (dbm) is also implemented by the GNU-project gdbm library. The use of BerkDB was much more prominent in day-to-day life a couple of years ago for any Linux user; nowadays, the storage format and library of preference is SQLite (to the point that even BerkDB itself provides an SQLite-based interface to its own storage format. But even so, it is very difficult to do without BerkDB: LibreOffice, Postfix, Evolution, Squid, Perl, … they all require BerkDB for this or that feature.

Unfortunately the most recent generation of APIs for Berkeley DB is still varying widely, and the format is not always compatible between minor version changes (so from 4.4 to 4.5, and so on). For these reasons, Gentoo has been allowing side-by-side installation of multiple Berkeley DB versions at the same time, so-called slotting. By allowing non-rebuilt software to still use the old version (and the old files), as well as allowing access to the utilities of the previous format, you make sysadmins’ work easier, usually. Unfortunately, since the functions present on more than one minor version have the same exact name, Gentoo users and developers ended up hitting ELF symbol collisions when programs and libraries linked different Berkeley DB versions.

Turns out that GLIBC is actually designed keeping this in mind, and includes symbol versioning to solve the issue: a particular string is assigned t each symbol, so that you can have multiple libraries providing ABI-incompatible symbols with the same name – usually there is a need for the API to be at least partially compatible, but I don’t want to go in too many details now – without clashes and collisions. To provide versioning you have three main option: inline with the C sources, through the use of a version script, or, with GNU ld/bfd, through the --default-symver option, which sets the version string of each symbol to the soname of the library it is exported from. This was a godsend for Gentoo at the time because it allowed avoiding collisions without having to edit anything in the build system: you just had to add the flag to the linker’s flags in the ebuild and voilà.

If you’re now wondering whether GNU gold supports this option, you’re on the right track. The answer is “no, not right now”, right now it chokes on such an option, which results in Berkeley DB reporting the compiler to be unable to create executables. Whether it will support said option or not in the future is still to be seen. Last time I tried to implement a bfd/ld feature in gold – namely support for emitting explicitly unversioned symbols, which is needed to build FUSE – the results have been disappointing although I understand there is a problem with implementing a build feature that cannot work at runtime right now.

So unless gold gains the same option, we need to find another solution or ignore the existence of gold for a while longer. An alternative that I have been told about already would be to replace the current --default-symver option with a --version-script option pointing to an explicit version script to set the version. Unfortunately, this is not as easy done as it is said, at least for the versions we have in tree right now. A similar blanket-version approach would make no issue if it was introduced with a new slot of the package, as the version would have to be different either way, but it wouldn’t work to keep binary compatibility with the older versions.

The problem is that BerkDB isn’t installing a single library, but a number of them instead; and since --default-symver uses the library’s soname when creating the versions for its symbols, it means that for each library, you’d need a different version. Implementing this same method through use of standard versioning scripts would be a world of pain, and probably not worth the prize. For now, I decided to simply mask BerkDB on the container that is testing gold, forcing as many packages as possible to use gdbm instead, which does not have the same problem.

I’m glad we decided not to go the same route with expat, even though the immediate fallout at the time was out of scale (at the time it was a dream even to think about using --as-needed.la files are a joke in comparison!), it saved us the headache of reaching the point where we decide whether to forgo modern tools, or break binary compatibility again.

At any rate this is just the tip of the iceberg, about gold and real-world software. I’ll write more about this in the next days as I find time. For now, I wouldn’t mind if you noted your interest on testing gold… comments, flattrs (on the blog, post or, even better, tinderbox since that’s what is doing the work!) and other tokens are definitely appreciated. At least it would tell me I’m not wrong in insisting spending time reporting and solving the gold bugs.

That’s not the DB you’re looking for

I have written before about the big problems with BerkDB and it was over six months ago that the problems started to show up with release 5 of the library. Despite this new version introduces a number of new features, a few of which I’m sure packages have started using, or will soon do, as well upstream moving on to work on the 5.1 series, Gentoo still doesn’t have this version available even in ~arch.

What’s going on here? Is this a failure of QA itself like people muse from time to time? Are people going to insist that ~arch is becoming “the new stable”? I don’t think any of this is right, actually.

There are a few new problems in all this; one of these is that unfortunately, for the way we’ve been installing Berkeley DB, all of the developers feel like “lingering” in fixing their Berkeley DB support, and rather let the package use the previous versions when they haven’t been updated to use the new ones. And this results in the current mess of dependencies, in packages depending on particular versions of sys-libs/db, and the need to keep eleven versions of the same package in tree at any time.

Now, you can guess that having more code around to maintain, to build and to install is usually a bad thing. But there are more reasons to have them around at all; one of these is that the binary format of berkdb files is not stable between versions, so if you have a huge amount of data stored in version, say, 4.3, you cannot simply switch to 5.0 or vice-versa. For this reason people often enough try to stick with a single version of berkdb per system and don’t upgrade even when new versions are available.

Unfortunately, the fact that some packages bring in older BerkDB version hampers the diagnosis of packages broken by the presence of BerkDB5; the problem is that some of them will definitely stop working at the mere presence of Berkeley DB 5; others will simply fall-back to something they seem to understand, by identifying the presence of BerkDB 4.8 or earlier and using that. Unfortunately this detection could easily be faulty and cause very obnoxious results.

The main issue is that while we do provide slotted names for the libraries (libdb-4.8.so and libdb-5.0.so), and a different directory for the headers (/usr/include/db4.8 and /usr/include/db5.0), we also provide compatibility links for libdb.so and /usr/include/db.h, both of which will cause autodetection to easily fall back to “whatever is available”, and depending on how crazy the checks are it could even use the header from one version and the library for another, which is a definitely bad idea.

So what am I doing and proposing to solve these issues? Well first of all I re-used a virtual machine I have laying around, removing all the old db versions and then rebuilding a few of the packages that I knew were having problems with db5, some of which I was able to fix, luckily. I’ll go through a few more soonish, since the tinderbox is not reliable to identify these problems (as it has all the versions installed).

A second task to handle is making sure that the packages that currently depend on “any version 4” of BerkDB are actually doing what they say. A common mistake was to use the dependency on any version 4 just because the code wasn’t going to work with version 3, which is wrong; and another common mistake is to require the presence of version 4 because it doesn’t work with 5, but still not ensure that version 4 is used (by leaving it to the code to decide what to use). I know it is a bit hazy to understand here, let’s just say that they might not do the right thing as it is.

Thankfully, Zac already wrote a script that can help us here, for my previous quest on fighting old automake last month (which is almost, but not completely, won), so we know what the specifics packages that need work are.

One lesson to be learnt here: if you’re looking to version-slot libraries, make sure you remove the generic fallback, and rather fix the packages relying on that before it turns out into a problem like this.

Tinderbox summary for May 2010: GCC 4.5, Berkeley DB 5.0; libpng 1.4

I’m a bit surprised sincerely, and not exactly in a good way, since I started with GCC 4.5 just over a month ago, that the tinderbox almost caught up with its queue already.

Now, admittedly part of the reason might be related to my optimisation of the filesystems and partitions — especially after last week, since I moved all the stuff around to divide it into three pairs of disks: two 320GB WD RAID Edition disks for the RAID1 with the OS, my home and work stuff; two 500G Samsung disks for the “scratch” partitions (/var/tmp, the tinderbox’s filesystem), and finally two 1TB WD Caviar Green for storage space (multimedia files, including samples, and distfiles, 150GB of them!).

What makes me doubtful regarding the goodness of this situation is that a lot of packages were skipped because of dependencies failing. With GCC 4.5 we have no MySQL; with Berkeley DB 5.0 we have no Apache (because of apr-util). Without those, the tinderbox drops tons and tons of packages, a whole deptree of packages that will not be tested until the roots are fixed.

At any rate, now that I actually went through the packages, I can finally say what the most common problems with GCC 4.5 are. And surprisingly, it comes down to mostly two problems, a nasty runtime one, and one “usual” boring one.

First of all, the nasty one: GCC now seem to provide runtime-based overflow protection, not totally unlike the Stack Smashing Protection that the Gentoo Hardened project used to provide (and thanks to Magnus might come back at providing); this is a good thing from one side, because overflow protection is a nice safety feature (if not a security one), but it also means that we’re going to find a lot of packages failing at runtime because of this, and that stuff is much harder to deal with; one such package is the TCL interpreter, that is overflowing at runtime for so many packages that it’s boring. The problems tied to these features have their own tracker that was started back into 4.3 series already.

The boring problem is, once again, related to C++ (can you see why Luca is so worried now?): for some reason, up until now GCC supported one very strange syntax for it:

Foo::Bar x = Foo::Bar::Bar();

This basically consists of explicitly calling the constructor function of the class, rather than using the constructor through conversion. I would have always considered this syntax invalid, since I started learning the language, but I can tell how it could be typed wrongly; what surprised me is that it was allowed before. Sigh. The fact that Free Software has become a strict GCC monoculture does not help here, it means that instead of actually being tested, the code is just accepted if GCC supports it. It sucks now that LLVM seems to become more interesting.

The Berkeley DB situation is much worse, I’m afraid, in term of time needed to solve it; the main problem there is that a lot of packages that fail with it fall into the mail software categories, and the net-mail team is near non-existant for way too long now. This can be noted by the fact that we have stuff like mailx failing, continuous file collisions that are not being solved (and the tentatives with the “mailwrapper” stuff resulted in a total revert), and generally broken and out of date packages.

We could use a few more “mailmen” working in Gentoo, since I most definitely have just barely enough clue to manage my postfix installs with the help of The Definitive Guide (that’s one of the best thing I could buy from O’Reilly, without that I’d be seriously screwed).

The Berkeley DB fiasco — Barely avoided!

This is one of my ranty posts; so if you don’t want to read me complaining about various things, in both my personal life and Gentoo, you can simply stop reading here and sorry for the noise.

I’m currently in a bit of a pinch; as I stated before I had to take a week off, because of some extra stress in my life drove me to very nasty migraines, and I was taking way too many meds for it. Luckily, I’ve now been able to lighten my load a bit, and thanks to a few other coincidences, I don’t have to break my neck working for a month or two, time enough to get in better shape.

The state of Gentoo when I came back, as I said, wasn’t very suggestive; not only the libpng bump that caused disarray for many many users (and could have been avoided if the other developers listened to me when I suggested both solutions, as messed up as they sounded), there was another, less visible problem, which I hit myself, but most users wouldn’t have noticed: Robin bumped Berkeley DB to version 5.0. I hit this because I had sys-libs/db unmasked a long time ago to check it against my packages in particular.

It might be interesting to note that this version of BerkDB implements a DB-backed SQLite-compatible interface, (libdbsql) which is something that lots of people, me included, are curious about for the future; having an alternative to SQLite (which is usually pretty slow) is not a bad thing, considering how much software relies on it, including Firefox.

Now, while with almost all the BerkDB releases we have some kind of problems; most of the time, it’s not even a problem with API breakage, but rather a problem with the software using BerkDB, trying to be smarter, detecting and accepting only a limited range of BerkDB versions, even when they work perfectly fine with newer versions. Then there are the API problems, of course. This basically brings us a number of problems:

  • API changes — these are unavoidable, of course; they aren’t usually too big changes, which means that it’s usually trivial to fix the packages for the new versions;
  • packages doing autodetect to find the latest DB version — just a minor annoyance, usually; the packages know that most distributions install multiple BerkDB versions, and try to look for the latest version available, checking in descending order, so 4.8, 4.7, 4.6 and so on; it’s not a bad thing, but there are two catches:
    **** packages need to have a way to override the detection, so that we can use our db-use.eclass to force our latest version (say, the package checks only up to 4.6… we can either patch the package to detect 4.7, 4.8 and now 5.0, or we can simply give the package an order to use what we know being the latest one, without patches);
    **** packages need to understand that newer versions provide (mostly) the same options as the older ones; this usually relates only to the features that are introduced in a given version, and are maintained in later; if a given feature is added to 4.6, you should expect it to be available in 4.7 and 4.8 as well;
  • some packages explicitly test for the version used to be within a given expected range; even when they do provide an override, they check the declared version and fail, these are nasty, as they need to be patched every time, sometimes waiting for upstream to accept them;
  • some packages compare versions as “equal” rather than “greater or equal”; this is again a minor, easy-to-patch problem (and rare enough), if it’s used to check the minor version of the package; it is becoming a problem with DB 5.0 as the major version changed, and the minor is now… lesser;
  • some package distinguish between the old DB versions (db3) and the “new” ones (db4), as between the two the API changed a lot; to do so, they check the major version of the DB library… problem is that it has changed now, and they only check for 34 rather than for “4 or later”.

There are a couple of extra problems caused by BerkDB upstream in the 5.0 version, but those are not something that I’m caring about here, they are the usual problem with any package.

While Robin’s idea was to unmask Berkeley DB as soon as the testsuite passed green, I hope I was able to slow him down on this; when I saw the implication of the new version in my running system – and decided for once to back-off rather than fixing what I used, like I usually do – I prepared to run a special cycle of tinderboxing, building all the reverse-dependencies of sys-libs/db to see how many failures we would be getting with the new version. The tracker bug give a good idea of the extension of the problem.

What I did not foresee when I started the cycle, is that the build-time failures risk to be just the tip of the iceberg. I’m actually very glad that I have decided to run with the tinderbox right away, as the packages that shown the symptom first aren’t really among the common ones, and very few people run testsuites on these. You might have guessed the problem now: runtime failures.

Indeed, it seems like Berkeley DB 5.0 has some stricter runtime checks on the status of the files and the database, so that some operations can only be executed after the database has reached given states. If it wasn’t for testsuites, we probably would have had to experience these problems at runtime, breaking user systems (hopefully, not production systems, though!) before we could find the root cause of the problem and fix it.

Unfortunately, as I often wrote about there are too many packages with no testsuites; and so many ebuilds that restrict them without a good reason (“package’s testsuite need a local daemon running” “RESO FIXED restricted” “Eh? No! REOPEN” is a common reduced exchange between me and other maintainers on the matter of testsuites), so even if we can get complete pass on the tinderbox, we’re doomed to find more of these problems at runtime.

Anyway, the tinderbox is still running, not all the packages have hit yet; I have at least one package of mine to fix, and I’ll do so today hopefully, and in the next weeks all of us developers will try our best to avoid another huge failure with updates. On my side, I accept thank you tokens as this kind of work is not only thankless on average, but sometimes even become controversial as I have to fight to get some fixes to be done, when they require some more than the basic effort to handle the package.

Sometimes you’re saved…

…and sometimes you’re not, this time it seems to me like I am finally. This night I woke up at 4am, couldn’t really sleep, and decided that there was one big TODO in my list that needed to be killed off before I could feel better; that TODO was the PAM bump.

As I wrote in the past, PAm is one f those packages for which the most irritating thing to handle during a bump is the build system, because they don’t use autotools a they are designed to be used. In this particular case, the makefiles were abusing the LDFLAGS variables to pass the libraries to link against, bad choice. Unfortunately there are a lot of modules to fix the makefile of, so it took me a very high amount of self control to actually get around to fix it.

I’ve spent the early morning fixing the makefiles from Intrepid, then when I got the final patch (about 34 KiB uncompressed), I tried the build.. oops, I forgot there was still the documentation problem to solve, too!

Basically, I was forcing DocBook as a dependency of PAM in the past, because it failed to build for me without those, and I never really had the time to get around looking for that to fix. I already got a report saying that PAM built fine without them, so I knew I had to do with some kind of automagic dependency (I hate them), that would have asked me again to look at PAM build system. This time I decided to fix that for good too, and now manpages and documentation won’t be rebuilt from sys-libs/pam package, instead the upstream-supplied documentation will be installed, nothing else. This hopefully will also cut the time needed for it to build, and will fix it crosscompiling, too.

I haven’t bumped pam_userdb because the buildsystem changes weren’t important for that single module, and there was no code change; I’ve decided to get a bump to pam_console though, even if I have no idea if RedHat changed the code or not, so that I could stick an extra warning on the ebuild. I «employed» already too much time to write that package once, I don’t intend maintaining it on long terms, every problem that it can cause, will be considered responsibility of the users, not mine; the pam.d files installed will collide with other packages (I hadn’t added them, I wouldn’t have added them in the first place), and pam_console useflag is evil and should be masked by default, leaving to the user to unmask it. Unfortunately it seems like Gentopia still likes using pam_console (and I can’t understand why as that code is broken by design and I’m just hoping RedHat is going to kill it off like they seem to be doing with pam_stack.so).

Anyway, with this bump I also cleaned up a few open bugs for PAM “team” (aka Robin for pam_ldap and me for the rest); luckily most of the bugs are simply requests to add new modules to Portage… I’d like to help my users with those, but I don’t have the time needed to maintain them alone, but if you want to add one particular module, I can handle a proxy maintainership I suppose.

One module that is already in portage and is giving quite a bit of trouble, thought is pam_krb5, so I’ll be looking forward to get a last rites for it unless I can find someone to maintain it, I cannot really do that myself, I have no clue about Kerberos and I don’t really want to have some.

Now, after asking for 0.78-r5 to go stable (it was waiting for a bit to be), I’m starting thinking about getting a 0.99 version stable someday, but there are problems with the upgrade, first of all the pam.d files that used pam_stack.so won’t work anymore, they absolutely need to be changed to include syntax (luckily, most of the newly installed pam.d files will just use include syntax already); then there are the two modules that got split out the main package, pam_userdb and pam_console… about the latter I already talked above, if it was for me, pam_console wouldn’t ever go stable, but there are packages that probably depend on it on stable tree too (sigh) and currently work because it’s built with pam package (next task for later on today: make sure that packages with pam_console useflag depend on pam_console); about pam_userdb, the problem is more complicated.

It’s complicated because there might be people using pam_userdb in production environments, so the fact that upgrading pam will remove the module is not really a good thing unless you actually get documented the fact that you need the other module. Also, pam_userdb right now is building its own copy of berkdb library, static with PIC, which makes it a problem from a security point of view if a vulnerability is found in berkdb library; the problem is that using berkdb library dynamically is a no-go (it’s installed in /usr/lib rather than /lib, moving it around is far from being a good idea), and berkdb static library is not built with PIC (by policy). I’ve asked Paul to think of a solution, because most of the possible solutions would ask for a change in berkdb before pam_userdb.

The bottom line is that PAM is something I can only see as a curse, I had the misfortune to have to fix the pam_stack situation for Gentoo/FreeBSD, and now I’m all the way in, without even needing it myself in the first place (most of home users shouldn’t really need PAM), neither for home nor for my jobs. And I’m alone taking care of it, which means that if I don’t consider investing some of my time in it on a regular basis, it’s easy for bugs to pile one over the other, and being PAM a central piece for the security of the systems where it is enabled, it’s not something you’d like to have unmaintained.

I suppose the only escape route I have to get 0.99 stable, anyway, is to write some upgrade documentation… but I don’t really feel like writing one right now; I could ask our documentation project for a monkey, but last time it seemed to me like only nightmorph is actually active, and I’d rather not ask him again, as he has already a lot of things to tae care of.

PAM mess, once again and forever

Seems like I’m doomed to work on PAM for the rest of my life, even if I would like to stop using it myself.

From where this doom come from is easy to find, it’s one of the first things I’ve done when I joined Gentoo. Why this doom is heavy and annoying is a bit less easy to explain.

First of all, when I started, I just supported Martin (azarah) who did most of the work; I cleaned up a few modules, I wrote pam.eclass, and basically I made sure that the files installed in /etc/pam.d were compatible both with Linux-PAM (used, you’d never say, on Linux), and OpenPAM (used on FreeBSD).

Now, after an year and a half that I’m in Gentoo, I’m almost alone on PAM herd, and I thus need to take care of sys-libs/pam too. The current version in Gentoo is pretty ancient, 0.78, even Debian beat us with 0.79 in testing and unstable (considering that latest version is 0.99.6.3); I put a (masked) 0.99 in portage for a while now, but, as I started from scratch (because the original ebuild was waaay too complex and hard to understand to me — mind you, it was building not only PAM, but also glib-1 (for pam_console) and BerkDB (for pam_userdb) inline in the ebuild), it missed stuff like a proper Berkeley DB support and the RedHat/Fedora patches that I simply wanted not to apply. I did this because for a while Martin was away and thus I had to do something that I couldn’t handle by myself.

In July Martin was back, and he re-added the BerkDB build code, and the dependencies checking to the ebuild, messing it again almost to the same level as 0.78, so I just left sys-libs/pam once again to him. But then, he’s away again, and the package is unmaintained for a few months now again.

So here I am, after a day bumping and building KDE, fighting with PAM trying to prepare a new ebuild that can be used to test, hopefully to get a newer version at least in ~arch. This version will be a revert of my previous version, so without the static BerkDB build, but still with a berkdb useflag to enable/disable userdb module. I’m planning on simply moving out pam_userdb on its own package, copying Martin’s code, and leaving it there alone for the eternity, but to do so I first need to send this patch upstream and get it applied.. sigh, more time going away on this.

Oh a totally unrelated (and way merrier) note, I wish to thank Uri Sivan once again, who sent me a couple of items from my wishlist. Thanks Uri, again :)