Reliable and unreliable detection of bundled libraries

Tim asked me for some help to identify bundled libraries, so I come back to a topic that I haven’t written about in quite a while, mostly because it seemed like a lost fight with too many upstream projects.

First of all let’s focus on what I’m talking about. With the term “bundled libraries” I’m referring to libraries whose sources are copied verbatim, or only slightly modified, within another project. This is usually done by projects with either a number of dependencies, or uncommon dependencies, under the impression it makes it easier for the users to build the project. I maintain that such notion is mostly obsolete as it’s not the place of most random users to actually build their own projects, leaving that job to distributions, which have the inverse problem with bundled libraries: they make packaging harder, and software much more vulnerable to security issues.

So how do you track down bundled libraries issues? Well one option is to check out each packages’ sources, but that’s not really applicable that easily, especially since the sources might be masked — I found more than one project taking a library originally in a number of source files, concatenating all of them together, and then building it in a single object file. The other options relate to analyse the result of the build, executables and libraries. This is what I’ve done with my scripts when I started looking at a different, but slightly related, problem (symbol collisions).

You can analyse the resulting binaries in many ways; the one I’ve been doing with the scripts I’m referring to above is a very shallow detection: it wasn’t designed for the task of looking for bundled libraries, it was designed to identify libraries and executables exporting the same symbols, that would cause trouble if they were ever loaded in the same address space. For those curious, this is because the dynamic loader needs to choose only one symbol for any name, so either the libraries or the executable would end up calling the wrong function or accessing the wrong data. While this can sound too theoretical, one such bug I reported back in 2008 was the root cause of a runtime crash of Evince in 2011.

At any rate this method is not optimal for the task of finding bundled libraries for two reasons: the first is that it only checks for identical symbol names, so it doesn’t take into account libraries that are bundled and have their method renamed – and yes I have seen that done – the second is that it only works with exported symbols, that are those that the dynamic loader can collide on. What is not included here are the so-called local symbols: static symbols, symbols declared with private ELF visibility, and those that are hidden by linker scripts at link time.

To inspect the binaries deeper, you need non-stripped copies; it doesn’t really require you to have them built with DWARF data in them (the debug data that is added when building with, for instance, -ggdb); it can work even just the complete symbol table kept in the .symtab section of final binaries (and usually stripped away). To get the list of all symbols present within a binary, be them data (constants and variables) or code (functions), you can simply use the nm --defined-only command (if you add --dynamic you end up with the same data I’m analysing with my script above, as it changes the table that it uses to look up symbols).

Unfortunately this still requires to find a way to match the symbols even with prefixed versions, and with little false positives; this is why I haven’t worked on a script to deal with this kind of data yet. Although while writing this I can think of a way to make it at least scan for a particular library against a list of executables, that is not really one of the best-performing options available. I guess the proper answer here would be to find a way to generate a regular expression for each library based on the list of symbols they include, and then grep for that over the symbols exported by the other binaries.

Note here: you can’t expect that all the symbols of a bundled library are present in the final binary; when linking against object files or static archives the linker applies logic akin to the --as-needed one and will not bring in object files if no symbol from that file is referenced by the code; if you use other techniques then you can even skip single functions and data symbols that are unused by your code. Bottom line is that even if the full sources of another library are bundled and built, the final executable might not have all of them on it.

If you don’t have access to a non-stripped binary, well then your only option is to run strings over the package and try to find matching strings for the library you’re looking for; if you’re lucky, the library provides version information you can still track down in raw strings. This unfortunately is the roughest option you have available and I’d not suggest doing it for anything you have sources for; it’s a last-resort for proprietary, pre-built software.

Finally, let me say a couple of words regarding identify the version of an exported version of a library, which I have found to be done way too many times. If this happens on an ET_DYN file, such as a shared object, you can easily dlopen() it and then call functions like zlibVersion() to get the actual version of the library embedded. This ends up being pretty important when trying to tell whether a package is bundling a vulnerable version of zlib or libpng, even though it is also not extremely reliable as it is.

That’s not the DB you’re looking for

I have written before about the big problems with BerkDB and it was over six months ago that the problems started to show up with release 5 of the library. Despite this new version introduces a number of new features, a few of which I’m sure packages have started using, or will soon do, as well upstream moving on to work on the 5.1 series, Gentoo still doesn’t have this version available even in ~arch.

What’s going on here? Is this a failure of QA itself like people muse from time to time? Are people going to insist that ~arch is becoming “the new stable”? I don’t think any of this is right, actually.

There are a few new problems in all this; one of these is that unfortunately, for the way we’ve been installing Berkeley DB, all of the developers feel like “lingering” in fixing their Berkeley DB support, and rather let the package use the previous versions when they haven’t been updated to use the new ones. And this results in the current mess of dependencies, in packages depending on particular versions of sys-libs/db, and the need to keep eleven versions of the same package in tree at any time.

Now, you can guess that having more code around to maintain, to build and to install is usually a bad thing. But there are more reasons to have them around at all; one of these is that the binary format of berkdb files is not stable between versions, so if you have a huge amount of data stored in version, say, 4.3, you cannot simply switch to 5.0 or vice-versa. For this reason people often enough try to stick with a single version of berkdb per system and don’t upgrade even when new versions are available.

Unfortunately, the fact that some packages bring in older BerkDB version hampers the diagnosis of packages broken by the presence of BerkDB5; the problem is that some of them will definitely stop working at the mere presence of Berkeley DB 5; others will simply fall-back to something they seem to understand, by identifying the presence of BerkDB 4.8 or earlier and using that. Unfortunately this detection could easily be faulty and cause very obnoxious results.

The main issue is that while we do provide slotted names for the libraries (libdb-4.8.so and libdb-5.0.so), and a different directory for the headers (/usr/include/db4.8 and /usr/include/db5.0), we also provide compatibility links for libdb.so and /usr/include/db.h, both of which will cause autodetection to easily fall back to “whatever is available”, and depending on how crazy the checks are it could even use the header from one version and the library for another, which is a definitely bad idea.

So what am I doing and proposing to solve these issues? Well first of all I re-used a virtual machine I have laying around, removing all the old db versions and then rebuilding a few of the packages that I knew were having problems with db5, some of which I was able to fix, luckily. I’ll go through a few more soonish, since the tinderbox is not reliable to identify these problems (as it has all the versions installed).

A second task to handle is making sure that the packages that currently depend on “any version 4” of BerkDB are actually doing what they say. A common mistake was to use the dependency on any version 4 just because the code wasn’t going to work with version 3, which is wrong; and another common mistake is to require the presence of version 4 because it doesn’t work with 5, but still not ensure that version 4 is used (by leaving it to the code to decide what to use). I know it is a bit hazy to understand here, let’s just say that they might not do the right thing as it is.

Thankfully, Zac already wrote a script that can help us here, for my previous quest on fighting old automake last month (which is almost, but not completely, won), so we know what the specifics packages that need work are.

One lesson to be learnt here: if you’re looking to version-slot libraries, make sure you remove the generic fallback, and rather fix the packages relying on that before it turns out into a problem like this.

“Put it there and wait for users to break” isn’t a valid QA method

So it seems like Jeremy feels we’re asking too much of maintainers by asking them to test their packages. As I said in the comments to his blog, picking up poppler as a case study is a very bad idea because, well, poppler has history.

Historically, new upstream poppler releases in the recent times have been handled by guys working on the KDE frontends, when they suited their need, rather than being coordinated with other developers. This isn’t much of a problem for them because binary distributions have no problem with shipping two version of the same library on different ABI versions, and even building packages against different versions is quite okay for them — it definitely isn’t for us.

In turn, within Gentoo, in the recent times, poppler has been maintained by the KDE team; mostly by the one developer who, not for the first time, wanted to bump to 0.16.0 a couple of days ago, without adding a mask first and ensuring that all the packages in tree that use poppler would work correctly with it. Jeremy states that doing such a test would make ~arch akin to stable; I and I’m sure other of my colleagues, rather see it as “not screwing users up”.

First of all, it’s not like bumping something that you expect not having trouble; bumping grep and finding out that half the packages fail whose tarballs were built with Gentoo and a very very old version of our elibtoolize patches is unexpected. Finding out that, once again, the new minor version of poppler has API breakage is not surprising, it’s definitely expected, which means that when committing the version bump you can have either of two thoughts in mind “I don’t care whether GNOME packages break” or “Ooh shiny!”. In either case, you should learn that such thoughts are not the right state of mind for committing something to the tree. I said that before of other two developers, I maintain my opinion now.

And just to be clear, I’m not expecting that we spend months in overlays before reaching the main tree, I’m just saying that if you know it’ll break something, you should mask it beforehand, and give a heads’ up to the developers maintaining the other packages to due their job and fix it.

Now, in the comments of that post, Jeremy insists that factoring in my tinderbox is not an option for him, because it is “counting on a single developer cpu/time”. Right, because there is no other way to test reverse dependencies, uh? The tinderbox is meant to help the general situation, but it’s definitely not the only way; even though I’d have been glad to run it for poppler to find and report the problems, the task of checking its reverse dependencies is far from impossible to deal with for a single developer. There are a grand total of … thirty-nine (39!) packages with reverse dependencies on poppler! So it’s not a “can’t be done” situation, it’s a “can’t be arsed”. This also brings Jeremy’s ratio of “7/14000” packages with a problem to a more worrisome 739. See the difference?

Simply put, do your own homework: learn about the package you’re maintaining, try hard not to break reverse dependencies; if it happens that you break a package unexpectedly, hey, it’s ~arch, we can deal with it. But do not commit stuff that you know will break a good deal of the packages depending on yours. Especially if you’re not out there to fix them yourself.

The automake and libtool clash

I’ve been warned today by my friend lavish that there are packages, such as ettercap and tvtime that are failing because of a clash between libtool and older automake versions. Indeed, that is the case and I’m now on the case to resolve it.

The problem here is that libtool 2.4 requires that you use it with at least automake 1.9, so anything using 1.7 (ettercap) or 1.8 (tvtime) will fail with a number of aclocal errors. There are internal changes forcing this, it’s nothing new and nothing extraordinary. Unfortunately, autotools.eclass allows maintainers to require an older automake version. This has been extensively used up to now to avoid future-version compatibility with automake… but it’s now biting our collective asses with this.

The solution, obviously, is to make sure that ebuilds are tested against a more modern version of automake, and that is usually quite quick when using 1.8 or later, older versions might be more of a problem; turns out that I suggested to start working on this back in June 2009 — magic? ESP? Or simple knowing my tools by now?

I’m now going to work through the tree to take care of these issues, since they are quite important, but also relatively easy to fix. Plus they require autotools expertise and I’m positive that at last half the involved maintainers would be asking me to fix them anyway.

Now, the question is why didn’t my tinderbox catch this beforetime? The answer is easy: *it is not running*… between the problems and the gold f-up – and few to nobody interested in helping me – I haven’t ran the tinderbox for the past two weeks.

And yet this is the exact kind of breakage it is supposed to find. So please consider how much work I’ve been pouring in to protect us all from this stuff happening, and don’t blame me if I actually come asking for help from time to time to pay for the hardware — there are requests for specific CPU and memory hardware to upgrade the workstation on the page I listed above, if you feel like contributing directly at running it; I should add soon a request for two Caviar Black drives where most builds happen (the two Samsung SpinPoint I have start to report relocated sectors, they are at the end of their life here), or two RE4 for the operating system/work data (which is now filled to its limits).

Edit: since the system locked-up during the build, likely caused by the HDDs failing, I had to order the new ones… at a moment where I’m €1550 short to pay taxes, I had to spend another €190 for the new disks. Can you tell now why I’d wish that at least the rest of the developers would help me reducing the overhead of tinderbox by at least doing some due diligence in their packages, rather than complain if I ask for help to pay the bills&hardware of the tinderbox? — No, having the Foundation reimburse me of the expenses is not going to help me any further since I’d have to pay extra taxes over that.

Anyway, for now I think I’ll be limiting the tinderbox to two cores out of the eight I have available, and keep running it in the background, hopefully it will not cause the rest of my system (which is needed for my “daily” job — I say “daily” because I actually spend evenings and nights working on it) to hang or perform too badly.

Anyway, 6am, I haven’t slept, and I still have a number of things to do…

About the new Quagga ebuild

A foreword: some people might think that I’m writing this just to banter about what I did; my sincere reason to write, though, is to point out an example of why I dislike 5-minutes fixes as I wrote last December. It’s also an almost complete analysis of my process of ebuild maintenance so it might be interesting for others to read.

For a series of reasons that I haven’t really written about at all, I need Quagga in my homemade Celeron router running Gentoo — for those who don’t know, Quagga is a fork of an older project called Zebra, and provides a few daemons for route advertisement protocols (such as RIP and BGP). Before yesterday, the last version of Quagga in Portage was 0.99.15 (and the stable is an old 0.98 still), but there was recently a security bug that required a bump to 0.99.17.

I was already planning on getting Quagga a bump to fix a couple of personal pet peeves with it on the router; since Alin doesn’t have much time, and also doesn’t use Quagga himself, I’ve added myself to the package’s metadata; and started polishing the ebuild and its support files. The alternative would have been for someone to just pick up the 0.99.15 ebuild, update the patch references, and push it out with the 0.99.17 version, which would have categorized for a 5-minutes-fix and wouldn’t have solved a few more problems the ebuild had.

Now, the ebuild (and especially the init scripts) make a point that they were contributed by someone working for a company that used Quagga; this is a good start, from one point: the code is supposed to work since it was used; on the other hand companies don’t usually care for the Gentoo practices and policies, and tend to write ebuilds that could be polished a bit further to actually be compliant to our guidelines. I like them as a starting point, and I got used to do the final touches in those cases. So if you have some ebuilds that you use internally and don’t want to spend time maintaining it forever, you can also hire me to clean them up and merge in tree.

So I started from the patches; the ebuild applied patches from a tarball, three unconditionally and two based on USE flags; both of those had URLs tied to them that pointed out that they were unofficial feature patches (a lot of networking software tend to have similar patches). I set out to check the patches; one was changing the detection of PCRE; one was obviously a fix for --as-needed, one was a fix for an upstream bug. All five of them were on a separate patchset tarball that had to be fetched from the mirrors. I decided to change the situation.

First of all, I checked the PCRE patch; actually the whole PCRE logic, inside configure is long winded and difficult to grok properly; on the other hand, a few comments and the code itself shows that the libpcreposix library is only needed non non-GNU systems, as GLIBC provides the regcomp/@regexec@ functions. So instead of applying the patch and have a pcre USE flag, I changed to link the use or not of PCRE depending on the elibc_glibc implicit USE flag; one less patch to apply.

Second patch I looked at was the --as-needed-related patch that changed the order of libraries link so that the linker wouldn’t drop them out; it wasn’t actually as complete as I would have made. Since libtool handles transitive dependencies fine, if the libcap library is used in the convenience library, it only has to be listed there, not also in the final installed library. Also, I like to take a chance to remove unused definitions in the Makefile while I’m there. So I reworked the patch on top of the current master branch in their GIT, and sent it upstream hoping to get it merged before next release.

The third patch is a fix for an upstream bug that hasn’t been merged in a few releases already, so I kept it basically the same. The two feature patches had new versions released, and the Gentoo version seems to have gone out of sync with the upstream ones a bit; for the sake of reducing Gentoo-specific files and process, I decided to move to use the feature patches that the original authors release; since they are only needed when their USE flags are enabled, they are fetched from the original websites conditionally. The remaining patches are too small to be part of a patchset tarball, so I first simply put them in files/ are they were, with mine a straight export from GIT. Thinking about it a bit more, I decided today to combine them in a single file, and just properly handle them on Gentoo GIT (I started writing a post detailing how I manage GIT-based patches).

Patches done, the next step is clearing out the configuration of the program itself; the ipv6 USE flag handles the build and installation of a few extra specific daemons for for the IPv6 protocol; the rest are more or less direct mappings from the remaining flags. For some reason, the ebuild used --libdir to change the installation directory of the libraries, and then later installed an env.d file to set the linker search path; which is generally a bad idea — I guess the intention was just to follow that advice, and not push non-generic libraries into the base directory, but doing it that way is mostly pointless. Note to self: write about how to properly handle internal libraries. My first choice was to see if libtool set rpath properly, and in that case leave it to the loader to deal with it. Unfortunately it seems like there is something bad in libtool, and while rpath worked on my workstation, it didn’t work on the cross-build root for the router though; I’m afraid it’s related to the lib vs lib64 paths, sigh. So after testing it out on the production router, I ended up revbumping the ebuild already to unhack itif libtool can handle it properly, I’ll get that fixed upstream so that the library is always installed, by default, as a package-internal library, in the mean time it gets installed vanilla as upstream wrote it. It makes even more sense given that there are headers installed that suggest the library is not an internal library after all.

In general, I found the build system of quagga really messed up and in need of an update; since I know how many projects are sloppy about build systems, I’d probably take a look. But sincerely, before that I have to finish what I started with util-linux!

While I was at it, I fixed the installation to use the more common emake DESTDIR= rather than the older einstall (which means that it now installs in parallel as well); and installed the sample files among the documentation rather than in /etc (reasoning: I don’t want to backup sample files, nor I want to copy them to the router, and it’s easier to move them away directly). I forgot the first time around to remove the .la files, but I did so afterwards.

What remains is the most important stuff actually; the init scripts! Following my own suggestions the scripts had to be mostly rewritten from scratch; this actually was also needed because the previous scripts had a non-Gentoo copyright owner and I wanted to avoid that. Also, there were something like five almost identical init scripts in the package, where almost is due to the name of the service itself; this means also that there had to be more than one file without any real reason. My solution is to have a single file for all of them, and symlink the remaining ones to that one; the SVCNAME variable is going to define the name of the binary to start up. The one script that differs from the other, zebra (it has some extra code to flush the routes) I also rewrote to minimise the differences between the two (this is good for compression, if not for deduplication). The new scripts also take care of creating the /var/run directory if it doesn’t exist already, which solves a lot of trouble.

Now, as I said I committed the first version trying it locally, and then revbumped it last night after trying it on production; I reworked that a bit harder; beside the change in libraries install, I decided to add a readline USE flag rather than force the readline dependency (there really isn’t much readline-capable on my router, since it’s barely supposed to have me connected), this also shown me that the PAM dependency was strictly related to the vtysh optional component; and while I looked at PAM, (Updated) I actually broke it (and fixed it back in r2); the code is calling pam_start() with a capital-case “Quagga” string; but Linux-PAM puts it in all lower case… I didn’t know that, and I was actually quite sure that it was case sensitive. Turns out that OpenPAM is case-sensitive, Linux-PAM is not; that explains why it works with one but not the other. I guess the next step in my list of things to do is check out if it might be broken with Turkish locale. (End of update)

Another thing that I noticed there is that by default Quagga has been building itself as a Position Independent Executable (PIE); as I have written before using PIE on a standard kernel, without strong ASLR, has very few advantages, and enough disadvantages that I don’t really like to have it around; so for now it’s simply disabled; since we do support proper flags passing, if you’re building a PIE-complete system you’re free to; and if you’re building an embedded-enough system, you have nothing else to do.

The result is a pretty slick ebuild, at least in my opinion, less files installed, smaller, Gentoo-copyrighted (I rewrote the scripts practically entirely). It handles the security issue but also another bunch of “minor” issues, it is closer to upstream and it has a maintainer that’s going to make sure that the future releases will have an even slicker build system. It’s nothing exceptional, mind you, but it’s what it is to fix an ebuild properly after a few years spent with bump-renames. See?

Afterword: a few people, seemingly stirred up by a certain other developer, seems to have started complaining that I “write too much”, or pretend that I actually have an uptake about writing here. The main uptake I have is not having to repeat myself over and over to different people. Writing posts cost me time, and keeping the blog running, reachable and so on so forth takes me time and money, and running the tinderbox costs me money. Am I complaining? Not so much; Flattr is helping, but trust me that it doesn’t even cover the costs of the hosting, up to now. I’m just not really keen on the slandering because I write out explanation of what I do and why. So from now on, you bother me? Your comments will be deleted. Full stop.

Keep on…

  • keep on ignoring requests coming from a QA team member;
    • keep on de-CCing QA on your bugs when said QA team member state that the fix is the wrong one, just stating that “it’s the wrong place open a new bug” when your solution was decided there;
    • keep on complaining if less than 1% of bugs filed, out of literally thousands lack a log file;
    • keep on asking me to not use the f-word because it makes it bad for you to be associated with Planet Gentoo (but on the other hand, feel no harm in being associated with people who repeatedly made Gentoo unusable for its users);
    • keep on spitting on me for pointing out that your unmask ideas are between reckless and totally stupid;
    • keep ignoring the bugs that are reported for your package;
    • keep bumping packages you don’t maintain, without looking into the further QA-related issues, and without declaring yourself the maintainer;
    • keep repeating the same mistakes and when asked to revise your attitude use the “but he did it as well” card.

Keep it up this way, then look back to see if there is QA at all.

Gentoo needs you! A few things that would definitely be useful

After my ranting about QA – which brought back our lead from hiding, at least for a while, let’s see how he fares this time – a number of people asked me what they could do to help QA improve. Given that we want both to test and prevent, here comes a list of desirable features and requests that could obviously be helpful:

  • as usual, providing for another tinderbox would be a good thing as more situations could be tested; right now I’d very much like to have a server powerful enough to run a parallel tinderbox set up to test against stable-tree amd64 keyword, rather than the current unstable x86;
  • repoman right now does not warn about network errors in HOMEPAGE or SRC_URI; see this bug as for why; it would be a good thing to have;
  • again repoman, lacking network checks, lack a way to make sure that the herd specified in metadata.xml exists at all; it’s pretty important to do that since it happened before that herds weren’t updated at all;
  • there is also no way to ensure that metadata.xml lists maintainers that have accounts in Bugzilla; with proxy-maintainership sometime it happens that either the email address is not properly updated (proxier fail) or the maintainer does not have an account on Bugzilla at all (which is against the rules!); unfortunately this requires login-access to bugzilla, and is thus a no-no right now;
  • the files/ subdirectories are still a problem for the tree, especially since they are synced down to all users and should, thus, be kept slim, if not for the critical packages; as it turns out, between Samba and PostgreSQL they use over half a megabyte for files that a lot of people wouldn’t be using at all; we also had trouble with people not understanding that the 20KB size limit on files does not mean you can split it in two 10KB files, nor compress it to keep it in tree (beside the fact we’re using CVS, there is a separate patch repository for that; Robin is working on providing a reliable patch tarball hosting as well); see bug #274853 and bug #274868
  • moving on to bugzilla, there is currently no way to display inline on the browser logs that are compressed with gzip, that the new Portage can generate and make the tinderbox work much easier; this makes bug reports a bit more complex to manage; if somebody knows how to get Bugzilla to provide proper headers so that the browsers can read them, it’d be pure magic;
  • right now we lack a decent way to test for automagic dependencies; you could make use of something similar to my oneliner but it won’t consider transitive dependencies on virtual packages;
  • with split-unsplit-split-unsplit cycles, it happens that a number of packages are in tree and are now obsoleted – this is why I had to remove a number of geda ebuilds the other day – removing these packages not only reduces the pressure on the tree itself, but also allows the tinderbox to run much more smoothly without me having to mask those packages manually: when they are visible, the tinderbox might end up trying to merge them by removing the (newer) un-split ebuild, or vice-versa;
  • one more Portage (non-repoman!) feature that would be nice: per-package features or at least per-package maintainer mode while Portage has a number of features to make sure that maintainers don’t commit crap (testsuite support; die-on-some-warnings; fix-or-die) – such as the stricter feature – but using them system-wide is unfeasible (I’m okay with fixing my own stuff, but I don’t want an unusable system because some other developer didn’t fix his).

There’s something about Webmin

So a week or so ago I masked Webmin, because it dinged a lot on my tinderbox and I decided to take a look at it. The ChangeLog shows a very bleak story: webmin hasn’t had a dedicated maintainer since March 2008; that’s two and a half years ago; the last five versions were bumped by Patrick (who, if you should be reminded, rarely picks up the pieces of what he might break); last October, Victor “fixed” a bunch of sandbox violations in the 1.490 version.

Given the hits on the tinderbox and the above-noted history; I decided to mask it until somebody actually stepped up to maintain it; this was also positively acknowledged by the security team: having a package with the security issue of Webmin lacking a dedicated maintainer is calling for trouble.

Since then I received a long list of email messages of users using Webmin, either in production or for local environments, wondering why I masked it and insisting that it is in its design to access files owned by other packages anyway. I guess I’ll have to rephrase the masking reason given that, that wasn’t what I meant.

It is a common, enforced policy in Gentoo that all the packages are self-contained, and mess the least possible with the running system as they can, so that if you just emerge MySQL, it won’t start up right away, or even at the next reboot; this is one thing that put Gentoo drastically in contrast with most other distributions. And what a lot of people, me included, love the most.

If there are commands to execute to properly setup the package before making use of it, we have the special pkg_config() function that can be called through emerge --config — there aren’t many packages doing that though.

Instead, webmin-1.510.ebuild goes the totally wrong way: it tries to second-guess the user’s setup; it runs the setup step of Webmin directly within src_install() which should be protected by the sandbox; but Victor’s “fixes” actually simply allowed the ebuild to access what it shouldn’t be, not at that state at least: real devices, kernel modules, and the cron configuration.

How did I notice that? Very simple, actually: the tinderbox, just like my own system, uses fcron as cron daemon. The fcron configuration file is not predicted within the ebuild, so it still triggered sandbox violation, and caused my tinderbox to start screaming at me like my mother when I come back at 6am .

Now, if that was the only problem it wouldn’t be much; but the ebuild also fail to actually provide a decent/stable PAM configuration, and a lot of the shell code in the ebuild file is just… icky.

What can be done to save webmin? Well, more than certainly it needs a new maintainer, one who does a lot more than copying a file and running cvs add on it. The new maintainer will have to rewrite most of the ebuild, implement the good parts of it as pkg_config and make sure that it respects the user’s settings.

For most other packages, like I did for Ruby packages, I would have said that I’m happy to be hired to take care of the problematic package, with a reasonable fee to keep maintaining it from then on. But given it’s Webmin we’re talking about.. I’m not sure if I’m ready to pick it up. If somebody else actually uses it, and feels like maintaining it, it’d be best. I’m still open to be the proxying maintainer if somebody feels like pick up the pieces of it all, but wants to have somebody else reviewing the ebuild before it get pushed to users.

Broken every other week

I’m seriously disappointed in Gentoo; even after my last post on the topic the Gentoo quality seems to be on a downward spiral, rather than climbing up. The problem is social: too many developers don’t give a f♥♥k.

Let me be clear: I’m not pretending that we should be never wrong, and that mistakes can’t happen. We are, they happen. This is why we have the unstable/testing support. But at the same time, we should still try to think twice and commit once.

Hey, I screwed up Friday night. I launched a double-commit without checking its status, committing virtual/ruby-ssl and www-servers/thin… but the former’s commit was rejected by repoman and I didn’t even notice. Thankfully Mr Bones was around to notify me, and Luca was able to commit a workaround for me until I got home (I was at my sister’s for her birthday dinner).

So what kind of problems upset, demoralise and disappoint me? It’s the most pernicious ones; like those we’ve been warning about for years, those that we could see coming from a mile away, and those that are caused by sheer lack of testing for important packages.

Say you’re a developer who’s already been warned to doublecheck what you do, and asked to use package.mask before committing big changes regarding the language the official package manager is written in. When you’re just disregarding this, trying to force a new version of the language altogether; ignoring the whole (known) problem of dependencies on multiple implementations, … you probably are screaming as loud as you can “I’m going to break the tree sooner or later”.

Say you’re an old-time developer that was known to do a relatively clean work albeit scarcely documented, and you start making mistakes like forgetting to update your own wrapper when bumping autoconf (sure it’s masked, but it means you’ve not even barely tested the thing); say you actually commit half-broken ebuilds for a while; I can’t stop wondering what is wrong with you when you actually blame me for not covering your (scarcely documented at all) use case with NFS when I actually fix it to work for Kerberos (before my involvement, it didn’t work for either case, after my original involvement, it covered one case fully). Do you still think you should be the one voicing opinions about the Quality Assurance team?

And if you say that “the new GLIBC works for me”, are you saying that the package itself builds or if it’s actually integrated correctly? Because, you know, I used to rebuild the whole system whenever I made a change to basic system packages when I maintained Gentoo/FreeBSD, and saying that it’s ready for ~arch when you haven’t even rebuilt the system (and you haven’t, or you would have noticed that m4 was broken) is definitely something I’d define as reckless and I’d venture to say you’re not good material to work on the quality assurance status.

I’m just tired, very very tired, of repeating the same post over and over again. Gentoo needs stronger QA. Full stop.

Gentoo’s Quality Status

So I have complained about the fact that we had a libpng 1.4 upgrade fallout that we could have for the most part avoided by using --as-needed (which is now finally available by default in the profiles!) and starting much earlier to remove the .la files that I have been writing for so long about.

I also had to quickly post about Python breakage — and all be glad that I noticed by pure chance, and Brian was around to find a decent fix to trickle down to users as fast as possible.

And today we’re back with a broken, stable Python caused by the blind backport of over 200k of patch from the Python upstream repository.

And in all this, the QA team only seem to have myself complaining aloud about the problem. While I started writing a personal QA manifesto, so that I could officially requests new election on the public gentoo-qa mailing list, I’m currently finding it hard to complete; Mark refused to call the elections himself, and noone else in the QA team seems to think that we’re being too “delicate”.

But again, all the fuck-ups with Gentoo and Python could have been seen from a mile away; we’ve had a number of eclass changes, some of which broke compatibility, packages trying to force users into using the 3.x version of Python that even upstream considers not yet ready for production use, and an absolute rejection of working together with others. Should we have been surprised that sooner or later shit would hit the fan?

So we’re now looking for a new Python team to pick up the situation and fix the problem, which will require more (precious) Tinderbox time to make sure we can pull this without risking breaking more stuff in the middle of it. And as someone already said to me, whoever is going to pick-up Python again, will have at their hand the need to replace the “new” Python packaging framework with something more similar to what the Ruby team has been doing with Ruby NG — which could have actually been done once and for all before this..

Now, thankfully there are positive situations; one is the --as-needed entering the defaults, if not yet as strongly as I’d liked it to be; another is Alexis and Samuli asking me specifically for OCAML and XFCE tinderbox runs to identify problems beforehand, and now Brian with the new Python revision.

Markos also is trying to stir up awareness about the lack of respect for the LDFLAGS variable; since the profile sets --as-needed in the variable, you end up ignoring that if you ignore the variable. (My method of using GCC_SPECS is actually sidestepping that problem entirely.) I’ve opened a bunch of bugs on the subject today as I added the test on the tinderbox; it’s going to be tricky, because at least the Ruby packages (most of them at least) respect the flags set on the Ruby implementation build, rather than on their own, as it’s saved in a configuration file. This is a long-standing problem and not limited to Ruby, actually. I’ve been manually getting around the problem on some extensions such as eventmachine but it’s tricky to solve in a general way.

And this is without adding further problems as that pointed out by Kay and Eray that I could have found before if I had more time to work on my linking collision script — it is designed to just find those error cases, but it needs lot of boring manual lookup to identify the issues.

Now, I’d like to be able to do more about this, but as you can guess, it already eats up enough of my time that I have even trouble fitting in enough work to cover the costs of running it (Yamato is not really cheap to work on, it’s power-hungry, has crunched a couple of hard-disks already, and needs a constant flow of network data to work clearly, and this is without adding the time I pour into it to keep it working as intended). Given these points, I’m actually going to make a request if somebody can get either one of two particular pieces of hardware to me: either another 16GB of Crucial CT2KIT51272AB667 memory (it’s Registered ECC memory) or a Gigabyte i-RAM (or anything equivalent; I’ve been pointed at the ACard’s ANS-9010 as an alternative) device with 8 or 16GB (or more, but that much is good already) of memory. Either option would allow me to build on a RAM-based device which would thus reduce the build time, and make it possible to run many many more tests.