I’m in my network, monitoring!

While I was originally supposed to come here in Los Angeles to work as a firmware developer engineer, I’ve ended up doing a bit more than I was called for.. in particular it seems like I’ve been enlisted to work as a system/network administrator as well, which is not something that bad to be honest, even though it still means that I have to deal with a number of old RedHat and derivative systems.

As I said before this is good because it means that I can work on open-source projects, and Gentoo maintenance, during work hours, as the monitoring is done with Munin, Gentoo and, lately, Icinga. The main issue is of course having to deal with so many different versions of RedHat (there is at least one RHEL3, a few RHEL4, a couple of RHEL5, – and almost all of them don’t have subscriptions – some CentOS 5, plus the new servers that are Gentoo, luckily), but there are others.

Starting last week I started looking into Icinga to monitor the status of services: while Munin is good to know how things move over time and to have an idea of “what happened at that point”, it’s still not extremely good if you just want to know “is everything okay now or not?”. I also find most Munin plugins being simpler to handle than Nagios’s (which are what Icinga would be using), and since I already want the data available on graphs, I might just as well forward the notifications. This of course does not apply to boolean checks that are pretty silly on Munin.

There is some documentation in the Munin website on how to set up Nagios notifications, and it mostly works flawlessly for Icinga. With the one difference being that you have to change the NSCA configuration, as Icinga uses a different command file path, and a different user, which means you have to set up

nsca_user=icinga
nsca_group=icinga

command_file=/var/lib/icinga/rw/icinga.cmd

I’m probably going to make the init script have a selectable configuration file and install two pairs of configuration files, one in /etc/icinga and hte other in /etc/nagios so that each user can choose which ones to use. This should make it easier to set it up.

So while I don’t have much to say for now, and I have had little time to post about this in the past few days, what my plan, in regard to Icinga and Munin, consists of is primarily cleaning up the nagios-plugins ebuild (right now it just dumps all the contrib scripts without caring about them at all, let alone caring about the dependencies), and writing documentation on the wiki about Icinga the way I cleaned up the one about Munin — speaking of which, Debian decided to disable CGI in their packages as well, so now the default is to keep CGI support disabled unless required and it’s provided “as is”, without warranties it ever works. I also have to finish setting up the Munin async support, which becomes certainly useful at this point.

I’m also trying to fit in Ruby work as well as the usual Tinderbox mangling so … please bear with my lack of update.

Autotools Mythbuster: On parallel testing

A “For A Parallel World” crossover!

Since now the tinderbox is actually running pretty good and the logs are getting through just fine, I’ve decided to spend some more time expanding the Autotools Mythbuster guide with more content, in particular in areas such as porting for automake 1.12 (and 1.13).

One issue though which I’ll have to discuss in that guide soon, and for which I’m posting already right now, is parallel testing, because it’s something that is not really well known, and is something that, at least for Gentoo, involves the EAPI=5 discussion.

Build systems using automake have a default target for testing purposes called check. This target is designed to build and execute testcases, in a pretty much transparent way. Usually this involves two main variables: check_PROGRAMS and TESTS. The former defines the binaries to build for the testcases, the latter which testcases to run.

This is counter-intuitive and might actually sound silly, but in some cases you want to build test programs as binaries, but call scripts instead to compare them. This is often the case when you test a library, as you want to actually compare the output of a test program with the known-good output.

Now, up to automake 1.12, if you run make -j16 check, what is parallelized is only the building of the binaries and targets; you can for instance make use of this with check_DATA to preprocess some source files (I do that for unpaper which only ships in the repository the original PNG files of the test data), but if your tests take time, and you have little stuff that needs to be built, then running make -j16 check is not going to be a big win. This added with the chance that the tests might just not work in parallel is why the default up to now in Gentoo is to run the tests in series.

But that’s why recent automake introduced the parallel-tests option, which is actually going to be the default starting from 1.13. In this configuration, the tests are executed by a driver script, which launches multiple copies of them at once, and then proceeds with receiving the results. Note that this is just an alternative default test harness, and Automake actually supports custom harnesses as well, which may or may not be run in parallel.

Anyway, this is something that I’ll have to write about in more details in my guide — please be patient. In the mean time you can see unpaper as an example, as I just updated the git tree to make the best use of the parallel tests harness (it actually saved me some code).

For A Parallel World: Parallel building is not passé

It’s been a while since I last wrote about parallel building. This has only to do with the fact that the tinderbox hasn’t been running for a long time (I’m almost set up with the new one!), and not with the many people who complained to me that spending time in getting parallel build systems to work is a waste of time.

This argument has been helped by the presence of a --jobs option to Portage, with them insisting that the future will have Portage building packages in parallel, so that the whole process will take less time, rather than shortening the single build time. I said before that I didn’t feel like it was going to help much, and now I definitely have some first hand experience to tell you that it doesn’t help at all.

The new tinderbox is a 32-way system; it has two 16-core CPUs, and enough RAM for each of them; you can easily build with 64 process at once, but I’m actually trying to push it further by using the unbound -j option (this is not proper, I know, but still). While this works nicely, we still have too many packages that force serial-building due to broken build systems; and a few that break in these conditions that would very rarely break on systems with just four or eight cores, such as lynx .

I then tried, during the first two rebuilds of world (one to set my choices in USE flags and packages, the other to build it hardened), running with five jobs in parallel… between the issue of the huge system set (yes that’s 4.24 years old article), and the fact that it’s much more likely to have many packages depending on one, rather than one depending on many, this still does not saturate the CPUs, if you’re still building serially.

Honestly seeing such a monstrous system take as much as my laptop, which is 14 in cores and 14 in RAM, to build the basic system was a bit… appalling.

The huge trouble seem to be for packages that don’t use make, but that could, under certain circumstances, be able to perform parallel building. The main problem with that is that we still don’t have a variable that tells us exactly how many build jobs we have to start, instead relying on the MAKEOPTS variable. Some ebuilds actually try to parse it to extract the number of jobs, but that would fail with configurations such as mine. I guess I should propose that addition for the next EAPI version… then we might actually be able to make use of it in the Ruby eclasses to run tests in parallel, which would make testing so much faster.

Speaking about parallel testing, the next automake major release (1.13 — 1.12 was released but it’s not in tree yet, as far as I can tell) will execute tests in parallel by default; this was optional starting 1.11 and now it’s going to be the default (you can still opt-out of course). That’s going to be very nice, but we’ll also have to change our src_test defaults, which still uses emake -j1 which forces serialisation.

Speaking about which, even if your package does not support parallel testing, you should use parallel make, at least with automake, to call make check; the reason is that the check target should also build the tests’ utilities and units, and the build can be sped up a lot by building them in parallel, especially for test frameworks that rely on a number of small units instead of one big executable.

Thankfully, for the day there are two more packages fixed to build in parallel: Lynx (which goes down from 110 to 46 seconds to build!) and Avahi (which I fixed so that it will install in parallel fine).

Are there still qmail users in Gentoo?

This might sound like a stupid question, but looking at the tinderbox results, it looks like qmail is in need of a new maintainer, or set of maintainers.

Let’s start with the old app-doc packages that are still lingering around since 2002, taking app-doc/ucspi-tcp-man as an example. It was originally installed as a separate package, but since 2008, with the 0.88-r17 revision of sys-apps/ucspi-tcp it was merged in. But it took two year to start a stable request by the maintainer and it was only this year that it finally went stable. A similar issue happened with daemontools and both of those required me to go through and remove the old ebuilds and the -man packages that cause so much trouble with my tinderbox setup.

But there are many more issues with the related packages; for instance the huge list of collisions on man pages only for many qmail-related packages, or the fact that some ebuilds have bogus src_test functions (which I’m now removing myself without even waiting for maintainers at this point).

All in all, the long series of bugs, some of which appears to be also security-related, makes me wonder if we wouldn’t need a more “hands-on” maintainers team for qmail and the rest of the djb software, which usually seems to be quite tightly related. So if you are a qmail user in Gentoo, please look at the proxy maintainers project that Markos announced two days ago, your help will be appreciated.

Pruning automake

So yesterday I found out of the libtool 2.4 problems and today I spent quite a bit of time working on fixing the situation, and planning on how to avoid this to ever happen again. I think at this point, the only thing we can do is start getting rid of the oldest automake versions altogether.

This is easier said than done; while we can probably safely decide that any pre-1.9 version of automake (that is, all those that don’t work with libtool 2.4) can be considered dead and has to be moved away from, doing so on a regular basis is going to be troublesome: when we decide that an automake version is too far gone? When libtool stops working? When Perl stops working? When autoconf stops working?

But let’s start on why this is happening, or rather why is autotools.eclass allowing to choose a different automake version at all? The answer is mine to give most likely, given I am the original designer and implementer of the eclass.

When I designed it, I designed it to have three alternative interfaces: eautoconf, eautoreconf and eautomake. These three were supposed to cover all the use cases for autotools in ebuilds, trying not to do too much, nor too little — complications like intltoolize, gettextize and phpize actually escaped me at the time.

  • eautoreconf is the common take-it-all approach: it rebuilds all autotools from scratch, calling aclocal, libtool (if needed), automake and so on; this is the default approach you should take if you’re uncertain about what you have to do;
  • eautoconf is designed to be used only when there is no Makefile.am file, and thus automake need not to run; if you’re going to rebuild configure for a package that does use automake, you shouldn’t use this at all; on the other hand, it doesn’t enforce WANT_AUTOMAKE=none since you can use aclocal (part of automake) without automake (see my documentation on the matter);
  • finally, eautomake was designed to optimise autotools rebuild: rather than rebuild all the chain, this was supposed to only call automake, as long as the current build system was built with the same exact version, and was to be used when only Makefile.am was modified, so that configure would be left untouched; since you’re supposed to use the same version, here’s where WANT_AUTOMAKE was really necessary.

Unfortunately, even though eautomake is much faster, its limitations nowadays make me think it was an horrible idea to implement it at all. First of all, it’s often misused: as I said it works only as long as the version used to build the patched build system is the same as the automake it’s going to use locally; this means that if you just leave WANT_AUTOMAKE to the default (latest) or set a different value than what the original package was built with, then it’s going to behave just like eautoreconf (indeed, as soon as it finds different automake versions it’ll scream treason and proceed to a full-rebuild); but even when used properly, there is not enough micro-version compatibility, so a build system built with automake 1.9.2 will require a full-rebuild anyway, given that our latest 1.9 supported version is 1.9.6.

Also, this automatic/automagic default of re-building all the autotools when the situation is unsafe, also causes a major problem: given that on a micro version bump of automake we move from building automake only to a full-rebuild, such a micro bump can cause problems to appear with newer libtool or autoconf packages just the same. And since I don’t usually go around testing on micro bumps, it’s more likely to create hidden problems.

Given all this, I’m actually considering asking to mark eautomake as deprecated, start removing its use from the tree, and add a QA warning to stop using it as soon as possible. Of course this won’t mean that WANT_AUTOMAKE will go away just yet; there are some cases where actually supporting newer automake is more work than it’s worth, but luckily such changes right now seem to be limited in the 1.9→1.10 bump, which is why I decided to set the lower bound at that version rather than sticking with 1.101.11 which are the newest. But even so, removing the whole idea of eautomake should reduce the trouble to maintainable levels.

At any rate, whatever the outcome will be, I’m pretty sure I’ll be handling most of the porting myself, but at least I hope to be able to write any particularly confusing situation in Autotools Mythbuster — after all, I’m doing it to help others, am I not?

For A Parallel World: ebuild writing tip: faster optional builds

Today lurking on #gentoo-hardened I came to look at an ebuild written particularly badly, that exasperated one very bad construct for what concerns parallel builds (which are a very good thing with modern multi-core multi-thread CPUs):

src_compile() {
  if use foo; then
     emake foo || die
  fi

  if use bar; then
    emake bar || die
  fi

  if use doc; then
    emake doc || die
  fi
}

This situation wastes a lot of processing power: the three targets with all their dependencies will be taken into consideration serially, not in parallel; if you requested 12 jobs, but each of foo and bar only have three object files as dependencies, they should have been built at the same time, not in two different invocations.

I admit I made this mistake before, and even so recently, mostly related to documentation building, so how does one solve this problem? Well there are many options, my favourite being something along these lines:

src_compile() {
  emake 
    $(use foo && echo foo) 
    $(use bar && echo bar) 
    $(use doc && echo doc) 
    || die "emake failed"
}

Of course this has one problem in the fact that I don’t have a general target so it should rather be something more like this:

src_compile() {
  local maketargets=""

  if use bar ; then
    maketargets="${maketargets} bar"
  else
    maketargets="${maketargets} foo"
  fi

  emake ${maketargets} 
    $(use doc && echo doc) 
    || die "emake failed"
}

This will make sure that all the targets will be considered at once, and will leave make to take care of dependency resolution.

I tried this approach out in the latest revision of the Drizzle ebuild that I proxy-maintain for Pavel; the result is quite impressive because doxygen, instead of taking its dear time after the build completed, runs for about half of the build process (using up only one slot of the twelve jobs I allocate for builds on Yamato).

Obviously, this won’t make any difference if the package is broken with respect to parallel build (using emake -j1) and won’t make a difference when you’re not building in parallel, but why not doing it right, while we’re at it?

About the new Quagga ebuild

A foreword: some people might think that I’m writing this just to banter about what I did; my sincere reason to write, though, is to point out an example of why I dislike 5-minutes fixes as I wrote last December. It’s also an almost complete analysis of my process of ebuild maintenance so it might be interesting for others to read.

For a series of reasons that I haven’t really written about at all, I need Quagga in my homemade Celeron router running Gentoo — for those who don’t know, Quagga is a fork of an older project called Zebra, and provides a few daemons for route advertisement protocols (such as RIP and BGP). Before yesterday, the last version of Quagga in Portage was 0.99.15 (and the stable is an old 0.98 still), but there was recently a security bug that required a bump to 0.99.17.

I was already planning on getting Quagga a bump to fix a couple of personal pet peeves with it on the router; since Alin doesn’t have much time, and also doesn’t use Quagga himself, I’ve added myself to the package’s metadata; and started polishing the ebuild and its support files. The alternative would have been for someone to just pick up the 0.99.15 ebuild, update the patch references, and push it out with the 0.99.17 version, which would have categorized for a 5-minutes-fix and wouldn’t have solved a few more problems the ebuild had.

Now, the ebuild (and especially the init scripts) make a point that they were contributed by someone working for a company that used Quagga; this is a good start, from one point: the code is supposed to work since it was used; on the other hand companies don’t usually care for the Gentoo practices and policies, and tend to write ebuilds that could be polished a bit further to actually be compliant to our guidelines. I like them as a starting point, and I got used to do the final touches in those cases. So if you have some ebuilds that you use internally and don’t want to spend time maintaining it forever, you can also hire me to clean them up and merge in tree.

So I started from the patches; the ebuild applied patches from a tarball, three unconditionally and two based on USE flags; both of those had URLs tied to them that pointed out that they were unofficial feature patches (a lot of networking software tend to have similar patches). I set out to check the patches; one was changing the detection of PCRE; one was obviously a fix for --as-needed, one was a fix for an upstream bug. All five of them were on a separate patchset tarball that had to be fetched from the mirrors. I decided to change the situation.

First of all, I checked the PCRE patch; actually the whole PCRE logic, inside configure is long winded and difficult to grok properly; on the other hand, a few comments and the code itself shows that the libpcreposix library is only needed non non-GNU systems, as GLIBC provides the regcomp/@regexec@ functions. So instead of applying the patch and have a pcre USE flag, I changed to link the use or not of PCRE depending on the elibc_glibc implicit USE flag; one less patch to apply.

Second patch I looked at was the --as-needed-related patch that changed the order of libraries link so that the linker wouldn’t drop them out; it wasn’t actually as complete as I would have made. Since libtool handles transitive dependencies fine, if the libcap library is used in the convenience library, it only has to be listed there, not also in the final installed library. Also, I like to take a chance to remove unused definitions in the Makefile while I’m there. So I reworked the patch on top of the current master branch in their GIT, and sent it upstream hoping to get it merged before next release.

The third patch is a fix for an upstream bug that hasn’t been merged in a few releases already, so I kept it basically the same. The two feature patches had new versions released, and the Gentoo version seems to have gone out of sync with the upstream ones a bit; for the sake of reducing Gentoo-specific files and process, I decided to move to use the feature patches that the original authors release; since they are only needed when their USE flags are enabled, they are fetched from the original websites conditionally. The remaining patches are too small to be part of a patchset tarball, so I first simply put them in files/ are they were, with mine a straight export from GIT. Thinking about it a bit more, I decided today to combine them in a single file, and just properly handle them on Gentoo GIT (I started writing a post detailing how I manage GIT-based patches).

Patches done, the next step is clearing out the configuration of the program itself; the ipv6 USE flag handles the build and installation of a few extra specific daemons for for the IPv6 protocol; the rest are more or less direct mappings from the remaining flags. For some reason, the ebuild used --libdir to change the installation directory of the libraries, and then later installed an env.d file to set the linker search path; which is generally a bad idea — I guess the intention was just to follow that advice, and not push non-generic libraries into the base directory, but doing it that way is mostly pointless. Note to self: write about how to properly handle internal libraries. My first choice was to see if libtool set rpath properly, and in that case leave it to the loader to deal with it. Unfortunately it seems like there is something bad in libtool, and while rpath worked on my workstation, it didn’t work on the cross-build root for the router though; I’m afraid it’s related to the lib vs lib64 paths, sigh. So after testing it out on the production router, I ended up revbumping the ebuild already to unhack itif libtool can handle it properly, I’ll get that fixed upstream so that the library is always installed, by default, as a package-internal library, in the mean time it gets installed vanilla as upstream wrote it. It makes even more sense given that there are headers installed that suggest the library is not an internal library after all.

In general, I found the build system of quagga really messed up and in need of an update; since I know how many projects are sloppy about build systems, I’d probably take a look. But sincerely, before that I have to finish what I started with util-linux!

While I was at it, I fixed the installation to use the more common emake DESTDIR= rather than the older einstall (which means that it now installs in parallel as well); and installed the sample files among the documentation rather than in /etc (reasoning: I don’t want to backup sample files, nor I want to copy them to the router, and it’s easier to move them away directly). I forgot the first time around to remove the .la files, but I did so afterwards.

What remains is the most important stuff actually; the init scripts! Following my own suggestions the scripts had to be mostly rewritten from scratch; this actually was also needed because the previous scripts had a non-Gentoo copyright owner and I wanted to avoid that. Also, there were something like five almost identical init scripts in the package, where almost is due to the name of the service itself; this means also that there had to be more than one file without any real reason. My solution is to have a single file for all of them, and symlink the remaining ones to that one; the SVCNAME variable is going to define the name of the binary to start up. The one script that differs from the other, zebra (it has some extra code to flush the routes) I also rewrote to minimise the differences between the two (this is good for compression, if not for deduplication). The new scripts also take care of creating the /var/run directory if it doesn’t exist already, which solves a lot of trouble.

Now, as I said I committed the first version trying it locally, and then revbumped it last night after trying it on production; I reworked that a bit harder; beside the change in libraries install, I decided to add a readline USE flag rather than force the readline dependency (there really isn’t much readline-capable on my router, since it’s barely supposed to have me connected), this also shown me that the PAM dependency was strictly related to the vtysh optional component; and while I looked at PAM, (Updated) I actually broke it (and fixed it back in r2); the code is calling pam_start() with a capital-case “Quagga” string; but Linux-PAM puts it in all lower case… I didn’t know that, and I was actually quite sure that it was case sensitive. Turns out that OpenPAM is case-sensitive, Linux-PAM is not; that explains why it works with one but not the other. I guess the next step in my list of things to do is check out if it might be broken with Turkish locale. (End of update)

Another thing that I noticed there is that by default Quagga has been building itself as a Position Independent Executable (PIE); as I have written before using PIE on a standard kernel, without strong ASLR, has very few advantages, and enough disadvantages that I don’t really like to have it around; so for now it’s simply disabled; since we do support proper flags passing, if you’re building a PIE-complete system you’re free to; and if you’re building an embedded-enough system, you have nothing else to do.

The result is a pretty slick ebuild, at least in my opinion, less files installed, smaller, Gentoo-copyrighted (I rewrote the scripts practically entirely). It handles the security issue but also another bunch of “minor” issues, it is closer to upstream and it has a maintainer that’s going to make sure that the future releases will have an even slicker build system. It’s nothing exceptional, mind you, but it’s what it is to fix an ebuild properly after a few years spent with bump-renames. See?

Afterword: a few people, seemingly stirred up by a certain other developer, seems to have started complaining that I “write too much”, or pretend that I actually have an uptake about writing here. The main uptake I have is not having to repeat myself over and over to different people. Writing posts cost me time, and keeping the blog running, reachable and so on so forth takes me time and money, and running the tinderbox costs me money. Am I complaining? Not so much; Flattr is helping, but trust me that it doesn’t even cover the costs of the hosting, up to now. I’m just not really keen on the slandering because I write out explanation of what I do and why. So from now on, you bother me? Your comments will be deleted. Full stop.

Don’t try autoconf 2.66 at home just yet!

I have to thank Arfrerver for making me notice this with the bug about Ruby 1.9 he reported.

The GNU project released autoconf 2.66 two days ago. Very few notable changes are present in it, just like a few were listed before, so I didn’t go out of my way to test it beforehand. My bad! Indeed there is one big nasty change with it for which I’d say to all of you to put off the update until I write it so. Hopefully it won’t get unmasked in Gentoo for a while either.

There are two main problems with this release; the first is due to the implementation of a stricter macro to ensure the parameters given to it is not variable over executions:

**** The macro AS_LITERAL_IF is slightly more conservative; text containing shell quotes are no longer treated as literals. Furthermore, a new macro, AS_LITERAL_WORD_IF, adds an additional level of checking that no whitespace occurs in literals.

well, whatever the idea about this was, it seems to have broken the AC_CHECK_SIZEOF macro: if you pass it [void*] as parameter, it’ll report it not being a literal (while it is) causing the following error:

flame@yamato test % cat configure.ac
AC_INIT([foo], [0])

AC_CHECK_SIZEOF([void*])

AC_OUTPUT

flame@yamato test % autoconf
configure.ac:3: error: AC_CHECK_SIZEOF: requires literal arguments
../../lib/autoconf/types.m4:765: AC_CHECK_SIZEOF is expanded from...
configure.ac:3: the top level
autom4te-2.66: /usr/bin/m4 failed with exit status: 1

This would be bad enough. But the nastier surprise I got when running autoreconf over the feng sources, the build system of which I wrote myself, and if I may say so, is very well engineered:

flame@yamato feng % autoreconf -fis
configure:6275: error: possibly undefined macro: AS_MESSAGE_LOG_FDdnl
      If this token and others are legitimate, please use m4_pattern_allow.
      See the Autoconf documentation.
autoreconf-2.66: /usr/bin/autoconf-2.66 failed with exit status: 1

The problem here is almost obvious, and it’s related to the dnl entry at end of the macro name; the dnl keyword is used as (advanced) comment delimiter in autoconf scripts, meaning “Discard up to New Line” and is often used to keep on multiple lines commands that should be kept togever, like is in many languages. A quick check at the configure files brings in this:

        as_fn_error $? "Package requirements (glib-2.0 >= 2.16 gthread-2.0) were not met:

$GLIB_PKG_ERRORS

Consider adjusting the PKG_CONFIG_PATH environment variable if you
installed software in a non-standard prefix.

Alternatively, you may set the environment variables GLIB_CFLAGS
and GLIB_LIBS to avoid the need to call pkg-config.
See the pkg-config man page for more details." "$LINENO" AS_MESSAGE_LOG_FDdnl

You can easily see that the problem here is with the pkg-config macros (pkg.m4). Funnily enough there is no change related to the errors reporting that is listed in the autoconf news file so I wasn’t expecting this. The problem is further down the path of pkg-config files but it’s not important to fully debug it right now, it’s actually quite easy to fix, in pkg-config itself, but here’s the catch.

Since the pkg.m4 macro file is way too often bundled with the upstream packaging, and its presence overrides the copy from the system, even fixing pkg-config will not fix all the software that carries outdated copies of the macro file.

This is almost the same problem with libtool 1 vs libtool 2 macro files with the difference that this is going to be much much more common. If you’re a package maintainer, you can do something already before this even hits the users: remove the pkg.m4 file during the src_prepare() phase; you’re already depending on pkg-config in the ebuild for it to work at build-time, and since we don’t split the macro file from the command itself, you can simply rely on its presence on the system.

In the mean time, I’m not sure if I want to start testing with it just yet or if we should be waiting for 2.67…

Depend on what you use

New Note 20

To this day, we still get --as-needed failures for packages in Gentoo; both for new packages and bumps. To this day, checking the list of reverse dependencies of libpng is not enough to ensure that all the packages build fine with libpng-1.4 (as Samuli found out the hard way). One common problem in both is represented by missing dependencies, which in a big part are caused by transitional transitive dependencies.

Transitional Transitive dependencies are those caused by indirect linking; since I don’t want to bore you all repeating myself you can read about it in this post and this one and finally another one — yes I wrote a lot about the matter.

How do transitional transitive, indirect dependencies, cause trouble with both --as-needed and with upgrade verification? Well it depends on a number of factors actually:

  • the build might work on the developers’ systems because the libraries linked against indirectly bring in the actually needed libraries, either by DT_NEEDED or by libtool archives, but the former libraries aren’t used directly, thus --as-needed breaks the link — misguided link
  • the build might work properly because some of the used (and linked to) libraries optionally use (and link to) the otherwise missing libraries; this work just as long as they are not built without that support; for instance you might use OpenSSL and Curl, separately, then link to Curl only, expecting it to bring in OpenSSL… but Curl might be built against GnuTLS or NSS, or neither;
  • the build might work depending on the versions of libraries used, because one of the linked libraries might replace one library for another, dropping a needed library from the final link.

The same rules generally apply to the DEPEND and RDEPEND variables; relying on another package to bring in your own dependencies is a bad idea; even if you use GTK+ it doesn’t mean that you can avoid listing libpng as a package used, if you use it directly. On the other hand, listing libpng because it’s present in the final link (especially when not using --as-needed) is a bad idea which you definitely should avoid.

By ignoring transitional transitive dependencies, you invalidate the dependency-tree, which means we cannot rely on it when we’re trying to avoid huge fuckups if an important package changed API and ABI. This is why I have (wrongly) snapped back at Samuli for closing the libpng-1.4 tracker before I had the chance of running it through the tinderbox.

Bottomline, please always depend on what you use directly, both in linking and in ebuilds. Please!

Thanks to Odi for letting me know that I used (consistently) the wrong word in the article. This goes to show that I either should stop writing drafts at 3am or I should proofread them before posting.

Ebuilds have to be done right

There is quite some stir right now in the gentoo-dev mailing list following a mass-masking and for removal of packages for QA and security reasons; I think that Alec nailed down most of the issues with his comments:

> This thread is yet another proof that we need to introduce a “Upcoming
> masking” for unmaintained packages.

<sarcasm>

Shall I file those forms in triplicate and fax them to the main office sir?

</sarcasm>

Since amazingly I actually started the Treecleaners project; the
intent was actually to fix problems with packages. Part of the
problem is that there are hundreds of packages in the tree and the
fixes vary in complexity so it is difficult to create hard-and-fast
rules on when to keep a package versus when to toss it. One of the
things I like about masking is that it quickly gets people who
actually care about the package up to bat to fix it instead of leaving
it broken for months. I realize maintainers do not exactly enjoy this
kind of poking, however when things have been left for long enough I
believe our options become a bit more limited (in this case, masking
for removal due to unfixed sec bugs.)

Now, this is one issue I already partly addressed in my post about the five minutes fix myth but I’d like to remind again that even though we can easily spot some blatant problems with packages, having a package that compiles and that passes the obvious, programmatic QA checks does not really tell you much about the health status of the package; indeed, you won’t know whether the package works at all for the final users. Tying to another post of mine (incidentally, someone complained about my self-reference to posts… should I stop giving pointers and context?), I have to admit that sometimes it’s impossible to have a 100% coverage of packages, among other reasons because some packages need particular hardware, or particular software components set up, to be able to test them effectively. On the other hand, when such a complex setup isn’t strictly needed, we should expect some level of testing when making changes, minor or otherwise.

Sometimes, the mistakes are in the messages logged by the ebuild, at other times, the problem is that some important part of the package is missing, for example because the install phase is manually written in the ebuild, and upstream has added some extra utility that is installed by make install but is obviously ignored by the ebuild (and this actually is one of the points that Donnie brought up when I suggested to override upstream build systems with an eclass: we’d have to triple-check the new releases to make sure that no further source files or objects or libraries were added from the previously-packaged version). All these things are almost impossible to identify in a nice, programmatic scripted way, and need knowledge of a package, checking the release notes having an idea how to test the package.

For instance, I’ve been looking into sys-libs/libnss-pgsql today, as I have an interest on it; the ebuild installs the shared library manually (skipping libtool’s relinking phase, by the way); why did it do that? It takes four steps rather than the one needed for make install… well, the reason was obvious (but not commented upon!) after changing it to use make install: a post-install check actually aborted the merge: the problem was that the package installed the Name Service Switch library in /lib, but also installed the static archive and the libtool .la file, both of which are definitely not needed in /lib. The handwritten install solution solves the symptoms but not the following problems:

  • it will still build the static archive (non-PIC) version, causing twice the number of compiler calls;
  • it won’t tell upstream that they forgot one thing in their Makefile.am;
  • it’s still wrong because the libraries it links to are not available in /lib: it won’t be working before mounting /usr if /usr is on a different partition (who does still do that, nowadays?!) — it should be in /usr itself, at this point (and yes, you can do that: both GNU libc and FreeBSD – which has a different NSS interface by the way – check both /usr and /usr/lib).

Incidentally, why does glibc’s default nsswitch.con use db files for services, protocols, svc and ethers? Their presence in there means that each time you call into glibc to resolve a port name, it makes eight open() syscalls trying to find the file. It doesn’t sound too right.

I have patches, and I have a new ebuild, I’ll see to send them upstream and get it committed (by someone else, or by picking maintainership for it) in the next day or so. In the mean time I have to get back to my work.