Compounded issues in GLIBC 2.12

I’ve recently ranted about the fact that GLIBC 2.12 was added to portage in a timeframe that, in both my personal and professional opinion, was too short. I’ve not dug too much into what the problems with that particular version, and why I think it needed to stay in package.mask for a little longer.

The main reason why we’re seeing failures is because the developers have been, for a few versions already, cleaning up the headers so that including only one of them won’t cause a number of other, near-unrelated headers to be included. This is the same problem as I described with -O0 and libintl.h almost two years ago.

Now, on FreeBSD, and Mac OS X by reflection, a number of these cleanups have been done a long time ago, or the problem was never introduced: they both try to stick to the minimal subset of interfaces you need to bring in. This is why half the time what you have to do to port something to FreeBSD is just adding a bunch of header inclusions. This should mean the whole situation is easily handled, and should already be fixed in most situations, but that’s far from the case. Not only, for no good reason at all, a number of projects protect the inclusion of extra headers with #ifdef calls for FreeBSD and Mac OS X, but one of the particular headers cleaned up causes a much worse problem than the usual “foo not defined” or constants not found.

The problem is that the stat.h inclusion has been dropped by a number of headers, which means it has to be added back to the code itself. Unfortunately, since C allows for implicit declarations (Portage does warn of them!), this means that a call of S_ISDIR(mymode) and similar will appear to the compiler like an implicit function declaration, and will then emit a requirement for the (undefined) symbol S_ISDIR… which of course is not a function but a macro defined in the header file. Again, this is not as troublesome as it looks: the linker will catch the symbol as undefined and halt the build, in the best of cases. And just to make sure, Portage logs these problem further, stating that they can create problems at runtime — of course that would be more useful if developers actually paid attention and fixed them, but let’s not go there for now.

The real problem come when this kind of mistake is present in shared objects; by default ld will allow shared objects to have undefined reference, especially if they are plugins (since then the symbols may be resolved by the host program, if not by a library that they can be linked to). Of course, most sane libraries will be using --no-undefined for the final link… but not all of them do so. A common problem situation is Ruby and its extensions – at least outside of Gentoo, given that all our current Ruby ebuilds force --no-undefined at extension link time. The only warning you have there is the implicit-declaration of the fake-function, but that’s far from being difficult to oversee.

And before you suggest that, no -Werror-implicit-declaration is not going to be a good idea: most of the autoconf tests fail if that is passed, which result in a non-buildable system; that’s why it’s one of the few flags I play with on all my autoconf-based projects!

Also, as Ruby itself proved – ask me again why I always rant when I write about Ruby – there are ways around the warning that don’t constitute a fix even for the most optimistic of the developers.

But then, yuo have the million euro question: how severe is this bug? The only way to judge that is to understand what symptoms it causes. Given that’s not a security feature, you’re not going to have security bugs here, but you have:

  • build failures: obnoxious but manageable; things just failing to build are not going to worry me most of the time; when the fix is as easy as adding a missing include line, I have no reserve against going ~arch;
  • runtime load/startup failure: already a bit less acceptable; failing to load a plugin/extension with a “undefined symbol” error is rarely something users will like to see; it’s bad but not too bad;
  • runtime abortions: now this is what I’m upset about; having missing symbols at runtime mean that you cannot either trust the software you built, nor that which starts up cleaning; aborting at runtime means that your software might be in the middle of some transactions when it reaches the error situation, and could even corrupt your data; this is made more likely given the fact that it’s part of the stat(2) handling code!

There is a good thing though: -Wl,-z,now which is part of the default settings for hardened profiles, or that can be set at runtime by setting LD_BIND_NOW=1 in the environment, will ensure that all the symbols are bound when starting up the process, rather than lazily while the code executes; it can reduce the risk of the missing symbol to be hit in the middle of the transaction. Unfortunately it does not work the same way with extensions for languages like Ruby and Python, but at least alleviate a bit the problem.

Combine what I just wrote with the fact that even a package part of the system set (m4, used by autoconf, not something you can live without) failed to build with this glibc version by the time it went unmasked, showing a lack of system rebuild on the system of the developer choosing to unmask it, and you might guess why I ended up using such strong words in my previous post.