I’ve recently ranted about the fact that GLIBC 2.12 was added to portage in a timeframe that, in both my personal and professional opinion, was too short. I’ve not dug too much into what the problems with that particular version, and why I think it needed to stay in package.mask
for a little longer.
The main reason why we’re seeing failures is because the developers have been, for a few versions already, cleaning up the headers so that including only one of them won’t cause a number of other, near-unrelated headers to be included. This is the same problem as I described with -O0 and libintl.h
almost two years ago.
Now, on FreeBSD, and Mac OS X by reflection, a number of these cleanups have been done a long time ago, or the problem was never introduced: they both try to stick to the minimal subset of interfaces you need to bring in. This is why half the time what you have to do to port something to FreeBSD is just adding a bunch of header inclusions. This should mean the whole situation is easily handled, and should already be fixed in most situations, but that’s far from the case. Not only, for no good reason at all, a number of projects protect the inclusion of extra headers with #ifdef
calls for FreeBSD and Mac OS X, but one of the particular headers cleaned up causes a much worse problem than the usual “foo not defined” or constants not found.
The problem is that the stat.h
inclusion has been dropped by a number of headers, which means it has to be added back to the code itself. Unfortunately, since C allows for implicit declarations (Portage does warn of them!), this means that a call of S_ISDIR(mymode)
and similar will appear to the compiler like an implicit function declaration, and will then emit a requirement for the (undefined) symbol S_ISDIR
… which of course is not a function but a macro defined in the header file. Again, this is not as troublesome as it looks: the linker will catch the symbol as undefined and halt the build, in the best of cases. And just to make sure, Portage logs these problem further, stating that they can create problems at runtime — of course that would be more useful if developers actually paid attention and fixed them, but let’s not go there for now.
The real problem come when this kind of mistake is present in shared objects; by default ld
will allow shared objects to have undefined reference, especially if they are plugins (since then the symbols may be resolved by the host program, if not by a library that they can be linked to). Of course, most sane libraries will be using --no-undefined
for the final link… but not all of them do so. A common problem situation is Ruby and its extensions – at least outside of Gentoo, given that all our current Ruby ebuilds force --no-undefined
at extension link time. The only warning you have there is the implicit-declaration of the fake-function, but that’s far from being difficult to oversee.
And before you suggest that, no -Werror-implicit-declaration
is not going to be a good idea: most of the autoconf
tests fail if that is passed, which result in a non-buildable system; that’s why it’s one of the few flags I play with on all my autoconf
-based projects!
Also, as Ruby itself proved – ask me again why I always rant when I write about Ruby – there are ways around the warning that don’t constitute a fix even for the most optimistic of the developers.
But then, yuo have the million euro question: how severe is this bug? The only way to judge that is to understand what symptoms it causes. Given that’s not a security feature, you’re not going to have security bugs here, but you have:
- build failures: obnoxious but manageable; things just failing to build are not going to worry me most of the time; when the fix is as easy as adding a missing include line, I have no reserve against going ~arch;
- runtime load/startup failure: already a bit less acceptable; failing to load a plugin/extension with a “undefined symbol” error is rarely something users will like to see; it’s bad but not too bad;
- runtime abortions: now this is what I’m upset about; having missing symbols at runtime mean that you cannot either trust the software you built, nor that which starts up cleaning; aborting at runtime means that your software might be in the middle of some transactions when it reaches the error situation, and could even corrupt your data; this is made more likely given the fact that it’s part of the
stat(2)
handling code!
There is a good thing though: -Wl,-z,now
which is part of the default settings for hardened profiles, or that can be set at runtime by setting LD_BIND_NOW=1
in the environment, will ensure that all the symbols are bound when starting up the process, rather than lazily while the code executes; it can reduce the risk of the missing symbol to be hit in the middle of the transaction. Unfortunately it does not work the same way with extensions for languages like Ruby and Python, but at least alleviate a bit the problem.
Combine what I just wrote with the fact that even a package part of the system set (m4
, used by autoconf
, not something you can live without) failed to build with this glibc version by the time it went unmasked, showing a lack of system rebuild on the system of the developer choosing to unmask it, and you might guess why I ended up using such strong words in my previous post.
Implicit declarations are just bad. You are missing the wonderful implicit int thingy.
LD_BIND_NOW=1 might break certain mesa/xorg setups IIRC, if you end up with nonworking opengl check for it.
Guh, how do they break lu?And luckily there doesn’t seem to be any of these implicit declarations to be that disruptive, luckily.
mesa and xorg have a quite _peculiar_ loader for modules and submodules, Hopefully it has fixedhttp://cgit.freedesktop.org…
It isn’t fixed. xorg-server-1.8.99.906 doesn’t even start with LD_BIND_NOW=1 for me. Take a look at this:http://www.gentoo.org/proj/…
Diego,So many times i see You pointing out the things we need to know, as users, administrators and (future?) devs. But i would like to ask You to please, list any way we as users can help. Do we need to download catalyst and pick a build each to test?What can the average user do to take up the slack, and help find, document and fix bugs? And when we have, how can we make sure there is a CHANCE in a BLIZZARD, someone will actually commit on them? Can we do it ourselves under peer review? How can we help?When i get my new quad-core i will be building packages daily in small sets with the current catalyst scripts, which havn’t been released for ages. But then there is no manual. i approaches wolf31et years ago about writing one, even did code analysis and found like, dead switches in catalyst etc. So i knew what i was doing,. D’You think i was taken seriously?And this was in 2005. The same tired releng website has been saty there for the last 5 years, saying catalayst 1.x is deep-sixed and 2.x has no manual.It was about that time that i gave up trying to think about doing any serious work, but i am now so distressed at the level of dismay in the community over releng and QA, that i am thinking to do some more.But the thing i ask is this: If i do all this, will it get commited? How can i offer it to the devs to check… there seems to be no path even for people with a lot of time… i’m on disability… and the motivation to do it. QA is broken. God know’s releng is too, they need to test the livecd’s better, i stall get udev/ dev/pts/ related issues there too. From a poorly configured kernel. On the minimal install cd. a package it would take maybe a week to completely recompile and test. On a P4. How can i fix this?If i did, WHO WOULD CAE? YOU? ME?….ANYONE?What can we do to help Master?
Diego, About the breakage of LD_BIND_NOW=1, look inside of gentoos bugzilla and you will find bugs about it breakig with hardened, currently it seems like mesa classic is fixed, but gallium3d does not work for either radeon or nouveau.