Gentoo Linux Health Report

Maybe I should start do this monthly, with the tinderbox’s results at hand.

So as I said, the Tinderbox is logging away although it’s still coughing from time to time. What I want to write about, here, is some of the insights the current logs tell me.

First of all, as you might expect considering my previous efforts, I’m testing GCC 4.7. It’s the least I can do of course, and it’s definitely important to proceed with this if we want to have it unmasked in decent terms, instead of the 4.6-like æons (that were caused in part by the lack of running continuous testing). The new GCC version by itself doesn’t seem to be too much of a break-through anyway; there is the usual set of new warnings that cause packages insisting in using -Werror to fail; there is some more headers’ cleanup that cause software using C and POSIX interfaces in C++ to fail because they don’t include the system headers declaring the functions they use; there also is a change in the C++ specs that require this-> to be prefixed on a few access to inherited attributes/methods but the rule of which I’m not sure of.

The most disruptive issue with the new GCC, though, seems to be another set of strengthening the invalid command-line parameters passed to the compiler. So for instance, the -mimpure-text option that is supported for SunOS is no longer usable on Linux — guess which kind of package fails badly due to that? Java native libraries of course. But it’s not just them, one or two packages failed due to the use of -mno-cygwin which is also gone from the Linux version of GCC.

So while this goes on, what other tests are being put in place? Well, I’ve been testing GnuTLS 3, simply because this version no longer bundles an antique, unmaintained configuration file parsing library. To be honest though I should be working a bit more on GnuTLS as there is a dependency over readline I’d like to remove.. for now I only fixed parallel build which means you can now use the ebuild at the best speed.

Oh and how can I forget that I’m testing the automake 1.12 transition which I’m also trying to keep documented in Autotools Mythbuster — although I’m still missing the “see also” references, I hope I’ll be able to work on it in the next few days. Who knows, maybe I’ll be able to work on them on the plane (given next time no matter what I’m going to get the “premium economy” extra — tired of children screaming, hopefully there are few families ready to pay for the extra).

The most interesting part of this transition is that we get failures for things that really don’t matter for us at all, such as the dist-lzma removal. Unfortunately we still have to deal with those. The other thing is that there are more packages than I expected that relied on what is called “auto-de-ANSI-fication” (conversion from ANSI C to more modern C), which is also gone on this automake version. Please remember that if you don’t want to spend too much time on fixing this you can still restrict WANT_AUTOMAKE=1.11 in your ebuild for the moment — but at some later point you’ll have to get upstream to support the new version.

I sill have some trouble with PHP extensions not installing properly. And I have no idea why.

Now, let’s leave alone the number of complains I could have with developers who commit outright broken packages without taking care of them, or three (or four) years old bugs still open and still present… I’d like for once to thank those developers who’re doing an excellent job by responding timely to the new bugs.

So, thank you Justin (jlec), Michael (kensington), Alfredo Tupone and Hans (graaff) among others. These are just the guys who received most of the bugs… and fixed them quickly enough for me to notice — “Hey did I just file a bug about that? Ah it’s fixed already.”

Apache, Passenger, Rails: log shmock

You might or might not remember my fighting with mod_perl and my finding a bug in the handling of logs if Apache’s error log is set to use the syslog interface (which in my case would be metalog). For those wondering the upstream bug is still untouched goes without saying. This should have told me that there aren’t many people using Apache’s syslog support, but sometimes I’m stubborn.

Anyway, yesterday I finally put into so-called “production” the webapp I described last week for handling customers’ computers. I got it working in no time after mongoid started to behave (tests are still restricted, because a couple fail and I’m not sure why — I’ll have to work on that with the next release that require quite fewer hacks to test cleanly). I did encounter a nasty bug in “best_in_place”http://rubygems.org/gems/best_in_place which I ended up fixing in Gentoo even though upstream hasn’t merged my branch yet.

To get it in “production” I simply mean configuring it to run on the twin server of this blog’s, which I’ve been using for another customer as well — and got ready for a third. Since Rails 3.1 was already installed on that box, it was quite easy to move my new app there. All it took was installing the few new gems I needed and…

Well here’s the interesting thing: I didn’t want for my application to run as my user, while obviously I wanted to check out the sources with my user so that I could get it to update with git … how do you do that? Well, Passenger is able to run the application under whatever user owns the config/environment.rb file, so you’d expect it to be able to run under an arbitrary user as well — which is the case, but only if you’re using version 3 (which is not stable in Gentoo as of yet).

So anyway I set up the new passenger to change the user, make public/assets/ and another directory I write to group-writable (the app user and my user are in the same group), and then I’m basically done, I think. I start up and I’m done with it, I think… but the hostnames tell me that “something went wrong”, without any clue as to what.

Okay so the default for Passenger is to not have any log at all, not a problem, I’ll just increase the level to 1 and see the error… or not? I still get no output in Apache’s error log .. which is still set to syslog… don’t tell me… I set Passenger to log to file, and lo and behold it works fine. I wonder if it’s time for me to learn Apache’s API and get to fix both, since it looks like I’m one of the very few people who would like to use syslog as Apache’s error log.

After getting Passenger to finally tell me what’s wrong, I find out both the reason why Rails wasn’t starting (I forgot to enable two USE flags in dev-ruby/barby which I use for generating the QR code on the label), but I also see this:

Rails Error: Unable to access log file. Please ensure that /var/www/${vhost}/log/production.log exists and is chmod 0666. The log level has been raised to WARN and the output directed to STDERR until the problem is fixed.
Please note that logging negatively impacts client-side performance. You should set your logging level no lower than :info in production.

What? Rails is really telling its users to create a world writeable log file, when it fails to write to it? Are they freaking kidding me? Is this really a suggestion coming from the developers of a framework for Web Applications which should be security-sensitive? … Okay so one can be smarter than them and do the right thing (in my case make sure that the log file is actually group-writeable) but if this is the kind of suggestions they find proper to tell you, it’s no wonder what happened with Diaspora. So it’s one more reason why Rails shouldn’t be for the faint hearted and that you should pay a very good sysadmin if you want to run a Rails application.

Oh and by the way the cherry on top of this is that instead of just sending the log to stderr, leaving it to Passenger to wrangle – which would have worked out nicely if Passenger had a way to distinguish which app the errors are coming from – Rails also moves the log level to warning, just to spite you. And then tells you that it impacts performances! Ain’t that lovely?

Plan for the day? If I find some extra free time I’d like to give a try and package (not necessarily in this order) syslogger so that the whole production.log thing can go away fast.

The pain of installing RT in Gentoo — Part 2

Yesterday I ranted about installing RT for lighttpd and I was suggested to look at otrs — I did, and even though the lack of webapp-config support/requirement was appealing, having everything installed in /var/lib didn’t sound too good.. so I simply set it aside, and tried again wtih RT.

Since the whole lighttpd issue at least let me read an error, I decided to assume that the PostgreSQL bug I was hitting was the reason why it wouldn’t load webmux.pl either, which meant I could then try Apache once again (I did change the user’s home directory so that it wouldn’t try to use /dev/null). Lo and behold was right, webmux.pl was loaded properly… but then, I got Apache to crash, exactly like it did on my original server (where the PostgreSQL bug wouldn’t have hit in the first place). Okay time to debug it.

Since this time it didn’t require me to keep the production web server offline to debug, I decided to spend some time and bandwidth to get to the bottom of the problem. So after a rebuild of Apache, apr and mod_perl with debug symbols, a push to the server, and a binpkg reinstall.. I was looking at the backtrace, that pointed at this little piece of code:

#ifdef MP_TRACE
    /* httpd core open_logs handler re-opens s->error_log, which might
     * change, even though it still points to the same physical file
     * (.e.g on win32 the filehandle will be different. Therefore
     * reset the tracing logfile setting here, since this is the
     * earliest place, happening after the open_logs phase.
     *
     * Moreover, we need to dup the filehandle so that when the server
     * shuts down, we will be able to log to error_log after Apache
     * has closed it (which happens too early for our likening).
     */
    {
        apr_file_t *dup;
        MP_RUN_CROAK(apr_file_dup(&dup, s->error_log, pconf),
                     "mod_perl core post_config");
        modperl_trace_logfile_set(dup);
    }
#endif

In particular, the call to apr_file_dup has the second parameter at NULL. What’s going on here? Well, it’s a bit complex. The first problem is that for whatever reason, up to last night, the mod_perl ebuild enabled debug information and tracing by default without controlling that with an USE flag. While it is true that if you didn’t enable tracing you wouldn’t get any extra log, the tracing code as you can see is quite invasive, so I want to thank Christian for fixing this up with a new debug USE flag since last night.

Anyway, when tracing support is built into mod_perl (even though it is not enabled in the configuration file), the module decides to save a copy of the file pointer in its own configuration to know where to output tracing if it ever get requested. But what it doesn’t check for is whether there is said pointer.

In my case, Apache is set with ErrorLog syslog; this is simply because I pass most of the information to metalog, which can then filter down the information, and take care of rotating the log files by itself. This is also one of the very few differences between the Apache configuration for xine-project.org and my servers, since the former just doesn’t get as many ModSecurity hits (it doesn’t have that module enabled at all, to be precise).

It goes without saying that when the error log is set to output to syslog, and not to a file, there is no s->error_log file pointer, which explains the NULL parameter and, in turn, the Apache crash. I reported this as issue #75240 which is yet to be fixed.

So I spent the best part of a month fighting with a very stupid bug. Lovely.

Unfortunately the painful installation is not done yet. There is one more issue: the var directory within the RT installation is marked as writeable by the user the code will run as (so the rt user in case of lighttpd/FastCGI, and apache in case of Apache’s mod_perl)… but the mason_data sub-directory isn’t! And of course Mason needs to write to that the cache files that are then served to the user. So the default installation still require quiiite a bit of fiddling.

Are we done yet? Not a chance. As you might have guessed at this point, if you know the software I’m working with, I encountered another bothersome point: when using mod_perl, Apache does not drop privileges to a different user, which means that all the RT code is processed as the Apache user. And I don’t like that at all.

It is true that just dropping privileges is not enough all by itself — before moving to a KVM guest, I had quite a bit of problems with Passenger dropping to a different user, but not processing limits, which didn’t solve the problem of a too greedy Ruby that should have been killed and wasn’t; luckily you can solve that with grsec by enforcing per-user limits.

So my next step is probably getting Apache’s FastCGI set up and get RT to use that.. after that I will probably move it back on this server, so that the remote PostgreSQL connection is avoided altogether, since the problem with Apache crashing has been found anyway, and even if I have to fall back to mod_perl I know it’ll work just fine as long as I keep the debug USE flag disabled.

Prepare to hear again from me, I’m afraid.

The pain of installing RT in Gentoo

Since some of my customers tend to forget what they asked me to do and then they complain that I’m going overbudget or overtime, I’ve finally decided to bite the bullet and set up some kind of tracker. Of course given that almost all of said customers are not technical at all, using Bugzilla was completely out of question. The choice fell back to RT that has a nice email-based interface for them to use.

Setting this up seemed a simple process after all: you just have to emerge the package, deal with the obnoxious webapp-config (you can tell I don’t like it at all!), and set up database and mod_perl for Apache. It turned out not to be so easy. The first problem is the one I already wrote about at least in passing: Apache went segfaulting on me when loading mod_perl on this server, and I didn’t care enough to actually go around and debug why that is.

But fear not, since I’ve already rented a second server, as I said, I decided to try deploying RT there, it shouldn’t be trouble, no? Hah, I wish.

First problem is that Apache refused to start because the webmux.pl script couldn’t be launched. Which was at the very least bothersome, since it also refused to actually show me some error message beside repeating to me that it couldn’t load it. I decided against trying mod_perl once again and moved to a more “common” configuration of lighttpd and reverse proxying, using FastCGI.

And here trouble starts getting even nastier. To begin with, FastCGI requires you to start rt with its own init script; the one provided by the current 3.8.10 ebuild is pretty much outdated and won’t work, possibly at all. I rewrote it (and the rewrite I’ll see to push into portage soon), and got it to at least try starting up. But even then it won’t start up. Why is that?

It has to do with the way I decided to fix up the database: since the new server will run at some point a series of WordPress instances (don’t ask!), it’ll have to run MySQL, but there will be other Web Apps that should use PostgreSQL, and as long as performance is not that big an issue, I wanted to keep one database software per server; this meant connecting to PostgreSQL running on Earhart (which is on the same network anyway), and to do so, beside limiting access through iptables, I set it to use SSL. That was a mistake.

Even though you may set authentication as trust in the pg_hba.conf configuration file, the client-side PostgreSQL library tries to see if there are authentication tokens to use, which in case of SSL can be of two kinds: passwords and certificates. The former is the usual clear-text password, the latter as the name implies is a SSL user certificate that can be used to validate the secure connection from one point to the other. I had no interest to use user certificates at that point, so I didn’t care much about procuring or producing any.

So when I start the rt service (without using --background, that is… I’ll solve that before committing the new init script), I get this:

 * Starting RT ...
DBI connect('dbname=rt3;host=earhart.flameeyes.eu;requiressl=1','rt',...) failed: could not open certificate file "/dev/null/.postgresql/postgresql.crt": Not a directory at /usr/lib64/perl5/vendor_perl/5.12.4/DBIx/SearchBuilder/Handle.pm line 106
Connect Failed could not open certificate file "/dev/null/.postgresql/postgresql.crt": Not a directory
 at //var/www/clienti.flameeyes.eu/rt-3.8.10/lib/RT.pm line 206
Compilation failed in require at /var/www/clienti.flameeyes.eu/rt-3.8.10/bin/mason_handler.fcgi line 54.
 * start-stop-daemon: failed to start `/var/www/clienti.flameeyes.eu/rt-3.8.10/bin/mason_handler.fcgi'                                                                       [ !! ]
 * ERROR: rt failed to start

Obviously /dev/null is the home of the rt user, which is what I’m trying to run this as, and of course it is not a directory by itself, trying to handle it as a directory will make the calls fail exactly as expected. And if you see this, your first thought is likely to be “PostgreSQL does not support connecting via SSL without an user certificate, what a trouble!”.. and you’d be wrong.

Indeed, if you look at a strace of psql run as root (again, don’t ask), you’ll see this:

stat("/root/.pgpass", 0x74cde2a44210)   = -1 ENOENT (No such file or directory)
stat("/root/.postgresql/postgresql.crt", 0x74cde2a41bb0) = -1 ENOENT (No such file or directory)
stat("/root/.postgresql/root.crt", 0x74cde2a41bb0) = -1 ENOENT (No such file or directory)

So it tries to find the certificate, doesn’t find it, and proceed to find a different one, if even that doesn’t exist, it gives up. But that’s not the case for the above case. The reason is probably a silly one: the library looks up errno to be ENOENT before ignoring the error, while the rest of them is likely considered fatal.

So how do you deal with a similar issue? The obvious answer would be to make the home directory to point to the RT installation directory, so that it’s also writeable by the user; in most cases this only requires you to set the $HOME variable, but that’s not the case for PostgreSQL, that instead decides to be smarter than that, and looks up the home directory of the user from the passwd file…

So why not changing the user’s home directory to the given directory then? One reason is that you could have multiple RT instances in the same system, mostly thanks to webapp-config, and another is that even with a single RT instance, the path to the installed code has the package’s version in it, so you would have to change the user’s home each time, which is not something you should be looking forward to.

How to solve this whole? Well, there is one “solution” that is what I’m going to do: set up RT on the same system as PostgreSQL, either with lighttpd or by using FastCGI directly within Apache, I have yet to decide; then there is the actual solution to solve this: get the PostgreSQL client library to respect $HOME and at the same time make it not throw a fit if the home directory is not really a directory. I just don’t think I have time to dedicate to the real fix for now.

Compilers’ rant

Be warned that this blog’s style is in form of a rant, because I’ve spent the past twelve hours fighting with multiple compilers trying to make sense of them while trying to get the best out of my unpaper fork thanks to the different analysis.

Let’s start with a few more notes about the Pathscale compiler I already slightly ranted about for my work on Ruby-Elf. I know I didn’t do the right thing when I posted that stuff as I should have reported the issues upstream directly, but I didn’t have much time, I was already swamped with other tasks, and going through a very bad personal moment, so I quickly written up my feelings without doing proper reports. I have to thank Pathscale people for accepting the critiques anyway, as Måns reported me that a couple of the issues I noted, in particular the use of --as-needed and the __PIC__ definition were taken care of (sorta, see in a moment).

First problem with the Pathscale compiler: by mistake I have been using the C++ compiler to compile C code; rather than screaming at me, it passed through properly, with one little difference: a static constant gets mis-emitted and this is not a minor issue at all, even though I am using the wrong compiler! Instead of having the right content, the constant is emitted as an empty, zeroed-out array of characters of the right size. I only noticed because of Ruby-elf’s cowstats reporting what should have been a constant into the .bss section. This is probably the most worrisome bug I have seen with Pathscale yet!

Of course its impact is theoretically limited by the fact that I was using the wrong compiler, but since the code should be written in a way to be both valid C and C+, I’m afraid the same bug might exist for some properly-written C+ code.. I hope it might get fixed soon.

The killer feature for Pathscale’s compiler is supposedly optimisation, though, and in that it looks like it is doing quite a nice job, indeed I can see from the emitted assembler that it is finding more semantics to the code than GCC seems to, even though it requires -O3 -march=barcelona to make something useful out of it — and in that case you give up debugging information as the debug sections may reference symbols that were dropped, and the linker will be unable to produce a final executable. This is hit and miss of course, as it depends on whether the optimiser will drop those symbols, but it makes difficult to use -ggdb at all in these cases.

Speaking about optimisations, as I said in my other post, GCC’s missed optimisation is still missed by Pathscale even with full optimisation (-O3) turned on, and with the latest sources. And is also still not fixed the wrong placement of static constants that I ranted about in that post.

Finally, for what concerns the __PIC__ definition that Måns referred as being fixed, well, it isn’t really as fixed as one would expect. Yes, using -fPIC now implies defining __PIC__ and __pic__ as GCC does, but there are two more issues:

  • While this does not apply to x86 and amd64 (but just for m68k, PowerPC and Sparc), GCC supports two modes for emission of position-independent code, one that is limited by the architecture’s global offset table maximum size, and the other that overrides such maximum size (I never investigated how it does that, probably through some indirect tables). The two options are enabled through -fpic (or -fpie) and -fPIC (-fPIE) and define the macros as 1 and 2, respectively; Path64 does only ever define them to 1.
  • With GCC, using -fPIE – that is used to emit Position Independent Executables – or the alternative -fpie of course, implies the use of -fPIC, which in turn means that the two macros noted above are defined; at the same time, two more are defined, __pie__ and __PIE__ with the same values as described in the previous paragraph. Path64 defines none of these four macros when building PIE.

But enough rant about Pathscale, before they feel I’m singling them out (which I’m not). Let’s rant a bit about Clang as well, the only compiler up to now that properly dropped write-only unit-static variables. I had very high expectations for what concerns improving unpaper through its suggestions but.. it turns out it cannot really create any executable, at least that’s what autoconf tells me:

configure:2534: clang -O2 -ggdb -Wall -Wextra -pipe -v   conftest.c  >&5
clang version 2.9 (tags/RELEASE_29/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
 "/usr/bin/clang" -cc1 -triple x86_64-pc-linux-gnu -emit-obj -disable-free -disable-llvm-verifier -main-file-name conftest.c -mrelocation-model static -mdisable-fp-elim -masm-verbose -mconstructor-aliases -munwind-tables -target-cpu x86-64 -target-linker-version 2.21.53.0.2.20110804 -momit-leaf-frame-pointer -v -g -resource-dir /usr/bin/../lib/clang/2.9 -O2 -Wall -Wextra -ferror-limit 19 -fmessage-length 0 -fgnu-runtime -fdiagnostics-show-option -o /tmp/cc-N4cHx6.o -x c conftest.c
clang -cc1 version 2.9 based upon llvm 2.9 hosted on x86_64-pc-linux-gnu
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/include
 /usr/bin/../lib/clang/2.9/include
 /usr/include
 /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.1/include
End of search list.
 "/usr/bin/ld" --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o a.out /usr/lib/../lib64/crt1.o /usr/lib/../lib64/crti.o crtbegin.o -L -L/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/../../.. /tmp/cc-N4cHx6.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed crtend.o /usr/lib/../lib64/crtn.o
/usr/bin/ld: cannot find crtbegin.o: No such file or directory
/usr/bin/ld: cannot find -lgcc
/usr/bin/ld: cannot find -lgcc_s
clang: error: linker command failed with exit code 1 (use -v to see invocation)
configure:2538: $? = 1
configure:2576: result: no

What’s going on? Well, Clang doesn’t provide its own crtbegin.o file for the C runtime prologue (while Path64 does), so it relies on the one provided by GCC, which has to be on the system somewhere. Unfortunately, to identify where this file is… they try hitting and missing.

% strace -e stat clang test.c -o test |& grep crtbegin.o
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.5.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.5.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.5.1/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.5.1/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.5/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.5/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.4.5/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.4.5/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.4.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.4.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.4.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.4.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.3.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.3.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.3.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.1/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2.1/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/crtbegin.o", 0x7fffc937f170)     = -1 ENOENT (No such file or directory)
stat("/../../../../lib64/crtbegin.o", 0x7fffc937f170) = -1 ENOENT (No such file or directory)
stat("/lib/../lib64/crtbegin.o", 0x7fffc937f170) = -1 ENOENT (No such file or directory)
stat("/usr/lib/../lib64/crtbegin.o", 0x7fffc937f170) = -1 ENOENT (No such file or directory)
stat("/../../../crtbegin.o", 0x7fffc937f170) = -1 ENOENT (No such file or directory)

Yes you can see that it has a hardcoded list of GCC versions that it looks for, from higher to lower, until it falls back to some generic paths (which don’t make that much sense to me to be honest, but nevermind). This means that on my system, that only has GCC 4.6.1 installed, you can’t use clang. This was reported last week and while a patch is available, a real solution is still not there: we shouldn’t be patching and bumping clang each time a new micro version of GCC is released that upstream didn’t list already!

Sigh. While GCC sure has its shortcomings, this is not really looking promising either.

Gentoo Public Service Announcement: crossdev, glib and binary compatibility

Quick summary, please read as it is important! If you’re running a 64-bit arch such as amd64, then you’re safe and you can read this simply as future reference. The same goes if you have never used sys-devel/crossdev (even better if you have no idea what that is. But if you’re cross-compiling to a 32-bit architecture such as ARM, you probably want to read this as you might have to restart from scratch or you’re using a 32-bit architecture and have sys-devel/crossdev installed you really want to read this post.

Today started as a fun day; not only I stated filing bugs from a tinderbox that is now running with the gold link editor but PaX Team pointed me at an interesting, and unfortunate, issue with crossdev and glib.

The issue shows itself as VMware crashing; this is likely not to be the only program that does so, any binary, pre-built program using system glib could crash the same way. The problem? A simple out of boundary access related to GMutex (in particular GStaticMutex). The root cause? A bit more complex to understand.

The issue was already tracked down, when relayed to me, to a configure check misreporting the size of pthread_mutex_t (the underlying type used by GMutex). On a 32-bit system, the size is 24, on a 64-bit system it is 40… but the build from Portage shown the reported value, on i386, as 40 as defined by cache – 40 (cached) – which was simply wrong.

This matters: the GMutex and GStaticMutex types are not entirely opaque to their users; being semi-transparent means that their size is to be considered part of the ABI and thus binary compatibility is broken when this changes. For a properly-built glib, this is not a problem: on a given arch the structure will always have the same size, but when you have one built with a misreported value.. you’ve got a problem.

Of course up to here I didn’t talk about the root cause, of why Portage-built glib would get the wrong value for that size, since it should be an easy test to run. Well, the reason is that the underlying GLIB_SIZEOF macro used to check the size of the structure uses an execution test, which is not cross-compilable. And how does that matter, if you’re not cross-compiling?

When you do install sys-devel/crossdev , the ebuild installs some “site files” which are used to tell ./configure about known results of execution tests, since you can’t have them when cross-compiling. One of these site files sets the size of pthread_mutex_t (or to be more precise further, the size of GMutex underlying object) to 40. But instead of doing so only for 64-bit arches, it does so for all Linux systems, which is bad.

This was then reported as bug #367351 and I just committed a fix that should fix the issue fine.

Unfortunately once the fix is actually deployed, and glib is further rebuilt, the ABI of libgthread will change, and the same issue noted with VMware above will happen for all the binaries linked against the previous glib. As a safety measure, even before the new crossdev is actually deployed, run the following set of commands

# sed -i -e '/^glib_cv_sizeof_gmutex/d' /usr/share/crossdev/include/site/linux
# emerge -1 glib
# revdep-rebuild --library libgthread-2.0.so.0

This will make sure that the correct ABI is present once again in libgthread, and you should be safe from there on. Please note that I have honestly no idea (nor I have a quick way to check) whether the 32-bit libgthread in the emul packages is good or not. I need to track down who’s managing those right now.

How to tell if you’re affected? First of all you should check whether your platform has a 24-bytes or a 40-bytes pthread_mutex_t. Quick way to do so is compile and execute the following test code on the host (the target of the cross-compilation if that’s what you’re doing!)

#include 
#include 

int main() {
    printf("%zun", sizeof(pthread_mutex_t));

    return 0;
}

If it reports 40, you’ve got a 40 bytes structure, and thus the injected cache value is consistent with your system and have not to worry any further. Otherwise you have to consider whether you have a broken glib or not. I don’t really know how to get the compiled value out of a binary glib, so unless you want to look at the logs, I’d say it is better to consider glib broken if you ever had crossdev installed. Do note: having it installed is enough, you don’t have to have used it at all, which is the worst part of it all.

So if you ever have had crossdev installed in your system, run the three commands I already listed:

# sed -i -e '/^glib_cv_sizeof_gmutex/d' /usr/share/crossdev/include/site/linux
# emerge -1 glib
# revdep-rebuild --library libgthread-2.0.so.0

Please do this as soon as you can, it is important to have a proper ABI in libgthread for stability of your system.

I wish I could say this is one more good reason to run amd64, but it could have gone the same way, this was just human error, and it was out of real luck that we caught this, I guess.

Mister 9% — Why I like keeping bugs open.

Last week, Kevin noted that Gentoo’s Bugzilla – recently redesigned, thanks to all the people involved! – has thousands of open bugs; in particular as of today, we have over 17000 non-resolved bugs. Of those, 1500 have been reported by me, which is about 9% of the total.

For those wondering why there is this disparity, the main reason is that I still report manually (and that means that I use my account) the bugs found by my tinderbox and thus they add up. Together with that, you have to count the fact that I’m one of those people who like to keep the bugs open until they are fixed; this is why I keep insisting that if you add a workaround to a package, you should be keeping a bug open so that we know that the problem is not fixed yet, which happens unfortunately quite often for what concerns parallel build failures.

So how do you shorten the list of open bugs without losing data?

It is obviously natural for people joining a team to maintain all the general packages, or those picking up a new package to go through the bugs open for the package and try to solve/close those. And I think that is the one process that we should go through right now; the problem is that you cannot simply solve the issues with “WONTFIX” as some people proposed: the bug need to be fixed. You can close the bug with NEEDINFO if the user didn’t provide enough information (it happens with my bugs as well when I forget to attach a build log), or with TEST-REQUEST if the bug is no longer reproducible on latest version — if you do know that the bug is solved, FIXED is a good idea as well.

A number of trivial bugs are also open for areas of Gentoo that aren’t much followed at all nowadays: this includes for instance most of the mail-related packages, some easier to solve than others. Most of the time they simply require a new, dedicated maintainer that uses the software: you cannot really maintain a package you stopped using, and I speak by experience.

Most of the bugs opened at least by the tinderbox also fall into the 5 minutes fix category that I so much loathe: they look so trivial that people are prone to simply ignore need for action on them, but when you do look into them you end up with a very messed up ebuild, which makes you spend hours to get it right — when they don’t uncover that the upstream code or build system is totally messed up to the point that it would have to be rewritten to work properly.

Unfortunately, simply ignoring the “nitpicks” QA bugs is not a good idea either: more often than not, presence of these bugs show that the package is stale, and the fact that they are relatively easy to solve only goes to show that the package hasn’t been cared for too long. Closing them out of lacking interests is also not going to solve anything at all; it’ll just mean that they’ll be hidden, and re-opened next time the tinderbox or an user merged them.

So rather than closing the bugs that look abandoned, waiting for other people to report them again, the solution should be to fix them. Closing them will just ensure that they’ll be reported once again, increasing the size of the Bugzilla database, which is a more serious problem than just inflating a counter.

So if the bugs look too many… proxy-maintain new packages!

Gentoo, a three-headed dog, and me — Kerberos coming to PAM

I’ve been fighting the past few days with finding a solution to strengthen the internal security of my network. I’m doing this for two main reasons; from one side, having working IPv6 on the network means that I have to either set up a front-end firewall on the router, or I have to add firewalls on all the devices, and that’s not really so nice to do; on the other side, I’d like to give access to either containers (such as the tinderbox) or other virtual machines to other people, developers for Gentoo or other projects.

I’m not ready to just give access to them as the network is because some of the containers and VMs have still password-login, and from there, well, there would be access to some stuff that is better kept private. Even though I might trust some of the people I’m thinking to give access to, I won’t trust anybody else’s security practice with accessing my system. And this is even more critical since I have/had NFS-writeable directories around the system, including the distfiles cache that the tinderbox works with.

Unfortunately, most of the alternatives I know for this only work with a single user ID, and that means among other things that I can’t use them with Portage. So I decided to give a try to using NFSv4 and Kerberos. I’m not sure if I’ll stick with that sincerely, since it makes the whole system a whole lot more complex and, as I’ll show in a moment, it’s also not really solving my problem at its root, so it’s of little use to me.

The first problem is that the NFSv4 client support in Gentoo seems to have been totally broken up to now, bug #293593 was causing one of the necessary services to simply kill itself rather than running properly, and it was fun to debug. There is a library (libgssglue) that is used to select one out of a series of GSS API providers (either Kerberos or others); interestingly enough, this is yet another workaround for Linux missing libmap.conf and the problem with multiple implementations of the same interface. This library provides symbols that are also provided by the MIT-KRB5 GSS API library (libkrb5_gssapi); when linking the GSS client daemon (rpc.gssd) it has to link both, explicitly causing symbol collisions, sigh. Unfortunately this failed for two reasons again: .la files for libtirpc (don’t ask) caused libtool to actually reorder the linking of libraries, getting the wrong symbols in (bad, and shows again why we should be dropping those damn files), plus there was a stupid typo in the configure.ac file for nfs-utils where instead of setting empty enable_nfsv41 variable they set enable_nfsv4, which in turn caused libgssglue from not being searched for.

The second problem is that right now, as I know way too well, we have no support for Kerberos in the PAM configuration for Gentoo, this is one of the reason why I was considering more complex PAM configurations — main problem is that most of the configurations you find in tutorials, and those that I was proposed, make use of pam_deny to allow using either pam_unix or pam_krb5 at the same time; this in turn breaks the proper login chain used by the GNOME Keyring for instance. So I actually spent some time to find a possible solution to this. Later today when I have some extra time I’ll be publishing a new pambase package with Kerberos support. Note: this, and probably a lot more features of pambase, will require Linux-PAM. This is because the OpenPAM syntax is just basic, while Linux-PAM allows much more flexibility. Somebody will have to make sure that it can work properly on FreeBSD!

There is also a request for pam_ccreds to cache the credentials when running offline, I’m curious about it but upstream does not seem to be working on it as much as it should, so I’m not sure if it’s a good solution.

Unfortunately, as I said, NFSv4 does not seem so much of a good solution; beside the still lack of IPv6 support (which would have been nice to have, but it’s not required for me), if I export the distfiles over NFSv4 (with or without Kerberos), the ebuild fetch operation remain stuck in D-state for the process (blocked on I/O wait). And if I try to force the unmount of the mounted, blocked filesystem, I get the laptop to kernel panic entirely. Now, to make the thing easier to me I’m re-using a Gentoo virtual machine (which I last used for writing a patch for the SCTP support in the kernel) to see if I can reproduce the problem there, and get to fix it, in one way or another.

Unfortunately I’ve spent the whole night working and trying to get this working, so now I’ll try to get some rest at least (it’s 9.30am, sigh!). All the other fixes will wait for tomorrow. On the other hand, I’d welcome thank yous if you find the help on Kerberos appreciated; organisations who would like to have even better Gentoo support for Kerberos are welcome to contact me as well…

Some GNU mistakes

I got so used to take crap because I say that GNU is not perfect that in a sort of masochistic way, I like to try opening the eyes of those who simply don’t care about having their eyes opened.

People who “advocate” Free Software, seem to have a very hard time to come to the result that, well, GNU might not be perfect, that some other project might actual have gotten a better development model for Free Software, and is actually making something better than they have.

We reached the point where the only thing they could attack me about, in my previous criticism of forced features (which didn’t even name GNU!), was my being an atheist. I guess no advocate is atheist because they all subscribed to the Church of Emacs… well, I’m definitely not on board that.

Which is, by itself, a bit strange given that I’m the first person to say that GNU Emacs is a saviour, I love GNU Emacs, it’s one of the best pieces of software that there are out there, I have used it since I started using Linux, and even on OSX I can’t go around without Aquamacs — I’m glad that Stallman wrote it, and it should show that I’m well acquainted to praising GNU when it makes something that is better than the average!

But we also have to call out for the things that don’t make sense, but we’re still having issues with. For instance, I noted a problem with GNU tar after the upgrade to GCC (that’s the GNU Compiler Collection if you wonder). Turns out that it was GCC identifying a buffer overflow, with fortified builds, as the GNU tar developers tried to cut corners, leaving a strcpy() call to go over the boundary of a structure member and write to the one following. Thankfully, Tony had a patch that solved the issue, and GNU tar 1.22-r1 is fine with both 4.4 and 4.5.

At the same time, Tony reported a problem with gawk (which, if you don’t follow, is GNU awk). The problem was worse than a build failure as the ./configure call was remaining stuck during build, which is one of the worst things that can happen (both on a normal system and on the tinderbox) as it wastes time. It wasn’t, though, the basic gawk configure script to freeze, but rather the sub-script for the bundled libsigsegv, that in the latest version of gawk for some reason became an USE flag… Beside being another example of why bundling libraries is a bad idea, the problem here was multi-folded: the USE flag was introduced without saying why it was optional at all; it was tied to a dependency over libsigsegv that wasn’t being used, and it always used the bundled library. The quick fix would have been to patch configure.ac to allow using the system library but… gawk developers have no clue about GNU autoconf and GNU automake; they don’t generate the tarball with make dist – I can tell because the gnulib m4 files are missing – and they… commit the generated configure to their CVS repository. At the end, I simply dropped the USE flag from gawk.

And while I love GNU Emacs, it turns out that something miscompiles with the latest compiler, so I had to rebuild that with GCC 4.4.

To complete the view, we got a mere build failure on GNU lilypond

Now, I’ll admit I’m explicitly nitpicking on GNU packages, with the last two at least; they are, most definitely, something two out many other packages that will at one point or another in time, fail. The first two example, though, are two of the reasons why I say that GNU is not perfect, and we shouldn’t refrain from replacing GNU software with better Free Software (in this case my obvious suggestion would be libarchive’s tar implementation, rather than GNU tar).

The fortify-source feature is implemented by the GNU C Library and by the GNU Compiler Collection… yet the GNU tar developers felt like they could ignore that feature altogether? It doesn’t sound too sane. And what about GNU awk failing to grok the official build system of the project? (and I understand the difficulty there pretty well, given I’ve been documenting it myself… speaking of which I got a few more things to write on my guide once I have some free time I can spend in front of the monitor without crying for the tiredness.

People, grow up! GNU is not perfect, but it’s also not the only choice you have for Free Software! Let’s properly evaluate all software out there, whoever wrote it, and strive for perfection by not accepting the low quality it has today!

Using reports for bugs autofiling

In my previous post I’ve written about my thoughts about developing a system to automatically produce report bugs out of a compilation failure from the tinderbox. While the thought was born out of a need in the tinderbox, after writing about it, I start to think it might actually be useful to implement in a much more generic scale, so that it can be used by users as well.

The idea is that if you can actually extract all the most important information about a merge failure in a single report, it’d be also possible to use that report to open a bug report on the bugzilla. Since you know the package name and version, you can also found the proper maintainer, making sure you’re using the latest version of the metadata file. Since you know the revision, you can also compare it with the currently-available version on some websites, so to stop the user from reporting a bug that might have been fixed already.

Furthermore, by filing bugs through an automatic or semi-automatic system you can also provide a standard summary line for bugs, or make use of the status whiteboard, so that the auto-filing feature can also check for eventual already-reported bugs (this would also turn out quite useful in the tinderbox).

The whole point of me writing down these ideas is that they might sound farfetched, and they did to me for a long while, but the truth is that you just need to break through the doubts, breach the barrier finding one place to start attacking it, and I think I did; starting from what I need for the tinderbox now, I really want to develop the idea of report files, and then start working with somebody, and with pybugz, so that the auto-filing feature can actually be tested somewhere. This is something that we could have by the end of the year, in my opinion.

But there is more to this; there are times when just knowing versions and USE flags don’t just work, for build failures you need to get some more build logs; for instance I already noted that econf produces a config.log, and that both epatch and the autotools.eclass functions create logs in the temporary directory of the package. While for these examples we could just wire the support up, there are many other packages that don’t use autotools, and thus won’t be helped by these. We could probably wire up CMake as well, since it is becoming widespread, but the rest?

Indeed there are packages with custom build systems, like FFmpeg, that have their own logs for the configure stage, and others, like a Python extension I sincerely forgot the name of, that hides the errors from the default build log and force you to pick up another file to find the actual errors (I find this pretty stupid, but that’s what you’re given to work with).

What I propose is then adding a new function that can either provide a list of files to attach, or even generate further content that is needed to be bale to file a bug for a build failure of that package (think about non-gcc compilers settings, like ghc and fpc, or the Java compilers). With that inside the ebuild you could really write up a complete report, and make bug filing an almost entirely automated process, which would mean more bugs filed by users (rather than the tinderboxes) and a much more pleasing tinderbox setup (given somebody can actually wrangle the bugs up for the latter case).

Anybody up to the task of helping me out here?