_FORTIFY_SOURCE, optimisations, and other details

While writing about the -O0 optimisation level I found some interesting problems related to the fortified source support in recent GCC and GLIBC releases, which we enable by default since GCC 4.3.3-r1 and that is used as a partial replacement of stack-smashing (which hopefully, thanks to Magnus, will soon come back into Gentoo!).

As I said in that post, the special fortified functions (those that are available in the form of __$func_chk in the libc.so file and provide warnings at build time, and proper stack traces at runtime) only get enabled if inline functions are enabled, so are totally ignored at -O0 (simply disabling inlines, by using -fno-inline won’t stop them from being used, though). This means that if we have some software that does not respect the CFLAGS variables, and uses -O0 in whatever context, it’s currently not using fortified sources, and as such, it can crash without warning about it beforehand.

But lack of optimisation – which is a quite rare occurrence, luckily – is not the only thing that may cause the fortified versions of functions to not be emitted. Since the fortified versions of the functions are declared (and the wrappers defined) in files such as /usr/include/bits/stdio2.h, you need to include them properly for them to work. Unfortunately, it’s way too common for projects not to take headers seriously and leave functions to be implicitly defined. But the implicit-definition does not take care of the fortified sources.

On the other hand, during testing of these little differences, I found some things that aren’t obvious at all. For instance, when you do use fortified sources, GCC fails to optimise some slightly more sophisticated cases. For instance this is the code I used to test implicit declarations:

#ifndef IMPLICIT
# include 
#endif

int main() {
  char foo[12];
  return sprintf(foo, "ABCDEFGHIJKLMNOPQRSTUVWXYZ");
}

When built with -DIMPLICIT -O1 (so causing sprintf() to be implicitly defined), the produced executable does not crash, while it does crash with -O0 (see why I say -O0 produces different code?). The reason for that is that GCC picks up the call to sprintf() as a built-in function, whose semantic is known to it. In general, GCC cannot drop function calls that even just might have side effects, but in this case, GCC knows what the side-effect is (writing to the foo variable); its parameters are also known, so it can replace the function with its effects straight during the build phase. But since the foo variable is never read from, the whole copy is a moot effect. The only thing that we care about the sprintf call is the return value, which represent the length of the content written to the string, and is also constant since the compiler knows would be 26.

Indeed, the emitted code for that function, compiled that way is as such:

main:
        movl    $26, %eax
        ret

Interesting, isn’t it? Now this has two related side effects: the first is that the fortified function is not used, so there is no warning printed at build-time, the other is that the code will not crash at runtime because the buffer overflow is gone altogether! Now, can this be considered a bug in GCC? I don’t think so, it’s the code that is wrong to begin with. But you can see how disabling optimisations has now introduced a crash site. You can see how the code would otherwise work by adding -fno-builtin to the compiler. This will remove the semantic knowledge of the call from the compiler optimisers; this results in a straight call to sprintf().

But this post need to have at least a bottom line to be worth anything, so here there are some:

  • Gentoo treasures stability and security, for this reason why enable fortified sources by default; if what you desire is pure speed, without caring about safety, you may -D_FORTIFY_SOURCE=0 to your CFLAGS variable; this will override the fortified sources as specified by the Gentoo spec file; if you do so, though, please do not file bugs unless you provide a patch with them as well;
  • even with the tinderbox at hand, I have near to no way to find out whether the fortified functions are used or not; as far as I can tell, the fortified version is not emitted when the important parameters are not build-time constant (obviously);
  • one way to at least reduce the impact of possibly skipped fortified checks is to get rid of implicit declarations; Portage already takes care of reporting them as QA warnings at the end of the merge; unfortunately, most maintainers won’t appreciate patches to remove implicit declarations because they are usually boring, and require to be changed from release to release, if not properly sent upstream; if you care about security and safety, please do take care of those warnings; if you’re not the maintainer, send them upstream, asking them nicely to add them to the next release;
  • even if we don’t have warnings about implicit declarations, there is the risk that some stupid software decides to declare the sysem functions manually rather than relying on those provided by the C library; this usually happens when the software is trying to be portable but wants to feel smarter than average; goes without saying that it becomes a mess to identify that software and properly fix it up.

Upstream, rice it down!

While Gentoo often gets a bad name because of the so-called ricers and upstream developer complains that we allow users to shoot themselves in the foot by setting CFLAGS as they please, it has to be said that not all upstream projects are good in that regard. For instance, there are a number of projects that, unless you enable debug support, will force you to optimise (or even over-optimise) the code, which is obviously not the best of ideas (this does not count in things like FFmpeg that rely on Dead Code Elimination to link properly — in those cases we should be even more careful but let’s leave it alone for now).

Now, what is the problem with forcing optimisation for non-debug builds? Well, sometimes you might not want to have debug support (extra verbosity, assertions, …) but you might still want to be able to fetch a proper backtrace; in such cases you have a non-debug build that needs to turn down optimisations. Why should I be forced to optimise? Most of the time, I shouldn’t.

Over-optimisation is even nastier: when upstream forces stuff like -O3, they might not even understand that the code might easily slow down further. Why is that? Well one of the reasons is -funroll-loops: declaring all loops to be slower than unrolled code is an over-generalisation that you cannot pretend to keep up with, if you have a minimum of CPU theory in mind. Sure, the loop instructions have an higher overhead than just pushing the instruction pointer further, but unrolled loops (especially when they are pretty complex) become CPU cache-hungry; where a loop might stay hot within the cache for many iterations, an unrolled version will most likely require more than a couple of fetch operations from memory.

Now, to be honest, this was much more of an issue with the first x86-64 capable processors, because of their risible cache size (it was vaguely equivalent to the cache available for the equivalent 32-bit only CPUs, but with code that almost literally doubled its size). This was the reason why some software, depending on a series of factors, ended up being faster when compiled with -Os rather than -O2 (optimise for size, the code size decreases and it uses less CPU cache).

At any rate, -O3 is not something I’m very comfortable to work with; while I agree with Mark that we shouldn’t filter or exclude compiler flags (unless they are deemed experimental, as is the case for graphite) based on compiler bugs – they should be fixed – I also would prefer avoiding to hit those bugs in production systems. And since -O3 is much more likely to hit them, I’d rather stay the hell away from it. Jesting about that, yesterday I produced a simple hack for the GCC spec files:

flame@yamato gcc-specs % diff -u orig.specs frigging.specs
--- orig.specs  2010-04-14 12:54:48.182290183 +0200
+++ frigging.specs  2010-04-14 13:00:48.426540173 +0200
@@ -33,7 +33,7 @@
 %(cc1_cpu) %{profile:-p}

 *cc1_options:
-%{pg:%{fomit-frame-pointer:%e-pg and -fomit-frame-pointer are incompatible}} %1 %{!Q:-quiet} -dumpbase %B %{d*} %{m*} %{a*} %{c|S:%{o*:-auxbase-strip %*}%{!o*:-auxbase %b}}%{!c:%{!S:-auxbase %b}} %{g*} %{O*} %{W*&pedantic*} %{w} %{std*&ansi&trigraphs} %{v:-version} %{pg:-p} %{p} %{f*} %{undef} %{Qn:-fno-ident} %{--help:--help} %{--target-help:--target-help} %{--help=*:--help=%(VALUE)} %{!fsyntax-only:%{S:%W{o*}%{!o*:-o %b.s}}} %{fsyntax-only:-o %j} %{-param*} %{fmudflap|fmudflapth:-fno-builtin -fno-merge-constants} %{coverage:-fprofile-arcs -ftest-coverage}
+%{pg:%{fomit-frame-pointer:%e-pg and -fomit-frame-pointer are incompatible}} %1 %{!Q:-quiet} -dumpbase %B %{d*} %{m*} %{a*} %{c|S:%{o*:-auxbase-strip %*}%{!o*:-auxbase %b}}%{!c:%{!S:-auxbase %b}} %{g*} %{O*} %{W*&pedantic*} %{w} %{std*&ansi&trigraphs} %{v:-version} %{pg:-p} %{p} %{f*} %{undef} %{Qn:-fno-ident} %{--help:--help} %{--target-help:--target-help} %{--help=*:--help=%(VALUE)} %{!fsyntax-only:%{S:%W{o*}%{!o*:-o %b.s}}} %{fsyntax-only:-o %j} %{-param*} %{fmudflap|fmudflapth:-fno-builtin -fno-merge-constants} %{coverage:-fprofile-arcs -ftest-coverage} %{O3:%eYou're frigging kidding me, right?} %{O4:%eIt's a joke, isn't it?} %{O9:%eOh no, you didn't!}

 *cc1plus:

flame@yamato gcc-specs % gcc -O2 hellow.c -o hellow; echo $?   
0
flame@yamato gcc-specs % gcc -O3 hellow.c -o hellow; echo $?
gcc: You're frigging kidding me, right?
1
flame@yamato gcc-specs % gcc -O4 hellow.c -o hellow; echo $?
gcc: It's a joke, isn't it?
1
flame@yamato gcc-specs % gcc -O9 hellow.c -o hellow; echo $?
gcc: Oh no, you didn't!
1
flame@yamato gcc-specs % gcc -O9 -O2 hellow.c -o hellow; echo $?
0

Of course, there is no way I could put this in production as it is. While the spec files allow enough flexibility to hit the case for the latest optimisation level (the one that is actually applied), rather than for any parameter passed, they lack an “emit warning” instruction, the instruction above, as you can see from the value of $? is “error out”. While I could get it running in the tinderbox, it would probably produce so much noise and for failing packages that I’d spend each day just trying to find why something failed.

But if somebody feels like giving it a try, it would be nice to ask the various upstream to rice it down themselves, rather than always being labelled as the ricer-distribution.

P.S.: building with no optimisation at all may cause problems; in part because of reliance on features such as DCE, as stated above, and as used by FFmpeg; in part because headers, including system headers might change behaviour and cause the packages to fail.

Bundling libraries: the curse of the ancients

I was very upset by one comment from Ardour’s lead developer Paul Davis in a recently reported “bug” about the un-bundling of libraries from Ardour in Gentoo. I was, to be honest, angry after reading his comment, and I was tempted to answer badly for a while; but then I decided my health was more important and backed away, thought about it, then answered how I answered (which I hope is diplomatic enough). Then I thought it might be useful to address the problem in a less concise way and explain the details.

Ardour is bundling a series of libraries; like I wrote previously, there are problems related to this and we dealt with them by just unbundling the libraries, now Ardour is threatening to withdraw support from Gentoo as a whole if we don’t back away from that decision. I’ll try to address his comments in multiple parts, so that you can understand why it really upset me.

First problem: the oogie-boogie crashes<

It’s a quotation from Adam Savage from MythBusters, watch the show if you want to actually know the detail; I learnt about it from Irregular Webcomic years ago, but I have only seen it about six months ago, since in Italy it only passes on satellite pay TV, and the DVDs are not available (which is why they are in my wishlist).

Let’s see what exactly Paul said:

Many years ago (even before Gentoo existed, I think) we used to distribute Ardour without the various C++ libraries that are now included, and we wasted a ton of time tracking down wierd GUI behaviour, odd stack tracks and many other bizarre bugs that eventually were traced back to incompatibilities between the way the library/libraries had been compiled and the way Ardour was compiled.

I think that I now coined a term for my own dictionary, and will call this the syndrome of oogie-boogie bugs, for each time I hear (or I’m found muttering!) “we know of past bad behaviour”. Sorry but without documentation, these things are like unprovable myth, just like the one Adam commented upon (the “pyramid power”). I’m not saying that these things didn’t happen, far form that I’m sure they did, the problem is that they are not documented and thus are unprovable, and impossible to dissect and correct.

Also, I’m not blaming Paul or the Ardour team to be superficial, because, believe it or not, I suffer(ed, hopefully) from that syndrome myself: some time ago, I reported to Mart that I had maintainer mode-induced rebuilds on packages that patched both Makefile.am and Makefile.in, and that thus the method of patching both was not working; while I still maintain that it’s more consistent to always rebuild autotools (and I know I have to write on why is that), Mart pushed me into proving it, and together we were able to identify the problem: I was using XFS for my build directory, which has sub-second mtime precision, while he was using ext3 with mtime precise only to the second, so indeed I was experiencing difficulties he would never have been able to reproduce on his setup.

Just to show that this goes beyond this kind of problem, since I joined Gentoo, Luca told me to be wary about suggesting use of -O0 when debugging because it can cause stuff to miscompile. I never accepted his word for it because that’s just how I am, and he didn’t have any specifics to prove it. Turns out he wasn’t that wrong after all, since if you build FFmpeg with -O0 and Sun’s compiler, it cannot complete the link. The reason for this is that with older GCC, and Sun’s compiler, and others I’m sure, -O0 turns off the DCE (Dead Code Elimination) pass entirely, and cause branches like if (0) to be compiled anyway. FFmpeg relies on the DCE pass to always happen. (there is more to say about relying on the DCE pass but that’s another topic altogether).

So again, if you want to solve bugs of this kind, you have to just do like the actual Mythbusters: document, reproduce, dissect, fix (or document why you have to do something rather than just saying you have to do it). Not having the specifics of the problem, makes it an “oogie-boogie” bug and it’s impossible to deal with it.

Second problem: once upon a time

Let me repeat one particular piece of the previous quote from Paul Davis (emphasis mine): “Many years ago (even before Gentoo existed, I think)”. How many years ago is that? Well, since I don’t want to track down the data on our own site (I have to admit I found it appalling that we don’t have a “History” page), I’ll go around quoting Wikipedia. If we talk about Gentoo Linux with this very name, the 1.0 version has been released on 2002, March 31 (hey it’s almost seven years go by now). If we talk about Daniel’s project, Enoch Linux 0.75 was released in December 1999, which is more than nine years ago. I cannot seem to be able to confirm Paul’s memories since their Subversion repositories seems to have discarded the history information from when they were in CVS (it reports the first commit in 2005, which is certainly wrong if we consider that Wikipedia puts their “Initial Release” in 2004).

Is anything the same as it was at that time? Well, most likely there are still pieces of code that are older than that, but I don’t think any of those are in actual use nowadays. There have been, in particular, a lot of transitions since then. Are difficulties found at that time of any relevance nowadays? I sincerely don’t think so. Paul also don’t seem to have any documentation of newer happenings of this, and just says that they don’t want to spend more time on debugging these problems:

We simply cannot afford the time it takes to get into debugging problems with Gentoo users only to realize that its just another variation on the SYSLIBS=1 problem.

I’ll go around that statement in more details in the next problem, but for now let’s accept that there has been no documentation of new cases, and that all that it goes here is bad history. Let’s try to think about what that bad history was. We’re speaking about libraries, first of all, what does that bring us? If you’re an avid reader of my blog, you might remember what actually brought me to investigate bundled libraries in the first place: symbol collisions ! Indeed this is very likely, if you remember I did find one crash in xine due to the use of system FFmpeg, caused by symbol collisions. So it’s certainly not a far-fetched problem.

The Unix flat namespace to symbols is certainly one big issue that projects depending on many libraries have to deal with; and I admit there aren’t many tools that can deal with that. While my collision analysis work has focused up to now to identify the areas of problem, it only helps in the big scheme of things to find possible candidate to collision problems. This actually made me think that I should adapt my technique to identify problems in a much smaller scale, giving one executable in input and identifying duplicated symbols. I just added this to my TODO map.

Anyway, thinking about the amount of time passed since Gentoo’s creation (and thus what Paul think is when the problems started to happen), we can see that there is at least one big “event horizon” in GNU/Linux since then (and for once I use this term, because it’s proper to use it here): the libc5 to libc6 migration ; the HOWTO I’m linking, from Debian, was last edited in 1997, which puts it well in the timeframe that Paul described.

So it’s well possible that people at the time went to use libraries built for one C library with Ardour built with a different one, which would have created, almost certainly, subtle and difficult to identify (for a person not skilled with linkers at least) issues. And it’s certainly not the only possible cause of similar crashes, or even worse unexpected behaviour. If we look again at Paul’s comment, he speaks of “C++ libraries”; I know that Ardour is written in C++ and I think I remember some of the libraries being built being written in C++ too; I’m not sure if he’s right at calling all of them “C++ libraries” (C and C++ are two different languages, even if the foreign calling convention glue is embedded in the latter’s language), but given even a single one is as such, it can open a different Pandora’s vase.

See, if you look at GCC’s history, it wasn’t long before Enoch 0.75 release that a huge paradigm shift initiated for Free Software compilers. The GNU C Compiler, nowadays the GNU Compiler Collection, forked the Experimental/Enhanced GNU Compiler System (EGCS) in 1997, which was merged back into GCC with the historical release 2.95 in April 1999. EGCS contained a huge amount of changes, a lot related to C++. But even that wasn’t near perfection; for many, C++ support was mostly ready from prime time only after release 3 at least, so there were wild changes going on at that time. Libraries built with different versions of the compiler at the time might as well had wildly differently built symbols with the same name, and even worse, they would have been using different STL libraries. Add to the mix the infamous 2.96 release of GCC as shipped by RedHat, I think the worse faux-pas in the history of RedHat itself, with so many bugs due to backporting that a project I was working with at the time (NoX-Wizard) officially unsupported it, suggesting to use either 2.95 or 3.1. We even had an explicit #error out if the 2.96 release was used!

A smaller scale paradigm shift has happened with the release of GCC 3.4 and the change from libstdc++.so.5 to libstdc++.so.6 which is what we use nowadays. Mixing libraries using the two ABIs and the STL versions caused obvious and non-obvious crashes; we still have software using the older ABI, and that’s why we have libstdc++-v3 around; Mozilla, Sun and Blackdown hackers certainly remember that time because it was a huge mess for them. It’s one very common (and one of my favourite) arguments against the use of C++ for mission-critical system software.

Also, GCC’s backward compatibility is near non-existent: if you build something with GCC 4.3, without using static libraries, executing it on a system with GCC 4.2 will likely cause a huge amount of problems (forward compatibility is always ensured though). Which adds up to the picture I already painted. And do we want to talk about the visibility problem? (on a different note I should ask Steve for a dump of my old blog to merge here, it’s boring not remembering that one post was written on the old one).

I am thus not doubting at all of Paul’s memories regarding problems with system libraries and so on so forth. I also would stress another piece of his comment: “eventually were traced back to incompatibilities between the way the library/libraries had been compiled and the way Ardour was compiled”. I understand he might not actually just refer to the compiler (and compiler version) used in the build; so I wish to point out two particular GCC options: -f(no-)exceptions and -f(no-)rtti.

These two options enable or disable two C++ language features: exceptions handling and run-time type information. I can’t find any reference to that in the current man page, but I remember that it warned that mixing code built with and without it in the same software unit was bad. I wouldn’t expect it to be any different now sincerely. In general the problem is solved because each piece of software builds its own independent unit, in the form of executable or shared object, and the boundary between those is subject to the contract that we call ABI. Shared libraries built with and without those options are supposed to work fine together (I sincerely am not ready to bet though), but if the lower-level object files are mixed together, bad things may happen, and since we’re talking about computers, they will, in the moment you don’t want them to. It’s important to note here for all the developers not expert with linkers that static libraries (or more properly, static archives) are just a bunch of object files glued together, so linking something statically still means linking lower-level object files together.

So the relevance of Paul’s memories is, in my opinion, pretty low. Sure shit happened, and we can’t swear that it’ll never happen again (most likely it will), but we can deal with that, which brings me to the next problem:

Third problem: the knee-jerk reaction

Each time some bug happens that is difficult to pin down, it seems like any developer tries to shift the blame. Upstream. Downstream. Sidestream. Ad infinitum. As a spontaneous reflex.

This happens pretty often with distributions, especially with Gentoo that gives users “too much” freedom with their software, but most likely in general, and I think this is the most frequent reason for bundling libraries. By using system libraries developers lose what they think is “control” over their software, which in my opinion is often just sheer luck. Sometimes developers admit that their reasons are just desire to spend the less time possible working on issues, some other times they try to explicitly move the blame on the distributions or other projects, but at the end of the day the problem is just the same.

Free software is a moving target; you might developer software against a version of a library, not touch the code for a few months, it works great, and then a new version is released and your software stops working. And you blame the new release. You might be right (new bug introduced), or you might be wrong (you breached the “contract” called API, some change happened and something that was not guaranteed to work in any particular way changed the way it worked, and you relied on the old behaviour). In either case, the answer “I don’t give a damn, just use the old version” is a sign of something pretty wrong with your approach.

The Free Software spirit should be the spirit of collaboration. If a new release of a given dependency breaks your software, you should probably just contact the author and try to work out between the two project what the problem is; if it’s a bug introduced, make sure there is a testsuite, and that the testsuite includes a testcase for the particular issue you found. Writing testcases for bugs that happened in the past is exactly why testsuites are so useful. If the problem is that you relied on a behaviour that has changed, the author might know how not to rely on that and have code that work as expected, or might take steps to make sure nobody else tries that (either by improving documentation or changing the interface so that the behaviour is not exposed). Bundling the dependency citing multiple problems and giving no option is usually not the brightest step.

I’m all for giving working software to users by default, so I can understand bundling the library by default; I just think that it should either be documented why that’s the case or give a chance of not using it. Someone somewhere might actually be able to find what the problem is. Just give him a chance. In my previous encounter with Firefox’s SQLite, I received a mail from Benjamin Smedberg:

Mozilla requires a very specific version of sqlite that has specific compiled settings. We know that our builds don’t work with later or earlier versions, based on testing. This is why we don’t build against system libsqlite by design.

They know based on testings that they can’t work with anything else. What does that testing consists of, I still don’t know. Benjamin admitted he didn’t have the specifics, and relied me to Shawn Wilsher who supposedly had more details, but he never got back at me with those details. Which is quite sad since I was eager to find what the problem was because SQLite is one of the most frequent oogei-boogei sources. I even noted that the problem with SQLite seems to lie upstream, and I still maintain that in this case; while I said before that it’s a knee-jerk reaction, I also have witnessed to more than a few project having problems with SQLite, myself I had my share of headaches because of that. But this should really start make us think that maybe, just maybe, SQLite needs help.

But we’re not talking about SQLite here, and trust me that most upstreams will likely help you out to forwardport your code, fix issues and so on so forth. Even if you, for some reason I don’t want to talk about now, decided to change the upstream library after bundling, often times you can get it back to a vanilla state by pushing your changes upstream. I know it’s feasible even for the most difficult upstreams, because I have done just that with FFmpeg, with respect to xine’s copy.

But just so that we’re clear, it does not stop with libraries, the knee-jerk reaction happens with CFLAGS too; if you have many users reporting that using wild CFLAGS break your software, the most common reaction is to just disallow custom CFLAGS, while the reasoned approach would be to add a warning and then start to identify the culprit; it might be your code assuming something that is not always true, or it might be a compiler bug, in either case the solution is to fix the culprit instead of just disallowing anybody from making use of custom flags.

Solution: everybody’s share

So for now I dissected Paul’s comment into three main problems; I could probably write more about each of them, and I might if the points are not clear, but the post is already long enough (but I didn’t want to split it down because it would take too long to be available), and I wanted to reach a conclusion with a solution, which is what I already posted in my reply to the bug.

The solution to this problem is to give everybody something to do. Instead of “blacklisting Gentoo” like Paul proposed, they should just do the right thing and leave us to deal with the problems caused by our choices and our needs. I have already pointed out some of these in my three-parts article for LWN (part 1, part 2 and part 3). This means that if you get user reporting some weird behaviour, using the Gentoo ebuild, your answer should not be “Die!” but “You should report that to the Gentoo folks over at their bugzilla”. Yes I know it is a much longer phrase and that it requires much more typing, but it’s much more user friendly and actually provides us all with a way to improve the situation.

Or you could also do the humble thing and ask for help. I already said that before, but if you got problem with anything I have written about, and have a good documentation of what the problem is, you can write me; of course I don’t always have time to fix your issues, sometimes I don’t even have time to look at them in a timely fashion I’m afraid, but I never sent away someone because I didn’t like them. The problem is that most of the time I’m not asked at all.

Even if you might end up asking me some question that would be very silly if you knew the topic, I’m not offended by those; just like I’d rather not be asked to learn all about the theory behind psychoacoustic to find why libfaad is shrieking my music, I don’t pretend that Paul knows all the inside out of linking problems to find out why the system libraries cause problems. I (and others like me) have the expertise to identify relatively quickly a collision problem; I should also be able to provide tools to identify that more quickly. But if I don’t know of the problem, I cannot magically fix it; well, not always at least .

So Paul, this is an official offer; if you can give me the details of even a single crash or misbehaviour due to the use of system libraries, I’d be happy to look into it.

Fixing CFLAGS/LDFLAGS handling with a single boilerplate Makefile (maybe an eclass, too?)

So, in the last few weeks I’ve been filing bugs for packages that don’t respect CFLAGS (or CXXFLAGS) using the beacon trick. Beside causing some possibly false positives, the testing is going well.

The problem is that I found more than a couple of packages that either do call gcc manually (I admit I’m the author of a couple of ebuilds doing that) or where the patch to fix the Makefile would be more complex than just using a boilerplate makefile.

So what is the boilerplate makefile I talk about? Something like this:

$(TARGET): $(OBJS)
        $(CC) $(LDFLAGS) -o $@ $^ $(LIBS)

Does it work? Yes it does, and it will respect CFLAGS, CXXFLAGS and LDFLAGS just fine, the invocation on an ebuild (taking one I modified earlier today) would be as easy as:

src_compile() {
    emake CC="$(tc-getCC)" 
        TARGET="xsimpsons" 
        OBJS="xsimpsons.o toon.o" 
        LIBS="-lX11 -lXext -lXpm" || die "emake failed"
}

Now of course this would suck if you had to do it for every and each ebuild, but what if we were to simplify it to an eclass? Something like having an ebuild just invoking it this way:

ESIMPLE_TARGET="xsimpsons"
ESIMPLE_OBJS="xsimpsons.o toon.o"
ESIMPLE_LIBS="-lX11 -lXext -lXpm"

inherit esimple

For slightly more complicated things you could make it use PKG_CONFIG too…

ESIMPLE_TARGET="xsimpsons"
ESIMPLE_OBJS="xsimpsons.o toon.o"
ESIMPLE_REQUIRED="x11 xext xpm"

inherit esimple

so that it would call pkg-config for those rather than using the libraries directly (this would allow to simplify also picoxine’s ebuild for instance that uses xine-lib).

Even better (or maybe I’m getting over the top here ;)), one could make the eclass accept a possibly static USE flag that would call pkg-config --static instead of standard pkg-config and append -static to the LDFLAGS, so that the resulting binary would be, well, static…

If anybody has comments about this, to flesh it out before it could actually be proposed for an eclass, it would be a nice time to say here so we can start with the right foot!

More notes about the flags testing

Before entering the hospital in Verona, I wrote about a feasible way to check for CFLAGS; in the past two days I decided to start testing my method over a wide range of packages, running buildpkg over each category (for packages either without build-time dependencies or with build-time dependencies I have merged already in the chroot). Its results have been quite interesting.

Beside the fact that this method also allows me to identify ebuilds installing pre-stripped files, I’ve also found quite a few packages failing in general, a few with broken DEPEND (as in missing stuff that’s needed at buildtime, and a lot of those were caused by typos or trivial mistakes in the ebuild, which I fixed myself without even bothering opening a bug), but I also noticed one thing about my test.

The way I designed my test, it sets up the modified CFLAGS (injecting the symbol) during pre_src_compile (and now pre_src_configure for EAPI=2); the problem with that is that there are more than a couple of packages that do respect CFLAGS, but by setting them in stone inside the makefiles during src_unpack (and probably nowadays src_prepare).

While it’s not officially a mistake, I’d sincerely say this is not what’s intended; I’d expect CFLAGS to be used only during configure/compile phases, not during unpack/prepare phases that should be, in my opinion, not system dependent. For instance it would be nice if one day we could run up to src_prepare once, and then build N-times the package as needed by multilib dependencies.

Anyway, if you’re a Gentoo developer maintaining a package that does set in stone the CFLAGS during src_unpack, you’re most likely going to get a bug from me; I won’t be disappointed even if you close it, but really, don’t you think you can do best?

In general, for software that does not respect CFLAGS by default you can work it around in many ways without recurring to the set in stone approach:

# This...
CFLAGS = -O2 -fomit-frame-pointer -Wall -Wextra -DSOMETHING -DOTHER
# may become
CFLAGS += -Wall -Wextra -DSOMETHING -DOTHER

# This...
gcc -O9 -funroll-all-loops -Ipath
# may become
gcc $(CFLAGS) -Ipath

and so on so forth.