Upstream, rice it down!

While Gentoo often gets a bad name because of the so-called ricers and upstream developer complains that we allow users to shoot themselves in the foot by setting CFLAGS as they please, it has to be said that not all upstream projects are good in that regard. For instance, there are a number of projects that, unless you enable debug support, will force you to optimise (or even over-optimise) the code, which is obviously not the best of ideas (this does not count in things like FFmpeg that rely on Dead Code Elimination to link properly — in those cases we should be even more careful but let’s leave it alone for now).

Now, what is the problem with forcing optimisation for non-debug builds? Well, sometimes you might not want to have debug support (extra verbosity, assertions, …) but you might still want to be able to fetch a proper backtrace; in such cases you have a non-debug build that needs to turn down optimisations. Why should I be forced to optimise? Most of the time, I shouldn’t.

Over-optimisation is even nastier: when upstream forces stuff like -O3, they might not even understand that the code might easily slow down further. Why is that? Well one of the reasons is -funroll-loops: declaring all loops to be slower than unrolled code is an over-generalisation that you cannot pretend to keep up with, if you have a minimum of CPU theory in mind. Sure, the loop instructions have an higher overhead than just pushing the instruction pointer further, but unrolled loops (especially when they are pretty complex) become CPU cache-hungry; where a loop might stay hot within the cache for many iterations, an unrolled version will most likely require more than a couple of fetch operations from memory.

Now, to be honest, this was much more of an issue with the first x86-64 capable processors, because of their risible cache size (it was vaguely equivalent to the cache available for the equivalent 32-bit only CPUs, but with code that almost literally doubled its size). This was the reason why some software, depending on a series of factors, ended up being faster when compiled with -Os rather than -O2 (optimise for size, the code size decreases and it uses less CPU cache).

At any rate, -O3 is not something I’m very comfortable to work with; while I agree with Mark that we shouldn’t filter or exclude compiler flags (unless they are deemed experimental, as is the case for graphite) based on compiler bugs – they should be fixed – I also would prefer avoiding to hit those bugs in production systems. And since -O3 is much more likely to hit them, I’d rather stay the hell away from it. Jesting about that, yesterday I produced a simple hack for the GCC spec files:

flame@yamato gcc-specs % diff -u orig.specs frigging.specs
--- orig.specs  2010-04-14 12:54:48.182290183 +0200
+++ frigging.specs  2010-04-14 13:00:48.426540173 +0200
@@ -33,7 +33,7 @@
 %(cc1_cpu) %{profile:-p}

 *cc1_options:
-%{pg:%{fomit-frame-pointer:%e-pg and -fomit-frame-pointer are incompatible}} %1 %{!Q:-quiet} -dumpbase %B %{d*} %{m*} %{a*} %{c|S:%{o*:-auxbase-strip %*}%{!o*:-auxbase %b}}%{!c:%{!S:-auxbase %b}} %{g*} %{O*} %{W*&pedantic*} %{w} %{std*&ansi&trigraphs} %{v:-version} %{pg:-p} %{p} %{f*} %{undef} %{Qn:-fno-ident} %{--help:--help} %{--target-help:--target-help} %{--help=*:--help=%(VALUE)} %{!fsyntax-only:%{S:%W{o*}%{!o*:-o %b.s}}} %{fsyntax-only:-o %j} %{-param*} %{fmudflap|fmudflapth:-fno-builtin -fno-merge-constants} %{coverage:-fprofile-arcs -ftest-coverage}
+%{pg:%{fomit-frame-pointer:%e-pg and -fomit-frame-pointer are incompatible}} %1 %{!Q:-quiet} -dumpbase %B %{d*} %{m*} %{a*} %{c|S:%{o*:-auxbase-strip %*}%{!o*:-auxbase %b}}%{!c:%{!S:-auxbase %b}} %{g*} %{O*} %{W*&pedantic*} %{w} %{std*&ansi&trigraphs} %{v:-version} %{pg:-p} %{p} %{f*} %{undef} %{Qn:-fno-ident} %{--help:--help} %{--target-help:--target-help} %{--help=*:--help=%(VALUE)} %{!fsyntax-only:%{S:%W{o*}%{!o*:-o %b.s}}} %{fsyntax-only:-o %j} %{-param*} %{fmudflap|fmudflapth:-fno-builtin -fno-merge-constants} %{coverage:-fprofile-arcs -ftest-coverage} %{O3:%eYou're frigging kidding me, right?} %{O4:%eIt's a joke, isn't it?} %{O9:%eOh no, you didn't!}

 *cc1plus:

flame@yamato gcc-specs % gcc -O2 hellow.c -o hellow; echo $?   
0
flame@yamato gcc-specs % gcc -O3 hellow.c -o hellow; echo $?
gcc: You're frigging kidding me, right?
1
flame@yamato gcc-specs % gcc -O4 hellow.c -o hellow; echo $?
gcc: It's a joke, isn't it?
1
flame@yamato gcc-specs % gcc -O9 hellow.c -o hellow; echo $?
gcc: Oh no, you didn't!
1
flame@yamato gcc-specs % gcc -O9 -O2 hellow.c -o hellow; echo $?
0

Of course, there is no way I could put this in production as it is. While the spec files allow enough flexibility to hit the case for the latest optimisation level (the one that is actually applied), rather than for any parameter passed, they lack an “emit warning” instruction, the instruction above, as you can see from the value of $? is “error out”. While I could get it running in the tinderbox, it would probably produce so much noise and for failing packages that I’d spend each day just trying to find why something failed.

But if somebody feels like giving it a try, it would be nice to ask the various upstream to rice it down themselves, rather than always being labelled as the ricer-distribution.

P.S.: building with no optimisation at all may cause problems; in part because of reliance on features such as DCE, as stated above, and as used by FFmpeg; in part because headers, including system headers might change behaviour and cause the packages to fail.

15 thoughts on “Upstream, rice it down!

  1. Does graphite really cause so many problems as to warrant special treatment?What’s the point in the graphite USE flag if graphite features are automatically filtered?After compiling world with parallelize-all (4.5.0) I’ve yet to hit a noticeable problem… (how boring!)

    Like

  2. Given that 4.5.0 is *not* released (and thus will have all its usual share of problems), I don’t even consider that.And Graphite in 4.4 series is Totally Fucked Up™. To the point that simple code can get the compiler stuck into an infinite loop and similar issues.Yes the graphite USE flag for 4.4 should be masked, I know.

    Like

  3. Heh, I missed it for a day ^^;;Doesn’t really change the drill actually, I wouldn’t use it in testing until .1, following the usual GCC upgrade scheme.

    Like

  4. All this talk about easing up on the optimizations…but I still can’t wait to try out the new link-time optimizer in 4.5.

    Like

  5. Hi Diego !Your paragraph “Over-optimisation is even nastier:…” contains an error : -funroll-loops is NOT activated by -O3. From the GCC documentation (I admit I didn’t check the internal compiler code but I presume GCC documentation to be correct) :”-O3 Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload and -ftree-vectorize options. “Of course, it turns out that the argument is just as valid for -finline-functions, which IS effectively activated by -O3.

    Like

  6. Uhm Alex, you know I’m not sure about that and I’ll have to check the spec file? Last I checked, I was pretty sure it was enabled. To be fair, the GCC documentation isn’t always as up-to-date as you’d like it to be so I wouldn’t be surprised.On the other hand, @-ftree-vectorize@ is … even more interesting, so…

    Like

  7. Uhmm.. The person(s) @ funroll-loops.info forgot to mention that there are just as many or more dumbasses out there using Debian, etc who also do stupid things with the respective package managment systems resulting in non-bugs being filed, fried systems, etc. Binary package managment or even a cflags limited source package does not prevent stupidity from occuring nor does it prevent unanticipated glitches in the operating system from occurring. Not sure why anyone would single out Gentoo users like that unless they are simply intimidated by the personal freedom.

    Like

  8. It is not mentioned in the gcc info pages, but %n is a valid specstring that produces a warning message, without aborting compilation. It is valid at least as far back as gcc 4.3, and maybe farther. It has been used in production to warn users that -mno-intel-syntax was deprecated. Changing your modified spec from %e to %n produces a message, but compiles the file anyway.gcc -dumpspecs | grep %n

    Like

  9. If you really want to get the most out of your PC, write test cases and use profile-guided optimizations. Improving branch prediction and cache locality will almost surely beat any compiler flag combination you might think of. Upstream authors might even be interested in what you came up with and you end up helping the community.Blindly trying compiler flags, on the other hand, is a waste of time. I do not want to over-generalize, but I have a feeling that many Gentoo users don’t actually know what (if any!) impact their flags will have, and that few of them ever used a profiler. Upstream authors often care about the speed of their code, and might have chosen compiler flags based on actual profiling data. If you are choosing compiler flags randomly or relying entirely on your “gut feeling” then you are trowing away that work (or, at the very least, are not doing a better job than them). Please don’t take this as a personal remark – it is entirely possible that O3 is harmful on your box – I just want to suggest an alternative approach to the problem.These are my 2 cents. By the way, I agree that function inlining (and loop unrolling) can stress the cache, but 64-bit code is definitely not double the size of 32-bit code. Default operand size is still 32-bit, and that will be enough in a lot of cases. Pointers double in size, and more prefixes exist to access extra registers. A more realistic figure might be a 25% increase in code size. Of course, if a program key structures are all “long” or pointers, then memory usage may actually double, maybe that’s what you were suggesting.

    Like

  10. Jacopo, maybe “doubled” is a bit harsh, but I’ve had experience of even hgiher increases in code sizes… I’m not talking about the final executable size (which is depending on the data structure) but really just @.text@ sections… of course that’s also partially dependent on the size of data structure, but is a digression right now and it doesn’t really matter to the point.As for using profiling-guided optimisation, that is definitely true. On the other hand I’m quite sure of a few things: * what I might profile on my machine will definitely work better for me, but might very well work worse for somebody with an i7 CPU; and it might still not be so, on a future GCC release; so even if upstream applied profile-guided optimisations and transcribed them in their build system, their use is definitely opinable; * I’m *not* always looking for the best speed performance; I might be looking to have meaningful backtraces, or to reduce as much as possible the executables’ sizes for an embedded use case; so the profile-guided optimisations are mostly pointless here; * when upstream is just sticking @-O3@ in their @CFLAGS@ they are most definitely not doing so because they did test on the speed of their code; first they’d have to know that different compiler versions (and different compiler _patches_) produce different results, and then they would discover that some “optimisations” really slow things down, like @-ftree-vectorize@ seems to do with FFmpeg.Speaking about which, upstreams that complain that the use of any particular @-O@, @-f@ or @-m@ switch will break their code and _it won’t be their fault_, are either ignorant of how the compilers work or are actually writing bad code (mostly, under invalid assumptions). There had been a time where @-ftracer -ftree-vectorize@ used for MPlayer on AMD64 caused a build failure. It wasn’t a GCC bug, but the inline assembly, that was using global labels, rather than locals… and the compiler inlined twice the function, causing the same (global) label to be defined twice. Heh.

    Like

  11. Yes I’m late to this party, but some info:$ touch test.c && gcc-4.4.3 -O3 -march=native -fverbose-asm -S test.c$ grep “unroll-loops” test.s$IIRC the reason -O3 can produce slower code is generally because of -finline-functions bloating the source size.Another big danger with packages that force -O3 is -ftree-vectorize, which is broken in 4.3 and still broken on x86 with 4.4 and 4.5. It’s not GCC itself that’s buggy per se, the issues are caused by packages like ClamAV and Mozilla.* that misalign the stack.Failures with -O0 can also often be attributed to the lack of inlining. In C90, functions that are extern inline means they are only defined if the function is inlined. At -O0 nothing is inlined, so these functions aren’t defined. I thought I remembered reading that this is being fixed in some way for 4.6, but I can’t find the source.I’ve been thinking about writing a compiler flags policy/best practices guide for the dev manual. If you don;t mind I’d like your comments before I post it publicly.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s