When the tests are more important than the functions

I’ve got to admit that the whole “Test-Driven Development” hype was not something that appealed to me as much — not because I think tests are wrong, I just think that while tests are important, focusing almost exclusively on them is just as dumb as ignoring them altogether.

In the Ruby world, there is so much talking about tests, that it’s still very common to find gems that don’t work at all with newer Ruby versions, but their specs pass just fine. Or even the tests pass fine for the original author, but they will fail on everyone else’s system because they depend heavily on custom configuration — sometimes, they depend on case-insensitive filesystems because the gem was developed on Windows or Mac, and never tested on Linux. Indeed, for the longest time, Rails own tests failed to work at all chances, and the “vendored” code they brought in, never had a working testsuite. Things have improved nowadays but not significantly.

Indeed, RubyGems do not make it easy to perform testing upon install, which means that many gems distributed lack part of the testsuite altogether — sometimes this is an explicit choice; in the case of my own RubyElf gem the tests are not distributed because they grow and grow, and they are quite a bit of megabytes at this point; if you want to run them you fetch the equivalent snapshot from GitHub — the ebuild in Gentoo uses that as a basis for that reason.

Sometimes even gems coming from people who are considered nearly Ruby Gods, like “rails_autolink”https://rubygems.org/gems/rails_autolink by tenderlove end up with a gem that fails tests, badly, in its release — the version we have in Portage is patched up, and the patch is sent upstream. Only the best for our users.

Now unfortunately, as I noted in the post’s title, some projects care more about the tests than the functionality — the project in question is the very same Typo that I use for this blog, and which I already mused forking to implement fixes that are not important for upstream. Maybe I should have done that already, maybe I will do that.

So I sent a batch of changes and fixes to upstream, some of them fixing issues compelled by their own changes, other implementing changes to allow proper usage of Typo over SSL vhosts (yes my blog is now available over SSL — I have to fix a few links and object load paths in some of the older posts, but it will soon work fine), other again simply making it a bit more “SEO”-friendly, since that seems to be a big deal for the developers.

What kind of response do I get about the changes? “They fail spec” — no matter that the one commit I’m first told it breaks specs actually fix editing of blog post after a change that went straight to master, so it might break specs, but it solve a real life issue that makes the software quite obnoxious. So why did I not check specs?

group :development, :test do
  gem 'thin'
  gem 'factory_girl', '~> 3.5'
  gem 'webrat'
  gem 'rspec-rails', '~> 2.12.0'
  gem 'simplecov', :require => false
  gem 'pry-rails'
end

I have no intention to start looking into this whole set of gems just to be able to run the specs for a blog which I find are vastly messed up. Why do I think so? Well, among other reasons, I’ve been told before quite a few times that they wouldn’t ever pass on PostgreSQL — which happens to be the database that has been powering this very instance for the past eight years. I’m pretty sure it’s working good enough!

Well, after asking (a few times) for the specs output — turns out that most of the specs broken are actually those that hardcode http:// in the URLs. Of course they break! My changes use protocol-relative URIs which means that the output changes to use // — no spec is present that tries to validate the output for SSL-delivered blogs which would otherwise break before.

And what is the upstream’s response to my changes? “It breaks here and there, would you mind looking into it?” Nope! “The commit breaks specs.” — No offer (until I complained loudly enough on IRC) for them to look into it, and fix either the patches or the specs are needed. No suggestion that there might be something to change in the specs.

Not even a cherry-pick of the patches that do not break specs.

Indeed as of this writing, even the first patch in the series, the only one that I really would care about get merged, because I don’t want to get out-of-sync with master’s database, at least until I decide to just get to the fork, is still there lingering, even if there is no way in this world that it breaks specs as it introduces new code altogether.

Am I going to submit a new set of commits with at least the visible specs’ failures fixed? Not sure — I really could care more about it, since right now my blog is working, it has the feature, the only one missing being the user agent forwarding to Akismet. I don’t see friendliness coming from upstream, and I keep thinking that a fork might be the best option at this point, especially when, suggesting the use of Themes for Rails to replace the currently theme handling, so that it works properly with the assets pipeline (one of the best features of Rails 3), the answer was “it’s not in our targets” — well it would be in mine, if I had the time! Mostly because being able to use SCSS would make it easier to share the stylesheets with my website (even though I’m considering getting rid of my website altogether).

So my plead to the rest of the development community, which I hope can be considered part of, is to not be so myopic that you care more about tests passing than features working. For sure Windows didn’t reach its popularity level being completely crash-proof — and at the same time I’m pretty sure that they did at least a level of basic testing on it. The trick is always in the compromise, not on the absolute care or negligence for tests.

On test verbosity (A Ruby Rant, but not only)

Testing of Ruby packages is nowadays mostly a recurring joke to the point I didn’t feel I could add much more to my last post — but there is something I can write about, which is not limited to Ruby at all!

Let’s take a step back and see what’s going on. In Ruby you have multiple test harnesses available, although the most commonly used are Test::Unit (which was part of Ruby 1.8 and is available as a gem in 1.9) and RSpec (version 1 before, version 2 now). The two of them have two very different defaults for test output: the former uses, by default, a quiet “progress” style output (one dot per test passed), the latter uses a verbose “documentation” style. But both of them can use the other style as well.

Now, we used to just run tests through rake, using the Rakefile coming with the gem, or the tarball, or the git snapshot, to run tests and build documentation, since both harnesses have very tight integrations with it — but since the introduction of Bundler, this starts to get more troublesome than just writing the test commands out in the ebuild explicitly.

Today I went to bump a package (origin, required for the new Mongoid), and since I didn’t wire in the tests last time (I was waiting for a last minute fix), I decided to do so this time. It should be of note that while, when using rake to run the tests, you’re almost entirely left to the upstream’s preference on what style to use for the tests, it’s tremendously easy to override that in the ebuild itself.

The obvious question is then “which of the two styles should ebuild use for packaged gems?” — The answer is a bit complex, which is why this post came up. Please also note that this does not only apply to Ruby packages, it should be considered for all packages where the tests can be silent or verbose.

On a default usage, you only care about whether the tests fail or pass — if they fail, they’ll give you as much detail as you need, most of the times at least, so the quiet (progress) style is a very good choice: it’s less output, which means smaller logs, less context, and it’s a win-win. For the tinderbox, though, I found out that having more verbose output is useful: to make sure that other tests are not being skipped, to make sure that the build does not get stuck, and if it does, where it got stuck.

Reconciling these two requirements is actually easier than one might think at first because it’s an already solved problem: both perl.eclass and, lately, cmake-utils.eclass support one variable, TEST_VERBOSE which can be set to make the tests output more details while running. So right now origin is the first Ruby ebuild supporting that variable, but talking about this with Hans we came to the conclusion we need a wrapper to take care of that.

So what I’m working on right now is a few changes to the Ruby eclasses so that we support TEST_VERBOSE and NOCOLOR (while the new analysis actually strips out most of the escape codes it’s easier if it doesn’t have to do that!) with something such as ruby-ng-rspec call (which can also check whether rspec is properly in the dependencies!). Furthermore I plan on changing ruby-fakegem.eclass so that it also allows to switch between the current default test function (which uses rake), one that uses RSpec (adding the dependency!), and one that uses Test::Unit. Possibly, after this, we might want to do the same thing with the API documentation building, although that might be trickier — in general though having the same style of API docs installed, instead of using each project’s own, might be a good idea.

So if your package runs tests, please try supporting TEST_VERBOSE, and if it can optionally use colors, make sure it supports NOCOLOR as well, thanks!

Do-s and Don’t-s of Ruby testing

This article is brought to you by some small time I had free to write something down. I’m clearing my own stuff out of the way in Italy, before going back to Los Angeles at the end of the month.

I’ve been working on updating some Gentoo Ruby ebuilds lately, and I start to feel the need to scream at some of the Gems developers out there… to be more constructive, I’ll try to phrase them this way.

Do provide tests in your gem file, or in a tarball

You’d be surprised how many times we have to rely on GIT snapshots in Gentoo simply because the gem file itself lacks the test files; even worse is when you have the test units, but these rely on data that is not there at all. This usually is due to the test data being too big to fit in a quick gem.

Interestingly enough there used to be a gem command parameter that let it run the tests for the gem at install time. I can’t seem to find it any more — this is probably why people don’t want to add big tests to their gem files, what’s the point? Especially given most people won’t be using them at all?

It is true that even my Ruby-Elf also doesn’t ship neither test units nor test data in the gem, but that’s why I still release a tarball together with the gem file itself. And yes I do release it not just tag it on GitHub.

Do tag your releases in GitHub!

Especially if your tests are too big or boring to fit into the gem file, please do tag your release versions in GitHub; this way we already have a workaround for the lack of testing: we’ll just fetch the tarball from there and use it for building our fake gem installation.

Do not mandate presence of analysis tools.

Tools such as simplecov, or ruby-debug, or lately guard, seems to be all the rage with Ruby developers. They are indeed very useful, but please do not mandate their presence, even if only in development mode.

While using these tools is dramatically important if you plan on doing real development on the gem, we don’t care about how much coverage is the gem’s tests having: if the tests pass we know our environment should be sane enough; if they fails we have a problem. That’s enough for us, and we end up having to fight with stuff like simplecov on a per-package basis most of the time.

Do not require test dependencies to build API docs (and vice versa)

This is unfortunately tied with the very simplistic dependency handling of RubyGems. When you look up a Gem’s dependencies you only see “runtime” and “development” — it says nothing about what’s needed to build the docs versus what’s required to run the tests.

Since testing and documentation are usually both handled through a Rakefile, and often enough, especially for complex testing instrumentations, the Rakefile uses the task files provided by the testing framework to declare the targets, it’s hard to even run rake -D on a gem that doesn’t have the testing framework installed.

Now in Gentoo you might want to install the API documentation of your gem but not run the tests, or vice-versa you might want to run the tests but don’t care about API documentation. This is a pain in the neck if the Rakefile requires all the development dependencies to be present.

This together with the previous point ties down to the Bundler simplistic tracking — there is no way to avoid having installed the gems in Gentoo even if they are properly grouped and you “exclude” them. Bundler will try resolving the dependencies anyway, and that will defeat the purpose of using Gentoo’s own dependency tracking.

Do not rely on system services.

This is actually something that happens with non-Ruby packages as well. Many interface libraries that implement access to services such as PostgreSQL, or MongoDB, will try connecting to the system instance during their test routine. This might be okay when you’re the developer, but if you’re doing this on a continuous basis, you really don’t want to do that: the service might not be there, or the service might just not be setup how you expect it to be.

Please provide (and document!) environment variables that can be set to change the default connection strings to the services. This way we can set up a testing instance when running the tests so that you have your full own environment. Even better if you’re able to write setup/teardown scripts that run unprivileged and respect $TMPDIR.

Do have mocks available.

Possibly due to the high usage for Rails and other webapp/webservice technologies based on Rails, there are gems out there for almost every possible third-party webservice. Unfortunately testing these against the full-blown webservice requires private access keys; most gems I’ve seen allow you to specify new keys that are used during the testing phase — it’s a step in the right direction, but it’s not enough.

While by allowing the environment to declare the access keys allows you to set up testing within Gentoo as well, if you care about one particular package, it’s not going to let everybody run their tests automatically. What can you do then? Simple: use mocks. There are tools and libraries out there designed to let you simulate a stateless or partially-stateful webservice so that you can try out your own method to see if they behave.

One important note here: the mocks don’t strictly have to be written in Ruby! While gems don’t allow you to specify non-Ruby dependencies, distributions can do that as well, so if you rely on a mock that is written in Python, C or Java, you can always document it. It’s still going to be easier than getting an access key and using that.

Even better, if your webservice is friendly to open source (which one isn’t nowadays? okay don’t answer), ask them to either provide an “official mock”, or create some standard access keys to be used for sandbox testing, in some kind of sandbox environment.

Do verify the gems you rely upon.

Especially for tests, it’s easy to forget to make sure that the software you rely upon works. Unfortunately, I’ve seen more than a couple of times that the gems used for testing do not work as expected themselves, causing a domino effect.

This is part of the usual “make sure that the software you rely upon really works”, but seems like while most people do verify their own direct runtime dependencies, they more often fail to verify those used during testing.

Of course this list is not complete, and there probably are more things that I didn’t think about, but it’s enough of a rant for now.

One’s DoS fix is one’s test breach (A Ruby Rant)

What do you know, it did indeed become a series!

Even though I’m leaving for Los Angeles again on Monday, March 5th, I’m still trying to package up the few dependencies that I’m missing to be able to get Radiant 1.0 in tree. This became more important to me now that Radiant 1.0.0 has finally been released, even though it might become easier for me to just target 1.1.0 at some later point, given that it would then use Rails 3. Together with that I’m trying to find what I need to get Typo 6 running to update this blog’s engine under the hood.

Anyway one of the dependencies I needed to package for Radiant 1.0. was delocalize which is an interesting library that allows dealing with localise date and time formats (which is something I had to hack together for a project myself, so I’m quite interested in it). Unfortunately, the version I committed to the tree is not the right one I should have packaged, which would have been 0.2.6 or so — since Radiant 1.0 is still on Rails 2.3. Unfortunately there is not a git tag for 0.2.6 so I haven’t packaged that one yet (and the .gem file only has some of the needed test files, not all of them — it’s the usual issue: why do you package the test files in the .gem file if you do not package the test data?), but I’m also trying to find why the code is failing on Ruby 1.9.

Upstream is collaborative and responsive, but as many others is now more focused on Ruby 1.9, rather than 1.8. What does this mean? Well, one of the recent changes to most of the programming languages out there involves changing the way keys are mangled in hash-based storage objects (such as Ruby’s Hash); this has changed the way Ruby 1.8 reported the content of those objects (Ruby 1.9 has ordered hashes which means that they don’t appear to have changed), which caused headaches for both me and Hans for quite a while.

This shows just one of the many reasons why we might have to step in on the throttle and try to get Ruby 1.9 stable and usable soon, possibly even before Ruby 1.8 falls out of security updates, which is scheduled for June — which was what I promised in the FOSDEM talk and I’m striving to make sure we deliver.

EAPI=4, missing tests, and emake diagnostics

Some time ago, well, quite a bit of time ago, I added the following snippet to the tinderbox’s bashrc file:

make() {
    if [[ "${FUNCNAME[1]}" == "einstall" ]] ; then
        emake -j1 "$@"
    else
        eqawarn "Tinderbox QA Notice: 'make' called by ${FUNCNAME[1]}"
        emake "$@"
    fi
}

I’m afraid I forgot whose suggestion that was, but it was an user, Edit: thanks to the anonymous contributor who actually remembered the original author of the code. And sorry Kevin, my memory starts to feel like it’s an old flash.

Kevin Pyle suggested me this code to find packages that ignored parallel build simply by using the straight make invocation over the correct emake one. It has served me well, and it has shown me a relatively long list of packages that should have been fixed, only a handful of which actually failed with parallel make. Most of the issues were due to to simple omission, rather than wilful disregard for parallel compilation.

One of the problems that this has shown is that, by default, all the make calls within src_test() default functions are not running in parallel. I guess the problem is that most people assume that tests are too fragile to run in parallel, but at least with (proper) autotools this is a mistake: the check_PROGRAMS and other check-related targets are rebuilt before running tests, and that can be done with parallel make just fine. Only starting with automake 1.11 you have the option to execute tests in simultaneously, otherwise the generated Makefile would still call them one by one, no matter how many make jobs you requested. This is why most of my ebuilds rewrite the src_test() function simply to call emake check.

This is the set-up situation. Now to get to the topic of the post… I’ve been seeing in the tinderbox some strange output, when looking at the compilation output, that I couldn’t find in the logs either. Said output simply told me that emake failed, but there was no obvious failure output from it, and the package merged fine. What was the problem? Well, it is a bit complex.

The default src_test() function contains a bit of logic to identify what the Makefile’s test target is, so that it can be used both with autotools (the default target to run tests with an automake-generated Makefile is check) and with otherwise-provided Makefile that likely use test as their testing target. This has unfortunate implications already, especially for packages whose maintainers didn’t test with FEATURES=test enabled:

  • if the package has a check target that doesn’t run tests, but rather checks whether the installed copy is in working state, then you have a broken src_test(); qmail-related packages are used to this, and almost all the ebuilds for said packages never restricted or blocked tests;
  • if the package does not have any specific test or check target, but it’s using make’s default rules, it could be trying to build a test command out of a non-existing test.c source file
  • in any case, it requires make to parse and load the Makefile which could be a bit taxing, depending on the package’s build system.

To this you can now add one further issue, when you join the above-quoted snippet from the tinderbox’s bashrc and the new EAPI=4 semantics for which ebuild helpers die by default on failure: if there is no Makefile at all in $S then you get an emake failure, which of course wouldn’t be saved in the logs but would still be repeated at the end of merge just for the sake of completeness. Okay, it’s just a nuisance and not a real issue, but it still baffled me.

So what is my suggestion here? Well, it’s something I’ve been applying myself independently of all this: declare your src_test explicitly. This means not only add it with the expected interface even if tests are present, but actually null it out if there are no tests: this way no overhead is added when building with FEATURES=test (which is what I do in the tinderbox). To do so, rather than using RESTRICT=test (which at least I read as a “tests are supposed to work but they don’t”), do something like:

src_test() { :; } # no tests present

_I usually do the same for each function in virtual packages._

I really wish that more of my fellows started doing the same, the tinderbox would probably have a much lower overhead when not running tests, which seems to be an unfortunate scenario, as most automake-based packages will try a pointless, empty make check each time they are merged, even if no test at all is present.

I’ll follow up with more test-related issues for the tinderbox in the next weeks, if time permits. Remember that the tinderbox is on flattr if you wish to show your appreciation… and I definitely could use some help with the hardware maintenance (I just had to replace the system’s mirrored disks, as one WD RE3 died on me).

Testing stable; stable testing

It might not be obvious but my tinderbox is testing the “unstable” ~x86 keyword of Gentoo. This choice was originally due to the kind of testing (the latest and greatest version of packages for which we don’t know the averse effects of), and I think it has helped tremendously, up to now, to catch things that could have otherwise bit people months, if not years later, especially for the least-used packages. Unfortunately that decision also meant ignoring for the most part the other, now more common architecture (amd64 — although I wonder if the new, experimental amd32 architecture would take its place), as well as ignoring the stable branch of Gentoo, which is luckily tracked by other people from time to time.

Lacking continuous testing though, what is actually considered stable, sometimes it’s not very much so. This problem is further increased by the fact that sometimes the stable requests aren’t proper by themselves: it can be an user asking to stable a package, and the arch teams being called before they should have, it might be an overseen issue that the maintainer didn’t think of, or it might simply be that multiple maintainers had different feelings about stabilisation, which happens pretty often in team-maintained packages.

Whatever the case, once a stable request has been filed, it is quite rare that issues are brought up that are severe enough the stable is denied: it might be that the current stable is in a much worse shape, or maybe it’s a security stable request, or a required dependency of something following one of those destinies. I find this a bit unfortunate, I’m always happy when issues are brought up that delay a stable, even if that sometimes means having to skip a whole generation of a package, just as long as they mean having a nicer stable package.

Testing a package is obviously a pretty taxing task: we have numbers of combination of USE flags, compiler and linker flags, features, and so on so forth. For some packages, such as PHP, testing every and each combination of USE flags is not only infeasible, but just impossible within this universe’s time. to make the work of the arch team developers more bearable, years ago, starting with the AMD64 team, the concept of Arch Testers was invented: so-called “power users” whose task is to test the package and give green light for stable-marking.

At first, all of the ATs were as technically capable as a full-fledged developer, to the point that most of the original ones (me included) “graduated” to full devship. With time this requirement seems to have relaxed, probably because AMD64 became usable to the average user, and no longer just to those with enough knowledge to workaround the landmines originally present when trying to use a 64-bit system as a desktop — I still remember seeing Firefox fail to build because of integer and pointer variable swaps, once that became an error-worthy mistake in GCC 3.4. Unfortunately this also meant that a number of non-apparent issues became even less apparent.

Most of these issues often end up being caused by one simple fault: lack of direction in testing. For most packages, and that includes, unfortunately, a lot of my packages, the ebuild’s selftests are nowhere near comprehensive enough to tell whether the package works or not, and unless the tester actually uses the package, there is little chance that he knows really how to test it. Sure that covers most of the common desktop packages and a huge number of server packages, but that’s far from the perfect solution since it’s the less-common packages that require more eyes on the issues.

What the problem is, with most other software, is that it might require specific hardware, software or configuration to actually be tested. And most of the time, it requires a few procedures to be applied to ensure that it is actually tested properly. At the same time, I only know of Java and Emacs team publishing proper testing procedures for at least a subset of their packages. Most packages could use such a documentation, but it’s not something that maintainers, me included, are used to work on. I think that one of the most important tasks here, is to remind developers when asking stable to come up with a testing plan. Maybe after asking a few times, we’ll get to work and write up that documentation, once and for all.

A “new” Tinderbox

While I’m still hoping for somebody to fund the PAM audit and fixup (remember: if you rely on PAM on your Gentoo systems, you really want for somebody to do the work!), and even though I have to reduce the tinderbox costs , I got some pretty cool news for many of you out there.

Up to now, the tinderbox have been running over the unstable/testing visibility for x86. The new tinderbox, which I simply called tinderbox64, uses instead the unstable/testing for amd64.

The new tinderbox is now testing ~amd64 rather than ~x86.

Why did I decide to go this route? Well, while the 64-bit builds require more space and time, I thought a bit about it, and even the stuff I introduce does not get keyworded ~x86 right away; it’s ignoring tests on my own stuff! Beside, with even my router moving to 64-bit to give the best with hardened, I start to think x86 is not really relevant for anything, nowadays.

It’s not all there of course; there are a number of issues that only appear on 64-bit (well, there are almost as many that only appear on 32-bit but for now let me focus on those): integer and buffer overflows, implicit function declarations that truncate pointers to integer, 64-bit unsafety that makes packages fail to build… All these conditions are more relevant because 64-bit is what you should be using on most modern systems, so they should be the ones tested, even more than ~x86.

Now of course it would be better to have both tinderboxes running, and I think I could get the two of them to run in parallel, but then I’d need a new “frontend” system, one I could use both for storage and for virtual machine hosting; probably something a little more beefy that the laptop I’m using, mostly in term of RAM, would be quite nice (the i7 performs quite nicely, but 4GB of RAM is just too little to play with KVM). But even if I could afford to buy a new frontend now (I cannot), it would still be a higher cost on a monthly basis in power. Right now I can roughly estimate that between power, and the maintenance costs (harddisks, UPSes, network connection), running the tinderbox is costing me between €150 and €200/month, which is not something I can easily afford, especially considering that last year, net of taxes and most expenses, I had an income of €500/month to pay for groceries and food. Whoopsie. And this is obviously without including the time I’m spending for manually review the results, or fixing them.

Anyway, expect another flood of bugs once the tinderbox gets again up to speed; for now, it might find a few more problems that previously it ignored, since it started building from scratch. And while the 32-bit filesystem is frozen, I’ll probably find some time to run again the collision-detection script that is part of Ruby-ELF that is supposed to find possible collisions between libraries and similar, which is something that is particularly important to take into consideration as those bugs tend to be the most complex to debug.

The three challenges of automated testing

With my automated testing for Gentoo ebuilds, I started picking up quite a few ideas about automated testing. One of these is that there are three main challenges for implementing it properly: checking, analyzing, and fixing.

Checking: obviously the first obstacle you have to cope with is writing the code to check for whatever problem you’re looking out for. In the case of my tinderbox, I apply the QA tests that Portage already provides, then add some. The extended tests are almost self-explanatory, even though a few are partly Goldberg machines and thus might take a bit of insight to understand what they exactly do.

But there are quite a few parameters that you have to make sure to respect in these QA checks; one of these is that it has to have a very low signal-to-noise ratio. Removing all the noise would be pretty good, but that’s almost impossible to reach; since that’s the case, you need (as we’ll see in a moment) human work to analyse the result. But if the signal-to-noise is too high low, the test is pointless as there’s so much crap to sift through before reaching the chocolate.

It is important that the test is as much as possible non-destructive. While “destructive” in software is quite relative, I have had bad experience with QA tests that caused further failures to be injected into software, such as the “CC not respected” bugs, that expected packages to fail when the wrong path was taken. The reason why these are a bit of a problem is that when a single package breaks (forcefully), the dependency-trees its part of are also broken, so you stop another long list of packages from even being tested.

Analyzing: but even if you have tons of tests with very little SNR, you have another challenge at your hands: analysing the results, and reporting them to the right person. This is very tricky; the easiest way to cope with this is simply not to write any code to analyse the output but, instead, rely on a person to check the results and report them. This is what I’m doing; it’s “easier” but it’s very time-consuming.

The problem with analysing the errors is that sometimes they need complex interpretation; for instance while some of the GCC 4.5 errors have unique strings (the infamous constructor-specifier warning was introduced with the 4.5 version so any warning regarding that is definitely a GCC 4.5 issue), other times it can be difficult to judge whether something is a compiler bug, or an overflow that only gets uncovered by GCC itself.

It gets even worse when you add --as-needed to the mix: while it’s now mostly safe to use, the symptoms can actually appear on different projects (since shared objects often don’t use --no-undefined, it means that they can feign to have linked properly, and then leave the application linking to them to fail), or might appear in strange forms, such as an autoconf-based check reporting a library is missing while it’s actually in the system already.

But identifying the problem is just half the analysis task; the other problem is to remove the noise from the results; to do so you have to collapse duplicate problems (if one library fails because of --as-needed, all its users will fail in similar but somewhat different ways), make sure that the error applies to the latest version (especially with the way I’m running the tests here, I can receive error logs for older-than-last versions when the last fails to build — for that or for something else), make sure it wasn’t fixed in tree since I last synced (and this often I cannot do properly for time or bandwidth constraints) and also that it wasn’t reported already.

Then there is the reporting, which can also become tricky when you have test failures, as most of the ebuilds won’t use verbose testing, so you’d have to fetch the test log somehow, and since on successful merge I clear out the build directory, that’s even more tricky.

Fixing: the most difficult challenge is getting the stuff fixed, obviously. What is not obvious is what this actually mean; it’s not just matter of fixing the code but rather having someone to look at the code and intend to fix it. This basically mean that you have to report bugs that other developers will look at and not just put to the side and ignore forever.

Even if you have a test that have little to no noise at all, reporting code that is easily fixable, there is no point into spending time with the analysis for problems that will not be taken care of. They might even be important issues, but if they can’t be felt by the developers, they are not going to get solved. Portability problems come on top of that, but security is another area where we actually can’t get enough people caring. The tinderbox would be able to produce a huge number of security reports, starting from the binaries that still use unsafe functions, but most of these will not lead to any fix because “it’s working”.

So it really is important to decide what to look out for; right now, I’m trying to focus on the things that will at one point or another hit our users, unfortunately even the things that are somewhat important (Berkeley DB 5.0 or GCC 4.5 for instance) are getting ignored.

Tinderbox — Getting Friendly with GCC 4.5

So, finally Mark gave us GCC 4.5.0 to play with in the tree… obviously it’s still masked, as the tree is definitely not tested to work with it. But this is why I’m working on the tinderbox, isn’t it? After the ebuild was there, I prepared to start a new run from scratch to the tinderbox. What is a “new run from scratch”?

Well, the tinderbox usually works incrementally: not only it has a queue of packages to merge, but every time I sync, it resumes (more or less) from where it stopped. When the whole queue is consumed, a new list is generated of packages that haven’t been merged in the past six weeks, or that have been merged but have new versions (or revisions) in the tree, and that becomes the new queue.

A new from scratch run means that the tinderbox is cleared out: all the packages that are not part of the tree induced by the world and system sets are removed by the tinderbox, system and world are rebuilt, then a new list of all the rest of the packages… then it starts. It might sound easy and quick, but to unmerge 12000 packages (yes, twelve thousands!) it takes well over a day.

This by itself is a good test for unmerge actions, and it provides useful clues. For instance I know that a single unmerge of a bunch of texlive modules will take æons to complete, as it rebuilds the font cache after every unmerge. The same is true for the packages that update the MIME database, and to a lesser extent to Python modules. Having a way to deal with these things at once at the end would most likely help us here.

Now, I originally wanted to go on and implement the spec file trickery to check for flags (remember? this was one of the first reasons why I started working on the tinderbox), but at the end, I decided to leave that for another time. My reason for not going with that at the same time as GCC 4.5 is that… this time I’m going to get surprised: I sincerely have no idea which problems this GCC will produce, compared to the problems that could have been caused by GCC 4.4 which I had a good clue how to fix before starting.

Also, I’ve been pretty scared to the effects of GCC 4.5 on my host system: during the rebuild, GNU tar stopped working properly, so that, for instance, all the calls to libtoolize within the ebuilds failed as they weren’t able to update the config.sub and config.guess files. I worked this around in the middle of the merge by switching tar with bsdtar provided by libarchive, and all is working fine, for now. The tar failure has to get investigated though, obviously.

Anyway, the work has started, even though it’ll take a bit to roll out entirely — I hope you’ll enjoy the results. And if you feel like thanking me, there is the wishlist — too bad that I can’t put eBooks in that list (since I don’t use a Kindle).

Upstream, rice it down!

While Gentoo often gets a bad name because of the so-called ricers and upstream developer complains that we allow users to shoot themselves in the foot by setting CFLAGS as they please, it has to be said that not all upstream projects are good in that regard. For instance, there are a number of projects that, unless you enable debug support, will force you to optimise (or even over-optimise) the code, which is obviously not the best of ideas (this does not count in things like FFmpeg that rely on Dead Code Elimination to link properly — in those cases we should be even more careful but let’s leave it alone for now).

Now, what is the problem with forcing optimisation for non-debug builds? Well, sometimes you might not want to have debug support (extra verbosity, assertions, …) but you might still want to be able to fetch a proper backtrace; in such cases you have a non-debug build that needs to turn down optimisations. Why should I be forced to optimise? Most of the time, I shouldn’t.

Over-optimisation is even nastier: when upstream forces stuff like -O3, they might not even understand that the code might easily slow down further. Why is that? Well one of the reasons is -funroll-loops: declaring all loops to be slower than unrolled code is an over-generalisation that you cannot pretend to keep up with, if you have a minimum of CPU theory in mind. Sure, the loop instructions have an higher overhead than just pushing the instruction pointer further, but unrolled loops (especially when they are pretty complex) become CPU cache-hungry; where a loop might stay hot within the cache for many iterations, an unrolled version will most likely require more than a couple of fetch operations from memory.

Now, to be honest, this was much more of an issue with the first x86-64 capable processors, because of their risible cache size (it was vaguely equivalent to the cache available for the equivalent 32-bit only CPUs, but with code that almost literally doubled its size). This was the reason why some software, depending on a series of factors, ended up being faster when compiled with -Os rather than -O2 (optimise for size, the code size decreases and it uses less CPU cache).

At any rate, -O3 is not something I’m very comfortable to work with; while I agree with Mark that we shouldn’t filter or exclude compiler flags (unless they are deemed experimental, as is the case for graphite) based on compiler bugs – they should be fixed – I also would prefer avoiding to hit those bugs in production systems. And since -O3 is much more likely to hit them, I’d rather stay the hell away from it. Jesting about that, yesterday I produced a simple hack for the GCC spec files:

flame@yamato gcc-specs % diff -u orig.specs frigging.specs
--- orig.specs  2010-04-14 12:54:48.182290183 +0200
+++ frigging.specs  2010-04-14 13:00:48.426540173 +0200
@@ -33,7 +33,7 @@
 %(cc1_cpu) %{profile:-p}

 *cc1_options:
-%{pg:%{fomit-frame-pointer:%e-pg and -fomit-frame-pointer are incompatible}} %1 %{!Q:-quiet} -dumpbase %B %{d*} %{m*} %{a*} %{c|S:%{o*:-auxbase-strip %*}%{!o*:-auxbase %b}}%{!c:%{!S:-auxbase %b}} %{g*} %{O*} %{W*&pedantic*} %{w} %{std*&ansi&trigraphs} %{v:-version} %{pg:-p} %{p} %{f*} %{undef} %{Qn:-fno-ident} %{--help:--help} %{--target-help:--target-help} %{--help=*:--help=%(VALUE)} %{!fsyntax-only:%{S:%W{o*}%{!o*:-o %b.s}}} %{fsyntax-only:-o %j} %{-param*} %{fmudflap|fmudflapth:-fno-builtin -fno-merge-constants} %{coverage:-fprofile-arcs -ftest-coverage}
+%{pg:%{fomit-frame-pointer:%e-pg and -fomit-frame-pointer are incompatible}} %1 %{!Q:-quiet} -dumpbase %B %{d*} %{m*} %{a*} %{c|S:%{o*:-auxbase-strip %*}%{!o*:-auxbase %b}}%{!c:%{!S:-auxbase %b}} %{g*} %{O*} %{W*&pedantic*} %{w} %{std*&ansi&trigraphs} %{v:-version} %{pg:-p} %{p} %{f*} %{undef} %{Qn:-fno-ident} %{--help:--help} %{--target-help:--target-help} %{--help=*:--help=%(VALUE)} %{!fsyntax-only:%{S:%W{o*}%{!o*:-o %b.s}}} %{fsyntax-only:-o %j} %{-param*} %{fmudflap|fmudflapth:-fno-builtin -fno-merge-constants} %{coverage:-fprofile-arcs -ftest-coverage} %{O3:%eYou're frigging kidding me, right?} %{O4:%eIt's a joke, isn't it?} %{O9:%eOh no, you didn't!}

 *cc1plus:

flame@yamato gcc-specs % gcc -O2 hellow.c -o hellow; echo $?   
0
flame@yamato gcc-specs % gcc -O3 hellow.c -o hellow; echo $?
gcc: You're frigging kidding me, right?
1
flame@yamato gcc-specs % gcc -O4 hellow.c -o hellow; echo $?
gcc: It's a joke, isn't it?
1
flame@yamato gcc-specs % gcc -O9 hellow.c -o hellow; echo $?
gcc: Oh no, you didn't!
1
flame@yamato gcc-specs % gcc -O9 -O2 hellow.c -o hellow; echo $?
0

Of course, there is no way I could put this in production as it is. While the spec files allow enough flexibility to hit the case for the latest optimisation level (the one that is actually applied), rather than for any parameter passed, they lack an “emit warning” instruction, the instruction above, as you can see from the value of $? is “error out”. While I could get it running in the tinderbox, it would probably produce so much noise and for failing packages that I’d spend each day just trying to find why something failed.

But if somebody feels like giving it a try, it would be nice to ask the various upstream to rice it down themselves, rather than always being labelled as the ricer-distribution.

P.S.: building with no optimisation at all may cause problems; in part because of reliance on features such as DCE, as stated above, and as used by FFmpeg; in part because headers, including system headers might change behaviour and cause the packages to fail.