This Time Self-Hosted
dark mode light mode Search

The three challenges of automated testing

With my automated testing for Gentoo ebuilds, I started picking up quite a few ideas about automated testing. One of these is that there are three main challenges for implementing it properly: checking, analyzing, and fixing.

Checking: obviously the first obstacle you have to cope with is writing the code to check for whatever problem you’re looking out for. In the case of my tinderbox, I apply the QA tests that Portage already provides, then add some. The extended tests are almost self-explanatory, even though a few are partly Goldberg machines and thus might take a bit of insight to understand what they exactly do.

But there are quite a few parameters that you have to make sure to respect in these QA checks; one of these is that it has to have a very low signal-to-noise ratio. Removing all the noise would be pretty good, but that’s almost impossible to reach; since that’s the case, you need (as we’ll see in a moment) human work to analyse the result. But if the signal-to-noise is too high low, the test is pointless as there’s so much crap to sift through before reaching the chocolate.

It is important that the test is as much as possible non-destructive. While “destructive” in software is quite relative, I have had bad experience with QA tests that caused further failures to be injected into software, such as the “CC not respected” bugs, that expected packages to fail when the wrong path was taken. The reason why these are a bit of a problem is that when a single package breaks (forcefully), the dependency-trees its part of are also broken, so you stop another long list of packages from even being tested.

Analyzing: but even if you have tons of tests with very little SNR, you have another challenge at your hands: analysing the results, and reporting them to the right person. This is very tricky; the easiest way to cope with this is simply not to write any code to analyse the output but, instead, rely on a person to check the results and report them. This is what I’m doing; it’s “easier” but it’s very time-consuming.

The problem with analysing the errors is that sometimes they need complex interpretation; for instance while some of the GCC 4.5 errors have unique strings (the infamous constructor-specifier warning was introduced with the 4.5 version so any warning regarding that is definitely a GCC 4.5 issue), other times it can be difficult to judge whether something is a compiler bug, or an overflow that only gets uncovered by GCC itself.

It gets even worse when you add --as-needed to the mix: while it’s now mostly safe to use, the symptoms can actually appear on different projects (since shared objects often don’t use --no-undefined, it means that they can feign to have linked properly, and then leave the application linking to them to fail), or might appear in strange forms, such as an autoconf-based check reporting a library is missing while it’s actually in the system already.

But identifying the problem is just half the analysis task; the other problem is to remove the noise from the results; to do so you have to collapse duplicate problems (if one library fails because of --as-needed, all its users will fail in similar but somewhat different ways), make sure that the error applies to the latest version (especially with the way I’m running the tests here, I can receive error logs for older-than-last versions when the last fails to build — for that or for something else), make sure it wasn’t fixed in tree since I last synced (and this often I cannot do properly for time or bandwidth constraints) and also that it wasn’t reported already.

Then there is the reporting, which can also become tricky when you have test failures, as most of the ebuilds won’t use verbose testing, so you’d have to fetch the test log somehow, and since on successful merge I clear out the build directory, that’s even more tricky.

Fixing: the most difficult challenge is getting the stuff fixed, obviously. What is not obvious is what this actually mean; it’s not just matter of fixing the code but rather having someone to look at the code and intend to fix it. This basically mean that you have to report bugs that other developers will look at and not just put to the side and ignore forever.

Even if you have a test that have little to no noise at all, reporting code that is easily fixable, there is no point into spending time with the analysis for problems that will not be taken care of. They might even be important issues, but if they can’t be felt by the developers, they are not going to get solved. Portability problems come on top of that, but security is another area where we actually can’t get enough people caring. The tinderbox would be able to produce a huge number of security reports, starting from the binaries that still use unsafe functions, but most of these will not lead to any fix because “it’s working”.

So it really is important to decide what to look out for; right now, I’m trying to focus on the things that will at one point or another hit our users, unfortunately even the things that are somewhat important (Berkeley DB 5.0 or GCC 4.5 for instance) are getting ignored.

Comments 2
  1. I think you have your signal-to-noise backwards. You want high signal to low noise. IE, you want just the chocolate.Low signal to noise is a static filled screen, with a constant humming in the background; you can still enjoy the show but it won’t be easy.High noise is the GCC ‘warn unused return’ on a project full of fprint()s.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.