Postmortem of a patch, or how do you find what changed?

Two days ago, Luca asked me to help him figure out what’s going on with a patch for libav which he knew to be the right thing, but was acting up in a fashion he didn’t understand: on his computer, it increased the size of the final shared object by 80KiB — while this number is certainly not outlandish for a library such as libavcodec, it does seem odd at a first glance that a patch removing source code is increasing the final size of the executable code.

My first wild guess which (spoiler alert) turned out to be right, was that removing branches out of the functions let GCC optimize the function further and decide to inline it. But how to actually be sure? It’s time to get the right tools for the job: dev-ruby/ruby-elf, dev-util/dwarves and sys-devel/binutils enter the battlefield.

We’ve built libav with and without the patch on my server, and then rbelf-size told us more or less the same story:

% rbelf-size --diff libav-{pre,post}/avconv
        exec         data       rodata        relro          bss     overhead    allocated   filename
     6286266       170112      2093445       138872      5741920       105740     14536355   libav-pre/avconv
      +19456           +0         -592           +0           +0           +0       +18864 

Yes there’s a bug in the command, I noticed. So there is a total increase of around 20KiB, where is it split? Given this is a build that includes debug info, it’s easy to find it through codiff:

% codiff -f libav-{pre,post}/avconv
[snip]

libavcodec/dsputil.c:
  avg_no_rnd_pixels8_9_c    | -163
  avg_no_rnd_pixels8_10_c   | -163
  avg_no_rnd_pixels8_8_c    | -158
  avg_h264_qpel16_mc03_10_c | +4338
  avg_h264_qpel16_mc01_10_c | +4336
  avg_h264_qpel16_mc11_10_c | +4330
  avg_h264_qpel16_mc31_10_c | +4330
  ff_dsputil_init           | +4390
 8 functions changed, 21724 bytes added, 484 bytes removed, diff: +21240

[snip]

If you wonder why it’s adding more code than we expected, it’s because there are other places where functions have been deleted by the patch, causing some reductions in other places. Now we know that the three functions that the patch deleted did remove some code, but five other functions added 4KiB each. It’s time to find out why.

A common way to do this is to generate the assembly file (which GCC usually does not represent explicitly) to compare the two — due to the size of the dsputil translation unit, this turned out to be completely pointless — just the changes in the jump labels cause the whole file to be rewritten. So we rely instead on objdump, which allows us to get a full disassembly of the executable section of the object file:

% objdump -d libav-pre/libavcodec/dsputil.o > dsputil-pre.s
% objdump -d libav-post/libavcodec/dsputil.o > dsputil-post.s
% diff -u dsputil-{pre,post}.s | diffstat
 unknown |245013 ++++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 125163 insertions(+), 119850 deletions(-)

As you can see, trying a diff between these two files is going to be pointless, first of all because of the size of the disassembled files, and secondarily because each instruction has its address-offset prefixed, which means that every single line will be different. So what to do? Well, first of all it’s useful to just isolate one of the functions so that we reduce the scope of the changes to check — I found out that there is a nice way to do so, and it involves relying on the way the function is declared in the file:

% fgrep -A3 avg_h264_qpel16_mc03_10_c dsputil-pre.s
00000000000430f0 :
   430f0:       41 54                   push   %r12
   430f2:       49 89 fc                mov    %rdi,%r12
   430f5:       55                      push   %rbp
--
[snip]

While it takes a while to come up with the correct syntax, it’s a simple sed command that can get you the data you need:

% sed -n -e '/ dsputil-func-pre.s
% sed -n -e '/ dsputil-func-post.s
% diff -u dsputil-func-{pre,post}.s | diffstat
 dsputil-func-post.s | 1430 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 1376 insertions(+), 54 deletions(-)

Okay that’s much better — but it’s still a lot of code to sift through, can’t we reduce it further? Well, actually… yes. My original guess was that some function was inlined; so let’s check for that. If a function is not inlined, it has to be called, the instruction for which, in this context, is callq. So let’s check if there are changes in the calls that happen:

% diff -u =(fgrep callq dsputil-func-pre.s) =(fgrep callq dsputil-func-post.s)
--- /tmp/zsh-flamehIkyD2        2013-01-24 05:53:33.880785706 -0800
+++ /tmp/zsh-flamebZp6ts        2013-01-24 05:53:33.883785509 -0800
@@ -1,7 +1,6 @@
-       e8 fc 71 fc ff          callq  a390 
-       e8 e5 71 fc ff          callq  a390 
-       e8 c6 71 fc ff          callq  a390 
-       e8 a7 71 fc ff          callq  a390 
-       e8 cd 40 fc ff          callq  72e0 
-       e8 a3 40 fc ff          callq  72e0 
-       e8 00 00 00 00          callq  43261 
+       e8 00 00 00 00          callq  8e670 
+       e8 71 bc f7 ff          callq  a390 
+       e8 52 bc f7 ff          callq  a390 
+       e8 33 bc f7 ff          callq  a390 
+       e8 14 bc f7 ff          callq  a390 
+       e8 00 00 00 00          callq  8f8d3 

Yes, I do use zsh — on the other hand, now that I look at the code above I note that there’s a bug: it does not respect $TMPDIR as it should have used /tmp/.private/flame as base path, dang!

So the quick check shows that avg_pixels8_l2_10 is no longer called — but does that account for the whole size? Let’s see if it changed:

% nm -S libav-{pre,post}/libavcodec/dsputil.o | fgrep avg_pixels8_l2_10
00000000000072e0 0000000000000112 t avg_pixels8_l2_10
00000000000072e0 0000000000000112 t avg_pixels8_l2_10

The size is the same and it’s 274 bytes. The increase is 4330 bytes, which is around 15 times more than the size of the single function — what does that mean then? Well, a quick look around shows this piece of code:

        41 b9 20 00 00 00       mov    $0x20,%r9d
        41 b8 20 00 00 00       mov    $0x20,%r8d
        89 d9                   mov    %ebx,%ecx
        4c 89 e7                mov    %r12,%rdi
        c7 04 24 10 00 00 00    movl   $0x10,(%rsp)
        e8 cd 40 fc ff          callq  72e0 
        48 8d b4 24 80 00 00    lea    0x80(%rsp),%rsi
        00 
        49 8d 7c 24 10          lea    0x10(%r12),%rdi
        41 b9 20 00 00 00       mov    $0x20,%r9d
        41 b8 20 00 00 00       mov    $0x20,%r8d
        89 d9                   mov    %ebx,%ecx
        48 89 ea                mov    %rbp,%rdx
        c7 04 24 10 00 00 00    movl   $0x10,(%rsp)
        e8 a3 40 fc ff          callq  72e0 
        48 8b 84 24 b8 04 00    mov    0x4b8(%rsp),%rax
        00 
        64 48 33 04 25 28 00    xor    %fs:0x28,%rax
        00 00 
        75 0c                   jne    4325c 

This is just a fragment but you can see that there are two calls to the function, followed by a pair of xor and jne instructions — which is the basic harness of a loop. Which means the function gets called multiple times. Knowing that this function is involved in 10-bit processing, it becomes likely that the function gets called twice per bit, or something along those lines — remove the call overhead (as the function is inlined) and you can see how twenty copies of that small function per caller account for the 4KiB.

So my guess was right, but incomplete: GCC not only inlined the function, but it also unrolled the loop, probably doing constant propagation in the process.

Is this it? Almost — the next step was to get some benchmark data when using the code, which was mostly Luca’s work (and I have next to no info on how he did that, to be entirely honest); the results on my server has been inconclusive, as the 2% loss that he originally registered was gone in further testing and would, anyway, be vastly within margin of error of a non-dedicated system — no we weren’t using full-blown profiling tools for that.

While we don’t have any sound numbers about it, what we’re worried about is for cache-starved architectures, such as Intel Atom, where the unrolling and inlining can easily cause performance loss, rather than gain — which is why all us developers facepalm in front of people using -funroll-all-loops and similar. I guess we’ll have to find an Atom system to do this kind of runs on…

Munin no[dt]es

Back when I was looking at entropy I took Jeremy’s suggestion and started setting up Munin on my boxes to have an idea of their average day situation. This was actually immensely useful to note that the EntropyKey worked fine, but the spike in load and processes caused entropy depletion nonetheless. After a bit of time spent tweaking, trying, and working with munin, I now have a few considerations I’d like to make you a part of.

First of all, you might remember my custom ModSecurity ruleset (which seems not to pick the interest of very many peopole). One of the things that my ruleset filters is requests coming from user-agents only reporting the underlying library version (such as libCurl, http-access2, libwww-perl, …), as most of the time they seem to be bound to be custom scripts or bots. As it happens, even the latest sources for Munin do not report themselves as being part of anything “bigger”, with the final result that you can’t monitor Apache when my ModSecurity ruleset is present, d’oh!

Luckily, Christian helped me out and provided me with a patch (that is attached to bug #370131 which I’ll come back to later) that makes munin plugins show themselves as part of munin when sending the requests, which stops my ruleset from rejecting the request altogether.

In the bug you’ll also find a patch that changes the apc_nis plugin, used to request data to APC UPSes though apcupsd, to graph one more variable, the UPS internal temperature. This was probably ignored before because it is not present on most low-end series, such as the Back-UPS, but it is available in my SmartUPS, which last year did indeed shut down abruptly, most likely because of overheating.

Unfortunately it’s not always as easy to fix the trouble; indeed one of the most obnoxious, for me, issues is that Munin does not support using IPv6 to connect to nodes. Unfortunately the problem lies in the Perl Net-Server module, and there is no known solution, barring some custom distribution patches — which I’d rather not ask to implement in Gentoo. For now I solved this by using ssh with port forwarding over address 127.0.0.2, which is not really nice, but works decently for my needs.

Another interesting area relates to sensors handling: Munin comes with a plugin to interpret the output of the sensors program from lm_sensors, which is tremendously nice to make sure that a system is not overheating constantly. Unfortunately this doesn’t always work as good. Turns out that the output format expected by Munin has changed quite a bit in the latest versions, namely the min/max/hyst tuple of extra values vary their name depending on the used chip. Plus it doesn’t take into consideration the option that one of the sensors is altogether disabled.

This last problem is what hit me on Raven (my frontend system). The sensors of the board – Asus AT5IONT-I – were not supported on Linux up to version 2.6.38; I could only watch over the values reported by coretemp, the Intel CPU temperature sensor. With versions 2.6.39, driver it8721 finally supported the board and I could watch over all the temperature values. But of the three temperature values available from the sensor, only two are actually wired, respectively to a thermal diode and a thermistor; the third results disabled, but is still reported by the sensors output with a value of –128°C which Munin then graphs. The only way I found to disable that, was to create a local sensors.d configuration file to stop the value from appearing.

One interesting note relates to the network I/O graphing: Munin didn’t seem to take very nice the rollover at the reboot of my router: it reported a peak of petabits I/O at the time. The problem turns out to be easy to solve, but not as easy to guess. When reading the ifconfig output, the if_ plugin does not discard any datapoint, even if it is bigger than the speed of any given interface, unless it can read the interface’s real speed through mii-tool. Unfortunately it can only do that if the plugin is executed as root, which obviously is not the default.

Another funny plugin is hddtemp_smartctl — rather than using the hddtemp command itself, it uses smartctl which is part of smartmontools, which I most definitely keep around more often than the former. Unfortunately it also has limitations; the most obnoxious is that it doesn’t allow you to just list a bunch of device paths. Instead you have to first provide it with a set of device names, and then you can override their default device path; this is further complicated by the fact that each of the paths is forced to have /dev/ prefixed. Since Yamato has a number of devices, and not always they seem to get the same letters, I had to set it up this way:

[hddtemp_smartctl]
user root
group disk
env.drives sd1 sd2 sd3 sd4 sd5 sd6
env.dev_sd1 disk/by-id/ata-WDC_WD3202ABYS-01B7A0_WD-WCAT15838155
env.dev_sd2 disk/by-id/ata-WDC_WD3202ABYS-01B7A0_WD-WCAT16241483
env.dev_sd3 disk/by-id/ata-WDC_WD1002FAEX-00Z3A0_WD-WCATR4499656
env.dev_sd4 disk/by-id/ata-WDC_WD1002FAEX-00Z3A0_WD-WCATR4517732
env.dev_sd5 disk/by-id/ata-WDC_WD10EARS-00Z5B1_WD-WMAVU2720459
env.dev_sd6 disk/by-id/ata-WDC_WD10EARS-00Z5B1_WD-WMAVU2721970

For those curious, the temperatures vary between 39°C for sd6 and 60°C for sd3.

I have to say I’m very glad I started using Munin, as it helps understanding a few important things, among which is the fact that I need to separate my storage to a separate system, rather than delegating everything to Yamato (I’ll do so as soon as I have a real office and some extra cash), and that using a bridge to connect the virtualised guests to my home network is not a good idea (having Yamato with the network card in promiscuous mode means that all the packets are received by it, even when they are directed to the access point connecting me to the router downstairs).

There’s a new sudo in town

It was last month that I noticed the release of the new series of Sudo (1.8), which brought a number of changes in the whole architecture of the package, starting from a plugin-based architecture for the authentication systems. Unfortunately when I went to bump it, I decided to simply leave users updating to 1.7.5 and keep 1.8.0 in package.mask simply because I didn’t have time to solve the (many) quality issues the new version reported, starting from LDFLAGS handling.

This morning, during my usual work tour-de-force, I noticed the first release candidate of 1.8.1 series that is due to come out at the end of this week, so I decidd to finally get to work on the ebuild for 1.8 series so that it would work correctly. I didn’t expect that it would take me till over 6 in the morning to get it to work to an extent (and even then it wasn’t completely right!).

While looking into 1.8 series I also ended up getting the 1.7.6 ebuild updated to a more… acceptable state:

  • the infamous S/Key patch that I’ve been carrying along since I took over the package from Tavis didn’t apply on 1.8, mostly just for sources reorganisation; I ended up converting it in a proper autoconf-based check, so that it could be submitted upstream;
  • the recursive make process wasn’t right in the 1.8 releases, so even if the package failed to build, there was no indication of it to the caller, and thus the ebuild didn’t fail – this is probably something I could write about in the future, as I’ve seen it happening before as well – I was able to fix that and send it upstream, but it has also shown me that I was ignoring one of the regression tests failing altogether, which is worrisome to say the least;
  • since I had to re-run autotools, I ended up hitting a nasty bug with packages using autoconf but not using automake: we didn’t inject the search path of extra m4 files, even though the eclass was designed with AT_M4DIR in mind for that; it is now fixed, and the various autoconf/autoheader steps provide the proper path conditioning, so that even sudo can have its autotools build system updated.
  • while looking at the ebuilds, I decided to change both of them from EAPI=0 to EAPI=4; it didn’t make much of a difference by itself, but I was able to use REQUIRED_USE to express the requisite of not having both PAM and S/Key enabled at the same time — even though I didn’t get this right the first time around; thanks Zeev!
  • another piece of almost “cargo code” was left in the ebuild since I took it over, and it was in charge of adding a few variables to the list of variables to blacklist; since this was done through the use of sed, and just with the 1.8 series it stopped working, I never noticed that it was nowadays meaningless: all those variables were already in the sudo upstream sources; I was simply able to drop the whole logic — not to self: make sure never to rely so much on silently-ignored sed statements!
  • today while at it (I added rc1 this morning; moved to rc2 today after Todd pointed to me that the LDFLAGS issue was fixed there), I reviewed the secure path calculation, which now treats the gnat compiler paths just like those for gcc itself;
  • the sudo plugins are no longer installed in /usr/libexec but rather in the so-called pkglibdir (/usr/lib/sudo); this also required patching the build system.

I hope to be able to unmask sudo 1.8.1 to ~arch when it comes out. With a bit of luck, by that time, there won’t be any patch applied at all, since I sent all of them to Todd. And that would be the first version of sudo in a very long time built from the vanilla, unpatched, unmodified sources. And if you know me, I like situations like those!

Unfortunately, I have never used nor tested the LDAP support for sudo, so I’m pretty sure it doesn’t work with sudo 1.8 series in Gentoo. Pretty please if somebody has used that, let me know what needs to be fixed! Thanks.

Using the SHA2 hash family with OpenPGPv2 cards and GnuPG

I’m sure I said this before, but I don’t remember when or to who, but most of the time it feels to me like GnuPG only works out of sheer luck, or sometimes fails to work just for my disgrace. Which is why I end up writing stuff down whenever I come to actually coerce it into behaving as I wish.

Anyway, let’s start with a bit of background; a bit of time ago, the SHA1 algorithm has been deemed by most experts to be insecure, which means that relying on it for Really Important Stuff was a bad idea; I still remember reading this entry by dkg that provided a good start to set up your system to use the SHA2 family (SHA256 in particular).

Unfortunately, when I actually got the FSFe smartcard and created the new key I noticed (and noted in the post) that only SHA1-signature worked; I set up the card to use SHA1 signatures, and forgot about it, to be honest. Today though, I went to sign an email and … it didn’t work, it reported me that the created signature was invalid.

A quick check around and it turns out that for some reason GnuPG started caring about the ~/.gnupg/gpg.conf file rather than the key preferences; maybe it was because I had to reset the PIN on the card when I mistyped it on the laptop too many times (I haven’t turned off the backlight since!). The configuration file was already set to use SHA256, so that failed because the card was set to use SHA1.

A quick googling around brought me to an interesting post from earlier this year. The problem as painted there seemed to exist only with GnuPG 1.4 (so not the 2.0 version I’m using) and was reportedly fixed. But the code in the actual sources of 2.0.16 tell a different story: the bug is the same there as it was in 1.4 back in January. What about 1.4? Well it’s also not fixed in the last release, but it is on the Subversion branch — I noticed that only afterwards, though, so you’ll see why that solution differs from mine.

Anyway, the problem is the same, in the new source file: gpg does not ask the agent (and thus scdaemon) to use any particular encoding if not RMD160, which was correct for the old cards but it definitely is not for the OpenPGP v2 that FSFE is now providing its fellows with. If you want to fix the problem, and you’re a Gentoo user, you can simply install gnupg-2.0.16-r1 from my overlay while if you’re not using Gentoo but building it by hand, or you want to forward it to other distributions’ packages, the patch is also available…

And obviously I sent it upstream and I’m now waiting on their response to see if it’s okay to get it applied in Gentoo (with a -r2). Also remember that you have to edit your ~/.gnupg/gpg.conf to have these lines if you want to use the SHA2 family (SHA256 in this particular case):

personal-digest-preferences SHA256
cert-digest-algo SHA256
default-preference-list SHA512 SHA384 SHA256 SHA224 AES256 AES192 AES CAST5 ZLIB BZIP2 ZIP Uncompressed

Maintaining backports with GIT

I have written last week of the good feeling of merging patches upstream – even though since then I don’t think I got anything else merged … well, beside the bti fixes that I sent Greg – this week, let’s start with the opposite problem: how can you handle backports sanely, and have a quick way to check what was merged upstream? Well, the answer, at least for the software that is managed upstream with GIT, is quite easy to me.

Note: yes this is a more comprehensive rehashing of what I posted last December so if you’ve been following my blog for a long time you might not be extremely surprised by the content.

So let’s start with two ideas: branches and tags; for my system to work out properly, you need upstream to have tagged their releases properly; so if the foobar project just released version 1.2.3, we need to have a tag available that is called foobar-1.2.3, v1.2.3, or something along these lines. From that, we’ll start out a new “scratch branch”; it is important to note that it’s a scratch branch, because it means that it can be force-pushed and might require a complete new checkout to work properly. So we have something like the following:

% git clone git://git.foobar.floss/foobar.git
% cd foobar
% git checkout -b 1.2.3-gentoo origin/v1.2.3

This gives us the 1.2.3-gentoo branch as the scratch branch, and we’ll see how that behave in a moment. If upstream fails to provide tags you can also try to track down which exact commit a release corresponds to – it is tricky but not unfeasible – and replace origin/v1.2.3 with the actual SHA hash of the commit or, even better as you’ll guess by the end of the post, tag it yourself.

The idea of using a scratch branch, rather than an actual “gentoo branch” is mostly out of simplicity to me; most of the time, I make more than a couple of changes to a project if I’m packaging it – mostly because I find it easier to just fix possible autotools minor issues before they actually spread throughout the package and other packages as well – but just the actual fixes I want to apply to the packaged version; cleanups, improvements and optimisations I send upstream and wait for the next release. I didn’t always do it this way, I admit.. I changed my opinion when I started maintaining too many packages to follow all of them individually. For this reason I usually have either a personal or a “gentoo” branch where I make changes to apply to master branch, which get sent upstream and merged, and a scratch branch to handle patches. It also makes it no different to add a custom patch or a backport to a specific version (do note, I’ll try to use the word “backport” whenever possible to stress the important of getting the stuff merged upstream so that it will be present in the future, hopefully).

So we know that in the upstream repository there have been a few commits to fix corner case crashers that, incidentally, seem to always apply on Gentoo (don’t laugh, it happens more often than you can think). The commits have the shorthashes 1111111 2222222 3333333 — I have no fantasy for hashes, so sue me.

% git cherry-pick 1111111
% git cherry-pick 2222222
% git cherry-pick 3333333

Now you have a branch with three commits, cherry-picked copies (with different hashes) of the commits you need. At this point, what I usually do, is tagging the current state (and in a few paragraphs you’ll understand why), so that we can get the data out properly; at this point, the way you name the tag depends vastly on how you will release the backport, so I’ll get to that right away.

The most common way to apply patches in Gentoo, for good or bad, is adding them to the files/ subdirectory of a package; to be honest this is my least preferred way unless they are really trivial stuff, because it means that the patches will be sent down the mirrors to all users, no matter whether they use the software or not; also, given the fact that you can use GIT for patch storage and versioning, it’s also duplicating the effort. With GIT-stored patches, it’s usually the easiest to create a files/${PV}/ subdirectory and store there the patches as exported by git format-patch — easy, yes; nice nope: given that, as I’ll say, you’ll be picking the patches again when a new version is released, they’ll always have different hashes, and thus the files will always differ, even if the patch itself is the same patch. This not only wastes time, it makes it non-deduplicable and also gets around the duplicated-files check. D’oh!

A more intelligent way to handle these trivial patches is to use a single, combined patch; while patchutils has a way to combine patches, it’s not really smart; on the other hand, GIT, like most other source control managers, can provide you with diffs between arbitrary points in the repository’s history… you can thus use git diff to export a combined, complete patch in a single file (though lacking history, attribution and explanation). This helps quite a lot when you have a few, or a number, of very small patches, one or two hunks each, that would cause too much overhead in the tree. Combining this way bigger patches can also work, but you’re more likely to compress it and upload it to the mirrors, or to some storage area and add it to SRC_URI.

A third alternative, which is also requiring you to have a storage area for extra distfiles, is using a so-called “patchset tarball”; as a lot of packages already do. The downside of this is that if you have a release without any patch tarball at all, it becomes less trivial to deal with it. At any rate, you can just put in a compressed tar archive the files created, once again, by git format-patch; if you add them as a subdirectory such as patches/ you can then use the epatch function from eutils.eclass to apply them sequentially, simply pointing it at the directory. You can then use the EPATCH_EXCLUDE variable to remove one patch without re-rolling the entire tarball.

Note: epatch itself was designed to use a slightly different patchset tarball format, that included the use of a specification of the architecture, or all to apply to all architectures. This was mostly because its first users were the toolchain-related packages, where architecture-dependent patches are very common. On the other hand, using conditional patches is usually discouraged, and mostly frown upon, for the rest of the software. Reason being that’s quite more likely to make a mistake when conditionality is involved; and that’s nothing new since it was the topic of an article I wrote over five years ago.

If you export the patches as multiple files in filesdir/, you’re not really going to have to think much about naming the tag; for both other cases you have multiple options: tie the name to the ebuild release, tie it to the CVS revision indication, and so on. My personal preferred choice is that of using a single incremental, non-version-specific number for patch tarballs and patches, and mix that with the upstream release version in the tag; in the example above, it would be 1.2.3-gentoo+1. This is, though, just a personal preference.

The reason is simple to explain and I hope it makes sense for others than me; if you tie it to the release of the ebuild (i.e. ${PF}), like the Ruby team did before, you end up in trouble when you want to add a build-change-only patch – take for instance the Berkeley DB 5.0 patch; it doesn’t change what is already installed on a system built with 4.8; it only allows to build anew with 5.0; given that, bumping the release in tree is going to waste users’ time – while using the CVS revision will create quite a few jumps (if you use the revision of the ebuild, that is) as many times you change the ebuild without changing the patches. Removing the indication of the upstream version is also useful, albeit rarely, when upstream does not merge any of your patches, and you could simply reuse the same patchset tarball as previous release; it’s something that comes handy especially when security releases are done.

At this point, as a summary you can do something like this:

  • mkdir patches; pushd patches; git format-patch v1.2.3..; popd; tar jcf foobar-gentoo-1.tar.bz2 patches — gets you a patchset tarball with the patches (similarly you can prepare split patches to run add to the tree);
  • git diff v1.2.3.. > foobar-gentoo-1.patch — creates the complete patch that you can either compress, or upload to mirrors or (if very very little) put it on the tree.

Now, let’s say upstream releases version 1.2.4, and integrates one of our patches. Redoing the patches is quick with GIT as well.

% git checkout -b 1.2.4-gentoo
% git rebase v1.2.4

If there are compatible changes, the new patches will be applied just fine, and updated to not apply with fuzz; any patch that was applied already will count as “empty” and will be simply removed from the branch. At that point, you can just reiterate the export as said above.

When pushing to the repository, remember to push explicitly the various gentoo branches, and make sure to push --tags as well. If you’re a Gentoo developer, you can host such repository on git.overlays.gentoo.org (I host a few of them already; lxc, libvirt, quagga …); probably contributors, even not developers, can ask for similar repositories to be hosted there.

I hope this can help out other developers dealing with GIT-bound upstreams to ease their overweight.

About the new Quagga ebuild

A foreword: some people might think that I’m writing this just to banter about what I did; my sincere reason to write, though, is to point out an example of why I dislike 5-minutes fixes as I wrote last December. It’s also an almost complete analysis of my process of ebuild maintenance so it might be interesting for others to read.

For a series of reasons that I haven’t really written about at all, I need Quagga in my homemade Celeron router running Gentoo — for those who don’t know, Quagga is a fork of an older project called Zebra, and provides a few daemons for route advertisement protocols (such as RIP and BGP). Before yesterday, the last version of Quagga in Portage was 0.99.15 (and the stable is an old 0.98 still), but there was recently a security bug that required a bump to 0.99.17.

I was already planning on getting Quagga a bump to fix a couple of personal pet peeves with it on the router; since Alin doesn’t have much time, and also doesn’t use Quagga himself, I’ve added myself to the package’s metadata; and started polishing the ebuild and its support files. The alternative would have been for someone to just pick up the 0.99.15 ebuild, update the patch references, and push it out with the 0.99.17 version, which would have categorized for a 5-minutes-fix and wouldn’t have solved a few more problems the ebuild had.

Now, the ebuild (and especially the init scripts) make a point that they were contributed by someone working for a company that used Quagga; this is a good start, from one point: the code is supposed to work since it was used; on the other hand companies don’t usually care for the Gentoo practices and policies, and tend to write ebuilds that could be polished a bit further to actually be compliant to our guidelines. I like them as a starting point, and I got used to do the final touches in those cases. So if you have some ebuilds that you use internally and don’t want to spend time maintaining it forever, you can also hire me to clean them up and merge in tree.

So I started from the patches; the ebuild applied patches from a tarball, three unconditionally and two based on USE flags; both of those had URLs tied to them that pointed out that they were unofficial feature patches (a lot of networking software tend to have similar patches). I set out to check the patches; one was changing the detection of PCRE; one was obviously a fix for --as-needed, one was a fix for an upstream bug. All five of them were on a separate patchset tarball that had to be fetched from the mirrors. I decided to change the situation.

First of all, I checked the PCRE patch; actually the whole PCRE logic, inside configure is long winded and difficult to grok properly; on the other hand, a few comments and the code itself shows that the libpcreposix library is only needed non non-GNU systems, as GLIBC provides the regcomp/@regexec@ functions. So instead of applying the patch and have a pcre USE flag, I changed to link the use or not of PCRE depending on the elibc_glibc implicit USE flag; one less patch to apply.

Second patch I looked at was the --as-needed-related patch that changed the order of libraries link so that the linker wouldn’t drop them out; it wasn’t actually as complete as I would have made. Since libtool handles transitive dependencies fine, if the libcap library is used in the convenience library, it only has to be listed there, not also in the final installed library. Also, I like to take a chance to remove unused definitions in the Makefile while I’m there. So I reworked the patch on top of the current master branch in their GIT, and sent it upstream hoping to get it merged before next release.

The third patch is a fix for an upstream bug that hasn’t been merged in a few releases already, so I kept it basically the same. The two feature patches had new versions released, and the Gentoo version seems to have gone out of sync with the upstream ones a bit; for the sake of reducing Gentoo-specific files and process, I decided to move to use the feature patches that the original authors release; since they are only needed when their USE flags are enabled, they are fetched from the original websites conditionally. The remaining patches are too small to be part of a patchset tarball, so I first simply put them in files/ are they were, with mine a straight export from GIT. Thinking about it a bit more, I decided today to combine them in a single file, and just properly handle them on Gentoo GIT (I started writing a post detailing how I manage GIT-based patches).

Patches done, the next step is clearing out the configuration of the program itself; the ipv6 USE flag handles the build and installation of a few extra specific daemons for for the IPv6 protocol; the rest are more or less direct mappings from the remaining flags. For some reason, the ebuild used --libdir to change the installation directory of the libraries, and then later installed an env.d file to set the linker search path; which is generally a bad idea — I guess the intention was just to follow that advice, and not push non-generic libraries into the base directory, but doing it that way is mostly pointless. Note to self: write about how to properly handle internal libraries. My first choice was to see if libtool set rpath properly, and in that case leave it to the loader to deal with it. Unfortunately it seems like there is something bad in libtool, and while rpath worked on my workstation, it didn’t work on the cross-build root for the router though; I’m afraid it’s related to the lib vs lib64 paths, sigh. So after testing it out on the production router, I ended up revbumping the ebuild already to unhack itif libtool can handle it properly, I’ll get that fixed upstream so that the library is always installed, by default, as a package-internal library, in the mean time it gets installed vanilla as upstream wrote it. It makes even more sense given that there are headers installed that suggest the library is not an internal library after all.

In general, I found the build system of quagga really messed up and in need of an update; since I know how many projects are sloppy about build systems, I’d probably take a look. But sincerely, before that I have to finish what I started with util-linux!

While I was at it, I fixed the installation to use the more common emake DESTDIR= rather than the older einstall (which means that it now installs in parallel as well); and installed the sample files among the documentation rather than in /etc (reasoning: I don’t want to backup sample files, nor I want to copy them to the router, and it’s easier to move them away directly). I forgot the first time around to remove the .la files, but I did so afterwards.

What remains is the most important stuff actually; the init scripts! Following my own suggestions the scripts had to be mostly rewritten from scratch; this actually was also needed because the previous scripts had a non-Gentoo copyright owner and I wanted to avoid that. Also, there were something like five almost identical init scripts in the package, where almost is due to the name of the service itself; this means also that there had to be more than one file without any real reason. My solution is to have a single file for all of them, and symlink the remaining ones to that one; the SVCNAME variable is going to define the name of the binary to start up. The one script that differs from the other, zebra (it has some extra code to flush the routes) I also rewrote to minimise the differences between the two (this is good for compression, if not for deduplication). The new scripts also take care of creating the /var/run directory if it doesn’t exist already, which solves a lot of trouble.

Now, as I said I committed the first version trying it locally, and then revbumped it last night after trying it on production; I reworked that a bit harder; beside the change in libraries install, I decided to add a readline USE flag rather than force the readline dependency (there really isn’t much readline-capable on my router, since it’s barely supposed to have me connected), this also shown me that the PAM dependency was strictly related to the vtysh optional component; and while I looked at PAM, (Updated) I actually broke it (and fixed it back in r2); the code is calling pam_start() with a capital-case “Quagga” string; but Linux-PAM puts it in all lower case… I didn’t know that, and I was actually quite sure that it was case sensitive. Turns out that OpenPAM is case-sensitive, Linux-PAM is not; that explains why it works with one but not the other. I guess the next step in my list of things to do is check out if it might be broken with Turkish locale. (End of update)

Another thing that I noticed there is that by default Quagga has been building itself as a Position Independent Executable (PIE); as I have written before using PIE on a standard kernel, without strong ASLR, has very few advantages, and enough disadvantages that I don’t really like to have it around; so for now it’s simply disabled; since we do support proper flags passing, if you’re building a PIE-complete system you’re free to; and if you’re building an embedded-enough system, you have nothing else to do.

The result is a pretty slick ebuild, at least in my opinion, less files installed, smaller, Gentoo-copyrighted (I rewrote the scripts practically entirely). It handles the security issue but also another bunch of “minor” issues, it is closer to upstream and it has a maintainer that’s going to make sure that the future releases will have an even slicker build system. It’s nothing exceptional, mind you, but it’s what it is to fix an ebuild properly after a few years spent with bump-renames. See?

Afterword: a few people, seemingly stirred up by a certain other developer, seems to have started complaining that I “write too much”, or pretend that I actually have an uptake about writing here. The main uptake I have is not having to repeat myself over and over to different people. Writing posts cost me time, and keeping the blog running, reachable and so on so forth takes me time and money, and running the tinderbox costs me money. Am I complaining? Not so much; Flattr is helping, but trust me that it doesn’t even cover the costs of the hosting, up to now. I’m just not really keen on the slandering because I write out explanation of what I do and why. So from now on, you bother me? Your comments will be deleted. Full stop.

The five-minutes-fix myth

One of the complains I often get for my QA work (which I have to say is vocally not appreciated even by the other devs), is that I could just go on and fix the issues I find rather than opening but for them because “it just takes five minutes from bug to commit” to fix the problem.

No it does not take five minutes to fix something, I can assure you!

Of course people will continue to say that it just takes a few minutes to find the problem and come up with a patch; the problem is that most of the time, for the kind of bugs I report myself, to fix them properly takes much, much more.

Most of the time that some developer decides that some single problem does not warrant to remove a package, even if it doesn’t have anybody looking after it, the same packages re-enter my radar at the next round of tinderboxing, or in the best of cases, a few months later.

The problem is, when a package has a maintainer, that maintainer is supposed to keep the package in line with the development policies; if you don’t maintain a package but commit a five-minutes-fix you’re not ensuring it complies with the policy but you’re fixing the single symptom I reported, which most likely is just the one that I hit first.

When I (and I’m not boasting here, but just stating how it is) fix a package, I do things like removing older versions, making sure it complies with all the policies, check all the bugs the package has open, check for things that there aren’t bugs for, like CFLAGS and LDFLAGS being respected, the files not being pre-stripped, features that might be missing, version bumps requirement, correct dependencies, and so on so forth. It’s not a five-minutes work! It’s often a multiple hours work instead!

What is upsetting me, to return to the fact that Gentoo’s QA problem is with developers is that some developers think it’s fine to remove a package from the QA removal queue just because they fixed the last bug I filed. I’m sincerely considering the idea of making it policy that if you wish to save a package from QA you got to fix all the problems with it, and take maintainership of it if it breaks again. And for those ignoring these, we should definitely enforce some kind of penalty in form of not being able to “save” a package from removal any longer.

About patches and contributions

In my last post I mentioned that users of packages lacking a maintainer in portage should submit patches for the open bugs; while this is all good and fine, it is true that often there are bugs with patches that stays there to rot as well. I wish to point out here some things that might not be obvious to users as well as developers.

The first point I want to make is that it’s not like the problem is limited to user-submitted bugs and patches; bugs and patches submitted by developers can follow the same road too, waiting for months, if not years, before they are accepted and merged in. The problems here are many, some technical some social, some difficult to fit into a single category.

The biggest technical problem I can find is that there is no easy way to identify bugs which have patches waiting for review from a simple search. Modifying the Bugzilla workflow is probably a matter too complex to be worth it, but there is an easy way to avoid this, although it requires more coordination between reporters and assignees: using the “Status whiteboard” field to write “Patch waiting review” or something like that; the status whiteboard appears in searches by default, and would be useful to signal stuff like that (I used that to ask developers to signal me particular cases where gnuconfig_update couldn’t be removed).

Another technical problem is that maybe some developer is interested in fixing eventual bugs with patches for a particular package that lacks a maintainer, without becoming maintainers themselves; or an users would like to submit patches for the new bugs about a particular package. Adding themselves to maintainer-needed alias, or watching it, is most likely going to throw at them a huge amount of bug mail they are absolutely not interested in; I’m sure I sent some hundreds of bugs the m-n way in the last month (I’ll have fun watching the bug statistics on the next GMN), and I’m sure most people wouldn’t care of all those bugs.

The feasible way to handle this in the current infrastructure would be to set up bug whining filtering by package name in subject, but that’s not the nicest option I guess, although it is doable without changing anything. Another idea I got would require a huge effort from our infrastructure team and might not really e that feasible: creating multiple package name aliases; basically creating an packages.g.o domain, and then creating dynamic aliasing for addresses like app-arch+libarchivep.g.o so that it would then redirect the mail to the maintainer (developer or team) looking at metadata. This is of course a very complex solution, requires technical changes, and might as well be quite worse on other things, like for instance it would be a huge mess to search for all the bugs for a given herd, and would increase the amount of spam that each alias receive in a HUGE manner. On the other hand, the domain could be ste to only receive mail through bugzilla, and the addresses being otherwise invalid.

There are then some social problems, for instance there is the fact that Gentoo developers are volunteer, so you can’t force them to accept patches, or to improve submitted patches if they are not seen fit for merge in the tree. This means that if you want your patch to be merged in the tree you need to make sure you have a good case for it, and that you improve it as much as it’s needed for it to be merged, which might take an incremental amount of time. Of course not all users are programmers, but you cannot expect all developers to take charge of improving patches indeterminately until they can be merged, if they have no need for such patches. You can always try to find some person interested in helping out who can help you improve the patch; in the worst case you can resolve to pay someone to fix your patch up.

Also, most of the developers are likely swamped and might require some prodding to remember to take care of patches, and this is probably the heaviest showstopper; the only way to fix that is to join Gentoo and start improving the situation; of course it requires training, it’s not like we can accept every user as a developer, there are standards that one has to consider. Yes, I know there are some developers who not always live up to those standards, but that’s also a problem. I’m not saying that new developers should be perfect before joining, but they have to be open to critics and ready to learn; “works for me, it’s your problem” is not a good answer.

Does anybody have better ideas on how to fix these problems?

Shadow casting

For those of you who’d like to follow Google Summer of Code progress (I’d very much like to have the blogs of the student available on Universe by the way), today (or rather yesterday) can be counted in as a very important date :)

Seraphim prepared a patch for shadow (the package containing all the basic utilities for users management in Linux) to work with OpenPAM, rather than depending strictly on Linux-PAM. I committed it in tree now for 4.1.2.1 and upstream applied it to trunk already, so now it can be built even when using OpenPAM rather than Linux-PAM. Cool, eh?

This patch (and the related bug) are important not only because they fix a very important part of the functionality of a Linux system with OpenPAM, but also because it will act as a base reference to fix other software to use OpenPAM too.

Indeed, one of the most obnoxious problems with OpenPAM is that a lot of packages instead of writing their own conversation functions rely on misc_conv. Seraphim prepared a patch that can be applied to almost any other package relying on that to make it OpenPAM-compatible. It’s very good.

Unfortunately, as Seraphim also blogged there is one catch with being able to provide OpenPAM support, especially for the future. The problem is that although mostly API compatible, OpenPAM and Linux-PAM are not ABI compatible. Although in a very subtle way, because, as Seraphim learnt, you can have a system built against Linux-PAM run against OpenPAM just fine, up to a point.

The problem is that ABI does not only refer to the name of the functions, or the type of their parameters, but also to the meaning of flag values. In this case, Linux-PAM and OpenPAM give different flags different meanings, so modules built against OpenPAM will not work properly with software built against Linux-PAM.

This is going to be tricky, especially once we’ll allow users to switch from one to the other and vice-versa, because it means all the software will have to be rebuilt to continue functioning as it’s supposed to. And no preserved-rebuild will help us there.

Oh well, there’s time to think of that!