Future planning for Ruby-Elf

My work on Ruby-Elf tends to happen in “sprees” whenever I actually need something from it that wasn’t supported before — I guess this is true for many projects out there, but it seems to happen pretty regularly for me with my projects. The other day I prepared a new release after fixing the bug I found while doing the postmortem of a libav patch — and then I proceeded giving another run to my usual collisions check after noting that I could improve the performance of the regular expressions …

But where is it directed, as it is? Well, I hope I’ll be able to have version 2.0 out before end of 2013 — in this version, I want to make sure I get full support for archives, so that I can actually analyze static archives without having to extract them beforehand. I’ve got a branch with the code to get access to the archives themselves, but it can only extract the file before actually being able to read it. The key in supporting archives would probably be supporting in-memory IO objects, as well as offset-in-file objects.

I’ve also found an interesting gem called bindata which seems to provide a decent way to decode binary data in Ruby without having to actually fully pre-decode it. This would probably be a killer for Ruby-Elf, as a lot of the time I’m forcibly decoding everything because it was extremely difficult to access it on the spot — so the first big change for Ruby-Elf 2 is going to be to drop down the task of decoding to bindata (or, otherwise, another similar gem).

Another change that I plan is to drop the current version of the man pages. While DocBook is a decent way to deal with man pages, and standard enough to be around in most distributions, it’s one “strange” dependency for a Ruby package — and honestly the XML is a bit too verbose sometimes. For the most horsey beefy man pages, the generated roff page is half as big as the source, which is the other way around from what anybody would expect them.

So I’m quite decided that the next version of Ruby-Elf will use Markdown for the man pages — while it does not have the same amount of semantic tagging, and thus I might have to handle some styling in the synopsis manually, using something like md2man should be easy (I’m not going to use ronn because of the old issue with JRuby and rdiscount) and at the same time, it gives me a public HTML version for free, thanks to GitHub conversion.

Finally, I really hope that by Ruby-Elf 2 I’ll be able to get least the symbol demangler for the Itanium C++ ABI — that is the one used by modern GCC, yes, it was originally specified for the Itanic. Working toward supporting the full DWARF specification is something that is on the back of my mind but I’m not very convinced right now, because it’s huge. Also, if I were to implement it I would then have to rename the library to Dungeon.

Redundant symbols

So I’ve decided to dust off my link collision script and see what the situation is nowadays. I’ve made sure that all the suppression file use non-capturing groups on the regular expressions – as that should improve the performance of the regexp matching – make it more resilient to issues within the files (metasploit ELF files are barely valid), and run it through.

Well, it turns out that the situation is bleaker than ever. Beside the obvious amount of symbols with a too-common name, there are still a lot of libraries and programs exporting default bison/flex symbols the same way I found them in 2008:

Symbol yylineno@ (64-bit UNIX - System V AMD x86-64) present 59 times
Symbol yyparse@ (64-bit UNIX - System V AMD x86-64) present 53 times
Symbol yylex@ (64-bit UNIX - System V AMD x86-64) present 49 times
Symbol yy_flush_buffer@ (64-bit UNIX - System V AMD x86-64) present 48 times
Symbol yy_scan_buffer@ (64-bit UNIX - System V AMD x86-64) present 48 times
Symbol yy_scan_bytes@ (64-bit UNIX - System V AMD x86-64) present 48 times
Symbol yy_scan_string@ (64-bit UNIX - System V AMD x86-64) present 48 times
Symbol yy_create_buffer@ (64-bit UNIX - System V AMD x86-64) present 47 times
Symbol yy_delete_buffer@ (64-bit UNIX - System V AMD x86-64) present 47 times
[...]

Note that at least one library got to export them to be listed in this output; indeed these symbols are present in quite a long list of libraries. I’m not going to track down each and every of them though, but I guess I’ll keep an eye on that list so that if problems arise that can easily be tracked down to this kind of collisions.

Action Item: I guess my next post is going to be a quick way to handle building flex/bison sources without exposing these symbols, for both programs and libraries.

But this is not the only issue — I’ve already mentioned a long time ago that a single installed system already brings in a huge number of redundant hashing functions; on the tinderbox as it was when I scanned it, there were 57 md5_init functions (and this without counting different function names!). Some of this I’m sure boils down to gnulib making it available, and the fact that, unlike the BSD libraries, GLIBC does not have public hashing functions — using libcrypto is not an option for many people.

Action item: I’m not very big of benchmarks myself, never understood the proper way to go around getting the real data rather than being fooled by the scheduler. Somebody who’s more apt at that might want to gather a bunch of libraries providing MD5/SHA1/SHA256 hashing interfaces, and produce some graphs that can let us know whether it’s time to switch to libgcrypt, or nettle, or whatever else that provides us with good performance as well as with a widely-compatible license.

The presence of duplicates of memory-management symbols such as malloc and company is not that big of a deal, at first sight. After all, we have a bunch of wrappers that use interposing to account for memory usage, plus another bunch to provide alternative allocation strategies that should be faster depending on the way you use your memory. The whole thing is not bad by itself, but when you get one of graphviz’s libraries (libgvpr) to expose malloc something sounds wrong. Indeed, if even after updating my suppression filter to ignore the duplicates coming from gperftools and TBB, I get 40 copies of realloc() something sounds extremely wrong:

Symbol realloc@ (64-bit UNIX - System V AMD x86-64) present 40 times
  libgvpr
  /mnt/tbamd64/bin/ksh
  /mnt/tbamd64/bin/tcsh
  /mnt/tbamd64/usr/bin/gtk-gnutella
  /mnt/tbamd64/usr/bin/makefb
  /mnt/tbamd64/usr/bin/matbuild
  /mnt/tbamd64/usr/bin/matprune
  /mnt/tbamd64/usr/bin/matsolve
  /mnt/tbamd64/usr/bin/polyselect
  /mnt/tbamd64/usr/bin/procrels
  /mnt/tbamd64/usr/bin/sieve
  /mnt/tbamd64/usr/bin/sqrt
  /mnt/tbamd64/usr/lib64/chromium-browser/chrome
  /mnt/tbamd64/usr/lib64/chromium-browser/chromedriver
  /mnt/tbamd64/usr/lib64/chromium-browser/libppGoogleNaClPluginChrome.so
  /mnt/tbamd64/usr/lib64/chromium-browser/nacl_helper
  /mnt/tbamd64/usr/lib64/firefox/firefox
  /mnt/tbamd64/usr/lib64/firefox/firefox-bin
  /mnt/tbamd64/usr/lib64/firefox/mozilla-xremote-client
  /mnt/tbamd64/usr/lib64/firefox/plugin-container
  /mnt/tbamd64/usr/lib64/firefox/webapprt-stub
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc/libcurs.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc/libmcurses.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc.memprof/libcurs.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc.memprof/libmcurses.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc.prof/libcurs.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc.prof/libmcurses.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc.trseg.debug/libcurs.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc.trseg.debug/libmcurses.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc.trseg/libcurs.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc.trseg/libmcurses.so
  /mnt/tbamd64/usr/lib64/mercury/lib/hlc.gc/libcurs.so
  /mnt/tbamd64/usr/lib64/mercury/lib/hlc.gc/libmcurses.so
  /mnt/tbamd64/usr/lib64/mercury/lib/hlc.gc.trseg/libcurs.so
  /mnt/tbamd64/usr/lib64/mercury/lib/hlc.gc.trseg/libmcurses.so
  /mnt/tbamd64/usr/lib64/OpenFOAM/OpenFOAM-1.6/lib/libhoard.so
  /mnt/tbamd64/usr/lib64/thunderbird/mozilla-xremote-client
  /mnt/tbamd64/usr/lib64/thunderbird/plugin-container
  /mnt/tbamd64/usr/lib64/thunderbird/thunderbird
  /mnt/tbamd64/usr/lib64/thunderbird/thunderbird-bin

Now it is true that it’s possible, depending on the usage patterns, to achieve a much better allocation strategy than the default coming from GLIBC — on the other hand, I’m also pretty sure that GLIBC’s own allocation improved a lot in the past few years so I’d rather use the standard allocation than a custom one that is five or more years old. Again this could use some working around.

In the list above, Thunderbird and Firefox for sure use (and for whatever reason re-expose) jemalloc; I have no idea if libhoard in OpenFOAM is another memory management library (and whether OpenFOAM is bundling it or not), and Mercury is so messed up that I don’t want to ask myself what it’s doing there. There are though a bunch of standalone programs listed as well.

Action item: go through the standalone programs exposing the memory interfaces — some of them are likely to bundle one of the already-present memory libraries, so just make them use the system copy of it (so that improvements in the library trickle down to the program), for those that use custom strategies, consider making them optional, as I’d expect most not to be very useful to begin with.

There is another set of functions that are similar to the memory management functions, which is usually brought in by gnulib; these are convenience wrappers that do error checking over the standard functions — they are xmalloc and friends. A quick check, shows that these are exposed a bit too often:

Symbol xmemdup@ (64-bit UNIX - System V AMD x86-64) present 37 times
  liblftp-tasks
  libparted
  libpromises
  librec
  /mnt/tbamd64/usr/bin/csv2rec
  /mnt/tbamd64/usr/bin/dgawk
  /mnt/tbamd64/usr/bin/ekg2
  /mnt/tbamd64/usr/bin/gawk
  /mnt/tbamd64/usr/bin/gccxml_cc1plus
  /mnt/tbamd64/usr/bin/gdb
  /mnt/tbamd64/usr/bin/pgawk
  /mnt/tbamd64/usr/bin/rec2csv
  /mnt/tbamd64/usr/bin/recdel
  /mnt/tbamd64/usr/bin/recfix
  /mnt/tbamd64/usr/bin/recfmt
  /mnt/tbamd64/usr/bin/recinf
  /mnt/tbamd64/usr/bin/recins
  /mnt/tbamd64/usr/bin/recsel
  /mnt/tbamd64/usr/bin/recset
  /mnt/tbamd64/usr/lib64/lftp/4.4.2/liblftp-network.so
  /mnt/tbamd64/usr/lib64/libgettextlib-0.18.2.so
  /mnt/tbamd64/usr/lib64/man-db/libman-2.6.3.so
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.6.3/cc1
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.6.3/cc1obj
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.6.3/cc1plus
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.6.3/f951
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.6.3/jc1
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.6.3/lto1
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.2/cc1
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.2/cc1obj
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.2/cc1plus
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.2/f951
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.2/jc1
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.2/lto1
  /mnt/tbamd64/usr/libexec/gnat-gcc/x86_64-pc-linux-gnu/4.5/cc1
  /mnt/tbamd64/usr/libexec/gnat-gcc/x86_64-pc-linux-gnu/4.5/gnat1
  /mnt/tbamd64/usr/libexec/gnat-gcc/x86_64-pc-linux-gnu/4.5/lto1

In this case they are exposed even by the GCC tools themselves! While this brings me again to complain that gnulib show now actually be libgnucompat and be dynamically linked, there is little we can do about these in programs — but the symbols should not creep in system libraries (mandb has the symbols in its private library which is marginally better).

Action item: check the libraries exposing the gnulib symbols, and make them expose only their proper interface, rather than every single symbol they come up with.

I suppose that this is already quite a bit of data for a single blog post — if you want a copy of the symbols’ index to start working on some of the action items I listed, just contact me and I’ll send it to you, it’s a big too big to just have it published as is.

Postmortem of a patch, or how do you find what changed?

Two days ago, Luca asked me to help him figure out what’s going on with a patch for libav which he knew to be the right thing, but was acting up in a fashion he didn’t understand: on his computer, it increased the size of the final shared object by 80KiB — while this number is certainly not outlandish for a library such as libavcodec, it does seem odd at a first glance that a patch removing source code is increasing the final size of the executable code.

My first wild guess which (spoiler alert) turned out to be right, was that removing branches out of the functions let GCC optimize the function further and decide to inline it. But how to actually be sure? It’s time to get the right tools for the job: dev-ruby/ruby-elf, dev-util/dwarves and sys-devel/binutils enter the battlefield.

We’ve built libav with and without the patch on my server, and then rbelf-size told us more or less the same story:

% rbelf-size --diff libav-{pre,post}/avconv
        exec         data       rodata        relro          bss     overhead    allocated   filename
     6286266       170112      2093445       138872      5741920       105740     14536355   libav-pre/avconv
      +19456           +0         -592           +0           +0           +0       +18864 

Yes there’s a bug in the command, I noticed. So there is a total increase of around 20KiB, where is it split? Given this is a build that includes debug info, it’s easy to find it through codiff:

% codiff -f libav-{pre,post}/avconv
[snip]

libavcodec/dsputil.c:
  avg_no_rnd_pixels8_9_c    | -163
  avg_no_rnd_pixels8_10_c   | -163
  avg_no_rnd_pixels8_8_c    | -158
  avg_h264_qpel16_mc03_10_c | +4338
  avg_h264_qpel16_mc01_10_c | +4336
  avg_h264_qpel16_mc11_10_c | +4330
  avg_h264_qpel16_mc31_10_c | +4330
  ff_dsputil_init           | +4390
 8 functions changed, 21724 bytes added, 484 bytes removed, diff: +21240

[snip]

If you wonder why it’s adding more code than we expected, it’s because there are other places where functions have been deleted by the patch, causing some reductions in other places. Now we know that the three functions that the patch deleted did remove some code, but five other functions added 4KiB each. It’s time to find out why.

A common way to do this is to generate the assembly file (which GCC usually does not represent explicitly) to compare the two — due to the size of the dsputil translation unit, this turned out to be completely pointless — just the changes in the jump labels cause the whole file to be rewritten. So we rely instead on objdump, which allows us to get a full disassembly of the executable section of the object file:

% objdump -d libav-pre/libavcodec/dsputil.o > dsputil-pre.s
% objdump -d libav-post/libavcodec/dsputil.o > dsputil-post.s
% diff -u dsputil-{pre,post}.s | diffstat
 unknown |245013 ++++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 125163 insertions(+), 119850 deletions(-)

As you can see, trying a diff between these two files is going to be pointless, first of all because of the size of the disassembled files, and secondarily because each instruction has its address-offset prefixed, which means that every single line will be different. So what to do? Well, first of all it’s useful to just isolate one of the functions so that we reduce the scope of the changes to check — I found out that there is a nice way to do so, and it involves relying on the way the function is declared in the file:

% fgrep -A3 avg_h264_qpel16_mc03_10_c dsputil-pre.s
00000000000430f0 :
   430f0:       41 54                   push   %r12
   430f2:       49 89 fc                mov    %rdi,%r12
   430f5:       55                      push   %rbp
--
[snip]

While it takes a while to come up with the correct syntax, it’s a simple sed command that can get you the data you need:

% sed -n -e '/ dsputil-func-pre.s
% sed -n -e '/ dsputil-func-post.s
% diff -u dsputil-func-{pre,post}.s | diffstat
 dsputil-func-post.s | 1430 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 1376 insertions(+), 54 deletions(-)

Okay that’s much better — but it’s still a lot of code to sift through, can’t we reduce it further? Well, actually… yes. My original guess was that some function was inlined; so let’s check for that. If a function is not inlined, it has to be called, the instruction for which, in this context, is callq. So let’s check if there are changes in the calls that happen:

% diff -u =(fgrep callq dsputil-func-pre.s) =(fgrep callq dsputil-func-post.s)
--- /tmp/zsh-flamehIkyD2        2013-01-24 05:53:33.880785706 -0800
+++ /tmp/zsh-flamebZp6ts        2013-01-24 05:53:33.883785509 -0800
@@ -1,7 +1,6 @@
-       e8 fc 71 fc ff          callq  a390 
-       e8 e5 71 fc ff          callq  a390 
-       e8 c6 71 fc ff          callq  a390 
-       e8 a7 71 fc ff          callq  a390 
-       e8 cd 40 fc ff          callq  72e0 
-       e8 a3 40 fc ff          callq  72e0 
-       e8 00 00 00 00          callq  43261 
+       e8 00 00 00 00          callq  8e670 
+       e8 71 bc f7 ff          callq  a390 
+       e8 52 bc f7 ff          callq  a390 
+       e8 33 bc f7 ff          callq  a390 
+       e8 14 bc f7 ff          callq  a390 
+       e8 00 00 00 00          callq  8f8d3 

Yes, I do use zsh — on the other hand, now that I look at the code above I note that there’s a bug: it does not respect $TMPDIR as it should have used /tmp/.private/flame as base path, dang!

So the quick check shows that avg_pixels8_l2_10 is no longer called — but does that account for the whole size? Let’s see if it changed:

% nm -S libav-{pre,post}/libavcodec/dsputil.o | fgrep avg_pixels8_l2_10
00000000000072e0 0000000000000112 t avg_pixels8_l2_10
00000000000072e0 0000000000000112 t avg_pixels8_l2_10

The size is the same and it’s 274 bytes. The increase is 4330 bytes, which is around 15 times more than the size of the single function — what does that mean then? Well, a quick look around shows this piece of code:

        41 b9 20 00 00 00       mov    $0x20,%r9d
        41 b8 20 00 00 00       mov    $0x20,%r8d
        89 d9                   mov    %ebx,%ecx
        4c 89 e7                mov    %r12,%rdi
        c7 04 24 10 00 00 00    movl   $0x10,(%rsp)
        e8 cd 40 fc ff          callq  72e0 
        48 8d b4 24 80 00 00    lea    0x80(%rsp),%rsi
        00 
        49 8d 7c 24 10          lea    0x10(%r12),%rdi
        41 b9 20 00 00 00       mov    $0x20,%r9d
        41 b8 20 00 00 00       mov    $0x20,%r8d
        89 d9                   mov    %ebx,%ecx
        48 89 ea                mov    %rbp,%rdx
        c7 04 24 10 00 00 00    movl   $0x10,(%rsp)
        e8 a3 40 fc ff          callq  72e0 
        48 8b 84 24 b8 04 00    mov    0x4b8(%rsp),%rax
        00 
        64 48 33 04 25 28 00    xor    %fs:0x28,%rax
        00 00 
        75 0c                   jne    4325c 

This is just a fragment but you can see that there are two calls to the function, followed by a pair of xor and jne instructions — which is the basic harness of a loop. Which means the function gets called multiple times. Knowing that this function is involved in 10-bit processing, it becomes likely that the function gets called twice per bit, or something along those lines — remove the call overhead (as the function is inlined) and you can see how twenty copies of that small function per caller account for the 4KiB.

So my guess was right, but incomplete: GCC not only inlined the function, but it also unrolled the loop, probably doing constant propagation in the process.

Is this it? Almost — the next step was to get some benchmark data when using the code, which was mostly Luca’s work (and I have next to no info on how he did that, to be entirely honest); the results on my server has been inconclusive, as the 2% loss that he originally registered was gone in further testing and would, anyway, be vastly within margin of error of a non-dedicated system — no we weren’t using full-blown profiling tools for that.

While we don’t have any sound numbers about it, what we’re worried about is for cache-starved architectures, such as Intel Atom, where the unrolling and inlining can easily cause performance loss, rather than gain — which is why all us developers facepalm in front of people using -funroll-all-loops and similar. I guess we’ll have to find an Atom system to do this kind of runs on…

Archiving

One of the requests at the past VDD when I shown some VLC developers my Ruby-Elf toolsuite was for it to access archive files directly. This has been in my wishlist as well for a while, so I decided to start working on it. To be precise, I started writing the parser (and actually wrote almost all of it!) on the Eurostar that was bringing me to London from Paris.

Now you have to understand that while the Wikipedia page is a quite good source of documentation for the format itself, it’s not exactly complete (it doesn’t really have to be). But it’s mostly to the point: there are currently two main variants for ar files: the GNU and the BSD variants. And the only real difference between the two is in the way long filenames are handled. And with long filenames I mean filenames that are long at least 16 characters.

The GNU format handles this with a single index file that provides the names for all the files, and provide you with an offset instead of the proper name in the header data, whereas the BSD format provides you with a length, and prepend the filename to the actual file data. Which of the two options should be the best is well up for debate.

I already knew of the difference, so I did code in support for both variants, but of course while on the train I only had access to the GNU version of ar which is present in Binutils, so I only wrote a test for that. Now that I’m back at the office (temporarily in Los Angeles, as it seems like I’ll be moving to London soon enough), I have a Mac to my side and I decided to prepare the files for testing with its ar(1) which is supposedly BSD.

I say supposedly because something strange happened! The long filename code is hit by two of my testcases: one is using an actual object file which happens to have a name longer than 16 characters, the other is an explicit long (very long) filename — 86 characters! But what happens is that Apple’s version of ar writes the filename as 88 characters, padding it with two final null bytes. At first I thought I got something wrong in the format, but if I use bsdtar on Linux, which provide among other formats support for the bsd ar format, it writes down properly 86 bytes without any kind of null termination.

More interestingly, the other archive, where the filename is just 20 characters long, is written the exact same way by both libarchive and Apple’s ar!

A special kind of bundling

I know it was a very long time since I last posted about bundled libraries, and a long time since I actually worked on symbol collisions which is the original reason why I started working on Ruby-Elf — even though you probably wouldn’t tell nowadays, given how many more tools I implemented over the same library.

Since the tinderbox was idling, due to the recent changes in distfiles storage I decided to update the database of symbols. This actually discovered more than a few bugs in my code, for which I should probably write a separate blog post. In the mean time I’m just going to ask here what I already asked on my streams over to identi.ca and Twitter: if you have access to an HP-UX machine, could you please check if there is an elf.h header, with a license permissive enough that I can look at it? I could fetch the entries from GNU binutils, but I’d rather not, since it’ll be mixing and matching code licensed under GPL-2 (only) and GPL-3 — although arguably constant names shouldn’t be copyrightable.

The Ruby-Elf code will be pushed tomorrow, as today gitorious wasn’t very collaborative, and I’ll probably follow with a blog post on the topic anyway.

Once I was able to get the output of the harvest and analyse script, I found an interesting, albeit worrisome, surprise. A long list of packages use gnulib’s MD5 code. The problem is that gnulib is not your average utility library: it isn’t installed and linked to, it is imported into the sources of the project using it. The original reason to design it this way was that it would provide replacement functions for the GNU extension interfaces, or even standard interfaces, that aren’t present on some systems, so that you wouldn’t have to stick to a twenty year old standard when you could have reduced the code logic by using modern interfaces.

What happens now? Well, gnulib carries not only replacement code for functions that are implemented in glibc abut not on other systems, but also a long list of interfaces that are not implemented in glibc either! And as it happens, even an MD5 implementation. Such an implementation is replicated at least 115 times into the tinderbox system, standing to the visible symbols — there might be a lot more, for when you hide the symbols or build a non-exported executable, my tools are not going to find them.

This use of gnulib is unlikely to go away anytime soon… unfortunately the more packages use gnulib, the more a security bug in gnulib would easily impact the distribution as a whole for a very long time. People, can we stop using gnulib like it was glib? Please? Just create a libgnutils or something, and make gnulib look for that before using its own modules, so that there is a chance to use a shared, common library instead… especially since some of the users of gnulib are libraries themselves, which cause the symbol collisions problem that is the original reason why I’m looking at this code…

Sigh! Rant out….

Are -g options really safe?

Tonight feels like a night after a very long day. But it was just half a day spent on trying to find the end of a bug saga that started about a month ago for me.

It starts like this: postgresql-server started failing; the link editor – BFD-based ld – reported that one of the static archives installed by postgresql-base didn’t have a proper index, which should have been generated by ranlib. But simply executing ranlib on said file didn’t solve the problem.

I originally blamed the build system of PostgreSQL, but when yesterday I launched an emerge -e world to rebuild everything with GCC 4.6, another package failed in the same way: lvm2, linking to /usr/lib64/libudev.a — since I know the udev build system very well, almost like I wrote it myself, I trusted that the archive was built correctly, so it was time to look at what the real problem was.

After poking around a bit, I found that binutils’s nm, objdump and at that point even ld refused to display information for some relocated objects (ET_REL files). This would have made it very difficult to debug the issue if not for two things: first, eu-nm could see the file just fine, and second, my own home-cooked nm.rb tool that I wrote to test Ruby-Elf reported issues with the file — but without exploding.

flame@yamato mytmpfs % nm dlopen.o
nm: dlopen.o: Bad value
flame@yamato mytmpfs % eu-nm -B dlopen.o 
0000000000000000 n .LC0
                 U _GLOBAL_OFFSET_TABLE_
                 U __stack_chk_fail
                 U dlclose
                 U dlerror
                 U dlopen
00000000000001a0 T dlopen_LTX_get_vtable
                 U dlsym
                 U lt__error_string
                 U lt__set_last_error
                 U lt__zalloc
0000000000000000 t vl_exit
00000000000000a0 t vm_close
0000000000000100 t vm_open
0000000000000040 t vm_sym
0000000000000000 d vtable
0000000000000000 n wt.1af52e75450527ed
0000000000000000 n wt.2e36542536402b38
0000000000000000 n wt.32ec40f73319dfa8
0000000000000000 n wt.442ae951f162d46e
0000000000000000 n wt.90e079bbb773abcb
0000000000000000 n wt.ac43b6ac10ce5688

I don’t have the original output from my tool since I have since fixed it, but the issues were related, as you can guess from that output, to the various wt. symbols at the end of the list. Where do they come from? What does the ‘n’ symbol they are coded with mean? And why is BFD failing to deal with them? I set out to find those answers with, well, more than a hunch of what the problem would turn out to be.

So what are those symbols? Google doesn’t help at all here since searching for “wt”, even enclosed in double quotes, turns up only results for “weight”. Yes, I know it is a way to shorten that word, but what the heck, I’m looking for a certain string! The answer, actually is simple: they are additional debug symbols that are added by -gdwarf-4, which is used to include the latest DWARF format revision. This was implemented in GCC 4.6 and is supposed to reduce the size of the debug information, which is generally positive, and include more debug information.

Turns out that libbfd (the library that implements all the low level access for nm, ld and the other utilities) doesn’t like those symbols, not sure if it’s the sections they are defined on, their type (which is set to STT_NONE), or what else, but it doesn’t like them at all. Interestingly enough, this does not happen with final executables and dynamic libraries, which makes it at least bearable: only less then 40 packages had to be rebuilt on my system because they had broken static objects; unfortunately one of those was LibreOffice, d’oh!

Now, let’s look back at the nm issue though: when I started writing Ruby-Elf, I decided not to reimplement the whole suite of ELF tools, since there are already quite a few implementations of those out there. But I did write a nm tool to debug my own issues — it also worked quite nicely, because implementing access to the codes used by nm allowed me to use the same output in my elfgrep tool to show the results. This implementation, that was actually never ported to the tools framework I wrote for Ruby-Elf, didn’t get installed, and it was just part of the repository for my own good.

But after noticing that I’m more resilient than binutils’s version, and it produced more interesting output than elfutils’s, I decided to rework it and make it available as rbelf-nm, writing a man page, and listing the codes for the various symbol kinds. But before all this, I also rewrote the function of code choice. Before, it relied on binding types and then on section names to produce the right code; now it relies on the symbol type, binding types, and sections’ flags and type, making it as resilient as elfutils’s, and as informative as binutils’s, up to what I encountered right now.

I also released a new version (1.0.6.1, don’t ask!) that includes the new tool, and it is already on RubyGems and Portage if you wish to use it. Please remember that the project has a Flattr page so if you like the project, your support is definitely welcome.

Impressions of Path64 compiler

So I noticed today that an ebuild for the Path64 compiler hit Portage; being the ELF nerd that I am, it interested me on the technical level, more than for the optimizations (especially since I’m never happy to hear about “the most sophisticated” about anything; claims like that tend to be simply bothersome to me).

Before starting with testing the compiler, I got to say that the ebuilds themselves had a bit of trouble: the pre-built binary one (dev-lang/ekopath) is changing the path at each update, which breaks Makefiles and other scripts where you could be using a full path as compiler (which has to be the case if you wish to target the binary toolchain rather than the custom-built one), while the custom-built one (dev-lang/path64) does not check for the validity of the dynamic linker name when trying to gather it from GCC, and breaks when using my customized specs for forced --as-needed as they change the commandline used to call collect2. Both problems are now reported in bugzilla and I hope they’ll be solved soon.

What is my baseline test? Well, let’s start with something simple: Ruby-ELF has a number of tests implemented for multiple compilers, in particular GCC, SunStudio and ICC on Linux/AMD64; adding a new compiler just requires rebuilding some object files, and then add some lines of code in the testsuite to check those out. There are always a few attributes that need to be adapted, such as the ELF entry points, but that’s beside the point now, and it is expected of compilers to have small variations in their behaviour, otherwise it wouldn’t make sense to have multiple compilers at all.

This test alone caused me to feel like I’m playing with an alpha-version of a compiler rather than something already targeted at production use, like it seems to be sold to the public. Given that the testfiles I use are very small and simplistic, I wasn’t expecting any difference at all, beside the most obvious ones. For instance, I already know that ICC appends a .0 suffix on all the local symbols (unit-static ones), and SunCC uses common symbols rather than BSS symbols for external TLS variables. But all in all, they are very similar. Turns out that Path64 has more semantic differences than the others.

First issue: on a very simple, hello-world type executable, where only one symbols – printf() – is used, all the compilers manage to only link to libc.so.6, which provides that symbol. Path64 instead adds one more dependency over libgcc.so, or rather its own variation of it. This in turn adds a dependency over libm.so, which makes it two extra objects to be loaded for simple executables (yes it might sound like it is impossible not to load the math library, but there are cases where that actually happens). This is extra nasty because linking to that library also means emitting “weak symbols” used for C++ language support.

Not extremely difficult to work around though: just add -Wl,--as-needed to the command line to make it skip over libgcc.so as it is really unused — this is what GCC does in its specs files by the way, it enables as-needed linking, lists its support library, then disable it again, so that the original semantics are restored.

There is one particularity to the Pathscale compiler: it sets the OS ABI on the ELF file to the code for Linux, on static executables. Neither GCC nor ICC do so (I’m not sure about SunStudio as I was unable to produce a static executable out of it last time). Nothing wrong with this, and I’m actually often wondering why compilers never did that.

Next up start the trouble for the compiler: one of the tests is designed to make sure that Ruby-ELF can provide the correct nm-style description code for the symbols in the object files. This is the most compiler-specific test of the whole suite, as both the notes I wrote above about ICC and SunStudio come from this one. Path64 is not as much inconsistent as it seems to be buggy in this area though.

The first difference is that the other three compilers are emitting, in the relocatable object file, an absolute symbol with the name of the source translation unit. This is not the case for Path64, but it isn’t much of a problem: the symbol is probably helpful during debug but not for real usage of the object, so it would just be an issue of rewiring the test. Where the problems arise is when it comes to the .data.rel.ro section and Copy-on-Write which is one of my pet peeves.

The test source file contains combination of static, exported, and external variables and constants; since the unit is compiled as PIC, it also contains combination of constants that contain relocated and non-relocated code:



char external_variable[] = "foo";
static char static_variable[] tc_used = "foo";

const char external_constant[] = "foo";
static const char static_constant[] tc_used = "foo";

const char *relocated_external_variable = "foo";
const char *const relocated_external_constant = "foo";

static const char *relocated_static_variable tc_used = "foo";
static const char *const relocated_static_constant tc_used = "foo";

All three of the compilers implemented up to now are good and emit the non-relocated constants in the .rodata section, keeping only the relocated ones (i.e., the pointers) in the .data.rel.ro sections that are copy-on-write.

Finally, for those who keep scores, the missed optimization I noted back in April, is missing in path64 as well as GCC and ICC. Only clang up to now was able to actually make the best out of that code.

I guess I’ll have some reports to do to PathScale, and I’ll keep an eye on this compiler. On the other hand, please don’t ask for this to be tested in any tinderbox for now. Before I can even just consider this, it’ll need to improve a bit further… and I’ll need a more powerful machine to use for tinderboxing.

That innocent warning… or maybe not?

Anybody who ever programmed in C with a half-decent compiler knows that warnings are very important and you should definitely not leaving them be. Of course, there are more and less important warnings, and the more the compiler’s understanding of the code increases, the more warnings it can give you (which is why using -Werror in released code is a bad idea and why it causes so many headaches to me and the other developers when a new compiler is released).

But there are some times where the warnings, while not highlighting broken code, are indication of more trivial issues, such as suboptimal or wasteful code. One of these warnings was introduced in GCC 4.6.0, and relates to variables that are declared, set, but never read, and I dreamed of it back in 2008.

Now, the warning as it is, it’s pretty useful. Even though a number of times it’s going to be used to mask unused results warnings it can show code where a semi-pure function, i.e. one without visible side effects, but not marked (or markable) as such because of caching and other extraneous accesses, more about it in my old article if you wish, is invoked just to set a variable that is not used — especially with very complex functions, it is possible that enough time is spent processing for nothing.

Let me clarify this situation. If you have a function that silently reads data from a configuration file or a cache to give you a result (based on its parameters), you have a function that, strictly-speaking, is non-pure. But if the end result depends most importantly on the parameters, and not from the external data, you could say that the function’s interface is pure, from the caller perspective.

Take localtime() as an example: it is not a strictly-pure function because it calls tzset(), which – as the name leaves to intend – is going to set some global variables, responsible to identify the current timezone. While these are most definitely side effects, they are not the kind of side effects that you’ll be caring for: if the initialization doesn’t happen there, it will happen the next time the function is called.

This is not the most interesting case though: tzset() is not a very expensive funciton, and quite likely it’ll be called (or it would have been called) at some other point in the process. But there are a number of other functions, usually related to encoding or cryptography, which rely on pre-calculated tables, which might be actually calculated at the time of use (why that matters is another story).

Now, even considering this, a variable set but not used is usually not going to be troublesome in by itself: if it’s used to mask a warning for instance, you still want the side effect to apply, and you won’t be paying the price of the extra store since the compiler will not emit the variable at all.. as long as said variable is an automatic one, which is allocated on the stack for the function. Automatic variables undergo the SSA transformation, which among other things allows for unused stores to be omitted from the code.

Unfortunately, SSA cannot be applied to static variables, which means that assigning a static variable, even though said static variable is never used, will cause the compiler to include a store of that value in the final code. Which is indeed what happens for instance with the following code (tested with both GCC 4.5 – which does not warn – and 4.6):

int main() {
  static unsigned char done = 0;

  done = 1;
  return 1;
}

The addition of the -Wunused-but-set-variable warning in GCC 4.6 is thus a godsend to identify these, and can actually lead to improvements on the performance of the code itself — although I wonder why is GCC still emitting the static variable in this case, since, at least 4.6, knows enough to warn you about it. I guess this is a matter of missing an optimization, nothing world-shattering. What I was much more surprised by is that GCC fails to warn you about one similar situation:

static unsigned char done = 0;

int main() {

  done = 1;
  return 1;
}

In the second snippet above, the variable has been moved from function-scope to unit-scope, and this is enough to confuse GCC into not warning you about it. Obviously, to be able to catch this situation, the compiler will have to perform more work than the previous one, since the variable could be accessed by multiple functions, but at least with -funit-at-a-time it is already able to apply similar analysis, since it reports unused static functions and constant/variables. I reported this as bug #48779 upstream.

Why am I bothering writing a whole blog post about a simple missed warning and optimization? Well, while it is true that zeroed static variables don’t cause much trouble, since they are mapped to the zero-page and shared by default, you could cause huge waste of memory if you have a written-only variable that is also relocated, like in the following code:

#include 

static char *messages[] = {
  NULL, /* set at runtime */
  "Foo",
  "Bar",
  "Baz",
  "You're not reading this, are you?"
};

int main(int argc, char *argv[]) {
  messages[0] = argv[0];

  return 1;
}

Note: I made this code for an executable just because it was easier to write down, and you should think of it as a PIE so that you can feel the issue with relocation.

In this case, the messages variable is going to be emitted even though it is never used — by the way it is not emitted if you don’t use it at all: when the static variable is reported as unused, the compiler also drops it, not so for the static ones, as I said above. Luckily I can usually identify problems like these while running cowstats, part of my Ruby-Elf utilities if you wish to try it, so I can look at the code that uses it, but you can guess it would have been nicer to have already in the compiler.

I guess we’ll have to wait for 4.7 to have that. Sigh!

Ruby-Elf and collision detection improvements

While the main use of Ruby-Elf for me lately has been quite different – for instance with the advent of elfgrep or helping verifying LFS support – the original reason that brought me to write that parser was finding symbol collisions (that’s almost four years ago… wow!).

And symbol collisions are indeed still a problem, and as I wrote recently they don’t get very easy on the upstream developers’ eyes, as they are mostly an indication of possible aleatory problems in the future.

At any rate, the original script ran overnight, generated a huge amount of database, and then required more time to produce a readable output, all of which happened using an unbearable amount of RAM. Between the ability to run it on a much more powerful box, and the work done to refine it, it can currently scan Yamato’s host system in … 12 minutes.

The latest set of change that replaced the “one or two hours” execution time with the current “about ten minutes” (for the harvesting part, there are two more minutes required for the analysis) was part of my big rewrite of the script so that it used the same common class interfaces as the commands that are installed to be used with the gem as well. In this situation, albeit keeping the current single-threaded (more on that in a moment), each file analysed consists of three calls to the PostgreSQL backend, rather than being something in the ballpark of 5 plus one per symbol, and this makes it quite faster.

To achieve this I first of all limited the round-trips between Ruby and PostgreSQL when deciding whether a file (or a symbol) has been already added or not. In the previous iteration I was already optimising this a bit by using prepared statements (that seemed slightly faster than direct queries), but they didn’t allow me to embed the logic into them, so I had a number of select and insert statements depending on the results of those, which was bad not only because each selection would require converting data types twice (from PostgreSQL representation to C, then from that to Ruby), but also because it required to call into the database each time.

So I decided to bite the bullet and, even though I know it makes it a bunch of spaghetti code, I’ve moved part of the logic in PostgreSQL through stored procedures. Long live PL/SQL.

Also, to make it more solid in respect to parsing error on single object files, rather than queuing all the queries and then commit them in one big single transaction, I create single transactions to commit all the symbols of an object, as well as when creating the indexes. This allows me to skip over objects altogether if they are broken, without stopping the whole harvesting process.

Even after introducing the transaction on symbols harvesting, I found it much faster to run a single statement through PostgreSQL in a transaction, with all the symbols; since I cannot simply run a single INSERT INTO with multiple values (because I might hit an unique constrain, when the symbols are part of a “multiple implementations” object), at least I call the same stored procedure multiple times within the same statement. This had tremendous effect, even though the database is accessed through Unix sockets!

Since the harvest process now takes so little time to complete, compared to what it did before, I also dropped the split between harvest and analysis: analyse.rb is gone, merged into the harvest.rb script for which I have to write a man page, sooner or later, and get installed properly as an available tool rather than an external one.

Now, as I said before, this script is still single-threaded; on the other hand, all the other tools are “properly multithreaded”, in the sense that their code fires up a new Ruby thread per each file to analyse and the results are synchronised not to step on each other’s feet. You might know already that, at least for what concerns Ruby 1.8, threading is not really implemented and green threads are used instead, which means there is no real advantage in using them; that’s definitely true. On the other hand, on Ruby 1.9, even though the pure-Ruby nature of Ruby-Elf makes the GIL a main obstacle, threading would improve the situation by simply allowing threads to analyse more files while the pg backend gem would send the data over to PostgreSQL (which would probably also be helped by the “big” transactions sent right now). But what about the other tools that don’t use external extensions at all?

Well, threading elfgrep or cowstats is not really any advantage on the “usual” Ruby versions (MRI18 and 1.9), but it provides a huge advantage when running them with JRuby, as that implementation has real threads, it can scan multiple files at once (both when using asynchronous listing of input files with the standard input stream, and when providing all of them in one single sweep), and then only synchronise to output the results. This of course makes it a bit more tricky to be sure that everything is being executed properly, but in general makes the tools just the more sweet. Too bad that I can’t use JRuby right now for harvest.rb, as the pg gem I’m using is not available for JRuby, I’d have to rewrite the code to use JDBC instead.

Speaking about options passing, I’ve been removing some features I originally implemented; in the original implementation, the arguments parsing was asynchronous and incremental, without limits to recursion; this meant that you could provide a list of files preceded by the at-symbol as the standard input of the process, and each of that would be scanned for… the same content. This could have been bad already for the possible loops, but it also had a few more problems, among which there was the lack of a way to add a predefined list of targets if none was passed (which I needed for harvest.rb to behave more or less like before). I’ve since rewritten the targets’ parsing code to only work with a single-depth search, and relying on asynchronous arguments passing only through the standard input, which is only used when no arguments are given, either on command line or by default of the script. It’s also much faster this way.

For today I guess all these notes about Ruby-Elf would be enough; on the other hand, in the next days I hope to provide some more details about the information the script is providing me.. they aren’t exactly funny, and they aren’t exactly the kind of things you wanted to know about your system. But I guess this is a story for another day.

On releasing Ruby software

You probably know that I’ve been working hard on my Ruby-Elf software and its tools, which include my pride elfgrep and are now available in the main Portage tree so that it’s just an emerge ruby-elf away. To make it easier to install, manage and use, I wanted to make the package as much in line with Ruby packaging best practices taking into consideration both those installing it as a gem and those installing it with package managers such as Portage. This gave me a few more insights on packaging that before escaped me a lot.

First of all, thankfully, RubyGems packaging starts to be feasible without needing a bunch of third party software; whereas a lot of software used to require Hoe or Echoe to even run tests, some of it is reeling back, and using simply the standard Gem-provided Rake task to run packaging; this is also the road I decided to take with Ruby-Elf. Unfortunately Gentoo is once again late on the Rubygems game, as we still have 1.3.7 and not 1.5.0 used; this is partly because we’ve been hitting our own roadblocks with the upgrade of Ruby 1.9, which is really proving a pain in our collective … backside — you’d expect that in early 2011 all the main Ruby packages would work with the 1.9.2 release just fine, but that’s still not the case.

Integrating Rubyforge upload, though, is quite difficult because the Rubyforge extension itself is quite broken and no longer works out of the box — main problem being that it tries to use the “Any” specification for CPU, but that exists no more, replaced by “Other”; you can trick it into using that by changing the automated configuration, but it’s not a completely foolproof system. The whole extension seem pretty much outdated and written hastily (if there is a problem when creating the release slots or uploading the file, the state of the release is left halfway through).

For what concerns mediating between keeping a simple RubyGems packaging and still providing all the details needed for distributions’ packaging, while not requiring all the users to install the required development packages, I’ve decided to release two very different packages. The RubyGem only installs the code, the tools, and the man pages; it lacks the tests, because there is a lot of test data that would otherwise be installed without any need for it. The tarball on the other hand contains all the data from the git repository, but including the gemspec file (that is needed for instance in Gentoo to have fakegems install properly). In both cases, there are two type of files that are included in the two distributions but are not part of the git repositories: the man pages and the Ragel-generated demanglers (which I’m afraid I’ll soon have to drop and replace with manually-written ones, as Ragel is unsuitable for totally recursive patterns like the C++ mangling format used by GCC3 and specified by the IA64 ABI); by distributing these directly, users are not required to have either Ragel or libxslt installed to make full use of Ruby-Elf!

Speaking about the man pages; I love the many tricks I can make use of with DocBook and XSLT; I don’t have to reproduce the same text over and over when the options, or bugs, are the same for all the tools – I have a common library to implement them – I just need to include the common file, and use XPointer to tell it which part of the file to pick up. Also, it’s quite important to me to keep the man pages updated, since i took a page out of the git book: rather than implementing the --help option with a custom description of them, the --help option calls up the manpage of the tool. This works out pretty well, mostly because this particular gem is designed to work on Unix systems, so that the man tool is always going to be present. Unfortunately in the first release I made it didn’t work out all too well, as I didn’t consider the proper installation layout of the gem; this is now fixed and works perfectly even if you use gem install ruby-elf.

The one problem I still have is that I have not yet signed the packages themselves; the reason is actually quite simple: while it’s trivial with OpenSSH to proxy the ssh-agent connection, so that I can access private hosts when jumping from my frontend system to Yamato, I can’t find currently any way to proxy the GnuPG agent, which is needed for me to sign the packages; sure I could simply connect another smartcard reader to Yamato and move the card there to do the signing, but I’m not tremendously happy with such a solution. I think I’ll be writing some kind of script to do that; it shouldn’t be very difficult to do with ssh and nc6.

Hopefully, having now released my first very much Ruby package, and my first Gem, I hope to be able to do a better job at packaging, and fixing others’ packages, in Gentoo.