Redundant symbols

So I’ve decided to dust off my link collision script and see what the situation is nowadays. I’ve made sure that all the suppression file use non-capturing groups on the regular expressions – as that should improve the performance of the regexp matching – make it more resilient to issues within the files (metasploit ELF files are barely valid), and run it through.

Well, it turns out that the situation is bleaker than ever. Beside the obvious amount of symbols with a too-common name, there are still a lot of libraries and programs exporting default bison/flex symbols the same way I found them in 2008:

Symbol yylineno@ (64-bit UNIX - System V AMD x86-64) present 59 times
Symbol yyparse@ (64-bit UNIX - System V AMD x86-64) present 53 times
Symbol yylex@ (64-bit UNIX - System V AMD x86-64) present 49 times
Symbol yy_flush_buffer@ (64-bit UNIX - System V AMD x86-64) present 48 times
Symbol yy_scan_buffer@ (64-bit UNIX - System V AMD x86-64) present 48 times
Symbol yy_scan_bytes@ (64-bit UNIX - System V AMD x86-64) present 48 times
Symbol yy_scan_string@ (64-bit UNIX - System V AMD x86-64) present 48 times
Symbol yy_create_buffer@ (64-bit UNIX - System V AMD x86-64) present 47 times
Symbol yy_delete_buffer@ (64-bit UNIX - System V AMD x86-64) present 47 times
[...]

Note that at least one library got to export them to be listed in this output; indeed these symbols are present in quite a long list of libraries. I’m not going to track down each and every of them though, but I guess I’ll keep an eye on that list so that if problems arise that can easily be tracked down to this kind of collisions.

Action Item: I guess my next post is going to be a quick way to handle building flex/bison sources without exposing these symbols, for both programs and libraries.

But this is not the only issue — I’ve already mentioned a long time ago that a single installed system already brings in a huge number of redundant hashing functions; on the tinderbox as it was when I scanned it, there were 57 md5_init functions (and this without counting different function names!). Some of this I’m sure boils down to gnulib making it available, and the fact that, unlike the BSD libraries, GLIBC does not have public hashing functions — using libcrypto is not an option for many people.

Action item: I’m not very big of benchmarks myself, never understood the proper way to go around getting the real data rather than being fooled by the scheduler. Somebody who’s more apt at that might want to gather a bunch of libraries providing MD5/SHA1/SHA256 hashing interfaces, and produce some graphs that can let us know whether it’s time to switch to libgcrypt, or nettle, or whatever else that provides us with good performance as well as with a widely-compatible license.

The presence of duplicates of memory-management symbols such as malloc and company is not that big of a deal, at first sight. After all, we have a bunch of wrappers that use interposing to account for memory usage, plus another bunch to provide alternative allocation strategies that should be faster depending on the way you use your memory. The whole thing is not bad by itself, but when you get one of graphviz’s libraries (libgvpr) to expose malloc something sounds wrong. Indeed, if even after updating my suppression filter to ignore the duplicates coming from gperftools and TBB, I get 40 copies of realloc() something sounds extremely wrong:

Symbol realloc@ (64-bit UNIX - System V AMD x86-64) present 40 times
  libgvpr
  /mnt/tbamd64/bin/ksh
  /mnt/tbamd64/bin/tcsh
  /mnt/tbamd64/usr/bin/gtk-gnutella
  /mnt/tbamd64/usr/bin/makefb
  /mnt/tbamd64/usr/bin/matbuild
  /mnt/tbamd64/usr/bin/matprune
  /mnt/tbamd64/usr/bin/matsolve
  /mnt/tbamd64/usr/bin/polyselect
  /mnt/tbamd64/usr/bin/procrels
  /mnt/tbamd64/usr/bin/sieve
  /mnt/tbamd64/usr/bin/sqrt
  /mnt/tbamd64/usr/lib64/chromium-browser/chrome
  /mnt/tbamd64/usr/lib64/chromium-browser/chromedriver
  /mnt/tbamd64/usr/lib64/chromium-browser/libppGoogleNaClPluginChrome.so
  /mnt/tbamd64/usr/lib64/chromium-browser/nacl_helper
  /mnt/tbamd64/usr/lib64/firefox/firefox
  /mnt/tbamd64/usr/lib64/firefox/firefox-bin
  /mnt/tbamd64/usr/lib64/firefox/mozilla-xremote-client
  /mnt/tbamd64/usr/lib64/firefox/plugin-container
  /mnt/tbamd64/usr/lib64/firefox/webapprt-stub
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc/libcurs.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc/libmcurses.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc.memprof/libcurs.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc.memprof/libmcurses.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc.prof/libcurs.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc.prof/libmcurses.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc.trseg.debug/libcurs.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc.trseg.debug/libmcurses.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc.trseg/libcurs.so
  /mnt/tbamd64/usr/lib64/mercury/lib/asm_fast.gc.trseg/libmcurses.so
  /mnt/tbamd64/usr/lib64/mercury/lib/hlc.gc/libcurs.so
  /mnt/tbamd64/usr/lib64/mercury/lib/hlc.gc/libmcurses.so
  /mnt/tbamd64/usr/lib64/mercury/lib/hlc.gc.trseg/libcurs.so
  /mnt/tbamd64/usr/lib64/mercury/lib/hlc.gc.trseg/libmcurses.so
  /mnt/tbamd64/usr/lib64/OpenFOAM/OpenFOAM-1.6/lib/libhoard.so
  /mnt/tbamd64/usr/lib64/thunderbird/mozilla-xremote-client
  /mnt/tbamd64/usr/lib64/thunderbird/plugin-container
  /mnt/tbamd64/usr/lib64/thunderbird/thunderbird
  /mnt/tbamd64/usr/lib64/thunderbird/thunderbird-bin

Now it is true that it’s possible, depending on the usage patterns, to achieve a much better allocation strategy than the default coming from GLIBC — on the other hand, I’m also pretty sure that GLIBC’s own allocation improved a lot in the past few years so I’d rather use the standard allocation than a custom one that is five or more years old. Again this could use some working around.

In the list above, Thunderbird and Firefox for sure use (and for whatever reason re-expose) jemalloc; I have no idea if libhoard in OpenFOAM is another memory management library (and whether OpenFOAM is bundling it or not), and Mercury is so messed up that I don’t want to ask myself what it’s doing there. There are though a bunch of standalone programs listed as well.

Action item: go through the standalone programs exposing the memory interfaces — some of them are likely to bundle one of the already-present memory libraries, so just make them use the system copy of it (so that improvements in the library trickle down to the program), for those that use custom strategies, consider making them optional, as I’d expect most not to be very useful to begin with.

There is another set of functions that are similar to the memory management functions, which is usually brought in by gnulib; these are convenience wrappers that do error checking over the standard functions — they are xmalloc and friends. A quick check, shows that these are exposed a bit too often:

Symbol xmemdup@ (64-bit UNIX - System V AMD x86-64) present 37 times
  liblftp-tasks
  libparted
  libpromises
  librec
  /mnt/tbamd64/usr/bin/csv2rec
  /mnt/tbamd64/usr/bin/dgawk
  /mnt/tbamd64/usr/bin/ekg2
  /mnt/tbamd64/usr/bin/gawk
  /mnt/tbamd64/usr/bin/gccxml_cc1plus
  /mnt/tbamd64/usr/bin/gdb
  /mnt/tbamd64/usr/bin/pgawk
  /mnt/tbamd64/usr/bin/rec2csv
  /mnt/tbamd64/usr/bin/recdel
  /mnt/tbamd64/usr/bin/recfix
  /mnt/tbamd64/usr/bin/recfmt
  /mnt/tbamd64/usr/bin/recinf
  /mnt/tbamd64/usr/bin/recins
  /mnt/tbamd64/usr/bin/recsel
  /mnt/tbamd64/usr/bin/recset
  /mnt/tbamd64/usr/lib64/lftp/4.4.2/liblftp-network.so
  /mnt/tbamd64/usr/lib64/libgettextlib-0.18.2.so
  /mnt/tbamd64/usr/lib64/man-db/libman-2.6.3.so
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.6.3/cc1
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.6.3/cc1obj
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.6.3/cc1plus
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.6.3/f951
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.6.3/jc1
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.6.3/lto1
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.2/cc1
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.2/cc1obj
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.2/cc1plus
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.2/f951
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.2/jc1
  /mnt/tbamd64/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.2/lto1
  /mnt/tbamd64/usr/libexec/gnat-gcc/x86_64-pc-linux-gnu/4.5/cc1
  /mnt/tbamd64/usr/libexec/gnat-gcc/x86_64-pc-linux-gnu/4.5/gnat1
  /mnt/tbamd64/usr/libexec/gnat-gcc/x86_64-pc-linux-gnu/4.5/lto1

In this case they are exposed even by the GCC tools themselves! While this brings me again to complain that gnulib show now actually be libgnucompat and be dynamically linked, there is little we can do about these in programs — but the symbols should not creep in system libraries (mandb has the symbols in its private library which is marginally better).

Action item: check the libraries exposing the gnulib symbols, and make them expose only their proper interface, rather than every single symbol they come up with.

I suppose that this is already quite a bit of data for a single blog post — if you want a copy of the symbols’ index to start working on some of the action items I listed, just contact me and I’ll send it to you, it’s a big too big to just have it published as is.

Choosing an MD5 implementation

_Yes this post might give you a sense of deja vu read but I’m trying to be more informative than ranty, if I can, over two years and a half after the original post…

Yesterday I ranted about gnulib and in particular about the hundred and some copies of the MD5 algorithm that it brings with it. Admittedly, the numbers seem more sensational than they would if you were to count the packages involved, as most of the copies come from GNU octave and another bunch come from GCC and so on so forth.

What definitely surprises me, though, is the way that people rely on said MD5 implementation, like there was no other available. I mean, I can understand GCC not wanting to add any more dependencies that could make it ABI-dependent – just look at the recent, and infamous, gmp bump – but did GNU forget that it has its own hash library ?

There are already enough MD5 implementations out there to fill a long list, so how do you choose which one to use, rather than add a new one onto your set? Well, that’s a tricky question. As far as I can tell, the most prominent factors in choosing an implementation for hash algorithms are non-technical:

Licensing is likely to be the most common issue: relying on OpenSSL’s libcrypto means that you rely on a software the license of which has been deemed enough incompatible with GPL-2 that there is a documented SSL exception that is used by projects using the GNU General Public License together with those libraries. This is the reason why libgcrypt exists, for the most part, but this continues GNU’s series of “let’s reinvent the wheel, and then reinvent it, and again”, GnuTLS (which is supposed to be a replacement to OpenSSL itself) also provides its own implementation. Great.

Availability can be a factor as well: software designed for BSD systems – such as libarchive – will make use of the libcrypto-style interface just fine; the reason is that at least FreeBSD (and I think NetBSD as well) provides those functions in its standard libraries set, making it the obvious available implementation (I wish we had something like that). Adding dependencies on a software is usually a problem, and that’s why gnulib’s used often times (being imported in the project’s sources, it adds no further dependency). So if your average system configuration already contains an acceptable implementation, then that’s what you should go for.

Given that all the interfaces are almost identical one to the other with the exception of the names and structure, and that their actual implementation has to follow the standard to make sense, the lack of many technical reasons in the prominent factors for choosing one library over another is generally understandable. So how should one proceed to choose which one to use? Well, I have some general rule of thumbs that could be useful.

The first is to use something that you already have available or you’re using already in your project: OpenSSL’s libcrypto, GLIB and libav/ffmpeg all provide an implementation of most hash functions. If you already rely on them for some of your needs, simply use their interfaces rather than looking for new ones. If there are bugs in those, or the interface is not good enough, or the performances are not as expected, try to get those fixed instead of cooking your own implementation.

If you are not using any other implementation already, then try to look at libgcrypt first; the reason why I suggest this is because it’s a common library (it is required among others by GnuPG), implementing all the main hashing algorithms, it’s LGPL so you have very few license concerns, and it doesn’t implement any other common interface, so you don’t risk symbol collision issues, as you would if you were to use a libcrypto-compatible library, and at the same time bring in anything else that used OpenSSL — I have seen that happening with Drizzle a longish time ago.

If you’re a contributor to one of the projects that use gnulib’s md5 code… please consider porting it to libgcrypt, all distributors will likely thank you for that!

PAM authentication for paranoids

Before I resume working on PAM (I need to implement a change to pam_lastlog to fix a pernicious bug), I wanted to just write a quick entry for the paranoid of you who still use PAM for system login.

Since, as you most likely already know, MD5 is once again considered insecure, one obvious concern would be the fact that passwords saved in MD5 on a system are not secure either. For this reason if you’re using Linux-PAM, you can make use of the SHA512 hashing of system password keys, which I already wrote about.

Remember that to use that you have to make sure your Linux-PAM (sys-libs/pam) is built against a recent enough version of glibc. Unfortunately the version of pambase with this feature hasn’t hit stable yet, the bug above is blocking it, and I’m going to have to hack at pam_lastlog to fix that.

What I didn’t write last time, is that you can easily spot if your system is using md5 passwords by using this simple command from root:

# fgrep '$1$' /etc/shadow

Of course one has to access your /etc/shadow file to breach your passwords, so your system has to have been compromised before, but it’s still not nice if they can find out what your basic passwords are.

Moving on.

How many implementations of MD5 do you have in your system?

Anybody who ever looked into protocols and checksums or even downloaded the ISO file of a distribution a few years ago knows what an MD5 digest looks like. It’s obvious that the obnoxious presence of MD5 in our lives up to a few years ago (when it was declared non secure, and work started to replace it with SHA1 or SHA256) caused a similar obnoxious presence of it in the code.

Implementing MD5 checksumming is not really an easy task; for this reason there is almost the same code that gets reused from one library to another, and so on. This code has an interface that is more or less initialise, update, finish; the name of the functions might change a bit between implementation and implementation, but the gist is that.

Now, the most common provider of these functions is certainly OpenSSL (which also implements SHA1 and other checksum commands), but is not limited to. On FreeBSD, the functions are present in a system library (I forgot which one), and the same seems to happen in the previous Linux C library (libc.so.5, used before glibc). A common problem with OpenSSL is its GPL-incompatibility, for instance.

Now this means that a lot of software reimplemented their own MD5 code, using the same usual interface, with slightly different names: MD5Init, MD5_init, md5_init, md5init, md5_update, md5_append, md5_final, md5_finish and so on. All these interfaces are likely slightly different one with the other, to the point of not being drop-in replacements, and thus causing problems when they collide one with the other.

On every system, thus, there are multiple implementations of MD5, which, well, contributes to memory waste, as having a single implementation would be nicer and be easily shared between programs.

These packages implement their own copy of MD5 algorithms, and export them (sometimes correctly, sometimes probably by mistake): OpenSSL (obviously), GNUTLS (obviously, as it’s intended as semi-drop-in replacement for OpenSSL), GeoIP (uh?), Python (EH!? Python already links to OpenSSL, why on earth doesn’t it use SSL for MD5 functions really escapes me), python-fchksum (and why does it not use Python’s code?), Wireshark (again, it links to both GNUTLS and OpenSSL, why it does implement its own copy of MD5 escapes me), Kopete (three times, one for Yahoo plugin, one for Oscar – ICQ – plugin, and a different one for Jabber, it goes even better as KDE provides an MD5 implementation!), liblrdf, Samba (duplicated in four libraries), Wine (for advapi32.dll reimplementation, I admit that it might be requested for advapi32 to export it, I don’t know), pwdb, and FFmpeg (with the difference that FFmpeg’s implementation is not triggering my script as it uses its own interface).

I’m sure there are more implementations of MD5 on a system, as I said they are obnoxiously present in our lives still, for legacy protocols and data, and the range of different areas where MD5 checksums are used is quite wide (cryptography, network protocols, backup safety checks, multimedia – for instance the common checksum of decoded data to ensure proper decoding in lossless formats – and authentication). Likely a lot of implementations are hidden inside the source code of software, and it is likely impossible to get rid of them. But it would certainly be an interesting task if someone wants: sharing MD5 implementations means that optimising it for new CPUs will improve performance on all software using it.

If I wasn’t sure that most developers would hate me doing that, I’d pretty much like to open bugs for all the packages giving possible area of improvement of upstream code. As it is, contacting all upstreams, and creating a good lot of trackers’ accounts is something I wouldn’t like to do in my free time, but I can easily point out improvement areas for a lot of code. I just opened python-fchksum (which is used by Portage, which in turn means that if I can optimise it, I can optimise Portage), and beside the presence of MD5 code, I can see a few more things that I could improve in it. I’ll likely write the author with a patch and a note, but it’s certainly not feasible for me to do so for every package out there, alone and in my free time…