This Time Self-Hosted
dark mode light mode Search

How many implementations of MD5 do you have in your system?

Anybody who ever looked into protocols and checksums or even downloaded the ISO file of a distribution a few years ago knows what an MD5 digest looks like. It’s obvious that the obnoxious presence of MD5 in our lives up to a few years ago (when it was declared non secure, and work started to replace it with SHA1 or SHA256) caused a similar obnoxious presence of it in the code.

Implementing MD5 checksumming is not really an easy task; for this reason there is almost the same code that gets reused from one library to another, and so on. This code has an interface that is more or less initialise, update, finish; the name of the functions might change a bit between implementation and implementation, but the gist is that.

Now, the most common provider of these functions is certainly OpenSSL (which also implements SHA1 and other checksum commands), but is not limited to. On FreeBSD, the functions are present in a system library (I forgot which one), and the same seems to happen in the previous Linux C library (libc.so.5, used before glibc). A common problem with OpenSSL is its GPL-incompatibility, for instance.

Now this means that a lot of software reimplemented their own MD5 code, using the same usual interface, with slightly different names: MD5Init, MD5_init, md5_init, md5init, md5_update, md5_append, md5_final, md5_finish and so on. All these interfaces are likely slightly different one with the other, to the point of not being drop-in replacements, and thus causing problems when they collide one with the other.

On every system, thus, there are multiple implementations of MD5, which, well, contributes to memory waste, as having a single implementation would be nicer and be easily shared between programs.

These packages implement their own copy of MD5 algorithms, and export them (sometimes correctly, sometimes probably by mistake): OpenSSL (obviously), GNUTLS (obviously, as it’s intended as semi-drop-in replacement for OpenSSL), GeoIP (uh?), Python (EH!? Python already links to OpenSSL, why on earth doesn’t it use SSL for MD5 functions really escapes me), python-fchksum (and why does it not use Python’s code?), Wireshark (again, it links to both GNUTLS and OpenSSL, why it does implement its own copy of MD5 escapes me), Kopete (three times, one for Yahoo plugin, one for Oscar – ICQ – plugin, and a different one for Jabber, it goes even better as KDE provides an MD5 implementation!), liblrdf, Samba (duplicated in four libraries), Wine (for advapi32.dll reimplementation, I admit that it might be requested for advapi32 to export it, I don’t know), pwdb, and FFmpeg (with the difference that FFmpeg’s implementation is not triggering my script as it uses its own interface).

I’m sure there are more implementations of MD5 on a system, as I said they are obnoxiously present in our lives still, for legacy protocols and data, and the range of different areas where MD5 checksums are used is quite wide (cryptography, network protocols, backup safety checks, multimedia – for instance the common checksum of decoded data to ensure proper decoding in lossless formats – and authentication). Likely a lot of implementations are hidden inside the source code of software, and it is likely impossible to get rid of them. But it would certainly be an interesting task if someone wants: sharing MD5 implementations means that optimising it for new CPUs will improve performance on all software using it.

If I wasn’t sure that most developers would hate me doing that, I’d pretty much like to open bugs for all the packages giving possible area of improvement of upstream code. As it is, contacting all upstreams, and creating a good lot of trackers’ accounts is something I wouldn’t like to do in my free time, but I can easily point out improvement areas for a lot of code. I just opened python-fchksum (which is used by Portage, which in turn means that if I can optimise it, I can optimise Portage), and beside the presence of MD5 code, I can see a few more things that I could improve in it. I’ll likely write the author with a patch and a note, but it’s certainly not feasible for me to do so for every package out there, alone and in my free time…

Comments 4
  1. few things;1) last I looked, openssl wasn’t a hard dep of python till 2.5; aware our deps differ, but too lazy to pop into the configure and verify it’s an optional dep for 2.4 and under.2) python-fchksum exporting it’s on md5 isn’t too surprising- suspect you’re referring to fmd5 and friends. As to why it doesn’t use python, likely related to the fact fchksum bypasses python’s File protocol, instead being handed a filename and doing md5 ops on it strictly at the c level (reading, etc).

  2. The MD5 implementation symbols exported by python-fchksum are _not_ related to the fact it exports the interface to _Python_. It’s a C interface it’s exporting, and shouldn’t be.

  3. glib-2.16 will contain a checksum API as well, including support for MD5, SHA1 and SHA256 – http://library.gnome.org/de…At least lots of GNOME packages that have their own checksum stuff will use that in GNOME-2.22, so it should bring the total count down a bit.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.