Anybody who ever looked into protocols and checksums or even downloaded the ISO file of a distribution a few years ago knows what an MD5 digest looks like. It’s obvious that the obnoxious presence of MD5 in our lives up to a few years ago (when it was declared non secure, and work started to replace it with SHA1 or SHA256) caused a similar obnoxious presence of it in the code.
Implementing MD5 checksumming is not really an easy task; for this reason there is almost the same code that gets reused from one library to another, and so on. This code has an interface that is more or less initialise, update, finish; the name of the functions might change a bit between implementation and implementation, but the gist is that.
Now, the most common provider of these functions is certainly OpenSSL (which also implements SHA1 and other checksum commands), but is not limited to. On FreeBSD, the functions are present in a system library (I forgot which one), and the same seems to happen in the previous Linux C library (libc.so.5, used before glibc). A common problem with OpenSSL is its GPL-incompatibility, for instance.
Now this means that a lot of software reimplemented their own MD5 code, using the same usual interface, with slightly different names: MD5Init
, MD5_init
, md5_init
, md5init
, md5_update
, md5_append
, md5_final
, md5_finish
and so on. All these interfaces are likely slightly different one with the other, to the point of not being drop-in replacements, and thus causing problems when they collide one with the other.
On every system, thus, there are multiple implementations of MD5, which, well, contributes to memory waste, as having a single implementation would be nicer and be easily shared between programs.
These packages implement their own copy of MD5 algorithms, and export them (sometimes correctly, sometimes probably by mistake): OpenSSL (obviously), GNUTLS (obviously, as it’s intended as semi-drop-in replacement for OpenSSL), GeoIP (uh?), Python (EH!? Python already links to OpenSSL, why on earth doesn’t it use SSL for MD5 functions really escapes me), python-fchksum (and why does it not use Python’s code?), Wireshark (again, it links to both GNUTLS and OpenSSL, why it does implement its own copy of MD5 escapes me), Kopete (three times, one for Yahoo plugin, one for Oscar – ICQ – plugin, and a different one for Jabber, it goes even better as KDE provides an MD5 implementation!), liblrdf, Samba (duplicated in four libraries), Wine (for advapi32.dll reimplementation, I admit that it might be requested for advapi32 to export it, I don’t know), pwdb, and FFmpeg (with the difference that FFmpeg’s implementation is not triggering my script as it uses its own interface).
I’m sure there are more implementations of MD5 on a system, as I said they are obnoxiously present in our lives still, for legacy protocols and data, and the range of different areas where MD5 checksums are used is quite wide (cryptography, network protocols, backup safety checks, multimedia – for instance the common checksum of decoded data to ensure proper decoding in lossless formats – and authentication). Likely a lot of implementations are hidden inside the source code of software, and it is likely impossible to get rid of them. But it would certainly be an interesting task if someone wants: sharing MD5 implementations means that optimising it for new CPUs will improve performance on all software using it.
If I wasn’t sure that most developers would hate me doing that, I’d pretty much like to open bugs for all the packages giving possible area of improvement of upstream code. As it is, contacting all upstreams, and creating a good lot of trackers’ accounts is something I wouldn’t like to do in my free time, but I can easily point out improvement areas for a lot of code. I just opened python-fchksum (which is used by Portage, which in turn means that if I can optimise it, I can optimise Portage), and beside the presence of MD5 code, I can see a few more things that I could improve in it. I’ll likely write the author with a patch and a note, but it’s certainly not feasible for me to do so for every package out there, alone and in my free time…