This Time Self-Hosted
dark mode light mode Search

Choosing an MD5 implementation

_Yes this post might give you a sense of deja vu read but I’m trying to be more informative than ranty, if I can, over two years and a half after the original post…

Yesterday I ranted about gnulib and in particular about the hundred and some copies of the MD5 algorithm that it brings with it. Admittedly, the numbers seem more sensational than they would if you were to count the packages involved, as most of the copies come from GNU octave and another bunch come from GCC and so on so forth.

What definitely surprises me, though, is the way that people rely on said MD5 implementation, like there was no other available. I mean, I can understand GCC not wanting to add any more dependencies that could make it ABI-dependent – just look at the recent, and infamous, gmp bump – but did GNU forget that it has its own hash library ?

There are already enough MD5 implementations out there to fill a long list, so how do you choose which one to use, rather than add a new one onto your set? Well, that’s a tricky question. As far as I can tell, the most prominent factors in choosing an implementation for hash algorithms are non-technical:

Licensing is likely to be the most common issue: relying on OpenSSL’s libcrypto means that you rely on a software the license of which has been deemed enough incompatible with GPL-2 that there is a documented SSL exception that is used by projects using the GNU General Public License together with those libraries. This is the reason why libgcrypt exists, for the most part, but this continues GNU’s series of “let’s reinvent the wheel, and then reinvent it, and again”, GnuTLS (which is supposed to be a replacement to OpenSSL itself) also provides its own implementation. Great.

Availability can be a factor as well: software designed for BSD systems – such as libarchive – will make use of the libcrypto-style interface just fine; the reason is that at least FreeBSD (and I think NetBSD as well) provides those functions in its standard libraries set, making it the obvious available implementation (I wish we had something like that). Adding dependencies on a software is usually a problem, and that’s why gnulib’s used often times (being imported in the project’s sources, it adds no further dependency). So if your average system configuration already contains an acceptable implementation, then that’s what you should go for.

Given that all the interfaces are almost identical one to the other with the exception of the names and structure, and that their actual implementation has to follow the standard to make sense, the lack of many technical reasons in the prominent factors for choosing one library over another is generally understandable. So how should one proceed to choose which one to use? Well, I have some general rule of thumbs that could be useful.

The first is to use something that you already have available or you’re using already in your project: OpenSSL’s libcrypto, GLIB and libav/ffmpeg all provide an implementation of most hash functions. If you already rely on them for some of your needs, simply use their interfaces rather than looking for new ones. If there are bugs in those, or the interface is not good enough, or the performances are not as expected, try to get those fixed instead of cooking your own implementation.

If you are not using any other implementation already, then try to look at libgcrypt first; the reason why I suggest this is because it’s a common library (it is required among others by GnuPG), implementing all the main hashing algorithms, it’s LGPL so you have very few license concerns, and it doesn’t implement any other common interface, so you don’t risk symbol collision issues, as you would if you were to use a libcrypto-compatible library, and at the same time bring in anything else that used OpenSSL — I have seen that happening with Drizzle a longish time ago.

If you’re a contributor to one of the projects that use gnulib’s md5 code… please consider porting it to libgcrypt, all distributors will likely thank you for that!

Comments 3
  1. Another function that is commonly used by copy-paste instead of dynamic linking and has performance considerations is crc32 (the one in Ethernet and gzip, not Castagnoli).For example, huffman decoding gzip files using a byte lookup table (instead of bit by bit) is so fast that crc32 becomes a factor. The common implementation (in gzip, zlib, etc.) is significantly slower than the one used by the linux kernel (licensed GPLv2), and I know of no source for an LGPL or BSD version with same performance.I have wondered about the feasibility of exporting the kernel’s library functions as a shared library to user mode code (like the VDSO, kinda). The pages contain code only, not data, so it should be safe. If used by enough programs, the code should be hot in cache. It is secure and reviewed by many people and optimized for each architecture.

Leave a Reply to Z.T.Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.