In the past weeks we have assisted to lots of concern in the free software world about the problems tied with the simplification of attacks on the SHA-1 algorithm. Unfortunately this type of news, while pretty important, is only meaningful for the people who actually are expert in cryptography, and can confuse the hell of people, such as me, who don’t know that much about the topic.
So while there are posts about the practical attacks to git with SHA-1 weakness which may seem far fetched for some users, I’d like to try understanding, and making understood what the real world implications are of using weak hash algorithms in many situation.
The first problem that comes to my mind is very likely social: we call them “unique identifiers” but there is nothing in the math, as far as I can see, that do make them unique, and “one-way hash”, while obviously you can revert that with a properly sized table. What good hashes are designed for is making sure that the chances of a collision are low enough that it’s infeasible to hit them, and for the tables needed for reversing the hash to be huge enough that normal computers can’t handle it.
Of course, the concepts of “infeasible” and “huge” in computers are quite vague: while something may very well be infeasible for my cellphone, it might not be for the iMac, or it might be huge for Merrimac but not for Yamato. What was absolutely impossible for personal computers five years ago might very well be a cakewalk for a PlayStation 3 (think about the December 2008 SSL Certificate attack ). And this is without considering clusters and supercomputers; which means that we have to take a lateral approach to all this rather than just following the God Hash Function.
Hash functions are not just used for cryptographic work of course; sometimes it’s just a redundancy check to ensure that the data arrived properly, for instance the TCP protocol still supports checksumming the packets with CRC, even though the collision-free space for CRCs is quite smaller than MD5, which itself we know is no longer a valid security solution. There are some cases where just being able to tell if some data arrived or was decoded properly, that using cryptographic hashes is not very important at all, where speed is more of a concern. In those cases, CRC32 still performs pretty neatly.
On a similar note I still can’t understand why FLAC need to store the MD5 hash of the uncompressed wave data; sure it’s very desirable to have a checksum to tell if the decode was done properly, but I’d expect a CRC32 checksum to be quite enough for that, I don’t see why going with the big guns…
Anyway, this moves on to the next point; having a checksum, a hash, a digest for a file that has to be downloaded is useful to know whether the download completed successfully or if there was problem during the transmission; is MD5 enough there? Well it really depends, if it’s just to make sure that some data that is not tremendously important, because it’s not going to execute code on the machine, like a photo, or a video, then it might well be enough; sometimes CRC32 is also quite enough (for instance if you’re an anime fan you probably have noticed quite a few times the episodes having a CRC32 checksum in the downloaded file name – of course downloading pirated anime is illegal, just remember that next time you do it…).
But is it the same thing for source code? Why doesn’t Gentoo use MD5 any longer? Why are we using both SHA-256 and RMD-160? Obviously it’s not the same for source code, and while using more resilient hash algorithms (I was going to say “stronger” but that’s not the point of course) is necessary, is by far not sufficient. With source and executable code, we don’t only want to ensure that the data was received correctly, but also that the data is what we want it to be. This means that we need to certify that the downloaded data correspond to what was tested and found safe.
For this step we have to introduce a different concept from the mere hash: the signature; we need to sign the data to make sure that it’s not changed, and that if it’s tampered with, we want to make sure that the signature doesn’t match. GnuPG signatures are meant just to do that, but they also rely on a hash algorithm, that nowadays tend to be SHA-1, unless, like the Debian developers, you start to change it to SHA-256 or whatever else. Does it make such a difference? It depends on what you use the key for, one should say.
There are two main critiques against the use of different hashing algorithms for GnuPG key generation by default: the GnuPG maintainer said that the (economical) resources needed to counterfeit a signature nowadays are high enough that would still allow somebody to just pay a very bad guy to arrive at you. With a gun. Or worse. The second is that to perform a fake signature on an email message, you’re going to need to add lots and lots of spurious data, which will be quite a sell off of the way the message was produced.
Of course these points are both true; but there are one catch for each: the former is true for now but is not going to remain true forever, not only there can be more weaknesses on the algorithm to be found, but the average computing power of a single individual is still increasing year after year; while 200 PS3 systems don’t come cheap nowadays they certainly are more feasible, and less risky, to procure than a serial killer. And they are much lower profile.
The latter point is more interesting, because it shows some limits to the ability of forging a duplicate key or counterfeiting a signed message. Indeed, whatever the algorithm used, a simple signed text message, once counterfeited, is going to be easily spoofed by the presence of data that is bogus or not relevant to the message. While the technical chance that a way is found to make a counterfeited message that only contains words in the correct language, and that is thus easy to blend with the rest of the message, is not null, it’s also quite far fetched nowadays even for CRC I’d say. That should be enough for email messages.
But is it for every application of GnuPG keys? I don’t think so; as you might have read in the post I linked early in this entry about the chances of using the SHA-1 attacks to fool the GIT content tracker, it is possible to replace source code even when entering bogus data, because almost nobody will be going through all the source files to see if there is something strange in them. Similarly, spoofing signatures for binary files is not as hard to achieve as spoofing signatures for email messages. Even more so when you count that
lzma all ignore trailing unknown data in their archives (which is a feature used even by Gentoo for the binary packages Portage builds). Which means that keys used for signing source and binary packages, like in the cases of Debian, and Gentoo, are more at risk for the SHA-1 attack than keys used just to sign email messages.
There are more things about this, but since I’m no expert I don’t want to go longer ways on this. There is much more to be said about the panacea of signatures, because as it appeared in my previous post about github there are quite a few users that are confused by what git tag signatures should mean to Gentoo developers and users. But this is the kind of stuff I always wanted to write about and almost never had time, I guess I’ll try my best to find time for it.
On the matter of tree signing, would you care to review GLEPs 57-61 inclusive, and email me your comments on them. Thereafter, sometime in the next 2 weeks, I want to submit them to council.
Most of what you say makes sense, but I have to disagree with your suggestion to use CRC for checking for anything other than unintentional corruption.One of the things that makes a hash a cryptographic one is that changing the hash input causes a change in the output that is infeasible to predict. CRC fails this test; by only knowing the current hash output and the changes we make to the message (not even the message, just the changes are enough), we can predict the CRC of the changed message. From here it is only a small step to make another small change which would cancel out the effect of the first one and give you a message without huge amounts of trailing garbage (ie. the attacker only needs to make a small change that they didn’t want to, which many people would overlook) which gives the same CRC. Even MD5 with all its problems doesn’t have that weakness yet.
You’re perfectly right on that. Indeed it should only be used to check for unintentional corruption. That’s why I made the example of FLAC audio data.I guess I didn’t clear that part enough.
Making a crc collision with words in the same language (and with correct grammar) is actually trivial. Just need to find 32 or so words in the message that have synonyms of the same length, and start XORing away…
nice article … I am adding this on my website