Just Enough Cryptography

Cryptographic Hash Functions — Their basic characteristics

Put very simply, a hash function is a measurement whose purpose is to detect very slight changes to data, or to determine whether or not two pieces of data are identical. Cryptographically useful hashing functions have some specific characteristics. Commonly used hash functions include MD5, SHA-1, SHA-256, and SHA-512.

Smoking a hookah after dinner in Göreme, Turkey

No, this is just flavored tobacco, in Göreme, Cappadocia, Turkey

The output of any one hash function is always of the same length, regardless of the length of the input string. Stated informally, a cryptographically strong hash is one for which a very slight change in the input causes a drastic change throughout the output, and that means you can't search for input strings producing a given hash value.

A good cryptographic hash has these four properties, where "infeasible" means "almost certainly beyond the capability of any adversary who must be kept from breaking the system for as long as the security is important":

Here is how to calculate the SHA-1 hash of the opening sentence of the CRC Handbook of Applied Cryptography:

$ echo -n "Cryptography has a long and fascinating history." | openssl sha1
d1557b56f7274faf16d38fdf9241035ab71da799 

We can replace the initial C with K and change only one bit out of the 48-character string because of the ASCII encodings:
C = 0x43 = 01000011
K = 0x4b = 01001011
That's a change of just one bit out of 384 bits in the input. 383 of the 384 bits, or 99.739583%, are identical. But the resulting hash is very different:

$ echo -n "Kryptography has a long and fascinating history." | openssl sha1
7215da622638fc3abdab4ffc91c6d2cf4c5599fe 

Compare those two nearly identical inputs:
Cryptography has a long and fascinating history.
Kryptography has a long and fascinating history.
And now compare the two very different hash results, in hexadecimal (base 16):
d1557b56f7274faf16d38fdf9241035ab71da799
7215da622638fc3abdab4ffc91c6d2cf4c5599fe
and (at least partially) in binary:
1101000101010101011110110101011011110111001001110100...
1110010000101011101101001100010001001100011100011111...

One criteria for a good hash function is that a single-bit change in the input causes changes in about half of the output bits. That's what we're seeing here.

Let's use a much larger example input. The King James translation of the Bible is available as a plain text ASCII document from Project Gutenberg. That provides a commonly available, well known, and fairly large data set. According to the Unix wc command, this file contains 4,445,260 bytes, with 100,117 lines of text making up 823,156 words:

$ wc kjv10.txt
  100117  823156 4445260 kjv10.txt 

This document contains a specification for data integrity. The second and third to last verses of the entire document (that is, The Revelation of Saint John the Divine, chapter 22, verses 18 and 19) provide a relevant test point for investigating change detection. As included in the file downloaded from gutenberg.org those verses read:

22:18 For I testify unto every man that heareth the words of the
prophecy of this book, If any man shall add unto these things, God
shall add unto him the plagues that are written in this book: 22:19
And if any man shall take away from the words of the book of this
prophecy, God shall take away his part out of the book of life, and
out of the holy city, and from the things which are written in this book. 

I have made a copy of that text file, changing the first letter of 22:19 from an upper-case "A" to a lower-case "a". That one changed character over 100,000 lines into the file makes absolutely no change in the meaning of the text. The ASCII codes for the characters A and a vary by only one bit:

Char  Hex   Binary
 A   0x41  1000001
 a   0x61  1100001

$ wc kjv10.txt kjv10-modified.txt
  100117  823156 4445260 kjv10.txt
  100117  823156 4445260 kjv10-modified.txt
  200234 1646312 8890520 total
$ diff -C 1 kjv10.txt kjv10-modified.txt
*** kjv10.txt
--- kjv10-modified.txt
***************
*** 100099,100101 ****
  shall add unto him the plagues that are written in this book: 22:19
! And if any man shall take away from the words of the book of this
  prophecy, God shall take away his part out of the book of life, and
--- 100099,100101 ----
  shall add unto him the plagues that are written in this book: 22:19
! and if any man shall take away from the words of the book of this
  prophecy, God shall take away his part out of the book of life, and 

One character out of 4,445,260, that's a 0.0000225% difference. But to be more accurate, the change is only only one bit out of 35,562,080 bits, a 0.0000028% difference. Let's see what happens to the hash values for those files. I'm using the Unix shash command to also have SHA-256 and SHA-512 available from the command line:

$ shash -a md5 kjv10*
# MD5 HASH
98074d4575a9d6970cd2e69dfde4adab  kjv10-modified.txt
a6d84a7fdfaa6906bae40e2c342e978f  kjv10.txt
$ shash -a sha1 kjv10*
# SHA1 HASH
afa2cd8f2e5797b1edde66a829bbbb20650b51fa  kjv10-modified.txt
a4f709de0b5684bee071ac912432833a0c09db7d  kjv10.txt
$ shash -a sha256 kjv10*
# SHA256 HASH
f3df0027a219731d277f2ee5496a7351b7be2028ee5126128c89f40296b4b93d  kjv10-modified.txt
f9aad5f2fed41aa145ea03f6c40716cfeece5b61c6d5fb8350a48873a87614ca  kjv10.txt
$ shash -a sha512 kjv10*
# SHA512 HASH
699f43a0750dbb6b2104e402a3622a7491c18537bff18e083c60603edf2fb7b1cbf7f5eccbf36ca7fc20b12b6929d7cf230b671f0d04e1eae4f1601107499292  kjv10-modified.txt
29df561de4486a9c3e313c7e1515b104ed2d2cf17ff26ab255eba05bff47c5382ec438488f27d6497ad19f68b87fc942702997641f1eba672d6c6c11ce95938f  kjv10.txt 

You can't reverse a hash, but you might find a collision

There is no such thing as the inverse of a hash function! There is no way to start with the hash of an input, and find an input for which you are certain that it was the input:

A collision is a pair of different inputs that produce the same hash output. That is:
   input1 != input2     but
   hash(input1) = hash(input2)

A random collision would be the discovery of any two inputs with equal hashes.

A chosen-target collision is more difficult to find. That is the case where input1 or hash(input1) is already specified, and the attacker needs to find input2.

Another description of a good hash function is in "Deploying a New Hash Algorithm", by Steven M Bellovin and Eric K Rescorla, which follows the description of section 9.2 of the CRC Handbook of Applied Cryptography, Menezes, van Oorschot, and Vanstone, available as a free PDF download:

Computers get faster and memory gets cheaper. So, a 2128 search space is no longer considered to be adequately large, and so the MD5 algorithm is not strong enough against brute-force attacks with sophisticated modern hardware. What's worse, researchers have published results for collisions (and see the results below for even more on this):

Use at least SHA-1 instead, or preferably SHA-256 or SHA-512. The SHA algorithms are of a family derived from MD5, so (very roughly speaking!) we might expect strengths proportional to their numbers of possible outputs.

Hash Output bits Possible outputs Approximate size of output space
MD5 128 2128 3.40×1038
SHA-1 160 2160 1.46×1048
SHA-256 256 2256 1.16×1077
SHA-512 512 2512 1.34×10154

That last number is awfully long. Thanks to the arbitrary-precision arithmatic provided by the Unix bc command:
13,407,807,929,942,597,099,574,024,998,205,846,127,479,365,820,592,393,377,723,561,443,721,764,030,073,546,976,801,874,298,166,903,427,690,031,858,186,486,050,853,753,882,811,946,569,946,433,649,006,084,096 

Do hash functions produce all possible outputs? Or might there be some "forbidden patterns", do hash functions enforce some form of parity preventing more than some maximum number of bits set to 0 or 1, or preventing sequential runs of identical bits longer than some threshold? Apparently not, although these "interesting" events take more and more brute-force search to discover as they become more and more interesting. See the results of my brute-force search for the details.

Yes, collisions are awfully unlikely to happen randomly, and still very difficult to find on purpose, but it is possible. Here are two 128-byte data files with identical MD5 hashes! But their SHA-1 hashes differ. Notice how similar the two files are, just six regularly-spaced bits differ:


 $ openssl md5 crypto-file*.dat
 MD5(FILE1.DAT)= a4c0d35c95a63a805915367dcfe6b751
 MD5(FILE2.DAT)= a4c0d35c95a63a805915367dcfe6b751
 $ openssl sha1 crypto-file*.dat
 SHA1(FILE1.DAT)= 2783c4ff4a3f20d25f2598a8b052b890c37dcac4
 SHA1(FILE2.DAT)= 3c35410823ef00b12d020981c1cf8564c0f89bcc
 $ hexdump -C crypto-file1.dat
 00000000  d1 31 dd 02 c5 e6 ee c4  69 3d 9a 06 98 af f9 5c  |gibberish      
 00000010  2f ca b5 87 12 46 7e ab  40 04 58 3e b8 fb 7f 89  |deleted...
 00000020  55 ad 34 06 09 f4 b3 02  83 e4 88 83 25 71 41 5a  |
 00000030  08 51 25 e8 f7 cd c9 9f  d9 1d bd f2 80 37 3c 5b  |
 00000040  96 0b 1d d1 dc 41 7b 9c  e4 d8 97 f4 5a 65 55 d5  |
 00000050  35 73 9a c7 f0 eb fd 0c  30 29 f1 66 d1 09 b1 8f  |
 00000060  75 27 7f 79 30 d5 5c eb  22 e8 ad ba 79 cc 15 5c  |
 00000070  ed 74 cb dd 5f c5 d3 6d  b1 9b 0a d8 35 cc a7 e3  |
 00000080
 $ hexdump -C crypto-file2.dat
 00000000  d1 31 dd 02 c5 e6 ee c4  69 3d 9a 06 98 af f9 5c  |gibberish
 00000010  2f ca b5 07 12 46 7e ab  40 04 58 3e b8 fb 7f 89  |deleted...
 00000020  55 ad 34 06 09 f4 b3 02  83 e4 88 83 25 f1 41 5a  |
 00000030  08 51 25 e8 f7 cd c9 9f  d9 1d bd 72 80 37 3c 5b  |
 00000040  96 0b 1d d1 dc 41 7b 9c  e4 d8 97 f4 5a 65 55 d5  |
 00000050  35 73 9a 47 f0 eb fd 0c  30 29 f1 66 d1 09 b1 8f  |
 00000060  75 27 7f 79 30 d5 5c eb  22 e8 ad ba 79 4c 15 5c  |
 00000070  ed 74 cb dd 5f c5 d3 6d  b1 9b 0a 58 35 cc a7 e3  |
 00000080

If you want to play with these data files:

Note that MD5 is a chained hash function, and this means that if you start with the above two 128-byte files, and append identical sequences to each, the two resulting data files will also have identical MD5 hashes. So, an infinite number of collisions could be based on these.

Post-2004 hash collision research

In the summer of 2004 some mathematicians presented a paper showing collision results for MD5 and others. See "Collisions for Hash Functions MD4, MD5, HAVAL-128 and RIPEMD" by Xiaoyun Wang, Dengguo Feng, Xuejia Lai, and Hongbo Yu, http://eprint.iacr.org/2004/199

Someone else then published a paper on weaknesses in SHA-0, which was a precursor to SHA-1. Actually, SHA-0 was the winning submission for the U.S. government's Secure Hash Algorithm, but the NSA specified a few tweaks for reasons they (of course) did not explain. For this intermediate work, see: "Near-Collisions of SHA-0" by Eli Biham and Rafi Chen, http://eprint.iacr.org/2004/146

At the Crypto 2005 conference the following summer, Xiaoyun Wang, Yiqun Yin, and Hongbo Yu (the first and third authors of the preceding summer's paper) published attacks against full SHA-1. Then, at an informal meeting at that conference, Xiaoyun Wang announced new results: the time complexity was even less than the papers (so far) had announced. Relevant papers include:

Lucks and Daum have generated Postscript files that exploit the attack — the files have the same hash, although they contain contradictory messages!
http://th.informatik.uni-mannheim.de/people/lucks/HashCollisions/

Stevens, Lenstra and de Weger have generated several PDF files with very different contents but identical hashes: http://www.win.tue.nl/hashclash/Nostradamus/ as part of their general hash work: http://www.win.tue.nl/hashclash/

In December, 2008, researchers reported finding MD5 collisions in one to two days using a system built from 200 Playstation 3's. Since six Certificate Authorities were still using MD5 signing on X.509 digital certificates, that means that with a day or two of computing you could generate forged certificates allowing a rogue system to masquerade as a "secure" server using one of those certificates. They generated an intermediate CA certificate so they could sign their own certificates. That means you would only need to do the computational attack once to generate as many apparently valid certificates as you want.

What does it all mean?

What does all that mean? The difficulty of finding random or chosen-target collisions is not as hard as we thought it was for anything in the family of hash functions including MD5 and SHA-1. So we should conclude:

No need to panic, but new systems should be designed around SHA-256 or SHA-512 with a plan to plug in replacement functions. Trusted implementations of SHA-256 and SHA-512 are available as part of the OpenSSL package, both command-line tools and shared libraries. And cryptologists should be (and are) working on a replacement for the whole SHA family.

In February 2007 NIST announced a competition to replace the current FIPS-180-2, which now specifies SHA-1, SHA-224, SHA-384, and SHA-512. FIPS-180-2 was issued in 2002, so the SHA family did not last as long as might have been expected. The new hash algorithms will be selected in a process like that used for AES, the Advanced Encryption Standard, defined by government standard FIPS-140-2. Entries were submitted at the end of October, 2008, and the final selection should happen in late 2012. See http://www.nist.gov/hash-function/.

Also see The Hashing Function Lounge, at http://www.larc.usp.br/~pbarreto/hflounge.html

Cryptographic Hash Functions
Algorithm Output Length Notes
MD1 128 Computationally inefficient...
MD2 128 Cryptographically flawed...
MD3 128 Computationally inefficient...
MD4 128 Cryptographically flawed...
MD5 128
SHA-1 160 "Secure Hash Algorithm", but use SHA-256 or SHA-512
RIPEMD-160 160
SHA-256 256
Snefru 256
GOST 256 Gosudarstvennyi Standart Soyuza SSR, Государственный Станарт Союза ССР
EKS-Blowfish 448 Based on the Blowfish cipher
SHA-512 512

Enigma encryption machine.

Back to the Security Index

Click here to inquire about advertising on this or any page on this site.
Home Unix/Linux Networking Cybersecurity Travel Technical Radio Site Map Contact


Use /bin/vi! Manipulate images with ImageMagick! Hosted on OpenBSD
Hosted on Apache This site is viewable with any browser Valid XHTML 1.0! Valid CSS!
© Bob Cromwell Feb 2012. Created with /bin/vi and ImageMagick, hosted on OpenBSD with Apache.    Root password available here, privacy policy here.