So far, we’ve discussed the basics of encryption and symmetric encryption algorithms. Encryption is used to protect the confidentiality of data. Let’s now take a small turn and look at protecting the integrity and authenticity of data.

Encryption can protect data from being read while in transit. However, this does not prevent the data from being manipulated while in transit. Hash functions can be used to help ensure that data has not been changed.

Hash functions can also be used to store sensitive data, such as passwords, without revealing it.

Let’s take a look at what hash functions do and how they can be used to protect your data.

## Anatomy of a Hash Function

A hash function takes an arbitrary length input and produces a fixed-length output. For example, you can pass 1MB worth of data or 1KB and the hash will always produce a 256-bit output value.

Hash functions operate according to one key concept: it is easy to calculate and hard to reverse. In other words, a good hash function makes it infeasible to figure out what was processed given the output.

A simple example of this is multiplication and factorization in math. It is pretty easy to multiply two numbers such as 1,522 and 256. Multiply these and you get 389,692.

However, given the number 389,692, it is much more difficult to figure out the numbers used to get there, or factorization in math terms.

Hash functions do more than this, but the idea is the same. When you hash a piece of data, it should be easy to calculate and difficult to reverse.

Another important property of cryptographic hash functions is a resistance to collisions. Collisions occur when two inputs to a hash function generate the same output. Since hashes are often used to establish integrity and authenticity of data, collisions would allow an attacker to trick you into trusting false or manipulated data.

It may seem strange to say “resistance” to collisions and not absence of collisions. This is due to the fact that there is an infinite number of inputs but a finite amount of outputs. Since the output is always the same length, collisions are inevitable. The trick is making them difficult to find.

## Hash Algorithm Choices

There have been several hash functions in wide use over the years. There are some that have been shown to be insecure. There are new ones being created. There are those in use that are believed to be secure now.

### MD5 and SHA-1

MD5 is a hash function that is still in use to some degree. It computes a 128-bit output. This function is considered to be broken. Why?

First, 128 bits is not enough to protect the hash against brute-force attacks with modern computing hardware. Even if the original value passed into the function is not known, a collision can be found pretty easily due to technical issues with how the hash function operates.

SHA-1 (Secure Hash Algorithm) was created by the NSA to replace MD5. It outputs a 160-bit value. Even though this is better than MD5, it is still a small bit size by current standards.

Another problem with SHA-1 is that collisions have now been proven to be possible with it. Google recently posted an attack where two different pdf files have been hashed to the same value. While this did take significant time and effort, attacks only get better over time. So it is safer to go with better alternatives if they are available, which they are.

### SHA-2 and SHA-3

SHA-2 and SHA-3 are families of hash functions that have been designed to replace SHA-1. SHA-2 consists of four hash functions based on the basic structure of MD5 and SHA-1. SHA-3 is the newest standard and has a much different internal structure.

SHA-2 is a family of hash functions including SHA-224, SHA-256, SHA-384, and SHA-512. The numbers indicate the number of bits of output. In practice, I have seen mostly SHA-256 and SHA-512 used. The others are not used as often.

The greatest asset to these functions is the longer output. As stated previously, you want a longer output in order to give the function the most possible output values. This makes it significantly more difficult to find collisions, which is usually the easier way to attack a hash algorithm than trying to guess the input.

SHA-3 is the newest standard, being released by NIST in 2015. This algorithm has not been around long enough to be fully tested but uses some strong techniques that show promise.

The internal algorithm used for SHA-3 adheres to a “sponge construction” model. As the name suggests, think of this algorithm as being a sponge that soaks up data like water. Then the sponge is “squeezed” to output the final result.

The differences between the structure of SHA-3 and the previous standards were purposely designed to overcome some weaknesses of those algorithms.

For example, MD5, SHA-1, and SHA-2 functions are susceptible to length extension attacks. This allows an attacker to see into the internal state of the hash function and append their own message to the original. This gives the appearance of validity of the bogus message. SHA-3 is not susceptible to this attack.

In terms of what developers should use, SHA-2 and SHA-3 should be what you are looking at for production systems. Even though SHA-3 is new, its design was made to overcome the weaknesses of the others. SHA-256 and SHA-512 have stood the test of time and are the most common choice in production systems.

## Uses of Hash Functions

One of the best aspects of hash functions is that they have many uses in cryptographic systems. We’ll cover three: storing sensitive data, creating cryptographic keys, and ensuring integrity and authenticity.

### Sensitive Data Storage

First, hash functions are used to store sensitive data. The best known use of this is storing user passwords.

My post on protecting your passwords goes in-depth on this topic. Passwords should never be stored in cleartext inside a database. Hashes are a great solution as they always generate a unique output for each input.

This means that when a user enters their password, instead of checking their actual password against a copy of it stored in a database, you hash it and compare the hash values. If they match, then the password entered was the original value that generated the hash value stored in the database. This is where resistance to collisions is a key piece of a good hash function.

Take a look at my post to find out what more developers can do to protect passwords against attack.

### Creating Cryptographic Keys

Hash functions can also be used to create cryptographic keys. Consider the following possible system design.

You have sensitive data stored in your database. Perhaps this is PII data such as Social Security Numbers or perhaps health records or credit card data that must be protected. You need to encrypt this data. How can the key be generated?

An option could be to use the password your users give you to generate a key. This can be done by using a hash function with a random salt value. If you are using a SHA-256 hash, this value can be used as the key to encrypt the user’s data with AES-256.

This design would allow each user’s data to have a different, strong encryption key, making it much more difficult for an attacker to decipher the data in the event of a data breach.

Password-Based Key Derivation Functions (PBKDFs) are made for this purpose. I’ll discuss how those work in another post.

### Integrity Checking

Another key use of hash functions is integrity checking. This means ensuring that a message was not changed in transit by a third party.

This is done by hashing the message and sending the hash of the message along with the message itself. Once the user on the other end receives it, they compute the hash of the message themselves and make sure they match.

Digital signatures also establish authenticity and integrity. They use asymmetric cryptography, also known as public-key cryptography. We haven’t discussed this in our series yet, but the idea is that each person has a separate key that can be used to encrypt and then decrypt a message.

A digital signature is made by first hashing the message. Then the hash value is encrypted with the private key of the sender. The receiver decrypts the hash value using the public key of the sender to verify the authenticity of the message. Then the message is hashed and when the hashes match the digital signature is valid.

Now let’s take a look at a highly useful way of using hashes, within Message Authentication Codes.

## HMACs for Integrity and Authenticity

A message authentication code, or MAC, is used to detect tampering with messages. Different algorithms can be used with MACs, including encryption algorithms such as AES.

The HMAC is a MAC that uses a hash function as its basis for calculating the final value used to authenticate and check the integrity of a message. It also adds a key that is shared between the two parties communicating with each other.

The key is important because if an eavesdropper tries the manipulate the message, they don’t know the key required to create a valid MAC for the new value.

The algorithm used is quite interesting. Check out my video on YouTube to see how the algorithm works. This method of calculating the HMAC prevents the problems from underlying hash functions from affecting the MAC (i.e. length extension attacks).

HMACs are used quite a bit and should definitely be considered for verifying message authenticity and integrity. In fact, if you’ve used JSON web tokens (JWTs), you’ve used HMACSHA256 to verify the token received from the client. This is an HMAC with SHA-256 as the main hash function.

## Confidentiality and Integrity

Encryption is meant to protect confidentiality. Hash functions are meant to protect integrity. They have a close relationship and are often used in combination.

As a developer or architect designing a secure system, it is important to understand this relationship and use it effectively.

Happy Hashing!