Developer's Corner

by Antonio Bello

Data Integrity Check Using Hashing

| Comments

When data is moved across a network, the recipient might require a verification to ensure that data has not been modified after send.

For example, in a client/server scenario where a client application sends some updates to a remote database via a web service, the web service might need to verify whether the data has not been modified while travelling across the network, either due to corruption or because somebody changed it. A simple way to implement such kind of verification is utilizing hashing.

How Hashing works

The idea behind hashing is to generate a (ideally) unique and non–reversible code generated from the data being sent.

Unique means that the codes generated from 2 different sets of data must be different, regardless of the data size, and even if they differ by 1 byte or 1 bit only.

Non-reversible means that the hash code cannot be used to recreate the original data.

For the purpose of signing data to check for modification, the non-reversibility feature is unnecessary, as the hash code travels with the data, but there are other cases where it is undoubtedly useful.

For example, one of the easiest way to prevent passwords from being stored in a database is to store a hash code rather than the password itself. This way, when a user is authenticating, the password is used to generate an hash key, which is compared to the hash key stored in the database (previously generated when the user account has been created, or the last time the user has changed his password) – if they match, authentication succeeds. The drawback of this method is that a “recall password” feature cannot be implemented, as the password is not stored anywhere, and the non reversibility property of the hash code prevents it from being used to recreate the original data – in such case a “reset password” feature is more appropriate.

Hashing and Salting

Using the model described above, data is protected against corruption but not against changes. In fact, whereas a corruption is immediately detected once the recipient verifies that the hash codes don’t match, an hacker attempting to modify the data simply has to create a new hash code. Since this code is sent with the data, and since the hashing algorithm is deterministic, whoever wants to modify the data has all ingredients to do so, with no risk of being detected.

So hashing is not enough to guarantee data integrity. One common practice is to use a random or predefined string, which is appended to the data before calculating the hash code. This additional string is known as salt. If the salt is known by sender and recipient only, an hacker has no way to generate an hash code that won’t be immediately detected by the recipient during hash verification.
In this scenario the salt can be thought as a private key shared between sender and recipient.

For improved security, it’s better if the salt is not 100% static – for example, a dynamic string depending from the date and/or time can be mixed to the static salt in order to generate a dynamic salt, harder for an hacker to guess. For this method to work both sender and recipient must generate the same salt in a wide enough time window. For example, supposing that sender and recipient have their clocks synchronized, if the salt is composed by current hour and minute, the widest time window is 60 seconds – but the recipient could generate a different salt just 1 second after the sender generated the salt code, if this occurs, for example, at 10:20:59. In this case the sender use 10:20 to generate the salt – if the recipient receives the message at least 1 second later, it will use 10:21 to generate the salt. And this obviously leads to a completely different hash code.

Implementing a data integrity check

There are several algorithms that can be use to generate hash codes. One of them is MD5, implementable as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
using System;
using System.Security.Cryptography;
using System.Text;

namespace Hash
{
 public class HashGenerator
 {
  public static string GenerateMd5Hash(string salt, string data)
  {
   byte[] buffer;
   byte[] binaryHash;
   string stringHash;
   MD5 md5;

   md5 = MD5.Create();

   // Combine salt + data and convert to array of bytes
   buffer = Encoding.Unicode.GetBytes(salt + data);

   // Create the hash code from salt + data
   binaryHash = md5.ComputeHash(buffer);

   // The hash code is in binary format - convert
   // it in readable string
   stringHash = Convert.ToBase64String(binaryHash);

   return stringHash;
  }
 }
}

The code is pretty simple: salt + data are combined together and converted to array of bytes, then the ComputeHash method of the MD5 instance creates the hash code given the array of bytes. Last, since the hash generated by ComputeHash is in binary format, it is converted to string and returned. To test this code we can create a simple console application like the following one:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
using System;
using Hash;

namespace HashSample
{
 public class Program
 {
  static void Main(string[] args)
  {
   string test;
   string hash;
   string salt;

   test = "This is a sample data";
   salt = "FTSA74D71KSAP";

   hash = HashGenerator.GenerateMd5Hash(salt, test);

   Console.Out.WriteLine(
    "The hash code for string '{0}' and salt '{1}' is:\n\t{2}",
    test, salt, hash
   );

   Console.In.Read();
  }
 }
}

The output produced by this application is the following:

1
2
3
`The hash code for string 'This is a sample data' and salt 'FTSA74D71KSAP' is:

s2gxEi9C3QSRGFy+cZheWA==`

To prove how the hash code changes by simply changing 1 byte in the original data, let’s see what happens if we slightly change the test string to “This is b sample data”:

1
2
3
`The hash code for string 'This is b sample data' and salt 'FTSA74D71KSAP' is:

FJr7RUcSYJMJBdfCg6DCXQ== `

We can see that these 2 hashes are completely different.

IHashable interface

If we need to generate an hash code on instances of a class, why don’t we define a proper interface? This way we can just implement the interface and write a single function to generate the hash code. The interface name is, of course, IHashable:

1
2
3
4
5
public interface IHashable
{
    string GetHashableRepresentation();
    string GenerateSalt();
}

and consists of 2 methods: GetHashableRepresentation and GenerateSalt. The latter should be implemented in order to generate a proper salt code, whereas the former should return a representation of the class instance in a string format suitable for hashing – for example, a simple string concatenation of all data members.

As I mentioned earlier, there are several algorithms usable for hashing. Following a comprehensive implementation of the hash generation, which allows the caller to choose which algorithm to use:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
public enum HashingAlgorithm
{
    Md5,
    Sha1,
    Sha256,
    Sha384,
    Sha512,
    RipeMd160
}

public static string GenerateHash(HashingAlgorithm alg, IHashable hashable)
{
    HashAlgorithm algorithm;

    algorithm = CreateHashAlgorithm(alg);

    byte[] buffer;
    byte[] binaryHash;
    string stringHash;

    // Combine salt + data and convert to array of bytes
    buffer = Encoding.Unicode.GetBytes(hashable.GetHashableRepresentation() + hashable.GenerateSalt());

    // Create the hash code from salt + data
    binaryHash = algorithm.ComputeHash(buffer);

    // The hash code is in binary format - convert
    // it in readable string
    stringHash = Convert.ToBase64String(binaryHash);

    return stringHash;
}

private static HashAlgorithm CreateHashAlgorithm(HashingAlgorithm alg)
{
    HashAlgorithm algorithm;

    switch (alg)
    {
        default:
        case HashingAlgorithm.Md5:
            algorithm = MD5.Create();
            break;

        case HashingAlgorithm.Sha1:
            algorithm = SHA1.Create();
            break;

        case HashingAlgorithm.Sha256:
            algorithm = SHA256.Create();
            break;

        case HashingAlgorithm.Sha384:
            algorithm = SHA384.Create();
            break;

        case HashingAlgorithm.Sha512:
            algorithm = SHA512.Create();
            break;

        case HashingAlgorithm.RipeMd160:
            algorithm = RIPEMD160.Create();
            break;
    }

    return algorithm;
}

The GenerateHash method requires 2 parameters: the algorithm to use and an instance of IHashable. In return, the method provides an hash code generated using the specified algorithm.

Last thing to mention, a helpful method to compare an hash code with an IHashable instance – its implementation is really very simple:

1
2
3
4
public static bool CompareHashes(HashingAlgorithm alg, string hash, IHashable hashable)
{
    return GenerateHash(alg, hashable).Equals(hash);
}

Digital signature

When used alone, hashing is good but it’s not the best. There are better ways, out of the scope of this article – but I want to spend a few words about them.

The safest and more secure way to sign data is using digital signature, which make use of asymmetric cryptography (also known as public key cryptography).

The idea is to encrypt the hash code so that it cannot be modified.
In cryptography a key is used to encrypt and decrypt data – it’s not an algorithm, but a secret data used for encryption: using a different key different encrypted data is generated.

A (very simple) example of key is the salt we’ve seen earlier: changing the salt will produce a different hash code. Keys are usually used to encrypt and decrypt data.

In symmetric cryptography the same key is used to encrypt and decrypt data. This is the biggest limitation, as the key must be owned by both parties involved in the encryption and decryption process, or sent along with the encrypted data if the recipient doesn’t have the key.

In public key cryptography a key is composed by a public key and a private key. The private key, as the name implies, is private and should never be sent to anybody. On the other hand, the public key can be published and made available to anybody. The key owner uses his private key to encrypt data – anybody can decrypt the data using the public key, but the cool feature is that the public key can be used to decrypt data encrypted by the corresponding private key only. This means that who is decrypting is sure that data has been encrypted by the owner of the private key only. Keys can also be used in the opposite way: data encrypted with the public key can only be decrypted by the corresponding private key. This way whoever encrypts data using the public key is sure that data will be available to the owner of the corresponding private key only.

In our hashing problem, if the sender encrypts the hash code using his private key, the recipient is able to detect any unauthorized data modification:

  • if the data gets corrupted or changed, the hash code calculated from the data won’t match with the encrypted hash code provided with the data

  • if the encrypted hash code provided with the data is modified, it won’t match with the hash code calculated from the data

Comments