Chapter 2. Forensic Algorithms

Forensic algorithms are the building blocks for a forensic investigator. Independent from any specific implementation, these algorithms describe the details of the forensic procedures. In the first section of this chapter, we will introduce the different algorithms that are used in forensic investigations, including their advantages and disadvantages.

Algorithms

In this section, we describe the main differences between MD5, SHA256, and SSDEEP—the most common algorithms used in the forensic investigations. We will explain the use cases as well as the limitations and threats behind these three algorithms. This should help you understand why using SHA256 is better than using MD5 and in which cases SSDEEP can help you in the investigation.

Before we dive into the different hash functions, we will give a short summary of what a cryptographic hash function is.

A hash function is a function that maps an arbitrarily large amount of data to a value of a fixed length. The hash function ensures that the same input always results in the same output, called the hash sum. Consequently, a hash sum is a characteristic of a specific piece of data.

A cryptographic hash function is a hash function that is considered practically impossible to invert. This means that it is not possible to create the input data while having a pre-defined hash sum value by any other means than trying all the possible input values, that is brute force. Therefore, this class of algorithms is known as one-way cryptographic algorithm.

The ideal cryptographic hash function has four main properties, as follows:

  1. It must be easy to compute the hash value for any given input.
  2. It must be infeasible to generate the original input from its hash.
  3. It must be infeasible to modify the input without changing the hash.
  4. It must be infeasible to find two different inputs with the same hash (collision-resistant).

In the ideal case, if you create a hash of the given input and change only one bit of this input, the newly calculated hash will look totally different, as follows:

user@lab:~$ echo -n This is a test message | md5sum
fafb00f5732ab283681e124bf8747ed1

user@lab:~$ echo -n This is A test message | md5sum
aafb38820e0a3788eb41e9f5805e088e

If all of the previously mentioned properties are fulfilled, the algorithm is a cryptographically correct hash function and can be used to compare, for example, files with each other to prove that they haven't been tampered with during analysis or by an attacker.

MD5

The MD5 message-digest algorithm was the most commonly used (and is still a widely used) cryptographic hash function that produces a 128-bit (16-byte) hash value, typically expressed in the text format as a 32-digit hexadecimal number (as shown in the previous example). This message digest has been utilized in a wide variety of cryptographic applications and is commonly used to verify data integrity in forensic investigations. This algorithm was designed by Ronald Rivest in 1991 and has been heavily used since then.

A big advantage of MD5 is that it calculates faster and produces small hashes. The small hashes are a major point of interest when you need to store thousands of these hashes in a forensic investigation. Just imagine how many files a common PC will have on its hard drive. If you need to calculate a hash of each of these files and store them in a database, it would make a huge difference if each of the calculated hash has 16 byte or 32 byte of size.

Nowadays, the major disadvantage of MD5 is the fact that it is no longer considered to be collision-resistant. This means that it is possible to calculate the same hash from two different inputs. Keeping this in mind, it is not possible anymore to guarantee that a file hasn't been modified just by comparing its MD5 hash at two different stages of an investigation. At the moment it is possible to create a collision very fast, (refer to http://www.win.tue.nl/hashclash/On%20Collisions%20for%20MD5%20-%20M.M.J.%20Stevens.pdf) but it is still difficult to modify a file in a way, which is now a malicious version of that benign file, and keep the MD5 hash of the original file.

The very famous cryptographer, Bruce Schneier, once wrote that (https://www.schneier.com/blog/archives/2008/12/forging_ssl_cer.html):

"We already knew that MD5 is a broken hash function" and that "no one should be using MD5 anymore".

We would not go that far (especially because a lot of tools and services still use MD5), but you should try switching to SHA256 or at least double-check your results with the help of different hash functions in cases where it is critical. Whenever the chain of custody is crucial, we recommend using multiple hash algorithms to prove the integrity of your data.

SHA256

SHA-2 is a set of cryptographic hash functions designed by the NSA (U.S. National Security Agency) and stands for Secure Hash Algorithm 2nd Generation. It has been published in 2001 by the NIST as a U.S. federal standard (FIPS). The SHA-2 family consists of several hash functions with digests (hash values) that are between 224 bits and 512 bits. The cryptographic functions SHA256 and SHA512 are the most common versions of SHA-2 hash functions computed with 32-bit and 64-bit words.

Despite the fact that these algorithms calculate slower and that the calculated hashes are larger in size (compared to MD5), they should be the preferred algorithms that are used for integrity checks during the forensic investigations. Nowadays, SHA256 is a widely used cryptographic hash function that is still collision-resistant and entirely trustworthy.

SSDEEP

The biggest difference between MD5, SHA256, and SSDEEP is the fact that SSDEEP is not considered to be a cryptographic hash function as it only changes slightly when the input is changed by one bit. For example:

user@lab:~$ echo -n This is a test message | ssdeep
ssdeep,1.1--blocksize:hash:hash,filename
3:hMCEpFzA:hurs,"stdin"

user@lab:~$ echo -n This is A test message | ssdeep
ssdeep,1.1--blocksize:hash:hash,filename
3:hMCkrzA:hOrs,"stdin"

The SSDEEP packages can be downloaded and installed as described in the following URL: http://ssdeep.sourceforge.net/usage.html#install

This behavior is not a weakness of SSDEEP, it is a major advantage of this function. In reality, SSDEEP is a program to compute and match the Context Triggered Piecewise Hashing (CTPH) values. CTPH is a technique that is also known as Fuzzy Hashing and is able to match inputs that have homologies. Inputs with homologies have sequences of identical bytes in a given order with totally different bytes in between. These bytes in between can differ in content and length. CTPH, originally based on the work of Dr. Andrew Tridgell, was adapted by Jesse Kornblum and published at the DFRWS conference in 2006 in a paper called Identifying Almost Identical Files Using Context Triggered Piecewise Hashing; refer to http://dfrws.org/2006/proceedings/12-Kornblum.pdf.

SSDEEP can be used to check how similar the two files are and in which part of the file the difference is located. This feature is often used to check if two different applications on the mobile devices have a common code base, as shown in the following:

user@lab:~$ ssdeep -b malware-sample01.apk > signature.txt

user@lab:~$ cat signature.txt
Ssdeep,1.1--blocksize:hash:hash,filename
49152:FTqSf4xGvFowvJxThCwSoVpzPb03++4zlpBFrnInZWk:JqSU4ldVVpDIcz3BFr8Z7,"malware-sample01.apk"

user@lab:~$ ssdeep –mb signature.txt malware-sample02.apk
malware-sample02.apk matches malware-sample01.apk (75)

In the previous example, you can see that the second sample matches the first one with a very high likelihood. These matches indicate the potential source code reuse or at least a large number of files inside the apk file are identical. A manual examination of the files in question is required to tell exactly which parts of the code or files are identical; however, we now know that both the files are similar to each other.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset