Custom Encoding

Malware often uses homegrown encoding schemes. One such scheme is to layer multiple simple encoding methods. For example, malware may perform one round of XOR encryption and then afterward perform Base64 encoding on the result. Another type of scheme is to simply develop a custom algorithm, possibly with similarities to a standard published cryptographic algorithm.

Identifying Custom Encoding

We have discussed a variety of ways to identify common cryptography and encoding functions within malware when there are easily identifiable strings or constants. In many cases, the techniques already discussed can assist with finding custom cryptographic techniques. If there are no obvious signs, however, the job becomes more difficult.

For example, say we find malware with a bunch of encrypted files in the same directory, each roughly 700KB in size. Example 13-6 shows the initial bytes of one of these files.

Example 13-6. First bytes of an encrypted file

88 5B D9 02 EB 07 5D 3A 8A 06 1E 67 D2 16 93 7F    .[....]:...g....
43 72 1B A4 BA B9 85 B7 74 1C 6D 03 1E AF 67 AF    Cr......t.m...g.
98 F6 47 36 57 AA 8E C5 1D 70 A5 CB 38 ED 22 19    ..G6W....p..8.".
86 29 98 2D 69 62 9E C0 4B 4F 8B 05 A0 71 08 50    .).-ib..KO...q.P
92 A0 C3 58 4A 48 E4 A3 0A 39 7B 8A 3C 2D 00 9E    ...XJH...9{.<-..

We use the tools described thus far, but find no obvious answer. There are no strings that provide any indication of cryptography. FindCrypt2 and KANAL both fail to find any cryptographic constants. The tests for high entropy find nothing that stands out. The only test that finds any hint is a search for XOR, which finds a single xor ebx, eax instruction. For the sake of the exercise, let’s ignore this detail for now.

Finding the encoding algorithm the hard way entails tracing the thread of execution from the suspicious input or output. Inputs and outputs can be treated as generic categories. No matter whether the malware sends a network packet, writes to a file, or writes to standard output, those are all outputs. If outputs are suspected of containing encoded data, then the encoding function will occur prior to the output.

Conversely, decoding will occur after an input. For example, say you identify an input function. You first identify the data elements that are affected by the input, and then follow the execution path forward, looking into only new functions that have access to the data element in question. If you reach the end of a function, you continue in the calling function from where the call took place, again noting the data location. In most cases, the decryption function will not be far from the input function. Output functions are similar, except that the tracing must be done opposite the flow of execution.

In our example, the assumed output is the encrypted files that we found in the same directory as the malware. Looking at the imports for the malware, we see that CreateFileA and WriteFile exist in the malware, and both are in the function labeled sub_4011A9. This is also the function that happens to contain that single XOR function.

The function graph for a portion of sub_4011A9 is shown in Figure 13-14. Notice the WriteFile call on the right in the block labeled loc_40122a. Also notice that the xor ebx, eax instruction is in the loop that may occur just before the write block (loc_40122a).

The left-hand block contains a call to sub_40112F, and at the end of the block, we see a counter incremented by 1 (the counter has the label var_4). After the call to sub_40112F, we see the return value in EAX used in an XOR operation with EBX. At this point, the results of the XOR function are in bl (the low byte of EBX). The byte value in bl is then written to the buffer (at lpBuffer plus the current counter).

Putting all of these pieces of evidence together, a good guess is that the call to sub_40112F is a call to get a single pseudorandom byte, which is XORed with the current byte of the buffer. The buffer is labeled lpBuffer, since it is used later in the WriteFile function. sub_40112F does not appear to have any parameters, and seems to return only a single byte in EAX.

Function graph showing an encrypted write

Figure 13-14. Function graph showing an encrypted write

Figure 13-15 shows the relationships among the encryption functions. Notice the relationship between sub_40106C and sub_40112F, which both have a common subroutine. sub_40106C also has no parameters and will always occur before the call to sub_40112F. If sub_40106C is an initialization function for the cryptographic routine, then it should share some global variables with sub_40112F.

Connected encryption function

Figure 13-15. Connected encryption function

Investigating further, we find that both sub_40106C and sub_40112F contain multiple references to three global variables (two DWORD values and a 256-byte array), which support the hypothesis that these are a cryptographic initialization function and a stream cipher function. (A stream cipher generates a pseudorandom bit stream that can be combined with plaintext via XOR.) One oddity with this example is that the initialization function took no password as an argument, containing only references to the two DWORD values and a pointer to an empty 256-byte array.

We’re lucky in this case. The encoding functions were very close to the output function that wrote the encrypted content, and it was easy to locate the encoding functions.

Advantages of Custom Encoding to the Attacker

For the attacker, custom-encoding methods have their advantages, often because they can retain the characteristics of simple encoding schemes (small size and nonobvious use of encryption), while making the job of the reverse engineer more difficult. It is arguable that the reverse-engineering tasks for this type of encoding (identifying the encoding process and developing a decoder) are more difficult than for many types of standard cryptography.

With many types of standard cryptography, if the cryptographic algorithm is identified and the key found, it is fairly easy to write a decryptor using standard libraries. With custom encoding, attackers can create any encoding scheme they want, which may or may not use an explicit key. As you saw in the previous example, the key is effectively embedded (and obscured) within the code itself. Even if the attacker does use a key and the key is found, it is unlikely that a freely available library will be available to assist with the decryption.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset