Detecting the presence of malicious code is one of the most fundamental and challenging activities in cybersecurity operations. You have two main options when analyzing a piece of code: static and dynamic. During static analysis you analyze the code itself to determine whether indicators of malicious activity exist. During dynamic analysis, you execute the code and then look at its behavior and impact on a system to determine its functionality. In this chapter, we focus on static analysis techniques.
When dealing with potentially malicious files, be sure to perform any analysis on a system that is not connected to a network and does not contain any sensitive information. Afterward, assume that the system has been infected, and completely wipe and reimage the system before introducing it back into your network.
In this chapter, we introduce curl
to interact with websites, vi
to edit files, and xxd
to perform base conversions and file analysis.
The curl
command can be used to transfer data over a network between a client and a server. It supports multiple protocols, including HTTP, HTTPS, FTP, SFTP, and Telnet. curl
is extremely versatile. The command options presented next represent only a small fraction of the capabilities available. For more information, be sure to check out the Linux man page for curl
.
vi
is not your typical command, but rather a full-featured command-line text editor. It is highly capable and even supports plug-ins.
To open the file somefile.txt in vi
:
vi somefile.txt
When you are in the vi
environment, hit the Esc key and then type i
to enter Insert mode so you can edit the text. To exit Insert mode, press Esc.
To enter Command mode, hit the Esc key. You can enter one of the commands in Table 11-1 and press Enter for it to take effect.
Command | Purpose |
---|---|
|
Back one word |
|
Replace current line |
|
Replace current word |
|
Delete current word |
|
Delete current line |
|
Write/save the file |
|
Write/save the file as filename |
|
Quit without saving |
|
Save and quit |
|
Show line numbers |
|
Search forward |
|
Search backward |
|
Find next occurrence |
A full overview of vi
is beyond the scope of this book. For more information, you can the visit Vim editor page.
The xxd
command displays a file to the screen in binary or hexadecimal format.
Display the file using binary rather than hexadecimal output
Print n
number of bytes
Start printing at byte position n
To display somefile.txt, start at byte offset 35 and print the next 50 bytes:
xxd -s 35 -l 50 somefile.txt
The details of how to reverse engineer a binary is beyond the scope of this book. However, we do cover how the standard command line can be used to enable your reverse-engineering efforts. This is not meant to be a replacement for reverse-engineering tools like IDA Pro or OllyDbg; rather, it is meant to provide techniques that can be used to augment those tools or provide you with some capability if they are not available.
For detailed information on malware analysis, see Practical Malware Analysis by Michael Sikorski and Andrew Honig (No Starch Press). For more information on IDA Pro, see The IDA Pro Book by Chris Eagle (No Starch Press).
When analyzing files, it is critical to be able to translate easily between decimal, hexadecimal, and ASCII. Thankfully, this can easily be done on the command line. Take the starting hexadecimal value 0x41
. You can use printf
to convert it to decimal by using the format string "%d"
:
$ printf "%d" 0x41 65
To convert the decimal 65
back to hexadecimal, replace the format string with %x
:
$ printf "%x" 65 41
To convert from ASCII to hexadecimal, you can pipe the character into the xxd
command from printf
:
$ printf 'A' | xxd 00000000: 41
To convert from hexadecimal to ASCII, use the xxd
command’s -r
option:
$ printf 0x41 | xxd -r A
To convert from ASCII to binary, you can pipe the character into xxd
and use the -b
option:
$ printf 'A' | xxd -b 00000000: 01000001
The printf
command is purposely used in the preceding examples rather than echo
. That is because the echo
command automatically appends a line feed that adds an extraneous character to the output. This can be seen here:
$ echo 'A' | xxd 00000000: 410a
Next, let’s look further at the xxd
command and how it can be used to analyze a file such as an executable.
The executable helloworld will be used to explore the functionality of xxd
. The source code is shown in Example 11-1. The file helloworld was compiled for Linux into Executable and Linkable Format (ELF) by using the GNU C Compiler (GCC).
#include <stdio.h>
int
main
()
{
printf
(
"Hello World!
"
);
return
0
;
}
The xxd
command can be used to examine any part of the executable. As an example, you can look at the file’s magic number, which begins at position 0x00
and is 4 bytes in size. To do that, use -s
for the starting position (in decimal), and -l
for the number of bytes (in decimal) to return. The starting offset and length can also be specified in hexadecimal by prepending 0x
to the number (i.e., 0x2A
). As expected, the ELF magic number is seen.
$ xxd -s 0 -l 4 helloworld 00000000: 7f45 4c46 .ELF
The fifth byte of the file will tell you whether the executable is 32-bit (0x01
) or 64-bit (0x02
) architecture. In this case, it is a 64-bit executable:
$ xxd -s 4 -l 1 helloworld 00000004: 02
The sixth byte tells you whether the file is little-endian (0x01
) or big-endian (0x02
). In this case, it is little-endian:
$ xxd -s 5 -l 1 helloworld 00000005: 01
The format and endianness are critical pieces of information for analyzing the rest of the file. For example, the 8 bytes starting at offset 0x20
of a 64-bit ELF file specify the offset of the program header:
$ xxd -s 0x20 -l 8 helloworld 00000020: 4000 0000 0000 0000
You know that the offset of the program header is 0x40
because the file is little-endian. That offset can then be used to display the program header, which should be 0x38
bytes in length for a 64-bit ELF file:
$ xxd -s 0x40 -l 0x38 helloworld 00000040: 0600 0000 0500 0000 4000 0000 0000 0000 ........@....... 00000050: 4000 4000 0000 0000 4000 4000 0000 0000 @.@.....@.@..... 00000060: f801 0000 0000 0000 f801 0000 0000 0000 ................ 00000070: 0800 0000 0000 0000 ........
For more information on the Linux ELF file format, see the Tool Interface Standard (TIS) Executable and Linking format (ELF) Specification.
For more information on the Windows executable file format, see the Microsoft portable executable file format documentation.
Sometimes you may need to display and edit a file in hexadecimal. You can combine xxd
with the vi
editor to do just that. First, open the file you want to edit as normal with vi
:
vi helloworld
After the file is open, enter the vi
command:
:%!xxd
In vi
, the %
symbol represents the address range of the entire file, and the !
symbol can be used to execute a shell command, replacing the original lines with the output of the command. Combining the two as shown in the preceding example will run the current file through xxd
(or any shell command) and leave the results in vi
:
00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............ 00000010: 0200 3e00 0100 0000 3004 4000 0000 0000 ..>.....0.@..... 00000020: 4000 0000 0000 0000 efbf bd19 0000 0000 @............... 00000030: 0000 0000 0000 4000 3800 0900 4000 1f00 [email protected]...@... 00000040: 1c00 0600 0000 0500 0000 4000 0000 0000 ..........@..... . . .
After you have made your edits, you can covert the file back to normal by using the vi
command :%!xxd -r
. Write out these changes (ZZ
) when you are done. Of course, you can just quit without writing (:q!
) at any time, and the file will be left unchanged.
One of the most basic approaches to analyzing an unknown executable is to extract any ASCII strings contained in the file. This can often yield information such as filenames or paths, IP addresses, author names, compiler information, URLs, and other information that might provide valuable insight into the program’s functionality or origin.
A command called strings
can extract ASCII data for us, but it is not available by default on many distributions, including Git Bash. To solve this more universally, we can use our good friend egrep
:
egrep -a -o '[[:print:]]{2,}' somefile.exe
This regex expression searches the specified file for two or more (that’s the {2,}
construct) printable characters in a row that appear as their own contiguous word. The -a
option processes the binary executable as if it were a text file. The -o
option will output only the matching text rather than the entire line, thereby eliminating any of the nonprintable binary data. The search is for two or more characters because single characters are quite likely in any binary byte and thus are not significant.
To make the output even cleaner, you can pipe the results into sort
with the -u
option to remove any duplicates:
egrep -a -o '[[:print:]]{2,}' somefile.exe | sort -u
It may also be useful to sort the strings from longest to shortest, as the longest strings are more likely to contain interesting information. The sort
command does not provide a way to do this natively, so you can use awk
to augment it:
egrep -a -o '[[:print:]]{2,}' somefile.exe | awk '{print length(), $0}' | sort -rnu
Here, you first send the egrep
output to awk
to have it prepend the length of each string on each line. This output is then sorted in reverse numerical order with duplicates removed.
The approach of extracting strings from an executable does have its limitations. If a string is not contiguous, meaning that nonprintable characters separate one or more characters, the string will print out as individual characters rather than the entire string. This is sometimes just an artifact of how an executable is constructed, but it can also be done intentionally by malware developers to help avoid detection. Malware developers may also use encoding or encryption to similarly mask the existence of strings in a binary file.
VirusTotal is a commercial online tool used to upload files and run them against a battery of antivirus engines and other static analysis tools to determine whether they are malicious. VirusTotal can also provide information on how often a particular file has been seen in the wild, or if anyone else has identified it as malicious; this is known as a file’s reputation. If a file has never been seen before in the wild, and therefore has a low reputation, it is more likely to be malicious.
Be cautious when uploading files to VirusTotal and similar services. Those services maintain databases of all files uploaded, so files with potentially sensitive or privileged information should never be uploaded. Additionally, in certain circumstances, uploading malware files to public repositories could alert an adversary that you have identified his presence on your system.
VirusTotal provides an API that can be used to interface with the service by using curl
. To use the API you must have a unique API key. To obtain a key, go to the VirusTotal website and request an account. After you create an account, log in and go to your account settings to view your API key. A real API key will not be used for the examples in this book due to security concerns; instead, we will use the text replacewithapikey
anywhere your API key should be substituted.
The full VirusTotal API can be found in the VirusTotal documentation.
VirusTotal uses a Representational State Transfer (REST) request to interact with the service over the internet. Table 11-2 lists some of the REST URLs for VirusTotal’s basic file-scanning functionality.
Description | Request URL | Parameters |
---|---|---|
Retrieve a scan report |
https://www.virustotal.com/vtapi/v2/file/report |
|
Upload and scan a file |
https://www.virustotal.com/vtapi/v2/file/scan |
|
VirusTotal keeps a history of all files that have been previously uploaded and analyzed. You can search the database by using a hash of your suspect file to determine whether a report already exists; this saves you from having to actually upload the file. The limitation with this method is that if no one else has ever uploaded the same file to VirusTotal, no report will exist.
VirusTotal accepts MD5, SHA-1, and SHA-256 hash formats, which you can generate using md5sum
, sha1sum
, and sha256sum
, respectively. Once you have generated the hash of your file it can be sent to VirusTotal by using curl
and a REST request.
The REST request is in the form of a URL that begins with https://www.virustotal.com/vtapi/v2/file/report and has the following three primary parameters:
Your API key obtained from VirusTotal
The MD5, SHA-1, or SHA-256 hash of the file
If true
, will return additional information from other tools
As an example, we will use a sample of the WannaCry malware, which has an MD5 hash of db349b97c37d22f5ea1d1841e3c89eb4
:
curl 'https://www.virustotal.com/vtapi/v2/file/report?apikey=replacewithapikey& resource=db349b97c37d22f5ea1d1841e3c89eb4&allinfo=false > WannaCry_VirusTotal.txt
The resulting JSON response contains a list of all antivirus engines the file was run against and their determination of whether the file was detected as malicious. Here, we can see the responses from the first two engines, Bkav and MicroWorld-eScan:
{
"scans"
:
{
"Bkav"
:
{
"detected"
:
true
,
"version"
:
"1.3.0.9466"
,
"result"
:
"W32.WannaCrypLTE.Trojan"
,
"update"
:
"20180712"
},
"MicroWorld-eScan"
:
{
"detected"
:
true
,
"version"
:
"14.0.297.0"
,
"result"
:
"Trojan.Ransom.WannaCryptor.H"
,
"update"
:
"20180712"
}
.
.
.
Although JSON is great for structuring data, it is a little difficult for humans to read. You can extract some of the important information, such as whether the file was detected as malicious, by using grep
:
$ grep -Po '{"detected": true.*?"result":.*?,' Calc_VirusTotal.txt {"detected": true, "version": "1.3.0.9466", "result": "W32.WannaCrypLTE.Trojan", {"detected": true, "version": "14.0.297.0", "result": "Trojan.Ransom.WannaCryptor.H", {"detected": true, "version": "14.00", "result": "Trojan.Mauvaise.SL1",
The -P
option for grep
is used to enable the Perl engine, which allows you to use the pattern .*?
as a lazy quantifier. This lazy quantifier matches only the minimum number of characters needed to satisfy the entire regular expression, thus allowing you to extract the response from each of the antivirus engines individually rather than in a large clump.
Although this method works, a much better solution can be created using a bash script, as shown in Example 11-2.
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# vtjson.sh
#
# Description:
# Search a JSON file for VirusTotal malware hits
#
# Usage:
# vtjson.awk [<json file>]
# <json file> File containing results from VirusTotal
# default: Calc_VirusTotal.txt
#
RE
=
'^.(.*)...{.*detect..(.*),..vers.*result....(.*).,..update.*$'
FN
=
"
${
1
:-
Calc_VirusTotal
.txt
}
"
sed
-e
's/{"scans": {/& /'
-e
's/},/& /g'
"
$FN
"
|
while
read
ALINE
do
if
[
[
$ALINE
=
~
$RE
]
]
then
VIRUS
=
"
${
BASH_REMATCH
[1]
}
"
FOUND
=
"
${
BASH_REMATCH
[2]
}
"
RESLT
=
"
${
BASH_REMATCH
[3]
}
"
if
[
[
$FOUND
=
~
.*true.*
]
]
then
echo
$VIRUS
"- result:"
$RESLT
fi
fi
done
This complex regular expression (or RE
) is looking for lines that contain DETECT
and RESULT
and UPDATE
in that sequence on a line. More importantly, the RE
is also locating three substrings within any line that matches those three keywords. The substrings are delineated by the parentheses; the parentheses are not to be found in the strings that we’re searching, but rather are syntax of the RE
to indicate a grouping.
Let’s look at the first group in this example. The RE
is enclosed in single quotes. There may be lots of special characters, but we don’t want the shell to interpret them as special shell characters; we want them passed through literally to the regex processor. The next character, the ^
, say, to anchor this search to the beginning of the line. The next character, the .
, matches any character in the input line. Then comes a group of any character, the .
again, repeated any number of times, indicated by the *
.
So how many characters will fill in that first group? We need to keep looking along the RE
to see what else has to match. What has to come after the group is three characters followed by a left brace. So we can now describe that first grouping as all the characters beginning at the second character of the line, up to, but not including, the three characters before the left brace.
It’s similar with the other groupings; they are constrained in their location by the dots and keywords. Yes, this does make for a rather rigid format, but in this case we are dealing with a rather rigid (predictable) format. This script could have been written to handle a more flexible input format. See the exercises at the end of the chapter.
The sed
command is preparing our input for easier processing.
It puts the initial JSON keyword scans
and its associated punctuations on a line by itself. It then also puts a newline at the end of each right brace (with a comma after it). In both edit expressions, the ampersand on the righthand side of a substitution represents whatever was matched on the left side. For example, in the second substitution, the ampersand is shorthand for a right brace and comma.
Here is where the regular expression is put into use. Be sure not to put the $RE
inside quotes, or it will match for those special characters as literals. To get the regular expression behavior, put no quotes around it.
If any parentheses are used in the regular expression, they delineate a substring that can be retrieved from the shell array variable BASH_REMATCH
. Index 1 holds the first substring, etc.
This is another use of the regular expression matching. We are looking for the word true anywhere in the line. This makes assumptions about our input data—that the word doesn’t appear in any other field than the one we want. We could have made it more specific (locating it near the word detected, for example), but this is much more readable and will work as long as the four letters t-r-u-e don’t appear in sequence in any other field.
You don’t necessarily need to use regular expressions to solve this problem. Here is a solution using awk
. Now awk
can make powerful use of regular expressions, but you don’t need them here because of another powerful feature of awk
: the parsing of the input into fields. Example 11-3 shows the code.
# Cybersecurity Ops with bash
# vtjson.awk
#
# Description:
# Search a JSON file for VirusTotal malware hits
#
# Usage:
# vtjson.awk <json file>
# <json file> File containing results from VirusTotal
#
FN
=
"${1:-Calc_VirusTotal.txt}"
sed
-
e
's/{"scans": {/& /'
-
e
's/},/& /g'
"$FN"
|
awk
' NF == 9 {
COMMA="," QUOTE="""
if ( $3 == "true" COMMA ) {
VIRUS=$1
gsub(QUOTE, "", VIRUS)
RESLT=$7 gsub(QUOTE, "", RESLT) gsub(COMMA, "", RESLT) print VIRUS, "- result:", RESLT } }'
We begin with the same preprocessing of the input as we did in the previous script. This time, we pipe the results into awk
.
Only input lines with nine fields will execute the code inside these braces.
We set up variables to hold these string constants. Note that we can’t use single quotes around the one double-quote character. Why? Because the entire awk script is being protected (from the shell interpreting special characters) by being enclosed in single quotes. (Look back three lines, and at the end of this script.) Instead, we “escape” the double quote by preceding it with a backslash.
This compares the third field of the input line to the string "true,
" because in awk
, juxtaposition of strings implies concatenation. We don’t use a plus sign to “add” the two strings as we do in some languages; we just put them side by side.
As with the $3
used in the if
clause, the $1
here refers to a field number of the input line—the first word, if you will, of the input. It is not a shell variable referring to a script parameter. Remember the single quotes that encase this awk
script.
gsub
is an awk
function that does a global substitution. It replaces all occurrences of the first argument with the second argument when searching through the third argument. Since the second argument is the empty string, the net result is that it removes all quote characters from the string in the variable VIRUS
(which was assigned the value of the first field of the input line).
The rest of the script is much the same, doing those substitutions and then printing the results. Remember, too, that in awk
, it keeps reading stdin and running through the code once for each line of input, until the end of the input.
You can upload new files to VirusTotal to be analyzed if information on them does not already exist in the database. To do that, you need to use an HTML POST request to the URL https://www.virustotal.com/vtapi/v2/file/scan. You must also provide your API key and a path to the file to upload. The following is an example using the Windows calc.exe file that can typically be found in the c:WindowsSystem32 directory:
curl --request POST --url 'https://www.virustotal.com/vtapi/v2/file/scan' --form 'apikey=replacewithapikey' --form 'file=@/c/Windows/System32/calc.exe'
When uploading a file, you do not receive the results immediately. What is returned is a JSON object, such as the following, that contains metadata on the file that can be used to later retrieve a report using the scan ID or one of the hash values:
{
"scan_id"
:
"5543a258a819524b477dac619efa82b7f42822e3f446c9709fadc25fdff94226-1..."
,
"sha1"
:
"7ffebfee4b3c05a0a8731e859bf20ebb0b98b5fa"
,
"resource"
:
"5543a258a819524b477dac619efa82b7f42822e3f446c9709fadc25fdff94226"
,
"response_code"
:
1
,
"sha256"
:
"5543a258a819524b477dac619efa82b7f42822e3f446c9709fadc25fdff94226"
,
"permalink"
:
"https://www.virustotal.com/file/5543a258a819524b477dac619efa82b7..."
,
"md5"
:
"d82c445e3d484f31cd2638a4338e5fd9"
,
"verbose_msg"
:
"Scan request successfully queued, come back later for the report"
}
VirusTotal also has features to perform scans on a particular URL, domain, or IP address. All of the API calls are similar in that they make an HTTP GET request to the corresponding URL listed in Table 11-3 with the parameters set appropriately.
Description | Request URL | Parameters |
---|---|---|
URL report |
https://www.virustotal.com/vtapi/v2/url/report |
|
Domain report |
https://www.virustotal.com/vtapi/v2/domain/report |
|
IP report |
https://www.virustotal.com/vtapi/v2/ip-address/report |
|
Here is an example of requesting a scan report on a URL:
curl 'https://www.virustotal.com/vtapi/v2/url/report?apikey=replacewithapikey &resource=www.oreilly.com&allinfo=false&scan=1'
The parameter scan=1
will automatically submit the URL for analysis if it does not already exist in the database.
The command line alone cannot provide the same level of capability as full-fledged reverse-engineering tools, but it can be quite powerful for inspecting an executable or file. Remember to analyze suspected malware only on systems that are disconnected from the network, and be cognizant of confidentiality issues that may arise if you upload files to VirusTotal or other similar services.
In the next chapter, we look at how to improve data visualization post gathering and analysis.
Create a regular expression to search a binary for single printable characters separated by single nonprintable characters. For example, p.a.s.s.w.o.r.d
, where .
represents a nonprintable character.
Search a binary file for instances of a single printable character. Rather than printing the ones that you find, print all the ones that you don’t find. For a slightly simpler exercise, consider only the alphanumeric characters rather than all printable characters.
Write a script to interact with the VirusTotal API via a single command. Use the options -h
to check a hash, -f
to upload a file, and -u
to check a URL. For example:
$ ./vt.sh -h db349b97c37d22f5ea1d1841e3c89eb4 Detected: W32.WannaCrypLTE.Trojan
Visit the Cybersecurity Ops website for additional resources and the answers to these questions.