In the previous chapter, you gathered lots of data. That data is likely in a variety of formats, including free-form text, comma-separated values (CSV), and XML. In this chapter, we show you how to parse and manipulate that data so you can extract key elements for analysis.
We introduce awk
, join
, sed
, tail
, and tr
to prepare data for analysis.
awk
is not just a command, but actually a programming language designed for processing text. Entire books are dedicated to this subject. awk
will be explained in more detail throughout this book, but here we provide a brief example of its usage.
Take a look at the file awkusers.txt in Example 6-1.
Mike Jones John Smith Kathy Jones Jane Kennedy Tim Scott
You can use awk
to print each line where the user’s last name is Jones
.
$ awk '$2 == "Jones" {print $0}' awkusers.txt Mike Jones Kathy Jones
awk
will iterate through each line of the input file, reading in each word (separated by whitespace by default) into fields. Field $0
represents the entire line—$1
the first word, $2
the second word, etc.
An awk
program consists of patterns and corresponding code to be executed when that pattern is matched.
In this example, there is only one pattern. We test $2
to see if that field is equal to Jones
. If it is, awk
will run the code in the braces which, in this case, will print the entire line.
If we left off the explicit comparison and instead wrote awk ' /Jones/ {print $0}'
, the string inside the slashes is a regular expression to match anywhere in the input line. The command would print all the names as before, but it would also find lines where Jones might be the first name or part of a longer name (such as “Jonestown”).
join
combines the lines of two files that share a common field. In order for join
to function properly, the input files must be sorted.
Consider the files in Examples 6-2 and 6-3.
1,jdoe 2,puser 3,jsmith
0745,file1.txt,1 0830,file4.txt,2 0830,file5.txt,3
Both files share a common field of data, which is the user ID. In accesstime.txt, the user ID is in the third column. In usernames.txt, the user ID is in the first column. You can merge these two files by using join
as follows:
$ join -1 3 -2 1 -t, accesstime.txt usernames.txt 1,0745,file1.txt,jdoe 2,0830,file4.txt,puser 3,0830,file5.txt,jsmith
The -1 3
option tells join
to use the third column in the first file (accesstime.txt), and -2 1
specifies the first column in the second file (usernames.txt) for use when merging the files. The -t,
option specifies the comma character as the field delimiter.
sed
allows you to perform edits, such as replacing characters, on a stream of data.
The sed
command is powerful and can be used for a variety of functions. However, replacing characters or sequences of characters is one of the most common. Take a look at the file ips.txt in Example 6-4.
ip,OS 10.0.4.2,Windows 8 10.0.4.35,Ubuntu 16 10.0.4.107,macOS 10.0.4.145,macOS
You can use sed
to replace all instances of the 10.0.4.35
IP address with 10.0.4.27
:
$ sed 's/10.0.4.35/10.0.4.27/g' ips.txt ip,OS 10.0.4.2,Windows 8 10.0.4.27,Ubuntu 16 10.0.4.107,macOS 10.0.4.145,macOS
In this example, sed
uses the following format, with each component separated by a forward slash:
s/<regular expression>/<replace with>/<flags>/
The first part of the command (s
) tells sed
to substitute. The second part of the command (10.0.4.35
) is a regular expression pattern. The third part (10.0.4.27
) is the value to use to replace the regex pattern matches. The fourth part is optional flags, which in this case (g
, for global) tells sed
to replace all instances on a line (not just the first) that match the regex pattern.
The tr
command is used to translate or map from one character to another. It is also often used to delete unwanted or extraneous characters. It only reads from stdin and writes to stdout, so you typically see it with redirects for the input and output files.
You can translate all the backslashes into forward slashes, and all the colons to vertical bars, with the tr
command:
tr'\:'
'/|'
< infile.txt > outfile.txt
Say the contents of infile.txt look like this:
drive:path ame c:UsersDefaultfile.txt
Then, after running the tr
command, outfile.txt would contain this:
drive|path/name c|/Users/Default/file.txt
The characters from the first argument are mapped to the corresponding characters in the second argument. Two backslashes are needed to specify a single backslash character because the backslash has a special meaning to tr
; it is used to indicate special characters such as newline (
), return (
), or tab (
). You use the single quotes around the arguments to avoid any special interpretation by bash.
Files from Windows systems often come with both a carriage return and a line feed (CR & LF) character at the end of each line. Linux and macOS systems have only the newline character to end a line. If you transfer a file to Linux and want to get rid of those extra return characters, here is how you might do that with the tr
command:
tr -d '
'
< fileWind.txt > fileFixed.txt
Conversely, you can convert Linux line endings to Windows line endings by using sed
:
$ sed -i 's/$/ /' fileLinux.txt
The -i
option makes the changes in place and writes them back to the input file.
Many of the files you will collect and process are likely to contain text, which makes the ability to manipulate text from the command line a critical skill. Text files are often broken into fields by using a delimiter such as a space, tab, or comma. One of the more common formats is known as comma-separated values (CSV). As the name indicates, CSV files are delimited using commas, and fields may or may not be surrounded in double quotes ("
). The first line of a CSV file is often the field headers. Example 6-5 shows a sample CSV file.
"name","username","phone","password hash" "John Smith","jsmith","555-555-1212",5f4dcc3b5aa765d61d8327deb882cf99 "Jane Smith","jnsmith","555-555-1234",e10adc3949ba59abbe56e057f20f883e "Bill Jones","bjones","555-555-6789",d8578edf8458ce06fbc5bb76a58c5ca4
To extract just the name from the file, you can use cut
by specifying the field delimiter as a comma and the field number you would like returned:
$ cut -d',' -f1 csvex.txt "name" "John Smith" "Jane Smith" "Bill Jones"
Note that the field values are still enclosed in double quotations. This may not be desirable for certain applications. To remove the quotations, you can simply pipe the output into tr
with its -d
option:
$ cut -d',' -f1 csvex.txt | tr -d '"' name John Smith Jane Smith Bill Jones
You can further process the data by removing the field header via the tail
command’s -n
option:
$ cut -d',' -f1 csvex.txt | tr -d '"' | tail -n +2 John Smith Jane Smith Bill Jones
The -n +2
option tells tail
to output the contents of the file starting at line number 2, thus removing the field header.
You can also give cut
a list of fields to extract, such as -f1-3
to extract fields 1 through 3, or a list such as -f1,4
to extract fields 1 and 4.
Although you can use cut
to extract entire columns of data, in some instances you will want to process the file and extract fields line by line; in this case, awk
may be a better choice.
Let’s suppose you want to check each user’s password hash in csvex.txt against the dictionary file of known passwords, passwords.txt; see Examples 6-6 and 6-7.
"name","username","phone","password hash" "John Smith","jsmith","555-555-1212",5f4dcc3b5aa765d61d8327deb882cf99 "Jane Smith","jnsmith","555-555-1234",e10adc3949ba59abbe56e057f20f883e "Bill Jones","bjones","555-555-6789",d8578edf8458ce06fbc5bb76a58c5ca4
password,md5hash 123456,e10adc3949ba59abbe56e057f20f883e password,5f4dcc3b5aa765d61d8327deb882cf99 welcome,40be4e59b9a2a2b5dffb918c0e86b3d7 ninja,3899dcbab79f92af727c2190bbd8abc5 abc123,e99a18c428cb38d5f260853678922e03 123456789,25f9e794323b453885f5181f1b624d0b 12345678,25d55ad283aa400af464c76d713c07ad sunshine,0571749e2ac330a7455809c6b0e7af90 princess,8afa847f50a716e64932d995c8e7435a qwerty,d8578edf8458ce06fbc5bb76a58c5c
You can extract each user’s hash from csvex.txt by using awk
as follows:
$ awk -F "," '{print $4}' csvex.txt "password hash" 5f4dcc3b5aa765d61d8327deb882cf99 e10adc3949ba59abbe56e057f20f883e d8578edf8458ce06fbc5bb76a58c5ca4
By default, awk
uses the space character as a field delimiter, so the -F
option is used to identify a custom field delimiter (,
) and then print out the fourth field ($4
), which is the password hash. You can then use grep
to take the output from awk
one line at a time and search for it in the passwords.txt dictionary file, outputting any matches:
$ grep "$(awk -F "," '{print $4}' csvex.txt)" passwords.txt 123456,e10adc3949ba59abbe56e057f20f883e password,5f4dcc3b5aa765d61d8327deb882cf99 qwerty,d8578edf8458ce06fbc5bb76a58c5ca4
If a file has fixed-width field sizes, you can use the cut
command’s -c
option to extract data by character position. In csvex.txt, the (US 10-digit) phone number is an example of a fixed-width field. Take a look at this example:
$ cut -d',' -f3 csvex.txt | cut -c2-13 | tail -n +2 555-555-1212 555-555-1234 555-555-6789
Here you first use cut
in delimited mode to extract the phone number at field 3. Because each phone number is the same number of characters, you can use the cut
character position option (-c
) to extract the characters between the quotations. Finally, tail
is used to remove the file header.
Extensible Markup Language (XML) allows you to arbitrarily create tags and elements that describe data. Example 6-8 presents an example XML document.
<book
title=
"Cybersecurity Ops with bash"
edition=
"1"
>
<author
>
<firstName
>
Paul
</firstName>
<lastName
>
Troncone
</lastName>
</author>
<author
>
<firstName
>
Carl
</firstName>
<lastName
>
Albing
</lastName>
</author>
</book>
This is a start tag that contains two attributes, also known as name/value pairs. Attribute values must always be quoted.
This is a start tag.
This is an element that has content.
This is an end tag.
For useful processing, you must be able to search through the XML and extract data from within the tags, which can be done using grep
. Let’s find all of the firstName
elements. The -o
option is used so only the text that matches the regex pattern will be returned, rather than the entire line:
$ grep -o '<firstName>.*</firstName>' book.xml <firstName>Paul</firstName> <firstName>Carl</firstName>
Note that the preceding regex above finds only the XML element if the start and end tags are on the same line. To find the pattern across multiple lines, you need to make use of two special features. First, add the -z
option to grep
, which treats newlines like any ordinary character in its searching and adds a null value (ASCII 0) at the end of each string it finds. Then, add the -P
option and (?s)
to the regex pattern, which is a Perl-specific pattern-match modifier. It modifies the .
metacharacter to also match on the newline character. Here’s an example with those two features:
$ grep -Pzo '(?s)<author>.*?</author>' book.xml <author> <firstName>Paul</firstName> <lastName>Troncone</lastName> </author><author> <firstName>Carl</firstName> <lastName>Albing</lastName> </author>
The -P
option is not available in all versions of grep
, including those included with macOS.
To strip the XML start and end tags and extract the content, you can pipe your output into sed
:
$ grep -Po '<firstName>.*?</firstName>' book.xml | sed 's/<[^>]*>//g' Paul Carl
The sed
expression can be described as s/expr/other/
to replace (or substitute) an expression (expr
) with something else (other
). The expression can be literal characters or a more complex regex. If an expression has no “other” portion, such as s/expr//
, then it replaces anything that matches the regular expression with nothing, essentially removing it. The regex pattern we use in the preceding example—namely, the <[^>]*>
expression—is a little confusing, so let’s break it down:
<
The pattern begins with a literal <
.
[^>]*
Zero or more (indicated by a *
) characters from the set of characters inside the brackets; the first character is a ^
, which means “not” any of the remaining characters listed. Here that’s just the solitary >
character, so [^>]
matches any character that is not >
.
>
The pattern ends with a literal >
.
This should match a single XML tag, from its opening less-than to its closing greater-than character, but not more than that.
JavaScript Object Notation (JSON) is another popular file format, particularly for exchanging data through application programming interfaces (APIs). JSON is a simple format that consists of objects, arrays, and name/value pairs. Example 6-9 shows a sample JSON file.
{
"title"
:
"Cybersecurity Ops with bash"
,
"edition"
:
1
,
"authors"
:
[
{
"firstName"
:
"Paul"
,
"lastName"
:
"Troncone"
}
,
{
"firstName"
:
"Carl"
,
"lastName"
:
"Albing"
}
]
}
For more information on the JSON format, visit the JSON web page.
When processing JSON, you are likely going to want to extract key/value pairs, which can be done using grep
. To extract the firstName
key/value pair from book.json:
$ grep -o '"firstName": ".*"' book.json "firstName": "Paul" "firstName": "Carl"
Again, the -o
option is used to return only the characters that match the pattern rather than the entire line of the file.
If you want to remove the key and display only the value, you can do so by piping the output into cut
, extracting the second field, and removing the quotations with tr
:
$ grep -o '"firstName": ".*"' book.json | cut -d " " -f2 | tr -d '"' Paul Carl
We will perform more-advanced processing of JSON in Chapter 11.
Data is often collected from a variety of sources, and in a variety of files and formats. Before you can analyze the data, you must get it all into the same place and in a format that is conducive to analysis.
Suppose you want to search a treasure trove of data files for any system named ProductionWebServer
. Recall that in previous scripts we wrapped our collected data in XML tags with the following format: <systeminfo host="">
. During collection, we also named our files by using the hostname. You can now use either of those attributes to find and aggregate the data into a single location:
find /data -type f -exec grep '{}' -e 'ProductionWebServer' ; -exec cat '{}' >> ProductionWebServerAgg.txt ;
The command find /data -type f
lists all of the files in the /data directory and its subdirectories. For each file found, it runs grep
, looking for the string ProductionWebServer
. If found, the file is appended (>>
) to the file ProductionWebServerAgg.txt. Replace the cat
command with cp
and a directory location if you would rather copy all of the files to a single location than to a single file.
You can also use the join
command to take data that is spread across two files and aggregate it into one. Take a look at the two files in Examples 6-10 and 6-11.
ip,OS 10.0.4.2,Windows 8 10.0.4.35,Ubuntu 16 10.0.4.107,macOS 10.0.4.145,macOS
user,ip jdoe,10.0.4.2 jsmith,10.0.4.35 msmith,10.0.4.107 tjones,10.0.4.145
The files share a common column of data, which is the IP addresses. Therefore, the files can be merged using join
:
$ join -t, -2 2 ips.txt user.txt ip,OS,user 10.0.4.2,Windows 8,jdoe 10.0.4.35,Ubuntu 16,jsmith 10.0.4.107,macOS,msmith 10.0.4.145,macOS,tjones
The -t,
option tells join
that the columns are delimited using a comma; by default, it uses a space character.
The -2 2
option tells join
to use the second column of data in the second file (user.txt) as the key to perform the merge. By default, join
uses the first field as the key, which is appropriate for the first file (ips.txt). If you needed to join using a different field in ips.txt, you would add the option -1 n
, where n
is replaced by the appropriate column number.
To use join
, both files must already be sorted by the column you will use to perform the merge. To do this, you can use the sort
command, which is covered in Chapter 7.
In this chapter, we explored ways to process common data formats, including delimited, positional, JSON, and XML. The vast majority of data you collect and process will be in one of those formats.
In the next chapter, we look at how data can be analyzed and transformed into information that will provide insights into system status and drive decision making.
Given the following file tasks.txt, use the cut
command to extract columns 1 (Image Name), 2 (PID), and 5 (Mem Usage).
Image Name;PID;Session Name;Session#;Mem Usage System Idle Process;0;Services;0;4 K System;4;Services;0;2,140 K smss.exe;340;Services;0;1,060 K csrss.exe;528;Services;0;4,756 K
Given the file procowner.txt, use the join
command to merge the file with tasks.txt from the preceding exercise.
Process Owner;PID jdoe;0 tjones;4 jsmith;340 msmith;528
Use the tr
command to replace all of the semicolon characters in tasks.txt with the tab character and print the file to the screen.
Write a command that extracts the first and last names of all authors in book.json.
Visit the Cybersecurity Ops website for additional resources and the answers to these questions.