Chapter 6. Data Processing

In the previous chapter, you gathered lots of data. That data is likely in a variety of formats, including free-form text, comma-separated values (CSV), and XML. In this chapter, we show you how to parse and manipulate that data so you can extract key elements for analysis.

Commands in Use

We introduce awk, join, sed, tail, and tr to prepare data for analysis.

awk

awk is not just a command, but actually a programming language designed for processing text. Entire books are dedicated to this subject. awk will be explained in more detail throughout this book, but here we provide a brief example of its usage.

Common command options

-f

Read in the awk program from a specified file

Command example

Take a look at the file awkusers.txt in Example 6-1.

Example 6-1. awkusers.txt
Mike Jones
John Smith
Kathy Jones
Jane Kennedy
Tim Scott

You can use awk to print each line where the user’s last name is Jones.

$ awk '$2 == "Jones" {print $0}' awkusers.txt

Mike Jones
Kathy Jones

awk will iterate through each line of the input file, reading in each word (separated by whitespace by default) into fields. Field $0 represents the entire line—$1 the first word, $2 the second word, etc. An awk program consists of patterns and corresponding code to be executed when that pattern is matched. In this example, there is only one pattern. We test $2 to see if that field is equal to Jones. If it is, awk will run the code in the braces which, in this case, will print the entire line.

Note

If we left off the explicit comparison and instead wrote awk ' /Jones/ {print $0}', the string inside the slashes is a regular expression to match anywhere in the input line. The command would print all the names as before, but it would also find lines where Jones might be the first name or part of a longer name (such as “Jonestown”).

join

join combines the lines of two files that share a common field. In order for join to function properly, the input files must be sorted.

Common command options

-j

Join using the specified field number. Fields start at 1.

-t

Specify the character to use as the field separator. Space is the default field separator.

--header

Use the first line of each file as a header.

Command example

Consider the files in Examples 6-2 and 6-3.

Example 6-2. usernames.txt
1,jdoe
2,puser
3,jsmith
Example 6-3. accesstime.txt
0745,file1.txt,1
0830,file4.txt,2
0830,file5.txt,3

Both files share a common field of data, which is the user ID. In accesstime.txt, the user ID is in the third column. In usernames.txt, the user ID is in the first column. You can merge these two files by using join as follows:

$ join -1 3 -2 1 -t, accesstime.txt usernames.txt

1,0745,file1.txt,jdoe
2,0830,file4.txt,puser
3,0830,file5.txt,jsmith

The -1 3 option tells join to use the third column in the first file (accesstime.txt), and -2 1 specifies the first column in the second file (usernames.txt) for use when merging the files. The -t, option specifies the comma character as the field delimiter.

sed

sed allows you to perform edits, such as replacing characters, on a stream of data.

Common command options

-i

Edit the specified file and overwrite in place

Command example

The sed command is powerful and can be used for a variety of functions. However, replacing characters or sequences of characters is one of the most common. Take a look at the file ips.txt in Example 6-4.

Example 6-4. ips.txt
ip,OS
10.0.4.2,Windows 8
10.0.4.35,Ubuntu 16
10.0.4.107,macOS
10.0.4.145,macOS

You can use sed to replace all instances of the 10.0.4.35 IP address with 10.0.4.27:

$ sed 's/10.0.4.35/10.0.4.27/g' ips.txt

ip,OS
10.0.4.2,Windows 8
10.0.4.27,Ubuntu 16
10.0.4.107,macOS
10.0.4.145,macOS

In this example, sed uses the following format, with each component separated by a forward slash:

s/<regular expression>/<replace with>/<flags>/

The first part of the command (s) tells sed to substitute. The second part of the command (10.0.4.35) is a regular expression pattern. The third part (10.0.4.27) is the value to use to replace the regex pattern matches. The fourth part is optional flags, which in this case (g, for global) tells sed to replace all instances on a line (not just the first) that match the regex pattern.

tail

The tail command is used to output the last lines of a file. By default, tail will output the last 10 lines of a file.

Common command options

-f

Continuously monitor the file and output lines as they are added

-n

Output the number of lines specified

Command example

To output the last line in the somefile.txt file:

$ tail -n 1 somefile.txt

12/30/2017 192.168.10.185 login.html

tr

The tr command is used to translate or map from one character to another. It is also often used to delete unwanted or extraneous characters. It only reads from stdin and writes to stdout, so you typically see it with redirects for the input and output files.

Common command options

-d

Delete the specified characters from the input stream

-s

Squeeze—that is, replace repeated instances of a character with a single instance

Command example

You can translate all the backslashes into forward slashes, and all the colons to vertical bars, with the tr command:

tr '\:'  '/|' < infile.txt  > outfile.txt

Say the contents of infile.txt look like this:

drive:path
ame
c:UsersDefaultfile.txt

Then, after running the tr command, outfile.txt would contain this:

drive|path/name
c|/Users/Default/file.txt

The characters from the first argument are mapped to the corresponding characters in the second argument. Two backslashes are needed to specify a single backslash character because the backslash has a special meaning to tr; it is used to indicate special characters such as newline ( ), return ( ), or tab ( ). You use the single quotes around the arguments to avoid any special interpretation by bash.

Tip

Files from Windows systems often come with both a carriage return and a line feed (CR & LF) character at the end of each line. Linux and macOS systems have only the newline character to end a line. If you transfer a file to Linux and want to get rid of those extra return characters, here is how you might do that with the tr command:

tr -d '
' < fileWind.txt  > fileFixed.txt

Conversely, you can convert Linux line endings to Windows line endings by using sed:

$ sed -i 's/$/
/' fileLinux.txt

The -i option makes the changes in place and writes them back to the input file.

Processing Delimited Files

Many of the files you will collect and process are likely to contain text, which makes the ability to manipulate text from the command line a critical skill. Text files are often broken into fields by using a delimiter such as a space, tab, or comma. One of the more common formats is known as comma-separated values (CSV). As the name indicates, CSV files are delimited using commas, and fields may or may not be surrounded in double quotes ("). The first line of a CSV file is often the field headers. Example 6-5 shows a sample CSV file.

Example 6-5. csvex.txt
"name","username","phone","password hash"
"John Smith","jsmith","555-555-1212",5f4dcc3b5aa765d61d8327deb882cf99
"Jane Smith","jnsmith","555-555-1234",e10adc3949ba59abbe56e057f20f883e
"Bill Jones","bjones","555-555-6789",d8578edf8458ce06fbc5bb76a58c5ca4

To extract just the name from the file, you can use cut by specifying the field delimiter as a comma and the field number you would like returned:

$ cut -d',' -f1 csvex.txt

"name"
"John Smith"
"Jane Smith"
"Bill Jones"

Note that the field values are still enclosed in double quotations. This may not be desirable for certain applications. To remove the quotations, you can simply pipe the output into tr with its -d option:

$ cut -d',' -f1 csvex.txt | tr -d '"'

name
John Smith
Jane Smith
Bill Jones

You can further process the data by removing the field header via the tail command’s -n option:

$ cut -d',' -f1 csvex.txt | tr -d '"' | tail -n +2

John Smith
Jane Smith
Bill Jones

The -n +2 option tells tail to output the contents of the file starting at line number 2, thus removing the field header.

Tip

You can also give cut a list of fields to extract, such as -f1-3 to extract fields 1 through 3, or a list such as -f1,4 to extract fields 1 and 4.

Iterating Through Delimited Data

Although you can use cut to extract entire columns of data, in some instances you will want to process the file and extract fields line by line; in this case, awk may be a better choice.

Let’s suppose you want to check each user’s password hash in csvex.txt against the dictionary file of known passwords, passwords.txt; see Examples 6-6 and 6-7.

Example 6-6. csvex.txt
"name","username","phone","password hash"
"John Smith","jsmith","555-555-1212",5f4dcc3b5aa765d61d8327deb882cf99
"Jane Smith","jnsmith","555-555-1234",e10adc3949ba59abbe56e057f20f883e
"Bill Jones","bjones","555-555-6789",d8578edf8458ce06fbc5bb76a58c5ca4
Example 6-7. passwords.txt
password,md5hash
123456,e10adc3949ba59abbe56e057f20f883e
password,5f4dcc3b5aa765d61d8327deb882cf99
welcome,40be4e59b9a2a2b5dffb918c0e86b3d7
ninja,3899dcbab79f92af727c2190bbd8abc5
abc123,e99a18c428cb38d5f260853678922e03
123456789,25f9e794323b453885f5181f1b624d0b
12345678,25d55ad283aa400af464c76d713c07ad
sunshine,0571749e2ac330a7455809c6b0e7af90
princess,8afa847f50a716e64932d995c8e7435a
qwerty,d8578edf8458ce06fbc5bb76a58c5c

You can extract each user’s hash from csvex.txt by using awk as follows:

$ awk -F "," '{print $4}' csvex.txt

"password hash"
5f4dcc3b5aa765d61d8327deb882cf99
e10adc3949ba59abbe56e057f20f883e
d8578edf8458ce06fbc5bb76a58c5ca4

By default, awk uses the space character as a field delimiter, so the -F option is used to identify a custom field delimiter (,) and then print out the fourth field ($4), which is the password hash. You can then use grep to take the output from awk one line at a time and search for it in the passwords.txt dictionary file, outputting any matches:

$ grep "$(awk -F "," '{print $4}' csvex.txt)" passwords.txt

123456,e10adc3949ba59abbe56e057f20f883e
password,5f4dcc3b5aa765d61d8327deb882cf99
qwerty,d8578edf8458ce06fbc5bb76a58c5ca4

Processing by Character Position

If a file has fixed-width field sizes, you can use the cut command’s -c option to extract data by character position. In csvex.txt, the (US 10-digit) phone number is an example of a fixed-width field. Take a look at this example:

$ cut -d',' -f3 csvex.txt | cut -c2-13 | tail -n +2

555-555-1212
555-555-1234
555-555-6789

Here you first use cut in delimited mode to extract the phone number at field 3. Because each phone number is the same number of characters, you can use the cut character position option (-c) to extract the characters between the quotations. Finally, tail is used to remove the file header.

Processing XML

Extensible Markup Language (XML) allows you to arbitrarily create tags and elements that describe data. Example 6-8 presents an example XML document.

Example 6-8. book.xml
<book title="Cybersecurity Ops with bash" edition="1"> 1
  <author> 2
    <firstName>Paul</firstName> 3
    <lastName>Troncone</lastName>
  </author> 4
  <author>
    <firstName>Carl</firstName>
    <lastName>Albing</lastName>
  </author>
</book>
1

This is a start tag that contains two attributes, also known as name/value pairs. Attribute values must always be quoted.

2

This is a start tag.

3

This is an element that has content.

4

This is an end tag.

For useful processing, you must be able to search through the XML and extract data from within the tags, which can be done using grep. Let’s find all of the firstName elements. The -o option is used so only the text that matches the regex pattern will be returned, rather than the entire line:

$ grep -o '<firstName>.*</firstName>' book.xml

<firstName>Paul</firstName>
<firstName>Carl</firstName>

Note that the preceding regex above finds only the XML element if the start and end tags are on the same line. To find the pattern across multiple lines, you need to make use of two special features. First, add the -z option to grep, which treats newlines like any ordinary character in its searching and adds a null value (ASCII 0) at the end of each string it finds. Then, add the -P option and (?s) to the regex pattern, which is a Perl-specific pattern-match modifier. It modifies the . metacharacter to also match on the newline character. Here’s an example with those two features:

$ grep -Pzo '(?s)<author>.*?</author>' book.xml

<author>
  <firstName>Paul</firstName>
  <lastName>Troncone</lastName>
</author><author>
  <firstName>Carl</firstName>
  <lastName>Albing</lastName>
</author>
Warning

The -P option is not available in all versions of grep, including those included with macOS.

To strip the XML start and end tags and extract the content, you can pipe your output into sed:

$ grep -Po '<firstName>.*?</firstName>' book.xml | sed 's/<[^>]*>//g'

Paul
Carl

The sed expression can be described as s/expr/other/ to replace (or substitute) an expression (expr) with something else (other). The expression can be literal characters or a more complex regex. If an expression has no “other” portion, such as s/expr//, then it replaces anything that matches the regular expression with nothing, essentially removing it. The regex pattern we use in the preceding example—namely, the <[^>]*> expression—is a little confusing, so let’s break it down:

<

The pattern begins with a literal <.

[^>]*

Zero or more (indicated by a *) characters from the set of characters inside the brackets; the first character is a ^, which means “not” any of the remaining characters listed. Here that’s just the solitary > character, so [^>] matches any character that is not >.

>

The pattern ends with a literal >.

This should match a single XML tag, from its opening less-than to its closing greater-than character, but not more than that.

Processing JSON

JavaScript Object Notation (JSON) is another popular file format, particularly for exchanging data through application programming interfaces (APIs). JSON is a simple format that consists of objects, arrays, and name/value pairs. Example 6-9 shows a sample JSON file.

Example 6-9. book.json
{ 1
  "title": "Cybersecurity Ops with bash", 2
  "edition": 1,
  "authors": [ 3
    {
      "firstName": "Paul",
      "lastName": "Troncone"
    },
    {
      "firstName": "Carl",
      "lastName": "Albing"
    }
  ]
}
1

This is an object. Objects begin with { and end with }.

2

This is a name/value pair. Values can be a string, number, array, Boolean, or null.

3

This is an array. Arrays begin with [ and end with ].

Tip

For more information on the JSON format, visit the JSON web page.

When processing JSON, you are likely going to want to extract key/value pairs, which can be done using grep. To extract the firstName key/value pair from book.json:

$ grep -o '"firstName": ".*"' book.json

"firstName": "Paul"
"firstName": "Carl"

Again, the -o option is used to return only the characters that match the pattern rather than the entire line of the file.

If you want to remove the key and display only the value, you can do so by piping the output into cut, extracting the second field, and removing the quotations with tr:

$ grep -o '"firstName": ".*"' book.json | cut -d " " -f2 | tr -d '"'

Paul
Carl

We will perform more-advanced processing of JSON in Chapter 11.

Aggregating Data

Data is often collected from a variety of sources, and in a variety of files and formats. Before you can analyze the data, you must get it all into the same place and in a format that is conducive to analysis.

Suppose you want to search a treasure trove of data files for any system named ProductionWebServer. Recall that in previous scripts we wrapped our collected data in XML tags with the following format: <systeminfo host="">. During collection, we also named our files by using the hostname. You can now use either of those attributes to find and aggregate the data into a single location:

find /data -type f -exec grep '{}' -e 'ProductionWebServer' ;
-exec cat '{}' >> ProductionWebServerAgg.txt ;

The command find /data -type f lists all of the files in the /data directory and its subdirectories. For each file found, it runs grep, looking for the string ProductionWebServer. If found, the file is appended (>>) to the file ProductionWebServerAgg.txt. Replace the cat command with cp and a directory location if you would rather copy all of the files to a single location than to a single file.

You can also use the join command to take data that is spread across two files and aggregate it into one. Take a look at the two files in Examples 6-10 and 6-11.

Example 6-10. ips.txt
ip,OS
10.0.4.2,Windows 8
10.0.4.35,Ubuntu 16
10.0.4.107,macOS
10.0.4.145,macOS
Example 6-11. user.txt
user,ip
jdoe,10.0.4.2
jsmith,10.0.4.35
msmith,10.0.4.107
tjones,10.0.4.145

The files share a common column of data, which is the IP addresses. Therefore, the files can be merged using join:

$ join -t, -2 2 ips.txt user.txt

ip,OS,user
10.0.4.2,Windows 8,jdoe
10.0.4.35,Ubuntu 16,jsmith
10.0.4.107,macOS,msmith
10.0.4.145,macOS,tjones

The -t, option tells join that the columns are delimited using a comma; by default, it uses a space character.

The -2 2 option tells join to use the second column of data in the second file (user.txt) as the key to perform the merge. By default, join uses the first field as the key, which is appropriate for the first file (ips.txt). If you needed to join using a different field in ips.txt, you would add the option -1 n, where n is replaced by the appropriate column number.

Warning

To use join, both files must already be sorted by the column you will use to perform the merge. To do this, you can use the sort command, which is covered in Chapter 7.

Summary

In this chapter, we explored ways to process common data formats, including delimited, positional, JSON, and XML. The vast majority of data you collect and process will be in one of those formats.

In the next chapter, we look at how data can be analyzed and transformed into information that will provide insights into system status and drive decision making.

Workshop

  1. Given the following file tasks.txt, use the cut command to extract columns 1 (Image Name), 2 (PID), and 5 (Mem Usage).

    Image Name;PID;Session Name;Session#;Mem Usage
    System Idle Process;0;Services;0;4 K
    System;4;Services;0;2,140 K
    smss.exe;340;Services;0;1,060 K
    csrss.exe;528;Services;0;4,756 K
  2. Given the file procowner.txt, use the join command to merge the file with tasks.txt from the preceding exercise.

    Process Owner;PID
    jdoe;0
    tjones;4
    jsmith;340
    msmith;528
  3. Use the tr command to replace all of the semicolon characters in tasks.txt with the tab character and print the file to the screen.

  4. Write a command that extracts the first and last names of all authors in book.json.

Visit the Cybersecurity Ops website for additional resources and the answers to these questions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset