Chapter 7. Data Analysis

In the previous chapters, we used scripts to collect data and prepare it for analysis. Now we need to make sense of it all. When analyzing large amounts of data, it often helps to start broad and continually narrow the search as new insights are gained into the data.

In this chapter, we use the data from web server logs as input into our scripts. This is simply for demonstration purposes. The scripts and techniques can easily be modified to work with nearly any type of data.

Commands in Use

We introduce sort, head, and uniq to limit the data we need to process and display. The file in Example 7-1 will be used for command examples.

Example 7-1. file1.txt
12/05/2017 192.168.10.14 test.html
12/30/2017 192.168.10.185 login.html

sort

The sort command is used to rearrange a text file into numerical and alphabetical order. By default, sort will arrange lines in ascending order, starting with numbers and then letters. Uppercase letters will be placed before their corresponding lowercase letters unless otherwise specified.

Common command options

-r

Sort in descending order.

-f

Ignore case.

-n

Use numerical ordering, so that 1, 2, 3 all sort before 10. (In the default alphabetic sorting, 2 and 3 would appear after 10.)

-k

Sort based on a subset of the data (key) in a line. Fields are delimited by whitespace.

-o

Write output to a specified file.

Command example

To sort file1.txt by the filename column and ignore the IP address column, you would use the following:

sort -k 2 file1.txt

You can also sort on a subset of the field. To sort by the second octet in the IP address:

sort -k 1.5,1.7 file1.txt

This will sort using characters 5 through 7 of the first field.

uniq

The uniq command filters out duplicate lines of data that occur adjacent to one another. To remove all duplicate lines in a file, be sure to sort it before using uniq.

Common command options

-c

Print out the number of times a line is repeated.

-f

Ignore the specified number of fields before comparing. For example, -f 3 will ignore the first three fields in each line. Fields are delimited using spaces.

-i

Ignore letter case. By default, uniq is case-sensitive.

Web Server Access Log Familiarization

We use an Apache web server access log for most of the examples in this chapter. This type of log records page requests made to the web server, when they were made, and who made them. A sample of a typical Apache Combined Log Format file can be seen in Example 7-2. The full logfile is referenced as access.log in this book and can be downloaded from the book’s web page.

Example 7-2. Sample from access.log
192.168.0.11 - - [12/Nov/2017:15:54:39 -0500] "GET /request-quote.html HTTP/1.1" 200
7326 "http://192.168.0.35/support.html" "Mozilla/5.0 (Windows NT 6.3; Win64; x64;
rv:56.0) Gecko/20100101 Firefox/56.0"
Note

Web server logs are used simply as an example. The techniques introduced throughout this chapter can be applied to analyze a variety of data types.

The Apache web server log fields are described in Table 7-1.

Table 7-1. Apache web server Combined Log Format fields
Field Description Field number

192.168.0.11

IP address of the host that requested the page

1

-

RFC 1413 Ident protocol identifier (- if not present)

2

-

The HTTP authenticated user ID (- if not present)

3

[12/Nov/2017:15:54:39 -0500]

Date, time, and GMT offset (time zone)

4–5

GET /request-quote.html

The page that was requested

6–7

HTTP/1.1

The HTTP protocol version

8

200

The status code returned by the web server

9

7326

The size of the file returned in bytes

10

http:⁄/192.168.0.35/support.html

The referring page

11

Mozilla/5.0 (Windows NT 6.3; Win64…

User agent identifying the browser

12+

Note

There is a second type of Apache access log known as the Common Log Format. The format is the same as the Combined Log Format except it does not contain fields for the referring page or user agent. See the Apache HTTP Server Project website for additional information on the Apache log format and configuration.

The status codes mentioned in the Table 7-1 (field 9) are often very informational and let you know how the web server responded to any given request. Common codes are seen in Table 7-2.

Table 7-2. HTTP status codes
Code Description

200

OK

401

Unauthorized

404

Page Not Found

500

Internal Server Error

502

Bad Gateway

Tip

For a complete list of codes, see the Hypertext Transfer Protocol (HTTP) Status Code Registry.

Sorting and Arranging Data

When analyzing data for the first time, it is often beneficial to start by looking at the extremes: the things that occurred the most or least frequently, the smallest or largest data transfers, etc. For example, consider the data that you can collect from web server logfiles. An unusually high number of page accesses could indicate scanning activity or a denial-of-service attempt. An unusually high number of bytes downloaded by a host could indicate site cloning or data exfiltration.

To control the arrangement and display of data, use the sort, head, and tail commands at the end of a pipeline:

…   | sort -k 2.1 -rn | head -15

This pipes the output of a script into the sort command and then pipes that sorted output into head that will print the top 15 (in this case) lines. The sort command here is using as its sort key (-k) the second field beginning at its first character (2.1). Moreover, it is doing a reverse sort (-r), and the values will be sorted like numbers (-n). Why a numerical sort? So that 2 shows up between 1 and 3, and not between 19 and 20 (which is alphabetical order).

By using head, we take the first lines of the output. We could get the last few lines by piping the output from the sort command into tail instead of head. Using tail -15 would give us the last 15 lines. The other way to do this would be to simply remove the -r option on sort so that it does an ascending rather than descending sort.

Counting Occurrences in Data

A typical web server log can contain tens of thousands of entries. By counting each time a page was accessed, or by which IP address it was accessed from, you can gain a better understanding of general site activity. Interesting entries can include the following:

  • A high number of requests returning the 404 (Page Not Found) status code for a specific page; this can indicate broken hyperlinks.

  • A high number of requests from a single IP address returning the 404 status code; this can indicate probing activity looking for hidden or unlinked pages.

  • A high number of requests returning the 401 (Unauthorized) status code, particularly from the same IP address; this can indicate an attempt at bypassing authentication, such as brute-force password guessing.

To detect this type of activity, we need to be able to extract key fields, such as the source IP address, and count the number of times they appear in a file. To accomplish this, we will use the cut command to extract the field and then pipe the output into our new tool, countem.sh, which is shown in Example 7-3.

Example 7-3. countem.sh
#!/bin/bash -
#
# Cybersecurity Ops with bash
# countem.sh
#
# Description:
# Count the number of instances of an item using bash
#
# Usage:
# countem.sh < inputfile
#

declare -A cnt        # assoc. array             1
while read id xtra                               2
do
    let cnt[$id]++                               3
done
# now display what we counted
# for each key in the (key, value) assoc. array
for id in "${!cnt[@]}"                           4
do
    printf '%d %s
'  "${cnt[$id]}"  "$id"       5
done
1

Since we don’t know what IP addresses (or other strings) we might encounter, we will use an associative array (also known as a hash table or dictionary), declared here with the -A option, so that we can use whatever string we read as our index.

The associative array feature is found in bash 4.0 and higher. In such an array, the index doesn’t have to be a number, but can be any string. So you can index the array by the IP address and thus count the occurrences of that IP address. In case you are using something older than bash 4.0, Example 7-4 is an alternate script that uses awk instead.

The array references are like others in bash, using the ${var[index]} syntax to reference an element of the array. To get all the different index values that have been used (the “keys” if you think of these arrays as (key, value) pairings), use: ${!cnt[@]}.

2

Although we expect only one word of input per line, we put the variable xtra there to capture any other words that appear on the line. Each variable on a read command gets assigned the corresponding word from the input (i.e., the first variable gets the first word, the second variable gets the second word, and so on), but the last variable gets any and all remaining words. On the other hand, if there are fewer words of input on a line than there are variables on the read command, then those extra variables get set to the empty string. So for our purposes, if there are extra words on the input line, they’ll all be assigned to xtra, but if there are no extra words, xtra will be given the value of the null string (which won’t matter either way because we don’t use it).

3

Here we use that string as the index and increment its previous value. For the first use of the index, the previous value will be unset, which will be taken as zero.

4

This syntax lets us iterate over all the various index values that we encountered. Note, however, that the order is not guaranteed to be alphabetical or in any other specific order due to the nature of the hashing algorithm for the index values.

5

In printing out the value and key, we put the values inside quotes so that we always get a single value for each argument—even if that value had a space or two inside it. It isn’t expected to happen with our use of this script, but such coding practices make the scripts more robust when used in other situations.

And Example 7-4 shows another version, this time using awk.

Example 7-4. countem.awk
# Cybersecurity Ops with bash
# countem.awk
#
# Description:
# Count the number of instances of an item using awk
#
# Usage:
# countem.awk < inputfile
#

awk '{ cnt[$1]++ }
END { for (id in cnt) {
        printf "%d %s
", cnt[id], id
      }
    }'

Both will work nicely in a pipeline of commands like this:

cut -d' ' -f1 logfile | bash countem.sh

The cut command is not really necessary here for either version. Why? Because the awk script explicitly references the first field (with $1), and in the shell script it’s because of how we coded the read command (see 2). So we can run it like this:

bash countem.sh < logfile

For example, to count the number of times an IP address made a HTTP request that resulted in a 404 (Page Not Found) error:

$ awk '$9 == 404 {print $1}' access.log | bash countem.sh

1 192.168.0.36
2 192.168.0.37
1 192.168.0.11

You can also use grep 404 access.log and pipe it into countem.sh, but that would include lines where 404 appears in other places (e.g., the byte count, or part of a file path). The use of awk here restricts the counting only to lines where the returned status (the ninth field) is 404. It then prints just the IP address (field 1) and pipes the output into countem.sh to get the total number of times each IP address made a request that resulted in a 404 error.

To begin analysis of the example access.log file, you can start by looking at the hosts that accessed the web server. You can use the Linux cut command to extract the first field of the logfile, which contains the source IP address, and then pipe the output into the countem.sh script. The exact command and output is shown here.

$ cut -d' ' -f1 access.log | bash countem.sh | sort -rn

111 192.168.0.37
55 192.168.0.36
51 192.168.0.11
42 192.168.0.14
28 192.168.0.26
Tip

If you do not have countem.sh available, you can use the uniq command -c option to achieve similar results, but it will require an extra pass through the data using sort to work properly.

$ cut -d' ' -f1 access.log | sort | uniq -c | sort -rn

111 192.168.0.37
55 192.168.0.36
51 192.168.0.11
42 192.168.0.14
28 192.168.0.26

Next, you can further investigate by looking at the host that had the most requests, which as can be seen in the preceding code is IP address 192.168.0.37, with 111. You can use awk to filter on the IP address, then pipe that into cut to extract the field that contains the request, and finally pipe that output into countem.sh to provide the total number of requests for each page:

$ awk '$1 == "192.168.0.37" {print $0}' access.log | cut -d' ' -f7
| bash countem.sh

1 /uploads/2/9/1/4/29147191/31549414299.png?457
14 /files/theme/mobile49c2.js?1490908488
1 /cdn2.editmysite.com/images/editor/theme-background/stock/iPad.html
1 /uploads/2/9/1/4/29147191/2992005_orig.jpg
. . .
14 /files/theme/custom49c2.js?1490908488

The activity of this particular host is unimpressive, appearing to be standard web-browsing behavior. If you take a look at the host with the next highest number of requests, you will see something a little more interesting:

$ awk '$1 == "192.168.0.36" {print $0}' access.log | cut -d' ' -f7
| bash countem.sh

1 /files/theme/mobile49c2.js?1490908488
1 /uploads/2/9/1/4/29147191/31549414299.png?457
1 /_/cdn2.editmysite.com/.../Coffee.html
1 /_/cdn2.editmysite.com/.../iPad.html
. . .
1 /uploads/2/9/1/4/29147191/601239_orig.png

This output indicates that host 192.168.0.36 accessed nearly every page on the website exactly one time. This type of activity often indicates web-crawler or site-cloning activity. If you take a look at the user agent string provided by the client, it further verifies this conclusion:

$ awk '$1 == "192.168.0.36" {print $0}' access.log | cut -d' ' -f12-17 | uniq

"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)

The user agent identifies itself as HTTrack, which is a tool used to download or clone websites. While not necessarily malicious, it is interesting to note during analysis.

Tip

You can find additional information on HTTrack at the HTTrack website.

Totaling Numbers in Data

Rather than just count the number of times an IP address or other item occurs, what if you wanted to know the total byte count that has been sent to an IP address—or which IP addresses have requested and received the most data?

The solution is not that much different from countem.sh: you just need a few small changes. First, you need more columns of data by tweaking the input filter (the cut command) to extract two columns (IP address and byte count) rather than just IP address. Second, you will change the calculation from an increment, (let cnt[$id]++) a simple count, to be a summing of that second field of data (let cnt[$id]+=$data).

The pipeline to invoke this will now extract two fields from the logfile, the first and the last:

cut -d' ' -f 1,10 access.log | bash summer.sh

The script summer.sh, shown in Example 7-5, reads in two columns of data. The first column consists of index values (in this case, IP addresses) and the second column is a number (in this case, number of bytes sent by the IP address). Every time the script finds a repeat IP address in the first column, it then adds the value of the second column to the total byte count for that IP address, thus totaling the number of bytes sent by the IP address.

Example 7-5. summer.sh
#!/bin/bash -
#
# Cybersecurity Ops with bash
# summer.sh
#
# Description:
# Sum the total of field 2 values for each unique field 1
#
# Usage: ./summer.sh
#   input format: <name> <number>
#

declare -A cnt        # assoc. array
while read id count
do
  let cnt[$id]+=$count
done
for id in "${!cnt[@]}"
do
    printf "%-15s %8d
"  "${id}"  "${cnt[${id}]}" 1
done
1

Note that we’ve made a few other changes to the output format. With the output format, we’ve added field sizes of 15 characters for the first string (the IP address in our sample data), left-justified (via the minus sign), and eight digits for the sum values. If the sum is larger, it will print the larger number, and if the string is longer, it will be printed in full. We’ve done this to get the data to align, by and large, nicely in columns, for readability.

You can run summer.sh against the example access.log file to get an idea of the total amount of data requested by each host. To do this, use cut to extract the IP address and bytes transferred fields, and then pipe the output into summer.sh:

$ cut -d' ' -f1,10 access.log | bash summer.sh | sort -k 2.1 -rn

192.168.0.36     4371198
192.168.0.37     2575030
192.168.0.11     2537662
192.168.0.14     2876088
192.168.0.26      665693

These results can be useful in identifying hosts that have transferred unusually large amounts of data compared to other hosts. A spike could indicate data theft and exfiltration. If you identify such a host, the next step would be to review the specific pages and files accessed by the suspicious host to try to classify it as malicious or benign.

Displaying Data in a Histogram

You can take counting one step further by providing a more visual display of the results. You can take the output from countem.sh or summer.sh and pipe it into yet another script, one that will produce a histogram-like display of the results.

The script to do the printing will take the first field as the index to an associative array, and the second field as the value for that array element. It will then iterate through the array and print a number of hashtags to represent the count, scaled to 50 # symbols for the largest count in the list.

Example 7-6. histogram.sh
#!/bin/bash -
#
# Cybersecurity Ops with bash
# histogram.sh
#
# Description:
# Generate a horizontal bar chart of specified data
#
# Usage: ./histogram.sh
#   input format: label value
#

function pr_bar ()                            1
{
    local -i i raw maxraw scaled              2
    raw=$1
    maxraw=$2
    ((scaled=(MAXBAR*raw)/maxraw))            3
    # min size guarantee
    ((raw > 0 && scaled == 0)) && scaled=1    4

    for((i=0; i<scaled; i++)) ; do printf '#' ; done
    printf '
'

} # pr_bar

#
# "main"
#
declare -A RA						5
declare -i MAXBAR max
max=0
MAXBAR=50	# how large the largest bar should be

while read labl val
do
    let RA[$labl]=$val					6
    # keep the largest value; for scaling
    (( val > max )) && max=$val
done

# scale and print it
for labl in "${!RA[@]}"					7
do
    printf '%-20.20s  ' "$labl"
    pr_bar ${RA[$labl]} $max				8
done
1

We define a function to draw a single bar of the histogram. This definition must be encountered before a call to the function can be made, so it makes sense to put function definitions at the front of our script. We will be reusing this function in a future script, so we could have put it in a separate file and included it here with a source command—but we didn’t.

2

We declare all these variables as local because we don’t want them to interfere with variable names in the rest of this script (or any others, if we copy/paste this script to use elsewhere). We declare all these variables as integers (that’s the -i option) because we are going to only compute values with them and not use them as strings.

3

The computation is done inside double parentheses. Inside those, we don’t need to use the $ to indicate “the value of” each variable name.

4

This is an “if-less” if statement. If the expression inside the double parentheses is true, then, and only then, is the second expression (the assignment) executed. This will guarantee that scaled is never zero when the raw value is nonzero. Why? Because we’d like something to show up in that case.

5

The main part of the script begins with a declaration of the RA array as an associative array.

6

Here we reference the associative array by using the label, a string, as its index.

7

Because the array is not indexed by numbers, we can’t just count integers and use them as indices. This construct gives all the various strings that were used as an index to the array, one at a time, in the for loop.

8

We use the label as an index one more time to get the count and pass it as the first parameter to our pr_bar function.

Note that the items don’t appear in the same order as the input. That’s because the hashing algorithm for the key (the index) doesn’t preserve ordering. You could take this output and pipe it into yet another sort, or you could take a slightly different approach.

Example 7-7 is a version of the histogram script that preserves order—by not using an associative array. This might also be useful on older versions of bash (pre 4.0), prior to the introduction of associative arrays. Only the “main” part of the script is shown, as the function pr_bar remains the same.

Example 7-7. histogram_plain.sh
#!/bin/bash -
#
# Cybersecurity Ops with bash
# histogram_plain.sh
#
# Description:
# Generate a horizontal bar chart of specified data without
# using associative arrays, good for older versions of bash
#
# Usage: ./histogram_plain.sh
#   input format: label value
#

declare -a RA_key RA_val                                 1
declare -i max ndx
max=0
maxbar=50    # how large the largest bar should be

ndx=0
while read labl val
do
    RA_key[$ndx]=$labl                                   2
    RA_value[$ndx]=$val
    # keep the largest value; for scaling
    (( val > max )) && max=$val
    let ndx++
done

# scale and print it
for ((j=0; j<ndx; j++))                                  3
do
    printf "%-20.20s  " ${RA_key[$j]}
    pr_bar ${RA_value[$j]} $max
done

This version of the script avoids the use of associative arrays, in case you are running an older version of bash (prior to 4.x), such as on macOS systems. For this version, we use two separate arrays—one for the index value and one for the counts. Because they are normal arrays, we have to use an integer index, and so we will keep a simple count in the variable ndx.

1

Here the variable names are declared as arrays. The lowercase a says that they are arrays, but not of the associative variety. While not strictly necessary, this is good practice. Similarly, on the next line we use the -i to declare these variables as integers, making them more efficient than undeclared shell variables (which are stored as strings). Again, this is not strictly necessary, as seen by the fact that we don’t declare maxbar but just use it.

2

The key and value pairs are stored in separate arrays, but at the same index location. This approach is “brittle”—that is, easily broken, if changes to the script ever got the two arrays out of sync.

3

Now the for loop, unlike the previous script, is a simple counting of an integer from 0 to ndx. The variable j is used here so as not to interfere with the index in the for loop inside pr_bar, although we were careful enough inside the function to declare its version of i as local to the function. Do you trust it? Change the j to an i here and see if it still works (it does). Then try removing the local declaration and see if it fails (it does).

This approach with the two arrays does have one advantage. By using the numerical index for storing the label and the data, you can retrieve them in the order they were read in—in the numerical order of the index.

You can now visually see the hosts that transferred the largest number of bytes by extracting the appropriate fields from access.log, piping the results into summer.sh, and then into histogram.sh:

$ cut -d' ' -f1,10 access.log | bash summer.sh | bash histogram.sh

192.168.0.36          ##################################################
192.168.0.37          #############################
192.168.0.11          #############################
192.168.0.14          ################################
192.168.0.26          #######

Although this might not seem that useful for the small amount of sample data, being able to visualize trends is invaluable when looking across larger datasets.

In addition to looking at the number of bytes transferred by IP address or host, it is often interesting to look at the data by date and time. To do that, you can use the summer.sh script, but due to the format of the access.log file, you need to do a little more processing before you can pipe it into the script. If you use cut to extract the date/time and bytes transferred fields, you are left with data that causes some problems for the script:

$ cut -d' ' -f4,10 access.log

[12/Nov/2017:15:52:59 2377
[12/Nov/2017:15:52:59 4529
[12/Nov/2017:15:52:59 1112

As shown in the preceding output, the raw data starts with a [ character. That causes a problem with the script because it denotes the beginning of an array in bash. To remedy that, you can use an additional iteration of the cut command with -c2- to remove the character. This option tells cut to extract the data by character, starting at position 2 and going to the end of the line (-). The corrected output with the square bracket removed is shown here:

$ cut -d' ' -f4,10 access.log | cut -c2-

12/Nov/2017:15:52:59 2377
12/Nov/2017:15:52:59 4529
12/Nov/2017:15:52:59 1112
Tip

Alternatively, you can use tr in place of the second cut. The -d option will delete the character specified—in this case, the square bracket.

cut -d' ' -f4,10 access.log | tr -d '['

You also need to determine how you want to group the time-bound data: by day, month, year, hour, etc. You can do this by simply modifying the option for the second cut iteration. Table 7-3 illustrates the cut option to use to extract various forms of the date/time field. Note that these cut options are specific to Apache logfiles.

Table 7-3. Apache log date/time field extraction
Date/time extracted Example output Cut option

Entire date/time

12/Nov/2017:19:26:09

-c2-

Month, day, and year

12/Nov/2017

-c2-12,22-

Month and year

Nov/2017

-c5-12,22-

Full time

19:26:04

-c14-

Hour

19

-c14-15,22-

Year

2017

-c9-12,22-

The histogram.sh script can be particularly useful when looking at time-based data. For example, if your organization has an internal web server that is accessed only during working hours of 9:00 A.M. to 5:00 P.M., you can review the server log file on a daily basis via the histogram view to see whether spikes in activity occur outside normal working hours. Large spikes of activity or data transfer outside normal working hours could indicate exfiltration by a malicious actor. If any anomalies are detected, you can filter the data by that particular date and time and review the page accesses to determine whether the activity is malicious.

For example, if you want to see a histogram of the total amount of data that was retrieved on a certain day and on an hourly basis, you can do the following:

$ awk '$4 ~ "12/Nov/2017" {print $0}' access.log | cut -d' ' -f4,10 |
cut -c14-15,22- | bash summer.sh | bash histogram.sh

17              ##
16              ###########
15              ############
19              ##
18              ##################################################

Here the access.log file is sent through awk to extract the entries from a particular date. Note the use of the like operator (~) instead of ==, because field 4 also contains time information. Those entries are piped into cut to extract the date/time and bytes transferred fields, and then piped into cut again to extract just the hour. From there, it is summed by hour by using summer.sh and converted into a histogram by using histogram.sh. The result is a histogram that displays the total number of bytes transferred each hour on November 12, 2017.

Tip

Pipe the output from the histogram script into sort -n to get the output in numerical (hour) order. Why is the sort needed? The scripts summer.sh and histogram.sh are both generating their output by iterating through the list of indices of their associative arrays. Therefore, their output will not likely be in a sensible order (but rather in an order determined by the internal hashing algorithm). If that explanation left you cold, just ignore it and remember to use a sort on the output.

If you want to have the output ordered by the amount of data, you’ll need to add the sort between the two scripts. You’ll also need to use histogram_plain.sh, the version of the histogram script that doesn’t use associative arrays.

Finding Uniqueness in Data

Previously, IP address 192.168.0.37 was identified as the system that had the largest number of page requests. The next logical question is, what pages did this system request? With that answer, you can start to gain an understanding of what the system was doing on the server and categorize the activity as benign, suspicious, or malicious. To accomplish that, you can use awk and cut and pipe the output into countem.sh:

$ awk '$1 == "192.168.0.37" {print $0}' access.log | cut -d' ' -f7 |
bash countem.sh | sort -rn | head -5

14 /files/theme/plugin49c2.js?1490908488
14 /files/theme/mobile49c2.js?1490908488
14 /files/theme/custom49c2.js?1490908488
14 /files/main_styleaf0e.css?1509483497
3 /consulting.html

Although this can be accomplished by piping together commands and scripts, that requires multiple passes through the data. This may work for many datasets, but it is too inefficient for extremely large datasets. You can streamline this by writing a bash script specifically designed to extract and count page accesses, and this requires only a single pass over the data. Example 7-8 shows this script.

Example 7-8. pagereq.sh
# Cybersecurity Ops with bash
# pagereq.sh
#
# Description:
# Count the number of page requests for a given IP address using bash
#
# Usage:
# pagereq <ip address> < inputfile
#   <ip address> IP address to search for
#

declare -A cnt                                             1
while read addr d1 d2 datim gmtoff getr page therest
do
    if [[ $1 == $addr ]] ; then let cnt[$page]+=1 ; fi
done
for id in ${!cnt[@]}                                       2
do
    printf "%8d %s
" ${cnt[$id]} $id
done
1

We declare cnt as an associative array so that we can use a string as the index to the array. In this program, we will be using the page address (the URL) as the index.

2

The ${!cnt[@]} results in a list of all the different index values that have been encountered. Note, however, that they are not listed in any useful order.

Early versions of bash do not have associative arrays. You can use awk to do the same thing—count the various page requests from a particular IP address—since awk has associative arrays.

Example 7-9. pagereq.awk
# Cybersecurity Ops with bash
# pagereq.awk
#
# Description:
# Count the number of page requests for a given IP address using awk
#
# Usage:
# pagereq <ip address> < inputfile
#   <ip address> IP address to search for
#

# count the number of page requests from an address ($1)
awk -v page="$1" '{ if ($1==page) {cnt[$7]+=1 } }                1
END { for (id in cnt) {                                          2
    printf "%8d %s
", cnt[id], id
    }
}'
1

There are two very different $1 variables on this line. The first $1 is a shell variable and refers to the first argument supplied to this script when it is invoked. The second $1 is an awk variable. It refers to the first field of the input on each line. The first $1 has been assigned to the awk variable page so that it can be compared to each $1 of awk (that is, to each first field of the input data).

2

This simple syntax results in the variable id iterating over the values of the index values to the cnt array. It is much simpler syntax than the shell’s "${!cnt[@]}" syntax, but with the same effect.

You can run pagereq.sh by providing the IP address you would like to search for and redirect access.log as input:

$ bash pagereq.sh 192.168.0.37 < access.log | sort -rn | head -5

14 /files/theme/plugin49c2.js?1490908488
14 /files/theme/mobile49c2.js?1490908488
14 /files/theme/custom49c2.js?1490908488
14 /files/main_styleaf0e.css?1509483497
3 /consulting.html

Identifying Anomalies in Data

On the web, a user-agent string is a small piece of textual information sent by a browser to a web server that identifies the client’s operating system, browser type, version, and other information. It is typically used by web servers to ensure page compatibility with the user’s browser. Here is an example of a user-agent string:

Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0

This user-agent string identifies the system as Windows NT version 6.3 (aka Windows 8.1), with 64-bit architecture, and using the Firefox browser.

The user agent string is interesting for two reasons: first, because of the significant amount of information it conveys, which can be used to identify the types of systems and browsers accessing the server; second, because it is configurable by the end user, which can be used to identify systems that may not be using a standard browser or may not be using a browser at all (i.e., a web crawler).

You can identify unusual user agents by first compiling a list of known-good user agents. For the purposes of this exercise, we will use a very small list that is not specific to a particular version; see Example 7-10.

Example 7-10. useragents.txt
Firefox
Chrome
Safari
Edge
Tip

For a list of common user agent strings, visit the TechBlog site.

You can then read in a web server log and compare each line to each valid user agent until you get a match. If no match is found, it should be considered an anomaly and printed to standard output along with the IP address of the system making the request. This provides yet another vantage point into the data, identifying systems with unusual user agents, and another path to further explore.

Example 7-11. useragents.sh
#!/bin/bash -
#
# Cybersecurity Ops with bash
# useragents.sh
#
# Description:
# Read through a log looking for unknown user agents
#
# Usage: ./useragents.sh  <  <inputfile>
#   <inputfile> Apache access log
#


# mismatch - search through the array of known names
#  returns 1 (false) if it finds a match
#  returns 0 (true) if there is no match
function mismatch ()                                    1
{
    local -i i                                          2
    for ((i=0; i<$KNSIZE; i++))
    do
        [[ "$1" =~ .*${KNOWN[$i]}.* ]] && return 1      3
    done
    return 0
}

# read up the known ones
readarray -t KNOWN < "useragents.txt"                      4
KNSIZE=${#KNOWN[@]}                                     5

# preprocess logfile (stdin) to pick out ipaddr and user agent
awk -F'"' '{print $1, $6}' | 
while read ipaddr dash1 dash2 dtstamp delta useragent   6
do
    if mismatch "$useragent"
    then
        echo "anomaly: $ipaddr $useragent"
    fi
done
1

We will use a function for the core of this script. It will return a success (or “true”) if it finds a mismatch; that is, if it finds no match against the list of known user agents. This logic may seem a bit inverted, but it makes the if statement containing the call to mismatch read clearly.

2

Declaring our for loop index as a local variable is good practice. It is not strictly necessary in this script but is a good habit.

3

There are two strings to compare: the input from the logfile and a line from the list of known user agents. To make for a very flexible comparison, we use the regex comparison operator (the =~). The .* (meaning “zero or more instances of any character”) placed on either side of the $KNOWN array reference means that the known string can appear anywhere within the other string for a match.

4

Each line of the file is added as an element to the array name specified. This gives us an array of known user agents. There are two identical ways to do this in bash: either readarray, as used here, or mapfile. The -t option removes the trailing newline from each line read. The file containing the list of known user agents is specified here; modify as needed.

5

This computes the size of the array. It is used inside the mismatch function to loop through the array. We calculate it here, once, outside our loop to avoid recomputing it every time the function is called.

6

The input string is a complex mix of words and quote marks. To capture the user agent string, we use the double quote as the field separator. Doing that, however, means that our first field contains more than just the IP address. By using the bash read, we can parse on the spaces to get the IP address. The last argument of the read takes all the remaining words so it can capture all the words of the user agent string.

When you run useragents.sh, it will output any user agent strings not found in the useragents.txt file:

$ bash useragents.sh < access.log

anomaly: 192.168.0.36 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
anomaly: 192.168.0.36 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
anomaly: 192.168.0.36 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
anomaly: 192.168.0.36 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
.
.
.
anomaly: 192.168.0.36 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)

Summary

In this chapter, we looked at statistical analysis techniques to identify unusual and anomalous activity in logfiles. This type of analysis can provide you with insights into what occurred in the past. In the next chapter, we look at how to analyze logfiles and other data to provide insights into what is happening on a system in real time.

Workshop

  1. The following example uses cut to print the first and tenth fields of the access.log file:

    $ cut -d' ' -f1,10 access.log | bash summer.sh | sort -k 2.1 -rn

    Replace the cut command with the awk command. Do you get the same results? What might be different about those two approaches?

  2. Expand the histogram.sh script to include the count at the end of each histogram bar. Here is sample output:

    192.168.0.37          #############################    2575030
    192.168.0.26          ####### 665693
  3. Expand the histogram.sh script to allow the user to supply the option -s that specifies the maximum bar size. For example, histogram.sh -s 25 would limit the maximum bar size to 25 # characters. The default should remain at 50 if no option is given.

  4. Modify the useragents.sh script to add some parameters:

    1. Add code for an optional first parameter to be a filename of the known hosts. If not specified, default to the name known.hosts as it currently is used.

    2. Add code for an -f option to take an argument. The argument is the filename of the logfile to read rather than reading from stdin.

  5. Modify the pagereq.sh script to not need an associative array but to work with a traditional array that uses a numerical index. Convert the IP address into a 10- to 12-digit number for that use. Caution: Don’t have leading zeros on the number, or the shell will attempt to interpret it as an octal number. Example: Convert “10.124.16.3” into “10124016003,” which can be used as a numerical index.

Visit the Cybersecurity Ops website for additional resources and the answers to these questions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset