In the previous chapters, we used scripts to collect data and prepare it for analysis. Now we need to make sense of it all. When analyzing large amounts of data, it often helps to start broad and continually narrow the search as new insights are gained into the data.
In this chapter, we use the data from web server logs as input into our scripts. This is simply for demonstration purposes. The scripts and techniques can easily be modified to work with nearly any type of data.
We introduce sort
, head
, and uniq
to limit the data we need to process and display. The file in Example 7-1 will be used for command examples.
12/05/2017 192.168.10.14 test.html 12/30/2017 192.168.10.185 login.html
The sort
command is used to rearrange a text file into numerical and alphabetical order. By default, sort
will arrange lines in ascending order, starting with numbers and then letters. Uppercase letters will be placed before their corresponding lowercase letters unless otherwise specified.
Ignore case.
Use numerical ordering, so that 1, 2, 3 all sort before 10. (In the default alphabetic sorting, 2 and 3 would appear after 10.)
Sort based on a subset of the data (key) in a line. Fields are delimited by whitespace.
Write output to a specified file.
To sort file1.txt by the filename column and ignore the IP address column, you would use the following:
sort -k 2 file1.txt
You can also sort on a subset of the field. To sort by the second octet in the IP address:
sort -k 1.5,1.7 file1.txt
This will sort using characters 5 through 7 of the first field.
We use an Apache web server access log for most of the examples in this chapter. This type of log records page requests made to the web server, when they were made, and who made them. A sample of a typical Apache Combined Log Format file can be seen in Example 7-2. The full logfile is referenced as access.log in this book and can be downloaded from the book’s web page.
192.168.0.11 - - [12/Nov/2017:15:54:39 -0500] "GET /request-quote.html HTTP/1.1" 200 7326 "http://192.168.0.35/support.html" "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0"
Web server logs are used simply as an example. The techniques introduced throughout this chapter can be applied to analyze a variety of data types.
The Apache web server log fields are described in Table 7-1.
Field | Description | Field number |
---|---|---|
192.168.0.11 |
IP address of the host that requested the page |
1 |
- |
RFC 1413 Ident protocol identifier (- if not present) |
2 |
- |
The HTTP authenticated user ID (- if not present) |
3 |
[12/Nov/2017:15:54:39 -0500] |
Date, time, and GMT offset (time zone) |
4–5 |
GET /request-quote.html |
The page that was requested |
6–7 |
HTTP/1.1 |
The HTTP protocol version |
8 |
200 |
The status code returned by the web server |
9 |
7326 |
The size of the file returned in bytes |
10 |
http:⁄/192.168.0.35/support.html |
The referring page |
11 |
Mozilla/5.0 (Windows NT 6.3; Win64… |
User agent identifying the browser |
12+ |
There is a second type of Apache access log known as the Common Log Format. The format is the same as the Combined Log Format except it does not contain fields for the referring page or user agent. See the Apache HTTP Server Project website for additional information on the Apache log format and configuration.
The status codes mentioned in the Table 7-1 (field 9) are often very informational and let you know how the web server responded to any given request. Common codes are seen in Table 7-2.
Code | Description |
---|---|
200 |
OK |
401 |
Unauthorized |
404 |
Page Not Found |
500 |
Internal Server Error |
502 |
Bad Gateway |
For a complete list of codes, see the Hypertext Transfer Protocol (HTTP) Status Code Registry.
When analyzing data for the first time, it is often beneficial to start by looking at the extremes: the things that occurred the most or least frequently, the smallest or largest data transfers, etc. For example, consider the data that you can collect from web server logfiles. An unusually high number of page accesses could indicate scanning activity or a denial-of-service attempt. An unusually high number of bytes downloaded by a host could indicate site cloning or data exfiltration.
To control the arrangement and display of data, use the sort
, head
, and tail
commands at the end of a pipeline:
… | sort -k 2.1 -rn | head -15
This pipes the output of a script into the sort
command and then pipes that sorted output into head
that will print the top 15 (in this case) lines. The sort
command here is using as its sort key (-k
) the second field beginning at its first character (2.1
). Moreover, it is doing a reverse sort (-r
), and the values will be sorted like numbers (-n
). Why a numerical sort? So that 2 shows up between 1 and 3, and not between 19 and 20 (which is alphabetical order).
By using head
, we take the first lines of the output. We could get the last few lines by piping the output from the sort
command into tail
instead of head
. Using tail -15
would give us the last 15 lines. The other way to do this would be to simply remove the -r
option on sort
so that it does an ascending rather than descending sort.
A typical web server log can contain tens of thousands of entries. By counting each time a page was accessed, or by which IP address it was accessed from, you can gain a better understanding of general site activity. Interesting entries can include the following:
A high number of requests returning the 404 (Page Not Found) status code for a specific page; this can indicate broken hyperlinks.
A high number of requests from a single IP address returning the 404 status code; this can indicate probing activity looking for hidden or unlinked pages.
A high number of requests returning the 401 (Unauthorized) status code, particularly from the same IP address; this can indicate an attempt at bypassing authentication, such as brute-force password guessing.
To detect this type of activity, we need to be able to extract key fields, such as the source IP address, and count the number of times they appear in a file. To accomplish this, we will use the cut
command to extract the field and then pipe the output into our new tool, countem.sh, which is shown in Example 7-3.
#!/bin/bash -
#
# Cybersecurity Ops with bash
# countem.sh
#
# Description:
# Count the number of instances of an item using bash
#
# Usage:
# countem.sh < inputfile
#
declare
-A
cnt
# assoc. array
while
read
id
xtra
do
let
cnt
[
$id
]
++
done
# now display what we counted
# for each key in the (key, value) assoc. array
for
id
in
"
${
!cnt[@]
}
"
do
printf
'%d %s '
"
${
cnt
[
$id
]
}
"
"
$id
"
done
Since we don’t know what IP addresses (or other strings) we might encounter, we will use an associative array (also known as a hash table or dictionary), declared here with the -A
option, so that we can use whatever string we read as our index.
The associative array feature is found in bash 4.0 and higher. In such an array, the index doesn’t have to be a number, but can be any string. So you can index the array by the IP address and thus count the occurrences of that IP address. In case you are using something older than bash 4.0, Example 7-4 is an alternate script that uses awk
instead.
The array references are like others in bash, using the ${var[index]}
syntax to reference an element of the array. To get all the different index values that have been used (the “keys” if you think of these arrays as (key, value) pairings), use: ${!cnt[@]}
.
Although we expect only one word of input per line, we put the variable xtra
there to capture any other words that appear on the line. Each variable on a read
command gets assigned the corresponding word from the input (i.e., the first variable gets the first word, the second variable gets the second word, and so on), but the last variable gets any and all remaining words. On the other hand, if there are fewer words of input on a line than there are variables on the read
command, then those extra variables get set to the empty string. So for our purposes, if there are extra words on the input line, they’ll all be assigned to xtra
, but if there are no extra words, xtra
will be given the value of the null string (which won’t matter either way because we don’t use it).
Here we use that string as the index and increment its previous value. For the first use of the index, the previous value will be unset, which will be taken as zero.
This syntax lets us iterate over all the various index values that we encountered. Note, however, that the order is not guaranteed to be alphabetical or in any other specific order due to the nature of the hashing algorithm for the index values.
In printing out the value and key, we put the values inside quotes so that we always get a single value for each argument—even if that value had a space or two inside it. It isn’t expected to happen with our use of this script, but such coding practices make the scripts more robust when used in other situations.
And Example 7-4 shows another version, this time using awk
.
# Cybersecurity Ops with bash
# countem.awk
#
# Description:
# Count the number of instances of an item using awk
#
# Usage:
# countem.awk < inputfile
#
awk'{ cnt[$1]++ }
END { for (id in cnt) {
printf "%d %s ", cnt[id], id
}
}'
Both will work nicely in a pipeline of commands like this:
cut -d' ' -f1 logfile | bash countem.sh
The cut command is not really necessary here for either version. Why? Because the awk script explicitly references the first field (with $1
), and in the shell script it’s because of how we coded the read
command (see ). So we can run it like this:
bash countem.sh < logfile
For example, to count the number of times an IP address made a HTTP request that resulted in a 404 (Page Not Found) error:
$ awk '$9 == 404 {print $1}' access.log | bash countem.sh 1 192.168.0.36 2 192.168.0.37 1 192.168.0.11
You can also use grep 404 access.log
and pipe it into countem.sh, but that would include lines where 404 appears in other places (e.g., the byte count, or part of a file path). The use of awk
here restricts the counting only to lines where the returned status (the ninth field) is 404. It then prints just the IP address (field 1) and pipes the output into countem.sh to get the total number of times each IP address made a request that resulted in a 404 error.
To begin analysis of the example access.log file, you can start by looking at the hosts that accessed the web server. You can use the Linux cut
command to extract the first field of the logfile, which contains the source IP address, and then pipe the output into the countem.sh script. The exact command and output is shown here.
$ cut -d' ' -f1 access.log | bash countem.sh | sort -rn 111 192.168.0.37 55 192.168.0.36 51 192.168.0.11 42 192.168.0.14 28 192.168.0.26
If you do not have countem.sh available, you can use the uniq
command -c
option to achieve similar results, but it will require an extra pass through the data using sort
to work properly.
$ cut -d' ' -f1 access.log | sort | uniq -c | sort -rn 111 192.168.0.37 55 192.168.0.36 51 192.168.0.11 42 192.168.0.14 28 192.168.0.26
Next, you can further investigate by looking at the host that had the most requests, which as can be seen in the preceding code is IP address 192.168.0.37
, with 111. You can use awk
to filter on the IP address, then pipe that into cut
to extract the field that contains the request, and finally pipe that output into countem.sh to provide the total number of requests for each page:
$ awk '$1 == "192.168.0.37" {print $0}' access.log | cut -d' ' -f7 | bash countem.sh 1 /uploads/2/9/1/4/29147191/31549414299.png?457 14 /files/theme/mobile49c2.js?1490908488 1 /cdn2.editmysite.com/images/editor/theme-background/stock/iPad.html 1 /uploads/2/9/1/4/29147191/2992005_orig.jpg . . . 14 /files/theme/custom49c2.js?1490908488
The activity of this particular host is unimpressive, appearing to be standard web-browsing behavior. If you take a look at the host with the next highest number of requests, you will see something a little more interesting:
$ awk '$1 == "192.168.0.36" {print $0}' access.log | cut -d' ' -f7 | bash countem.sh 1 /files/theme/mobile49c2.js?1490908488 1 /uploads/2/9/1/4/29147191/31549414299.png?457 1 /_/cdn2.editmysite.com/.../Coffee.html 1 /_/cdn2.editmysite.com/.../iPad.html . . . 1 /uploads/2/9/1/4/29147191/601239_orig.png
This output indicates that host 192.168.0.36
accessed nearly every page on the website exactly one time. This type of activity often indicates web-crawler or site-cloning activity. If you take a look at the user agent string provided by the client, it further verifies this conclusion:
$ awk '$1 == "192.168.0.36" {print $0}' access.log | cut -d' ' -f12-17 | uniq "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
The user agent identifies itself as HTTrack
, which is a tool used to download or clone websites. While not necessarily malicious, it is interesting to note during analysis.
You can find additional information on HTTrack at the HTTrack website.
Rather than just count the number of times an IP address or other item occurs, what if you wanted to know the total byte count that has been sent to an IP address—or which IP addresses have requested and received the most data?
The solution is not that much different from countem.sh: you just need a few small changes. First, you need more columns of data by tweaking the input filter (the cut
command) to extract two columns (IP address and byte count) rather than just IP address. Second, you will change the calculation from an increment, (let cnt[$id]++
) a simple count, to be a summing of that second field of data (let cnt[$id]+=$data
).
The pipeline to invoke this will now extract two fields from the logfile, the first and the last:
cut -d' ' -f 1,10 access.log | bash summer.sh
The script summer.sh, shown in Example 7-5, reads in two columns of data. The first column consists of index values (in this case, IP addresses) and the second column is a number (in this case, number of bytes sent by the IP address). Every time the script finds a repeat IP address in the first column, it then adds the value of the second column to the total byte count for that IP address, thus totaling the number of bytes sent by the IP address.
#!/bin/bash -
#
# Cybersecurity Ops with bash
# summer.sh
#
# Description:
# Sum the total of field 2 values for each unique field 1
#
# Usage: ./summer.sh
# input format: <name> <number>
#
declare
-A
cnt
# assoc. array
while
read
id
count
do
let
cnt
[
$id
]
+
=
$count
done
for
id
in
"
${
!cnt[@]
}
"
do
printf
"%-15s %8d "
"
${
id
}
"
"
${
cnt
[
${
id
}
]
}
"
done
Note that we’ve made a few other changes to the output format. With the output format, we’ve added field sizes of 15 characters for the first string (the IP address in our sample data), left-justified (via the minus sign), and eight digits for the sum values. If the sum is larger, it will print the larger number, and if the string is longer, it will be printed in full. We’ve done this to get the data to align, by and large, nicely in columns, for readability.
You can run summer.sh against the example access.log file to get an idea of the total amount of data requested by each host. To do this, use cut
to extract the IP address and bytes transferred fields, and then pipe the output into summer.sh:
$ cut -d' ' -f1,10 access.log | bash summer.sh | sort -k 2.1 -rn 192.168.0.36 4371198 192.168.0.37 2575030 192.168.0.11 2537662 192.168.0.14 2876088 192.168.0.26 665693
These results can be useful in identifying hosts that have transferred unusually large amounts of data compared to other hosts. A spike could indicate data theft and exfiltration. If you identify such a host, the next step would be to review the specific pages and files accessed by the suspicious host to try to classify it as malicious or benign.
You can take counting one step further by providing a more visual display of the results. You can take the output from countem.sh or summer.sh and pipe it into yet another script, one that will produce a histogram-like display of the results.
The script to do the printing will take the first field as the index to an associative array, and the second field as the value for that array element. It will then iterate through the array and print a number of hashtags to represent the count, scaled to 50 #
symbols for the largest count in the list.
#!/bin/bash -
#
# Cybersecurity Ops with bash
# histogram.sh
#
# Description:
# Generate a horizontal bar chart of specified data
#
# Usage: ./histogram.sh
# input format: label value
#
function
pr_bar
(
)
{
local
-i
i
raw
maxraw
scaled
raw
=
$1
maxraw
=
$2
(
(
scaled
=
(
MAXBAR*raw
)
/maxraw
)
)
# min size guarantee
(
(
raw
>
0
&&
scaled
=
=
0
)
)
&&
scaled
=
1
for
(
(
i
=
0
;
i
<
scaled
;
i++
)
)
;
do
printf
'#'
;
done
printf
' '
}
# pr_bar
#
# "main"
#
declare
-A
RA
declare
-i
MAXBAR
max
max
=
0
MAXBAR
=
50
# how large the largest bar should be
while
read
labl
val
do
let
RA
[
$labl
]
=
$val
# keep the largest value; for scaling
(
(
val
>
max
)
)
&&
max
=
$val
done
# scale and print it
for
labl
in
"
${
!RA[@]
}
"
do
printf
'%-20.20s '
"
$labl
"
pr_bar
${
RA
[
$labl
]
}
$max
done
We define a function to draw a single bar of the histogram.
This definition must be encountered before a call to the function can be made, so it makes sense to put function definitions at the front of our script. We will be reusing this function in a future script, so we could have put it in a separate file and included it here with a source
command—but we didn’t.
We declare all these variables as local because we don’t want them to interfere with variable names in the rest of this script (or any others, if we copy/paste this script to use elsewhere). We declare all these variables as integers (that’s the -i
option) because we are going to only compute values with them and not use them as strings.
The computation is done inside double parentheses. Inside those, we don’t need to use the $
to indicate “the value of” each variable name.
This is an “if-less” if
statement. If the expression inside the double parentheses is true, then, and only then, is the second expression (the assignment) executed. This will guarantee that scaled
is never zero when the raw value is nonzero. Why? Because we’d like something to show up in that case.
The main part of the script begins with a declaration of the RA
array as an associative array.
Here we reference the associative array by using the label, a string, as its index.
Because the array is not indexed by numbers, we can’t just count integers and use them as indices. This construct gives all the various strings that were used as an index to the array, one at a time, in the for
loop.
We use the label as an index one more time to get the count and pass it as the first parameter to our pr_bar
function.
Note that the items don’t appear in the same order as the input. That’s because the hashing algorithm for the key (the index) doesn’t preserve ordering. You could take this output and pipe it into yet another sort
, or you could take a slightly different approach.
Example 7-7 is a version of the histogram script that preserves order—by not using an associative array. This might also be useful on older versions of bash (pre 4.0), prior to the introduction of associative arrays. Only the “main” part of the script is shown, as the function pr_bar
remains the same.
#!/bin/bash -
#
# Cybersecurity Ops with bash
# histogram_plain.sh
#
# Description:
# Generate a horizontal bar chart of specified data without
# using associative arrays, good for older versions of bash
#
# Usage: ./histogram_plain.sh
# input format: label value
#
declare
-a
RA_key
RA_val
declare
-i
max
ndx
max
=
0
maxbar
=
50
# how large the largest bar should be
ndx
=
0
while
read
labl
val
do
RA_key
[
$ndx
]
=
$labl
RA_value
[
$ndx
]
=
$val
# keep the largest value; for scaling
(
(
val
>
max
)
)
&&
max
=
$val
let
ndx++
done
# scale and print it
for
(
(
j
=
0
;
j
<
ndx
;
j++
)
)
do
printf
"%-20.20s "
${
RA_key
[
$j
]
}
pr_bar
${
RA_value
[
$j
]
}
$max
done
This version of the script avoids the use of associative arrays, in case you are running an older version of bash (prior to 4.x), such as on macOS systems. For this version, we use two separate arrays—one for the index value and one for the counts. Because they are normal arrays, we have to use an integer index, and so we will keep a simple count in the variable ndx
.
Here the variable names are declared as arrays. The lowercase a
says that they are arrays, but not of the associative variety. While not strictly necessary, this is good practice. Similarly, on the next line we use the -i
to declare these variables as integers, making them more efficient than undeclared shell variables (which are stored as strings). Again, this is not strictly necessary, as seen by the fact that we don’t declare maxbar
but just use it.
The key and value pairs are stored in separate arrays, but at the same index location. This approach is “brittle”—that is, easily broken, if changes to the script ever got the two arrays out of sync.
Now the for
loop, unlike the previous script, is a simple counting of an integer from 0
to ndx
. The variable j
is used here so as not to interfere with the index in the for
loop inside pr_bar
, although we were careful enough inside the function to declare its version of i
as local to the function. Do you trust it? Change the j
to an i
here and see if it still works (it does). Then try removing the local declaration and see if it fails (it does).
This approach with the two arrays does have one advantage. By using the numerical index for storing the label and the data, you can retrieve them in the order they were read in—in the numerical order of the index.
You can now visually see the hosts that transferred the largest number of bytes by extracting the appropriate fields from access.log, piping the results into summer.sh, and then into histogram.sh:
$ cut -d' ' -f1,10 access.log | bash summer.sh | bash histogram.sh 192.168.0.36 ################################################## 192.168.0.37 ############################# 192.168.0.11 ############################# 192.168.0.14 ################################ 192.168.0.26 #######
Although this might not seem that useful for the small amount of sample data, being able to visualize trends is invaluable when looking across larger datasets.
In addition to looking at the number of bytes transferred by IP address or host, it is often interesting to look at the data by date and time. To do that, you can use the summer.sh script, but due to the format of the access.log file, you need to do a little more processing before you can pipe it into the script. If you use cut
to extract the date/time and bytes transferred fields, you are left with data that causes some problems for the script:
$ cut -d' ' -f4,10 access.log [12/Nov/2017:15:52:59 2377 [12/Nov/2017:15:52:59 4529 [12/Nov/2017:15:52:59 1112
As shown in the preceding output, the raw data starts with a [
character. That causes a problem with the script because it denotes the beginning of an array in bash. To remedy that, you can use an additional iteration of the cut
command with -c2-
to remove the character. This option tells cut
to extract the data by character, starting at position 2 and going to the end of the line (-). The corrected output with the square bracket removed is shown here:
$ cut -d' ' -f4,10 access.log | cut -c2- 12/Nov/2017:15:52:59 2377 12/Nov/2017:15:52:59 4529 12/Nov/2017:15:52:59 1112
Alternatively, you can use tr
in place of the second cut
. The -d
option will delete the character specified—in this case, the square bracket.
cut -d' ' -f4,10 access.log | tr -d '['
You also need to determine how you want to group the time-bound data: by day, month, year, hour, etc. You can do this by simply modifying the option for the second cut
iteration. Table 7-3 illustrates the cut
option to use to extract various forms of the date/time field. Note that these cut
options are specific to Apache logfiles.
Date/time extracted | Example output | Cut option |
---|---|---|
Entire date/time |
|
|
Month, day, and year |
|
|
Month and year |
|
|
Full time |
|
|
Hour |
|
|
Year |
|
|
The histogram.sh script can be particularly useful when looking at time-based data. For example, if your organization has an internal web server that is accessed only during working hours of 9:00 A.M. to 5:00 P.M., you can review the server log file on a daily basis via the histogram view to see whether spikes in activity occur outside normal working hours. Large spikes of activity or data transfer outside normal working hours could indicate exfiltration by a malicious actor. If any anomalies are detected, you can filter the data by that particular date and time and review the page accesses to determine whether the activity is malicious.
For example, if you want to see a histogram of the total amount of data that was retrieved on a certain day and on an hourly basis, you can do the following:
$ awk '$4 ~ "12/Nov/2017" {print $0}' access.log | cut -d' ' -f4,10 | cut -c14-15,22- | bash summer.sh | bash histogram.sh 17 ## 16 ########### 15 ############ 19 ## 18 ##################################################
Here the access.log file is sent through awk
to extract the entries from a particular date. Note the use of the like operator (~
) instead of ==
, because field 4 also contains time information. Those entries are piped into cut
to extract the date/time and bytes transferred fields, and then piped into cut
again to extract just the hour. From there, it is summed by hour by using summer.sh and converted into a histogram by using histogram.sh. The result is a histogram that displays the total number of bytes transferred each hour on November 12, 2017.
Pipe the output from the histogram script into sort -n
to get the output in numerical (hour) order. Why is the sort needed? The scripts summer.sh and histogram.sh are both generating their output by iterating through the list of indices of their associative arrays. Therefore, their output will not likely be in a sensible order (but rather in an order determined by the internal hashing algorithm). If that explanation left you cold, just ignore it and remember to use a sort on the output.
If you want to have the output ordered by the amount of data, you’ll need to add the sort between the two scripts. You’ll also need to use histogram_plain.sh, the version of the histogram script that doesn’t use associative arrays.
Previously, IP address 192.168.0.37
was identified as the system that had the largest number of page requests. The next logical question is, what pages did this system request? With that answer, you can start to gain an understanding of what the system was doing on the server and categorize the activity as benign, suspicious, or malicious. To accomplish that, you can use awk
and cut
and pipe the output into countem.sh:
$ awk '$1 == "192.168.0.37" {print $0}' access.log | cut -d' ' -f7 | bash countem.sh | sort -rn | head -5 14 /files/theme/plugin49c2.js?1490908488 14 /files/theme/mobile49c2.js?1490908488 14 /files/theme/custom49c2.js?1490908488 14 /files/main_styleaf0e.css?1509483497 3 /consulting.html
Although this can be accomplished by piping together commands and scripts, that requires multiple passes through the data. This may work for many datasets, but it is too inefficient for extremely large datasets. You can streamline this by writing a bash script specifically designed to extract and count page accesses, and this requires only a single pass over the data. Example 7-8 shows this script.
# Cybersecurity Ops with bash
# pagereq.sh
#
# Description:
# Count the number of page requests for a given IP address using bash
#
# Usage:
# pagereq <ip address> < inputfile
# <ip address> IP address to search for
#
declare
-A
cnt
while
read
addr
d1
d2
datim
gmtoff
getr
page
therest
do
if
[
[
$1
=
=
$addr
]
]
;
then
let
cnt
[
$page
]
+
=
1
;
fi
done
for
id
in
${
!cnt[@]
}
do
printf
"%8d %s "
${
cnt
[
$id
]
}
$id
done
We declare cnt
as an associative array so that we can use a string as the index to the array. In this program, we will be using the page address (the URL) as the index.
The ${!cnt[@]}
results in a list of all the different index values that have been encountered. Note, however, that they are not listed in any useful order.
Early versions of bash do not have associative arrays. You can use awk
to do the same thing—count the various page requests from a particular IP address—since awk
has associative arrays.
# Cybersecurity Ops with bash
# pagereq.awk
#
# Description:
# Count the number of page requests for a given IP address using awk
#
# Usage:
# pagereq <ip address> < inputfile
# <ip address> IP address to search for
#
# count the number of page requests from an address ($1)
awk
-v
page
=
"
$1
"
'{ if ($1==page) {cnt[$7]+=1 } }
END { for (id in cnt) {
printf "%8d %s ", cnt[id], id } }'
There are two very different $1
variables on this line.
The first $1
is a shell variable and refers to the first argument supplied to this script when it is invoked.
The second $1
is an awk
variable. It refers to the first field of the input on each line.
The first $1
has been assigned to the awk
variable page
so that it can be compared to each $1
of awk
(that is, to each first field of the input data).
This simple syntax results in the variable id
iterating over the values of the index values to the cnt
array. It is much simpler syntax than the shell’s "${!cnt[@]}"
syntax, but with the same effect.
You can run pagereq.sh by providing the IP address you would like to search for and redirect access.log as input:
$ bash pagereq.sh 192.168.0.37 < access.log | sort -rn | head -5 14 /files/theme/plugin49c2.js?1490908488 14 /files/theme/mobile49c2.js?1490908488 14 /files/theme/custom49c2.js?1490908488 14 /files/main_styleaf0e.css?1509483497 3 /consulting.html
On the web, a user-agent string is a small piece of textual information sent by a browser to a web server that identifies the client’s operating system, browser type, version, and other information. It is typically used by web servers to ensure page compatibility with the user’s browser. Here is an example of a user-agent string:
Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0
This user-agent string identifies the system as Windows NT version 6.3 (aka Windows 8.1), with 64-bit architecture, and using the Firefox browser.
The user agent string is interesting for two reasons: first, because of the significant amount of information it conveys, which can be used to identify the types of systems and browsers accessing the server; second, because it is configurable by the end user, which can be used to identify systems that may not be using a standard browser or may not be using a browser at all (i.e., a web crawler).
You can identify unusual user agents by first compiling a list of known-good user agents. For the purposes of this exercise, we will use a very small list that is not specific to a particular version; see Example 7-10.
Firefox Chrome Safari Edge
For a list of common user agent strings, visit the TechBlog site.
You can then read in a web server log and compare each line to each valid user agent until you get a match. If no match is found, it should be considered an anomaly and printed to standard output along with the IP address of the system making the request. This provides yet another vantage point into the data, identifying systems with unusual user agents, and another path to further explore.
#!/bin/bash -
#
# Cybersecurity Ops with bash
# useragents.sh
#
# Description:
# Read through a log looking for unknown user agents
#
# Usage: ./useragents.sh < <inputfile>
# <inputfile> Apache access log
#
# mismatch - search through the array of known names
# returns 1 (false) if it finds a match
# returns 0 (true) if there is no match
function
mismatch
(
)
{
local
-i
i
for
(
(
i
=
0
;
i
<
$KNSIZE
;
i++
)
)
do
[
[
"
$1
"
=
~
.*
${
KNOWN
[
$i
]
}
.*
]
]
&&
return
1
done
return
0
}
# read up the known ones
readarray
-t
KNOWN
<
"useragents.txt"
KNSIZE
=
${#
KNOWN
[@]
}
# preprocess logfile (stdin) to pick out ipaddr and user agent
awk
-F
'"'
'{print $1, $6}'
|
while
read
ipaddr
dash1
dash2
dtstamp
delta
useragent
do
if
mismatch
"
$useragent
"
then
echo
"
anomaly:
$ipaddr
$useragent
"
fi
done
We will use a function for the core of this script. It will return a success (or “true”) if it finds a mismatch; that is, if it finds no match against the list of known user agents. This logic may seem a bit inverted, but it makes the if
statement containing the call to mismatch
read clearly.
Declaring our for
loop index as a local variable is good practice. It is not strictly necessary in this script but is a good habit.
There are two strings to compare: the input from the logfile and a line from the list of known user agents. To make for a very flexible comparison, we use the regex comparison operator (the =~
). The .*
(meaning “zero or more instances of any character”) placed on either side of the $KNOWN
array reference means that the known string can appear anywhere within the other string for a match.
Each line of the file is added as an element to the array name specified. This gives us an array of known user agents. There are two identical ways to do this in bash: either readarray
, as used here, or mapfile
. The -t
option removes the trailing newline from each line read. The file containing the list of known user agents is specified here; modify as needed.
This computes the size of the array. It is used inside the mismatch
function to loop through the array. We calculate it here, once, outside our loop to avoid recomputing it every time the function is called.
The input string is a complex mix of words and quote marks. To capture the user agent string, we use the double quote as the field separator. Doing that, however, means that our first field contains more than just the IP address. By using the bash read
, we can parse on the spaces to get the IP address. The last argument of the read
takes all the remaining words so it can capture all the words of the user agent string.
When you run useragents.sh, it will output any user agent strings not found in the useragents.txt file:
$ bash useragents.sh < access.log anomaly: 192.168.0.36 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98) anomaly: 192.168.0.36 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98) anomaly: 192.168.0.36 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98) anomaly: 192.168.0.36 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98) . . . anomaly: 192.168.0.36 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
In this chapter, we looked at statistical analysis techniques to identify unusual and anomalous activity in logfiles. This type of analysis can provide you with insights into what occurred in the past. In the next chapter, we look at how to analyze logfiles and other data to provide insights into what is happening on a system in real time.
The following example uses cut
to print the first and tenth fields of the access.log file:
$ cut -d' ' -f1,10 access.log | bash summer.sh | sort -k 2.1 -rn
Replace the cut
command with the awk
command. Do you get the same results? What might be different about those two approaches?
Expand the histogram.sh script to include the count at the end of each histogram bar. Here is sample output:
192.168.0.37 ############################# 2575030 192.168.0.26 ####### 665693
Expand the histogram.sh script to allow the user to supply the option -s
that specifies the maximum bar size. For example, histogram.sh -s 25
would limit the maximum bar size to 25 #
characters. The default should remain at 50 if no option is given.
Modify the useragents.sh script to add some parameters:
Add code for an optional first parameter to be a filename of the known hosts. If not specified, default to the name known.hosts
as it currently is used.
Add code for an -f
option to take an argument. The argument is the filename of the logfile to read rather than reading from stdin.
Modify the pagereq.sh script to not need an associative array but to work with a traditional array that uses a numerical index. Convert the IP address into a 10- to 12-digit number for that use. Caution: Don’t have leading zeros on the number, or the shell will attempt to interpret it as an octal number. Example: Convert “10.124.16.3” into “10124016003,” which can be used as a numerical index.
Visit the Cybersecurity Ops website for additional resources and the answers to these questions.