Common Log Format

Problem

You need a regular expression that matches each line in the log files produced by a web server that uses the Common Log Format.[11] For example:

127.0.0.1 - jg [27/Apr/2012:11:27:36 +0700] "GET /regexcookbook.html HTTP/1.1" 200 2326

The regular expression should have a capturing group for each field, to allow the application using the regular expression to easily process the fields of each entry in the log.

Solution

^(?<client>S+)S+(?<userid>S+)[(?<datetime>[^]]+)]↵
"(?<method>[A-Z]+)(?<request>[^"]+)?HTTP/[0-9.]+"↵
(?<status>[0-9]{3})(?<size>[0-9]+|-)
Regex options: ^ and $ match at line breaks
Regex flavors: .NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9
^(?P<client>S+)S+(?P<userid>S+)[(?P<datetime>[^]]+)]↵
"(?P<method>[A-Z]+)(?P<request>[^"]+)?HTTP/[0-9.]+"↵
(?P<status>[0-9]{3})(?P<size>[0-9]+|-)
Regex options: ^ and $ match at line breaks
Regex flavors: PCRE 4, Perl 5.10, Python
^(S+)S+(S+)[([^]]+)]"([A-Z]+)([^"]+)?HTTP/[0-9.]+"↵
([0-9]{3})([0-9]+|-)"([^"]*)""([^"]*)"
Regex options: ^ and $ match at line breaks
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Creating a regular expressions to match any entry in a log file generally is very straightforward. It certainly is when the log format puts the same information in each entry, just with different values. This is true for web servers that save access logs using the Common Log Format, such as Apache. Each line in the log file is one log entry, and each entry consists of seven fields, delimited with spaces:

  1. IP address or hostname of the client that made the request.

  2. RFC 1413 client ID. Rarely used. A hyphen indicates the client ID is not available.

  3. The username when using HTTP authentication, and a hyphen when not using HTTP authentication.

  4. The time the request was received, between square brackets. Usually in the format [day/month/year:hour:minute:second timezone] on a 24-hour clock.

  5. The request, between double quotes, with three pieces of information, delimited by spaces:

    1. The request method,[12] such as GET, POST, or HEAD.

    2. The requested resource, which is the part of the URL after the hostname used for the request.

    3. The protocol version, which is either HTTP/1.0 or HTTP/1.1.

  6. The status code,[13] which is a three-digit number such as 200 (meaning “OK”) or 404 (“not found”).

  7. The size of the data returned to the client, excluding the headers. This can be a hyphen or zero if no response was returned.

We don’t really need to know all these details to create a regular expression that successfully matches each entry. We can assume that the web server will write only valid information to the log. Our regular expression doesn’t need to filter the log by matching only entries with certain values, because the application that uses the regular expression will do that.

So we really only need to know how the entries and fields are delimited. Then we can match each field separately into its own capturing group. Entries are delimited by line breaks, and fields are delimited by spaces. But the date and request fields can contain spaces, so we’ll need to handle those two with a bit of extra care.

The first three fields cannot contain spaces. We can easily match them with the shorthand character class S+, which matches one or more characters that are not spaces or line breaks. Because the client ID is rarely used, we do not grab it with a capturing group.

The date field is always surrounded by square brackets, which are metacharacters in a regular expression. To match literal brackets, we escape them: [ and ]. Strictly speaking, the closing bracket does not need to be escaped outside of a character class. But since we will put a character class between the literal brackets, escaping the closing bracket makes the regex easier to read. The negated character class [^]]+ matches one or more characters that are not closing brackets. In JavaScript, the closing bracket must be escaped to include it as a literal in a character class. The other flavors do not require the closing bracket to be escaped when it immediately follows the opening bracket or negating caret, but we escape it anyway for clarity. We put the parentheses around the negated character class, between the escaped literal brackets: [([^]]+)]. This makes our regex capture the date without the brackets around it, so the application that processes the regex matches does not have to strip off the brackets when parsing the date.

Because the request actually contains three bits of information, we use three separate capturing groups to match it. [A-Z]+ matches any uppercase word, which covers all possible request methods. The requested resource can be pretty much anything. [^ "]+ matches anything but spaces and quotes. HTTP/[0-9.]+ matches the HTTP version, allowing any combination of digits and dots for the version.

The status code consists of three digits, which we easily match with [0-9]{3}. The data size is a number or a hyphen, easily matched with [0-9]+|-. The capturing group takes care of grouping the two alternatives.

We put a caret at the start of the regular expression and turn on the option to make it match after line breaks, to make sure that we start matching each log entry at the start of the line. This will significantly improve the performance of the regular expression in the off chance that the log file contains some invalid lines. The regex will attempt to match such lines only once, at the start of the line, rather than at every position in the line.

We did not put a dollar at the end of the line to force each log entry to end at the end of a line. If a log entry has more information, the regex simply ignores this. This allows our regular expression to work equally well on extended logs such as the Combined Log Format, described in the next recipe.

Our final regular expression has eight capturing groups. To make it easy to keep track of the groups, we use named capture for the flavors that support it. JavaScript (without XRegExp) and Ruby 1.8 are the only two flavors in this book that do not support named capture. For those flavors, we use numbered groups instead.

Variations

^(?<client>S+)S+(?<userid>S+)[(?<day>[0-9]{2})/(?<month>↵
[A-Za-z]+)/(?<year>[0-9]{4}):(?<hour>[0-9]{2}):(?<min>[0-9]{2}):↵
(?<sec>[0-9]{2})(?<zone>[-+][0-9]{4})]"(?<method>[A-Z]+)↵
(?<file>[^#?"]+)(?<parameters>[#?][^"]*)?HTTP/[0-9.]+"↵
(?<status>[0-9]{3})(?<size>[0-9]+|-)
Regex options: ^ and $ match at line breaks
Regex flavors: .NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9
^(?P<client>S+)S+(?P<userid>S+)[(?P<day>[0-9]{2})/(?P<month>↵
[A-Za-z]+)/(?P<year>[0-9]{4}):(?P<hour>[0-9]{2}):(?P<min>[0-9]{2}):↵
(?P<sec>[0-9]{2})(?P<zone>[-+][0-9]{4})]"(?P<method>[A-Z]+)↵
(?P<file>[^#?"]+)(?P<parameters>[#?][^"]*)?HTTP/[0-9.]+"↵
(?P<status>[0-9]{3})(?P<size>[0-9]+|-)
Regex options: ^ and $ match at line breaks
Regex flavors: PCRE 4, Perl 5.10, Python
^(S+) S+ (S+) [([0-9]{2})/([A-Za-z]+)/([0-9]{4}):([0-9]{2}):↵
([0-9]{2}):([0-9]{2}) ([-+][0-9]{4})] "([A-Z]+) ([^#? "]+)↵
([#?][^ "]*)? HTTP/[0-9.]+" ([0-9]{3}) ([0-9]+|-)
Regex options: ^ and $ match at line breaks
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

The regular expression presented as the solution in this recipe just matches all the fields, leaving the processing to the application that uses the regex. Depending on what the application needs to do with the log entries, it may be helpful to use a regular expression that provides some more detail.

In this variation, we match all the elements in the timestamp separately, making it easier for the application to convert the matched text into an actual date and time value. We also split up the requested object in separate “file” and “parameters” parts. If the requested object contains a ? or # character, the “file” group will capture the text before the ? or #. The “parameters” group will capture the ? or # and anything that follows. This will make it easier for the application to ignore parameters when calculating page counts, for example.

See Also

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors such as the caret. Recipe 2.11 explains named capturing groups.

Chapter 3 has code snippets that you can use with this regular expression to process log files in your application. If your application loads the whole log file into a string, then Recipe 3.11 shows code to iterate over all the regex matches. If your application reads the file line by line, follow Recipe 3.7 to get the regex match on each line. Either way, Recipe 3.9 shows code to get the text matched by the capturing groups.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset