Combined Log Format

Problem

You need a regular expression that matches each line in the log files produced by a web server that uses the Combined Log Format.[14] For example:

127.0.0.1 - jg [27/Apr/2012:11:27:36 +0700] "GET /regexcookbook.html HTTP/1.1" 200 2326 "http://www.regexcookbook.com/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"

Solution

^(?<client>S+)S+(?<userid>S+)[(?<datetime>[^]]+)]↵
"(?<method>[A-Z]+)(?<request>[^"]+)?HTTP/[0-9.]+"↵
(?<status>[0-9]{3})(?<size>[0-9]+|-)"(?<referrer>[^"]*)"↵
"(?<useragent>[^"]*)"
Regex options: ^ and $ match at line breaks
Regex flavors: .NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9
^(?P<client>S+)S+(?P<userid>S+)[(?P<datetime>[^]]+)]↵
"(?P<method>[A-Z]+)(?P<request>[^"]+)?HTTP/[0-9.]+"↵
(?P<status>[0-9]{3})(?P<size>[0-9]+|-)"(?P<referrer>[^"]*)"↵
"(?P<useragent>[^"]*)"
Regex options: ^ and $ match at line breaks
Regex flavors: PCRE 4, Perl 5.10, Python
^(S+)S+(S+)[([^]]+)]"([A-Z]+)([^"]+)?HTTP/[0-9.]+"↵
([0-9]{3})([0-9]+|-)"([^"]*)""([^"]*)""([^"]*)""([^"]*)"
Regex options: ^ and $ match at line breaks
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

The Combined Log Format is the same as the Common Log Format, but with two extra fields added at the end of each entry, and the first extra field is the referring URL. The second extra field is the user agent. Both appear as double-quoted strings. We can easily match those strings with "[^"]*". We put a capturing group around the [^"]* so that we can easily retrieve the referrer or user agent without the enclosing quotes.

See Also

The previous recipe explains how to match each entry in a Common Log Format web server log.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset