A Web Log Processor (weblog.pl)

The second example script is one that takes a log file, as generated by Web servers, and generates statistics about the information contained in that log file. Most Web servers keep files of this sort, which keep track of how many accesses (“hits”) have been made to a Web site, the files that were requested, the sites that requested them, and other information.

Many log file-analyzer programs already exist on the Web (and there are usually programs that come with the Web server), so this example isn't breaking any new ground. The statistics it generates are fairly simple, although this script could be easily modified to include just about any information that you'd like to include. It's a good starting point for processing Web logs, or a model to follow for processing log files from any other programs.

How It Works

The weblog.pl script is called with one argument: a log file. On many Web servers, these files are commonly called access_log, and follow what is known as the common log format. The script processes for a while (it'll print the date of the logs it's working on so you know it's still working), and then prints some results. Here's an example of the sort of output you can get (this example is from the logs on my own Web server, www.lne.com):

% weblog.pl access_log
Processing log....
Processing 09/Apr/1998
Processing 10/Apr/1998
Processing 11/Apr/1998
Processing 12/Apr/1998
Web log file Results:
Total Number of Hits: 55789
Total failed hits: 1803 (3.23%)
(sucessful) HTML files: 18264 (33.83%)
Number of unique hosts: 5911
Number of unique domains: 2121
Most popular files:
  /Web/index.html (2456 hits)
  /lemay/index.html (1711 hits)
  /Web/Title.gif (1685 hits)
  /Web/HTML3.2/3.2thm.gif (1669 hits)
  /Web/JavaProf/javaprof_thm.gif (1662 hits)
Most popular hosts:
  202.185.174.4 (487 hits)
  vader.integrinautics.com (440 hits)
  linea15.secsa.podernet.com.mx (437 hits)
  lobby.itmin.com (284 hits)
  pyx.net (256 hits)
Most popular domains:
  mindspring.com (3160 hits)
  aol.com (1808 hits)
  uu.net (792 hits)
  grid.net (684 hits)
  compuserve.com (565 hits)

This particular output shows only the top 5 files, hosts, and domains, to save space here. You can configure the script to print out any number of those statistics.

The difference between a host and a domain might not be readily apparent; a host is the full host name of the system that accessed the Web server, which might include dynamically assigned addresses and proxy servers. The host dialup124.servers.foo.com will be a different host from dialup567.servers.foo.com. The domain, on the other hand, is a larger group of hosts, usually consisting of two or three parts. foo.com is a domain, as is aol.com or demon.co.uk. The domain listings tend to collapse separate entries for hosts into major groups—all the hosts under aol.com's purview will show up as hits from aol.com in the domain list.

Note also that a single hit can be an HTML page, an image, a form submission, or any other file. There are usually significantly more raw hits than there are actual page accesses. This script points those out by keeping track of HTML hits separately from the total number of hits.

What a Web Log Looks Like

Because the weblog.pl script processes Web log files, it helps to know what those log files look like. Web log files are stored with one hit per line, and each line in what's called common log format (common because it's common to various Web servers). Most Web servers generate their log files in this format, or can be configured to do so (many servers use a superset of the common log format with more information in it). A line in a common log format log file might look something like this (here I'm showing it to you on two lines; actually, it only appears on one in real life):

proxy2bh.powerup.com.au - - [03/Apr/1998:00:09:02 -0800]
"GET /lemay/ HTTP/1.0" 200 4621

The various elements of each line in the log file are

  • The host name accessing the server (here proxy2bh.powerup.com.au).

  • The username of the person accessing the page discovered through ident (a Unix program used to identify users), or through the user signing into your site. These two parts usually show up as two dashes (- -) when the username cannot be determined.

  • The date and time the hit was made, in square brackets.

  • Inside quotes, the action for the hit (actually a Web server action): GET is to get a file or submit a form, POST is to submit a form in a different way, HEAD is to get header information about a file.

  • After the action, the filename (or directory) that was requested, here /lemay/.

  • The version number of the protocol, here HTTP/1.0.

  • The return code for the hit; 200 is successful, 404 is “not found,” and so on.

  • The number of bytes transferred.

Not all these elements of the log file are interesting to a statistics generator script, of course, and a lot of them won't make any sense to you unless you know how Web servers work. But a few, such as the host, the date, the filename, and the return code, can be extracted and processed for each line of the file.

Building the Script

The path of execution for this script is easier to follow than the one for the address.pl script; there are basically only two major steps—process the log and generate the statistics. We do have a number of subroutines along the way to help, however.

In fact, all of the code for this script is contained in subroutines. The body of the code consists of a bunch of global variables and two subroutine calls: &process_log() and &print_results().

The global variables are used to store the various statistics and information about parts of the log file. Because many of these statistics are hashes, using local variables and passing around the data would become complicated. In this case, keeping the data global makes it easier to manage. The global data we keep track of includes

  • The number of hits, number of failed hits, and number of hits to HTML pages

  • A hash to store the various host names, and the number of times those hosts appear in the log

  • A hash to do the same for the various files in the log

In addition, two other global variables are worth mentioning:

  • The $topthings variable stores a number indicating how many entries you want to print for the “most popular” parts of the statistics. In the example output I showed you, $topthings was set to 5, which gives us some nice short output. Setting it to 20 will print the top 20 files, hosts, and domains.

  • The $default variable should be set to the default HTML file for your Web server, often called index.html or home.html. This is the file that serves as the main file for a directory when the user doesn't ask for a specific file. Usually it's index.html.

These two variables determine how the script itself will behave. Although we could have put these variables deep inside the program, putting them up here, right up front, enables you or someone else using your script to change the overall behavior of the script in one place without having to search for the right variable to change. It's one of those “good programming practices” that make sense, no matter which programming language you're using.

Processing the Log

The first part of the weblog.pl script is the &process_log() subroutine, which loops over each line in the script, and stores various statistics about that line. I'm not going to show you every line of this subroutine, but I will point out the important parts. You can see the complete code in Listing 14.7 at the end of this section.

The core of the &process_log() subroutine is yet another while (<>) loop, to read each line of the input at a time. Unlike address.pl, this script doesn't pause anywhere; it just reads in the file from start to finish.

The first thing we do to process each line is to split the line into its component parts and store those parts in a hash keyed by the part name ('site', 'file', and so on). There's a separate subroutine to do the splitting, called &splitline(). Listing 14.4 shows this subroutine.

Listing 14.4. The &splitline() Subroutine
1:  sub splitline {
2:      my $in = $_[0];
3:      my %line = ();
4:      if ($in =~ /^([^s]+)s           # site
5:                 ([w-]+s[w-]+)s     # users
6:                 [([^]]+)]s         # date
7:                  "(w+)s             # protocol
8:                 (/[^s]*)s           # file
9:                 ([^"]+)"s            # HTTP version
10:                (d{3})s              # return code
11:                ([d-]+)               # bytes transferred
12:     /x) {
13:        $line{'site'} = $1;
14:        $line{'date'} = $3;
15:        $line{'file'} = $5;
16:        $line{'code'} = $7;
17:        return %line;
18:     } else { return (); }
19: }
							

The first thing that probably catches your eye about that subroutine is that enormous monster of a regular expression smack in the middle (lines 4 through 11). It's so ugly it needs six lines! And comments! This regular expression is in a form called extended regular expressions; if you read the “Going Deeper” section on Day 5, “Working with Hashes,” I described these there. But here's a quick review: Say you have a particularly ugly regular expression like the one in this example (here I've put it on two lines because it doesn't fit on one line!):

if ($in =~ /^([^s]+)s([w-]+s[w-]+)s[([^]]+)]s"(w+)
(/[^s]*)s([^"]+)"s(d{3})s([d-]+)/)

Chances are good you won't be able to make heads or tails of that expression without a lot of patient dissecting or very strong tranquilizers. And debugging it won't be much fun either. But if you put the /x option on the end of the expression (as we have in line 12), then you can spread that regular expression apart into sections or onto separate lines, and comment it as you would lines of Perl code. All whitespace in the pattern is ignored; if you want to match for whitespace in the text, you'll have to use s. All the /x option does is make a regular expression easier to read and debug.

This particular regex assumes the common log format I described earlier. Specifically:

  • Line 4 matches the site (host) name. The site always appears at the start of the line, and consists of some nonwhitespace characters followed by a space (a s here so that extended patterns work).

  • Line 5 matches the user fields (two of them). The users consist of one or more alphanumeric characters or a dash separated by and followed by whitespace. Note the dashes inside the character classes; the w class does not include dashes.

  • Line 6 matches the date, which is one or more characters or whitespace in between brackets ([]).

  • Line 7 matches the protocol (GET, HEAD, and so on), by starting the string with a quote and following it by one or more characters (the closing quote is after the HTTP version in line 9).

  • Line 8 matches the file. It always starts with a slash (/), followed by zero or more other characters and ending with whitespace.

  • Line 9 matches the HTTP version, which includes any remaining characters before the closing quote.

  • Line 10 matches the return code, which is always three digits long followed by whitespace (it would be less specific to just use d+ here, but this is a chance to show off the use of the {3} pattern.

  • Line 11 finishes up the pattern with the number of bytes transferred, which is any number of digits. If no bytes were transferred—for example, the hit resulted in an error—this field will be a dash. The pattern covers that as well.

Each element of this regex is stored in a parenthesized expression (and a match variable), with the extra brackets or quotes removed. After the match has occurred, we can put the various matched bits into a hash. Note that we only use about half of the actual matches in the hash; we only need to store what we're actually going to use. But if you extend this example to include statistics on other parts of the hit, all you have to do is add lines to add those matches to the hash. You don't have to muck with the regular expression to get more information.

With the line split into its component parts, we return from the &splitline() subroutine back up to the main &process_log() routine. The next part of this subroutine checks for failed hits. If a line in the Web log didn't match the pattern—and some don't—then the &splitline() subroutine will return null. That's considered a failed hit, so we add it to the count of failed hits, and then skip to the end of the loop to process the next line:

if (!%hit) {  # malformed line in web log
   $failhits++;
    next;
}

The next part of the script is a convenience for the person running the script. Processing a log file of any size takes a long time, and sometimes it can be hard to tell whether Perl is still working on the log file, or if the system has hung and it's never going to return anything. This part of the script prints a processing message with the date of the lines being processed, printing a new message each time a day's hits are complete and showing Perl's progress through the file:

$dateshort = &getday($hit{'date'});
if ($currdate ne $dateshort) {
    print "Processing $dateshort
";
    $currdate = $dateshort;
}

Here, the subroutine &getday() is simply a short routine that grabs the month and the day out of the date field using a pattern so they can be compared to the date being processed (I'm not going to show you &getday(); you can see it in the full code if you're curious). If they're different, a message is printed and the $currdate variable is updated.

In addition to lines in the log file that don't match the log format, also considered failed hits are those that matched the pattern, but didn't result in an actual file being returned (misspellings in URLs or files that have moved will cause these kinds of hits, for example). These hits are recorded in the log with error codes that start with 4, for example, the 404 you've probably seen on the Web. The return code was one of the things we saved from the line, so testing that is a simple pattern match:

if ($hit{'code'} =~ /^4/) { # 404, 403, etc. (errors)
    $failhits++;

The else part of this if test handles all other hits—that is, the successful ones that actually returned HTML files or images. Those hits will have return codes of 200 or 304:

} elsif ($hit{'code'} =~ /200|302|304/) {   # deal only with successes

Web servers are set up to deliver a default file, usually index.html, when a site requests a URL that ends in a directory name. This means that a request for /web/ and a request for /web/index.html actually refer to the same file, but they show up as different entries in the log file, which means our script will process them as different files. To collapse directories and default files, we have a couple of lines that test to see if the file requested ends with a slash, and if so, to add the default filename on the end of it. The default file, as I noted earlier, is defined by the $default variable:

if ($hit{'file'} =~ //$/) { # slashes map to $default
    $hit{'file'} .= $default;
}

With that done, now we can finish up the processing by incrementing the $htmlhits variable if the file is an HTML file and updating the hashes for the site and for the file:

if ($hit{'file'} =~ /.html?$/) { # .htm or .html
    $htmlhits++;
}

$hosts{ $hit{'site'} }++;
$files{ $hit{'file'} }++;

At this point, we're now at the end of the while loop, and the loop starts over again with the next line in the file. The loop continues until all the lines are processed, and then we move onto the printing part of the script.

Printing the Results

The &process_log() subroutine processes the log file line by line, and calls the &splitline() and &getday() subroutines to help. The second part of the weblog.pl script is the &print_results() subroutine, and it has a few other subroutines to help it as well. Much of print_results(), however, is as it sounds: a bunch of print statements to print out the various statistics.

First, the script checks to make sure that the log file wasn't empty (using the $totalhits variable). If the file was empty, then the script prints an error message and exits. The next few lines print out the total number of hits, total number of failed hits, and total number of HTML hits. The latter are also shown as a percentage of total hits, with HTML hits a total of successful hits. We can get these values with a little math and a printf:

print "Web log file Results:
";
print "Total Number of Hits: $totalhits
";
print "Total failed hits: $failhits (";
printf('%.2f', $failhits / $totalhits * 100);
print "%)
";
print "(sucessful) HTML files: $htmlhits (";
printf('%.2f', $htmlhits / ($totalhits - $failhits) * 100);
print "%)
";

Next up: total number of hosts. We can get this value by extracting the keys of the %hosts hash into a list, and then evaluate that list in a scalar context (using the scalar function):

print 'Number of unique hosts: ';
print scalar(keys %hosts);
print "
";

To get the number of unique domains, we need to process the %hosts hash to compress the hosts into their smaller domains, and build a new hash (%domains) that has the new count of all the hits for each domain. We'll use a subroutine called &getdomains() for that, which I'll discuss in the next section; assume we've done it, that we have our domains %hash. We can do the same scalar trick with the keys to that hash to get the number of unique domains:

my %domains = &getdomains(keys %hosts);
print 'Number of unique domains: ';
print scalar(keys %domains);
print "
";

The last three things that get printed are the most popular files, hosts, and domains. There's a subroutine to get these values as well, called &gettop(), which sorts each hash by its values (the number of times each thing appeared in a hit), and then builds an array of descriptive strings with the keys and values in the hash. The array will contain only the top five or ten things (or whatever the value of $topthings is). More about the &gettop() subroutine in a bit.

Each of those arrays gets printed to the output to finish up. Here's the one for files:

print "Most popular files: 
";
foreach my $file (&gettop(%files)) {
    print "  $file
";
}
							

The &getdomains() Subroutine

We're not done yet. We still have to cover the helper subroutines for printing the statistics: &getdomains(), to extract the domains from the %hosts hash and recalculate the stats, and gettop(), to take a hash of keys and frequencies and return the most popular elements. The &getdomains() subroutine is shown in Listing 14.5.

Listing 14.5. The &getdomains() Subroutine
1:  sub getdomains {
2:      my %domains = ();
3:      my ($sd,$d,$tld);       # secondary domain, domain, top-level domain
4:      foreach my $host (@_) {
5:          my $dom = '';
6:          if($host =~ /(([^.]+).)?([^.]+).([^.]+)$/ ) {
7:              if (!defined($1)) { # only two domains (i.e. aol.com)
8:                  ($d,$tld) = ($3, $4);
9:              } else {            # a usual domain x.y.com etc
10:                 ($sd, $d, $tld) = ($2, $3, $4);
11:             }
12:             if ($tld =~ /D+/) { # ignore raw IPs
13:                 if ($tld =~ /com|edu|net|gov|mil|org$/i) { # US TLDs
14:                     $dom = "$d.$tld";
15:                 } else { $dom = "$sd.$d.$tld"; }
16:                 $domains{$dom} += $hosts{$host};
17:             }
18:         } else { print "Malformed: $host
"; }
19:     }
20:     return %domains;
21: }
							

This is less complex than it looks. A few basic assumptions are made about the makeup of a host name: in particular, that each host name has a number of parts separated by periods, and that the domain consists of either the right-most two or three parts, depending on the name itself. In this subroutine, then, we'll reduce each host into its actual domain, and then use that domain name as the index to a new hash, storing all the original hits from the full hosts into the new domain-based hash.

The core of this subroutine is the foreach loop starting in line 4. The argument that gets passed to this subroutine is an array of all the host names from the %hosts array, and we'll loop over each host name in turn to make sure we covered them all.

The first part of that foreach loop is the long scary-looking regular expression in line 6. All this pattern does is grab the last two parts of the host name, and the last three if it can (some host names only have two parts; this regex will handle those, too). Lines 7 through 11 then check to see how many parts we got (2 or 3), and assign the variables $sd, $d, and $tld to those parts ($sd stands for secondary domain, $d stands for domain, and $tld stands for top-level domain, if you want to keep them straight).

The second part of the loop determines whether we'll use two or three parts of the host as the actual domain (and ignores any hosts made up of IP numbers rather than actual domain names in line 12). The purely arbitrary rule I used for determining whether a domain has two or three parts is this: If the top-level domain (that is, the right-most part of the host name) is a US domain such as .com, .edu, and so on (full list in line 13), then the domain only has two parts. This covers aol.com, mit.edu, whitehouse.gov, and so on. If the top-level domain is anything else, it's probably a country-specific domain such as .uk, .au, .mx, and so on. Those domains typically use three parts to refer to a site, for example, citygate.co.uk or monash.edu.au. Two parts would not be enough granularity (edu.au refers to all universities in Australia, not to a specific place called edu).

This is what lines 13 through 15 deal with; building up the domain name from two or three parts and storing it in the string $dom. After we've built the domain name, we can then use it as the key in the new hash, and bring over the hits we had for the original host in line 16. By the time the domain hash is done, all the hits in the host's hash should be accounted for in the domain's hash as well, and we can return that hash back to the print_results subroutine.

One last bit: line 28 is a bit of error checking for this subroutine. If the pattern matching expression in 6 doesn't match, then we've got a very weird host name indeed, and we'll print a message to that effect. Generally speaking, however, that message should never be reached because a malformed host name in the log file usually means a malformed host name on the host itself, and the Internet makes that difficult to do.

The &gettop() Subroutine

One last subroutine to cover, and then we can put this week to bed and you can go have a beer and celebrate finishing two thirds of this book. This last subroutine, &gettop(), takes a hash, sorts it by value, and then trims off the top X elements, where X is determined by the $topthings variable. The subroutine returns an array of strings, where each string contains the key and value for the top X elements in form that can be easily printed by the &print_results() subroutine that called this one in the first place. Listing 14.6 shows this subroutine.

Listing 14.6. The &gettop() Subroutine
1:  sub gettop {
2:      my %hash = @_;
3:      my $i = 1;
4:      my @topkeys = ();
5:      foreach my $key (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
6:          if ($i <= $topthings) {
7:              push @topkeys, "$key ($hash{$key} hits)";
8:              $i++;
9:          }
10:     }
11:     return @topkeys;
12: }
							

The Code

Listing 14.7 contains the complete code for the weblog.pl script.

Note

Once again, watch out for the my variables inside foreach loops in certain versions of Perl. See the note just before Listing 14.3 for details.


Listing 14.7. The Code for weblog.pl
1:   #!/usr/bin/perl -w
2:   use strict;
3:
4:   my $default = 'index.html';     # change to be your default HTML file
5:   my $topthings = 30;             # number of files, sites, etc to report
6:   my $totalhits = 0;
7:   my $failhits = 0;
8:   my $htmlhits = 0;
9:   my %hosts= ();
10:  my %files = ();
11:
12:  &process_log();
13:  &print_results();
14:
15:  sub process_log {
16:      my %hit = ();
17:      my $currdate = '';
18:      my $dateshort = '';
19:      print "Processing log....
";
20:      while (<>) {
21:          chomp;
22:          %hit = splitline($_);
23:          $totalhits++;
24:
25:          # watch out for malformed lines
26:          if (!%hit) {  # malformed line in web log
27:              $failhits++;
28:              next;
29:          }
30:
31:          $dateshort = &getday($hit{'date'});
32:          if ($currdate ne $dateshort) {
33:              print "Processing $dateshort
";
34:              $currdate = $dateshort;
35:          }
36:
37:          # watch 404s
38:          if ($hit{'code'} =~ /^4/) { # 404, 403, etc. (errors)
39:              $failhits++;
40:          # other files
41:          } elsif ($hit{'code'} =~ /200|304/) {   # deal only with sucesses
42:              if ($hit{'file'} =~ //$/) { # slashes map to $default
43:                  $hit{'file'} .= $default;
44:              }
45:
46:              if ($hit{'file'} =~ /.html?$/) { # .htm or .html
47:                  $htmlhits++;
48:              }
49:
50:              $hosts{ $hit{'site'} }++;
51:              $files{ $hit{'file'} }++;
52:          }
53:      }
54:  }
55:
56:  sub splitline {
57:      my $in = $_[0];
58:      my %line = ();
59:      if ($in =~ /^([^s]+)s           # site
60:                 ([w-]+s[w-]+)s     # users
61:                 [([^]]+)]s         # date
62:                  "(w+)s             # protocol
63:                 (/[^s]*)s           # file
64:                 ([^"]+)"s            # HTTP version
65:                 (d{3})s              # return code
66:                 ([d-]+)               # bytes transferred
67:           /x) {
68:         # we only care about some of the values
69:         # (every other one, coincidentally)
70:         $line{'site'} = $1;
71:         $line{'date'} = $3;
72:         $line{'file'} = $5;
73:         $line{'code'} = $7;
74:         return %line;
75:      } else { return (); }
76:  }
77:
78:  sub getday {
79:      my $date;
80:      if ($_[0] =~ /([^:]+):/) {
81:          $date = $1;
82:          return $date;
83:      } else {
84:          return $_[0];
85:      }
86:  }
87:
88:  sub print_results {
89:      if ($totalhits == 0) {
90:          print "The log file is empty.
";
91:          exit;
92:      }
93:
94:      print "Web log file Results:
";
95:      print "Total Number of Hits: $totalhits
";
96:      print "Total failed hits: $failhits (";
97:      printf('%.2f', $failhits / $totalhits * 100);
98:      print "%)
";
99:
100:     print "(sucessful) HTML files: $htmlhits (";
101:     printf('%.2f', $htmlhits / ($totalhits - $failhits) * 100);
102:     print "%)
";
103:
104:     print 'Number of unique hosts: ';
105:     print scalar(keys %hosts);
106:     print "
";
107:
108:     my %domains = &getdomains(keys %hosts);
109:     print 'Number of unique domains: ';
110:     print scalar(keys %domains);
111:     print "
";
112:
113:     print "Most popular files: 
";
114:     foreach my $file (&gettop(%files)) {
115:        print "  $file
";
116:     }
117:     print "Most popular hosts: 
";
118:     foreach my $host (&gettop(%hosts)) {
119:        print "  $host
";
120:     }
121:
122:     print "Most popular domains: 
";
123:     foreach my $dom (&gettop(%domains)) {
124:        print "  $dom
";
125:     }
126: }
127:
128: sub getdomains {
129:     my %domains = ();
130:     my ($sd,$d,$tld);       # secondary domain, domain, top-level domain
131:     foreach my $host (@_) {
132:         my $dom = '';
133:         if($host =~ /(([^.]+).)?([^.]+).([^.]+)$/ ) {
134:             if (!defined($1)) { # only two domains (i.e. aol.com)
135:                 ($d,$tld) = ($3, $4);
136:             } else {            # a usual domain x.y.com etc
137:                 ($sd, $d, $tld) = ($2, $3, $4);
138:             }
139:             if ($tld =~ /D+/) { # ignore raw IPs
140:                 if ($tld =~ /com|edu|net|gov|mil|org$/i) { # US TLDs
141:                     $dom = "$d.$tld";
142:                 } else { $dom = "$sd.$d.$tld"; }
143:                 $domains{$dom} += $hosts{$host};
144:             }
145:         } else { print "Malformed: $host
"; }
146:     }
147:     return %domains;
148: }
149:
150: sub gettop {
151:     my %hash = @_;
152:     my $i = 1;
153:     my @topkeys = ();
154:     foreach my $key (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
155:         if ($i <= $topthings) {
156:             push @topkeys, "$key ($hash{$key} hits)";
157:             $i++;
158:         }
159:     }
160:     return @topkeys;
161: }
						

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset