Chapter 24. Logging

 

“Quickly, bring me a beaker of wine, that I may wet my brain and say something clever.”

 
 --Aristophanies

Apache comes with built-in mechanisms that log activity on your server. In this chapter I'll talk about the standard way that Apache writes log files, and some of the tricks for getting more useful information and statistics out of your server.

If you have done a default installation of Apache two log files will be written when you run your server. These files are called access_log (access.log on Windows) and error_log (error.log on Windows). These files can be found (if you did a default installation) in /usr/local/apache/logs. On Windows, the logs are in the logs subdirectory of wherever you installed Apache. Some package managers put the log files in various other places, and you'll have to poke around to find them, or check in the configuration file for the configured location. Common places include /var/log/ and /usr/adm/.

access_log

access_log is, as the name suggests, the log of all accesses to your server. A typical entry in this file would look like

216.35.116.91 - - [19/Aug/2000:14:47:37 -0400] "GET / HTTP/1.0" 200 654

This Line contains seven pieces of information. Actually, two of them are blank in this example, but there is space for seven pieces of information.

The first piece of information is the address of the remote host. That is, who is looking at your Web site. In the previous example, the host visiting my Web site is 216.35.116.91, which is, incidentally, the IP address of the machine called si3001.inktomi.com. (I figured that out by looking up the address in DNS, with the nslookup utility.) inktomi.com is a company that makes Web searching software. (I looked at their Web site.) Because the same IP address requested the file robots.txt just a few seconds earlier, I suspect that this is a Web searching spider that was indexing my Web site. (See Chapter 23, “Web Spiders,” for more information about spiders and robots..) So, just based on that first piece of information, and a glance back in the log file, I've already found out quite a bit of information about my visitors.

By default, this address is just the IP address of the remote host. You can tell Apache to look up all the hostnames, and put those hostnames in the log instead of the IP address. This is not a very good idea because it greatly slows down the logging process, and, therefore, slows down your entire server. (See Chapter 13, “Performance Tuning,” for more tips about performance.) Various other tools will go through your log after the fact and resolve all the IP addresses to hostnames, so there's no real advantage to doing this anyway.

But, if you want to, you can tell Apache to do these lookups with the directive:

HostNameLookups on

Setting HostNameLookups to double, rather than on, will cause the logging process to do a reverse lookup on the name that it finds to verify that it points back to the IP address that you started with. The value is set to off by default.

The second slot, is blank, and almost always will be. The “-” is a placeholder for the second piece of information, where you're supposed to get the identity of the visitor in that location. Not just their login name but their e-mail address, or another unique identifier. This information is supposed to be returned by identd, or directly by the browser. In the old days, back when Netscape 0.9 was the dominant browser, you would usually have e-mail addresses in this spot. However, it did not take long for unsavory marketing types to think that it would be a good idea to collect those e-mail addresses and send them unsolicited e-mail (also known as spam). So, before very long, this feature was removed from just about every browser on the market. You will almost never find information in this field.

The third piece of information is also blank. The information that would appear there is the username with which the visitor authenticated. This will appear, of course, only when you have required authentication for a particular resource. So for the majority of entries in your log file, and for most sites, this will be blank. See Chapter 21, “Authentication and Authorization,” for more details.

Next we have the time when the request was made. This information is enclosed in square brackets, and is in what is called standard-English format. So the request in the previous example was made at 14:47:37 on Saturday, August 19. The -0400 on the end of the field means that the server is in the time zone four hours before UTC.

The next piece of information is probably the most useful piece in the record. It tells what request was actually made of the server. This is typically in the format METHOD RESOURCE PROTOCOL.

In the previous example, the METHOD is GET. The other most common methods will be POST and HEAD. There are a number of other valid methods, but those three are what you will see most of the time.

The RESOURCE is the actual document, or URL, that was requested from the server. In this example, the client requested /, which is the root, or front page, of the server. In most configurations, this corresponds to the file index.html in the DocumentRoot directory, but could be something else, depending on your server configuration.

The PROTOCOL is usually going to be HTTP, followed by a version number. The version number will be either 1.0 or 1.1, with the proportions being roughly even. HTTP is the protocol that makes the Web work. HTTP/1.0 was the earlier version of this protocol, and 1.1 is the more recent version.

The sixth piece of information is a status code. This tells you whether the request was successful, or if it encountered some problem. Most of the time this is 200, which means that the transfer was successful, and everything went well. In general, a status code that starts with 2 was successful. Starting with a 3 means that the request was redirected somewhere else for some reason. Starting with a 4 means that the user did something wrong, and starting with a 5 means that the server did something wrong.

The exact meanings of these status codes are included in the table at the end of this section.

The seventh and final piece of information is the total number of bytes that were transferred to the client. This can tell you if a transfer was interrupted (if the number is different from the size of the file). Adding them up will tell you how much data your server transferred in a day, or week, or whatever.

The following is a complete listing of possible values for the status codes—the sixth piece of information in an access_log entry.

Table 24.1. 100-Series HTTP Status Codes

100 Informational
100 Continue: The client should continue with the request.
101 Switching protocols: The server is willing to comply with the client's request to upgrade protocols.

Table 24.2. 200-Series HTTP Status Codes

200 Successful
200 OK: The request was successfully completed.
201 Resource created: The resource was successfully created.
202 Accepted: The request has been accepted for processing, but the processing has not been completed.
203 Nonauthoritative information: The information is not the definitive set as available from the origin server, but has been gathered from a local or third-party copy.
204 No content: The request was fulfilled, but no content needs to be returned.
205 Reset content: The request has been fulfilled, and the client should reset the document view that caused the request to be sent. For example, reset the contents of an HTML form so that the user can enter new information into that form.
206 Partial content: The partial GET request has been completed. This will be in response to a GET request that included a Range header, requesting only a portion of the resource.

Table 24.3. 300-Series Server Status Codes

300 Redirection
300 Multiple choices: The requested resource can be fulfilled with any one of several choices.
301 Moved permanently: The requested resource has been permanently moved to a new location.
302 Found: The resource is temporarily located somewhere else, but the client should continue to use the same URL in the future.
303 See other: Usually the same as a 302. The response to the requested URL can be found at another location and should be retrieved from there.
304 Not modified: The document has not been modified since the specified date.
305 Use proxy: The requested resource must be requested through the specified proxy, which is sent in the Location header.
306 Unused
307 Temporary redirect: The resource has temporarily moved to a new location, and the client should repeat the request using that new location.

Table 24.4. 400-Series Server Status Codes

400 Client error
400 Bad request: The request was not understood by the server.
401 Unauthorized: The request requires user authentication. This response is accompanied by a request for the necessary credentials. See Chapter 21 for more details.
402 Payment required: Not yet used.
403 Forbidden: The request was understood, but is being refused.
404 Not found: The requested resource could not be located.
405 Method not allowed: The method used is not one of the methods permitted for the requested resource.
406 Not acceptable: The requested resource is only available in representations which the client has indicated are not acceptable. See Chapter 10, “Content Negotiation,” for more information on Content Negotiation and Accept headers.
407 Proxy authentication required: Similar to 401, but indicates that a proxy server requires authentication.
408 Request timeout: The client did not produce a request in the time that the server was willing to wait.
409 Conflict: The request could not be completed because of a conflict.
410 Gone: The resource is no longer available, and there is no known forwarding address.
411 Length required: The server will not accept the request without a Status-Length header.
412 Precondition Failed: A precondition specifies in the request header evaluated is false.
413 Request entity too large: The request was larger than the server was willing or able to process.
414 Request URI too long: The request URI is longer than the server is willing to interpret. Note that this is not the same as 413, which refers to the entire request entity, including headers.
415 Unsupported Media Type: The request is in a format not supported by the requested resource for the requested method.
416 Request range not satisfiable: The client request included a Range specifier, which does not specify a valid range for the requested resource. For example, it requests a byte-range that extends past the size of the requested file.
417 Expectation failed: The expectation expressed in the Expect request header could not be met by the server.

Table 24.5. 500-Series Server Status Codes

500 Server Error
500 Internal server error: The server encountered an unexpected condition that prevented it from fulfilling the request.
501 Not implemented: The server does not support the functionality required to fulfill the request.
502 Bad gateway: While acting as a gateway or proxy, the server received an invalid request.
503 Service unavailable: The server is currently unavailable.
504 Gateway timeout: When acting as a gateway or proxy, the server did not receive a timely response from the upstream server.
505 HTTP version not supported: The server does not support the HTTP protocol that was specified in the request.

Location and Format of the access_log File

Where the access_log is located is actually a configuration option. If you look in your configuration file, httpd.conf, you should see a line that looks like the following:

CustomLog /usr/local/apache/logs/access\_log common

Note

If you're running an older version of Apache, this line might look a little different. It might be the TransferLog directive instead of the CustomLog directive. If that is the case, I really recommend that you upgrade if at all possible.

The CustomLog directive specifies where a particular log file should be stored, and what format that log should be in. The log format described previously is the common log format, which has been in use as the standard since the beginning of Web servers. That's why it still contains the ident information field, even though almost no clients actually pass that information to the server.

The path specified there is the location of the log file.

Note

Note that this location should be secured against random users writing to it because the log file is opened by the HTTP user (specified with the User directive), so this is a potential security problem.

LogFormat

The LogFormat directive defines the actual format of the log file. Long ago, log files came in one format, called the common format, and you were pretty much stuck with it. Then came custom log file format, and it turned out to be such a good idea that even the common format was reimplemented as a custom log file format.

LogFormat sets up a format and gives it a nickname by which you can refer to it. CustomLog sets up an actual log file, and indicates the format (by nickname, usually) that the file will use.

For example, in your default httpd.conf file, you'll find the following line:

LogFormat "%h %l %u %t "%r" %>s %b" common

This directive creates a log format called common, which is in the format specified in quotes. Each one of those letters means a particular piece of information, which is put into the log file in the order indicated.

The available variables, and their meanings, are listed in the documentation, and are reproduced in table 24.6:

Table 24.6. LogFormat Variables

Variable Meaning
%…a: Remote IP-address
%…A: Local IP-address
%…B: Bytes sent, excluding HTTP headers
%…b: Bytes sent, excluding HTTP headers. In CLF format that is, a “-” rather than a 0 when no bytes are sent
%…{ FOOBAR} e: The contents of the environment variable FOOBAR
%…f: Filename
%…h: Remote host
%…H The request protocol
%…{ Foobar} i: The contents of Foobar: header line(s) in the request sent to the server
%…l: Remote logname (from identd, if supplied)
%…m The request method
%…{ Foobar} n: The contents of note Foobar from another module
%…{ Foobar} o: The contents of Foobar: header line(s) in the reply
%…p: The canonical Port of the server serving the request
%…P: The process ID of the child that serviced the request
%…q The query string (prepended with a ? if a query string exists, otherwise an empty string)
%…r: First line of request
%…s: Status. For requests that got internally redirected, this is the status of the original request—%…>s for the last
%…t: Time, in common log format time format (standard-English format)
%…{ format} t: The time, in the form given by format, which should be in strftime(3) format (Potentially localized)
%…T: The time taken to serve the request, in seconds
%…u: Remote user (from auth; might be bogus if return status (%s) is 401)
%…U: The URL path requested
%…v: The canonical ServerName of the server serving the request
%…V: The server name according to the UseCanonicalName setting

In each case, the “…” indicates an (optional) condition. If the condition is met, then the specified variable is displayed. If the condition is omitted, then the variable will be replaced with a “-” if it is not defined. I'll give some examples of this in a shortly.

The LogFormat line shown in the previous example, from the default httpd.conf file, creates a log format called common, which contains the remote host, remote logname, remote user, the time of the transaction, the first line of the request, the status of the request, and the number of bytes sent. This is the common log format explained in the previous section.

Sometimes you'll only want a particular piece of information logged if it is defined. These are what the “…” referred to previously provide for. If, between the % and the variable, you put one or more HTTP status codes, the variable will only be logged in the event that the request returns one of those status codes. So, if you're trying to keep a log of all the broken links on your site, you might have the following:

LogFormat %404{ Referer} i BrokenLinks

Conversely, if you want to log requests that don't match a particular code, put a ! in there:

LogFormat %!200U SomethingWrong

CustomLog

After you have set up one or more LogFormats, you just have to apply them to a particular log file. This is done with the CustomLog directive. You can set up as many log files as you like. Each one needs to specify a log file location and which LogFormat you want to use:

CustomLog /var/log/httpd/bogus_log SomethingWrong
CustomLog /usr/local/apache/logs/broken BrokenLinks
CustomLog /usr/local/apache/logs/access_log common

The only disadvantage to doing this is that if you get some “off the shelf” log analysis application, it will assume that you are using common or combined log format because those are the ones that are most widely in use. However, many log analysis packages are able to do a good job of guessing what format you are using, if you have something other than the expected format.

Error Logs

The format of the entries in the error log is rather different from the entries in the access log that we saw previously.

But the two logs are similar because they both provide a lot of useful information, which you can use in analyzing how your server is being used, and what is going wrong.

Location of the Error Log

Your error log file should be in the same location as your access log file. It will be called error_log, or, on Windows machines, error.log.

The location of your error log can be configured with the ErrorLog directive:

ErrorLog logs/error.log

The location, unless it has a leading slash, is assumed to be relative to the ServerRoot directory.

In a default Apache installation the log file is located in /usr/local/apache/logs. As with the access log, if you installed with one of the various package managers out there, you might find it just about anywhere.

What's in It?

The error log, as the name suggests, contains a record of everything that went wrong while your server was running. It also contains general diagnostic messages, such as a notification of when your server was restarted, or shut down.

You can set your log level higher or lower to control the amount, and type, of messages that appear in your log file. This is configured with the LogLevel directive. The default setting of this directive is error, which tells you about error conditions. The complete list of possible settings is contained in Table 24.7.

Table 24.7. LogLevel Values

Level Description Example
emerg Emergencies—system is unusable. “Child cannot open lock file. Exiting”
alert Action must be taken immediately. “getpwuid: couldn't determine username from uid”
crit Critical Conditions. “socket: Failed to get a socket, exiting child”
error Error conditions. “Premature end of script headers”
warn Warning conditions. “child process 1234 did not exit, sending another SIGHUP”
notice Normal but significant condition. “httpd: caught SIGBUS, attempting to dump core in …”
info Informational. “Server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers)…”
debug Debug-level messages. “Opening config file …”

In most cases, the things that you see in your log file will be in two categories: document errors and CGI errors. You will also occasionally see configuration errors, server start, and server stop messages.

Document Errors

Document errors are things that are in the 400 series of server response codes. The most common of these is 404—Document Not Found. 404's are followed in frequency, on most servers, by authentication errors.

A 404 error occurs whenever someone requests a resource—a URL—that is not on your server. Either they have mistyped something, there was a typo in a link somewhere, or you moved or deleted a document that used to be on your server.

Note

Jakob Nielsen, who is a highly respected usability expert, says that you should never move or delete any resource from your Web site without providing a redirect of some variety. If you're not already familiar with Nielsen's writings, you should take a look sometime at http://www.useit.com/.

When a client is unable to locate a document on your server, you'll see an entry like this in your logs:

[Fri Aug 18 22:36:26 2000] [error] [client 192.168.1.6] File does not exist:
/usr/local/apache/bugletdocs/Img/south-korea.gif

Note that, as in the case of the access_log file, this record is broken down into several fields.

First, we have the date/time stamp. The first thing that you might notice is that the format is not the same as the format in the access_log. The format that we called the standard-English format. This is merely an accident of history.

Note

The logging mechanisms for access and error logs were implemented by different people, who just happened to use different date formats. By the time they were cooperating a little more closely, there already existed log parsing applications that were counting on the particular date formats that had been used, and it was deemed to be too late to change.

Next, we have the level of the message. This will be one of the levels specified in the documentation for LogLevel (see previous). error is right between warn and crit. This simply indicates how serious the problem is. A 404 error means that you irritated someone, but it's not actually a critical condition affecting the health of your server.

The next field indicates the address of the client machine that made the request. In this case, it is a machine on my local network.

The last part of the log entry is the actual error message. In the case of a 404, it gives you the full path of the file that the server tried to serve. This is particularly useful when you're getting a 404 on a file that you just know is there. Frequently you have a configuration wrong, or the file is on a different virtual host than you thought, or some other strangeness.

Note that document errors, because they are a direct result of a client request, will be accompanied by an entry in access_log as well.

Authentication errors will look very much the same:

[Tue Apr 11 22:13:21 2000] [error] [client 192.168.1.3] user rbowen
authentication failure for "/cgi-bin/hirecareers/company.cgi": password
mismatch

CGI Errors

Perhaps the most useful purpose of the error is troubleshooting misbehaved CGI programs (or other content generation programs, such as mod_perl). Anything that a CGI program emits to STDERR (Standard Error) gets appended directly to the error log for your perusal. This means that (well-written) CGI programs, when they have problems, will tell you, via the log file, exactly what that problem is.

The downside to this is that you end up with stuff in the error log that is not in any well-defined format, and so it makes it very hard to have any automatic error-log parsing to get useful information out.

What follows is an example of an entry in an error log from problematic Perl CGI code:

Wed Jun 14 16:16:37 2000] [error] [client 192.168.1.3] Premature
end of script headers: /usr/local/apache/cgi-bin/TestProg/announcement.cgi
Global symbol "$rv" requires explicit package name at
/usr/local/apache/cgi-bin/TestProg/announcement.cgi line 81.
Global symbol "%details" requires explicit package name at
/usr/local/apache/cgi-bin/TestProg/announcement.cgi line 84.
Global symbol "$Config" requires explicit package name at
/usr/local/apache/cgi-bin/TestProg/announcement.cgi line 133.
Execution of /usr/local/apache/cgi-bin/TestProg/announcement.cgi
aborted due to compilation errors.

Although this entry actually does follow the same format as the previous 404 error, in that it has a date, error level, and a client address, the error message itself is several lines long, which tends to confuse some log-parsing software.

Even if you don't know Perl, you should be able to look at the previous error messages and glean some useful information about what went wrong. At the very least, you can tell on what lines the program had problems. Perl is very good about telling you where you made a mistake. Your mileage may vary based on what language you are using.

Without logs, it would be very difficult to troubleshoot most CGI programs because running it from the command line is a rather different environment than running it from a Web server.

Watching the Error Log

When actively developing a CGI program, or a content-generation application using some other technology, such as mod_perl, it's a good idea to actively watch the error log, so that when error conditions happen, you have immediate feedback.

This is done using the utility called tail. tail is a standard part of any Unix operating system, and tail clones are available for other operating systems.

At the command prompt, type the following:

tail -f /usr/local/apache/logs/error_log

This will show the last few lines of your log file, and, as lines are added to the file, it will show you those as they happen. The f stands for “follow” because it follows the log as it grows.

If you want to use tail (and other Unix utilities) on Windows, you may want to try AINTX, which is a collection of AIX utilities for NT. You can find these at http://maxx.mc.net/jlh/nttools/html/nttools.htm.

It's extremely good practice to keep several terminal windows open, with your error log tailing in one window, and your access log tailing in the other, while you work on your site. This will tell you what is going on as it happens, so you know about problems before the customer has a chance to call you about them.

Log File Analysis

Although there is an enormous amount of information in the log files it's not much good in its raw form.

Your marketing department, or the customer you are running the Web site for, will typically want to know how many people visited the site, what they looked at, how long they stayed, and where they found out about your site. All that information is (or might be) in your log files.

They will also want to know the names, addresses, and shoe sizes of those people, and, hopefully, their credit-card numbers. That information is not there and you need to know how to explain to your employer that not only is it not there, but the only way to get it is to explicitly ask your visitors for it and be willing to be told “no.”

What Your Log Files Can Tell You

A lot of information is available to put in your log files, including the following:

  1. Address of the remote machine: This is almost the same as “who is visiting my Web site,” but not quite. More specifically, it tells you where that visitor is from. This will be something like buglet.rcbowen.com or proxy01.aol.com.

  2. Time of visit: When did this person come to my Web site? This can tell you something about your visitors. If most of your visits come between the hours of 9 a.m. and 4 p.m., then you're probably getting visits from people at work. If it's mostly 7 p.m. through midnight, people are looking at your site from home.

    Single records, of course, give you very little useful information, but across several thousand hits, you can start to gather useful statistics.

  3. Resource requested: What parts of your site are most popular? Those are the parts that you should expand. Which parts of the site are completely neglected? Perhaps those parts of the site are just really hard to get to. Or, perhaps they are genuinely uninteresting, in which case you should spice them up a little. Of course, some parts of your site, such as your legal statements, are boring and there's nothing you can do about it, but they need to stay on the site for the two or three people that want to see them.

  4. What's broken? And, of course, your logs tell you when things are not working as they should be. Do you have broken links? Do other sites have links to your site that are not correct? Are some of your CGI programs malfunctioning? Is a robot overwhelming your site with thousands of requests per second?

What Your Log Files Don't Tell You

HTTP is a stateless, anonymous protocol. This is by design, and is not, at least in my opinion, a shortcoming of the protocol. If you want to know more about your visitors, you have to be polite, and actually ask them. And be prepared to not get reliable answers. This is amazingly frustrating for marketing types. They want to know the average income, number of kids, and hair color of their target demographic. And they don't like to be told that that information is not available in the log files. However, it is quite beyond your control to get this information out of the log files. Explain to them that HTTP is anonymous.

Even what the log files do tell you is occasionally suspect. For example, you can expect to have numerous entries in your log files indicating that a machine called something like cache-mtc-am05.proxy.aol.com has visited your Web site. This tells you that this is a machine that is on the AOL network. But because of the way that AOL works, this might be one person visiting your site many times, or it might be many people visiting my site one time each. All requests coming from the AOL network are proxied. A proxy server is one that one or more people sit behind. They type an address into their browser. It makes that request to the proxy server. The proxy server gets the page (generating the log file entry on my Web site). It then passes that page back to the requesting machine. This means that I never see the request from the originating machine, but only the request from the proxy.

Another implication of this is that if, 10 minutes later, someone else sitting behind that same proxy requests the same page a log file entry is not generated at all. They type the address, and that request goes to the proxy server. The proxy sees the request and thinks “I already have that document in memory. There's no point asking the Web site for it again.” So instead of asking my Web site for the page, it gives the copy that it already has to the client. So, not only is the address field suspect, but the number of requests is also suspect.

Most proxies will not cache dynamic content, such as the results of a CGI program, as that can change from one client to the next.

It might sound like the data that you receive is so suspect that it's useless. This is in fact not the case. It should just be taken with a grain of salt. The number of hits that your site receives is almost certainly not really the number of visitors that came to your site. But it's a good indication. And it still gives you some useful information. Just don't rely on it for exact numbers.

Getting Useful Statistics From Your Logs

So, to the real meat of all this. How do you actually generate statistics from your Web server logs?

There are two main approaches that you can take here. You can either do it yourself, or you can get one of the existing applications that is available to do it for you.

Unless you have custom log files that don't look anything like the Common log format, you should probably get one of the available apps out there. There are some excellent commercial products, and some really good free ones, so you just need to decide what features you are looking for.

The following are some of the available programs on the market. This should not be considered a comprehensive list, and you should do your own research before choosing one for your site because they all have somewhat different feature sets and different reports.

The programs that I have chosen here either all have versions that will run on Unix and NT operating systems, or are in Perl, and will consequently run anywhere.

  • Analog: The Analog Web site (http://www.statslab.cam.ac.uk/sret1/analog/) claims that about 29% of all Web sites that use any log analysis tool at all use Analog. They claim that this makes it the most popular log analysis tool in the world.

    The example report, which you can see on the Analog Web site, seemed very thorough and contained all the stats that I might want. In addition to the pages and pages of detailed statistics, there was a very useful executive summary, which will probably be the only part that your boss will really care about.

    Analog is free software.

  • WebTrends: WebTrends provides astoundingly detailed reports on your log files, giving you all sorts of information that you did not know you could get out of these files. And there are a lot of pretty graphs generated in the report.

    It is, however, rather on the expensive side. You can look up the actual price on their Web site http://www.webtrends.com/default.htm.

    It is also very slow, in comparison to other programs listed here.

  • WWWStat: WWWStat has been around for a very long time. It's fast, full-featured, and it's free. What more could you want? You can get it at http://www.ics.uci.edu/pub/websoft/wwwstat/ and there is a companion package (linked from that same page) that generates pretty graphs.

    It is very easy to automate WWWStat so that it generates your log statistics every night at midnight, and then generates monthly reports at the end of each month.

  • Wusage: Wusage has also been around a very long time. It is a great program, full- featured, inexpensive, and generates useful graphical reports.

    You can get Wusage at http://www.boutell.com/wusage/.

Parsing the Log Files Yourself

If you want to do your own log parsing and reporting, the best tool for the task is Perl. In fact, Perl's name (Practical Extraction and Report Language) is a tribute to its capability to extract useful information from logs and generate reports. (In reality, the name “Perl” came before the expansion of it, but I suppose that does not detract from my point.)

The Apache::ParseLog module, available from your favorite CPAN mirror, makes parsing log files simple, and, therefore, takes all the work out of generating useful reports from those logs.

For detailed information about how to use this module, install it and read the documentation. After you have installed the module, you can get at the documentation by typing perldoc Apache::ParseLog.

Trolling through the source code for WWWStat is another good way to learn about Perl log file parsing.

Because the log file format is so simple, writing code to chop it up into its component parts and do statistical analysis of it is rather simple.

Logging to a Process

You don't have to log to a file; you can log to a process. This is particularly useful if you want your logs to go to a database, or to some process that will give some type of real-time statistics on your Web site traffic.

Using the <CustomLog> directive, you can, instead of specifying a file to which the log should be written, specify |, followed by the name of a program that is to receive the logging information.

For example:

CustomLog |/usr/bin/apachelog common

Where /usr/bin/apachelog is some program that knows what to do with Apache log file entries. This might be as simple as a Perl program that processes the log entries in some fashion, or it might be something that writes entries to a database.

The main thing to be cautious about if you're going to do this is security. Log files are opened with the permission of the user that starts the server. This is usually root, and it applies as well to logging to a process. Make sure that the process to which you are logging is secure. If you log to an unsecure process (one that some non-root user can tinker with) you run the risk of having that process being replaced by another that does unsavory things. If, for example, /usr/bin/apachelog.pl is world-writable, any user could edit it to shut down your server, mail someone the password file, or delete important files. This would be done with root permissions.

Secondly, you should be careful about buffering. If, for example, you log to a process written in Perl, you might find that, although your Web site appears to be active, nothing is being logged by your Perl program. However, when you shut down the server, it suddenly logs everything. This is because it is buffering the output, and does not write anything out until the process terminates. Make sure you turn off output buffering if you want to get any sort of real-time reporting.

If you want to log to a process of some kind, you might be well advised to look for a module that already implements the functionality that you are looking for. Check out http://modules.apache.org/ for a list of some of the modules available to do all sorts of cool things with Apache.

Note that logging to a process always uses more system resources than just logging to a text file. Unless you have an exceptionally good reason to log to a process, you will usually find that post-processing your log files is a better solution. It's a good idea to copy your log files off onto another machine for processing, so that there is no performance impact on your machine while the logs are being crunched.

Rotating Your Log Files

Log files get big. If you're not careful, you can end up filling up the drive (or partition) on which your log files are sitting, which can bring your server to a grinding halt.

The way around this is to move your log files to some other place before they get too big. This can be accomplished a number of different ways. Some Unix variants come with a logrotate script that handles this for you. RedHat, for example, comes preconfigured to rotate your logs for you every few days, based on either their size or their age.

Logfile::Rotate

If you want to do this yourself, you can use a Perl module (freely available from CPAN) called Logfile::Rotate. The following code, run periodically (perhaps once a week?) by cron, will rotate out your log file, keeping five previous log files at any given time. Each backup log file will be gzipped to conserve space.

#!/usr/bin/perl
use Logfile::Rotate;
foreach $log (qw(error_log access_log)) {
    $logfile = new Logfile::Rotate(
        File => "/usr/local/apache/logs/$log",
        Count => 5,
        Gzip => '/bin/gzip',
        Post => sub {
            `/usr/local/apache/bin/apachectl restart`;
        }
    }
);

The Perl module takes care of all the details. You'll end up with files called things such as access_log.1.gz, access_log.2.gz, and so on. Each file will get bumped up one number each time and the file that used to be access_log.5.gz will be deleted each time. The Count parameter specifies how many log files are kept.

This keeps you from running out of space on your log drive, and keeps as much of an archive as you like.

logrotate

Additionally, Apache ships with a utility called logrotate, which enables you to rotate log files on a regular basis. You can use the logrotate utility by adding it to your Apache configuration file as a process to which you will log. The syntax will look like this:

CustomLog "|/usr/local/apache/bin/rotatelogs /var/log/archive/apachelog 86400" common

The parameter 86400 here is the number of seconds after which the log will be moved and a new log started.

The path /var/log/archive/apachelog specifies where the old log files will be put for archival purposes. More specifically, the log file will be backed up to a file named by taking this argument, and appending a timestamp.

86400 seconds, by the way, is 24 hours.

Logging for Multiple Virtual Hosts

When you have more than one virtual host on the same machine, you should have separate log files for each host. This will eliminate the problems related to trying to pull log files apart into accesses from the various hosts after the fact.

In each of your VirtualHost sections, simply specify a log file for that host. You can then handle each log file separately when it comes time to run reports.

There are some concerns with available file handles. That is, if you are running hundreds of virtual hosts, and have a log file per host, you may encounter a situation where you run out of available file handles. This can cause system instability and can even cause your system to halt. This is primarily a concern on servers that are hosting a very large number of virtual hosts. In that condition, you will need to consult the documentation for your particular operating system regarding the available number of file handles.

Summary

In this chapter, we've talked about various aspects of logging with Apache. You should now be equipped to log whatever information you're interested in, and get all sorts of useful statistics out of those log files.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset