IN THIS CHAPTER
“Quickly, bring me a beaker of wine, that I may wet my brain and say something clever.” | ||
--Aristophanies |
Apache comes with built-in mechanisms that log activity on your server. In this chapter I'll talk about the standard way that Apache writes log files, and some of the tricks for getting more useful information and statistics out of your server.
If you have done a default installation of Apache two log files will be written when you run your server. These files are called access_log
(access.log
on Windows) and error_log
(error.log
on Windows). These files can be found (if you did a default installation) in /usr/local/apache/logs
. On Windows, the logs are in the logs
subdirectory of wherever you installed Apache. Some package managers put the log files in various other places, and you'll have to poke around to find them, or check in the configuration file for the configured location. Common places include /var/log/
and /usr/adm/
.
access_log
is, as the name suggests, the log of all accesses to your server. A typical entry in this file would look like
216.35.116.91 - - [19/Aug/2000:14:47:37 -0400] "GET / HTTP/1.0" 200 654
This Line contains seven pieces of information. Actually, two of them are blank in this example, but there is space for seven pieces of information.
The first piece of information is the address of the remote host. That is, who is looking at your Web site. In the previous example, the host visiting my Web site is 216.35.116.91, which is, incidentally, the IP address of the machine called si3001.inktomi.com. (I figured that out by looking up the address in DNS, with the nslookup
utility.) inktomi.com is a company that makes Web searching software. (I looked at their Web site.) Because the same IP address requested the file robots.txt
just a few seconds earlier, I suspect that this is a Web searching spider that was indexing my Web site. (See Chapter 23, “Web Spiders,” for more information about spiders and robots.
.) So, just based on that first piece of information, and a glance back in the log file, I've already found out quite a bit of information about my visitors.
By default, this address is just the IP address of the remote host. You can tell Apache to look up all the hostnames, and put those hostnames in the log instead of the IP address. This is not a very good idea because it greatly slows down the logging process, and, therefore, slows down your entire server. (See Chapter 13, “Performance Tuning,” for more tips about performance.) Various other tools will go through your log after the fact and resolve all the IP addresses to hostnames, so there's no real advantage to doing this anyway.
But, if you want to, you can tell Apache to do these lookups with the directive:
HostNameLookups on
Setting HostNameLookups
to double
, rather than on
, will cause the logging process to do a reverse lookup on the name that it finds to verify that it points back to the IP address that you started with. The value is set to off by default.
The second slot, is blank, and almost always will be. The “-” is a placeholder for the second piece of information, where you're supposed to get the identity of the visitor in that location. Not just their login name but their e-mail address, or another unique identifier. This information is supposed to be returned by identd
, or directly by the browser. In the old days, back when Netscape 0.9 was the dominant browser, you would usually have e-mail addresses in this spot. However, it did not take long for unsavory marketing types to think that it would be a good idea to collect those e-mail addresses and send them unsolicited e-mail (also known as spam). So, before very long, this feature was removed from just about every browser on the market. You will almost never find information in this field.
The third piece of information is also blank. The information that would appear there is the username with which the visitor authenticated. This will appear, of course, only when you have required authentication for a particular resource. So for the majority of entries in your log file, and for most sites, this will be blank. See Chapter 21, “Authentication and Authorization,” for more details.
Next we have the time when the request was made. This information is enclosed in square brackets, and is in what is called standard-English format. So the request in the previous example was made at 14:47:37 on Saturday, August 19. The -0400
on the end of the field means that the server is in the time zone four hours before UTC.
The next piece of information is probably the most useful piece in the record. It tells what request was actually made of the server. This is typically in the format METHOD RESOURCE PROTOCOL
.
In the previous example, the METHOD
is GET
. The other most common methods will be POST
and HEAD
. There are a number of other valid methods, but those three are what you will see most of the time.
The RESOURCE
is the actual document, or URL, that was requested from the server. In this example, the client requested /
, which is the root, or front page, of the server. In most configurations, this corresponds to the file index.html
in the DocumentRoot
directory, but could be something else, depending on your server configuration.
The PROTOCOL
is usually going to be HTTP
, followed by a version number. The version number will be either 1.0
or 1.1
, with the proportions being roughly even. HTTP is the protocol that makes the Web work. HTTP/1.0 was the earlier version of this protocol, and 1.1 is the more recent version.
The sixth piece of information is a status code. This tells you whether the request was successful, or if it encountered some problem. Most of the time this is 200
, which means that the transfer was successful, and everything went well. In general, a status code that starts with 2 was successful. Starting with a 3 means that the request was redirected somewhere else for some reason. Starting with a 4 means that the user did something wrong, and starting with a 5 means that the server did something wrong.
The exact meanings of these status codes are included in the table at the end of this section.
The seventh and final piece of information is the total number of bytes that were transferred to the client. This can tell you if a transfer was interrupted (if the number is different from the size of the file). Adding them up will tell you how much data your server transferred in a day, or week, or whatever.
The following is a complete listing of possible values for the status codes—the sixth piece of information in an access_log
entry.
Table 24.2. 200-Series HTTP Status Codes
200 | Successful |
---|---|
200 | OK: The request was successfully completed. |
201 | Resource created: The resource was successfully created. |
202 | Accepted: The request has been accepted for processing, but the processing has not been completed. |
203 | Nonauthoritative information: The information is not the definitive set as available from the origin server, but has been gathered from a local or third-party copy. |
204 | No content: The request was fulfilled, but no content needs to be returned. |
205 | Reset content: The request has been fulfilled, and the client should reset the document view that caused the request to be sent. For example, reset the contents of an HTML form so that the user can enter new information into that form. |
206 | Partial content: The partial GET request has been completed. This will be in response to a GET request that included a Range header, requesting only a portion of the resource.
|
Table 24.3. 300-Series Server Status Codes
300 | Redirection |
---|---|
300 | Multiple choices: The requested resource can be fulfilled with any one of several choices. |
301 | Moved permanently: The requested resource has been permanently moved to a new location. |
302 | Found: The resource is temporarily located somewhere else, but the client should continue to use the same URL in the future. |
303 | See other: Usually the same as a 302. The response to the requested URL can be found at another location and should be retrieved from there. |
304 | Not modified: The document has not been modified since the specified date. |
305 | Use proxy: The requested resource must be requested through the specified proxy, which is sent in the Location header.
|
306 | Unused |
307 | Temporary redirect: The resource has temporarily moved to a new location, and the client should repeat the request using that new location. |
Table 24.4. 400-Series Server Status Codes
400 | Client error |
---|---|
400 | Bad request: The request was not understood by the server. |
401 | Unauthorized: The request requires user authentication. This response is accompanied by a request for the necessary credentials. See Chapter 21 for more details. |
402 | Payment required: Not yet used. |
403 | Forbidden: The request was understood, but is being refused. |
404 | Not found: The requested resource could not be located. |
405 | Method not allowed: The method used is not one of the methods permitted for the requested resource. |
406 | Not acceptable: The requested resource is only available in representations which the client has indicated are not acceptable. See Chapter 10, “Content Negotiation,” for more information on Content Negotiation and Accept headers.
|
407 | Proxy authentication required: Similar to 401, but indicates that a proxy server requires authentication. |
408 | Request timeout: The client did not produce a request in the time that the server was willing to wait. |
409 | Conflict: The request could not be completed because of a conflict. |
410 | Gone: The resource is no longer available, and there is no known forwarding address. |
411 | Length required: The server will not accept the request without a Status-Length header.
|
412 | Precondition Failed: A precondition specifies in the request header evaluated is false. |
413 | Request entity too large: The request was larger than the server was willing or able to process. |
414 | Request URI too long: The request URI is longer than the server is willing to interpret. Note that this is not the same as 413, which refers to the entire request entity, including headers. |
415 | Unsupported Media Type: The request is in a format not supported by the requested resource for the requested method. |
416 | Request range not satisfiable: The client request included a Range specifier, which does not specify a valid range for the requested resource. For example, it requests a byte-range that extends past the size of the requested file.
|
417 | Expectation failed: The expectation expressed in the Expect request header could not be met by the server.
|
Table 24.5. 500-Series Server Status Codes
500 | Server Error |
---|---|
500 | Internal server error: The server encountered an unexpected condition that prevented it from fulfilling the request. |
501 | Not implemented: The server does not support the functionality required to fulfill the request. |
502 | Bad gateway: While acting as a gateway or proxy, the server received an invalid request. |
503 | Service unavailable: The server is currently unavailable. |
504 | Gateway timeout: When acting as a gateway or proxy, the server did not receive a timely response from the upstream server. |
505 | HTTP version not supported: The server does not support the HTTP protocol that was specified in the request. |
Where the access_log
is located is actually a configuration option. If you look in your configuration file, httpd.conf
, you should see a line that looks like the following:
CustomLog /usr/local/apache/logs/access\_log common
If you're running an older version of Apache, this line might look a little different. It might be the TransferLog
directive instead of the CustomLog
directive. If that is the case, I really recommend that you upgrade if at all possible.
The CustomLog
directive specifies where a particular log file should be stored, and what format that log should be in. The log format described previously is the common
log format, which has been in use as the standard since the beginning of Web servers. That's why it still contains the ident information field, even though almost no clients actually pass that information to the server.
The path specified there is the location of the log file.
Note that this location should be secured against random users writing to it because the log file is opened by the HTTP user (specified with the User
directive), so this is a potential security problem.
The LogFormat
directive defines the actual format of the log file. Long ago, log files came in one format, called the common format, and you were pretty much stuck with it. Then came custom log file format, and it turned out to be such a good idea that even the common format was reimplemented as a custom log file format.
LogFormat
sets up a format and gives it a nickname by which you can refer to it. CustomLog
sets up an actual log file, and indicates the format (by nickname, usually) that the file will use.
For example, in your default httpd.conf
file, you'll find the following line:
LogFormat "%h %l %u %t "%r" %>s %b" common
This directive creates a log format called common
, which is in the format specified in quotes. Each one of those letters means a particular piece of information, which is put into the log file in the order indicated.
The available variables, and their meanings, are listed in the documentation, and are reproduced in table 24.6:
Table 24.6. LogFormat Variables
In each case, the “…” indicates an (optional) condition. If the condition is met, then the specified variable is displayed. If the condition is omitted, then the variable will be replaced with a “-” if it is not defined. I'll give some examples of this in a shortly.
The LogFormat
line shown in the previous example, from the default httpd.conf
file, creates a log format called common, which contains the remote host, remote logname, remote user, the time of the transaction, the first line of the request, the status of the request, and the number of bytes sent. This is the common log format explained in the previous section.
Sometimes you'll only want a particular piece of information logged if it is defined. These are what the “…” referred to previously provide for. If, between the % and the variable, you put one or more HTTP status codes, the variable will only be logged in the event that the request returns one of those status codes. So, if you're trying to keep a log of all the broken links on your site, you might have the following:
LogFormat %404{ Referer} i BrokenLinks
Conversely, if you want to log requests that don't match a particular code, put a ! in there:
LogFormat %!200U SomethingWrong
After you have set up one or more LogFormats
, you just have to apply them to a particular log file. This is done with the CustomLog
directive. You can set up as many log files as you like. Each one needs to specify a log file location and which LogFormat
you want to use:
CustomLog /var/log/httpd/bogus_log SomethingWrong CustomLog /usr/local/apache/logs/broken BrokenLinks CustomLog /usr/local/apache/logs/access_log common
The only disadvantage to doing this is that if you get some “off the shelf” log analysis application, it will assume that you are using common
or combined
log format because those are the ones that are most widely in use. However, many log analysis packages are able to do a good job of guessing what format you are using, if you have something other than the expected format.
The format of the entries in the error log is rather different from the entries in the access log that we saw previously.
But the two logs are similar because they both provide a lot of useful information, which you can use in analyzing how your server is being used, and what is going wrong.
Your error log file should be in the same location as your access log file. It will be called error_log
, or, on Windows machines, error.log
.
The location of your error log can be configured with the ErrorLog
directive:
ErrorLog logs/error.log
The location, unless it has a leading slash, is assumed to be relative to the ServerRoot
directory.
In a default Apache installation the log file is located in /usr/local/apache/logs
. As with the access log, if you installed with one of the various package managers out there, you might find it just about anywhere.
The error log, as the name suggests, contains a record of everything that went wrong while your server was running. It also contains general diagnostic messages, such as a notification of when your server was restarted, or shut down.
You can set your log level higher or lower to control the amount, and type, of messages that appear in your log file. This is configured with the LogLevel
directive. The default setting of this directive is error
, which tells you about error conditions. The complete list of possible settings is contained in Table 24.7.
Table 24.7. LogLevel Values
Level | Description | Example |
---|---|---|
emerg | Emergencies—system is unusable. | “Child cannot open lock file. Exiting” |
alert | Action must be taken immediately. | “getpwuid: couldn't determine username from uid” |
crit | Critical Conditions. | “socket: Failed to get a socket, exiting child” |
error | Error conditions. | “Premature end of script headers” |
warn | Warning conditions. | “child process 1234 did not exit, sending another SIGHUP” |
notice | Normal but significant condition. | “httpd: caught SIGBUS, attempting to dump core in …” |
info | Informational. | “Server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers)…” |
debug | Debug-level messages. | “Opening config file …” |
In most cases, the things that you see in your log file will be in two categories: document errors and CGI errors. You will also occasionally see configuration errors, server start, and server stop messages.
Document errors are things that are in the 400 series of server response codes. The most common of these is 404—Document Not Found. 404's are followed in frequency, on most servers, by authentication errors.
A 404 error occurs whenever someone requests a resource—a URL—that is not on your server. Either they have mistyped something, there was a typo in a link somewhere, or you moved or deleted a document that used to be on your server.
Jakob Nielsen, who is a highly respected usability expert, says that you should never move or delete any resource from your Web site without providing a redirect of some variety. If you're not already familiar with Nielsen's writings, you should take a look sometime at http://www.useit.com/.
When a client is unable to locate a document on your server, you'll see an entry like this in your logs:
[Fri Aug 18 22:36:26 2000] [error] [client 192.168.1.6] File does not exist: /usr/local/apache/bugletdocs/Img/south-korea.gif
Note that, as in the case of the access_log
file, this record is broken down into several fields.
First, we have the date/time stamp. The first thing that you might notice is that the format is not the same as the format in the access_log
. The format that we called the standard-English format. This is merely an accident of history.
The logging mechanisms for access and error logs were implemented by different people, who just happened to use different date formats. By the time they were cooperating a little more closely, there already existed log parsing applications that were counting on the particular date formats that had been used, and it was deemed to be too late to change.
Next, we have the level of the message. This will be one of the levels specified in the documentation for LogLevel
(see previous). error
is right between warn
and crit
. This simply indicates how serious the problem is. A 404 error means that you irritated someone, but it's not actually a critical condition affecting the health of your server.
The next field indicates the address of the client machine that made the request. In this case, it is a machine on my local network.
The last part of the log entry is the actual error message. In the case of a 404, it gives you the full path of the file that the server tried to serve. This is particularly useful when you're getting a 404 on a file that you just know is there. Frequently you have a configuration wrong, or the file is on a different virtual host than you thought, or some other strangeness.
Note that document errors, because they are a direct result of a client request, will be accompanied by an entry in access_log
as well.
Authentication errors will look very much the same:
[Tue Apr 11 22:13:21 2000] [error] [client 192.168.1.3] user rbowen authentication failure for "/cgi-bin/hirecareers/company.cgi": password mismatch
Perhaps the most useful purpose of the error is troubleshooting misbehaved CGI programs (or other content generation programs, such as mod_perl
). Anything that a CGI program emits to STDERR (Standard Error) gets appended directly to the error log for your perusal. This means that (well-written) CGI programs, when they have problems, will tell you, via the log file, exactly what that problem is.
The downside to this is that you end up with stuff in the error log that is not in any well-defined format, and so it makes it very hard to have any automatic error-log parsing to get useful information out.
What follows is an example of an entry in an error log from problematic Perl CGI code:
Wed Jun 14 16:16:37 2000] [error] [client 192.168.1.3] Premature end of script headers: /usr/local/apache/cgi-bin/TestProg/announcement.cgi Global symbol "$rv" requires explicit package name at /usr/local/apache/cgi-bin/TestProg/announcement.cgi line 81. Global symbol "%details" requires explicit package name at /usr/local/apache/cgi-bin/TestProg/announcement.cgi line 84. Global symbol "$Config" requires explicit package name at /usr/local/apache/cgi-bin/TestProg/announcement.cgi line 133. Execution of /usr/local/apache/cgi-bin/TestProg/announcement.cgi aborted due to compilation errors.
Although this entry actually does follow the same format as the previous 404 error, in that it has a date, error level, and a client address, the error message itself is several lines long, which tends to confuse some log-parsing software.
Even if you don't know Perl, you should be able to look at the previous error messages and glean some useful information about what went wrong. At the very least, you can tell on what lines the program had problems. Perl is very good about telling you where you made a mistake. Your mileage may vary based on what language you are using.
Without logs, it would be very difficult to troubleshoot most CGI programs because running it from the command line is a rather different environment than running it from a Web server.
When actively developing a CGI program, or a content-generation application using some other technology, such as mod_perl
, it's a good idea to actively watch the error log, so that when error conditions happen, you have immediate feedback.
This is done using the utility called tail
. tail
is a standard part of any Unix operating system, and tail
clones are available for other operating systems.
At the command prompt, type the following:
tail -f /usr/local/apache/logs/error_log
This will show the last few lines of your log file, and, as lines are added to the file, it will show you those as they happen. The f
stands for “follow” because it follows the log as it grows.
If you want to use tail
(and other Unix utilities) on Windows, you may want to try AINTX, which is a collection of AIX utilities for NT. You can find these at http://maxx.mc.net/jlh/nttools/html/nttools.htm.
It's extremely good practice to keep several terminal windows open, with your error log tailing in one window, and your access log tailing in the other, while you work on your site. This will tell you what is going on as it happens, so you know about problems before the customer has a chance to call you about them.
Although there is an enormous amount of information in the log files it's not much good in its raw form.
Your marketing department, or the customer you are running the Web site for, will typically want to know how many people visited the site, what they looked at, how long they stayed, and where they found out about your site. All that information is (or might be) in your log files.
They will also want to know the names, addresses, and shoe sizes of those people, and, hopefully, their credit-card numbers. That information is not there and you need to know how to explain to your employer that not only is it not there, but the only way to get it is to explicitly ask your visitors for it and be willing to be told “no.”
A lot of information is available to put in your log files, including the following:
Address of the remote machine: This is almost the same as “who is visiting my Web site,” but not quite. More specifically, it tells you where that visitor is from. This will be something like buglet.rcbowen.com or proxy01.aol.com.
Time of visit: When did this person come to my Web site? This can tell you something about your visitors. If most of your visits come between the hours of 9 a.m. and 4 p.m., then you're probably getting visits from people at work. If it's mostly 7 p.m. through midnight, people are looking at your site from home.
Single records, of course, give you very little useful information, but across several thousand hits, you can start to gather useful statistics.
Resource requested: What parts of your site are most popular? Those are the parts that you should expand. Which parts of the site are completely neglected? Perhaps those parts of the site are just really hard to get to. Or, perhaps they are genuinely uninteresting, in which case you should spice them up a little. Of course, some parts of your site, such as your legal statements, are boring and there's nothing you can do about it, but they need to stay on the site for the two or three people that want to see them.
What's broken? And, of course, your logs tell you when things are not working as they should be. Do you have broken links? Do other sites have links to your site that are not correct? Are some of your CGI programs malfunctioning? Is a robot overwhelming your site with thousands of requests per second?
HTTP is a stateless, anonymous protocol. This is by design, and is not, at least in my opinion, a shortcoming of the protocol. If you want to know more about your visitors, you have to be polite, and actually ask them. And be prepared to not get reliable answers. This is amazingly frustrating for marketing types. They want to know the average income, number of kids, and hair color of their target demographic. And they don't like to be told that that information is not available in the log files. However, it is quite beyond your control to get this information out of the log files. Explain to them that HTTP is anonymous.
Even what the log files do tell you is occasionally suspect. For example, you can expect to have numerous entries in your log files indicating that a machine called something like cache-mtc-am05.proxy.aol.com has visited your Web site. This tells you that this is a machine that is on the AOL network. But because of the way that AOL works, this might be one person visiting your site many times, or it might be many people visiting my site one time each. All requests coming from the AOL network are proxied. A proxy server is one that one or more people sit behind. They type an address into their browser. It makes that request to the proxy server. The proxy server gets the page (generating the log file entry on my Web site). It then passes that page back to the requesting machine. This means that I never see the request from the originating machine, but only the request from the proxy.
Another implication of this is that if, 10 minutes later, someone else sitting behind that same proxy requests the same page a log file entry is not generated at all. They type the address, and that request goes to the proxy server. The proxy sees the request and thinks “I already have that document in memory. There's no point asking the Web site for it again.” So instead of asking my Web site for the page, it gives the copy that it already has to the client. So, not only is the address field suspect, but the number of requests is also suspect.
Most proxies will not cache dynamic content, such as the results of a CGI program, as that can change from one client to the next.
It might sound like the data that you receive is so suspect that it's useless. This is in fact not the case. It should just be taken with a grain of salt. The number of hits that your site receives is almost certainly not really the number of visitors that came to your site. But it's a good indication. And it still gives you some useful information. Just don't rely on it for exact numbers.
So, to the real meat of all this. How do you actually generate statistics from your Web server logs?
There are two main approaches that you can take here. You can either do it yourself, or you can get one of the existing applications that is available to do it for you.
Unless you have custom log files that don't look anything like the Common log format, you should probably get one of the available apps out there. There are some excellent commercial products, and some really good free ones, so you just need to decide what features you are looking for.
The following are some of the available programs on the market. This should not be considered a comprehensive list, and you should do your own research before choosing one for your site because they all have somewhat different feature sets and different reports.
The programs that I have chosen here either all have versions that will run on Unix and NT operating systems, or are in Perl, and will consequently run anywhere.
Analog: The Analog Web site (http://www.statslab.cam.ac.uk/sret1/analog/) claims that about 29% of all Web sites that use any log analysis tool at all use Analog. They claim that this makes it the most popular log analysis tool in the world.
The example report, which you can see on the Analog Web site, seemed very thorough and contained all the stats that I might want. In addition to the pages and pages of detailed statistics, there was a very useful executive summary, which will probably be the only part that your boss will really care about.
Analog is free software.
WebTrends: WebTrends provides astoundingly detailed reports on your log files, giving you all sorts of information that you did not know you could get out of these files. And there are a lot of pretty graphs generated in the report.
It is, however, rather on the expensive side. You can look up the actual price on their Web site http://www.webtrends.com/default.htm.
It is also very slow, in comparison to other programs listed here.
WWWStat: WWWStat has been around for a very long time. It's fast, full-featured, and it's free. What more could you want? You can get it at http://www.ics.uci.edu/pub/websoft/wwwstat/ and there is a companion package (linked from that same page) that generates pretty graphs.
It is very easy to automate WWWStat so that it generates your log statistics every night at midnight, and then generates monthly reports at the end of each month.
Wusage: Wusage has also been around a very long time. It is a great program, full- featured, inexpensive, and generates useful graphical reports.
You can get Wusage at http://www.boutell.com/wusage/.
If you want to do your own log parsing and reporting, the best tool for the task is Perl. In fact, Perl's name (Practical Extraction and Report Language) is a tribute to its capability to extract useful information from logs and generate reports. (In reality, the name “Perl” came before the expansion of it, but I suppose that does not detract from my point.)
The Apache::ParseLog
module, available from your favorite CPAN mirror, makes parsing log files simple, and, therefore, takes all the work out of generating useful reports from those logs.
For detailed information about how to use this module, install it and read the documentation. After you have installed the module, you can get at the documentation by typing perldoc Apache::ParseLog
.
Trolling through the source code for WWWStat is another good way to learn about Perl log file parsing.
Because the log file format is so simple, writing code to chop it up into its component parts and do statistical analysis of it is rather simple.
You don't have to log to a file; you can log to a process. This is particularly useful if you want your logs to go to a database, or to some process that will give some type of real-time statistics on your Web site traffic.
Using the <CustomLog>
directive, you can, instead of specifying a file to which the log should be written, specify |
, followed by the name of a program that is to receive the logging information.
For example:
CustomLog |/usr/bin/apachelog common
Where /usr/bin/apachelog
is some program that knows what to do with Apache log file entries. This might be as simple as a Perl program that processes the log entries in some fashion, or it might be something that writes entries to a database.
The main thing to be cautious about if you're going to do this is security. Log files are opened with the permission of the user that starts the server. This is usually root, and it applies as well to logging to a process. Make sure that the process to which you are logging is secure. If you log to an unsecure process (one that some non-root user can tinker with) you run the risk of having that process being replaced by another that does unsavory things. If, for example, /usr/bin/apachelog.pl
is world-writable, any user could edit it to shut down your server, mail someone the password file, or delete important files. This would be done with root permissions.
Secondly, you should be careful about buffering. If, for example, you log to a process written in Perl, you might find that, although your Web site appears to be active, nothing is being logged by your Perl program. However, when you shut down the server, it suddenly logs everything. This is because it is buffering the output, and does not write anything out until the process terminates. Make sure you turn off output buffering if you want to get any sort of real-time reporting.
If you want to log to a process of some kind, you might be well advised to look for a module that already implements the functionality that you are looking for. Check out http://modules.apache.org/ for a list of some of the modules available to do all sorts of cool things with Apache.
Note that logging to a process always uses more system resources than just logging to a text file. Unless you have an exceptionally good reason to log to a process, you will usually find that post-processing your log files is a better solution. It's a good idea to copy your log files off onto another machine for processing, so that there is no performance impact on your machine while the logs are being crunched.
Log files get big. If you're not careful, you can end up filling up the drive (or partition) on which your log files are sitting, which can bring your server to a grinding halt.
The way around this is to move your log files to some other place before they get too big. This can be accomplished a number of different ways. Some Unix variants come with a logrotate
script that handles this for you. RedHat, for example, comes preconfigured to rotate your logs for you every few days, based on either their size or their age.
If you want to do this yourself, you can use a Perl module (freely available from CPAN) called Logfile::Rotate
. The following code, run periodically (perhaps once a week?) by cron, will rotate out your log file, keeping five previous log files at any given time. Each backup log file will be gzipped to conserve space.
#!/usr/bin/perl use Logfile::Rotate; foreach $log (qw(error_log access_log)) { $logfile = new Logfile::Rotate( File => "/usr/local/apache/logs/$log", Count => 5, Gzip => '/bin/gzip', Post => sub { `/usr/local/apache/bin/apachectl restart`; } } );
The Perl module takes care of all the details. You'll end up with files called things such as access_log.1.gz
, access_log.2.gz
, and so on. Each file will get bumped up one number each time and the file that used to be access_log.5.gz
will be deleted each time. The Count
parameter specifies how many log files are kept.
This keeps you from running out of space on your log drive, and keeps as much of an archive as you like.
Additionally, Apache ships with a utility called logrotate
, which enables you to rotate log files on a regular basis. You can use the logrotate
utility by adding it to your Apache configuration file as a process to which you will log. The syntax will look like this:
CustomLog "|/usr/local/apache/bin/rotatelogs /var/log/archive/apachelog 86400" common
The parameter 86400 here is the number of seconds after which the log will be moved and a new log started.
The path /var/log/archive/apachelog
specifies where the old log files will be put for archival purposes. More specifically, the log file will be backed up to a file named by taking this argument, and appending a timestamp.
86400 seconds, by the way, is 24 hours.
When you have more than one virtual host on the same machine, you should have separate log files for each host. This will eliminate the problems related to trying to pull log files apart into accesses from the various hosts after the fact.
In each of your VirtualHost
sections, simply specify a log file for that host. You can then handle each log file separately when it comes time to run reports.
There are some concerns with available file handles. That is, if you are running hundreds of virtual hosts, and have a log file per host, you may encounter a situation where you run out of available file handles. This can cause system instability and can even cause your system to halt. This is primarily a concern on servers that are hosting a very large number of virtual hosts. In that condition, you will need to consult the documentation for your particular operating system regarding the available number of file handles.