IN THIS CHAPTER
I got remote control and a color T.V. I don't change channels so they must change me. | ||
--Close to the Borderline—Billy Joel |
When the Web was young—or at least when you were new to the Web—it was interesting to spend hours clicking links and looking at Web pages. Eventually, you got over that, and now you just want the information you want, when you want it, and you no longer want to do the work for yourself. That's where spiders come in.
Spiders are programs that walk the Web for you, following links and grabbing information. They're also known as robots and crawlers. You can find a list of many of the currently available and active spiders online at http://info.webcrawler.com/mak/projects/robots/active/html/index.html.
Spiders are very useful, but they can also cause a lot of problems. If you have a Web site on the Internet, you will find that a steady percentage of the visits to your site are from spiders. This is because most of the major search engines use spiders to index the Web, including your Web site, for inclusion in their database.
This chapter discusses what a spider is, how spiders can make your life easier, and how to protect your Web site against spiders that you don't want to let in. You'll also learn how to give spiders the right information about your site when they visit. Finally, you'll learn briefly about writing your own spider.
The Web Robots FAQ defines a robot as “a program that automatically traverses the Web's hypertext structure by retrieving a document and recursively retrieving all documents that are referenced.” (You can find the Web Robots FAQ at http://info.webcrawler.com/mak/projects/robots/faq.html.) What this means is that a spider starts with some page and downloads all the pages that page has links to. Then, for each of those pages, it downloads all the pages they are linked to, and so on, ad infinitum. This is done automatically by the spider program, which will presumably be collecting this information for some useful purpose.
Spiders might be collecting information for a search engine, collecting e-mail addresses for sending spam, or downloading pages for offline viewing.
Some examples of various common types of robots are
Scooter is the robot responsible for the AltaVista search engine. Scooter fetches documents from the Web, which are then incorporated into AltaVista's database. You can search that database at http://www.altavista.com/. Most major search engines also use some type of spider to index the Web, and you will see many of them in your server logs.
EmailSiphon, and various other spiders with similar names, rove the Web, retrieving e-mail addresses from Web pages. The people who run EmailSiphon then sell those addresses to various low-lifes who then send unsolicited bulk e-mail (also known as spam) to those addresses. See the later section on excluding spiders from your site to learn how to deny access to these robots and protect your mailbox.
MOMspider is one that you can download and use on your own site to validate links and generate statistics. You can run it from your server or from your desktop. There are a large number of similar products for Web site developers to use on their own sites.
In general, spiders are good things. They can help you out in a number of ways, such as indexing your site, searching for broken links, and validating the HTML on your pages.
A common use for spiders is collecting documents from the Web for you, so that you can look at them at your leisure when you aren't online. This is called offline browsing or caching, among other things, and the products that do this are sometimes called personal agents or personal spiders. One such product, called AvantGo (http://www.avantgo.com), will even download Web content to your hand-held computer so that you can look at your favorite Web pages while on an airplane or bus.
Conversely, spiders also can cause a lot of problems on your Web site, because their traffic patterns are not the sort that you typically plan for.
One potential problem is server overload. Whereas a human user is likely to wait at least a few seconds between downloading one page and the next, the spider can start on the next page immediately after receiving the first page. Also, it can fork multiple processes and download several pages at the same time. If your server isn't equipped to handle that many simultaneous connections, or if you don't have the bandwidth to handle the requests, this might cause visitors to have a long wait for their pages to load, or even cause the server to become overloaded.
Occasionally, poorly written spiders might get trapped in some infinite portion of your Web site, such as a CGI program that generates pages with links back to itself. The spider might spend hours or days chasing its tail, so to speak. This can cause your log files to grow at an alarming rate, skew any statistical information that you might be collecting, and lead to an overloaded server.
Before you try to keep spiders out of your site, you might want to get a good idea of what spiders are visiting your site and what they're trying to do. You'll notice log entries from spiders in several ways:
The first thing that will stand out will be the user agent (if you are logging the user agent in your log files). It won't look like an ordinary browser (because it's not) and will tend to have a name such as harvester, black widow, Aracnophilia, and the like. You can see a full listing of the various known spiders in the Web Robots FAQ, discussed earlier in this chapter.
Of course, it's also important to understand what spiders, like any other Web client, are free to provide whatever client description they choose, and so they could just as easily say that they are Netscape, IE, or “Bob's Handy-Dandy Browser”. There is no guarantee whatsoever that the USER_AGENT
string can be trusted to be true.
You might notice that a large number of pages are requested by the same client, often in quick succession.
The address from which the client is connecting can tell you quite a lot. Connections from the various search engines are frequently spiders indexing your site. For example, a connection from lobo.yahoo.com is a good indication that your site is being indexed for the Yahoo! Internet directory.
You can keep spiders off your site—or at least off certain parts of your site—in several different ways. These methods usually rely on the cooperation of the spider itself. However, you can do a number of things at the server level to deny access.
As mentioned earlier, you will probably want to keep spiders out of your CGI directories. You also will want to keep them out of portions of your site that change with such regularity that indexing would be fruitless. And, of course, there might be parts of your site that you'd just rather not have indexed, for whatever reason.
The Robots Exclusion Protocol, also known as A Standard for Robot Exclusion, is a document drafted in 1994 that outlined a method for telling robots what parts of your site you want them to stay out of. You can find the full text of this document on the WebCrawler Web site at http://info.webcrawler.com/mak/projects/robots/norobots.html.
To implement this exclusion on your Web site, you need to create a text file called robots.txt
, and place it in your server's document root directory. When a spider visits your site, it is supposed to fetch this document before going any further, to find out what rules you have set.
The file contains one or more User-agent
lines, each followed by one or more Disallow
lines specifying any directories that particular user agent (spider) is not permitted to access. Most commonly, the user agent specified will be *
, which should be obeyed by all robots. In the following sample robots.txt
file, all user agents are requested to stay out of the directories /cgi-bin/
and /datafiles/
:
User-agent: * Disallow: /cgi-bin/ Disallow: /datafiles/
In the following example, a particular user agent, Scooter, is requested to stay out of the directory /dont-index/
:
User-agent: Scooter Disallow: /dont-index/
robots.txt
files can also contain comments. Anything following a hash character (#), until the end of that line, is a comment and will be ignored.
Unfortunately, it is very easy to write a spider but considerably more difficult to write one that is well behaved. Consequently, many people write spiders that blatantly ignore your robots.txt
file. Like many parts of Internet standards, it's just a suggestion, and particular implementations are free to ignore the suggestion.
Another method for requesting that spiders not enter your Web site is the ROBOTS
meta tag. This HTML tag can appear in the <HEAD>
section of any HTML page. The format of the tag is as follows:
<HTML> <HEAD> <META NAME="ROBOTS" CONTENT="arguments"> <TITLE>Title here</TITLE> </HEAD> <BODY> ...
Possible arguments to the CONTENT
attribute are as follows:
FOLLOW
tells the spider that it's okay to follow any links that appear on this document.
INDEX
tells the spider that it's okay to index this document. That is, the contents of this document can be cached or added to a search engine database.
NOFOLLOW
tells the spider not to follow any links from this page.
NOINDEX
tells the spider not to index this page.
Any of these arguments can be combined, separated by commas, as shown in the following example:
<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">
Two other directives also specify a grouping of the preceding arguments. ALL
is equivalent to INDEX,FOLLOW
, and NONE
is equivalent to NOINDEX,NOFOLLOW
.
As with the robots.txt
file, obeying the rules specified in this tag is optional. Most major search engines follow any requests that you make with this meta tag.
If a spider appears to be running wild on your site or visiting parts of your site that you really don't want it to, you should first attempt to contact the operator. You have the client's address in the log files. Try to e-mail an administrator at the offending site to get hold of whoever is running the robot. Tell him what his robot is doing to your server and ask him nicely to stop, or at least to obey your robots.txt
file.
If you can't get any response or if the operator refuses to pay any attention to you, you can shut out the spider completely with some well-placed deny directives:
<Directory /usr/web/docs> Order allow,deny Allow from all Deny from unfriendly.spiderhost.com </Directory>
If all else fails, have the spider's traffic blocked at your network's router or firewall. This has a disadvantage, however, in that it will also block traffic from any legitimate users coming from that system.
If you want to block a spider by something other than the address, such as the user agent, this can be done most effectively with the Deny from Env
syntax, in conjunction with the SetEnvIf
directive.
If, for example, you want to block traffic from a spider with useragent EmailSiphon
, you might use the following approach.
SetEnvIf User-Agent EmailSiphon Spammers Order Allow,Deny Allow from all Deny from env=Spammers
The SetEnfIf
directive sets an environment variable if a particular condition is satisfied. In this case, it will see the environment variable Spammers
if the User-Agent
contains the string EmailSiphon
.
The Deny from Env
directive will deny access if a particular environment variable is set. In this case, it looks for the existence of the environment variable Spammers
, which we have set in the case of a particular user agent, and deny access from that user agent.
Place this directive in your main server configuration file, in a <Directory>
section that encompasses your entire site, and you will be able to keep this spider from looking at your site.
Perhaps you want to write your own special-purpose spider to do some work for you. The best advice I can give you is, simply, don't write your own spider. A plethora of spiders are already available online, most of which you can download for free. They do everything from checking links on your site, to getting the latest basketball scores, to validating your HTML syntax, to telling you that your favorite Web site has been updated. It is very unlikely that you have a need so specialized that someone has not already written a spider to do exactly what you need. You can find a spider to suit your needs at http://info.webcrawler.com/mak/projects/robots/active/html/index.html.
It can be difficult to write a spider that correctly implements the Robots Exclusion Protocol (that is, obeys all the suggestions given in the robots.txt
file and any ROBOTS
meta tags), so you might as well use one that someone else has already written.
If you really feel that you must write your own spider, the best tool for the job is probably Perl. Perl's main strength is processing large quantities of text and pulling out the information that's of interest to you. Spiders spend most of their time going through Web pages (text files) and pulling out information, as well as links to other Web pages.
Several Perl modules are used specifically for processing HTML pages. These modules are available on CPAN (http://www.cpan.org/). Of particular interest would be the LWP
modules, in CPAN's modules/by-module/LWP/
directory, and various HTML::*
modules in CPAN's modules/by-module/HTML/
directory.
Listing 23.1 shows a very simple spider, implemented in Perl. This subroutine gets a Web page, does something with that page, and then gets all the pages linked from the first page, recursively. The HTML::LinkExtor
module extracts all links from an HTML document. HTML::FormatText
formats an HTML page as text, so that you can get to the information without all the HTML markup. And LWP::Simple
is a simple way to fetch documents from the network.
Example 23.1. A Simple Spider
use HTML::LinkExtor; use HTML::FormatText; use LWP::Simple; my $p = HTML::LinkExtor->new(); my $Docs = { } ; searchpage(0, 'http://www.yoursite.com/', $Docs); sub searchpage { my ($cur_depth, $url, $Docs) = @_; my ($link, @links, $abs); print "Looking at $url, at depth $cur_depth "; $Docs->{ $url} = 1; # Mark site as visited my $content = get($url); $p->parse($content); $content = HTML::FormatText->new->format(parse_html($content)); DoSomethingWith($url, $content); @links = $p->links; for $link (@links) { $abs = url($link->[2], $url)->abs if ($link->[0] eq 'a' && $link->[1] eq 'href'), $abs =~ s/#.*$//; $abs =~ s!/$!!; # Skip some URLs next if $abs=~/^mailto/i; # Email link next if $abs=~/(gz|zip|exe|tar|Z)$/; # Binary files next if $abs=~/?S+?=S+/; # CGI program searchpage($cur_depth+1, $abs, $Docs) unless ($Docs->{ $abs} ); } } # End sub searchpage
The function call in the middle—DoSomethingWith($url, $content)
—is, of course, where you would fill in whatever it is you wanted to do with the content you were collecting from the page.
The section labeled Skip some URLs
contains regular expressions that match certain patterns in order to skip files that would be a particularly bad idea to spider. The first regular expression matches mailto
links, and skips those so that your spider does not start sending e-mail. The second, regex
, skips any files that might be binary files. You might add any number of additional patterns here, such as pdf
or doc
. The third pattern skips URLs that contain arguments on the end in GET
form syntax, indicating that they are CGI links. This is not perfect, but eliminates some links to CGI programs that could trap the spider in an infinite URL space.
Be careful when using this code, because it can put a heavy load on a server very quickly. It doesn't follow the standard for robot exclusion, as discussed earlier, and it continues to fetch pages forever because the recursion has no exit condition. (A good approach might be to exit from the loop as soon as $cur_depth
reaches a certain value.) Test it on your server not mine.
Spiders are very useful tools for doing tedious work that we don't want to do manually. If carelessly written or used, however, they can wreak havoc on your Web server. This chapter focused on the various uses for spiders, as well as the ways they can be misused. You learned how to block them from your site and even how to write your own spider.