Chapter 23. Web Spiders

 

I got remote control and a color T.V.

I don't change channels so they must change me.

 
 --Close to the Borderline—Billy Joel

When the Web was young—or at least when you were new to the Web—it was interesting to spend hours clicking links and looking at Web pages. Eventually, you got over that, and now you just want the information you want, when you want it, and you no longer want to do the work for yourself. That's where spiders come in.

Spiders are programs that walk the Web for you, following links and grabbing information. They're also known as robots and crawlers. You can find a list of many of the currently available and active spiders online at http://info.webcrawler.com/mak/projects/robots/active/html/index.html.

Spiders are very useful, but they can also cause a lot of problems. If you have a Web site on the Internet, you will find that a steady percentage of the visits to your site are from spiders. This is because most of the major search engines use spiders to index the Web, including your Web site, for inclusion in their database.

This chapter discusses what a spider is, how spiders can make your life easier, and how to protect your Web site against spiders that you don't want to let in. You'll also learn how to give spiders the right information about your site when they visit. Finally, you'll learn briefly about writing your own spider.

What Are Spiders?

The Web Robots FAQ defines a robot as “a program that automatically traverses the Web's hypertext structure by retrieving a document and recursively retrieving all documents that are referenced.” (You can find the Web Robots FAQ at http://info.webcrawler.com/mak/projects/robots/faq.html.) What this means is that a spider starts with some page and downloads all the pages that page has links to. Then, for each of those pages, it downloads all the pages they are linked to, and so on, ad infinitum. This is done automatically by the spider program, which will presumably be collecting this information for some useful purpose.

Spiders might be collecting information for a search engine, collecting e-mail addresses for sending spam, or downloading pages for offline viewing.

Some examples of various common types of robots are

  • Scooter is the robot responsible for the AltaVista search engine. Scooter fetches documents from the Web, which are then incorporated into AltaVista's database. You can search that database at http://www.altavista.com/. Most major search engines also use some type of spider to index the Web, and you will see many of them in your server logs.

  • EmailSiphon, and various other spiders with similar names, rove the Web, retrieving e-mail addresses from Web pages. The people who run EmailSiphon then sell those addresses to various low-lifes who then send unsolicited bulk e-mail (also known as spam) to those addresses. See the later section on excluding spiders from your site to learn how to deny access to these robots and protect your mailbox.

  • MOMspider is one that you can download and use on your own site to validate links and generate statistics. You can run it from your server or from your desktop. There are a large number of similar products for Web site developers to use on their own sites.

Spiders: The Good and the Bad

In general, spiders are good things. They can help you out in a number of ways, such as indexing your site, searching for broken links, and validating the HTML on your pages.

A common use for spiders is collecting documents from the Web for you, so that you can look at them at your leisure when you aren't online. This is called offline browsing or caching, among other things, and the products that do this are sometimes called personal agents or personal spiders. One such product, called AvantGo (http://www.avantgo.com), will even download Web content to your hand-held computer so that you can look at your favorite Web pages while on an airplane or bus.

Conversely, spiders also can cause a lot of problems on your Web site, because their traffic patterns are not the sort that you typically plan for.

Server Overloading

One potential problem is server overload. Whereas a human user is likely to wait at least a few seconds between downloading one page and the next, the spider can start on the next page immediately after receiving the first page. Also, it can fork multiple processes and download several pages at the same time. If your server isn't equipped to handle that many simultaneous connections, or if you don't have the bandwidth to handle the requests, this might cause visitors to have a long wait for their pages to load, or even cause the server to become overloaded.

Black Holes

Occasionally, poorly written spiders might get trapped in some infinite portion of your Web site, such as a CGI program that generates pages with links back to itself. The spider might spend hours or days chasing its tail, so to speak. This can cause your log files to grow at an alarming rate, skew any statistical information that you might be collecting, and lead to an overloaded server.

Recognizing Spiders in Your Log Files

Before you try to keep spiders out of your site, you might want to get a good idea of what spiders are visiting your site and what they're trying to do. You'll notice log entries from spiders in several ways:

  1. The first thing that will stand out will be the user agent (if you are logging the user agent in your log files). It won't look like an ordinary browser (because it's not) and will tend to have a name such as harvester, black widow, Aracnophilia, and the like. You can see a full listing of the various known spiders in the Web Robots FAQ, discussed earlier in this chapter.

    Of course, it's also important to understand what spiders, like any other Web client, are free to provide whatever client description they choose, and so they could just as easily say that they are Netscape, IE, or “Bob's Handy-Dandy Browser”. There is no guarantee whatsoever that the USER_AGENT string can be trusted to be true.

  2. You might notice that a large number of pages are requested by the same client, often in quick succession.

  3. The address from which the client is connecting can tell you quite a lot. Connections from the various search engines are frequently spiders indexing your site. For example, a connection from lobo.yahoo.com is a good indication that your site is being indexed for the Yahoo! Internet directory.

Excluding Spiders from Your Server

You can keep spiders off your site—or at least off certain parts of your site—in several different ways. These methods usually rely on the cooperation of the spider itself. However, you can do a number of things at the server level to deny access.

As mentioned earlier, you will probably want to keep spiders out of your CGI directories. You also will want to keep them out of portions of your site that change with such regularity that indexing would be fruitless. And, of course, there might be parts of your site that you'd just rather not have indexed, for whatever reason.

Robot Exclusion with robots.txt

The Robots Exclusion Protocol, also known as A Standard for Robot Exclusion, is a document drafted in 1994 that outlined a method for telling robots what parts of your site you want them to stay out of. You can find the full text of this document on the WebCrawler Web site at http://info.webcrawler.com/mak/projects/robots/norobots.html.

To implement this exclusion on your Web site, you need to create a text file called robots.txt, and place it in your server's document root directory. When a spider visits your site, it is supposed to fetch this document before going any further, to find out what rules you have set.

The file contains one or more User-agent lines, each followed by one or more Disallow lines specifying any directories that particular user agent (spider) is not permitted to access. Most commonly, the user agent specified will be *, which should be obeyed by all robots. In the following sample robots.txt file, all user agents are requested to stay out of the directories /cgi-bin/ and /datafiles/:

User-agent: *

Disallow: /cgi-bin/

Disallow: /datafiles/

In the following example, a particular user agent, Scooter, is requested to stay out of the directory /dont-index/:

User-agent: Scooter

Disallow: /dont-index/

robots.txt files can also contain comments. Anything following a hash character (#), until the end of that line, is a comment and will be ignored.

Unfortunately, it is very easy to write a spider but considerably more difficult to write one that is well behaved. Consequently, many people write spiders that blatantly ignore your robots.txt file. Like many parts of Internet standards, it's just a suggestion, and particular implementations are free to ignore the suggestion.

The ROBOTS Meta Tag

Another method for requesting that spiders not enter your Web site is the ROBOTS meta tag. This HTML tag can appear in the <HEAD> section of any HTML page. The format of the tag is as follows:

<HTML>

<HEAD>

<META NAME="ROBOTS" CONTENT="arguments">

<TITLE>Title here</TITLE>
</HEAD>

<BODY>

...

Possible arguments to the CONTENT attribute are as follows:

  • FOLLOW tells the spider that it's okay to follow any links that appear on this document.

  • INDEX tells the spider that it's okay to index this document. That is, the contents of this document can be cached or added to a search engine database.

  • NOFOLLOW tells the spider not to follow any links from this page.

  • NOINDEX tells the spider not to index this page.

Any of these arguments can be combined, separated by commas, as shown in the following example:

<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">

Two other directives also specify a grouping of the preceding arguments. ALL is equivalent to INDEX,FOLLOW, and NONE is equivalent to NOINDEX,NOFOLLOW.

As with the robots.txt file, obeying the rules specified in this tag is optional. Most major search engines follow any requests that you make with this meta tag.

Contacting the Operator

If a spider appears to be running wild on your site or visiting parts of your site that you really don't want it to, you should first attempt to contact the operator. You have the client's address in the log files. Try to e-mail an administrator at the offending site to get hold of whoever is running the robot. Tell him what his robot is doing to your server and ask him nicely to stop, or at least to obey your robots.txt file.

Blocking a Spider by Address

If you can't get any response or if the operator refuses to pay any attention to you, you can shut out the spider completely with some well-placed deny directives:

<Directory /usr/web/docs>
Order allow,deny
Allow from all
Deny from unfriendly.spiderhost.com
</Directory>

If all else fails, have the spider's traffic blocked at your network's router or firewall. This has a disadvantage, however, in that it will also block traffic from any legitimate users coming from that system.

Blocking a Spider by Deny from Env

If you want to block a spider by something other than the address, such as the user agent, this can be done most effectively with the Deny from Env syntax, in conjunction with the SetEnvIf directive.

If, for example, you want to block traffic from a spider with useragent EmailSiphon, you might use the following approach.

SetEnvIf User-Agent EmailSiphon Spammers
Order Allow,Deny
Allow from all
Deny from env=Spammers

The SetEnfIf directive sets an environment variable if a particular condition is satisfied. In this case, it will see the environment variable Spammers if the User-Agent contains the string EmailSiphon.

The Deny from Env directive will deny access if a particular environment variable is set. In this case, it looks for the existence of the environment variable Spammers, which we have set in the case of a particular user agent, and deny access from that user agent.

Place this directive in your main server configuration file, in a <Directory> section that encompasses your entire site, and you will be able to keep this spider from looking at your site.

Writing Your Own Spider

Perhaps you want to write your own special-purpose spider to do some work for you. The best advice I can give you is, simply, don't write your own spider. A plethora of spiders are already available online, most of which you can download for free. They do everything from checking links on your site, to getting the latest basketball scores, to validating your HTML syntax, to telling you that your favorite Web site has been updated. It is very unlikely that you have a need so specialized that someone has not already written a spider to do exactly what you need. You can find a spider to suit your needs at http://info.webcrawler.com/mak/projects/robots/active/html/index.html.

It can be difficult to write a spider that correctly implements the Robots Exclusion Protocol (that is, obeys all the suggestions given in the robots.txt file and any ROBOTS meta tags), so you might as well use one that someone else has already written.

If you really feel that you must write your own spider, the best tool for the job is probably Perl. Perl's main strength is processing large quantities of text and pulling out the information that's of interest to you. Spiders spend most of their time going through Web pages (text files) and pulling out information, as well as links to other Web pages.

Several Perl modules are used specifically for processing HTML pages. These modules are available on CPAN (http://www.cpan.org/). Of particular interest would be the LWP modules, in CPAN's modules/by-module/LWP/ directory, and various HTML::* modules in CPAN's modules/by-module/HTML/ directory.

Listing 23.1 shows a very simple spider, implemented in Perl. This subroutine gets a Web page, does something with that page, and then gets all the pages linked from the first page, recursively. The HTML::LinkExtor module extracts all links from an HTML document. HTML::FormatText formats an HTML page as text, so that you can get to the information without all the HTML markup. And LWP::Simple is a simple way to fetch documents from the network.

Example 23.1. A Simple Spider

use HTML::LinkExtor;
use HTML::FormatText;
use LWP::Simple;

my $p = HTML::LinkExtor->new();
my $Docs = { } ;

searchpage(0, 'http://www.yoursite.com/', $Docs);

sub searchpage  {
     my ($cur_depth, $url, $Docs) = @_;
     my ($link, @links, $abs);
     print "Looking at $url, at depth $cur_depth
";
     $Docs->{ $url}  = 1; # Mark site as visited
     my $content = get($url);
     $p->parse($content);
     $content = HTML::FormatText->new->format(parse_html($content));
     DoSomethingWith($url, $content);
     @links = $p->links;

     for $link (@links)  {
          $abs = url($link->[2], $url)->abs if
               ($link->[0] eq 'a' && $link->[1] eq 'href'),
          $abs =~ s/#.*$//;
          $abs =~ s!/$!!;

          # Skip some URLs
          next if $abs=~/^mailto/i; # Email link
          next if $abs=~/(gz|zip|exe|tar|Z)$/; # Binary files
          next if $abs=~/?S+?=S+/; # CGI program

          searchpage($cur_depth+1, $abs, $Docs)
                unless ($Docs->{ $abs} );
     }
}  # End sub searchpage

The function call in the middle—DoSomethingWith($url, $content)—is, of course, where you would fill in whatever it is you wanted to do with the content you were collecting from the page.

The section labeled Skip some URLs contains regular expressions that match certain patterns in order to skip files that would be a particularly bad idea to spider. The first regular expression matches mailto links, and skips those so that your spider does not start sending e-mail. The second, regex, skips any files that might be binary files. You might add any number of additional patterns here, such as pdf or doc. The third pattern skips URLs that contain arguments on the end in GET form syntax, indicating that they are CGI links. This is not perfect, but eliminates some links to CGI programs that could trap the spider in an infinite URL space.

Caution

Be careful when using this code, because it can put a heavy load on a server very quickly. It doesn't follow the standard for robot exclusion, as discussed earlier, and it continues to fetch pages forever because the recursion has no exit condition. (A good approach might be to exit from the loop as soon as $cur_depth reaches a certain value.) Test it on your server not mine.

Summary

Spiders are very useful tools for doing tedious work that we don't want to do manually. If carelessly written or used, however, they can wreak havoc on your Web server. This chapter focused on the various uses for spiders, as well as the ways they can be misused. You learned how to block them from your site and even how to write your own spider.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset