Testing the spider

To run a spider from the command line, the crawl command is used along with the name of the spider:

    $ scrapy crawl country -s LOG_LEVEL=ERROR
$

The script runs to completion with no output. Take note of the -s LOG_LEVEL=ERROR flag-this is a Scrapy setting and is equivalent to defining LOG_LEVEL = 'ERROR' in the settings.py file. By default, Scrapy will output all log messages to the terminal, so here the log level was raised to isolate error messages. Here, no output means our spider completed without error -- great!

In order to actually scrape some content from the pages, we need to add a few lines to the spider file. To ensure we can start building and extracting our items, we have to first start using our CountryItem and also update our crawler rules. Here is an updated version of the spider:

from example.items import CountryItem
    ...

    rules = ( 
        Rule(LinkExtractor(allow=r'/index/'), follow=True), 
        Rule(LinkExtractor(allow=r'/view/'), callback='parse_item') 
    ) 

    def parse_item():
        i = CountryItem()
        ...

In order to extract structured data, we should use our CountryItem class which we created. In this added code, we are importing the class and instantiating an object as the i (or item) in our parse_item method.

Additionally, we need to add rules so our spider can find data and extract it. The default rule searched the url pattern r'/Items' which is not matched on the example site. Instead, we can create two new rules from what we know already about the site. The first rule will crawl the index pages and follow their links, and the second rule will crawl the country pages and pass the downloaded response to the callback function for scraping.

Let's see what happens when this improved spider is run with the log level set to DEBUG to show more crawling messages:

$ scrapy crawl country -s LOG_LEVEL=DEBUG
...
2017-03-24 11:52:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET     http://example.webscraping.com/view/Belize-23> (referer: http://example.webscraping.com/index/2)
2017-03-24 11:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.webscraping.com/view/Belgium-22> (referer: http://example.webscraping.com/index/2)
2017-03-24 11:52:53 [scrapy.extensions.logstats] INFO: Crawled 40 pages (at 10 pages/min), scraped 0 items (at 0 items/min)
2017-03-24 11:52:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.webscraping.com/user/login?_next=%2Findex%2F0> (referer: http://example.webscraping.com/index/0)
2017-03-24 11:53:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.webscraping.com/user/register?_next=%2Findex%2F0> (referer: http://example.webscraping.com/index/0)
...

This log output shows the index pages and countries are being crawled and duplicate links are filtered, which is handy. We can also see our installed middlewares and other important information output when we first start the crawler.

However, we also notice the spider is wasting resources by crawling the login and register forms linked from each web page, because they match the rules regular expressions. The login URL in the preceding command ends with _next=%2Findex%2F1, which is a URL encoding equivalent to _next=/index/1, defining a post-login redirect. To prevent these URLs from being crawled, we can use the deny parameter of the rules, which also expects a regular expression and will prevent crawling every matching URL.

Here is an updated version of code to prevent crawling the user login and registration forms by avoiding the URLs containing /user/:

    rules = ( 
        Rule(LinkExtractor(allow=r'/index/', deny=r'/user/'), follow=True), 
        Rule(LinkExtractor(allow=r'/view/', deny=r'/user/'), callback='parse_item') 
    )

Further documentation about how to use the LinkExtractor class is available at http://doc.scrapy.org/en/latest/topics/link-extractors.html.

To stop the current crawl and restart with the new code, you can send a quit signal using Ctrl + C or cmd + C. You should then see a message similar to this one:

2017-03-24 11:56:03 [scrapy.crawler] INFO: Received SIG_SETMASK, shutting down gracefully. Send again to force

It will finish queued requests and then stop. You'll see some extra statistics and debugging at the end, which we will cover later in this section.

In addition to adding deny rules to the crawler, you can use the process_links argument for the Rule object. This allow you to create a function which iterates through the found links and makes any modifications (such as removing or adding parts of query strings). More information about crawling rules is available in the documentation: https://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules

Table of Contents for Testing the spider

Create new playlist

Sign In

Sign Up

Table of Contents for
Testing the spider