Scraping results

Now that we have complete implementations for each scraper, we will test their relative performance with this snippet. The imports in the code expect your directory structure to be similar to the book's repository, so please adjust as necessary:

import time
import re
from chp2.all_scrapers import re_scraper, bs_scraper, 
    lxml_scraper, lxml_xpath_scraper
from chp1.advanced_link_crawler import download

NUM_ITERATIONS = 1000 # number of times to test each scraper
html = download('http://example.webscraping.com/places/view/United-Kingdom-239')

scrapers = [
   ('Regular expressions', re_scraper),
   ('BeautifulSoup', bs_scraper),
   ('Lxml', lxml_scraper),
   ('Xpath', lxml_xpath_scraper)]

for name, scraper in scrapers:
    # record start time of scrape
    start = time.time()
    for i in range(NUM_ITERATIONS):
        if scraper == re_scraper:
            re.purge()
        result = scraper(html)
        # check scraped result is as expected
        assert result['area'] == '244,820 square kilometres'
    # record end time of scrape and output the total
    end = time.time()
    print('%s: %.2f seconds' % (name, end - start))

This example will run each scraper 1000 times, check whether the scraped results are as expected, and then print the total time taken. The download function used here is the one defined in the preceding chapter. Note the highlighted line calling re.purge(); by default, the regular expression module will cache searches and this cache needs to be cleared to make a fair comparison with the other scraping approaches.

Here are the results from running this script on my computer:

    $ python chp2/test_scrapers.py 
    Regular expressions: 1.80 seconds
    BeautifulSoup: 14.05 seconds
    Lxml: 3.08 seconds
    Xpath: 1.07 seconds

The results on your computer will quite likely be different because of the different hardware used. However, the relative difference between each approach should be similar. The results show Beautiful Soup is over six times slower than the other approaches when used to scrape our example web page. This result could be anticipated because lxml and the regular expression module were written in C, while BeautifulSoup is pure Python. An interesting fact is that lxml performed comparatively well with regular expressions, since lxml has the additional overhead of having to parse the input into its internal format before searching for elements. When scraping many features from a web page, this initial parsing overhead is reduced and lxml becomes even more competitive. As we can see with the XPath parser, lxml is able to directly compete with regular expressions. It really is an amazing module!

Although we strongly encourage you to use lxml for parsing, the biggest performance bottleneck for web scraping is usually the network. We will discuss approaches to parallelize workflows, allowing you to increase the speed of your crawlers by having multiple requests work in parallel.

Table of Contents for Scraping results

Create new playlist

Sign In

Sign Up

Table of Contents for
Scraping results