Adding cache support to the link crawler

To support caching, the download function developed in Chapter 1, Introduction to Web Scraping, needs to be modified to check the cache before downloading a URL. We also need to move throttling inside this function and only throttle when a download is made, and not when loading from a cache. To avoid the need to pass various parameters for every download, we will take this opportunity to refactor the download function into a class so parameters can be set in the constructor and reused numerous times. Here is the updated implementation to support this:

from chp1.throttle import Throttle
from random import choice
import requests


class Downloader:
def __init__(self, delay=5, user_agent='wswp', proxies=None, cache={}):
self.throttle = Throttle(delay)
self.user_agent = user_agent
self.proxies = proxies
self.num_retries = None # we will set this per request
self.cache = cache

def __call__(self, url, num_retries=2):
self.num_retries = num_retries
try:
result = self.cache[url]
print('Loaded from cache:', url)
except KeyError:
result = None
if result and self.num_retries and 500 <= result['code'] < 600:
# server error so ignore result from cache
# and re-download
result = None
if result is None:
# result was not loaded from cache
# so still need to download
self.throttle.wait(url)
proxies = choice(self.proxies) if self.proxies else None
headers = {'User-Agent': self.user_agent}
result = self.download(url, headers, proxies)
if self.cache:
# save result to cache
self.cache[url] = result
return result['html']


def download(self, url, headers, proxies, num_retries):
...
return {'html': html, 'code': resp.status_code }
The full source code for the Download class is available at https://github.com/kjam/wswp/blob/master/code/chp3/downloader.py.

The interesting part of the Download class used in the preceding code is in the __call__ special method, where the cache is checked before downloading. This method first checks whether this URL was previously put in the cache. By default, the cache is a Python dictionary. If the URL is cached, it checks whether a server error was encountered in the previous download. Finally, if no server error was encountered, the cached result can be used. If any of these checks fails, the URL needs to be downloaded as usual, and the result will be added to the cache.

The download method of this class is almost the same as the previous download function, except now it returns the HTTP status code so the error codes can be stored in the cache. In addition, instead of calling itself and testing num_retries, it must first decrease the self.num_retries and then recursively use self.download if there are still retries left. If you just want a simple download without throttling or caching, this method can be used instead of __call__.

The cache class is used here by calling result = cache[url] to load from cache and cache[url] = result to save to cache, which is a convenient interface from Python's built-in dictionary data type. To support this interface, our cache class will need to define the __getitem__() and __setitem__() special class methods.

The link crawler also needs to be slightly updated to support caching by adding the cache parameter, removing the throttle, and replacing the download function with the new class, as shown in the following code:

def link_crawler(..., num_retries=2, cache={}): 
crawl_queue = [seed_url]
seen = {seed_url: 0}
rp = get_robots(seed_url)
D = Downloader(delay=delay, user_agent=user_agent, proxies=proxies, cache=cache)

while crawl_queue:
url = crawl_queue.pop()
# check url passes robots.txt restrictions
if rp.can_fetch(user_agent, url):
depth = seen.get(url, 0)
if depth == max_depth:
continue
html = D(url, num_retries=num_retries)
if not html:
continue
...

You'll notice that num_retries is now linked to our call. This allows us to utilize the number of request retries on a per-URL basis. If we simply use the same number of retries without ever resetting the self.num_retries value, we will run out of retries if we reach a 500 error from one page.

You can check the full code again at the book repository (https://github.com/kjam/wswp/blob/master/code/chp3/advanced_link_crawler.py). Now, our web scraping infrastructure is prepared, and we can start building the actual cache.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset