Disk Cache

To cache downloads, we will first try the obvious solution and save web pages to the filesystem. To do this, we will need a way to map URLs to a safe cross-platform filename. The following table lists limitations for some popular filesystems:

Operating system	Filesystem	Invalid filename characters	Maximum filename length
Linux	Ext3/Ext4	/ and	255 bytes
OS X	HFS Plus	: and	255 UTF-16 code units
Windows	NTFS	, /, ?, :, *, ", >, <, and \|	255 characters

To keep our file path safe across these filesystems, it needs to be restricted to numbers, letters, and basic punctuation, and it should replace all other characters with an underscore, as shown in the following code:

>>> import re 
>>> url = 'http://example.webscraping.com/default/view/Australia-1' 
>>> re.sub('[^/0-9a-zA-Z-.,;_ ]', '_', url) 
'http_//example.webscraping.com/default/view/Australia-1'

Additionally, the filename and the parent directories need to be restricted to 255 characters (as shown in the following code) to meet the length limitations described in the preceding table:

>>> filename = re.sub('[^/0-9a-zA-Z-.,;_ ]', '_', url)
>>> filename = '/'.join(segment[:255] for segment in filename.split('/'))
>>> print(filename)
'http_//example.webscraping.com/default/view/Australia-1'

Here, no sections of our URL are longer than 255; so our file path hasn't changed. There is also an edge case, which should be considered, where the URL path ends with a slash (/), and the empty string after this slash would be an invalid filename. However, removing this slash to use the parent for the filename would prevent saving other URLs. Consider the following URLs:

http://example.webscraping.com/index/
http://example.webscraping.com/index/1

If you need to save these, the index needs to be a directory to save the child page with filename 1. The solution our disk cache will use is appending index.html to the filename when the URL path ends with a slash. The same applies when the URL path is empty. To parse the URL, we will use the urlsplit function, which splits a URL into its components:

>>> from urllib.parse import urlsplit 
>>> components = urlsplit('http://example.webscraping.com/index/') 
>>> print(components) 
SplitResult(scheme='http', netloc='example.webscraping.com', path='/index/', query='', fragment='') 
>>> print(components.path) 
'/index/'

This function provides a convenient interface to parse and manipulate URLs. Here is an example using this module to append index.html for this edge case:

>>> path = components.path 
>>> if not path: 
>>>     path = '/index.html' 
>>> elif path.endswith('/'): 
>>>     path += 'index.html' 
>>> filename = components.netloc + path + components.query 
>>> filename 
'example.webscraping.com/index/index.html'

Depending on the site you are scraping, you may want to modify this edge case handling. For example, some sites will append / on every URL due to the way the web server expects the URL to be sent. For these sites, you might be safe simply stripping the trailing forward slash for every URL. Again, evaluate and update the code for your web crawler to best fit the site(s) you intend to scrape.

Table of Contents for Disk Cache

Create new playlist

Sign In

Sign Up

Table of Contents for
Disk Cache