Expiring stale data

Our current version of the disk cache will save a value to disk for a key and then return it whenever this key is requested in the future. This functionality may not be ideal when caching web pages because of online content changes, so the data in our cache will become out of date. In this section, we will add an expiration time to our cached data so the crawler knows when to download a fresh copy of the web page. To support storing the timestamp of when each web page was cached is straightforward.

Here is an implementation of this:

from datetime import datetime, timedelta 

class DiskCache:
def __init__(..., expires=timedelta(days=30)):
...
self.expires = expires
## in __getitem___ for DiskCache class
with open(path, mode) as fp:
if self.compress:
data = zlib.decompress(fp.read()).decode(self.encoding)
data = json.loads(data)
else:
data = json.load(fp)
exp_date = data.get('expires')
if exp_date and datetime.strptime(exp_date,
'%Y-%m-%dT%H:%M:%S') <= datetime.utcnow():
print('Cache expired!', exp_date)
raise KeyError(url + ' has expired.')
return data
## in __setitem___ for DiskCache class
result['expires'] = (datetime.utcnow() + self.expires).isoformat(timespec='seconds')

In the constructor, the default expiration time is set to 30 days with a timedelta object. Then, the __set__ method saves the expiration timestamp as a key in the result dictionary, and the __get__ method compares the current UTC time to the expiration time. To test this expiration, we can try a short timeout of 5 seconds, as shown here:

 >>> cache = DiskCache(expires=timedelta(seconds=5)) 
>>> url = 'http://example.webscraping.com'
>>> result = {'html': '...'}
>>> cache[url] = result
>>> cache[url]
{'html': '...'}
>>> import time; time.sleep(5)
>>> cache[url]
Traceback (most recent call last):
...
KeyError: 'http://example.webscraping.com has expired'

As expected, the cached result is initially available, and then, after sleeping for five seconds, calling the same key raises a KeyError to show this cached download has expired.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset