Exploring requests-cache

Occasionally, you might want to cache a library that uses requests internally or maybe you don't want to manage the cache classes and handling yourself. If this is the case, requests-cache (https://github.com/reclosedev/requests-cache) is a great library that implements a few different backend options for creating a cache for the requests library. When using requests-cache, all get requests to access a URL via the requests library will first check the cache and only request the page if it's not found.

requests-cache supports several backends including Redis, MongoDB (a NoSQL database), SQLite (a lightweight relational database), and memory (which is not persistent, and therefore not recommended). Since we already have Redis set up, we can use it as our backend. To get started, we first need to install the library:

pip install requests-cache

Now we can simply install and test our cache using a few simple commands in IPython:

In [1]: import requests_cache

In [2]: import requests

In [3]: requests_cache.install_cache(backend='redis')

In [4]: requests_cache.clear()

In [5]: url = 'http://example.webscraping.com/view/United-Kingdom-239'

In [6]: resp = requests.get(url)

In [7]: resp.from_cache
Out[7]: False

In [8]: resp = requests.get(url)

In [9]: resp.from_cache
Out[9]: True

If we were to use this instead of our own cache class, we would only need to instantiate the cache using the install_cache command and then every request (provided we are utilizing the requests library) would be maintained in our Redis backend. We can also set expiry using a few simple commands:

from datetime import timedelta
requests_cache.install_cache(backend='redis', expire_after=timedelta(days=30))

To test the speed of using requests-cache compared to our own implementation, we have built a new downloader and link crawler to use. This downloader also implements the suggested requests hook to allow for throttling, as documented in the requests-cache User Guide: https://requests-cache.readthedocs.io/en/latest/user_guide.html.

To see the full code, check out the new downloader (https://github.com/kjam/wswp/blob/master/code/chp3/downloader_requests_cache.py)and link crawler (https://github.com/kjam/wswp/blob/master/code/chp3/requests_cache_link_crawler.py). We can test them using IPython to compare the performance:

In [1]: from chp3.requests_cache_link_crawler import link_crawler
...
In [3]: %time link_crawler('http://example.webscraping.com/', '/(index|view)')
Returning from cache: http://example.webscraping.com/
Returning from cache: http://example.webscraping.com/index/1
Returning from cache: http://example.webscraping.com/index/2
...
Returning from cache: http://example.webscraping.com/view/Afghanistan-1
CPU times: user 116 ms, sys: 12 ms, total: 128 ms
Wall time: 359 ms

We see the requests-cache solution is slightly less performant from our own Redis solution, but it also took fewer lines of code and was still quite fast (and still much faster than our DiskCache solution). Especially if you are using another library where requests might be managed internally, the requests-cache implementation is a great tool to have.

Table of Contents for Exploring requests-cache

Create new playlist

Sign In

Sign Up

Table of Contents for
Exploring requests-cache