Using the requests library

Although we have built a fairly advanced parser using only urllib, the majority of scrapers written in Python today utilize the requests library to manage complex HTTP requests. What started as a small library to help wrap urllib features in something "human-readable" is now a very large project with hundreds of contributors. Some of the features available include built-in handling of encoding, important updates to SSL and security, as well as easy handling of POST requests, JSON, cookies, and proxies.

Throughout most of this book, we will utilize the requests library for its simplicity and ease of use, and because it has become the de facto standard for most web scraping.

To install requests, simply use pip:

pip install requests

For an in-depth overview of all features, you should read the documentation at http://python-requests.org or browse the source code at https://github.com/kennethreitz/requests

To compare differences using the two libraries, I've also built the advanced link crawler so that it can use requests. You can see the code at https://github.com/kjam/wswp/blob/master/code/chp1/advanced_link_crawler_using_requests.py. The main download function shows the key differences. The requests version is as follows:

def download(url, user_agent='wswp', num_retries=2, proxies=None):
print('Downloading:', url)
headers = {'User-Agent': user_agent}
try:
resp = requests.get(url, headers=headers, proxies=proxies)
html = resp.text
if resp.status_code >= 400:
print('Download error:', resp.text)
html = None
if num_retries and 500 <= resp.status_code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
except requests.exceptions.RequestException as e:
print('Download error:', e.reason)
html = None

One notable difference is the ease of use of having status_code as an available attribute for each request. Additionally, we no longer need to test for character encoding, as the text attribute on our Response object does so automatically. In the rare case of an non-resolvable URL or timeout, they are all handled by RequestException so it makes for an easy catch statement. Proxy handling is also taken care of by simply passing a dictionary of proxies (that is {'http': 'http://myproxy.net:1234', 'https': 'https://myproxy.net:1234'}).

We will continue to compare and use both libraries, so that you are familiar with them depending on your needs and use case. I strongly recommend using requests whenever you are handling more complex websites, or need to handle important humanizing methods such as using cookies or sessions. We will talk more about these methods in Chapter 6Interacting with Forms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset