The Login form

The first form we'll automate is the Login form, which is available at http://example.webscraping.com/user/login. To understand the form, we can use our browser development tools. With the full version of Firebug or Chrome Developer Tools, it is possible to simply submit the form and check what data was transmitted in the Network tab (similar to how we did in Chapter 5, Dynamic Content). However, we can also see information about the form if we use "Inspect Element" features:

The important parts regarding how to send the form are the action, enctype, and method attributes of the form tag, and the two input fields (in the above image we have expanded the "password" field). The action attribute sets the HTTP location where the form data will be submitted, in this case, #, which represents the current URL. The enctype attribute (or encoding type) sets the encoding used for the submitted data, in this case, application/x-www-form-urlencoded. The method attribute is set to post to submit form data with a POST method in the message body to the server. For each input tags, the important attribute is name, which sets the name of the field when the POST data is submitted to the server.

Form encoding
When a form uses the POST method, there are two useful choices for how the form data is encoded before being submitted to the server. The default is application/x-www-form-urlencoded, which specifies all non-alphanumeric characters must be converted to ASCII Hex values. However, this is inefficient for forms which contain a large amount of non-alphanumeric data, such as a binary file upload, so multipart/form-data encoding was defined. Here, the input is not encoded but sent as multiple parts using the MIME protocol, which is the same standard used for e-mail.
The official details of this standard can be viewed at http://www.w3.org/TR/html5/forms.html#selecting-a-form-submission-encoding.

When regular users open this web page in their browser, they will enter their e-mail and password, and click on the Login button to submit their details to the server. Then, if the login process on the server is successful, they will be redirected to the home page; otherwise, they will return to the Login page to try again. Here is an initial attempt to automate this process:

>>> from urllib.parse import urlencode
>>> from urllib.request import Request, urlopen
>>> LOGIN_URL = 'http://example.webscraping.com/user/login'
>>> LOGIN_EMAIL = '[email protected]'
>>> LOGIN_PASSWORD = 'example'
>>> data = {'email': LOGIN_EMAIL, 'password': LOGIN_PASSWORD}
>>> encoded_data = urlencode(data)
>>> request = Request(LOGIN_URL, encoded_data.encode('utf-8'))
>>> response = urlopen(request)
>>> print(response.geturl())
'http://example.webscraping.com/user/login'

This example sets the e-mail and password fields, encodes them with urlencode, and submits them to the server. When the final print statement is executed, it will output the URL of the Login page, which means the login process has failed. You will notice we must also encode the already encoded data as bytes so urllib will accept it. 

We can write the same process using requests in fewer lines:

>>> import requests
>>> response = requests.post(LOGIN_URL, data)
>>> print(response.url)
'http://example.webscraping.com/user/login'

The requests library allows us to explicitly post data, and will do the encoding internally. Unfortunately, this code still fails to log in.

The Login form is particularly strict and requires some additional fields to be submitted along with the e-mail and password. These additional fields can be found at the bottom of the previous screenshot, but are set to hidden and so they aren't displayed in the browser. To access these hidden fields, here is a function using the lxml library covered in Chapter 2, Scraping the Data, to extract all the input tag details in a form:

from lxml.html import fromstring

def parse_form(html):
tree = fromstring(html)
data = {}
for e in tree.cssselect('form input'):
if e.get('name'):
data[e.get('name')] = e.get('value')
return data

The function in the preceding code uses lxml CSS selectors to iterate over all input tags in a form and return their name and value attributes in a dictionary. Here is the result when the code is run on the Login page:

>>> html = requests.get(LOGIN_URL)
>>> form = parse_form(html.content)
>>> print(form)
{'_formkey': 'a3cf2b3b-4f24-4236-a9f1-8a51159dda6d',
'_formname': 'login',
'_next': '/',
'email': '',
'password': '',
'remember_me': 'on'}

The _formkey attribute is the crucial piece; it contains a unique ID used by the server to prevent multiple form submissions. Each time the web page is loaded, a different ID is used, and the server can tell whether a form with a given ID has already been submitted. Here is an updated version of the login process which submits _formkey and other hidden values:

>>> html = requests.get(LOGIN_URL)
>>> data = parse_form(html.content)
>>> data['email'] = LOGIN_EMAIL
>>> data['password'] = LOGIN_PASSWORD
>>> response = requests.post(LOGIN_URL, data)
>>> response.url
'http://example.webscraping.com/user/login'

Unfortunately, this version doesn't work either, because the login URL was again returned. We are missing another essential component--browser cookies. When a regular user loads the Login form, this _formkey value is stored in a cookie, which is compared to the _formkey value in the submitted Login form data. We can take a look at the cookies and their values via our response object:

>>> response.cookies.keys()
['session_data_places', 'session_id_places']
>>> response.cookies.values()
['"8bfbd84231e6d4dfe98fd4fa2b139e7f:N-almnUQ0oZtHRItjUOncTrmC30PeJpDgmAqXZEwLtR1RvKyFWBMeDnYQAIbWhKmnqVp-deo5Xbh41g87MgYB-oOpLysB8zyQci2FhhgU-YFA77ZbT0hD3o0NQ7aN_BaFVrHS4DYSh297eTYHIhNagDjFRS4Nny_8KaAFdcOV3a3jw_pVnpOg2Q95n2VvVqd1gug5pmjBjCNofpAGver3buIMxKsDV4y3TiFO97t2bSFKgghayz2z9jn_iOox2yn8Ol5nBw7mhVEndlx62jrVCAVWJBMLjamuDG01XFNFgMwwZBkLvYaZGMRbrls_cQh"',
'True']

You can also see via your Python interpreter that the response.cookies is a special object type, called a cookie jar. This object can also be passed to new requests. Let's retry our submission with cookies:

>>> second_response = requests.post(LOGIN_URL, data, cookies=html.cookies)
>>> second_response.url
'http://example.webscraping.com/'
What are cookies?
Cookies are small amounts of data sent by a website in the HTTP response headers, which look like this: Set-Cookie: session_id=example;. The web browser will store them, and then include them in the headers of subsequent requests to that website. This allows a website to identify and track users.

Success! The submitted form values have been accepted and the response URL is the home page. Note that we needed to use the cookies which properly align with our form data from our initial request (which we have stored in the html variable). This snippet and the other login examples in this chapter are available for download at https://github.com/kjam/wswp/tree/master/code/chp6.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset