Extending the login script to update content

Now that we can login via a script, we can extend this script by adding code to update the website country data. The code used in this section is available at https://github.com/kjam/wswp/blob/master/code/chp6/edit.py and https://github.com/kjam/wswp/blob/master/code/chp6/login.py.

You may have already noticed an Edit link at the bottom of each country:

When logged in, clicking this link leads to another page where each property of a country can be edited:

We will make a script to increase the population of a country by one person every time it's run. The first step is to rewrite our login function to utilize Session objects. This will make our code cleaner and allow us to remain logged into our current session. The new code is as follows:

def login(session=None):
""" Login to example website.
params:
session: request lib session object or None
returns tuple(response, session)
"""
if session is None:
html = requests.get(LOGIN_URL)
else:
html = session.get(LOGIN_URL)
data = parse_form(html.content)
data['email'] = LOGIN_EMAIL
data['password'] = LOGIN_PASSWORD
if session is None:
response = requests.post(LOGIN_URL, data, cookies=html.cookies)
else:
response = session.post(LOGIN_URL, data)
assert 'login' not in response.url
return response, session

Now our login form can work with or without sessions. By default, it doesn't use sessions and expects the user to utilize the cookies to stay logged in. This can be problematic for some forms, however, so adding the session functionality is useful when extending our login function. Next, we need to extract the current values of the country by reusing the parse_form() function:

>>> from chp6.login import login, parse_form 
>>> session = requests.Session()
>>> COUNTRY_URL = 'http://example.webscraping.com/edit/United-Kingdom-239'
>>> response, session = login(session=session)
>>> country_html = session.get(COUNTRY_URL)
>>> data = parse_form(country_html.content)
>>> data
{'_formkey': 'd9772d57-7bd7-4572-afbd-b1447bf3e5bd',
'_formname': 'places/2575175',
'area': '244820.00',
'capital': 'London',
'continent': 'EU',
'country': 'United Kingdom',
'currency_code': 'GBP',
'currency_name': 'Pound',
'id': '2575175',
'iso': 'GB',
'languages': 'en-GB,cy-GB,gd',
'neighbours': 'IE',
'phone': '44',
'population': '62348448',
'postal_code_format': '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA',
'postal_code_regex': '^(([A-Z]d{2}[A-Z]{2})|([A-Z]d{3}[A-Z]{2})|([A-Z]{2}d{2}[A-Z]{2})|([A-Z]{2}d{3}[A-Z]{2})|([A-Z]erd[A-Z]d[A-Z]{2})|([A-Z]{2}d[A-Z]d[A-Z]{2})|(GIR0AA))$',
'tld': '.uk'}

Now we can increase the population by one and submit the updated version to the server:

>>> data['population'] = int(data['population']) + 1 
>>> response = session.post(COUNTRY_URL, data)

When we return to the country page, we can verify that the population has increased to 62,348,449:

Feel free to test and modify the other fields as well--the database is restored to the original country data each hour to keep the data sane.  There is code for modifying the currency field in the edit script to use as another example. You can also play around with modifying other countries.

Note that the example covered here is not strictly web scraping, but falls under the wider scope of online bots. The form techniques we used can also be applied to interacting with complex forms to access data you want to scrape. Make sure you use your new automated form powers for good and not for spam or malicious content bots!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset