Loading cookies from the web browser

Working out how to submit the login details expected by a server can be quite complex, as demonstrated by the previous example. Fortunately, there's a workaround for difficult websites--we can log in to the website manually using a web browser, and have our Python script load and reuse the cookies to be automatically logged in.

Some web browsers store their cookies in different formats, but Firefox and Chrome use an easy-to-access format we can parse with Python: a sqlite database.  

SQLite is a very popular open-source SQL database. It can be easily installed on many platforms and comes pre-installed on Mac OSX. To download and install it on your operating system, check the Download page or simply search for your operating system instructions. 

To take a look at your cookies, you can (if installed) run the sqlite3 command and then the path to your cookie file (shown below is an example for Chrome):

$ sqlite3 [path_to_your_chrome_browser]/Default/Cookies
SQLite version 3.13.0 2016-05-18 10:57:30
Enter ".help" for usage hints.
sqlite> .tables
cookies meta

You will need to first find the path to your browser's configuration files which can either be done by searching your filesystem or simply searching the web for your browser and operating system. To see table schema in SQLite, you can use .schema and select syntax functions similarly to other SQL databases. 

In addition to storing cookies in a sqlite database, some browsers (such as Firefox) store sessions directly in a JSON file, which can be easily parsed using Python. There are also numerous browser extensions, such as SessionBuddy which can export your sessions into JSON files. For the login, we only need to find the proper sessions, which are stored in this structure:

{"windows": [... 
"cookies": [
{"host":"example.webscraping.com",
"value":"514315085594624:e5e9a0db-5b1f-4c66-a864",
"path":"/",
"name":"session_id_places"}
...]
]}

Here is a function that can be used to parse Firefox sessions into a Python dictionary, which we can then feed to the requests library:

def load_ff_sessions(session_filename): 
cookies = {}
if os.path.exists(session_filename):
json_data = json.loads(open(session_filename, 'rb').read())
for window in json_data.get('windows', []):
for cookie in window.get('cookies', []):
cookies[cookie.get('name')] = cookie.get('value')
else:
print('Session filename does not exist:', session_filename)
return cookies

One complexity is that the location of the Firefox sessions file will vary, depending on the operating system. On Linux, it should be located at this path:

~/.mozilla/firefox/*.default/sessionstore.js 

In OS X, it should be located at:

~/Library/Application Support/Firefox/Profiles/*.default/ 
sessionstore.js

Also, for Windows Vista and above, it should be located at:

%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default/sessionstore.js 

Here is a helper function to return the path to the session file:

import os, glob 
def find_ff_sessions():
paths = [
'~/.mozilla/firefox/*.default',
'~/Library/Application Support/Firefox/Profiles/*.default',
'%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default'
]
for path in paths:
filename = os.path.join(path, 'sessionstore.js')
matches = glob.glob(os.path.expanduser(filename))
if matches: m
return matches[0]

Note that the glob module used here will return all matching files for the given path. Now here is an updated snippet using the browser cookies to log in:

    >>> session_filename = find_ff_sessions() 
>>> cookies = load_ff_sessions(session_filename)
>>> url = 'http://example.webscraping.com'
>>> html = requests.get(url, cookies=cookies)

To check whether the session was loaded successfully, we cannot rely on the login redirect this time. Instead, we will scrape the resulting HTML to check whether the logged in user label exists. If the result here is Login, the sessions have failed to load correctly. If this is the case, make sure you are already logged in to the example website using your Firefox browser. We can inspect the User label for the site using our browser tools:

The browser tools show this label is located within a <ul> tag of ID "navbar", which can easily be extracted with the lxml library used in Chapter 2, Scraping the Data:

>>> tree = fromstring(html.content) 
>>> tree.cssselect('ul#navbar li a')[0].text_content()
'Welcome Test account'

The code in this section was quite complex and only supports loading sessions from the Firefox browser. There are numerous browser add-ons and extensions that support saving your sessions in JSON files, so you can explore these as an option if you need session data for login.

In the next section, we will take a look at the requests library advanced usage for sessions http://docs.python-requests.org/en/master/user/advanced/#session-objects, which allows you utilize browser sessions easily when scraping with Python.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset