Reverse engineering a dynamic web page

So far, we tried to scrape data from a web page the same way as introduced in Chapter 2, Scraping the Data. This method did not work because the data is loaded dynamically using JavaScript. To scrape this data, we need to understand how the web page loads the data, a process which can be described as reverse engineering. Continuing the example from the preceding section, in our browser tools, if we click on the Network tab and then perform a search, we will see all of the requests made for a given page. There are a lot! If we scroll up through the requests, we see mainly photos (from loading country flags), and then we notice one with an interesting name: search.json with a path of /ajax:

If we click on that URL using Chrome, we can see more details (there is similar functionality for this in all major browsers, so your view may vary; however the main features should function similarly). Once we click on the URL of interest, we can see more details, including a preview which shows us the response in parsed form. Here, similar to the Inspect Element view in our Elements tab, we use the carrots to expand the preview and see that each country of our results is included in JSON form:

We can also open the URL directly by right-clicking and opening the URL in a new tab. When you do so, you will see it as a simple JSON response. This AJAX data is not only accessible from within the Network tab or via a browser, but can also be downloaded directly, as follows:

>>> import requests
>>> resp = requests.get('http://example.webscraping.com/ajax/search.json?page=0&page_size=10&search_term=a')
>>> resp.json()
{'error': '',
'num_pages': 22,
'records': [{'country': 'Afghanistan',
'id': 1261,
'pretty_link': '<div><a href="/view/Afghanistan-1"><img src="/places/static/images/flags/af.png" />Afghanistan</a></div>'},
...]
}

As we can see from the previous code, the requests library allows us to access JSON responses as a Python dictionary by using the json method. We could also download the raw string response and load it using Python's json.loads method. 

Our code gives us a simple way to scrape countries containing the letter A. To find the details of the countries requires calling the AJAX search with each letter of the alphabet. For each letter, the search results are split into pages, and the number of pages is indicated by page_size in the response.

Unfortunately, we cannot save all results returned because the same countries will be returned in multiple searches-for example, Fiji matches searches for f, i, and j. These duplicates are filtered here by storing results in a set before writing them to a text file-the set data structure ensures unique elements.

Here is an example implementation that scrapes all of the countries by searching for each letter of the alphabet and then iterating the resulting pages of the JSON responses. The results are then stored in a simple text file.

import requests
import string

PAGE_SIZE = 10

template_url = 'http://example.webscraping.com/ajax/' +
'search.json?page={}&page_size={}&search_term={}'

countries = set()

for letter in string.ascii_lowercase:
print('Searching with %s' % letter)
page = 0
while True:
resp = requests.get(template_url.format(page, PAGE_SIZE, letter))
data = resp.json()
print('adding %d more records from page %d' %
(len(data.get('records')), page))
for record in data.get('records'):
countries.add(record['country'])
page += 1
if page >= data['num_pages']:
break

with open('../data/countries.txt', 'w') as countries_file:
countries_file.write('n'.join(sorted(countries)))

When you run the code, you will see progressive output:

$ python chp5/json_scraper.py
Searching with a
adding 10 more records from page 0
adding 10 more records from page 1
...

Once the script is completed, the countries.txt file in the relative folder ../data/ will show a sorted list of the country names. You may also note the page length can be set using the PAGE_SIZE global variable. You may want to try toggling this to increase or decrease the number of requests.

This AJAX scraper provides a simpler way to extract the country details than the traditional page-by-page scraping approach covered in Chapter 2, Scraping the Data. This is a common experience: AJAX-dependent websites initially look more complex, however their structure encourages separating the data and presentation layers, which can actually make our job of extracting data easier. If you find a site with an open Application Programming Interface (or API) like this example site, you can simply scrape the API rather than using CSS selectors and XPath to load data from HTML.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset