Edge cases

The AJAX search script is quite simple, but it can be simplified further by utilizing possible edge cases. So far, we have queried each letter, which means 26 separate queries, and there are duplicate results between these queries. It would be ideal if a single search query could be used to match all results. We will try experimenting with different characters to see if this is possible. This is what happens if the search term is left empty:

>>> url = 'http://example.webscraping.com/ajax/search.json?page=0&page_size=10&search_term=' 
>>> requests.get(url).json()['num_pages']
0

Unfortunately, this did not work-there are no results. Next we will check if '*' will match all results:

>>> requests.get(url + '*').json()['num_pages'] 
0

Still no luck. Then we check '.', which is a regular expression to match any character:

>>> requests.get(url + '.').json()['num_pages'] 
26

Perfect! The server must be matching results using regular expressions. So, now searching each letter can be replaced with a single search for the dot character.

Furthermore, we can set the page size in the AJAX URLs using the page_size query string value. The web site search interface has options for setting this to 4, 10, and 20, with the default set to 10. So, the number of pages to download could be halved by increasing the page size to the maximum.

>>> url = 'http://example.webscraping.com/ajax/search.json?page=0&page_size=20&search_term=.' 
>>> requests.get(url).json()['num_pages']
13

Now, what if a much higher page size is used, a size higher than what the web interface select box supports?

>>> url = 'http://example.webscraping.com/ajax/search.json?page=0&page_size=1000&search_term=.' 
>>> requests.get(url).json()['num_pages']
1

Apparently, the server does not check whether the page size parameter matches the options allowed in the interface and now returns all the results in a single page. Many web applications do not check the page size parameter in their AJAX backend because they expect all API requests to only come via the web interface.

Now, we have crafted a URL to download the data for all countries in a single request. Here is the updated and much simpler implementation which saves the data to a CSV file:

from csv import DictWriter
import requests

PAGE_SIZE = 1000

template_url = 'http://example.webscraping.com/ajax/' +
'search.json?page=0&page_size={}&search_term=.'

resp = requests.get(template_url.format(PAGE_SIZE))
data = resp.json()
records = data.get('records')

with open('../data/countries.csv', 'w') as countries_file:
wrtr = DictWriter(countries_file, fieldnames=records[0].keys())
wrtr.writeheader()
wrtr.writerows(records)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset