ID iteration crawler

In this section, we will take advantage of a weakness in the website structure to easily access all the content. Here are the URLs of some sample countries:

We can see that the URLs only differ in the final section of the URL path, with the country name (known as a slug) and ID. It is a common practice to include a slug in the URL to help with search engine optimization. Quite often, the web server will ignore the slug and only use the ID to match relevant records in the database. Let's check whether this works with our example website by removing the slug and checking the page http://example.webscraping.com/view/1:

The web page still loads! This is useful to know because now we can ignore the slug and simply utilize database IDs to download all the countries. Here is an example code snippet that takes advantage of this trick:

import itertools 

def crawl_site(url):
for page in itertools.count(1):
pg_url = '{}{}'.format(url, page)
html = download(pg_url)
if html is None:
break
# success - can scrape the result

Now we can use the function by passing in the base URL:

>>> crawl_site('http://example.webscraping.com/view/-')
Downloading: http://example.webscraping.com/view/-1
Downloading: http://example.webscraping.com/view/-2
Downloading: http://example.webscraping.com/view/-3
Downloading: http://example.webscraping.com/view/-4
[...]

Here, we iterate the ID until we encounter a download error, which we assume means our scraper has reached the last country. A weakness in this implementation is that some records may have been deleted, leaving gaps in the database IDs. Then, when one of these gaps is reached, the crawler will immediately exit. Here is an improved version of the code that allows a number of consecutive download errors before exiting:

def crawl_site(url, max_errors=5):
for page in itertools.count(1):
pg_url = '{}{}'.format(url, page)
html = download(pg_url)
if html is None:
num_errors += 1
if num_errors == max_errors:
# max errors reached, exit loop
break
else:
num_errors = 0
# success - can scrape the result

The crawler in the preceding code now needs to encounter five consecutive download errors to stop iteration, which decreases the risk of stopping iteration prematurely when some records have been deleted or hidden.

Iterating the IDs is a convenient approach to crawling a website, but is similar to the sitemap approach in that it will not always be available. For example, some websites will check whether the slug is found in the URL and if not return a 404 Not Found error. Also, other websites use large nonsequential or nonnumeric IDs, so iterating is not practical. For example, Amazon uses ISBNs, as the ID for the available books, that have at least ten digits. Using an ID iteration for ISBNs would require testing billions of possible combinations, which is certainly not the most efficient approach to scraping the website content.

As you've been following along, you might have noticed some download errors with the message TOO MANY REQUESTS . Don't worry about them at the moment; we will cover more about handling these types of error in the Advanced Features section of this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset