After the introductory chapter, it is time to get you started with a real scraping project.
In this chapter you will learn what data you must extract throughout the next two chapters, using Beautiful Soup and Scrapy.
Don’t worry; the requirements are simple. We will extract information from the following website: https://www.sainsburys.co.uk/ .
Sainsbury’s is an online shop with a lot of goods provided. This makes a great source for a website scraping project.
The Requirements
If you look at the website, you can see this is a simple web page with a lot of information. Let me show you which parts we will extract.
One idea would be to extract something from the Halloween-themed site (see Figure 2-1. for their themed landing page). However, this is not an option because you cannot try this yourself; Halloween is over when you read this—at least for 2017, and I cannot guarantee that the future sales will be the same.
Therefore, you will extract information on groceries. To be more specific, you will gather nutrition details from the “Meat & fish” department.
Name of the product
URL of the product
Item code
- Nutrition details per 100g:
Energy in kilocalories
Energy in kilojoules
Fat
Saturates
Carbohydrates
Total sugars
Starch
Fibre
Protein
Salt
Country of origin
Price per unit
Unit
Number of reviews
Average rating
This looks like a lot, but do not worry! You will learn how to extract this information from all the products of this department with an automated script. And if you are keen and motivated, you can extend this knowledge and extract all the nutrition information for all the products.
Preparation
As I mentioned in the previous chapter, before you start your scraper development, you should look at the website’s terms and conditions, and the robots.txt file to see if you can extract the information you need.
When writing this part (November 2017), there was no entry on scraper restrictions in the terms and conditions of the website. This means, you can create a bot to extract information.
In the code block you can see what is allowed and what is not, and this robots.txt is quite restrictive and has only Disallow entries but this is for all bots.
What can we find out from this text? For example, you shouldn’t create bots that order automatically through this website. But this is unimportant for us because we only need to gather information—no purchasing. This robots.txt file has no limitations on our purposes; we are free to continue our preparation and scraping.
What would limit our purposes?
Good question. An entry in the robots.txt referencing the “Meat & fish” department could limit our scraping intent. A sample entry would look like this:
User-agent: *
Disallow: /shop/gb/groceries/meat-fish/
Disallow: /shop/gb/groceries/
But this won’t allow search engines to look up the goods Sainsbury’s is selling, and that would be a big profit loss.
Navigating Through “Meat & fishFish”
As mentioned at the beginning of this chapter, we will extract data from the “Meat & fish” department . The URL of this part of the website is www.sainsburys.co.uk/shop/gb/groceries/meat-fish .
Let’s open the URL in our Chrome browser, disable JavaScript, and reload the browser window as described in the previous chapter. Remember, disabling JavaScript enables you to see the website’s HTML code as a basic scraper will see it.
Now we can see in the DevTools that every link is in a list element (<li> tag) of an unordered list (<ul>), with class categories departments. Note down this information because we will use it later.
Here we can see that the page has no products but another list with links to detailed sites. If we look at the HTML structure in DevTools, we can see that these links are again elements of an unordered list. This unordered list has the class categories aisles.
Here we need to examine two things: one is the list of products ; the other is the navigation.
From those list elements , we are interested in the one with the right-pointing arrow symbol, which has the class next. This tells us if we have a next page we must navigate to or not.
Now we reached the level of the products, and we can step further and concentrate on the real task: identifying the required information.
Selecting the Required Information
Now that we have the product , let’s identify the required information. As previously, we can use the select tool, locate the required text, and read the properties from the HTML code.
The name of the product is inside a header (h1), which is inside a div with the class productTitleDescriptionContainer.
The price and the unit are in a div of the class pricing. The price itself is in a paragraph (p) of the class pricePerUnit; the unit is in a span of the class pricePerUnitUnit.
We can see the location of the image is inside a label of class numberOfReviews and it has an attribute, alt, which contains the decimal value of the averages of the reviews. After the image, there is the text containing the number of the reviews.
The item code is inside a paragraph of class itemCode.
Even though we must extract many fields, we identified them easily in the website. Now it is time to extract the data and learn the tools of the trade!
Outlining the Application
After the requirements are defined and we’ve found each entry to extract, it is time to plan the applications structure and behavior.
- 1.
Download the starting page, in this case the “Meat & fish” department, and extract the links to the product pages.
- 2.
Download the product pages and extract the links to the detailed products.
- 3.
Extract the information we are interested in from the already downloaded product pages.
- 4.
Export the extracted information.
And these steps could identify functions of the application we are developing.
Step 1 has a bit more to offer: if you remember the analysis with DevTools you have seen, some links are just a grouping category and you must extract the detail page links from this grouping category.
Navigating the Website
Before we jump into learning the first tools you will use to scrape website data, I want to show you how to navigate websites—and this will be another building block for scrapers.
Because a website is a graph, you can use graph algorithms to navigate through the pages and links: Breadth First Search (BFS) and Depth First Search (DFS).
Using BFS, you go one level of the graph and gather all the URLs you need for the next level. For example, you start at the “Meat & fish” department page and extract all URLs to the next required level, like “Top sellers” or “Roast dinner.” Then you have all these URLs and go to the Top sellers and extract all URLs that lead to the detailed product pages. After this is done, you go to the “Roast dinner” page and extract all product details from there too, and so on. At the end you will have the URLs to all product pages, where you can go and extract the required information.
Using DFS, you go straight to the first product through “Meat & fish,” “Top sellers,” and extract the information from its site. Then you go to the next product on the “Top sellers” page and extract the information from there. If you have all the products from “Top sellers” then you move to “Roast dinner” and extract all products from there.
If you ask me, both algorithms are good, and they deliver the same result. I could write two scripts and compare them to see which one is faster, but this comparison would be biased and flawed.1
Therefore, you will implement a script that will navigate a website, and you can change the algorithm behind it to use BFS or DFS.
If you are interested in the Why? for both algorithms, I suggest you consider Magnus Hetland’s book: Python Algorithms.2
Creating the Navigation
Implementing the navigation is simple if you look at the algorithms, because this is the only trick: implement the pseudo code.
The two functions shown extract the page, and the links still point to the Sainsbury’s website.
Note
If you don’t filter out external URLs, your script may never end. This is only useful if you want to navigate the whole WWW to see how far you can reach from one website.
The extract_links function takes care of an empty or None page. urljoin wouldn’t bleat about this but re.findall would throw an exception and you don’t want that to happen.
The get_links function returns all the links of the web page that point to the same host. To find out which host to use, you can utilize the urlparse function,3 which returns a tuple. The second parameter of this tuple is the host extracted from the URL.
If you look at the two functions just shown, you will see only one difference in their code (hint: it’s highlighted): how you put them into the queue, which is a stack.
The requests Library
To implement the script successfully, you must learn a bit about the requests library .
I really like the extendedness of the Python core library, but sometimes you need libraries developed by members of the community. And the requests library is one of those.
With basic Python urlopen you can create simple requests and corresponding data, but it is complex to use. The requests library adds a friendly layer above this complexity and makes network programming easy: it takes care of redirects, and can handle sessions and cookies for you. The Python documentation recommends it as the tool to use.
Again, I won’t give you a detailed introduction into this library, just the necessary information to get you going. If you need more information, look at the project’s website.4
Installation
Now you are set up to continue this book .
Getting Pages
Requesting pages is easy with the requests library: requests.get(url).
This returns a response object that contains basic information, like status code and content. The content is most often the body of the website you requested, but if you requested some binary data (like images or sound files) or JSON, then you get that back. For this book, we will focus on HTML content.
The preceding code block requests my website’s front page, and if the server returns the status code 200, which means OK, it prints the first 250 characters of the content. If the server returns a different status, that code is printed.
With this we are through the basics of the requests library . As I introduce more concepts of the library later in this book, I will tell you more about it.
Now it is time to skip the default urllib calls of Python 3 and change to requests.
Switching to requests
Now it is time to finish the script and use the requests library for downloading the pages.
I surrounded the requesting method call with a try-except block because it can happen that the content has some encoding issues and we get an exception back that kills the whole application; and we don’t want this because the website is big and starting over would require too much resources.5
Putting the Code Together
If your result is slightly different, then the website’s structure changed in the meantime.
As you can see from the printed URLs, the current solution is rudimentary: the code navigates the whole website instead of focusing only on the “Meat & fish” department and nutrition details.
One option would be to extend the filter to return only relevant links , but I don’t like regular expressions because they are hard to read. Instead let’s go ahead to the next chapter.
Summary
This chapter prepared you for the remaining parts of the book: you’ve met the requirements, analyzed the website to scrape, and identified where in the HTML code the fields of interest lay. And you implemented a simple scraper, mostly with basic Python tools, which navigates through the website.
In the next chapter you will learn Beautiful Soup, a simple extractor library that helps you to forget regular expressions, and adds more features to traverse and extract HTML-trees like a boss.