This chapter is all about handling websites that utilize JavaScript to render information dynamically.
You have seen in the previous chapters that a basic website scraper loads the web page’s contents and does its extraction on this source code. And if there’s JavaScript included, it’s not executed, and the dynamic information is missing from the page.
This is bad, at least in those cases where you need that dynamic data.
Another interesting part of scraping websites that use JavaScript is that you may need clicks or button presses to go to the right page / get the right content, because these actions call a chain of JavaScript functions.
Now I will give you options for how you can deal with these problems. Most of the time you will find Selenium as the solution, if you Google or search the Internet with other engines. However, there are other options present, and I will give you more insight. Perhaps those other options will fit your needs better.
Reverse Engineering
This first option is for advanced developers—at least I feel advanced developers will do more reverse engineering.
The idea here is to use the DevTools from Chrome (or similar functionality in other browsers), enable JavaScript, and monitor the XHR network flow to find out which data is requested from the server and rendered separately.
With the target endpoint (either a GET or a POST request) in your hands, you see which parameters to provide and how they affect the results.
Let’s look at a simple example: at kayak.com you can search for flights and, therefore, airports too. In this simple example we will reverse engineer the destination search endpoint to extract some information, even if this information is not valuable.
I’ll use Chrome for these examples. This is because I use Chrome for all my scraping tasks. It will work with Firefox too, if you know how to handle the developer tools.
As you can see in the image, I already navigated to the XHR tab inside the Network tab because all AJAX and XHR calls are listed here.
where is the key you’re searching for
s is the type of the search; 58 is for airports
lc is the locale; you can change it and get different results—more on this later
v is the version; there’s a small difference in the result format if you choose v1 instead of the default v2
Based on this information, what can we get out of it? We get some airports, and some idea about how to reverse engineer JavaScript and when to decide to use a different tool.
In this example, the JavaScript rendering is a simple HTTP GET call—nothing fancy, and I bet you already have an idea how to extract information delivered from these endpoints. Yes, using either the requests and Beautiful Soup libraries or Scrapy and some Request objects.
Back to the example: when you vary the lc value, for example, to de or es in the request, you get back different airports and the description of these airports in the locale you chose. This means JavaScript reverse engineering is not just about finding the right calls you want to use but also requires a bit of thinking.
Thoughts on Reverse Engineering
If you find yourself having a search that utilizes an HTTP endpoint to get the data, you can try to figure out how the search works. For example, instead of sending some values you expect to deliver results, try to add search expressions. Such expressions could be * to match all, .+ to evaluate regular expressions, or % if it has some kind of SQL query in the back.
Summary
You see, sometimes JavaScript reverse engineering pays off: you learned that those nasty XHR calls are simple requests and you can cover them from your scripts. However, sometimes JavaScript makes more complex things like rendering and loading data after the initial page is loaded. And you don’t want to reverse engineer this, believe me.
Splash
Splash1 is an open-source JavaScript rendering engine written in Python. It is lightweight and has a smooth integration with Scrapy.
It is maintained, and new versions are released every few months, when need arises.
Set-up
The basic and easiest usage of Splash is getting a Docker image from the developers and running it. This ensures that you have all the dependencies required by the project and can start using it. In this section we will use Docker.
To get started, install Docker if you don’t have it already. You can find more information on installing Docker here: https://docs.docker.com/manuals/ .
Note
On some machines, administrator rights are required to start Splash. For example, on my Windows 10 computer, I had to run the docker container from an administrator console. On Unix-like machines, you may need to run the container using sudo.
The preceding example result is just an excerpt. If you save this code into an HTML file and open it in a browser and do the same with the sources returned by Splash, you will see the same page. The difference is in the sources: Splash has more lines and contains expanded JavaScript functions.
A Dynamic Example
To see how to get Splash working with dynamic websites (which utilize JavaScript a lot), let’s see a different example. For instance, http://www.protopage.com/ generates you a web page based on a prototype , which you can customize. If you visit the site, you must wait some seconds until the page gets rendered.
If we want to scrape data from this site (there’s not much available either, but imagine it has a lot to offer) and we use a simple tool (the requests library, Scrapy) or Splash with the default settings, we only get the base page that tells us that the page is currently rendered.
Depending on the network speed and load on the target website, three seconds can be too short. Feel free to experiment with different values for your target websites to have the page rendered.
Now all this is good, but how to use Splash to scrape websites?
Integration with Scrapy
The recommended way by Splash developers is to integrate this tool with Scrapy , and because we use Scrapy as our scraping tool, we will take a thorough look at how it can be accomplished.
The second variable points to a cache storage solution, which is aware of Splash. If you’re using another custom cache storage, you must adapt it to work with Splash. This requires you to subclass the aforementioned storage class and replace all calls to scrapy.util.request.request_fingerprint with scrapy_splash.splash_request_fingerprint to have those nasty changed fingerprints work out.
The last change we must adapt is the usage of Requests: instead of using the default Scrapy Request we need to use SplashRequest.
Now let’s adapt the Sainsbury’s spider to use Splash .
Adapting the basic Spider
In an ideal world, you would only need to alter the configuration as we did in the previous section, and all requests and responses would go over Splash because we don’t have any usages of Scrapy’s Request objects.
Unfortunately, we need some more configuration in the code of the scraper too. If you don’t believe me, just start the scraper without having Splash running.
To get our scraper running through Splash, we need to adapt every request call to use a SplashRequest, and every time we initiate a new request (either when starting the scraper or yield-ing some response.follow calls).
So, we’re good and we render the first page through Splash , but what about the other calls like navigating to the detail pages or the next page?
To adapt these, I changed the XPath extraction code a bit. Until now, we used the response.follow approach where we could provide the selector containing the potential next URL we want to scrape.
One change I made besides the ones mentioned previously was to rename the spider to splash.
And that is it: we converted the Sainsbury’s scraper to use Splash .
What Happens When Splash Isn’t Running?
Summary
Splash is a nice Python-based website rendering tool that you can integrate easily with Scrapy.
One drawback is that you must install it manually through a somewhat complicated process or using Docker. This makes porting it to the cloud complicated (see Chapter 6 for Cloud solutions), therefore you should use Splash only for a local scraper. However, locally it can give you a great benefit with its seamless integration with Scrapy for scraping websites using JavaScript to render content dynamically.
Another drawback is the speed. When I used Splash on my local computer, it barely scraped 20 pages per minute. This is too slow for my taste, but sometimes I cannot get around it.
Selenium
If you search the Internet about website scraping, you will most often encounter articles and questions about Selenium. Originally, I wanted to leave Selenium out of this book because I don’t like its approach; it’s a bit clumsy for my taste. However, because of its popularity, I decided to add a section about this tool. Perhaps you will embed a Selenium-based solution to your Scrapy scripts (for example you already have a Selenium-scraper but want to extend it), and I want to help you with this task.
First we will look at Selenium and how to use it in a stand-alone fashion, then we will add it to a Scrapy spider.
Prerequisites
To use Selenium for website scraping, you will need a web browser. This means you will see the configured web browser (let’s say Firefox or Chrome) open up, load the website, and then Selenium does its work and extracts the script you defined.
To enable linking between Selenium and your browser, you must install a specific WebDriver.
For Chrome, visit https://sites.google.com/a/chromium.org/chromedriver/home . I downloaded version 2.38.
For Firefox, you need to install GeckoDriver. It can be found at GitHub. I downloaded version 0.20.1.
These drivers must be on the PATH when you’re running your Python script. I put all of them inside a folder, because in this case I have to add only this one folder and all my web drivers are available.
Note
that these web drivers require a specific browser version. For example, if you already have Chrome installed and download the latest version of the web driver, you may encounter an exception like the one following if you miss updating your browser:
raise exception_class(message, screen, stacktrace) selenium.common.exceptions.SessionNotCreatedException: Message: session not created exception: Chrome version must be >= 65.0.3325.0
(Driver info: chromedriver=2.38.552522 (437e6fbedfa8762dec75e2c5b3ddb86763dc9dcb),platform=Windows NT 10.0.16299 x86_64)
Basic Usage
OK, it’s nice to have the browser open automatically and navigate to the target website . But what about scraping information?
Because we have a website in our reach (in the browser), we can parse the HTML—almost like we did in the previous chapters or use Selenium’s offering for data extraction from the HTML of the web page.
I won’t go into detail on Selenium’s extractors because it would exceed the boundaries of this book, but let me tell you that by using Selenium you have access to a different set of extraction functions, which you can use on your browser instances.
Integration with Scrapy
Selenium can be integrated with Scrapy . The only thing you need is to configure Selenium properly (have the web drivers on the PATH and the browsers installed) and then the fun can begin.
What I like to do is to disable the browser window for my scrapes. That’s because I get distracted every time I see a browser window if it navigates the pages automatically, and it would go crazy if you combine Scrapy with Selenium.
Besides this, you will need a middleware that will intercept calls prior to sending them directly through Scrapy and will use Selenium instead of normal requests.
The preceding code uses Firefox as the default browser and starts it in headless mode when the spider is opened. When the spider closes, the web driver is closed too.
The interesting part is when the request happens: it is intercepted and routed through the browser and the response HTML code is wrapped into an HtmlResponse object. Now your spider gets the Selenium-loaded HTML code and you can use it for scraping.
scrapy-selenium
Recently, I have found a fresh project at GitHub called scrapy-selenium.2 It is a convenient project to have you install and use it to combine the powers of Scrapy and Selenium. I think it is worth sharing this project with you.
Note
Because this project is a private one, it may have issues. If you find something not working, feel free to raise an issue for this project and the developer will help you out to fix that problem. If not, shoot me an email and I’ll see if I can give you a solution or perhaps maintain the application myself and deliver newer versions.
This project works just like the custom middleware we implemented in the previous section: it intercepts requests and downloads the pages using Selenium.
Summary
Selenium is an alternative tool that website scraper developers use because it supports JavaScript rendering through a browser. We saw some solutions on how to integrate Selenium with Scrapy but skipped the built-in methods to extract information.
Again, using an external tool like Selenium makes your scraping slower, even in headless mode.
Solutions for Beautiful Soup
Until now, we looked at solutions where we can integrate JavaScript-based website scraping with Scrapy. But some projects are fine using Beautiful Soup and don’t need a full scraper environment.
Splash
Splash offers manual usage too. This means, you have an alternative option to get Splash to render a website and return the source code back to your code. And we can utilize this to have a simple scraper written with Beautiful Soup.
The idea here is to send an HTTP request to Splash, providing the URL to render (and any configuration parameters) and get the result back, and then use Beautiful Soup on this result, which is a rendered HTML.
To stick with the previous example, we will convert the scraper form Chapter 3 into a tool that utilizes Splash to render the pages of Sainsbury’s.
As you can see, we call the render.html endpoint of our Splash installation and provide the target URL as a simple GET parameter.
Selenium
Of course, we can integrate Selenium to our Beautiful Soup solutions too. It works the same way as it did with Scrapy.
Again, I won’t use the built-in Selenium methods to extract information from the website. I use Selenium only to render the page and extract the information I require.
Summary
Even though we focus on Scrapy, because in my opinion it’s currently the website scraping tool for Python, you can see that options that make Scrapy handle JavaScript can be added to “plain” Beautiful Soup scrapers. And this gives you options to stay with the tools you already know!
Summary
In this chapter we looked at some approaches to scrape websites that utilize JavaScript. We looked at the mainstream Selenium using a web browser to execute JavaScript and then went to the headless world, where you don’t need any window to execute JavaScript and this makes your scripts portable and easier to execute.
Naturally, using another tool to get some extra rendering done takes time and provides overhead. If you don’t require JavaScript rendering, create your scripts without any add-ons like Splash or Selenium. You’ll benefit from the speed gain.
Now we are ready to see how we can deploy our spiders to the Cloud!