Scrapy Performance Tuning

If we check the initial full scrape of the example site and take a look at the start and end times, we can see the scrape took approximately 1,697 seconds. If we calculate how many seconds per page (on average), that is ~6 seconds per page. Knowing we did not use the Scrapy concurrency features and fully aware that we also added a delay of ~5 seconds between requests, this means Scrapy is parsing and extracting data at around 1s per page (Recall from Chapter 2, Scraping the Data, that our fastest scraper using XPath took 1.07s). I gave a talk at PyCon 2014 comparing web scraping library speed, and even then, Scrapy was massively faster than any other scraping frameworks I could find. I was able to write a simple Google search scraper that was returning (on average) 100 requests a second. Scrapy has come a long way since then, and I always recommend it for the most performant Python scraping framework.

In addition to leveraging the concurrency Scrapy uses (via Twisted), Scrapy can be tuned to use things like page caches and other performance considerations (such as utilizing proxies to allow more concurrent requests to a single site). In order to install the cache, you should first read the cache middleware documentation (https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpcache). You might have already seen in the settings.py file, there are several good examples of how to implement the proper cache settings. For implementing proxies, there are some great helper libraries (as Scrapy only gives access to a simple middleware class). The current most popular and updated library is https://github.com/aivarsk/scrapy-proxies, which has Python3 support and is fairly easy to integrate. 

As always, libraries and recommended setup can change, so reading the latest Scrapy documentation should always be your first stop when it comes to checking performance and making spider changes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset