Interrupting and resuming a crawl

Sometimes when scraping a website, it can be useful to pause the crawl and resume it at a later time without needing to start over from the beginning. For example, you may need to interrupt the crawl to reset your computer after a software update, or perhaps, the website you are crawling is returning errors and you want to continue the crawl later.

Conveniently, Scrapy comes with built-in support to pause and resume crawls without needing to modify our example spider. To enable this feature, we just need to define the JOBDIR setting with a directory where the current state of a crawl can be saved. Note separate directories must be used to save the state of multiple crawls.

Here is an example using this feature with our spider:

$ scrapy crawl country -s LOG_LEVEL=DEBUG -s JOBDIR=../../../data/crawls/country
...
2017-03-24 13:41:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.webscraping.com/view/Anguilla-8> (referer: http://example.webscraping.com/)
2017-03-24 13:41:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://example.webscraping.com/view/Anguilla-8>
{'name': ['Anguilla'], 'population': ['13,254']}
2017-03-24 13:41:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.webscraping.com/view/Angola-7> (referer: http://example.webscraping.com/)
2017-03-24 13:41:59 [scrapy.core.scraper] DEBUG: Scraped from <200 http://example.webscraping.com/view/Angola-7>
{'name': ['Angola'], 'population': ['13,068,161']}
2017-03-24 13:42:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.webscraping.com/view/Andorra-6> (referer: http://example.webscraping.com/)
2017-03-24 13:42:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://example.webscraping.com/view/Andorra-6>
{'name': ['Andorra'], 'population': ['84,000']}
^C2017-03-24 13:42:10 [scrapy.crawler] INFO: Received SIG_SETMASK, shutting down gracefully. Send again to force
...
[country] INFO: Spider closed (shutdown)

Here, we see a ^C in the line that says Received SIG_SETMASK which is the same Ctrl + C or cmd + C we used earlier in the chapter to stop our scraper. To have Scrapy save the crawl state, you must wait here for the crawl to shut down gracefully and resist the temptation to enter the termination sequence again to force immediate shutdown! The state of the crawl will now be saved in the data directory in crawls/country. We can see the saved files if we look in that directory (Note this command and directory syntax will need to be altered for Windows users):

$ ls ../../../data/crawls/country/
requests.queue requests.seen spider.state

 The crawl can be resumed by running the same command:

$ scrapy crawl country -s LOG_LEVEL=DEBUG -s JOBDIR=../../../data/crawls/country
...
2017-03-24 13:49:49 [scrapy.core.engine] INFO: Spider opened
2017-03-24 13:49:49 [scrapy.core.scheduler] INFO: Resuming crawl (13 requests scheduled)
2017-03-24 13:49:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-03-24 13:49:49 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-03-24 13:49:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.webscraping.com/robots.txt> (referer: None)
2017-03-24 13:49:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.webscraping.com/view/Cameroon-40> (referer: http://example.webscraping.com/index/3)
2017-03-24 13:49:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://example.webscraping.com/view/Cameroon-40>
{'name': ['Cameroon'], 'population': ['19,294,149']}
...

The crawl now resumes from where it paused and continues as normal. This feature is not particularly useful for our example website because the number of pages to download is manageable. However, for larger websites which could take months to crawl, being able to pause and resume crawls is quite convenient.

There are some edge cases not covered here that can cause problems when resuming a crawl, such as expiring cookies and sessions. These are mentioned in the Scrapy documentation available at http://doc.scrapy.org/en/latest/topics/jobs.html.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset