Tuning settings

Before running the generated crawl spider, the Scrapy settings should be updated to avoid the spider being blocked. By default, Scrapy allows up to 16 concurrent downloads for a domain with no delay between downloads, which is much faster than a real user would browse. This behavior is easy for a server to detect and block.

As mentioned in Chapter 1, the example website we are scraping is configured to temporarily block crawlers which consistently download at faster than one request per second, so the default settings would ensure our spider is blocked. Unless you are running the example website locally, I recommend adding these lines to example/settings.py so the crawler only downloads a single request per domain at a time with a reasonable 5 second delay between downloads:

CONCURRENT_REQUESTS_PER_DOMAIN = 1 
DOWNLOAD_DELAY = 5

You can also search and find those settings in the documentation, modify and uncomment them with the above values. Note that Scrapy will not use this precise delay between requests, because this would also make a crawler easier to detect and block. Instead, it adds a random offset within this delay between requests.

For details about these settings and the many others available, refer to http://doc.scrapy.org/en/latest/topics/settings.html.

Table of Contents for Tuning settings

Create new playlist

Sign In

Sign Up

Table of Contents for
Tuning settings