Threaded crawler

Now we will extend the sequential crawler to download the web pages in parallel. Note that, if misused, a threaded crawler could request content too quickly and overload a web server or cause your IP address to be blocked.

To avoid this, our crawlers will have a delay flag to set the minimum number of seconds between requests to the same domain.

The Alexa list example used in this chapter covers one million separate domains, so this particular problem does not apply here. However, a delay of at least one second between downloads should be considered when crawling many web pages from a single domain in the future.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset