Sequential crawler

We can now use AlexaCallback with a slightly modified version of the link crawler we developed earlier to download the top 500 Alexa URLs sequentially. To update the link crawler, it will now take either a start URL or a list of start URLs:

# In link_crawler function

if isinstance(start_url, list):
crawl_queue = start_url
else:
crawl_queue = [start_url]

We also need to update the way the robots.txt is handled for each site. We use a simple dictionary to store the parsers per domain (see: https://github.com/kjam/wswp/blob/master/code/chp4/advanced_link_crawler.py#L53-L72). We also need to handle the fact that not every URL we encounter will be relative, and some of them aren't even URLs we can visit, such as e-mail addresses with mailto: or javascript: event commands. Additionally, due to some sites not having the robots.txt files and other poorly formed URLs, there are some additional error-handling sections added and a new no_robots variable, which allows us to continue crawling if we cannot, in good faith, find a robots.txt file. Finally, we added a socket.setdefaulttimeout(60) to handle timeouts for the robotparser and some additional timeout arguments for the Downloader class in Chapter 3Caching Downloads,

The primary code to handle these cases is available at https://github.com/kjam/wswp/blob/master/code/chp4/advanced_link_crawler.py. The new crawler can then be used directly with the AlexaCallback and run from the command line as follows:

python chp4/advanced_link_crawler.py
...
Total time: 1349.7983705997467s

Taking a look at the code that runs in the __main__ section of the file, we use '$^' as our pattern to avoid collecting links from each page. You can also try to crawl all links on every page using '.' to match everything. (Warning: This will take a long time, potentially days!) 

The time for only crawling the first page is as expected for sequential downloading, with an average of ~2.7 seconds per URL (this includes the time to test the robots.txt file). Depending on your ISP speeds, and if you run the script on a server in the cloud, you might see much faster results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset