We can now use AlexaCallback with a slightly modified version of the link crawler we developed earlier to download the top 500 Alexa URLs sequentially. To update the link crawler, it will now take either a start URL or a list of start URLs:
# In link_crawler function
if isinstance(start_url, list):
crawl_queue = start_url
else:
crawl_queue = [start_url]
We also need to update the way the robots.txt is handled for each site. We use a simple dictionary to store the parsers per domain (see: https://github.com/kjam/wswp/blob/master/code/chp4/advanced_link_crawler.py#L53-L72). We also need to handle the fact that not every URL we encounter will be relative, and some of them aren't even URLs we can visit, such as e-mail addresses with mailto: or javascript: event commands. Additionally, due to some sites not having the robots.txt files and other poorly formed URLs, there are some additional error-handling sections added and a new no_robots variable, which allows us to continue crawling if we cannot, in good faith, find a robots.txt file. Finally, we added a socket.setdefaulttimeout(60) to handle timeouts for the robotparser and some additional timeout arguments for the Downloader class in Chapter 3, Caching Downloads,.
The primary code to handle these cases is available at https://github.com/kjam/wswp/blob/master/code/chp4/advanced_link_crawler.py. The new crawler can then be used directly with the AlexaCallback and run from the command line as follows:
python chp4/advanced_link_crawler.py
...
Total time: 1349.7983705997467s
Taking a look at the code that runs in the __main__ section of the file, we use '$^' as our pattern to avoid collecting links from each page. You can also try to crawl all links on every page using '.' to match everything. (Warning: This will take a long time, potentially days!)
The time for only crawling the first page is as expected for sequential downloading, with an average of ~2.7 seconds per URL (this includes the time to test the robots.txt file). Depending on your ISP speeds, and if you run the script on a server in the cloud, you might see much faster results.