Implementing a multithreaded crawler

Fortunately, Python makes threading relatively straightforward. This means we can keep a similar queuing structure to the link crawler developed in Chapter 1, Introduction to Web Scraping, but start the crawl loop in multiple threads to download these links in parallel. Here is a modified version of the start of the link crawler with the crawl loop moved into a function:

import time 
import threading
...
SLEEP_TIME = 1

def threaded_crawler(..., max_threads=10, scraper_callback=None):
...

def process_queue():
while crawl_queue:
...

Here is the remainder of the threaded_crawler function to start process_queue in multiple threads and wait until they have completed:

threads = [] 
while threads or crawl_queue:
# the crawl is still active
for thread in threads:
if not thread.is_alive():
# remove the stopped threads
threads.remove(thread)
while len(threads) < max_threads and crawl_queue:
# can start some more threads
thread = threading.Thread(target=process_queue)
# set daemon so main thread can exit when receives ctrl-c
thread.setDaemon(True)
thread.start()
threads.append(thread)
# all threads have been processed # sleep temporarily so CPU can focus execution elsewhere
for thread in threads:
thread.join()
time.sleep(SLEEP_TIME))

The loop in the preceding code will keep creating threads while there are URLs to crawl until it reaches the maximum number of threads set. During the crawl, threads may also prematurely shut down when there are currently no more URLs in the queue. For example, consider a situation when there are two threads and two URLs to download. When the first thread finishes its download, the crawl queue is empty so this thread exits. However, the second thread may then complete its download and discover additional URLs to download. The thread loop will then notice that there are still more URLs to download, and the maximum number of threads has not been reached, so it will create a new download thread.

We might also want to add parsing to this threaded crawler later. To do so, we can add a section for a function callback using the returned HTML. We likely want to return even more links from this logic or extraction, so we need to also expand the links we parse in the later for loop:

html = D(url, num_retries=num_retries)
if not html:
continue
if scraper_callback:
links = scraper_callback(url, html) or []
else:
links = []
# filter for links matching our regular expression
for link in get_links(html) + links:
...

The fully updated code can be viewed at https://github.com/kjam/wswp/blob/master/code/chp4/threaded_crawler.py. To have a fair test, you will also need to flush your RedisCache or use a different default database. If you have the redis-cli installed, you can do so easily from your command line:

$ redis-cli
127.0.0.1:6379> FLUSHALL
OK
127.0.0.1:6379>

To exit, use your normal program exit (usually Ctrl + C or cmd + C). Now, let's test the performance of this multi-threaded version of the link crawler with the following command:

$ python code/chp4/threaded_crawler.py
...
Total time: 361.50403571128845s

If you take a look at the __main__ section of this crawler, you will note that you can easily pass arguments to this script including max_threads and url_pattern. In the previous example, we are using the defaults of max_threads=5 and url_pattern='$^'

Since there are five threads, downloading is nearly four times faster! Again, your results might vary depending on your ISP or if you run the script from a server. Further analysis of thread performance will be covered in the Performance section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset