The Alexa list is provided in a spreadsheet with columns for the rank and domain:
Extracting this data requires a number of steps, as follows:
- Download the .zip file.
- Extract the CSV file from the .zip file.
- Parse the CSV file.
- Iterate each row of the CSV file to extract the domain.
Here is an implementation to achieve this:
import csv
from zipfile import ZipFile
from io import BytesIO, TextIOWrapper
import requests
resp = requests.get('http://s3.amazonaws.com/alexa-static/top-1m.csv.zip', stream=True)
urls = [] # top 1 million URL's will be stored in this list
with ZipFile(BytesIO(resp.content)) as zf:
csv_filename = zf.namelist()[0]
with zf.open(csv_filename) as csv_file:
for _, website in csv.reader(TextIOWrapper(csv_file)):
urls.append('http://' + website)
You may have noticed that the downloaded zipped data is wrapped with the BytesIO class and passed to ZipFile. This is necessary because ZipFile expects a file-like interface rather than a raw byte object. We also utilize stream=True, which helps speed up the request. Next, the CSV filename is extracted from the filename list. The .zip file only contains a single file, so the first filename is selected. Then, the CSV file is read using a TextIOWrapper to help handle encoding and read issues. This file is then iterated, and the domain in the second column is added to the URL list. The http:// protocol is prepended to each domain to make them valid URLs.
To reuse this function with the crawlers developed earlier, it needs to be modified to an easily callable class:
class AlexaCallback:
def __init__(self, max_urls=500):
self.max_urls = max_urls
self.seed_url = 'http://s3.amazonaws.com/alexa-static/top-1m.csv.zip'
self.urls = []
def __call__(self):
resp = requests.get(self.seed_url, stream=True)
with ZipFile(BytesIO(resp.content)) as zf:
csv_filename = zf.namelist()[0]
with zf.open(csv_filename) as csv_file:
for _, website in csv.reader(TextIOWrapper(csv_file)):
self.urls.append('http://' + website)
if len(self.urls) == self.max_urls:
break
A new input argument was added here, called max_urls, which sets the number of URLs to extract from the Alexa file. By default, this is set to 500 URLs because downloading a million web pages takes a long time (as mentioned in the chapter introduction, more than 11 days when downloaded sequentially).