Parsing the Alexa list

The Alexa list is provided in a spreadsheet with columns for the rank and domain:

Extracting this data requires a number of steps, as follows:

  1. Download the .zip file.
  2. Extract the CSV file from the .zip file.
  3. Parse the CSV file.
  4. Iterate each row of the CSV file to extract the domain.

Here is an implementation to achieve this:

import csv 
from zipfile import ZipFile
from io import BytesIO, TextIOWrapper
import requests

resp = requests.get('http://s3.amazonaws.com/alexa-static/top-1m.csv.zip', stream=True)
urls = [] # top 1 million URL's will be stored in this list
with ZipFile(BytesIO(resp.content)) as zf:
csv_filename = zf.namelist()[0]
with zf.open(csv_filename) as csv_file:
for _, website in csv.reader(TextIOWrapper(csv_file)):
urls.append('http://' + website)

You may have noticed that the downloaded zipped data is wrapped with the BytesIO class and passed to ZipFile. This is necessary because ZipFile expects a file-like interface rather than a raw byte object. We also utilize stream=True, which helps speed up the request. Next, the CSV filename is extracted from the filename list. The .zip file only contains a single file, so the first filename is selected. Then, the CSV file is read using a TextIOWrapper to help handle encoding and read issues. This file is then iterated, and the domain in the second column is added to the URL list. The http:// protocol is prepended to each domain to make them valid URLs.

To reuse this function with the crawlers developed earlier, it needs to be modified to an easily callable class:

class AlexaCallback:
def __init__(self, max_urls=500):
self.max_urls = max_urls
self.seed_url = 'http://s3.amazonaws.com/alexa-static/top-1m.csv.zip'
self.urls = []

def __call__(self):
resp = requests.get(self.seed_url, stream=True)
with ZipFile(BytesIO(resp.content)) as zf:
csv_filename = zf.namelist()[0]
with zf.open(csv_filename) as csv_file:
for _, website in csv.reader(TextIOWrapper(csv_file)):
self.urls.append('http://' + website)
if len(self.urls) == self.max_urls:
break

A new input argument was added here, called max_urls, which sets the number of URLs to extract from the Alexa file. By default, this is set to 500 URLs because downloading a million web pages takes a long time (as mentioned in the chapter introduction, more than 11 days when downloaded sequentially).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset