Processing many large files

Here is an example of a multiprocessing application. We'll scrape Common Log Format (CLF) lines in web log files. This is the generally used format for web server access logs. The lines tend to be long, but look like the following when wrapped to the book's margins:

99.49.32.197 - - [01/Jun/2012:22:17:54 -0400] "GET /favicon.ico 
HTTP/1.1" 200 894 "-" "Mozilla/5.0 (Windows NT 6.0)
AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.52
Safari/536.5"

We often have large numbers of files that we'd like to analyze. The presence of many independent files means that concurrency will have some benefit for our scraping process.

We'll decompose the analysis into two broad areas of functionality. The first phase of processing is the essential parsing of the log files to gather the relevant pieces of information. We'll further decompose the parsing into four stages. They are as follows:

  1. All the lines from multiple source log files are read.
  2. Then, we create simple namedtuple objects from the lines of log entries in a collection of files.
  3. The details of more complex fields such as dates and URLs are parsed.
  4. Uninteresting paths from the logs are rejected; we can also think of this as parsing only the interesting paths.

Once past the parsing phase, we can perform a large number of analyses. For our purposes in demonstrating the multiprocessing module, we'll look at a simple analysis to count occurrences of specific paths.

The first portion, reading from source files, involves the most input processing. Python's use of file iterators will translate into lower-level OS requests for the buffering of data. Each OS request means that the process must wait for the data to become available.

Clearly, we want to interleave the other operations so that they are not waiting for I/O to complete. We can interleave operations along a spectrum from individual rows to whole files. We'll look at interleaving whole files first, as this is relatively simple to implement.

The functional design for parsing Apache CLF files can look as follows:

data = path_filter(
access_detail_iter(
access_iter(
local_gzip(filename))))

We've decomposed the larger parsing problem into a number of functions that will handle each portion of the parsing problem. The local_gzip() function reads rows from locally cached GZIP files. The access_iter() function creates a NamedTuple object for each row in the access log. The access_detail_iter() function will expand on some of the more difficult-to-parse fields. Finally, the path_filter() function will discard some paths and file extensions that aren't of much analytical value.

It can help to visualize this kind of design as a pipeline of processing, as shown here:

(local_gzip(filename) | access_iter | access_detail_iter | path_filter) > data

This uses shell notation of a pipe (|) to pass data from process to process. The built-in Python pipes module facilitates building actual shell pipelines to leverage OS multiprocessing capabilities. Other packages such as pipetools or pipe provide a similar way to visualize a composite function. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset