Designing concurrent processing

From a functional programming perspective, we've seen three ways to use the map() function concept applied to data items concurrently. We can use any one of the following:

multiprocessing.Pool
concurrent.futures.ProcessPoolExecutor
concurrent.futures.ThreadPoolExecutor

These are almost identical in the way we interact with them; all three have a map() method that applies a function to items of an iterable collection. This fits in elegantly with other functional programming techniques. The performance is different because of the nature of concurrent threads versus concurrent processes.

As we stepped through the design, our log analysis application decomposed into two overall areas:

The lower-level parsing: This is generic parsing that will be used by almost any log analysis application
The higher-level analysis application: This is more specific filtering and reduction focused on our application needs

The lower-level parsing can be decomposed into four stages:

Reading all the lines from multiple source log files. This was the
local_gzip() mapping from file name to a sequence of lines.
Creating simple namedtuples from the lines of log entries in a collection of files. This was the access_iter() mapping from text lines to Access objects.
Parsing the details of more complex fields such as dates and URLs. This was the access_detail_iter() mapping from Access objects to AccessDetails objects.
Rejecting uninteresting paths from the logs. We can also think of this as passing only the interesting paths. This was more of a filter than a map operation. This was a collection of filters bundled into the path_filter() function.

We defined an overall analysis() function that parsed and analyzed a given log file. It applied the higher-level filter and reduction to the results of the lower-level parsing. It can also work with a wildcard collection of files.

Given the number of mappings involved, we can see several ways to decompose this problem into work that can be mapped to into a pool of threads or processes. Here are some of the mappings we can consider as design alternatives:

Map the analysis() function to individual files. We use this as a consistent example throughout this chapter.
Refactor the local_gzip() function out of the overall analysis() function. We can now map the revised analysis() function to the results of the local_gzip() function.
Refactor the access_iter(local_gzip(pattern)) function out of the overall analysis() function. We can map this revised analysis() function against the iterable sequence of the Access objects.
Refactor the access_detail_iter(access-iter(local_gzip(pattern))) function into a separate iterable. We will then map the path_filter() function and the higher-level filter and reduction against the iterable sequence of the AccessDetail objects.
We can also refactor the lower-level parsing into a function that is separate from the higher-level analysis. We can map the analysis filter and reduction against the output from the lower-level parsing.

All of these are relatively simple methods to restructure the example application. The benefit of using functional programming techniques is that each part of the overall process can be defined as a mapping. This makes it practical to consider different architectures to locate an optimal design.

In this case, however, we need to distribute the I/O processing to as many CPUs or cores as we have available. Most of these potential refactorings will perform all of the I/O in the parent process; these will only distribute the computations to multiple concurrent processes with little resulting benefit. Then, we want to focus on the mappings, as these distribute the I/O to as many cores as possible.

It's often important to minimize the amount of data being passed from process to process. In this example, we provided just short filename strings to each worker process. The resulting Counter object was considerably smaller than the 10 MB of compressed detail data in each log file. We can further reduce the size of each Counter object by eliminating items that occur only once, or we can limit our application to only the 20 most popular items.

The fact that we can reorganize the design of this application freely doesn't mean we should reorganize the design. We can run a few benchmarking experiments to confirm our suspicions that log file parsing is dominated by the time required to read the files.

Table of Contents for Designing concurrent processing

Create new playlist

Sign In

Sign Up

Table of Contents for
Designing concurrent processing