Filtering with compress()

The built-in filter() function uses a predicate to determine whether an item is passed or rejected. Instead of a function that calculates a value, we can use a second, parallel iterable to determine which items to pass and which to reject.

We can think of the filter() function as having the following definition:

def filter(function, iterable):
    i1, i2 = tee(iterable, 2)
    return compress(i1, map(function, i2))

We cloned the iterable using the tee() function. We'll look at this function in detail later. The map() function will generate results of applying the filter predicate function, function(), to each value in the iterable, yielding a sequence of True and False values. The sequence of Booleans are used to compress the original sequence, passing only items associated with True. This builds the features of the filter() function from the more primitive features of the compress() function.

In the Re-iterating a cycle with cycle() section of this chapter, we looked at data selection using a simple generator expression. Its essence was as follows:

choose = lambda rule: (x == 0 for x in rule)
keep = [v for v, pick in zip(data, choose(all)) if pick]

Each value for a rule must be a function to produce a sequence of Boolean values. To choose all items, it simply repeats True. To pick a fixed subset, it cycles among True followed by  copies of False.

The list comprehension can be revised as compress(some_source, choose(rule)). If we make that change, the processing is simplified:

compress(data, choose(all))
compress(data, choose(subset))
compress(data, choose(randomized))

These examples rely on the alternative selection rules: all, subset, and randomized as shown previously. The subset and randomized versions must be defined with a proper parameter to pick  rows from the source. The choose expression will build an iterable over True and False values based on one of the selection rules. The rows to be kept are selected by applying the source iterable to the row-selection iterable.

Since all of this is non-strict, rows are not read from the source until required. This allows us to process very large sets of data efficiently. Also, the relative simplicity of the Python code means that we don't really need a complex configuration file and an associated parser to make choices among the selection rules. We have the option to use this bit of Python code as the configuration for a larger data-sampling application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset