Using the map() and reduce() functions to sanitize raw data

When doing data cleansing, we'll often introduce filters of various degrees of complexity to exclude invalid values. We may also include a mapping to sanitize values in the cases where a valid but improperly formatted value can be replaced with a valid but proper value.

We might produce the following output:

def comma_fix(data: str) -> float:
    try:
        return float(data)
    except ValueError:
        return float(data.replace(",", ""))

def clean_sum(
        cleaner: Callable[[str], float],
        data: Iterable[str]
    ) -> float:
    return reduce(operator.add, map(cleaner, data))

We've defined a simple mapping, the comma_fix() class, that will convert data from a nearly correct string format into a usable floating-point value. This will remove the comma character. Another common variation will remove dollar signs and convert to decimal.Decimal.

We've also defined a map-reduce operation that applies a given cleaner function, the comma_fix() class in this case, to the data before doing a reduce() function, using the operator.add method.

We can apply the previously described function as follows:

>>> d = ('1,196', '1,176', '1,269', '1,240', '1,307', 
... '1,435', '1,601', '1,654', '1,803', '1,734')
>>> clean_sum(comma_fix, d)
14415.0

We've cleaned the data, by fixing the commas, as well as computed a sum. The syntax is very convenient for combining these two operations.

We have to be careful, however, of using the cleaning function more than once. If we're also going to compute a sum of squares, we really should not execute the following command:

comma_fix_squared = lambda x: comma_fix(x)**2

If we use the clean_sum(comma_fix_squared, d) method as part of computing a standard deviation, we'll do the comma-fixing operation twice on the data, once to compute the sum and once to compute the sum of squares. This is a poor design; caching the results with an lru_cache decorator can help. Materializing the sanitized intermediate values as a temporary tuple object is better.

Table of Contents for Using the map() and reduce() functions to sanitize raw data

Create new playlist

Sign In

Sign Up

Table of Contents for
Using the map() and reduce() functions to sanitize raw data