When doing data cleansing, we'll often introduce filters of various degrees of complexity to exclude invalid values. We may also include a mapping to sanitize values in the cases where a valid but improperly formatted value can be replaced with a valid but proper value.
We might produce the following output:
def comma_fix(data: str) -> float:
try:
return float(data)
except ValueError:
return float(data.replace(",", ""))
def clean_sum(
cleaner: Callable[[str], float],
data: Iterable[str]
) -> float:
return reduce(operator.add, map(cleaner, data))
We've defined a simple mapping, the comma_fix() class, that will convert data from a nearly correct string format into a usable floating-point value. This will remove the comma character. Another common variation will remove dollar signs and convert to decimal.Decimal.
We've also defined a map-reduce operation that applies a given cleaner function, the comma_fix() class in this case, to the data before doing a reduce() function, using the operator.add method.
We can apply the previously described function as follows:
>>> d = ('1,196', '1,176', '1,269', '1,240', '1,307',
... '1,435', '1,601', '1,654', '1,803', '1,734') >>> clean_sum(comma_fix, d) 14415.0
We've cleaned the data, by fixing the commas, as well as computed a sum. The syntax is very convenient for combining these two operations.
We have to be careful, however, of using the cleaning function more than once. If we're also going to compute a sum of squares, we really should not execute the following command:
comma_fix_squared = lambda x: comma_fix(x)**2
If we use the clean_sum(comma_fix_squared, d) method as part of computing a standard deviation, we'll do the comma-fixing operation twice on the data, once to compute the sum and once to compute the sum of squares. This is a poor design; caching the results with an lru_cache decorator can help. Materializing the sanitized intermediate values as a temporary tuple object is better.