We also need to check whether there are columns with a particularly low incidence of useful information. There's a lot of things happening in the code we presented, so let's unpack it step by step.
Let's start with the inner list:
[
(1 - (fn.count(c) / fn.count('*')))
.alias(c + '_miss')
for c in merc_out.columns
]
We loop through all the columns in the merc_out DataFrame and count how many non-missing values we find in each column. We then divide it by the total count of all the rows and subtract this from 1 so we get the percentage of missing values.
However, what we're actually doing here is not really calculating anything. The way Python stores this information, at this time, is just as a list of objects, or pointers, to certain operations. Only after we pass the list to the .agg(...) method does it get translated into PySpark's internal execution graph (which only gets executed when we call the .collect() action).
The .collect() method will return a list of one element—a Row object with aggregated information. We can transform Row into a dictionary using the .asDict() method and then extract all the items from it. This will result in a list of tuples, where the first element is the column name (we used the .alias(...) method to append '_miss' to each column) and the second element is the percentage of missing observations.
While looping through the elements of the sorted list, we just print them to the screen:
Well, it looks like most of the information in the MSRP column is missing. Thus, we can drop it, as it will not bring us any useful information:
no_MSRP = merc_out.select([col for col in new_id.columns if col != 'MSRP'])
We still have two columns with some missing information. Let's do something about them.