Missing observations per row

To calculate how much data is missing from a row, it is easier to work with RDDs as we can loop through each element of an RDD's record and count how many values are missing. Thus, the first thing we do is we access .rdd within our new_id DataFrame. Using the .map(...) transformation, we loop through each row, extract 'Id', and count how many times an element is missing using the sum([c == None for c in row]) expression. The outcome of these operations is an RDD of elements that each has two values: the ID of the row and the count of missing values.

Next, we only select those that have more than one missing value and .collect() those records on the driver. We then create a simple DataFrame, .orderBy(...), by the count of missing values in a descending order and show the records. 

The result looks as follows:

As you can see, one of the records has five out of eight values missing. Let's see that record:

(
new_id
.where('Id == 197568495616')
.show()
)

The preceding code shows that one of the Mercedes-Benz records has most of its values missing:

So, we can drop the whole observation as there isn't really much value contained in this record. To achieve this goal, we can use the .dropna(...) method of DataFrames: merc_out = new_id.dropna(thresh=4).

If you use .dropna() without passing any parameters, any record that has a missing value will be removed.

We specify thresh=4, so we only remove the records that have a minimum of four non-missing values; our record has only three useful pieces of information.

Let's confirm: running new_id.count(), merc_out.count() produces (19, 18), so yes, indeed, we removed one of the records. Did we really remove the Mercedes-Benz one? Let's check:

(
merc_out
.where('Id == 197568495616')
.show()
)

The preceding code snippet produces an empty table, so it did remove the records with Id equal to 197568495616, as shown in the following screenshot:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset