Missing data

The last step in our data-exploration journey is to explore missing values. We already observed that some columns contain a value that represents a missing value; however, in this section, we will focus on pure missing values. First, we need to collect them:

val naColumns = loanDataHf.names().indices
   .filter(idx => loanDataHf.vec(idx).naCnt() >0)
   .map(idx =>
          (loanDataHf.name(idx),
            loanDataHf.vec(idx).naCnt(),
f"${100*loanDataHf.vec(idx).naCnt()/loanDataHf.numRows().toFloat}%2.1f%%")
   ).sortBy(-_._2)
println(s"Columns with NAs (#${naColumns.length}):${table(naColumns)}")

The list contains 111 columns with the number of missing values varying from 0.2 percent to 86 percent:

There are plenty of columns with five missing values, which can be caused by wrong data collection, and we can easily filter them out if they are represented in a pattern. For more "polluted columns" (for example, where there are many missing values), we need to figure out the right strategy per column based on the column semantics described in the data dictionary.

In all these cases, H2O Flow UI allows us to easily and quickly explore basic properties of data or even execute basic data cleanup. However, for more advanced data manipulations, Spark is the right tool to utilize because of a provided library of pre-cooked transformations and native SQL support.

Whew! As we can see, the data clean up, while being fairly laborious, is an extremely important task for the data scientist and one that will-hopefully-yield good answers to well thought out questions. This process must be carefully considered before each and every new problem that is looking to be solved. As the old ad age goes, "Garbage in, garbage out" - if the inputs are not right, our model will suffer the consequences.

At this point, it is possible to compose all the identified transformations together into shared library functions:

def basicDataCleanup(loanDf: DataFrame, colsToDrop: Seq[String] = Seq()) = {
   (
     (if (loanDf.columns.contains("int_rate"))
       loanDf.withColumn("int_rate", toNumericRateUdf(col("int_rate")))
else
loanDf)
       .withColumn("revol_util", toNumericRateUdf(col("revol_util")))
       .withColumn("mo_sin_old_il_acct", toNumericMnthsUdf(col("mo_sin_old_il_acct")))
       .withColumn("mths_since_last_delinq", toNumericMnthsUdf(col("mths_since_last_delinq")))
       .withColumn("mths_since_last_record", toNumericMnthsUdf(col("mths_since_last_record")))
       .withColumn("mths_since_last_major_derog", toNumericMnthsUdf(col("mths_since_last_major_derog")))
       .withColumn("mths_since_recent_bc", toNumericMnthsUdf(col("mths_since_recent_bc")))
       .withColumn("mths_since_recent_bc_dlq", toNumericMnthsUdf(col("mths_since_recent_bc_dlq")))
       .withColumn("mths_since_recent_inq", toNumericMnthsUdf(col("mths_since_recent_inq")))
       .withColumn("mths_since_recent_revol_delinq", toNumericMnthsUdf(col("mths_since_recent_revol_delinq")))
   ).drop(colsToDrop.toArray :_*)
 }

The method takes a Spark DataFrame as an input and applies all identified cleanup transformations. Now, it is time to build some models!

Table of Contents for Missing data

Create new playlist

Sign In

Sign Up

Table of Contents for
Missing data