Final transformation

Finally, we can compose all the defined functionality together and prepare the data for model building. First, the rawData RDD is filtered and all bad rows are removed with the help of filterBadRows, then the result is processed by the imputeNaN method which injects given values at the location of missing values:

val processedRawData = imputeNaN( 
  filterBadRows(rawData, nanCountPerRow, nanThreshold = 26), 
  imputedValues) 

At the end, verify that we invoked the right transformations by at least computing the number of rows:

println(s"Number of rows before/after: ${rawData.count} / ${ processedRawData.count}") 

The output is as follows:

We can see that we filtered out 151 rows ,which corresponds to our preceding observations.

Understanding data is the key point of data science. It involves also understanding missing data. Never skip this stage since it can lead to biased models giving too good results. And, as we continuously point out, not understanding your data will lead you to ask poor questions which ultimately results in lackluster answers.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset