Missing data

The data description mentions that sensors used for activity tracking were not fully reliable and results contain missing data. We need to explore them in more detail to see how this fact can influence our modeling strategy.

The first question is how many missing values are in our dataset. We know from the data description that all missing values are marked by the string NaN (that is, not a number), which is now represented as Double.NaN in the RDD rawData. In the next code snippet, we compute the number of missing values per row and the total number of missing values in the dataset:

val nanCountPerRow = rawData.map { row => 
  row.foldLeft(0) { case (acc, v) =>  
    acc + (if (v.isNaN) 1 else 0)  
  } 
} 
val nanTotalCount = nanCount.sum 
 
val ncols = rawData.take(1)(0).length 
val nrows = rawData.count 
 
val nanRatio = 100.0 * nanTotalCount / (ncols * nrows)  
 
println(f"""|NaN count = ${nanTotalCount}%.0f 
            |NaN ratio = ${nanRatio}%.2f %%""".stripMargin) 

The output is as follows:

Right, now we have overall knowledge about the amount of missing values in our data. But we do not know how the missing values are distributed. Are they spread uniformly over the whole dataset? Or are there rows/columns which contain more missing values? In the following text, we will try to find answers to these questions.

A common mistake is to compare a numeric value and Double.NaN with comparison operators. For example, the if (v == Double.NaN) { ... } is wrong, since the Java specification says:
"NaN is unordered: (1) The numerical comparison operators <<=, >, and >= return false if either or both operands are NaN, (2) The equality operator == returns false if either operand is NaN."
Hence, Double.NaN == Double.NaN returns always false. The right way to compare numeric values with Double.NaN is to use the method isNaN: if (v.isNaN) { ... } (or use the corresponding static method java.lang.Double.isNaN).

At first, considering rows we have already computed numbers of missing values per row in the previous step. Sorting them and taking the unique values give us an understanding of how rows are affected by missing values:

val nanRowDistribution = nanCountPerRow.
map( count => (count, 1)).
reduceByKey(_ + _).sortBy(-_._1).collect

println(s"${table(Seq("#NaN","#Rows"), nanRowDistribution, Map.empty[Int, String])}")

The output is as follows:

Now we can see that the majority of rows contain a single missing value. However, there are lot of rows containing 13 or 14 missing values, or even 40 rows containing 27 NaNs and 107 rows which contain more than 30 missing values (104 rows with 40 missing values, and 3 rows with 39 missing values). Considering that the dataset contains 41 columns, it means there are 107 rows which are useless (majority of values are missing), leaving 3,386 rows with at least two missing values which need attention, and 885,494 rows with a single missing value. We can now look at these rows in more detail. We select all rows which contain more missing values than a given threshold, for example, 26. We also collect the index of the rows (it is a zero-based index!):

val nanRowThreshold = 26 
val badRows = nanCountPerRow.zipWithIndex.zip(rawData).filter(_._1._1 > nanRowThreshold).sortBy(-_._1._1) 
println(s"Bad rows (#NaN, Row Idx, Row):
${badRows.collect.map(x => (x._1, x._2.mkString(","))).mkString("
")}") 

Now we know exactly which rows are not useful. We have already observed that there are 107 bad rows which do not contain any useful information. Furthermore, we can see that lines which have 27 missing values have them in the places representing hand and ankle IMU sensors.

And finally, most of the lines have assigned activityId 10, 19, or 20, which represents computer work, house cleaning, and playing soccer activities, which are classes with top frequencies in dataset. That can lead us to theory that the "bad" lines were produced by explicitly rejecting a measurement device by subjects. Furthermore, we can also see the index of each wrong row and verify them in the input dataset. For now, we are going to leave the bad rows and focus on columns.

We can ask the same question about columns - are there any columns which contain a higher amount of missing values? Can we remove such columns? We can start by collecting the number of missing values per column:

val nanCountPerColumn = rawData.map { row =>
row.map(v => if (v.isNaN) 1 else 0)
}.reduce((v1, v2) => v1.indices.map(i => v1(i) + v2(i)).toArray)

println(s"""Number of missing values per column:
^${table(columnNames.zip(nanCountPerColumn).map(t => (t._1, t._2, "%.2f%%".format(100.0 * t._2 / nrows))).sortBy(-_._2))}
^""".stripMargin('^'))

The output is as follows:

The result shows that the second column (do not forget that we have already removed invalid columns during data load), which represents subjects' heart rate, contains lot of missing values. More than 90% of values are marked by NaN, which was probably caused by a measurement process of the experiment (subjects probably do not wear the heart rate monitor during usual daily activities but only when practicing sport).

The rest of the columns contain sporadic missing values.

Another important observation is that the first column containing activityId does not include any missing values - that is good news and means that all observations were properly annotated and we do not need to drop any of them (for example, without a training target, we cannot train a model).

The RDD's reduce method represents action. That means it forces evaluation of the RDD and the result of the reduce is a single value and not RDD. Do not confuse it with reduceByKey which is an RDD operation and returns a new RDD of key-value pairs.

The next step is to decide what to do with missing data. There are many strategies to adopt; however we need to preserve the meaning of our data.

We can simply drop all rows or columns which contain missing data - a very common approach as a matter of fact! It makes good sense for rows which are polluted by too many missing values but this is not a good global strategy in this case since we observed that missing values are spread over almost all columns and rows. Hence, we need a better strategy for handling missing values.

A summary of missing values sources and imputation methods is available, for example, in the book Data Analysis Using Regression and Mutlilevel/Hierarchical Models by A. Gelman and J. Hill (http://www.stat.columbia.edu/~gelman/arm/missing.pdf) or in the presentation https://www.amstat.org/sections/srms/webinarfiles/ModernMethodWebinarMay2012.pdf or https://www.utexas.edu/cola/prc/_files/cs/Missing-Data.pdf.

Considering the heart rate column first, we cannot drop it since there is an obvious link between higher heart rate and practiced activity. However, we can still fill missing values with a reasonable constant. In the context of the heart rate, replacing missing values with the mean value of column values - a technique sometimes referred to as mean computation of missing values - can make good sense. We can compute it with the following lines of code:

val heartRateColumn = rawData. 
  map(row => row(1)). 
  filter(_.isNaN). 
  map(_.toInt) 
 
val heartRateValues = heartRateColumn.collect 
val meanHeartRate = heartRateValues.sum / heartRateValues.count 
scala.util.Sorting.quickSort(heartRateValues) 
val medianHeartRate = heartRateValues(heartRateValues.length / 2) 
 
println(s"Mean heart rate: ${meanHeartRate}") 
println(s"Median heart rate: ${medianHeartRate}") 

The output is as follows:

We can see that mean heart rate is quite a high value, which reflects the fact that heart rate measurements are mainly associated with sport activities (a reader can verify that). But, for example, considering the activity watching TV, the value over 90 is slightly higher than the expected value, since the average resting rate is between 60 and 100 (based on Wikipedia).

So for this case, we can replace missing heart rate values with mean resting rate (80) or we can take the computed mean value of heart rate. Later, we will impute the computed mean value and compare or combine the results (this is called, multiple imputation method). Or we can append a column which marks a line with missing value (see, for example, https://www.utexas.edu/cola/prc/_files/cs/Missing-Data.pdf).

The next step is to replace missing values in the rest of the columns. We should perform the same analysis that we did for the heart rate column and see if there is a pattern in missing data or if they are just missing at random. For example, we can explore a dependency between missing value and our prediction target (in this case, activityId). Hence, we collect a number of missing values per column again; however, now we also remember activityId with each missing value:

def inc[K,V](l: Seq[(K, V)], v: (K, V)) // (3)
(implicit num: Numeric[V]): Seq[(K,V)] =
if (l.exists(_._1 == v._1)) l.map(e => e match {
case (v._1, n) => (v._1, num.plus(n, v._2))
case t => t
}) else l ++ Seq(v)

val distribTemplate = activityIdCounts.collect.map { case (id, _) => (id, 0) }.toSeq
val nanColumnDistribV1 = rawData.map { row => // (1)
val activityId = row(0).toInt
row.drop(1).map { v =>
if (v.isNaN) inc(distribTemplate, (activityId, 1)) else distribTemplate
} // Tip: Make sure that we are returning same type
}.reduce { (v1, v2) => // (2)
v1.indices.map(idx => v1(idx).foldLeft(v2(idx))(inc)).toArray
}

println(s"""
^NaN Column x Response distribution V1:
^${table(Seq(distribTemplate.map(v => activitiesMap(v._1)))
++ columnNames.drop(1).zip(nanColumnDistribV1).map(v => Seq(v._1) ++ v._2.map(_._2)), true)}
""".stripMargin('^'))

The output is as follows:

The preceding code is slightly more complicated and deserves explanation. The call (1) transforms each value in a row into a sequence of (K, V) pairs where K represents the activityId stored in the row, and V is 1 if a corresponding column contains a missing value, else it is 0. Then the reduce method (2) recursively transforms row values represented by sequences into the final results, where each column has associated a distribution represented by a sequence of (K,V) pairs where K is activityId and V represents the number of missing values in rows with the activityId. The method is straightforward but overcomplicated using a non-trivial function inc (3). Furthermore, this naive solution is highly memory-inefficient, since for each column we duplicate information about activityId.

Hence, we can reiterate the naive solution by slightly changing the result representation by not computing distribution per column, but by counting all columns, missing value count per activityId:

val nanColumnDistribV2 = rawData.map(row => {
val activityId = row(0).toInt
(activityId, row.drop(1).map(v => if (v.isNaN) 1 else 0))
}).reduceByKey( (v1, v2) =>
v1.indices.map(idx => v1(idx) + v2(idx)).toArray
).map { case (activityId, d) =>
(activitiesMap(activityId), d)
}.collect

println(s"""
^NaN Column x Response distribution V2:
^${table(Seq(columnNames.toSeq) ++ nanColumnDistribV2.map(v => Seq(v._1) ++ v._2), true)}
""".stripMargin('^'))

In this case, the result is an array of key-value pairs, where key is activity name, and value contains representing distribution of missing value in individual columns. Simply by running both samples, we can observe that the first one takes much more time than the second one. Also, the first one has higher memory demands and is much more complicated.

Finally, we can visualize the result as a heatmap where the axis corresponds to columns and the axis represents activities as shown in Figure 3. Such graphical representation gives us a comprehensible overview of how missing values are correlated with response column:

Figure 3: Heatmap showing number of missing values in each column grouped by activity.

The generated heatmap nicely shows the correlation of missing values. We can see that missing values are connected to sensors. If a sensor is not available or malfunctioning, then all measured values are not available. For example, this is visible for ankle sensor and playing soccer, other activities. On the other hand, the activity watching TV does not indicate any missing value pattern connected to a sensor.

Moreover, there is no other directly visible connection between missing data and activity. Hence, for now, we can decide to fill missing values with 0.0 to express that a missing sensor provides default values. However, our goal is to be flexible to experiment with different imputation strategies later (for example, imputing mean value of observation with the same activityId).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset