Categorical values

Spark algorithms can handle different forms of categorical features, but they need to be transformed into a form expected by an algorithm. For example, decision trees can handle categorical features as they are; on the other hand, linear regression or neural networks need to expand categorical values into binary columns.

In this example, the good news is that all input features in our dataset are continuous. However, the target feature - activityId - represents multi-class features. The Spark MLlib classification guide (https://spark.apache.org/docs/latest/mllib-linear-methods.html#classification) says:

"The training data set is represented by an RDD of LabeledPoint in MLlib, where labels are class indices starting from zero."

But our dataset contains different numbers of activityIds - see the computed variable activityIdCounts. Hence, we need to transform them into a form expected by MLlib by defining a map from activityId to activityIdx:

val activityId2Idx = activityIdCounts. 
  map(_._1). 
  collect. 
  zipWithIndex. 
  toMap 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset