Spark Naive Bayes model

Next, let's look to employ Spark's implementation of Naive Bayes. As a reminder, we purposely stay away from going into the algorithm itself as this has been covered in many machine learning books; instead, we will focus on the parameters of the model and ultimately, how we can "deploy" these models in a Spark streaming application later on in this chapter.

Spark's implementation of Naive Bayes is relatively straightforward, with just a few parameters we need to keep in mind. They are mainly as follows:

  • getLambda: Sometimes referred to as "additive smoothing" or "laplace smoothing," this parameter allows us to smooth out the observed proportions of our categorical variables to create a more uniform distribution. This parameter is especially important when the number of categories you are trying to predict is very low and you don't want entire categories to be missed due to low sampling. Enter the lambda parameter that "helps" you combat this by introducing some minimal representation of some of the categories.
  • getModelType: There are two options here: "multinomial" (default) or "Bernoulli." The Bernoulli model type would assume that our features are binary, which in our text example would be "does review have word _____? Yes or no?" The multinomial model type, however, takes discrete word counts. One other model type that is not currently implemented in Spark for Naive Bayes but is appropriate for you to know is a Gaussian model type. This gives our model features the freedom to come from a normal distribution.

Given that we only have one hyper-parameter to deal with in this case, we will simply go with the default value for our lamda, but you are encouraged to try a grid search approach as well for optimal results:

import org.apache.spark.ml.classification.{NaiveBayes, NaiveBayesModel}
val nbModelPath= s"$MODELS_DIR/nbModel"
val nbModel= {
val model = new NaiveBayes()
.setFeaturesCol(idf.getOutputCol)
.setLabelCol("label")
.setSmoothing(1.0)
.setModelType("multinomial") // Note: input data are multinomial
.fit(trainData)
val nbPrediction = model.transform(testData)
val nbAUC = new BinaryClassificationEvaluator().setLabelCol("label")
.evaluate(nbPrediction)
println(s"Naive Bayes AUC: $nbAUC")
model.write.overwrite.save(nbModelPath)
model
}

It is interesting to compare the performance of different models for the same input dataset. Often, it turns out that even a simple Naive Bayes algorithm lends itself very well to text classification tasks. The reason partly has to do with the first adjective of this algorithm: "naive." Specifically, this particular algorithm assumes that our features-which in this case are globally weighted term frequencies-are mutually independent. Is this true in the real world? More often, this assumption is often violated; however, this algorithm still could perform just as well, if not better, than more complex models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset