In this section, we will see a step-by-step example using Naive Bayes (NB) algorithm. As already stated, NB is highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. This scalability has enabled the Spark community to make predictive analytics on large-scale datasets using this algorithm. The current implementation of NB in Spark MLlib supports both the multinomial NB and Bernoulli NB.
In this section, we will see how to predict the digits from the Pen-Based Recognition of Handwritten Digits dataset by incorporating Spark machine learning APIs including Spark MLlib, Spark ML, and Spark SQL:
Step 1. Data collection, preprocessing, and exploration - The Pen-based recognition of handwritten digits dataset was downloaded from the UCI Machine Learning Repository at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/pendigits. This dataset was generated after collecting around 250 digit samples each from 44 writers, correlated to the location of the pen at fixed time intervals of 100 milliseconds. Each digit was then written inside a 500 x 500 pixel box. Finally, those images were scaled to an integer value between 0 and 100 to create consistent scaling between each observation. A well-known spatial resampling technique was used to obtain 3 and 8 regularly spaced points on an arc trajectory. A sample image along with the lines from point to point can be visualized by plotting the 3 or 8 sampled points based on their (x, y) coordinates; it looks like what is shown in the following table:
Set | '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | Total |
Training | 780 | 779 | 780 | 719 | 780 | 720 | 720 | 778 | 718 | 719 | 7493 |
Test | 363 | 364 | 364 | 336 | 364 | 335 | 336 | 364 | 335 | 336 | 3497 |
As shown in the preceding table, the training set consists of samples written by 30 writers and the testing set consists of samples written by 14 writers.
More on this dataset can be found at http://archive.ics.uci.edu/ml/machine-learning-databases/pendigits/pendigits-orig.names. A digital representation of a sample snapshot of the dataset is shown in the following figure:
Now to predict the dependent variable (that is, label) using the independent variables (that is, features), we need to train a multiclass classifier since, as shown previously, the dataset now has nine classes, that is, nine handwritten digits. For the prediction, we will use the Naive Bayes classifier and evaluate the model's performance.
Step 2. Load the required library and packages:
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation
.MulticlassClassificationEvaluator
import org.apache.spark.sql.SparkSession
Step 3. Create an active Spark session:
val spark = SparkSession
.builder
.master("local[*]")
.config("spark.sql.warehouse.dir", "/home/exp/")
.appName(s"NaiveBayes")
.getOrCreate()
Note that here the master URL has been set as local[*], which means all the cores of your machine will be used for processing the Spark job. You should set SQL warehouse accordingly and other configuration parameter based on the requirements.
Step 4. Create the DataFrame - Load the data stored in LIBSVM format as a DataFrame:
val data = spark.read.format("libsvm")
.load("data/pendigits.data")
For digits classification, the input feature vectors are usually sparse, and sparse vectors should be supplied as input to take advantage of sparsity. Since the training data is only used once, and moreover the size of the dataset is relatively smaller (that is, few MBs), we can cache it if you use the DataFrame more than once.
Step 5. Prepare the training and test set - Split the data into training and test sets (25% held out for testing):
val Array(trainingData, testData) = data
.randomSplit(Array(0.75, 0.25), seed = 12345L)
Step 6. Train the Naive Bayes model - Train a Naive Bayes model using the training set as follows:
val nb = new NaiveBayes()
val model = nb.fit(trainingData)
Step 7. Calculate the prediction on the test set - Calculate the prediction using the model transformer and finally show the prediction against each label as follows:
val predictions = model.transform(testData)
predictions.show()
As you can see in the preceding figure, some labels were predicted accurately and some of them were wrongly. Again we need to know the weighted accuracy, precision, recall and f1 measures without evaluating the model naively.
Step 8. Evaluate the model - Select the prediction and the true label to compute test error and classification performance metrics such as accuracy, precision, recall, and f1 measure as follows:
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
val evaluator1 = evaluator.setMetricName("accuracy")
val evaluator2 = evaluator.setMetricName("weightedPrecision")
val evaluator3 = evaluator.setMetricName("weightedRecall")
val evaluator4 = evaluator.setMetricName("f1")
Step 9. Compute the performance metrics - Compute the classification accuracy, precision, recall, f1 measure, and error on test data as follows:
val accuracy = evaluator1.evaluate(predictions)
val precision = evaluator2.evaluate(predictions)
val recall = evaluator3.evaluate(predictions)
val f1 = evaluator4.evaluate(predictions)
Step 10. Print the performance metrics:
println("Accuracy = " + accuracy)
println("Precision = " + precision)
println("Recall = " + recall)
println("F1 = " + f1)
println(s"Test Error = ${1 - accuracy}")
You should observe values as follows:
Accuracy = 0.8284365162644282
Precision = 0.8361211320692463
Recall = 0.828436516264428
F1 = 0.8271828540349192
Test Error = 0.17156348373557184
The performance is not that bad. However, you can still increase the classification accuracy by performing hyperparameter tuning. There are further opportunities to improve the prediction accuracy by selecting appropriate algorithms (that is, classifier or regressor) through cross-validation and train split, which will be discussed in the following section.