According to Robi Polikar et al., (Learn++: An Incremental Learning Algorithm for Supervised Neural Networks, IEEE Transactions on Systems, Man, And Cybernetics, V-21, No-4, November 2001), various algorithms have been suggested for incremental learning. The incremental learning is therefore implied for solving different problems. In some literature, the term incremental learning has been used to refer to either the growing of or pruning of a classifier. Alternatively, it may refer to the selection of most informative training samples for solving a problem in an incremental way.
In other cases, making a regular ML algorithm incremental means performing some form of controlled modification of weights in the classifier, by retraining with misclassified signals. Some algorithms are capable of learning new information; however, they do not synchronously satisfy all of the previously mentioned criteria. Moreover, they either require access to the old data or need to forget the prior knowledge along the way, and as they are unable to accommodate new classes they are not adaptable for new datasets.
Considering the previously mentioned issues, in this section, we will discuss how to adopt ML models using an incremental version of the original algorithms. Incremental SVM, Bayesian Network, and Neural networks will be discussed in brief. Moreover, when applicable, we will provide regular Spark implementation of these algorithms.
It's pretty difficult to make a regular ML algorithm incremental. In short, it's possible but not altogether easy. If you want to do it you have to change the underlying source codes in the Spark library you are using or implement the training algorithm yourself.
Unfortunately, Spark does not have an incremental version of SVM implemented. However, before making the linear SVM incremental, you need to first understand the linear SVM itself. Therefore, we provide some concepts of linear SVMs in the next sub-section using Spark for the new dataset.
According to our knowledge, we have found only two possible solutions called SVMHeavy (http://people.eng.unimelb.edu.au/shiltona/svm/) and LaSVM (http://leon.bottou.org/projects/lasvm), which support incremental training. But we haven't used either. Interested readers should follow these two papers on incremental SVMs to get some insight. These two papers are straightforward and show good research if you're just getting started:
http://cbcl.mit.edu/cbcl/publications/ps/cauwenberghs-nips00.pdf.
In this section, we will first discuss how to perform binary classification using linear SVMs of Spark implementation. Then we will show how to adopt the same algorithm for the new data type.
Step 1: Data collection and exploration
We have collected a colon cancer dataset from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. Originally, the dataset was labeled as -1.0 and 1.0 as follows:
The dataset was used in the following publication: U. Alon, N. Barkai, D. A. Notterman, K. Gish, S.Ybarra, D.Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumour and normal colon tissues probed by oligonucleotide arrays. Cell Biology, 96:6745-6750, 1999. Interested readers should refer to the publication to get more insights into the dataset.
After that, instance-wise normalization is carried out to mean zero and variance one. Then feature wise normalization is carried out to get zero and variance one as a pre-processing step. However, for simplicity, we have considered -1.0 as 0.1, since SVM does not recognize symbols (that is, + or -). Therefore, the dataset now contains two labels 1 and 0 (that is, to say it's a binary classification problem). After pre-processing and scaling, there are two classes and 2000 features. Here is a sample of the dataset in Figure 5:
Step 2: Load the necessary packages and APIs
Here is the code to load the necessary packages:
import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.mllib.classification.SVMModel; import org.apache.spark.mllib.classification.SVMWithSGD; import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics; import org.apache.spark.mllib.evaluation.MulticlassMetrics; import org.apache.spark.mllib.optimization.L1Updater; import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.util.MLUtils; import org.apache.spark.sql.SparkSession;
Step 3: Configure the Spark session
The following code helps us to create the Spark session:
SparkSession spark = SparkSession .builder() .appName("JavaLDAExample") .master("local[*]") .config("spark.sql.warehouse.dir", "E:/Exp/") .getOrCreate();
Step 4: Create a Dataset out of the data
Here is the code to create a Dataset:
String path = "input/colon-cancer.data"; JavaRDD<LabeledPoint>data = MLUtils.loadLibSVMFile(spark.sparkContext(), path).toJavaRDD();
Step 5: Prepare the training and test sets
Here is the code to prepare the training and test sets:
JavaRDD<LabeledPoint>training = data.sample(false, 0.8, 11L); training.cache(); JavaRDD<LabeledPoint>test = data.subtract(training);
Step 6: Build and train the SVM model
The following code illustrates how to build and train the SVM model:
intnumIterations = 500; final SVMModel model = SVMWithSGD.train(training.rdd(), numIterations);
Step 7: Compute the raw prediction score on the test set
Here is the code to compute the raw prediction:
JavaRDD<Tuple2<Object, Object>>scoreAndLabels = test.map( newFunction<LabeledPoint, Tuple2<Object, Object>>() { public Tuple2<Object, Object> call(LabeledPoint p) { Double score = model.predict(p.features()); returnnew Tuple2<Object, Object>(score, p.label()); }});
Step 8: Evaluate the model
Here is the code to evaluate the model:
BinaryClassificationMetrics metrics = new BinaryClassificationMetrics(JavaRDD.toRDD(scoreAndLabels)); System.out.println("Area Under PR = " + metrics.areaUnderPR()); System.out.println("Area Under ROC = " + metrics.areaUnderROC()); Area Under PR = 0.6266666666666666 Area Under ROC = 0.875
However, the value of ROC is between 0.5 and 1.0. Where the value is more than 0.8 this indicates a good classifier and if the value of ROC is less than 0.8, this signals a bad classifier. The SVMWithSGD.train()
method by default performs Level Two (L2) regularization with the regularization parameter set to 1.0.
If you want to configure this algorithm, you should customize the SVMWithSGD
further by creating a new object directly. After that, you can further the setter methods to set the value of the object.
Interestingly, all the other Spark MLlib algorithms can be customized this way. However, after the customization has been completed, you need to build the source code to make changes up to API level. Interested readers can add themselves to the Apache Spark mailing list if they want to contribute to the open source.
Note the source code of Spark is available on GitHub at the URL https://github.com/apache/spark as an open source and it sends pull requests to enrich Spark. More technical discussion can be found at the Spark website at http://spark.apache.org/.
For example, the following code produces a level one (L1
) regularized variant of SVMs with the regularization parameter set to 0.1, and runs the training algorithm for 500 iterations as follows:
SVMWithSGD svmAlg = new SVMWithSGD(); svmAlg.optimizer() .setNumIterations(500) .setRegParam(0.1) .setUpdater(new L1Updater()); final SVMModel model = svmAlg.run(training.rdd());
Your model is now trained. Now if you perform step 7 and step 8, the following metrics will be generated:
Area Under PR = 0.9380952380952381 Area Under ROC = 0.95
If you compare this result with the result produced in step 8, it's much better now, isn't it? However, depending on the data preparation, you might experience different results.
It indicates a better classification (please see also at https://www.researchgate.net/post/What_is_the_value_of_the_area_under_the_roc_curve_AUC_to_conclude_that_a_classifier_is_excellent). In this way the SVM can be optimized or adaptive for the new data type.
However, the parameters (that is, number of iterations, regression params, and updater) should be set accordingly.
The incremental version of the neural network in R or Mat lab provides adaptability using the adapt function. Does this update instead of overwriting iteratively? To verify this statement, readers can try using the R or Mat lab version of the incremental neural network-based classifier that may need to select a subset of your first data chunk as the second chunk in training. If it is overwriting, when you use the trained net with the subset to test your first data chunk, it will likely poorly predict the data that does not belong to the subset.
To date, there is no implementation of the incremental version of the neural network in Spark yet. According to the API documentation provided at https://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier, Spark's Multilayer Perceptron Classifier (MLPC) is a classifier based on the Feedforward Artificial Neural Network (FANN). The MLPC consists of multiple layers of nodes including hidden layers. Each layer is fully connected to the next layer and so on in a network. A node in the input layer represents the input data. All other nodes map inputs to outputs by a linear combination of the inputs with the node's weights w and bias b and by applying the activation function. The number of nodes N in the output layer corresponds to the number of classes.
MLPC also performs backpropagation for learning the model. Spark uses the logistic loss function for optimization and Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) as an optimization routine. Note that the L-BFGS is an optimization algorithm in the family of Quasi-Newton Method (QNM) that approximates the Broyden-Fletcher-Goldfarb-Shanno algorithm using a limited main memory. To train the multilayer perceptron classifier, the following parameters need to be set:
Note the layers consist of the input, hidden, and output layers. Moreover, a smaller value of convergence tolerance will lead to higher accuracy with the cost of more iterations. The default block size parameter is 128 and the maximum number of iteration is set to be 100 as a default value. We suggest you set these values accordingly and carefully.
In this sub-section, we will show how Spark has implemented the neural network learning algorithms through the multilayer perception classifier on the Iris dataset.
Step 1: Dataset collection, processing, and exploration
The original Iris plant dataset was collected from the UCI machine learning repositories (http://www.ics.uci.edu/~mlearn/MLRepository.html) and then pre-processed, scaled to libsvm format by Chang et al., and placed as the libsvm a comprehensive library for support vector machine at (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html) for the binary, multi-class, and multi-label classification task. The Iris dataset contains three classes and four features, where the sepal and petal lengths are scaled according to the libsvm format. More specifically, here is the attribute information:
A snapshot of the dataset is shown in Figure 6:
Step 2: Load the required packages and APIs
Here is the code to load the required packages and APIs:
import org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel; import org.apache.spark.ml.classification.MultilayerPerceptronClassifier; import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import com.example.SparkSession.UtilityForSparkSession;
Step 3: Create a Spark session
The following code helps us to create the Spark session:
SparkSession spark = UtilityForSparkSession.mySession();
Note, the mySession()
method that creates and returns a Spark session object is as follows:
public static SparkSession mySession() { SparkSession spark = SparkSession.builder() .appName("MultilayerPerceptronClassificationModel") .master("local[*]") .config("spark.sql.warehouse.dir", "E:/Exp/") .getOrCreate(); return spark; }
Step 4: Parse and prepare the dataset
Load the input data as libsvm
format:
String path = "input/iris.data"; Dataset<Row> dataFrame = spark.read().format("libsvm").load(path);
Step 5: Prepare the training and test set
Prepare the training and test set: training = 70%, test = 30%, and seed = 12345L:
Dataset<Row>[] splits = dataFrame.randomSplit(new double[] { 0.7, 0.3 }, 12345L); Dataset<Row> train = splits[0]; Dataset<Row> test = splits[1];
Step 6: Specify the layers for the neural network
Specify the layers for the neural network. Here, input layer size 4 (features), two intermediate layers (that is, hidden layers) of size 4 and 3, and output size 3 (classes):
int[] layers = newint[] { 4, 4, 3, 3 };
Step 7: Create the multilayer perceptron estimator
Create the MultilayerPerceptronClassifier
trainer and set its parameters. Here, set the value of param [[layers]]
using the setLayers()
method from Step 6. Set the convergence tolerance of iterations using the setTol()
method, since, a smaller value will lead to higher accuracy with the cost of more iterations.
Note the default is 1E-4
. Set the value of Param [[blockSize]]
using the setBlockSize()
method, where the default is 128KB. Set the seed for weight initialization if the weights using the setInitialWeights()
are not set. Finally, set the maximum number of iterations using the setMaxIter()
method, where the default is 100:
MultilayerPerceptronClassifier trainer = new MultilayerPerceptronClassifier() .setLayers(layers) .setTol(1E-4) .setBlockSize(128) .setSeed(12345L) .setMaxIter(100);
Step 8: Train the model
Train the MultilayerPerceptronClassificationModel
using the preceding estimator from step 7:
MultilayerPerceptronClassificationModel model = trainer.fit(train);
Step 9: Compute the accuracy on the test set
Here is the code to compute the accuracy on the test set:
Dataset<Row> result = model.transform(test); Dataset<Row> predictionAndLabels = result.select("prediction", "label");
Step 10: Evaluate the model
Evaluate the model, calculate the metrics`, and print the accuracy, weighted precision and weighted recall:
MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator().setMetricName("accuracy"); MulticlassClassificationEvaluator evaluator2 = new MulticlassClassificationEvaluator().setMetricName("weightedPrecision"); MulticlassClassificationEvaluator evaluator3 = new MulticlassClassificationEvaluator().setMetricName("weightedRecall"); System.out.println("Accuracy = " + evaluator.evaluate(predictionAndLabels)); System.out.println("Precision = " + evaluator2.evaluate(predictionAndLabels)); System.out.println("Recall = " + evaluator3.evaluate(predictionAndLabels));
The output should appear as follows:
Accuracy = 0.9545454545454546 Precision = 0.9595959595959596 Recall = 0.9545454545454546
Step 11: Stop the Spark session
The following code is used to stop the Spark session:
spark.stop();
From the preceding prediction metrics, it is clear that the classification task is quite impressive. Now it's your turn to make your model adaptable. Now try training and testing with the new dataset and make your ML model adaptable.
As we discussed earlier, Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between each pair of features. The Naive Bayes based model can be trained very efficiently. The model can compute the conditional probability distribution of each feature, given the label, since a pass to the training data. After that, it applies the Bayes theorem to compute the conditional probability distribution of the labels for making the prediction.
However, there is still no implementation of the incremental version of the Bayesian network into Spark yet. According to the API documentation provided at http://spark.apache.org/docs/latest/mllib-naive-bayes.html, each observation is a document and each feature represents a term. The value of an observation is the frequency of the term or a zero or one. This value indicates if the term has been found in the document for the multinomial Naive Bayes and Bernoulli Naive Bayes respectively for the document classification.
Note that as with linear SVM-based learning, here the feature values must be non-negative too. The type of the model is selected with an optional parameter, multinomial or Bernoulli. The default model type is multinomial. Furthermore, additive smoothing (that is, lambda) can be used by setting the parameter λ. Note the default of lambda is 1.0.
More technical details on the big data approach of Bayesian network based learning can be found in the paper: A Scalable Data Science Workflow Approach for Big Data Bayesian Network Learning b y Jianwu W., et al., (http://users.sdsc.edu/~jianwu/JianwuWang_files/A_Scalable_Data_Science_Workflow_Approach_for_Big_Data_Bayesian_Network_Learning.pdf).
Interested readers also should refer to the following publications for more insight into the incremental Bayesian networks:
The current implementation in Spark MLlib supports both the multinomial Naive Bayes and Bernoulli Naive Bayes. However, the incremental version has not been implemented yet. Therefore, in this section, we will show you how to perform the classification using the Spark MLlib version of Naïve Bayes on the Vehicle Scale dataset to provide you with some concepts of the Naïve Bayes based learning.
Note, due to the low accuracy and precision using Spark ML, we did not provide the Pipeline version but implemented the same using only Spark MLlib. Moreover, if you have suitable and better data, you can try to implement the Spark ML version with ease.
Step 1: Data collection, pre-processing, and exploration
The dataset was downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#aloi and provided by David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361-397, 2004.
Pre-processing: For the pre-processing, two steps were considered as follows:
After performing these two steps, there are finally 53 classes and 47,236 features collected. Here is a snapshot of the dataset shown in Figure 7:
Step 2: Load the required library and packages
Here is the code to load the library and packages:
import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.mllib.classification.NaiveBayes; import org.apache.spark.mllib.classification.NaiveBayesModel; import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.util.MLUtils; import org.apache.spark.sql.SparkSession; importscala.Tuple2;
Step 3: Initiate a Spark session
The following code helps us to create the Spark session:
static SparkSession spark = SparkSession .builder() .appName("JavaLDAExample").master("local[*]") .config("spark.sql.warehouse.dir", "E:/Exp/") .getOrCreate();
Step 4: Prepare LabeledPoint RDDs
Parse the dataset in the libsvm format and prepare LabeledPoint
RDDs:
static String path = "input/rcv1_train.multiclass.data"; JavaRDD<LabeledPoint> inputData = MLUtils.loadLibSVMFile(spark.sparkContext(), path).toJavaRDD();
For document classification, the input feature vectors are usually sparse, and sparse vectors should be supplied as input to take advantage of sparsity. Since the training data is only used once, it is not necessary to cache it.
Step 5: Prepare the training and test set
Here is the code to prepare the training and test set:
JavaRDD<LabeledPoint>[] split = inputData.randomSplit(new double[]{0.8, 0.2}, 12345L); JavaRDD<LabeledPoint> training = split[0]; JavaRDD<LabeledPoint> test = split[1];
Step 6: Train the Naive Bayes model
Train a Naive Bayes model by specifying the model type as multinomial and lambda = 1.0, which is the default and suitable for the multiclass classification of any features. However, note that Bernoulli naive Bayes requires 0 or 1 feature values:
final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0, "multinomial");
Step 7: Calculate the prediction on the test dataset
Here is the code to calculate the prediction:
JavaPairRDD<Double,Double> predictionAndLabel = test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() { @Override public Tuple2<Double, Double> call(LabeledPoint p) { return new Tuple2<>(model.predict(p.features()), p.label()); } });
Step 8: Calculate the prediction accuracy
Here is the code to calculate the prediction accuracy:
double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() { @Override public Boolean call(Tuple2<Double, Double>pl) { returnpl._1().equals(pl._2()); } }).count() / (double) test.count();
Step 9: Print the accuracy
Here is the code to print the accuracy:
System.out.println("Accuracy of the classification: "+accuracy);
This provides the following output:
Accuracy of the classification: 0.5941753719531497
This is pretty low, right? This is as we discussed when we tuned the ML models in Chapter 7, Tuning Machine Learning Models. There are further opportunities to improve the prediction accuracy by selecting appropriate algorithms (that is, classifier or regressor) via cross-validation and train split.