Tips and performance considerations

Spark also supports cross-validation for hyperparameter tuning that will be discussed broadly in the following chapter. Spark's view toward the cross-validation as a meta-algorithm, which fits the underlying estimator with user-specified combinations of parameters, cross-evaluates the fitted models and outputs the best one.

However, there is no specific requirement of the underlying estimator, which could be a pipeline, as long as it can be paired with an evaluator that outputs a scalar metric, such as precision and recall, from the predictions.

Let's recall the OCR prediction, where we found that the precision is 75%, which is obviously not satisfactory, once again. To further investigate the reason now, let's prints the confusion matrix for the label 8.0 or "I". If you look at the following matrix in Figure 40, you will find the number of correctly predicted instances is low:

Tips and performance considerations

Figure 40: The confusion matrix for the label 8.0 or "I"

Now let's try to use the Random Forest model to do the prediction. But before going into the model training step, let's do some initialization of the parameters that are needed for the Random Forest classifier, which also supports multiclass classification such as the logistic regression model with LBFGS:

Integer numClasses = 26; 
HashMap<Integer, Integer>categoricalFeaturesInfo = new HashMap<Integer, Integer>(); 
Integer numTrees = 5; // Use more in practice. 
String featureSubsetStrategy = "auto"; // Let the algorithm choose. 
String impurity = "gini"; 
Integer maxDepth = 20; 
Integer maxBins = 40; 
Integer seed = 12345; 

Now train the model by specifying the previous parameters, as follows:

final RandomForestModelmodel = RandomForest.trainClassifier(training, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed); 

Now let's see how it performs. We'll reuse the same code segments that we used in step 9. Refer to the following screenshot:

Tips and performance considerations

Figure 41: Performance metrics for the precision and recall

Tips and performance considerations

Figure 42: The improved confusion matrix for the label 8.0 or "I"

If you look at Figure 42, we have a significant improvement in terms of all the parameters printed, and the precision has been increased from 75.30% to 89.20%. The reason behind this is the improved interpretation of the Random Forest model toward the global maxima calculation for prediction accuracy and the confusion matrix, as shown in Figure 38; you will find significant improvements in the number of predicted instances marked by a diagonal arrow.

Through a process of trial and error, you can settle on a short list of algorithms that show promise, but how do you know which is the best? Moreover, as previously mentioned, it is difficult to find a well-performing machine learning algorithm for your dataset. Therefore, if you are still unsatisfied with the accuracy of 89.20%, I suggest you tune the parametric values and look at the precision and the recall.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset