Finally, we can move to predicting the infants' survival chances. In this section, we will build two models: a linear classifier—the logistic regression, and a non-linear one—a random forest. For the former one, we will use all the features at our disposal, whereas for the latter one, we will employ a ChiSqSelector(...)
method to select the top four features.
Logistic regression is somewhat a benchmark to build any classification model. MLlib used to provide a logistic regression model estimated using a stochastic gradient descent (SGD) algorithm. This model has been deprecated in Spark 2.0 in favor of the LogisticRegressionWithLBFGS
model.
The LogisticRegressionWithLBFGS
model uses the Limited-memory Broyden–Fletcher–Goldfarb–Shanno (BFGS) optimization algorithm. It is a quasi-Newton method that approximates the BFGS algorithm.
For those of you who are mathematically adept and interested in this, we suggest perusing this blog post that is a nice walk-through of the optimization algorithms: http://aria42.com/blog/2014/12/understanding-lbfgs.
First, we train the model on our data:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS LR_Model = LogisticRegressionWithLBFGS .train(births_train, iterations=10)
Training the model is very simple: we just need to call the .train(...)
method. The required parameters are the RDD with LabeledPoints
; we also specified the number of iterations
so it does not take too long to run.
Having trained the model using the births_train
dataset, let's use the model to predict the classes for our testing set:
LR_results = ( births_test.map(lambda row: row.label) .zip(LR_Model .predict(births_test .map(lambda row: row.features))) ).map(lambda row: (row[0], row[1] * 1.0))
The preceding snippet creates an RDD where each element is a tuple, with the first element being the actual label and the second one, the model's prediction.
MLlib provides an evaluation metric for classification and regression. Let's check how well or how bad our model performed:
import pyspark.mllib.evaluation as ev LR_evaluation = ev.BinaryClassificationMetrics(LR_results) print('Area under PR: {0:.2f}' .format(LR_evaluation.areaUnderPR)) print('Area under ROC: {0:.2f}' .format(LR_evaluation.areaUnderROC)) LR_evaluation.unpersist()
Here's what we got:
The model performed reasonably well! The 85% area under the Precision-Recall curve indicates a good fit. In this case, we might be getting slightly more predicted deaths (true and false positives). In this case, this is actually a good thing as it would allow doctors to put the expectant mother and the infant under special care.
The area under Receiver-Operating Characteristic (ROC) can be understood as a probability of the model ranking higher than a randomly chosen positive instance compared to a randomly chosen negative one. A 63% value can be thought of as acceptable.
For more on these metrics, we point interested readers to http://stats.stackexchange.com/questions/7207/roc-vs-precision-and-recall-curves and http://gim.unmc.edu/dxtests/roc3.htm.
Any model that uses less features to predict a class accurately should always be preferred to a more complex one. MLlib allows us to select the most predictable features using a Chi-Square selector.
Here's how you do it:
selector = ft.ChiSqSelector(4).fit(births_train) topFeatures_train = ( births_train.map(lambda row: row.label) .zip(selector .transform(births_train .map(lambda row: row.features))) ).map(lambda row: reg.LabeledPoint(row[0], row[1])) topFeatures_test = ( births_test.map(lambda row: row.label) .zip(selector .transform(births_test .map(lambda row: row.features))) ).map(lambda row: reg.LabeledPoint(row[0], row[1]))
We asked the selector to return the four most predictive features from the dataset and train the selector using the births_train
dataset. We then used the model to extract only those features from our training and testing datasets.
The .ChiSqSelector(...)
method can only be used for numerical features; categorical variables need to be either hashed or dummy coded before the selector can be used.
We are now ready to build the random forest model.
The following code shows you how to do it:
from pyspark.mllib.tree import RandomForest RF_model = RandomForest .trainClassifier(data=topFeatures_train, numClasses=2, categoricalFeaturesInfo={}, numTrees=6, featureSubsetStrategy='all', seed=666)
The first parameter to the .trainClassifier(...)
method specifies the training dataset. The numClasses
one indicates how many classes our target variable has. As the third parameter, you can pass a dictionary where the key is the index of a categorical feature in our RDD and the value for the key indicates the number of levels that the categorical feature has. The numTrees
specifies the number of trees to be in the forest. The next parameter tells the model to use all the features in our dataset instead of keeping only the most descriptive ones, while the last one specifies the seed for the stochastic part of the model.
Let's see how well our model did:
RF_results = ( topFeatures_test.map(lambda row: row.label) .zip(RF_model .predict(topFeatures_test .map(lambda row: row.features))) ) RF_evaluation = ev.BinaryClassificationMetrics(RF_results) print('Area under PR: {0:.2f}' .format(RF_evaluation.areaUnderPR)) print('Area under ROC: {0:.2f}' .format(RF_evaluation.areaUnderROC)) model_evaluation.unpersist()
Here are the results:
As you can see, the Random Forest model with fewer features performed even better than the logistic regression model. Let's see how the logistic regression would perform with a reduced number of features:
LR_Model_2 = LogisticRegressionWithLBFGS .train(topFeatures_train, iterations=10) LR_results_2 = ( topFeatures_test.map(lambda row: row.label) .zip(LR_Model_2 .predict(topFeatures_test .map(lambda row: row.features))) ).map(lambda row: (row[0], row[1] * 1.0)) LR_evaluation_2 = ev.BinaryClassificationMetrics(LR_results_2) print('Area under PR: {0:.2f}' .format(LR_evaluation_2.areaUnderPR)) print('Area under ROC: {0:.2f}' .format(LR_evaluation_2.areaUnderROC)) LR_evaluation_2.unpersist()
The results might surprise you:
As you can see, both models can be simplified and still attain the same level of accuracy. Having said that, you should always opt for a model with fewer variables.