Predicting infant survival

Finally, we can move to predicting the infants' survival chances. In this section, we will build two models: a linear classifier—the logistic regression, and a non-linear one—a random forest. For the former one, we will use all the features at our disposal, whereas for the latter one, we will employ a ChiSqSelector(...) method to select the top four features.

Logistic regression in MLlib

Logistic regression is somewhat a benchmark to build any classification model. MLlib used to provide a logistic regression model estimated using a stochastic gradient descent (SGD) algorithm. This model has been deprecated in Spark 2.0 in favor of the LogisticRegressionWithLBFGS model.

The LogisticRegressionWithLBFGS model uses the Limited-memory Broyden–Fletcher–Goldfarb–Shanno (BFGS) optimization algorithm. It is a quasi-Newton method that approximates the BFGS algorithm.

Note

For those of you who are mathematically adept and interested in this, we suggest perusing this blog post that is a nice walk-through of the optimization algorithms: http://aria42.com/blog/2014/12/understanding-lbfgs.

First, we train the model on our data:

from pyspark.mllib.classification 
    import LogisticRegressionWithLBFGS
LR_Model = LogisticRegressionWithLBFGS 
    .train(births_train, iterations=10)

Training the model is very simple: we just need to call the .train(...) method. The required parameters are the RDD with LabeledPoints; we also specified the number of iterations so it does not take too long to run.

Having trained the model using the births_train dataset, let's use the model to predict the classes for our testing set:

LR_results = (
        births_test.map(lambda row: row.label) 
        .zip(LR_Model 
             .predict(births_test
                      .map(lambda row: row.features)))
    ).map(lambda row: (row[0], row[1] * 1.0))

The preceding snippet creates an RDD where each element is a tuple, with the first element being the actual label and the second one, the model's prediction.

MLlib provides an evaluation metric for classification and regression. Let's check how well or how bad our model performed:

import pyspark.mllib.evaluation as ev
LR_evaluation = ev.BinaryClassificationMetrics(LR_results)
print('Area under PR: {0:.2f}' 
      .format(LR_evaluation.areaUnderPR))
print('Area under ROC: {0:.2f}' 
      .format(LR_evaluation.areaUnderROC))
LR_evaluation.unpersist()

Here's what we got:

Logistic regression in MLlib

The model performed reasonably well! The 85% area under the Precision-Recall curve indicates a good fit. In this case, we might be getting slightly more predicted deaths (true and false positives). In this case, this is actually a good thing as it would allow doctors to put the expectant mother and the infant under special care.

The area under Receiver-Operating Characteristic (ROC) can be understood as a probability of the model ranking higher than a randomly chosen positive instance compared to a randomly chosen negative one. A 63% value can be thought of as acceptable.

Selecting only the most predictable features

Any model that uses less features to predict a class accurately should always be preferred to a more complex one. MLlib allows us to select the most predictable features using a Chi-Square selector.

Here's how you do it:

selector = ft.ChiSqSelector(4).fit(births_train)
topFeatures_train = (
        births_train.map(lambda row: row.label) 
        .zip(selector 
             .transform(births_train 
                        .map(lambda row: row.features)))
    ).map(lambda row: reg.LabeledPoint(row[0], row[1]))
topFeatures_test = (
        births_test.map(lambda row: row.label) 
        .zip(selector 
             .transform(births_test 
                        .map(lambda row: row.features)))
    ).map(lambda row: reg.LabeledPoint(row[0], row[1]))

We asked the selector to return the four most predictive features from the dataset and train the selector using the births_train dataset. We then used the model to extract only those features from our training and testing datasets.

The .ChiSqSelector(...) method can only be used for numerical features; categorical variables need to be either hashed or dummy coded before the selector can be used.

Random forest in MLlib

We are now ready to build the random forest model.

The following code shows you how to do it:

from pyspark.mllib.tree import RandomForest
RF_model = RandomForest 
    .trainClassifier(data=topFeatures_train, 
                     numClasses=2, 
                     categoricalFeaturesInfo={}, 
                     numTrees=6,  
                     featureSubsetStrategy='all',
                     seed=666)

The first parameter to the .trainClassifier(...) method specifies the training dataset. The numClasses one indicates how many classes our target variable has. As the third parameter, you can pass a dictionary where the key is the index of a categorical feature in our RDD and the value for the key indicates the number of levels that the categorical feature has. The numTrees specifies the number of trees to be in the forest. The next parameter tells the model to use all the features in our dataset instead of keeping only the most descriptive ones, while the last one specifies the seed for the stochastic part of the model.

Let's see how well our model did:

RF_results = (
        topFeatures_test.map(lambda row: row.label) 
        .zip(RF_model 
             .predict(topFeatures_test 
                      .map(lambda row: row.features)))
    )
RF_evaluation = ev.BinaryClassificationMetrics(RF_results)
print('Area under PR: {0:.2f}' 
      .format(RF_evaluation.areaUnderPR))
print('Area under ROC: {0:.2f}' 
      .format(RF_evaluation.areaUnderROC))
model_evaluation.unpersist()

Here are the results:

Random forest in MLlib

As you can see, the Random Forest model with fewer features performed even better than the logistic regression model. Let's see how the logistic regression would perform with a reduced number of features:

LR_Model_2 = LogisticRegressionWithLBFGS 
    .train(topFeatures_train, iterations=10)
LR_results_2 = (
        topFeatures_test.map(lambda row: row.label) 
        .zip(LR_Model_2 
             .predict(topFeatures_test 
                      .map(lambda row: row.features)))
    ).map(lambda row: (row[0], row[1] * 1.0))
LR_evaluation_2 = ev.BinaryClassificationMetrics(LR_results_2)
print('Area under PR: {0:.2f}' 
      .format(LR_evaluation_2.areaUnderPR))
print('Area under ROC: {0:.2f}' 
      .format(LR_evaluation_2.areaUnderROC))
LR_evaluation_2.unpersist()

The results might surprise you:

Random forest in MLlib

As you can see, both models can be simplified and still attain the same level of accuracy. Having said that, you should always opt for a model with fewer variables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset