Base model

At this point, we have prepared the target prediction column and cleaned up the input data, and we can now build a base model. The base model gives us basic intuition about data. For this purpose, we will use all columns except columns detected as being useless. We will also skip handling of missing values, since we will use H2O and the RandomForest algorithm, which can handle missing values. However, the first step is to prepare a dataset with the help of defined Spark transformations:

import com.packtpub.mmlwspark.chapter8.Chapter8Library._
val loanDataDf = h2oContext.asDataFrame(loanDataHf)(sqlContext)
val loanStatusBaseModelDf = basicDataCleanup(
   loanDataDf
     .where("loan_status is not null")
     .withColumn("loan_status", toBinaryLoanStatusUdf($"loan_status")),
   colsToDrop = Seq("title") ++ columnsToRemove)

We will simply drop all known columns that are correlated with our target prediction column, all high categorical columns that carry a text description (except title and desc, which we will use later), and apply all basic the cleanup transformation we identified in the sections earlier.

The next step involves splitting data into two parts. As usual, we will keep the majority of data for training and rest for model validation and transforming into a form that is accepted by H2O model builders:

val loanStatusDfSplits = loanStatusBaseModelDf.randomSplit(Array(0.7, 0.3), seed = 42)

val trainLSBaseModelHf = toHf(loanStatusDfSplits(0).drop("emp_title", "desc"), "trainLSBaseModelHf")(h2oContext)
val validLSBaseModelHf = toHf(loanStatusDfSplits(1).drop("emp_title", "desc"), "validLSBaseModelHf")(h2oContext)
def toHf(df: DataFrame, name: String)(h2oContext: H2OContext): H2OFrame = {
val hf = h2oContext.asH2OFrame(df, name)
val allStringColumns = hf.names().filter(name => hf.vec(name).isString)
     hf.colToEnum(allStringColumns)
     hf
 }

With the cleanup data, we can easily build a model. We will blindly use the RandomForest algorithm since it gives us direct insight into data and importance of individual features. We say "blindly" because as you recall from Chapter 2, Detecting Dark Matter - The Higgs-Boson Particle, a RandomForest model can take inputs of many different types and build many different trees using different features, which gives us confidence to use this algorithm as our out-of-the-box model, given how well it performs when including all our features. Thus, the model also defines a baseline that we would like to improve by engineering new features.

We will use default settings. RandomForest brings out-of-the-box validation schema based on out-of-bag samples, so we can skip cross-validation for now. However, we will increase the number of constructed trees, but limit the model builder execution by a Logloss-based stopping criterion. Furthermore, we know that the prediction target is imbalanced where the number of good loans is much higher than bad loans, so we will ask for upsampling minority class by enabling the balance_classes option:


import _root_.hex.tree.drf.DRFModel.DRFParameters
import _root_.hex.tree.drf.{DRF, DRFModel}
import _root_.hex.ScoreKeeper.StoppingMetric
import com.packtpub.mmlwspark.utils.Utils.let

val loanStatusBaseModelParams = let(new DRFParameters) { p =>
   p._response_column = "loan_status"
p._train = trainLSBaseModelHf._key
p._ignored_columns = Array("int_rate")
   p._stopping_metric = StoppingMetric.logloss
p._stopping_rounds = 1
p._stopping_tolerance = 0.1
p._ntrees = 100
p._balance_classes = true
p._score_tree_interval = 20
}
val loanStatusBaseModel1 = new DRF(loanStatusBaseModelParams, water.Key.make[DRFModel]("loanStatusBaseModel1"))
   .trainModel()
   .get()

When the model is built, we can explore its quality, as we did in the previous chapters, but our first look will be at the importance of the features:

The most surprising fact is that the zip_code and collection_recovery_fee features have a much higher importance than the rest of the columns. This is suspicious and could indicate direct correlation between the column and the target variable.

We can revisit the data dictionary, which describes the zip_code column as "the first three numbers of the zip code provided by the borrower in the loan application" and the second column as "post-charge off collection fee". The latter one indicates a direct connection to the response column since "good loans" will have a value equal to zero. We can also validate this fact by exploring the data. In the case of zip_code, there is no obvious connection to the response column.

We will therefore do one more model run, but in this case, we will try to ignore both the zip_code and collection_recovery_fee columns:

loanStatusBaseModelParams._ignored_columns = Array("int_rate", "collection_recovery_fee", "zip_code")
val loanStatusBaseModel2 = new DRF(loanStatusBaseModelParams, water.Key.make[DRFModel]("loanStatusBaseModel2"))
   .trainModel()
   .get()

After the model is built, we can explore the variable importance graph again and see a more meaningful distribution of the importance between the variables. Based on the graph, we can decide to use only top 10 input features to simplify the model's complexity and decrease modeling time. It is important to say that we still need to consider the removed columns as relevant input features:

Base model performance

Now, we can look at the model performance of the created model. We need to keep in mind that in our case, the following applies:

The performance of the model is reported on out-of-bag samples, not on unseen data.
We used fixed parameters as the best guess; however, it would be beneficial to perform a random parameter search to see how the input parameters influence the model's performance.

We can see that the AUC measured on out-of-bag sample of data is quite high. Even individual class errors are for a selected threshold, which minimizes individual classes accuracy, low. However, let's explore the performance of the model on the unseen data. We will use the prepared part of the data for validation:

import _root_.hex.ModelMetrics
val lsBaseModelPredHf = loanStatusBaseModel2.score(validLSBaseModelHf)
println(ModelMetrics.getFromDKV(loanStatusBaseModel2, validLSBaseModelHf))

The output is as follows:

The computed model metrics can be explored visually in the Flow UI as well.

We can see that the AUC is lower, and individual class errors are higher, but are still reasonably good. However, all the measured statistical properties do not give us any notion of "business" value of the model-how much money was lent, how much money was lost for defaulted loans, and so on. In the next step, we will try to design ad-hoc evaluation metrics for the model.

What does it mean by the statement that the model made a wrong prediction? It can consider a good loan application as bad, which will result in the rejection of the application. That also means the loss of profit from the loan interest. Alternatively, the model can recommend a bad loan application as good, which will cause the loss of the full or partial amount of lent money. Let's look at both situations in more detail.

The former situation can be described by the following function:

def profitMoneyLoss = (predThreshold: Double) =>
     (act: String, predGoodLoanProb: Double, loanAmount: Int, intRate: Double, term: String) => {
val termInMonths = term.trim match {
case "36 months" =>36
case "60 months" =>60
}
val intRatePerMonth = intRate / 12 / 100
if (predGoodLoanProb < predThreshold && act == "good loan") {
         termInMonths*loanAmount*intRatePerMonth / (1 - Math.pow(1+intRatePerMonth, -termInMonths)) - loanAmount
       } else 0.0
}

The function returns the amount of money lost if a model predicted a bad loan, but the actual data indicated that the loan was good. The returned amount considers the predicted interest rate and term. The important variables are predGoodLoanProb, which holds the model's predicted probability of considering the actual loan as a good loan, and predThreshold, which allows us to set up a bar when the probability of predicting a good loan is good enough for us.

In a similar way, we will describe the latter situation:

val loanMoneyLoss = (act: String, predGoodLoanProb: Double, predThreshold: Double, loanAmount: Int) => {
if (predGoodLoanProb > predThreshold /* good loan predicted */
&& act == "bad loan" /* actual is bad loan */) loanAmount else 0
}

It is good to realize that we are just following the confusion matrix definition for false positives and false negatives and applying our domain knowledge of input data to define ad-hoc model evaluation metrics.

Now, it is time to utilize both functions and define totalLoss-how much money we can lose on accepting bad loans and missing good loans if we follow our model's recommendations:

import org.apache.spark.sql.Row
def totalLoss(actPredDf: DataFrame, threshold: Double): (Double, Double, Long, Double, Long, Double) = {

val profitMoneyLossUdf = udf(profitMoneyLoss(threshold))
val loanMoneyLossUdf = udf(loanMoneyLoss(threshold))

val lostMoneyDf = actPredDf
     .where("loan_status is not null and loan_amnt is not null")
     .withColumn("profitMoneyLoss", profitMoneyLossUdf($"loan_status", $"good loan", $"loan_amnt", $"int_rate", $"term"))
     .withColumn("loanMoneyLoss", loanMoneyLossUdf($"loan_status", $"good loan", $"loan_amnt"))

   lostMoneyDf
     .agg("profitMoneyLoss" ->"sum", "loanMoneyLoss" ->"sum")
     .collect.apply(0) match {
case Row(profitMoneyLossSum: Double, loanMoneyLossSum: Double) =>
       (threshold,
         profitMoneyLossSum, lostMoneyDf.where("profitMoneyLoss > 0").count,
         loanMoneyLossSum, lostMoneyDf.where("loanMoneyLoss > 0").count,
         profitMoneyLossSum + loanMoneyLossSum
       )
   }
 }

The totalLoss function is defined for a Spark DataFrame and a threshold. The Spark DataFrame holds actual validation data and prediction composed of three columns: actual prediction for default threshold, the probability of a good loan, and the probability of a bad loan. The threshold helps us define the right bar for the good loan probability; that is, if the good loan probability is higher than threshold, we can consider that the model recommends to accept the loan.

If we run the function for different thresholds, including one that minimizes individual class errors, we will get the following table:

import _root_.hex.AUC2.ThresholdCriterion
val predVActHf: Frame = lsBaseModel2PredHf.add(validLSBaseModelHf)
 water.DKV.put(predVActHf)
val predVActDf = h2oContext.asDataFrame(predVActHf)(sqlContext)
val DEFAULT_THRESHOLDS = Array(0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95)

println(
table(Array("Threshold", "Profit Loss", "Count", "Loan loss", "Count", "Total loss"),
         (DEFAULT_THRESHOLDS :+
               ThresholdCriterion.min_per_class_accuracy.max_criterion(lsBaseModel2PredModelMetrics.auc_obj()))
          .map(threshold =>totalLoss(predVActDf, threshold)),
Map(1 ->"%,.2f", 3 ->"%,.2f", 5 ->"%,.2f")))

The output is as follows:

From the table, we can see that the lowest total loss for our metrics is based on threshold 0.85, which represents quite a conservative strategy, which focusing on avoiding bad loans.

We can even define a function that finds the minimal total loss and corresponding threshold:

// @Snippet
def findMinLoss(model: DRFModel,
                 validHf: H2OFrame,
                 defaultThresholds: Array[Double]): (Double, Double, Double, Double) = {
import _root_.hex.ModelMetrics
import _root_.hex.AUC2.ThresholdCriterion
// Score model
val modelPredHf = model.score(validHf)
val modelMetrics = ModelMetrics.getFromDKV(model, validHf)
val predVActHf: Frame = modelPredHf.add(validHf)
   water.DKV.put(predVActHf)
//
val predVActDf = h2oContext.asDataFrame(predVActHf)(sqlContext)
val min = (DEFAULT_THRESHOLDS :+ ThresholdCriterion.min_per_class_accuracy.max_criterion(modelMetrics.auc_obj()))
     .map(threshold =>totalLoss(predVActDf, threshold)).minBy(_._6)
   ( /* Threshold */ min._1, /* Total loss */ min._6, /* Profit loss */ min._2, /* Loan loss */ min._4)
 }
val minLossModel2 = findMinLoss(loanStatusBaseModel2, validLSBaseModelHf, DEFAULT_THRESHOLDS)
println(f"Min total loss for model 2: ${minLossModel2._2}%,.2f (threshold = ${minLossModel2._1})")

The output is as follows:

Based on the reported results, we can see that the model minimizes the total loss for threshold ~ 0.85, which is higher than the default threshold identified by the model (F1 = 0.66). However, we still need to realize that this is just a base naive model; we did not perform any tuning and searching of right training parameters. We still have two fields, title and desc, which we can utilize. It's time for model improvements!

Table of Contents for Base model

Create new playlist

Sign In

Sign Up

Table of Contents for
Base model