Loan status model

The first model needs to distinguish between bad and good loans. The dataset already provides the loan_status column, which is the best feature representation of our modeling target. Let's look at the column in more detail.

The loan status is represented by a categorical feature that has seven levels:

Fully paid: borrower paid the loan and all interest
Current: the loan is actively paid in accordance with a plan
In grace period: late payment 1-15 days
Late (16-30 days): late payment
Late (31-120 days): late payment
Charged off: a loan is 150 days past the due date
Default: a loan was lost

For the first modeling goal, we need to distinguish between good and bad loans. Good loans could be the loans that were fully paid. The rest of the loans could be considered as bad loans with the exception of current loans that need more attention (for example, survival analysis) or we could simply remove all rows that contain the "Current" status. For transformation of the loan_status feature into a binary feature, we will define a Spark UDF:

val toBinaryLoanStatus = (status: String) => status.trim.toLowerCase() match {
case "fully paid" =>"good loan"
case _ =>"bad loan"
}
val toBinaryLoanStatusUdf = udf(toBinaryLoanStatus)

We can explore the distribution of individual categories in more detail. In the following screenshot,we can also see that the ratio between good and bad loans is highly unbalanced. We need to keep this fact during the training and evaluation of the model, since we would like to optimize the recall probability of detection of the bad loan:

Properties of the loan_status column.

Table of Contents for Loan status model

Create new playlist

Sign In

Sign Up

Table of Contents for
Loan status model