Loan status model

The first model needs to distinguish between bad and good loans. The dataset already provides the loan_status column, which is the best feature representation of our modeling target. Let's look at the column in more detail.

The loan status is represented by a categorical feature that has seven levels:

  • Fully paid: borrower paid the loan and all interest
  • Current: the loan is actively paid in accordance with a plan
  • In grace period: late payment 1-15 days
  • Late (16-30 days): late payment
  • Late (31-120 days): late payment
  • Charged off: a loan is 150 days past the due date
  • Default: a loan was lost

For the first modeling goal, we need to distinguish between good and bad loans. Good loans could be the loans that were fully paid. The rest of the loans could be considered as bad loans with the exception of current loans that need more attention (for example, survival analysis) or we could simply remove all rows that contain the "Current" status. For transformation of the loan_status feature into a binary feature, we will define a Spark UDF:

val toBinaryLoanStatus = (status: String) => status.trim.toLowerCase() match {
case "fully paid" =>"good loan"
case _ =>"bad loan"
}
val toBinaryLoanStatusUdf = udf(toBinaryLoanStatus)

We can explore the distribution of individual categories in more detail. In the following screenshot,we can also see that the ratio between good and bad loans is highly unbalanced. We need to keep this fact during the training and evaluation of the model, since we would like to optimize the recall probability of detection of the bad loan:

Properties of the loan_status column.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset