The desc column transformation

The next column we will explore is desc. Our motivation is still to mine any possible information from it and improve model's quality. The desc column contains purely textual descriptions for why the lender wishes to take out a loan. In this case, we are not going to treat them as categorical values since most of them are unique. However, we will apply NLP techniques to extract important information. In contrast to the emp_title column, we will not use the Word2Vec algorithm, but we will try to find words that are distinguishing bad loans from good loans.

For this goal, we will simply decompose descriptions into individual words (that is, tokenization) and assign weights to each used word with the help of tf-idf and explore which words are most likely to represent good loans or bad loans. Instead of tf-idf, we could help just word counts, but tf-idf values are a better separation between informative words (such as "credit") and common words (such as "loan").

Let's start with the same procedure we performed in the case of the emp_title column-defining transformations that transcribe the desc column into a list of unified tokens:

import org.apache.spark.sql.types._
val descColUnifier = new UDFTransformer("unifier", unifyTextColumn, StringType, StringType)
.setInputCol("desc")
.setOutputCol("desc_unified")

val descColTokenizer = new UDFTransformer("tokenizer",
tokenizeTextColumn(3)(StopWordsRemover.loadDefaultStopWords("english")),
StringType, ArrayType(StringType, true))
.setInputCol("desc_unified")
.setOutputCol("desc_tokens")

The transformation prepares a desc_tokens column that contains a list of words for each input desc value. Now, we need to translate string tokens into numeric form to build the tf-idf model. In this context, we will use CountVectorizer, which extracts the vocabulary of used words and generates a numeric vector for each row. A position in a numeric vector corresponds to a single word in the vocabulary and the value represents the number of occurrences. Un which g tokens into a numeric vector, since we would like to keep the relation between a number in the vector and token representing it. In contrast to Spark HashingTF, CountVectorizer preserves bijection between a word and the number of its occurrences in a generated vector. We will reuse this capability later:

import org.apache.spark.ml.feature.CountVectorizer
val descCountVectorizer = new CountVectorizer()
.setInputCol("desc_tokens")
.setOutputCol("desc_vector")
.setMinDF(1)
.setMinTF(1)

Define the IDF model:

import org.apache.spark.ml.feature.IDF
val descIdf = new IDF()
.setInputCol("desc_vector")
.setOutputCol("desc_idf_vector")
.setMinDocFreq(1)

When we put all the defined transformations into a single pipeline, we can directly train it on input data:

import org.apache.spark.ml.Pipeline
val descFreqPipeModel = new Pipeline()
.setStages(
Array(descColUnifier,
descColTokenizer,
descCountVectorizer,
descIdf)
).fit(loanStatusBaseModelDf)

Now, we have a pipeline model that can transform a numeric vector for each input desc value. Furthermore, we can inspect the pipeline model's internals and extract vocabulary from the computed CountVectorizerModel and individual word weights from IDFModel:

val descFreqDf = descFreqPipeModel.transform(loanStatusBaseModelDf)
import org.apache.spark.ml.feature.IDFModel
import org.apache.spark.ml.feature.CountVectorizerModel
val descCountVectorizerModel = descFreqPipeModel.stages(2).asInstanceOf[CountVectorizerModel]
val descIdfModel = descFreqPipeModel.stages(3).asInstanceOf[IDFModel]
val descIdfScores = descIdfModel.idf.toArray
val descVocabulary = descCountVectorizerModel.vocabulary
println(
s"""
~Size of 'desc' column vocabulary:
${descVocabulary.length}
~Top ten highest scores:
~
${table(descVocabulary.zip(descIdfScores).sortBy(-_._2).take(10))}
""".stripMargin('~'))

The output is as follows:

At this point, we know individual word weights; however, we still need to compute which words are used by "good loans" and "bad loans". For this purpose, we will utilize information about word frequencies computed by the prepared pipeline model and stored in the desc_vector column (in fact, this is an output of CountVectorizer). We will sum all these vectors separately for good and then for bad loans:

import org.apache.spark.ml.linalg.{Vector, Vectors}
val rowAdder = (toVector: Row => Vector) => (r1: Row, r2: Row) => {
Row(Vectors.dense((toVector(r1).toArray, toVector(r2).toArray).zipped.map((a, b) => a + b)))
}

val descTargetGoodLoan = descFreqDf
.where("loan_status == 'good loan'")
.select("desc_vector")
.reduce(rowAdder((row:Row) => row.getAs[Vector](0))).getAs[Vector](0).toArray

val descTargetBadLoan = descFreqDf
.where("loan_status == 'bad loan'")
.select("desc_vector")
.reduce(rowAdder((row:Row) => row.getAs[Vector](0))).getAs[Vector](0).toArray

Having computed values, we can easily find words that are used only by good/bad loans and explore their computed IDF weights:

val descTargetsWords = descTargetGoodLoan.zip(descTargetBadLoan)
.zip(descVocabulary.zip(descIdfScores)).map(t => (t._1._1, t._1._2, t._2._1, t._2._2))
println(
s"""
~Words used only in description of good loans:
~
${table(descTargetsWords.filter(t => t._1 >0 && t._2 == 0).sortBy(-_._1).take(10))}
~
~Words used only in description of bad loans:
~
${table(descTargetsWords.filter(t => t._1 == 0 && t._2 >0).sortBy(-_._1).take(10))}
""".stripMargin('~'))

The output is as follows:

The produced information does not seem helpful, since we got only very rare words that allow us detect only a limited number of highly specific loan descriptions. However, we would like to be more generic and find more common words that are used by both loan types, but will still allow us to distinguish between bad and good loans.

Therefore, we need to design a word score that will target words with high-frequency usage in good (or bad) loans but penalize rare words. For example, we can define it as follows:

def descWordScore = (freqGoodLoan: Double, freqBadLoan: Double, wordIdfScore: Double) =>
Math.abs(freqGoodLoan - freqBadLoan) * wordIdfScore * wordIdfScore

If we apply the word score method on each word in the vocabulary, we will get a sorted list of words based on the descending score:

val numOfGoodLoans = loanStatusBaseModelDf.where("loan_status == 'good loan'").count()
val numOfBadLoans = loanStatusBaseModelDf.where("loan_status == 'bad loan'").count()

val descDiscriminatingWords = descTargetsWords.filter(t => t._1 >0 && t. _2 >0).map(t => {
val freqGoodLoan = t._1 / numOfGoodLoans
val freqBadLoan = t._2 / numOfBadLoans
val word = t._3
val idfScore = t._4
(word, freqGoodLoan*100, freqBadLoan*100, idfScore, descWordScore(freqGoodLoan, freqBadLoan, idfScore))
})
println(
table(Seq("Word", "Freq Good Loan", "Freq Bad Loan", "Idf Score", "Score"),
descDiscriminatingWords.sortBy(-_._5).take(100),
Map(1 ->"%.2f", 2 ->"%.2f")))

The output is as follows:

Based on the produced list, we can identify interesting words. We can take 10 or 100 of them. However, we still need to figure out what to do with them. The solution is easy; for each word, we will generate a new binary feature-1 if a word is present in the desc value; otherwise, 0:

val descWordEncoder = (denominatingWords: Array[String]) => (desc: String) => {
if (desc != null) {
val unifiedDesc = unifyTextColumn(desc)
Vectors.dense(denominatingWords.map(w =>if (unifiedDesc.contains(w)) 1.0 else 0.0))
} else null
}

We can test our idea on the prepared training and validation sample and measure the quality of the model. Again, the first step is to prepare the augmented data with a new feature. In this case, a new feature is a vector that contains binary features generated by descWordEncoder:

val trainLSBaseModel4Df = trainLSBaseModel3Df.withColumn("desc_denominating_words", descWordEncoderUdf($"desc")).drop("desc")
val validLSBaseModel4Df = validLSBaseModel3Df.withColumn("desc_denominating_words", descWordEncoderUdf($"desc")).drop("desc")
val trainLSBaseModel4Hf = toHf(trainLSBaseModel4Df, "trainLSBaseModel4Hf")
val validLSBaseModel4Hf = toHf(validLSBaseModel4Df, "validLSBaseModel4Hf")
loanStatusBaseModelParams._train = trainLSBaseModel4Hf._key
val loanStatusBaseModel4 = new DRF(loanStatusBaseModelParams, water.Key.make[DRFModel]("loanStatusBaseModel4"))
.trainModel()
.get()

Now, we just need to compute the model's quality:

val minLossModel4 = findMinLoss(loanStatusBaseModel4, validLSBaseModel4Hf, DEFAULT_THRESHOLDS)
println(f"Min total loss for model 4: ${minLossModel4._2}%,.2f (threshold = ${minLossModel4._1})")

The output is as follows:

We can see that the new feature helps and improves the precision of our model. On the other hand, it also opens a lot of space for experimentation-we can select different words, or even use IDF weights instead of binary values if the word is part of the desc column.

To summarize our experiments, we will compare the computed results for the three models we produced: (1) the base model, (2) the model trained on the data augmented by the emp_title feature, and (3) the model trained on the data enriched by the desc feature:

println(
s"""
~Results:
~
${table(Seq("Threshold", "Total loss", "Profit loss", "Loan loss"),
Seq(minLossModel2, minLossModel3, minLossModel4),
Map(1 ->"%,.2f", 2 ->"%,.2f", 3 ->"%,.2f"))}
""".stripMargin('~'))

The output is as follows:

Our small experiments demonstrated the powerful concept of feature generation. Each newly generated feature improved the quality of the base model with respect to our model-evaluation criterion.

At this point, we can finish with exploration and training of the first model to detect good/bad loans. We will use the last model we prepared since it gives us the best quality. There are still many ways to explore data and improve our model quality; however, now, it is time to build our second model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset