Categorical columns

In the next step, we will explore categorical columns. The H2O parser marks a column as a categorical column only if it contains a limited set of string values. This is the main difference from columns that are marked as string columns. They contain more than 90 percent of unique values (see, for example, the url column that we explored in the previous paragraph). Let's collect a list of all the categorical columns in our dataset and also the sparsity of individual features:

val categoricalColumns = loanDataHf.names().indices
.filter(idx => loanDataHf.vec(idx).isCategorical)
.map(idx => (loanDataHf.name(idx), loanDataHf.vec(idx).cardinality()))
.sortBy(-_._2)

println(s"Categorical columns:${table(tblize(categoricalColumns, true, 2))}")

The output is as follows:

Now, we can explore individual columns. For example, the "purpose" column contains 13 categories, and the main purpose of it is debt consolidation:

This column seems valid, but now, we should focus on suspicious columns, that is, first high-cardinality columns: emp_title, title, desc. There are several observations:

  • The highest value for each column is an empty "value". That can mean a missing value. However, for these types of column (that is, columns representing a set of values) a dedicated level for a missing value makes very good sense. It just represents another possible state, "missing". Hence, we can keep it as it is.
  • The "title" column overlaps with the purpose column and can be dropped.
  • The emp_title and desc columns are purely textual descriptions. In this case, we will not treat them as categorical, but apply NLP techniques to extract important information later.

Now, we will focus on columns starting with "mths_", As the name of the column suggests, the column should contain numeric values, but our parser decided that the columns are categorical. That could be caused by inconsistencies in collected data. For example, when we explore the domain of the "mths_since_last_major_derog" column, we can easily spot a reason:

The most common value in the column is an empty value (that is, the same deficiency that we already explored earlier). In this case, we need to decide how to replace this value to transform a column to a numeric column: should it be replaced by the missing value?

If we want to experiment with different strategies, we can define a flexible transformation for this kind of column. In this situation, we will leave the H2O API and switch to Spark and define our own Spark UDF. Hence, as in the previous chapters, we will define a function. In this case, a function which for a given replacement value and a string, produces a float value representing given string or returns the specified value if string is empty. Then, the function is wrapped into the Spark UDF:

import org.apache.spark.sql.functions._
val toNumericMnths = (replacementValue: Float) => (mnths: String) => {
if (mnths != null && !mnths.trim.isEmpty) mnths.trim.toFloat else replacementValue
}
val toNumericMnthsUdf = udf(toNumericMnths(0.0f))
A good practice is to keep our code flexible enough to allow for experimenting, but do not make it over complicated. In this case, we simply keep an open door for cases that we expect to be explored in more detail.

There are two more columns that need our attention: int_rate and revol_util. Both should be numeric columns expressing percentages; however, if we explore them, we can easily see a problem--instead of a numeric value, the column contains the "%" sign. Hence, we have two more candidates for column transformations:

However, we will not process the data directly but define the Spark UDF transformation, which will transform the string-based rate into a numeric rate. However, in definition of our UDF, we will simply use information provided by H2O, which is confirming that the list of categories in both columns contains only data suffixed by the percent sign:

import org.apache.spark.sql.functions._
val toNumericRate = (rate: String) => {
val num = if (rate != null) rate.stripSuffix("%").trim else ""
if (!num.isEmpty) num.toFloat else Float.NaN
}
val toNumericRateUdf = udf(toNumericRate)

The defined UDF will be applied later with the rest of the Spark transformations. Furthermore, we need to realize that these transformations need to be applied during training as well as scoring time. Hence, we will put them into our shared library.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset