Chapter 8. Advanced data preparation

This chapter covers

  • Using the vtreat package for advanced data preparation
  • Cross-validated data preparation

In our last chapter, we built substantial models on nice or well-behaved data. In this chapter, we will learn how to prepare or treat messy real-world data for modeling. We will use the principles of chapter 4 and the advanced data preparation package: vtreat. We will revisit the issues that arise with missing values, categorical variables, recoding variables, redundant variables, and having too many variables. We will spend some time on variable selection, which is an important step even with current machine learning methods. The mental model summary (figure 8.1) of this chapter emphasizes that this chapter is about working with data and preparing for machine learning modeling. We will first introduce the vtreat package, then work a detailed real-world problem, and then go into more detail about using the vtreat package.

Figure 8.1. Mental model

8.1. The purpose of the vtreat package

vtreat is an R package designed to prepare real-world data for supervised learning or predictive modeling. It is designed to deal with a lot of common issues, so the data scientist doesn’t have to. This leaves them much more time to find and work on unique domain-dependent issues. vtreat is an excellent realization of the concepts discussed in chapter 4 as well as many other concepts. One of the goals of chapter 4 was to give you an understanding of some of the issues we can run into working with data, and principled steps to take in dealing with such data. vtreat automates these steps into a high-performance production-capable package, and is a formally citable methodology you can incorporate into your own work. We can’t succinctly explain everything vtreat does with data, as it does a lot; for details please see the long-form documentation here: https://arxiv.org/abs/1611.09477. In addition, vtreat has many explanatory vignettes and worked examples here: https://CRAN.R-project.org/package=vtreat.

We will work through vtreat’s capabilities in this chapter using an example of predicting account cancellation (called customer churn) using the KDD Cup 2009 dataset. In this example scenario, we will use vtreat to prepare the data for use in later modeling steps. Some of the issues vtreat helps with include the following:

  • Missing values in numeric variables
  • Extreme or out-of-range values in numeric variables
  • Missing values in categorical variables
  • Rare values in categorical data
  • Novel values (values seen during testing or application, but not during training) in categorical data
  • Categorical data with very many possible values
  • Overfit due to a large number of variables
  • Overfit due to “nested model bias”

The basic vtreat workflow (shown in figure 8.2) is to use some of the training data to create a treatment plan that records key characteristics of the data such as relationships between individual variables and the outcome. This treatment plan is then used to prepare data that will be used to fit the model, as well as to prepare data that the model will be applied to. The idea is that this prepared or treated data will be “safe,” with no missing or unexpected values, and will possibly have new synthetic variables that will improve the model fitting. In this sense, vtreat itself looks a lot like a model.

Figure 8.2. vtreat three-way split strategy

We saw a simple use of vtreat in chapter 4 to treat missing values. In this chapter, we will use vtreat’s full coding power on our customer churn example. For motivation, we will solve the KDD Cup 2009 problem, and then we will discuss how to use vtreat in general.

The KDD Cup 2009 provided a dataset about customer relationship management. This contest data supplied 230 facts about 50,000 credit card accounts. From these features, one of the contest goals was to predict account cancellation (called churn).

The basic way to use vtreat is with a three-way data split: one set for learning the data treatment, one for modeling, and a third for estimating the model quality on new data. Figure 8.2 shows the concept, which will serve as a good mnemonic once we have worked an example. As the diagram shows, to use vtreat in this manner, we split the data three ways and use one subset to prepare the treatment plan. Then we use the treatment plan to prepare the other two subsets: one subset to fit the desired model, and the other subset to evaluate the fitted model. The process may seem complicated, but from the user’s point of view it is very simple.

Let’s start with a look at an example scenario using vtreat with the KDD Cup 2009 account cancellation prediction problem.

8.2. KDD and KDD Cup 2009

Example

We are given the task of predicting which credit card accounts will cancel in a given time period. This sort of cancellation is called churn. To build our model, we have supervised training data available. For each account in the training data, we have hundreds of measured features and we know whether the account later cancelled. We want to build a model that identifies “at risk of canceling” accounts in this data, as well as for future application.

To simulate this scenario, we will use the KDD Cup 2009 contest dataset.[1]

1

We share the data and steps to prepare this data for modeling in R here: https://github.com/WinVector/PDSwR2/tree/master/KDD2009.

Shortcomings of the data

As with many score-based competitions, this contest concentrated on machine learning and deliberately abstracted out or skipped over a number of important data science issues, such as cooperatively defining goals, requesting new measurements, collecting data, and quantifying classifier performance in terms of business goals. For this contest data, we don’t have names or definitions for any of the independent (or input) variables[a] and no real definition of the dependent (or outcome) variables. We have the advantage that the data comes in a ready-to-model format (all input variables and the results arranged in single rows). But we don’t know the meaning of any variable (so we unfortunately can’t join in outside data sources), and we can’t use any method that treats time and repetition of events carefully (such as time series methods or survival analysis).

a

We’ll call variables or columns used to build the model variously variables, independent variables, input variables, and so on to try and distinguish them from the value to be predicted (which we’ll call the outcome or dependent variable).

To simulate the data science processes, we’ll assume that we can use any column we’re given to make predictions (that all of these columns are known prior to needing a prediction).[2] We will assume the contest metric (AUC, or area under the curve as discussed in section 6.2.5) is the correct one, and the AUC of the top contestant is a good upper bound (telling us when to stop tuning).[3]

2

Checking if a column is actually going to be available during prediction (and not some later function of the unknown output) is a critical step in data science projects.

3

AUC is a good initial screening metric, as it measures if any monotone transformation of your score is a good score. For fine tuning, we will use R-squared and pseudo R-squared (also defined in chapter 6) as they are stricter, measuring if the exact values at hand are good scores.

8.2.1. Getting started with KDD Cup 2009 data

For our example, we’ll try to predict churn in the KDD dataset. The KDD contest was judged in terms of AUC (area under the curve, a measure of prediction quality discussed in section 6.2.5), so we’ll also use AUC as our measure of performance.[4] The winning team achieved an AUC of 0.76 on churn, so we’ll treat that as our upper bound on possible performance. Our lower bound on performance is an AUC of 0.5, as an AUC below 0.5 is worse than random predictions.

4

Also, as is common for example problems, we have no project sponsor to discuss metrics with, so our choice of evaluation is a bit arbitrary.

This problem has a large number of variables, many of which are categorical variables that have a large number of possible levels. As we will see, such variables are especially liable to overfit, even during the process of creating the treatment plan. Because of this concern, we’ll split our data into three sets: training, calibration, and test. In the following example, we’ll use the training set to design the treatment plan, and the calibration set to check for overfit in the treatment plan. The test set is reserved for a final estimate of model performance. This three-way split procedure is recommended by many researchers.[5]

5

Normally, we would use the calibration set to design the treatment plan, the training set to train the model, and the test set to evaluate the model. Since the focus of this chapter is on the data treatment process, we’ll use the largest set (dTrain) to design the treatment plan, and the other sets to evaluate it.

Let’s start work as shown in the following listing, where we prepare the data for analysis and modeling.[6]

6

Please either work in the KDD2009 subdirectory of the PDSwR2 support materials, or copy the relevant files to where you are working. The PDSwR2 support materials are available from https://github.com/WinVector/PDSwR2, and instructions for getting started can be found in appendix A.

Listing 8.1. Preparing the KDD data for analysis
d <- read.table('orange_small_train.data.gz',                          1
   header = TRUE,
   sep = '	',
   na.strings = c('NA', ''))                                           2

churn <- read.table('orange_small_train_churn.labels.txt',
   header = FALSE, sep = '	')                                         3
d$churn <- churn$V1                                                    4
set.seed(729375)                                                       5
rgroup <- base::sample(c('train', 'calibrate', 'test'),                6
   nrow(d),
   prob = c(0.8, 0.1, 0.1),
   replace = TRUE)
dTrain <- d[rgroup == 'train', , drop = FALSE]
dCal <- d[rgroup == 'calibrate', , drop = FALSE]
dTrainAll <- d[rgroup %in% c('train', 'calibrate'), , drop = FALSE]
dTest <- d[rgroup == 'test', , drop = FALSE]

outcome <- 'churn'
vars <- setdiff(colnames(dTrainAll), outcome)

rm(list=c('d', 'churn', 'rgroup'))                                     7

  • 1 Reads the file of independent variables. All the data is from https://github.com/WinVector/PDSwR2/tree/master/KDD2009.
  • 2 Treats both NA and the empty string as missing data
  • 3 Reads the known churn outcomes
  • 4 Adds churn as a new column
  • 5 By setting the seed to the pseudo-random number generator, we make our work reproducible: someone redoing it will see the exact same results.
  • 6 Splits data into train, calibration, and test sets. Explicitly specifies the base::sample() function to avoid name collision with dplyr::sample(), if the dplyr package is loaded.
  • 7 Removes unneeded objects from the workspace

We have also saved an R workspace with most of the data, functions, and results of this chapter in the GitHub repository, which you can load with the command load('KDD2009.Rdata'). We’re now ready to build some models.

We want to remind the reader: always look at your data. Looking at your data is the quickest way to find surprises. Two functions are particularly helpful for taking an initial look at your data: str() (which shows the structure of the first few rows in transposed form) and summary().

Exercise: Using str() and summary()

Before moving on, please run all of the steps in listing 8.1, and then try running str(dTrain) and summary(dTrain) yourself. We try to avoid overfit by not making modeling decisions based on looking at our holdout data.

Subsample to prototype quickly

Often the data scientist will be so engrossed with the business problem, math, and data that they forget how much trial and error is needed. It’s often an excellent idea to first work on a small subset of your training data, so that it takes seconds to debug your code instead of minutes. Don’t work with large and slow data sizes until you have to.

Characterizing the outcome

Before starting on modeling, we should look at the distribution of the outcome. This tells how much variation there is to even attempt to predict. We can do this as follows:

outcome_summary <- table(
   churn = dTrain[, outcome],                   1
   useNA = 'ifany')                             2

knitr::kable(outcome_summary)

outcome_summary["1"] / sum(outcome_summary)     3
#          1
# 0.07347764

  • 1 Tabulates levels of churn outcome
  • 2 Includes NA values in the tabulation
  • 3 Estimates the observed churn rate or prevalence

The table in figure 8.3 indicates that churn takes on two values: –1 and 1. The value 1 (indicating a churn, or cancellation of account, has happened) is seen about 7% of the time. So we could trivially be 93% accurate by predicting that no account ever cancels, though obviously this is not a useful model![7]

7

Figure 8.3. KDD2009 churn rate

8.2.2. The bull-in-the-china-shop approach

Let’s deliberately ignore our advice to look at the data, to look at the columns, and to characterize the relations between the proposed explanatory variables and the quantity to be predicted. For this first attempt, we aren’t building a treatment plan, so we’ll use both the dTrain and dCal data together to fit the model (as the set dTrainAll). Let’s see what happens if we jump in and immediately try to build a model for churn == 1, given the explanatory variables (hint: it won’t be pretty).

Listing 8.2. Attempting to model without preparation
library("wrapr")                                                 1

outcome <- 'churn'
vars <- setdiff(colnames(dTrainAll), outcome)

formula1 <- mk_formula("churn", vars, outcome_target = 1)        2
model1 <- glm(formula1, data = dTrainAll, family = binomial)     3

# Error in `contrasts ...                                        4

  • 1 Attaches the wrapr package for convenience functions, such as mk_formula()
  • 2 Builds a model formula specification, asking churn == 1 to be predicted as a function of our explanatory variables
  • 3 Asks the glm() function to build a logistic regression model
  • 4 The attempt failed with an error.

As we can see, this first attempt failed. Some research will show us that some of the columns we are attempting to use as explanatory variables do not vary and have the exact same value for every row or example. We could attempt to filter these bad columns out by hand, but fixing common data issues in an ad hoc manner is tedious. For example, listing 8.3 shows what happens if we try to use just the first explanatory variable Var1 to build a model.

Explanatory variables

Explanatory variables are columns or variables we are trying to use as inputs for our model. In this case, the variables came to us without informative names, so they go by the names Var# where # is a number. In a real project, this would be a possible sign of uncommitted data-managing partners, and something to work on fixing before attempting modeling.

Listing 8.3. Trying just one variable
model2 <- glm((churn == 1) ~ Var1, data = dTrainAll, family = binomial)
summary(model2)
#
# Call:
# glm(formula = (churn == 1) ~ Var1, family = binomial, data = dTrainAll)
#
# Deviance Residuals:
#     Min       1Q   Median       3Q      Max
# -0.3997  -0.3694  -0.3691  -0.3691   2.3326
#
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.6523837  0.1674387 -15.841   <2e-16 ***
# Var1         0.0002429  0.0035759   0.068    0.946
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
#     Null deviance: 302.09  on 620  degrees of freedom
# Residual deviance: 302.08  on 619  degrees of freedom
#   (44407 observations deleted due to missingness)       1
# AIC: 306.08
#
# Number of Fisher Scoring iterations: 5

dim(dTrainAll)
# [1] 45028   234

  • 1 This means the modeling procedure threw out this much (almost all) of our training data.

We saw how to read the model summary in detail in section 7.2. What jumps out here is the line “44407 observations deleted due to missingness.” This means the modeling procedures threw out 44407 of our 45028 training rows, building a model on the remaining 621 rows of data. So in addition to columns that do not vary, we have columns that have damaging amounts of missing values.

The data problems do not end there. Take a look at another variable, this time the one named Var200:

head(dTrainAll$Var200)
# [1] <NA>    <NA>    vynJTq9 <NA>    0v21jmy <NA>
# 15415 Levels: _84etK_ _9bTOWp _A3VKFm _bq4Nkb _ct4nkXBMp ... zzQ9udm

length(unique(dTrainAll$Var200))
# [1] 14391

The head() command shows us the first few values of Var200, telling us this column has string values encoded as factors. Factors are R’s representation for strings taken from a known set. And this is where an additional problem lies. Notice the listing says the factor has 15415 possible levels. A factor or string variable with this many distinct levels is going to be a big problem in terms of overfitting and also difficult for the glm() code to work with. In addition, the length(unique(dTrainAll$Var200)) summary tells us that Var200 takes on only 14391 distinct values in our training sample. This tells us our training data sample did not see all known values for this variable. Our held-out test set contains, in addition to values seen during training, new values not in the training set. This is quite common for string-valued or categorical variables with a large number of levels, and causes most R modeling code to error-out when trying to make predictions on new data.

We could go on. We have not yet exhausted the section 8.1 list of things that can commonly go wrong. At this point, we hope the reader will agree: a sound systematic way of identifying, characterizing, and mitigating common data quality issues would be a great help. Having a good way to work though common data quality issues in a domain-independent way leaves us more time to work with the data and work through any domain-specific issues. The vtreat package is a great tool for this task. For the rest of this chapter, we will work a bit with the KDD Cup 2009 data, and then master using vtreat in general.

8.3. Basic data preparation for classification

vtreat prepares data for use by both cleaning up existing columns or variables and by introducing new columns or variables. For our order cancellation scenario, vtreat will address the missing values, the categorical variables with very many levels, and other issues. Let’s master the vtreat process here.

First, we’ll use a portion of our data (the dTrain set) to design our variable treatments.

Listing 8.4. Basic data preparation for classification
library("vtreat")                                                       1

(parallel_cluster <- parallel::makeCluster(parallel::detectCores()))    2

treatment_plan <- vtreat::designTreatmentsC(                            3
  dTrain,
  varlist = vars,
  outcomename = "churn",
  outcometarget = 1,
  verbose = FALSE,
  parallelCluster = parallel_cluster)

  • 1 Attaches the vtreat package for functions such as designTreatmentsC()
  • 2 Starts up a parallel cluster to speed up calculation. If you don’t want a parallel cluster, just set parallel_ cluster to NULL.
  • 3 Uses designTreatmentsC() to learn the treatment plan from the training data. For a dataset the size and complexity of KDD2009, this can take a few minutes.

Then, we’ll use the treatment plan to prepare cleaned and treated data. The prepare() method builds a new data frame with the same row order as the original data frame, and columns from the treatment plan (plus copying over the dependent variable column if it is present). The idea is illustrated in figure 8.4. In listing 8.5, we apply the treatment plan to the dTrain data, so we can compare the treated data to the original data.

Figure 8.4. vtreat variable preparation

Listing 8.5. Preparing data with vtreat
dTrain_treated <- prepare(treatment_plan,
                          dTrain,
                          parallelCluster = parallel_cluster)

head(colnames(dTrain))
## [1] "Var1" "Var2" "Var3" "Var4" "Var5" "Var6"
head(colnames(dTrain_treated))                                       1
## [1] "Var1"       "Var1_isBAD" "Var2"       "Var2_isBAD" "Var3"
## [6] "Var3_isBAD"

  • 1 Compares the columns of the original dTrain data to its treated counterpart

Note that the treated data both converts existing columns and introduces new columns or derived variables. In the next section, we will work through what those new variables are and how to use them.

8.3.1. The variable score frame

The vtreat process we have worked with up to now centers around designTreatmentsC(), which returns the treatment plan. The treatment plan is an R object with two purposes: to be used in data preparation by the prepare() statement, and to deliver a simple summary and initial critique of the proposed variables. This simple summary is encapsulated in the score frame. The score frame lists the variables that will be created by the prepare() method, along with some information about them. The score frame is our guide to the new variables vtreat introduces to make our modeling work easier. Let’s take a look at the score frame:

score_frame <-  treatment_plan$scoreFrame
t(subset(score_frame, origName %in% c("Var126", "Var189")))

# varName           "Var126"       "Var126_isBAD" "Var189"       "Var189_isBAD" 1
# varMoves          "TRUE"         "TRUE"         "TRUE"         "TRUE"         2
# rsq               "0.0030859179" "0.0136377093" "0.0118934515" "0.0001004614" 3
# sig               "7.876602e-16" "2.453679e-64" "2.427376e-56" "1.460688e-01" 4
# needsSplit        "FALSE"        "FALSE"        "FALSE"        "FALSE"        5
# extraModelDegrees "0"            "0"            "0"            "0"            6
# origName          "Var126"       "Var126"       "Var189"       "Var189"       7
# code              "clean"        "isBAD"        "clean"        "isBAD"        8

  • 1 The name of the derived variable or column
  • 2 An indicator that this variable is not always the same value (not a constant, which would be useless for modeling)
  • 3 The R-squared or pseudo R-squared of the variable; what fraction of the outcome variation this variable can explain on its own in a linear model
  • 4 The significance of the estimated R-squared
  • 5 An indicator that, when TRUE, is a warning to the user that the variable is hiding extra degrees of freedom (a measure of model complexity) and needs to be evaluated using cross-validation techniques
  • 6 How complex the variable is; for a categorical variable, this is related to the number of levels.
  • 7 Name of the original column the variable was derived from
  • 8 Name of the type of transformation used to build this variable

The score frame is a data.frame with one row per derived explanatory variable. Each row shows which original variable the derived variable will be produced from (orig-Name), what type of transform will be used to produce the derived variable (code), and some quality summaries about the variable.

In our example, Var126 produces two new or derived variables: Var126 (a cleaned-up version of the original Var126 that has no NA/missing values), and Var116_isBAD (an indicator variable that indicates which rows of Var126 originally held missing or bad values).

The rsq column records the pseudo R-squared of the given variable, which is an indication of how informative the variable would be if treated as a single-variable model for the outcome. The sig column is an estimate of the significance of this pseudo R-squared. Notice that var126_isBAD is more informative than the cleaned up original variable var126. This indicates we should consider including var126_isBAD in our model, even if we decide not to include the cleaned-up version of var126 itself!

Informative missing values

In production systems, missingness is often very informative. Missingness usually indicates the data in question was subject to some condition (temperature out of range, test not run, or something else) and gives a lot of context in an encoded form. We have seen many situations where the information that a variable is missing is more informative than the cleaned-up values of the variable itself.

Let’s look at a categorical variable. The original Var218 has two possible levels: cJvF and UYBR.

t(subset(score_frame, origName == "Var218"))

# varName           "Var218_catP"  "Var218_catB"  "Var218_lev_x_cJvF" "Var218
                                                                 _lev_x_UYBR"
# varMoves          "TRUE"         "TRUE"         "TRUE"              "TRUE"

# rsq               "0.011014574"  "0.012245152"  "0.005295590"       "0.0019
                                                                       70131"
# sig               "2.602574e-52" "5.924945e-58" "4.902238e-26"
                                                               "1.218959e-10"
# needsSplit        " TRUE"        " TRUE"        "FALSE"             "FALSE"
# extraModelDegrees "2"            "2"            "0"                 "0"
# origName          "Var218"       "Var218"       "Var218"            "Var218
# code              "catP"         "catB"         "lev"               "lev"

The original variable Var218 produced four derived variables. In particular, notice that the levels cJvF and UYBR each gave us new derived columns or variables.

Level variables (lev)

Var218_lev_x_cJvF and Var218_lev_x_UYBR are indicator variables that have the value 1 when the original Var218 had the values cJvF and UYBR respectively;[8] we will discuss the other two variables in a bit. Recall from chapter 7 that most modeling methods work with a categorical variable with n possible levels by converting it to n (or n-1) binary variables, or indicator variables (sometimes referred to as one-hot encoding or dummies). Many modeling functions in R, such as lm or glm, do this conversion automatically; others, such as xgboost, don’t. vtreat tries to explicitly one-hot encode categoricals when it is feasible. In this way, the data can be used either by modeling functions like glm, or by functions like xgboost.

8

In a real modeling project, we would insist on meaningful level names and a data dictionary describing the meanings of the various levels. The KDD2009 contest data did not supply such information, which is a limitation of the contest data and prevents powerful methods such as using variables to join in additional information from external data sources.

By default, vtreat only creates indicator variables for “non-rare” levels: levels that appear more than 2% of the time. As we will see, Var218 also has some missing values, but the missingness only occurs 1.4% of the time. If missingness had been more informative, then vtreat would have also created a Var218_lev_x_NA indicator, as well.

Impact variables (catB)

One-hot encoding creates a new variable for every non-rare level of a categorical variable. The catB encoding returns a single new variable, with a numerical value for every possible level of the original categorical variable. This value represents how informative a given level is: values with large magnitudes correspond to more-informative levels. We call this the impact of the level on the outcome; hence, the term “impact variable.” To understand impact variables, let’s compare the original Var218 to Var218_catB:

comparison <- data.frame(original218 = dTrain$Var218,
                         impact218 = dTrain_treated$Var218_catB)

head(comparison)
 ##   original218  impact218
## 1        cJvF -0.2180735
## 2        <NA>  1.5155125
## 3        UYBR  0.1221393
## 4        UYBR  0.1221393
## 5        UYBR  0.1221393
## 6        UYBR  0.1221393

For classification problems, the values of impact encoding are related to the predictions of a logistic regression model that predicts churn from Var218. To see this, we’ll use the simple missingness treatment that we used in section 4.1.3 to explicitly convert the NA values in Var218 to a new level. We will also use the logit, or log-odds function that we saw in chapter 7.

treatment_plan_2 <- design_missingness_treatment(dTrain, varlist = vars)   1
dtrain_2 <- prepare(treatment_plan_2, dTrain)                              2
head(dtrain_2$Var218)

## [1] "cJvF"      "_invalid_" "UYBR"      "UYBR"      "UYBR"      "UYBR"

model <- glm(churn ==1  ~ Var218,                                          3
             data = dtrain_2,
             family = "binomial")
pred <- predict(model,                                                     4
                newdata = dtrain_2,
                type = "response")

(prevalence <- mean(dTrain$churn == 1) )                                   5
 ## [1] 0.07347764

logit <- function(p) {                                                     6
   log ( p / (1-p) )
}

comparison$glm218 <- logit(pred) - logit(prevalence)                       7
 head(comparison)

##   original218  impact218     glm218
## 1        cJvF -0.2180735 -0.2180735                                     8
## 2        <NA>  1.5155125  1.5155121
## 3        UYBR  0.1221393  0.1221392
## 4        UYBR  0.1221393  0.1221392
## 5        UYBR  0.1221393  0.1221392
## 6        UYBR  0.1221393  0.1221392

  • 1 Simple treatment to turn NA into a safe string
  • 2 Creates the treated data
  • 3 Fits the one-variable logistic regression model
  • 4 Makes predictions on the data
  • 5 Calculates the global probability of churn.
  • 6 A function to calculate the logit, or log-odds of a probability
  • 7 Calculates the catB values by hand
  • 8 Notice that the impact codes from vtreat match the “delta logit” encoded predictions from the standard glm model. This helps illustrate how vtreat is implemented.

In our KDD2009 example, we see the catB impact encoding is replacing a categorical variable with the predictions of the corresponding one-variable logistic regression model. For technical reasons, the predictions are in “link space,” or logit space, rather than in probability space, and are expressed as a difference from the null model of always predicting the global probability of the outcome. In all cases this data preparation takes a potentially complex categorical variable (that may imply many degrees of freedom, or dummy variable columns) and derives a single numeric column that picks up most of the variable’s modeling utility.

When the modeling problem is a regression rather than a classification (the outcome is numeric), the impact encoding is related to the predictions of a one-variable linear regression. We’ll see an example of this later in the chapter.

The prevalence variables (catP)

The idea is this: for some variables, knowing how often a level occurs is very informative. For example, for United States ZIP codes, rare ZIP codes may all be from low-population rural areas. The prevalence variable simply encodes what fraction of the time the original variable takes the given level, making these whole-dataset statistics available to the modeling process in a convenient per-example format.

Variable ethics

Note: For some applications, certain variables and inference may be either unethical or illegal to use. For example, ZIP code and race are both prohibited in the United States for credit approval decisions, due to historic “red lining” discrimination practices.

Having a sensitivity to ethical issues and becoming familiar with data and modeling law are critical in real-world applications.

Let’s look at what happened to another variable that was giving us trouble: Var200. Recall that this variable has 15415 possible values, of which only 13324 appear in the training data.

score_frame[score_frame$origName == "Var200", , drop = FALSE]

#           varName varMoves         rsq          sig needsSplit
            extraModelDegrees origName code
# 361   Var200_catP     TRUE 0.005729835 4.902546e-28
                        TRUE             13323   Var200 catP
# 362   Var200_catB     TRUE 0.001476298 2.516703e-08
                        TRUE             13323   Var200 catB
# 428 Var200_lev_NA     TRUE 0.005729838 4.902365e-28
                        FALSE                 0   Var200  lev

Note that vtreat only returned one indicator variable, indicating missing values. All the other possible values of Var200 were rare: they occurred less than 2% of the time. For a variable like Var200 with a very large number of levels, it isn’t practical to encode all the levels as indicator variables when modeling; it’s more computationally efficient to represent the variable as a single numeric variable, like the catB variable.

In our example, the designTreatmentsC() method recoded the original 230 explanatory variables into 546 new all-numeric explanatory variables that have no missing values. The idea is that these 546 variables are easier to work with and have a good shot of representing most of the original predictive signal in the data. A full description of what sorts of new variables vtreat can introduce can be found in the vtreat package documentation.[9]

9

8.3.2. Properly using the treatment plan

The primary purpose of the treatment plan object is to allow prepare() to convert new data into a safe, clean form before fitting and applying models. Let’s see how that is done. Here, we apply the treatment plan that we learned from the dTrain set to the calibration set, dCal, as shown in figure 8.5.

Figure 8.5. Preparing held-out data

dCal_treated <- prepare(treatment_plan,
                        dCal,
                        parallelCluster = parallel_cluster)

Normally, we could now use dCal_treated to fit a model for churn. In this case, we’ll use it to illustrate the risk of overfit on transformed variables that have needsSplit == TRUE in the score frame.

As we mentioned earlier, you can think of the Var200_catB variable as a single-variable logistic regression model for churn. This model was fit using dTrain when we called designTreatmentsC(); it was then applied to the dCal data when we called prepare(). Let’s look at the AUC of this model on the training and calibration sets:

library("sigr")

calcAUC(dTrain_treated$Var200_catB, dTrain_treated$churn)

# [1] 0.8279249

calcAUC(dCal_treated$Var200_catB, dCal_treated$churn)

# [1] 0.5505401

Notice the AUC estimated in the training data is 0.83, which seems very good. However, this AUC is not confirmed when we look at the calibration data that was not used to design the variable treatment. Var200_catB is overfit with respect to dTrain_ treated. Var200_catB is a useful variable, just not as good as it appears to be on the training data.

Do not directly reuse the same data for fitting the treatment plan and the model!

To avoid overfit, the general rule is that whenever a premodeling data processing step uses knowledge of the outcome, you should not use the same data for the premodeling step and the modeling.

The AUC calculations in this section show that Var200_catB looks “too good” on the training data. Any model-fitting algorithm using dTrain_ treated to fit a churn model will likely overuse this variable based on its apparent value. The resulting model then fails to realize that value on new data, and it will not predict as well as expected.

The correct procedure is to not reuse dTrain after designing the data treatment plan, but instead use dCal_treated for model training (although in this case, we should use a larger fraction of the available data than we originally allocated). With enough data and the right data split (say, 40% data treatment design, 50% model training, and 10% model testing/evaluation), this is an effective strategy.

In some cases, we may not have enough data for a good three-way split. The built-in vtreat cross-validation procedures allow us to use the same training data both for designing the data treatment plan and to correctly build models. This is what we will master next.

8.4. Advanced data preparation for classification

Now that we have seen how to prepare messy data for classification, let’s work through how to do this in a more statistically efficient manner. That is, let’s master techniques that let us safely reuse the same data for both designing the treatment plan and model training.

8.4.1. Using mkCrossFrameCExperiment()

Safely using the same data for data treatment design and for model construction is easy using vtreat. All we do is use the method mkCrossFrameCExperiment() instead of designTreatmentsC(). The designTreatmentsC() method uses cross-validation techniques to produce a special cross-frame for training instead of using prepare() on the training data, which we review in figure 8.6.

Figure 8.6. vtreat three-way split strategy again

The cross-frame is special surrogate training data that behaves as if it hadn’t been used to build its own treatment plan. The process is shown in figure 8.7, which we can contrast with figure 8.6.

Figure 8.7. vtreat cross-frame strategy

The user-visible parts of the procedures are small and simple. Figure 8.7 only looks complex because vtreat is supplying a very sophisticated service: the proper cross-validated organization that allows us to safely reuse data for both treatment design and model training.

The treatment plan and cross-frame can be built as follows. Here, we use all the data that we originally allocated for training and calibration as a single training set, dTrainAll. Then we will evaluate the data on the test set.

Listing 8.6. Advanced data preparation for classification
library("vtreat")

parallel_cluster <- parallel::makeCluster(parallel::detectCores())

cross_frame_experiment <- vtreat::mkCrossFrameCExperiment(
  dTrainAll,
  varlist = vars,
  outcomename = "churn",
  outcometarget = 1,
  verbose = FALSE,
  parallelCluster = parallel_cluster)

dTrainAll_treated <- cross_frame_experiment$crossFrame       1
treatment_plan <- cross_frame_experiment$treatments
score_frame <- treatment_plan$scoreFrame

dTest_treated <- prepare(treatment_plan,                     2
                         dTest,
                         parallelCluster = parallel_cluster)

  • 1 We will use the cross-frame to train the logistic regression model.
  • 2 Prepares the test set so we can call the model on it

The steps in listing 8.6 are intentionally very similar to those of listing 8.4. Notice that dTrainAll_treated is a value returned as part of the experiment, not something we use prepare() to produce. This overall data treatment strategy implements the ideas of figure 8.7.

Let’s recheck the estimated prediction quality of Var200 on both the training and test sets:

library("sigr")

calcAUC(dTrainAll_treated$Var200_catB, dTrainAll_treated$churn)

# [1] 0.5450466

calcAUC(dTest_treated$Var200_catB, dTest_treated$churn)

# [1] 0.5290295

Notice that the estimated utility of Var200 on the training data is now much closer to its future performance on the test data.[10] This means decisions made on the training data have a good chance of being correct when later retested on held-out test data or future application data.

10

Remember we are estimating performance from data subject to sampling, so all quality estimates are noisy, and we should not consider this observed difference to be an issue.

8.4.2. Building a model

Now that we have treated our variables, let’s try again to build a model.

Variable selection

A key part of building many variable models is selecting what variables to use. Each variable we use represents a chance of explaining more of the outcome variation (a chance of building a better model), but also represents a possible source of noise and overfitting. To control this effect, we often preselect which subset of variables we’ll use to fit. Variable selection can be an important defensive modeling step, even for types of models that “don’t need it.” The large number of columns typically seen in modern data warehouses can overwhelm even state-of-the-art machine learning algorithms.[11]

11

vtreat supplies two ways to filter variables: the summary statistics in the score frame and also a method called value_variables_C(). The summaries in the score frame are the qualities of the linear fits of each variable, so they may undervalue complex non-linear numeric relationships. In general, you might want to try value_variables_C() to properly score non-linear relationships. For our example, we’ll fit a linear model, so using the simpler score frame method is appropriate.[12]

12

We share a worked xgboost solution at https://github.com/WinVector/PDSwR2/blob/master/KDD2009/KDD2009vtreat.md, which achieves similar performance (as measured by AUC) as the linear model. Things can be improved, but we appear to be getting into a region of diminishing returns.

We are going to filter the variables on significances, but be aware that significance estimates are themselves very noisy, and variable selection itself can be a source of errors and biases if done improperly.[13] The idea we’ll use is this: assume some columns are in fact irrelevant, and use the loosest criterion that would only allow a moderate number of irrelevant columns to pass through. We use the loosest condition to try to minimize the number of actual useful columns or variables that we may accidentally filter out. Note that, while relevant columns should have a significance value close to zero, irrelevant columns should have a significance that is uniformly distributed in the interval zero through one (this is very closely related to the definition of significance). So a good selection filter would be to retain all variables that have a significance of no more than k/nrow(score_frame); we would expect only about k irrelevant variables to pass through such a filter.

13

A good article on this effect is Freedman, “A note on screening regression equations,” The American Statistician, volume 37, pp. 152-155, 1983.

This variable selection can be performed as follows:

k <- 1                                                1
 (significance_cutoff <- k / nrow(score_frame))
# [1] 0.001831502
score_frame$selected <- score_frame$sig < significance_cutoff
suppressPackageStartupMessages(library("dplyr"))      2

score_frame %>%
  group_by(., code, selected) %>%
  summarize(.,
            count = n()) %>%
  ungroup(.) %>%
  cdata::pivot_to_rowrecs(.,
                          columnToTakeKeysFrom = 'selected',
                          columnToTakeValuesFrom = 'count',
                          rowKeyColumns = 'code',
                          sep = '=')

# # A tibble: 5 x 3
#   code  `selected=FALSE` `selected=TRUE`
#   <chr>            <int>           <int>
# 1 catB                12              21
# 2 catP                 7              26
# 3 clean              158              15
# 4 isBAD               60             111
# 5 lev                 74              62

  • 1 Uses our filter significances at k / nrow (score_frame) heuristic with k = 1
  • 2 Brings in the dplyr package to help summarize the selections

The table shows for each converted variable type how many variables were selected or rejected. In particular, notice that almost all the variables of type clean (which is the code for cleaned up numeric variables) are discarded as being unusable. This is possible evidence that linear methods may not be sufficient for this problem, and that we should consider non-linear models instead. In this case, you might use value_variables_C() (which returns a structure similar to the score frame) to select variables, and also use the advanced non-linear machine learning methods of chapter 10. In this chapter, we are focusing on the variable preparation steps, so we will only build a linear model, and leave trying different modeling techniques as an important exercise for the reader.[14]

14

Though we do share a worked xgboost solution here: https://github.com/WinVector/PDSwR2/blob/master/KDD2009/KDD2009vtreat.md.

Building a multivariable model

Once we have our variables ready to go, building the model seems relatively straightforward. For this example, we will use a logistic regression (the topic of section 7.2). The code to fit the multivariable model is given in the next listing.

Listing 8.7. Basic variable recoding and selection
library("wrapr")

newvars <- score_frame$varName[score_frame$selected]              1

f <- mk_formula("churn", newvars, outcome_target = 1)             2
model <- glm(f, data = dTrainAll_treated, family = binomial)      3
# Warning message:
# glm.fit: fitted probabilities numerically 0 or 1 occurred

  • 1 Builds a formula specifying modeling churn == 1 as a function of all variables
  • 2 Uses the modeling formula with R’s glm() function
  • 3 Take heed of this warning: it is hinting we should move on to a regularized method such as glmnet.

Evaluating the model

Now that we have a model, let’s evaluate it on our test data:

library("sigr")

dTest_treated$glm_pred <- predict(model,                                   1
                                  newdata = dTest_treated,
                                  type = 'response')
# Warning message:                                                         2
# In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==  :
#   prediction from a rank-deficient fit may be misleading

calcAUC(dTest_treated$glm_pred, dTest_treated$churn == 1)                  3
## [1] 0.7232192

permTestAUC(dTest_treated, "glm_pred", "churn", yTarget = 1)               4
 ## [1] "AUC test alt. hyp. AUC>AUC(permuted): (AUC=0.7232, s.d.=0.01535, p<1
     e-05)."

var_aucs <- vapply(newvars,                                                5
        function(vi) {
        calcAUC(dTrainAll_treated[[vi]], dTrainAll_treated$churn == 1)
       }, numeric(1))
(best_train_aucs <- var_aucs[var_aucs >= max(var_aucs)])
## Var216_catB
##   0.5873512

  • 1 Adds the model prediction to the evaluation data as a new column
  • 2 Again, take heed of this warning: it is hinting we should move on to a regularized method such as glmnet.
  • 3 Calculates the AUC of the model on holdout data
  • 4 Calculates the AUC a second time, using an alternative method that also estimates a standard deviation or error bar
  • 5 Here we calculate the best single variable model AUC for comparison.

The model’s AUC is 0.72. This is not as good as the winning entry’s 0.76 (on different test data), but much better than the quality of the best input variable treated as a single variable model (which showed an AUC of 0.59). Keep in mind that the perm-TestAUC() calculation indicated a standard deviation of the AUC estimate of 0.015 for a test set of this size. This means a difference of plus or minus 0.015 in AUC is not statistically significant.

Turning the logistic regression model into a classifier

As we can see from the double density plot of the model’s scores (figure 8.8), this model only does a moderate job of separating accounts that churn from those that don’t. If we made the mistake of using this model as a hard classifier where all individuals with a predicted churn propensity above 50% are considered at risk, we would see the following awful performance:

Figure 8.8. Distribution of the glm model’s scores on test data

table(prediction = dTest_treated$glm_pred >= 0.5,
      truth = dTest$churn)
#           truth
# prediction   -1    1
#      FALSE 4591  375
#      TRUE     8    1

The model only identifies nine individuals with such a high probability, and only one of those churn. Remember this was an unbalanced classification problem; only 7.6% of the test examples do in fact churn. What the model can identify is individuals at an elevated risk of churning, not those that will certainly churn. For example, what if we ask the model for the individuals that are predicted to have double the expected churn risk:

table(prediction = dTest_treated$glm_pred>0.15,
      truth = dTest$churn)
#           truth
# prediction   -1    1
#      FALSE 4243  266
#      TRUE   356  110

Notice that in this case, using 0.15 as our scoring threshold, the model identified 466 potentially at-risk accounts, of which 101 did in fact churn. This subset therefore has a churn rate of 24%, or about 3 times the overall churn rate. And this model identified 110 of the 376 churners, or 29% of them. From a business point of view, this model is identifying a 10% subgroup of the population that is responsible for 29% of the churning. This can be useful.

In section 7.2.3, we saw how to present the family of trade-offs between recall (what fraction of the churners are detected) and enrichment or lift (how much more common churning is in the selected set) as a graph. Figure 8.9 shows the plot of recall and enrichment as a function of threshold for the churn model.

Figure 8.9. glm recall and enrichment as a function of threshold

One way to use figure 8.9 is to draw a vertical line at a chosen x-axis threshold, say 0.2. Then the height at which this vertical line crosses each curve tells us the simultaneous enrichment and recall we would see if we classify scores above our threshold as positive. In this case, we would have a recall of around 0.12 (meaning we identify about 12% of the at-risk accounts), and an enrichment of around 3 (meaning the population we warn about has an account cancellation rate of 3 times the general population, indicating this is indeed an enhanced-risk population).

The code to produce these charts looks like this:

WVPlots::DoubleDensityPlot(dTest_treated, "glm_pred", "churn",
                           "glm prediction on test, double density plot")

WVPlots::PRTPlot(dTest_treated, "glm_pred", "churn",
                 "glm prediction on test, enrichment plot",
                 truthTarget = 1,
                 plotvars = c("enrichment", "recall"),
                 thresholdrange = c(0, 1.0))

And now we have worked a substantial classification problem using vtreat.

8.5. Preparing data for regression modeling

Preparing data for regression is very similar to preparing data for classification. Instead of calling designTreatmentsC() or mkCrossFrameCExperiment(), we call designTreatmentsN() or mkCrossFrameNExperiment().

Example

You wish to predict automobile fuel economy stated in miles per gallon from other facts about cars, such as weight and horsepower.

To simulate this scenario, we will use the Auto MPG Data Set from the UCI Machine Learning Repository. We can load this data from the file auto_mpg.RDS in the directory auto_mpg/ of https://github.com/WinVector/PDSwR2/ (after downloading this repository).

auto_mpg <- readRDS('auto_mpg.RDS')

knitr::kable(head(auto_mpg))          1

  • 1 Take a quick look at the data.

Having glanced at the data in figure 8.10, let’s take the “bull in the china shop” approach to modeling, and directly call lm() without examining or treating the data:

Figure 8.10. The first few rows of the auto_mpg data

library("wrapr")

vars <- c("cylinders", "displacement",                          1
          "horsepower", "weight", "acceleration",
          "model_year", "origin")
f <- mk_formula("mpg", vars)
model <- lm(f, data = auto_mpg)

auto_mpg$prediction <- predict(model, newdata = auto_mpg)       2

str(auto_mpg[!complete.cases(auto_mpg), , drop = FALSE])

# 'data.frame':    6 obs. of  10 variables:
#  $ mpg         : num  25 21 40.9 23.6 34.5 23
#  $ cylinders   : num  4 6 4 4 4 4
#  $ displacement: num  98 200 85 140 100 151
#  $ horsepower  : num  NA NA NA NA NA NA                       3
#  $ weight      : num  2046 2875 1835 2905 2320 ...
#  $ acceleration: num  19 17 17.3 14.3 15.8 20.5
#  $ model_year  : num  71 74 80 80 81 82
#  $ origin      : Factor w/ 3 levels "1","2","3": 1 1 2 1 2 1
#  $ car_name    : chr  ""ford pinto"" ""ford maverick"" ""renault lecar
      deluxe"" ...
#  $ prediction  : num  NA NA NA NA NA NA                       4

  • 1 Jump into modeling without bothering to treat the data.
  • 2 Adds the model predictions as a new column
  • 3 Notice that these cars do not have a recorded horsepower.
  • 4 So these cars do not get a prediction.

Because the dataset had missing values, the model could not return a prediction for every row. Now, we’ll try again, using vtreat to treat the data first:

library("vtreat")

cfe <- mkCrossFrameNExperiment(auto_mpg, vars, "mpg",     1
                                verbose = FALSE)
treatment_plan <- cfe$treatments
auto_mpg_treated <- cfe$crossFrame
score_frame <- treatment_plan$scoreFrame
new_vars <- score_frame$varName

newf <- mk_formula("mpg", new_vars)
new_model <- lm(newf, data = auto_mpg_treated)

auto_mpg$prediction <- predict(new_model, newdata = auto_mpg_treated)
# Warning in predict.lm(new_model, newdata = auto_mpg_treated): prediction
# from a rank-deficient fit may be misleading
str(auto_mpg[!complete.cases(auto_mpg), , drop = FALSE])
# 'data.frame':    6 obs. of  10 variables:
#  $ mpg         : num  25 21 40.9 23.6 34.5 23
#  $ cylinders   : num  4 6 4 4 4 4
#  $ displacement: num  98 200 85 140 100 151
#  $ horsepower  : num  NA NA NA NA NA NA
#  $ weight      : num  2046 2875 1835 2905 2320 ...
#  $ acceleration: num  19 17 17.3 14.3 15.8 20.5
#  $ model_year  : num  71 74 80 80 81 82
#  $ origin      : Factor w/ 3 levels "1","2","3": 1 1 2 1 2 1
#  $ car_name    : chr  ""ford pinto"" ""ford maverick"" ""renault lecar deluxe"" ...
#  $ prediction  : num  24.6 22.4 34.2 26.1 33.3 ...    2

  • 1 Try it again with vtreat data preparation.
  • 2 Now we can make predictions, even for items that have missing data.

Now, the model returns a prediction for every row, including those with missing data.

8.6. Mastering the vtreat package

Now that we have seen how to use the vtreat package, we will take some time to review what the package is doing for us. This is easiest to see with toy-sized examples.

vtreat is designed to prepare data for supervised machine learning or predictive modeling. The package is designed to help with the task of relating a bunch of input or explanatory variables to a single output to be predicted or to a dependent variable.

8.6.1. The vtreat phases

As illustrated in figure 8.11, vtreat works in two phases: a design phase and an application/prepare phase. In the design phase, vtreat learns details of your data. For each explanatory variable, it estimates the variable’s relationship to the outcome, so both the explanatory variables and the dependent variable must be available. In the application phase, vtreat introduces new variables that are derived from the explanatory variables, but are better suited for simple predictive modeling. The transformed data is all numeric and has no missing values.[15] R itself has methods for dealing with missing values, including many missing value imputation packages.[16] R also has a canonical method to convert arbitrary data.frames to numeric data: model.matrix(), which many models use to accept arbitrary data. vtreat is a specialized tool for these tasks that is designed to work very well for supervised machine learning or predictive modeling tasks.

15

Remember: missing values are not the only thing that can go wrong with the data, and not the only point vtreat addresses.

16

Figure 8.11. The two vtreat phases

For the treatment-design phase, call one of the following functions:

  • designTreatmentsC() Designs a variable treatment plan for a binary classification task. A binary classification task is where we want to predict if an example is in a given category, or predict the probability that an example is in the given category.
  • designTreatmentsN() Designs a variable treatment plan for a regression task. A regression task predicts a numeric outcome, given example numeric outcomes.
  • designTreatmentsZ() Designs a simple variable treatment plan that does not look at the training data outcomes. This plan deals with missing values and recodes strings as indicator variables (one-hot encoding), but it does not produce impact variables (which require knowledge of the training data outcomes).
  • design_missingness_treatment() Designs a very simple treatment that only deals with missing values, but does not one-hot encode categorical variables. Instead, it replaces NA with the token "_invalid_".
  • mkCrossFrameCExperiment() Prepares data for classification, using a cross-validation technique so the data used to design the variable treatment can be safely reused to train the model.
  • mkCrossFrameNExperiment() Prepares data for regression, using a cross-validation technique so the data used to design the variable treatment can be safely reused to train the model.

For the application or data preparation phase, we always call the prepare() method.

The vtreat package comes with a large amount of documentation and examples that can be found at https://winvector.github.io/vtreat/. However, in addition to knowing how to operate the package, it is critical that data scientists know what the packages they use are doing for them. So we will discuss what vtreat actually does here.

The concepts we need to review include these:

  • Missing values
  • Indicator variables
  • Impact coding
  • The treatment plan
  • The variable score frame
  • The cross-frame

These are a lot of concepts, but they are key to data repair and preparation. We will keep this concrete by working specific, but tiny, examples. Larger examples showing the performance of these can be found at https://arxiv.org/abs/1611.09477.

8.6.2. Missing values

As we have discussed before, R has a special code for values that are missing, not known, or not available: NA. Many modeling procedures will not accept data with missing values, so if they occur, we must do something about them. The common strategies include these:

  • Restricting down to “complete cases”— Using only the data rows where no columns have missing values. This can be problematic for model training, as the complete cases may not be distributed the same or representative of the actual dataset. Also, this strategy does not give a good idea of how to score new data that has missing values. There are some theories about how to reweight data to make it more representative, but we do not encourage these methods.
  • Missing-value imputation— These are methods that use the non-missing values to infer or impute values (or distributions of values) for the missing values. An R task view dedicated to these methods can be found at https://cran.r-project.org/web/views/MissingData.html.
  • Using models that tolerate missing values— Some implementations of decision trees or random forests can tolerate missing values.
  • Treating missingness as observable information— Replacing missing values with stand-in information.

vtreat supplies an implementation of the last idea (treating missingness as observable information), as this is easy to do and very suitable for supervised machine learning or predictive modeling. The idea is simple: the missing values are replaced with some stand-in value (it can be zero, or it can be the average of the non-missing values), and an extra column is added to indicate this replacement has taken place. This extra column gives any modeling step an extra degree of freedom, or the ability to treat the imputed values separately from not-imputed values.

The following is a simple example showing the addition of the transformation:

library("wrapr")                                  1

d <- build_frame(
   "x1"    , "x2"         , "x3", "y" |
   1       , "a"          , 6   , 10  |
   NA_real_, "b"          , 7   , 20  |
   3       , NA_character_, 8   , 30  )

knitr::kable(d)

plan1 <- vtreat::design_missingness_treatment(d)
vtreat::prepare(plan1, d) %.>%                    2
    knitr::kable(.)

  • 1 Brings in the wrapr package for build_frame and the wrapr “dot pipe”
  • 2 Using wrapr’s dot pipe instead of magrittr’s forward pipe. The dot pipe requires the explicit dot argument notation discussed in chapter 5.

Notice that in figure 8.12 the x1 column has the missing value, and that value is replaced in figure 8.13 by a stand-in value, the average of the known values. The treated or prepared data (see figure 8.13) also has a new column, x1_isBAD, indicating where x1 was replaced. Finally, notice that for the string-valued column x2, the NA value is replaced with a special level code.

Figure 8.12. Our simple example data: raw

Figure 8.13. Our simple example data: treated

8.6.3. Indicator variables

Many statistical and machine learning procedures expect all variables to be numeric. Some R users may not be aware of this, as many R model implementations call model .matrix() under the covers to convert arbitrary data to numeric data. For real-world projects, we advise using a more controllable explicit transformation such as vtreat.[17]

17

However, in this book, for didactic purposes, we will try to minimize the number of preparation steps in each example when these steps are not the subject being discussed.

This transformation goes by a number of names, including indicator variables, dummy variables, and one-hot encoding. The idea is this: for each possible value of a string-valued variable, we create a new data column. We set each of these new columns to 1 when the string-valued variable has a value matching the column label, and zero otherwise. This is easy to see in the following example:

d <- build_frame(
   "x1"    , "x2"         , "x3", "y" |
   1       , "a"          , 6   , 10  |
   NA_real_, "b"          , 7   , 20  |
   3       , NA_character_, 8   , 30  )

print(d)
#   x1   x2 x3  y
# 1  1    a  6 10
# 2 NA    b  7 20                                        1
# 3  3 <NA>  8 30
plan2 <- vtreat::designTreatmentsZ(d,
                                   varlist = c("x1", "x2", "x3"),
                                   verbose = FALSE)
vtreat::prepare(plan2, d)
#   x1 x1_isBAD x3 x2_lev_NA x2_lev_x_a x2_lev_x_b
# 1  1        0  6         0          1          0
# 2  2        1  7         0          0          1       2
# 3  3        0  8         1          0          0

  • 1 The second value of x2 is b.
  • 2 In the second row of the treated data, x2_lev_x_b = 1.

Notice that x2_lev_x_b is 1 in the second prepared data row. This is how the transformed data retains the information that the x2 variable originally had the value of b in this row.

As we saw in the discussions of lm() and glm() in chapter 7, it is traditional statistical practice to not actually reserve a new column for one possible level of the string-valued variable. This level is called the reference level. We can identify rows where the string-valued variable was equal to the reference level, as all the other level columns are zero in such rows (other rows have exactly one 1 in the level columns). For supervised learning in general, and especially for advanced techniques such as regularized regression, we recommend encoding all levels, as seen here.

8.6.4. Impact coding

Impact coding is a good idea that gets rediscovered often under different names (effects coding, impact coding, and more recently target encoding).[18]

18

The earliest discussion we can find on effects coding is Robert E. Sweeney and Edwin F. Ulveling, “A Transformation for Simplifying the Interpretation of Coefficients of Binary Variables in Regression Analysis.” The American Statistician, 26(5), 30–32, 1972. We, the authors, have produced research and popularized the methodology among R and Kaggle users, adding key cross-validation methods similar to a method called “stacking” https://arxiv.org/abs/1611.09477.

When a string-valued variable has thousands of possible values or levels, producing a new indicator column for each possible level causes extreme data expansion and overfitting (if the model fitter can even converge in such situations). So instead we use an impact code: replacing the level code with its effect as a single-variable model. This is what produced derived variables of type catB in our KDD2009 credit account cancellation example, and produced catN-style variables in the case of regression.

Let’s see the effect of a simple numeric prediction or regression example:

d <- build_frame(
   "x1"    , "x2"         , "x3", "y" |
   1       , "a"          , 6   , 10  |
   NA_real_, "b"          , 7   , 20  |
   3       , NA_character_, 8   , 30  )

print(d)
#   x1   x2 x3  y
# 1  1    a  6 10
# 2 NA    b  7 20
# 3  3 <NA>  8 30
plan3 <- vtreat::designTreatmentsN(d,
                                   varlist = c("x1", "x2", "x3"),
                                   outcomename = "y",
                                   codeRestriction = "catN",
                                   verbose = FALSE)
vtreat::prepare(plan3, d)
#   x2_catN  y
# 1     -10 10
# 2       0 20
# 3      10 30

The impact-coded variable is in the new column named x2_catN. Notice that in the first row it is -10, as the y-value is 10, which is 10 below the average value of y. This encoding of “conditional delta from mean” is where names like “impact code” or “effect code” come from.

The impact coding for categorical variables is similar, except they are in logarithmic units, just like the logistic regression in section 8.3.1. In this case, for data this small, the naive value of x2_catB would be minus infinity in rows 1 and 3, and plus infinity in row 2 (as the x2 level values perfectly predict or separate cases whether y == 20 or not). The fact that we see values near plus or minus 10 is due to an important adjustment called smoothing which says when computing conditional probabilities, add a little bias towards “no effect” for safer calculations.[19] An example of using vtreat to prepare data for a possible classification task is given next:

19

A reference on smoothing can be found here: https://en.wikipedia.org/wiki/Additive_smoothing.

plan4 <- vtreat::designTreatmentsC(d,
                                   varlist = c("x1", "x2", "x3"),
                                   outcomename = "y",
                                   outcometarget = 20,
                                   codeRestriction = "catB",
                                   verbose = FALSE)
vtreat::prepare(plan4, d)
#     x2_catB  y
# 1 -8.517343 10
# 2  9.903538 20
# 3 -8.517343 30
Smoothing

Smoothing is a method to prevent some degree of overfit and nonsense answers on small data. The idea of smoothing is an attempt to obey Cromwell’s rule that no probability estimate of zero should ever be used in empirical probabilistic reasoning. This is because if you’re combining probabilities by multiplication (the most common method of combining probability estimates), then once some term is 0, the entire estimate will be 0 no matter what the values of the other terms are. The most common form of smoothing is called Laplace smoothing, which counts k successes out of n trials as a success ratio of (k+1)/(n+1) and not as a ratio of k/n (defending against the k=0 case). Frequentist statisticians think of smoothing as a form of regularization, and Bayesian statisticians think of smoothing in terms of priors.

8.6.5. The treatment plan

The treatment plan specifies how training data will be processed before using it to fit a model, and how new data will be processed before applying the model. It is returned directly by the design*() methods. For the mkExperiment*() methods, the treatment plan is the item with the key treatments on the returned result. The following code shows the structure of a treatment plan:

class(plan4)
# [1] "treatmentplan"

names(plan4)

# [1] "treatments"  "scoreFrame"  "outcomename" "vtreatVersion" "outcomeType"  

# [6] "outcomeTarget" "meanY"         "splitmethod"
The variable score frame

An important item included in all treatment plans is the score frame. It can be pulled out of a treatment plan as follows (continuing our earlier example):

plan4$scoreFrame

#   varName varMoves rsq   sig needsSplit extraModelDegrees origName code
# 1 x2_catB   TRUE   0.0506719  TRUE   2  x2 catB

The score frame is a data.frame with one row per derived explanatory variable. Each row shows which original variable the derived variable was produced from (orig-Name), what type of transform was used to produce the derived variable (code), and some quality summaries about the variable. For instance, needsSplit is an indicator that, when TRUE, indicates the variable is complex and requires cross-validated scoring, which is in fact how vtreat produces the variable quality estimates.

8.6.6. The cross-frame

A critical innovation of vtreat is the cross-frame. The cross-frame is an item found in the list of objects returned by the mkCrossFrame*Experiment() methods. It is an innovation that allows the safe use of the same data both for the design of the variable treatments and for training a model. Without this cross-validation method, you must reserve some of the training data to build the variable treatment plan and a disjoint set of training data to fit treated data. Otherwise, the composite system (data preparation plus model application) may suffer from severe nested model bias: producing a model that appears good on training data, but later fails on test or application data.

The dangers of naively reusing data

Here is an example of the problem. Suppose we start with some example data where there is in fact no relation between x and y. In this case, we know that any relation we think we find between them is just an artifact of our procedures, and not really there.

Listing 8.8. An information-free dataset
set.seed(2019)                                               1

d <- data.frame(                                             2
  x_bad = sample(letters, 100, replace = TRUE),
  y = rnorm(100),
  stringsAsFactors = FALSE
)
d$x_good <- ifelse(d$y > rnorm(100), "non-neg", "neg")       3

head(d)                                                      4
#   x_bad           y  x_good
# 1     u -0.05294738 non-neg
# 2     s -0.23639840     neg
# 3     h -0.33796351 non-neg
# 4     q -0.75548467 non-neg
# 5     b -0.86159347     neg
# 6     b -0.52766549 non-neg

  • 1 Sets pseudo-random number generator seed to make the example reproducible
  • 2 Builds example data where there is no relation between x_bad and y
  • 3 x_good is a noisy prediction of the sign of y, so it does have some information about y.
  • 4 Take a look at our synthetic example data. The idea is this: y is related to x_good in a noisy fashion, but unrelated to x_bad. In this case, we know what variables should be chosen, so we can tell if our acceptance procedure is working correctly.

We naively use the training data to create the treatment plan, and then prepare the same data prior to fitting the model.

Listing 8.9. The dangers of reusing data
plan5 <- vtreat::designTreatmentsN(d,                                     1
                                   varlist = c("x_bad", "x_good"),
                                   outcomename = "y",
                                   codeRestriction = "catN",
                                   minFraction = 2,
                                   verbose = FALSE)
class(plan5)
# [1] "treatmentplan"

print(plan5)                                                              2
#   origName     varName code          rsq          sig extraModelDegrees

# 1    x_bad  x_bad_catN catN 4.906903e-05 9.448548e-01                24
# 2   x_good x_good_catN catN 2.602702e-01 5.895285e-08                 1

training_data1 <- vtreat::prepare(plan5, d)                               3

res1 <- vtreat::patch_columns_into_frame(d, training_data1)               4
 head(res1)
#   x_bad  x_good x_bad_catN x_good_catN           y
# 1     u non-neg  0.4070979   0.4305195 -0.05294738
# 2     s     neg -0.1133011  -0.5706886 -0.23639840
# 3     h non-neg -0.3202346   0.4305195 -0.33796351
# 4     q non-neg -0.5447443   0.4305195 -0.75548467
# 5     b     neg -0.3890076  -0.5706886 -0.86159347
# 6     b non-neg -0.3890076   0.4305195 -0.52766549

sigr::wrapFTest(res1, "x_good_catN", "y")                                 5
# [1] "F Test summary: (R2=0.2717, F(1,98)=36.56, p<1e-05)."

sigr::wrapFTest(res1, "x_bad_catN", "y")                                  6
# [1] "F Test summary: (R2=0.2342, F(1,98)=29.97, p<1e-05)."

  • 1 Designs a variable treatment plan using x_bad and x_good to predict y
  • 2 Notice that the derived variable x_good_catN comes out as having a significant signal, and x_bad_catN does not. This is due to the proper use of cross-validation in the vtreat quality estimates.
  • 3 Calls prepare() on the same data used to design the treatment plan—this is not always safe, as we shall see.
  • 4 Combines the data frames d and training_data1, using training_data1 when there are columns with duplicate names
  • 5 Uses a statistical F-test to check the predictive power of x_good_catN
  • 6 x_bad_catN’s F-test is inflated and falsely looks significant. This is due to failure to use cross-validated methods.

In this example, notice the sigr F-test reports an R-squared of 0.23 between x_bad_ catN and the outcome variable y. This is the technical term for checking if the fraction of variation explained (itself called the R-squared) is statistically insignificant (a common occurrence under pure chance). So we want the true R-squared to be high (near 1) and true F-test significance low (near zero) for the good variable. We also expect the true R-squared to be low (near 0), and the true F-test significance to be non-vanishing (not near zero) for the bad variable.

However, notice both the good and bad variables received favorable evaluations! This is an error, and happened because the variables we are testing, x_good_catN and x_bad_catN, are both impact codes of high-cardinality string-valued variables. When we test these variables on the same data they were constructed on, we suffer from overfitting, which erroneously inflates our variable quality estimate. In this case, a lot of the apparent quality of fit is actually just a measure of a variable’s complexity (or ability to overfit).

Also notice that the R-squared and significance reported in the score frame correctly indicate that x_bad_catN is not a high-quality variable (R-squared near zero, and significance not near zero). This is because the score frame uses cross-validation to estimate variable significance. This matters because a modeling process involving multiple variables might pick the variable x_bad_catN over other actual useful variables due to x_bad_catN’s overfit inflated quality score.

As mentioned in previous sections, the way to fix the overfitting is to use one portion of our training data for the designTreatments*() step and a disjoint portion of our training data for the variable use or evaluation (such as the sigr::wrapFTest() step).

The cross-frame to safely reuse data

Another way to do this, which lets us use all of the training data both for the design of the variable treatment plan and for model fitting, is called the cross-frame method. This is a special cross-validation method built into vtreat’s mkCrossFrame*Experiment() methods. All we do in this case is call mkCrossFrameNExperiment() instead of designTreatmentsN and get the prepared training data from the crossFrame element of the returned list object (instead of calling prepare()). For future test or application data, we do call prepare() from the treatment plan (which is returned as the treatments item on the returned list object), but for training we do not call prepare().

The code is as follows.

Listing 8.10. Using mkCrossFrameNExperiment()
cfe <- vtreat::mkCrossFrameNExperiment(d,
                                       varlist = c("x_bad", "x_good"),
                                       outcomename = "y",
                                       codeRestriction = "catN",
                                       minFraction = 2,
                                       verbose = FALSE)
plan6 <- cfe$treatments

training_data2 <- cfe$crossFrame
res2 <- vtreat::patch_columns_into_frame(d, training_data2)

head(res2)
#   x_bad  x_good x_bad_catN x_good_catN           y
# 1     u non-neg  0.2834739   0.4193180 -0.05294738
# 2     s     neg -0.1085887  -0.6212118 -0.23639840
# 3     h non-neg  0.0000000   0.5095586 -0.33796351
# 4     q non-neg -0.5142570   0.5095586 -0.75548467
# 5     b     neg -0.3540889  -0.6212118 -0.86159347
# 6     b non-neg -0.3540889   0.4193180 -0.52766549

sigr::wrapFTest(res2, "x_bad_catN", "y")
# [1] "F Test summary: (R2=-0.1389, F(1,98)=-11.95, p=n.s.)."

sigr::wrapFTest(res2, "x_good_catN", "y")
# [1] "F Test summary: (R2=0.2532, F(1,98)=33.22, p<1e-05)."
plan6$scoreFrame                                              1
 #       varName varMoves        rsq          sig needsSplit
# 1  x_bad_catN     TRUE 0.01436145 2.349865e-01       TRUE
# 2 x_good_catN     TRUE 0.26478467 4.332649e-08       TRUE
#   extraModelDegrees origName code
# 1                24    x_bad catN
# 2                 1   x_good catN

  • 1 The F-tests on the data and the scoreFrame statistics now largely agree.

Notice now that sigr::wrapFTest() correctly considers x_bad_catN to be a low-value variable. This scheme also scores good variables correctly, meaning we can tell good from bad. We can use the cross-frame training_data2 for fitting models, with good protection against overfit from the variable treatment.

Nested model bias

Overfit due to using the result of one model as an input to another is called nestedmodel bias. With vtreat, this could be an issue with the impact codes, which are themselves models. For data treatments that do not look at the outcome, like design_ missingness_treatment() and designTreatmentsZ(), it is safe to use the same data to design the treatment plan and fit the model. However, when the data treatment uses the outcome, we suggest either an additional data split or using the mkCrossFrame*Experiment()/$crossFrame pattern from section 8.4.1.

vtreat uses cross-validation procedures to create the cross-frame. For details, see https://winvector.github.io/vtreat/articles/vtreatCrossFrames.html.

designTreatments*() vs. mkCrossFrame*Experiment()

For larger datasets, it’s easier to use a three-way split of the training data and the designTreatments*()/prepare() pattern to design the treatment plan, fit the model, and evaluate it. For datasets that seem too small to split three ways (especially datasets with a very large number of variables), you may get better models by using the mkCrossFrame*Experiment()/prepare() pattern.

Summary

Real-world data is often messy. Raw uninspected and untreated data may crash your modeling or predicting step, or may give bad results. “Fixing” data does not compete with having better data. But being able to work with the data you have (instead of the data you want) is an advantage.

In addition to many domain-specific or problem-specific problems, you may find that in your data, there are a number of common problems that should be anticipated and dealt with systematically. vtreat is a package specialized for preparing data for supervised machine learning or predictive modeling tasks. It can also reduce your project documentation requirements through its citable documentation.[20] However, remember that tools are not an excuse to avoid looking at your data.

20

In this chapter you have learned

  • How to use the vtreat package’s designTreatments*()/prepare() pattern with a three-way split of your training data to prepare messy data for model fitting and model application
  • How to use the vtreat package’s mkCrossFrame*Experiment()/prepare() pattern with a two-way split of your training data to prepare messy data for model fitting and model application, when statistical efficiency is important
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset