Chapter 4. Managing data

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4. Managing data

This chapter covers

Fixing data quality problems
Transforming data before modeling
Organizing your data for the modeling process

In chapter 3, you learned how to explore your data and how to identify common data issues. In this chapter, you’ll see how to fix the data issues that you’ve discovered. After that, we’ll talk about transforming and organizing the data for the modeling process. Most of the examples in this chapter use the same customer data that you used in the previous chapter.^[1]

¹
The data can be loaded by saving the file custdata.RDS from https://github.com/WinVector/PDSwR2/tree/master/Custdata and then running readRDS("custdata.RDS") in R.

As shown in the mental model (figure 4.1), this chapter again emphasizes the science of managing the data in a statistically valid way, prior to the model-building step.

Figure 4.1. Chapter 4 mental model

4.1. Cleaning data

In this section, we’ll address issues that you discovered during the data exploration/visualization phase, in particular, invalid and missing values. Missing values in data happen quite commonly, and the way you treat them is generally the same from project to project. Handling invalid values is often domain-specific: which values are invalid, and what you do about them, depends on the problem that you are trying to solve.

Example

Suppose you have a numeric variable called credit_score. Domain knowledge will tell you what the valid range for that variable should be. If the credit score is supposed to be a customer’s “classic FICO score,” then any value outside the range 300–850 should be treated as invalid. Other types of credit scores will have different ranges of valid values.

We’ll look at an example of domain-specific data cleaning first.

4.1.1. Domain-specific data cleaning

From our data exploration in the previous chapter, we know of some issues with our data:

The variable gas_usage mixes numeric and symbolic data: values greater than 3 are monthly gas_bills, but values from 1 to 3 are special codes. In addition, gas_usage has some missing values.
The variable age has the problematic value 0, which probably means that the age is unknown. In addition, there are a few customers with ages older than 100 years, which may also be an error. However, for this project, we'll treat the value 0 as invalid, and assume ages older than 100 years are valid.
The variable income has negative values. We’ll assume for this discussion that those values are invalid.

These sorts of issues are quite common. In fact, most of the preceding problems were already in the actual census data that our notional customer data example is based on.

A quick way to treat the age and income variables is to convert the invalid values to NA, as if they were missing variables. You can then treat the NAs using the automatic missing-value treatment discussed in section 4.1.2.^[2]

²
If you haven’t already done so, we suggest you follows the steps in section A.1 of appendix A to install R, packages, tools, and the book examples.

Listing 4.1. Treating the age and income variables

library(dplyr)
customer_data = readRDS("custdata.RDS")                  1

customer_data <- customer_data %>%
   mutate(age = na_if(age, 0),                           2
           income = ifelse(income < 0, NA, income))      3

1 Loads the data
2 The function mutate() from the dplyr package adds columns to a data frame, or modifies existing columns. The function na_if (), also from dplyr, turns a specific problematic value (in this case, 0) to NA.
3 Converts negative incomes to NA

The gas_usage variable has to be treated specially. Recall from chapter 3 that the values 1, 2, and 3 aren’t numeric values, but codes:

The value 1 means “Gas bill included in rent or condo fee.”
The value 2 means “Gas bill included in electricity payment.”
The value 3 means “No charge or gas not used.”

One way to treat gas_usage is to convert all the special codes (1, 2, 3) to NA, and to add three new indicator variables, one for each code. For example, the indicator variable gas_with_electricity will have the value 1 (or TRUE) whenever the original gas_usage variable had the value 2, and the value 0 otherwise. In the following listing, you will create the three new indicator variables, gas_with_rent, gas_with_electricity, and no_ gas_bill.

Listing 4.2. Treating the `gas_usage` variable

customer_data <- customer_data %>%
  mutate(gas_with_rent = (gas_usage == 1),                   1
          gas_with_electricity = (gas_usage == 2),
         no_gas_bill = (gas_usage == 3) ) %>%
  mutate(gas_usage = ifelse(gas_usage < 4, NA, gas_usage))   2

1 Creates the three indicator variables
2 Converts the special codes in the gas_usage column to NA

4.1.2. Treating missing values

Let’s take another look at some of the variables with missing values in our customer dataset from the previous chapter. One way to find these variables programmatically is to count how many missing values are in each column of the customer data frame, and look for the columns where that count is greater than zero. The next listing counts the number of missing values in each column of the dataset.

Listing 4.3. Counting the missing values in each variable

count_missing = function(df) {                             1
   sapply(df, FUN=function(col) sum(is.na(col)) )
}

nacounts <- count_missing(customer_data)
hasNA = which(nacounts > 0)                                2
nacounts[hasNA]

##          is_employed               income         housing_type
##                25774                   45                 1720
##          recent_move         num_vehicles                  age
##                 1721                 1720                   77
##            gas_usage        gas_with_rent gas_with_electricity
##                35702                 1720                 1720
##          no_gas_bill
##                 1720

1 Defines a function that counts the number of NAs in each column of a data frame
2 Applies the function to customer_data, identifies which columns have missing values, and prints the columns and counts

Fundamentally, there are two things you can do with these variables: drop the rows with missing values, or convert the missing values to a meaningful value. For variables like income or age that have very few missing values relative to the size of the data (customer_data has 73,262 rows), it could be safe to drop the rows. It wouldn’t be safe to drop rows from variables like is_employed or gas_usage, where a large fraction of the values is missing.

In addition, remember that many modeling algorithms in R (and in other languages) will quietly drop rows that have missing values. So if you have wide data, and many columns have missing values, it may not be safe to drop rows with missing values. This is because the fraction of rows with at least one missing value can be high in that situation, and you can lose most of your data, as figure 4.2 shows. So for this discussion, we will convert all the missing values to meaningful values.

Figure 4.2. Even a few missing values can lose all your data.

Missing data in categorical variables

When the variable with missing values is categorical, an easy solution is to create a new category for the variable, called, for instance, missing or _invalid_. This is shown schematically for the variable housing_type in figure 4.3.

Figure 4.3. Creating a new level for missing categorical values

Missing values in numeric or logical variables

Suppose your income variable were missing substantial data, as in figure 4.4. You believe that income is still an important predictor of the probability of health insurance coverage, so you still want to use the variable. What do you do? This can depend on why you think the data is missing.

Figure 4.4. Income data with missing values

The nature of missing values

You might believe that the data is missing because the data collection failed at random, independent of the situation and of the other values. In this case, you can replace the missing values with a “reasonable estimate,” or imputed value. Statistically, one commonly used estimate is the expected, or mean, income, as shown in figure 4.5.

Figure 4.5. Replacing missing values with the mean

Assuming that the customers with missing income are distributed the same way as the others, replacing missing values with the mean will be correct on average. It’s also an easy fix to implement.

You can improve this estimate when you remember that income is related to other variables in your data—for instance, you know from your data exploration in the previous chapter that there’s a relationship between age and income. There might be a relationship between state of residence or marital status and income, as well. If you have this information, you can use it. The method of imputing a missing value of an input variable based on the other input variables can be applied to categorical data, as well.^[3]

³
The text R in Action, Second Edition (Robert Kabacoff, 2014, http://mng.bz/ybS4) includes an extensive discussion of several value imputation methods available in R.

It’s important to remember that replacing missing values by the mean, as well as other more sophisticated methods for imputing missing values, assumes that the customers with missing income are in some sense typical. It’s possible that the customers with missing income data are systematically different from the others. For instance, it could be the case that the customers with missing income information truly have no income, because they are full-time students or stay-at-home spouses or otherwise not in the active workforce. If this is so, then “filling in” their income information by using one of the preceding methods is an insufficient treatment, and may lead to false conclusions.

Treating missing values as information

You still need to replace the NAs with a stand-in value, perhaps the mean. But the modeling algorithm should know that these values are possibly different from the others. A trick that has worked well for us is to replace the NAs with the mean, and add an additional indicator variable to keep track of which data points have been altered. This is shown in figure 4.6.

Figure 4.6. Replacing missing values with the mean and adding an indicator column to track the altered values

The income_isBAD variable lets you differentiate the two kinds of values in the data: the ones that you are about to add, and the ones that were already there.

You’ve seen a variation of this approach already, in another example of systematic missing values: the gas_usage variable. Most of the customers with missing gas_usage values aren’t random: they either pay for gas together with another bill, such as electricity or rent, or they don’t use gas. You identified those customers by adding additional indicator variables: no_gas_bill, gas_with_rent, and so on. Now you can fill in the “missing” values in gas_usage with a stand-in value, such as zero, or the average value of gas_usage.

The idea is that at the modeling step, you give all the variables—income, income_isBAD, gas_usage, no_gas_bill, and so on—to the modeling algorithm, and it can determine how to best use the information to make predictions. If the missing values really are missing randomly, then the indicator variables that you added are uninformative, and the model should ignore them. If the missing values are missing systematically, then the indicator variables provide useful additional information to the modeling algorithm.

Missingness indicators can be useful

We’ve observed in many situations that the isBAD variables are sometimes even more informative and useful than the original variables!

If you don’t know whether the missing values are random or systematic, we recommend assuming that the difference is systematic, rather than working hard to impute values to the variables based on the random missingness assumption. As we said earlier, treating missing values as if they are missing at random when they really indicate a systematic difference in some of the datums may lead to false conclusions.

4.1.3. The vtreat package for automatically treating missing variables

Since missing values are such a common problem with data, it’s useful to have an automatic and repeatable process for dealing with them. We recommend using the vtreat variable treatment package. The vtreat process creates a treatment plan that records all the information needed to repeat the data treatment process: for example, the average observed income, or all the observed values of a categorical variable like housing_type. You then use this treatment plan to “prepare” or treat your training data before you fit a model, and then again to treat new data before feeding it into the model. The idea is that treated data is “safe,” with no missing or unexpected values, and shouldn’t ruin the model.

You’ll see more-sophisticated examples of using vtreat in later chapters, but for now you will just create a simple treatment plan to manage the missing values in customer_data. Figure 4.7 shows the processes of creating and applying this simple treatment plan. First, you have to designate which columns of the data are the input variables: all of them except health_ins (which is the outcome to be predicted) and custid:

varlist <- setdiff(colnames(customer_data), c("custid", "health_ins"))

Figure 4.7. Creating and applying a simple treatment plan

Then, you create the treatment plan, and “prepare” the data.

Listing 4.4. Creating and applying a treatment plan

library(vtreat)
treatment_plan <-
      design_missingness_treatment(customer_data, varlist = varlist)
training_prepared <- prepare(treatment_plan, customer_data)

The data frame training_prepared is the treated data that you would use to train a model. Let’s compare it to the original data.

Listing 4.5. Comparing the treated data to the original

colnames(customer_data)
##  [1] "custid"               "sex"                  "is_employed"
##  [4] "income"               "marital_status"       "health_ins"
##  [7] "housing_type"         "recent_move"          "num_vehicles"
## [10] "age"                  "state_of_res"         "gas_usage"
## [13] "gas_with_rent"        "gas_with_electricity" "no_gas_bill"
colnames(training_prepared)                                               1
##  [1] "custid"                     "sex"
##  [3] "is_employed"                "income"
##  [5] "marital_status"             "health_ins"
##  [7] "housing_type"               "recent_move"
##  [9] "num_vehicles"               "age"
## [11] "state_of_res"               "gas_usage"
## [13] "gas_with_rent"              "gas_with_electricity"
## [15] "no_gas_bill"                "is_employed_isBAD"
## [17] "income_isBAD"               "recent_move_isBAD"
## [19] "num_vehicles_isBAD"         "age_isBAD"
## [21] "gas_usage_isBAD"            "gas_with_rent_isBAD"
## [23] "gas_with_electricity_isBAD" "no_gas_bill_isBAD"

nacounts <- sapply(training_prepared, FUN=function(col) sum(is.na(col)) ) 2
sum(nacounts)
## [1] 0

1 The prepared data has additional columns that are not in the original data, most importantly those with the _isBAD designation.
2 The prepared data has no missing values.

Now examine a few columns that you know had missing values.

Listing 4.6. Examining the data treatment

htmissing <- which(is.na(customer_data$housing_type))                  1

columns_to_look_at <- c("custid", "is_employed", "num_vehicles",
                           "housing_type", "health_ins")

customer_data[htmissing, columns_to_look_at] %>% head()                2
##           custid is_employed num_vehicles housing_type health_ins
## 55  000082691_01        TRUE           NA         <NA>      FALSE
## 65  000116191_01        TRUE           NA         <NA>       TRUE
## 162 000269295_01          NA           NA         <NA>      FALSE
## 207 000349708_01          NA           NA         <NA>      FALSE
## 219 000362630_01          NA           NA         <NA>       TRUE
## 294 000443953_01          NA           NA         <NA>       TRUE
columns_to_look_at = c("custid", "is_employed", "is_employed_isBAD",
                       "num_vehicles","num_vehicles_isBAD",
                       "housing_type", "health_ins")

training_prepared[htmissing, columns_to_look_at] %>%  head()           3
##           custid is_employed is_employed_isBAD num_vehicles
## 55  000082691_01   1.0000000                 0       2.0655
## 65  000116191_01   1.0000000                 0       2.0655
## 162 000269295_01   0.9504928                 1       2.0655
## 207 000349708_01   0.9504928                 1       2.0655
## 219 000362630_01   0.9504928                 1       2.0655
## 294 000443953_01   0.9504928                 1       2.0655
##     num_vehicles_isBAD housing_type health_ins
## 55                   1    _invalid_      FALSE
## 65                   1    _invalid_       TRUE
## 162                  1    _invalid_      FALSE
## 207                  1    _invalid_      FALSE
## 219                  1    _invalid_       TRUE
## 294                  1    _invalid_       TRUE

customer_data %>%
    summarize(mean_vehicles = mean(num_vehicles, na.rm = TRUE),
    mean_employed = mean(as.numeric(is_employed), na.rm = TRUE))       4
##   mean_vehicles mean_employed
## 1        2.0655     0.9504928

1 Finds the rows where housing_type was missing
2 Looks at a few columns from those rows in the original data
3 Looks at those rows and columns in the treated data (along with the isBADs)
4 Verifies the expected number of vehicles and the expected unemployment rate in the dataset

You see that vtreat replaced missing values of the categorical variable housing_type with the token _invalid_, and missing values of the numerical column num_vehicles with its average value in the original data. It also converted the logical variable is_ employed to a numeric variable, and replaced missing values with its average value in the original data.

In addition to fixing missing data, there are other ways that you can transform the data to address issues that you found during the exploration phase. In the next section, we’ll examine some additional common transformations.

4.2. Data transformations

The purpose of data transformation is to make data easier to model, and easier to understand. Machine learning works by learning meaningful patterns in training data, and then making predictions by exploiting those patterns in new data. Therefore, a data transformation that makes it easier to match patterns in the training data to patterns in new data can be a benefit.

Example

Suppose you are considering the use of income as an input to your insurance model. The cost of living will vary from state to state, so what would be a high salary in one region could be barely enough to scrape by in another. Because of this, it might be more meaningful to normalize a customer’s income by the typical income in the area where they live. This is an example of a relatively simple (and common) transformation.

For this example, you have external information about the median income in each state, in a file called median_income.RDS. Listing 4.7 uses this information to normalize the incomes. The code uses a join operation to match the information from median_ income.RDS to the existing customer data. We will discuss joining tables in the next chapter, but for now, you should understand joining as copying data into a data frame from another data frame with matching rows.

Listing 4.7. Normalizing income by state

library(dplyr)
median_income_table <- readRDS("median_income.RDS")                         1
head(median_income_table)

##   state_of_res median_income
## 1      Alabama         21100
## 2       Alaska         32050
## 3      Arizona         26000
## 4     Arkansas         22900
## 5   California         25000
## 6     Colorado         32000

training_prepared <-  training_prepared %>%
  left_join(., median_income_table, by="state_of_res") %>%                  2
   mutate(income_normalized = income/median_income)

head(training_prepared[, c("income", "median_income", "income_normalized")])3

##   income median_income income_normalized
## 1  22000         21100         1.0426540
## 2  23200         21100         1.0995261
## 3  21000         21100         0.9952607
## 4  37770         21100         1.7900474
## 5  39000         21100         1.8483412
## 6  11100         21100         0.5260664

summary(training_prepared$income_normalized)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.0000  0.4049  1.0000  1.5685  1.9627 46.5556

1 If you have downloaded the PDSwR2 code example directory, then median_income.RDS is in the directory PDSwR2/Custdata. We assume that this is your working directory.
2 Joins median_income_table into the customer data, so you can normalize each person’s income by the median income of their state
3 Compares the values of income and income_normalized

Looking at the results in listing 4.7, you see that customers with an income higher than the median income of their state have an income_normalized value larger than 1, and customers with an income lower than the median income of their state have an income_normalized value less than 1. Because customers in different states get a different normalization, we call this a conditional transform. A long way to say this is that “the normalization is conditioned on the customer’s state of residence.” We would call scaling all the customers by the same value an unconditioned transform.

The need for data transformation can also depend on which modeling method you plan to use. For linear and logistic regression, for example, you ideally want to make sure that the relationship between the input variables and the output variable is approximately linear, and that the output variable is constant variance (the variance of the output variable is independent of the input variables). You may need to transform some of your input variables to better meet these assumptions.

In this section, we’ll look at some useful data transformations and when to use them:

Normalization
Centering and scaling
Log transformations

4.2.1. Normalization

Normalization (or rescaling) is useful when absolute quantities are less meaningful than relative ones. You’ve already seen an example of normalizing income relative to another meaningful quantity (median income). In that case, the meaningful quantity was external (it came from outside information), but it can also be internal (derived from the data itself).

For example, you might be less interested in a customer’s absolute age than you are in how old or young they are relative to a “typical” customer. Let’s take the mean age of your customers to be the typical age. You can normalize by that, as shown in the following listing.

Listing 4.8. Normalizing by mean age

summary(training_prepared$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   21.00   34.00   48.00   49.22   62.00  120.00

mean_age <- mean(training_prepared$age)
age_normalized <- training_prepared$age/mean_age
summary(age_normalized)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.4267  0.6908  0.9753  1.0000  1.2597  2.4382

A value for age_normalized that is much less than 1 signifies an unusually young customer; much greater than 1 signifies an unusually old customer. But what constitutes “much less” or “much greater” than 1? That depends on how wide an age spread your customers tend to have. See figure 4.8 for an example.

Figure 4.8. Is a 35-year-old young?

The average customer in both populations is 50. The population 1 group has a fairly wide age spread, so a 35-year-old still seems fairly typical (perhaps a little young). That same 35-year-old seems unusually young in population 2, which has a narrow age spread. The typical age spread of your customers is summarized by the standard deviation. This leads to another way of expressing the relative ages of your customers.

4.2.2. Centering and scaling

You can rescale your data by using the standard deviation as a unit of distance. A customer who is within one standard deviation of the mean age is considered not much older or younger than typical. A customer who is more than one or two standard deviations from the mean can be considered much older, or much younger. To make the relative ages even easier to understand, you can also center the data by the mean, so a customer of “typical age” has a centered age of 0.

Listing 4.9. Centering and scaling age

(mean_age <- mean(training_prepared$age))                                  1
 ## [1] 49.21647

(sd_age <- sd(training_prepared$age))                                      2
 ## [1] 18.0124

print(mean_age + c(-sd_age, sd_age))                                       3
 ## [1] 31.20407 67.22886

training_prepared$scaled_age <- (training_prepared$age - mean_age) / sd_age4

training_prepared %>%
  filter(abs(age - mean_age) < sd_age) %>%
  select(age, scaled_age) %>%
  head()

##   age scaled_age                                                        5
## 1  67  0.9872942
## 2  54  0.2655690
## 3  61  0.6541903
## 4  64  0.8207422
## 5  57  0.4321210
## 6  55  0.3210864

training_prepared %>%
  filter(abs(age - mean_age) > sd_age) %>%
  select(age, scaled_age) %>%
  head()

##   age scaled_age                                                        6
## 1  24  -1.399951
## 2  82   1.820054
## 3  31  -1.011329
## 4  93   2.430745
## 5  76   1.486950
## 6  26  -1.288916

1 Takes the mean
2 Takes the standard deviation
3 The typical age range for this population is from about 31 to 67.
4 Uses the mean value as the origin (or reference point) and rescales the distance from the mean by the standard deviation
5 Customers in the typical age range have a scaled_age with magnitude less than 1.
6 Customers outside the typical age range have a scaled_age with magnitude greater than 1.

Now, values less than -1 signify customers younger than typical; values greater than 1 signify customers older than typical.

A technicality

The common interpretation of standard deviation as a unit of distance implicitly assumes that the data is distributed normally. For a normal distribution, roughly two-thirds of the data (about 68%) is within plus/minus one standard deviation from the mean. About 95% of the data is within plus/minus two standard deviations from the mean. In figure 4.8 (reproduced as a faceted graph in figure 4.9), a 35-year-old is within one standard deviation from the mean in population 1, but more than one (in fact, more than two) standard deviations from the mean in population 2.

Figure 4.9. Faceted graph: is a 35-year-old young?

You can still use this transformation if the data isn’t normally distributed, but the standard deviation is most meaningful as a unit of distance if the data is unimodal and roughly symmetric around the mean.

When you have multiple numeric variables, you can use the scale() function to center and scale all of them simultaneously. This has the advantage that the numeric variables now all have similar and more-compatible ranges. To make this concrete, compare the variable age in years to the variable income in dollars. A 10-year difference in age between two customers could be a lot, but a 10-dollar difference in income is quite small. If you center and scale both variables, then the value 0 means the same thing for both scaled variables: the mean age or mean income. And the value 1.5 also means the same thing: a person who is 1.5 standard deviations older than the mean age, or who makes 1.5 standard deviations more than the mean income. In both situations, the value 1.5 can be considered a big difference from the average.

The following listing demonstrates centering and scaling four numerical variables from the data with scale().

Listing 4.10. Centering and scaling multiple numeric variables

dataf <- training_prepared[, c("age", "income", "num_vehicles", "gas_usage")]
summary(dataf)

##       age             income         num_vehicles     gas_usage
##  Min.   : 21.00   Min.   :      0   Min.   :0.000   Min.   :  4.00
##  1st Qu.: 34.00   1st Qu.:  10700   1st Qu.:1.000   1st Qu.: 50.00
##  Median : 48.00   Median :  26300   Median :2.000   Median : 76.01
##  Mean   : 49.22   Mean   :  41792   Mean   :2.066   Mean   : 76.01
##  3rd Qu.: 62.00   3rd Qu.:  51700   3rd Qu.:3.000   3rd Qu.: 76.01
##  Max.   :120.00   Max.   :1257000   Max.   :6.000   Max.   :570.00

dataf_scaled <- scale(dataf, center=TRUE, scale=TRUE)                      1

summary(dataf_scaled)
##       age               income         num_vehicles        gas_usage
##  Min.   :-1.56650   Min.   :-0.7193   Min.   :-1.78631   Min.   :-1.4198
##  1st Qu.:-0.84478   1st Qu.:-0.5351   1st Qu.:-0.92148   1st Qu.:-0.5128
##  Median :-0.06753   Median :-0.2666   Median :-0.05665   Median : 0.0000
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000
##  3rd Qu.: 0.70971   3rd Qu.: 0.1705   3rd Qu.: 0.80819   3rd Qu.: 0.0000
##  Max.   : 3.92971   Max.   :20.9149   Max.   : 3.40268   Max.   : 9.7400

(means <- attr(dataf_scaled, 'scaled:center'))                             2
 ##          age       income num_vehicles    gas_usage
##     49.21647  41792.51062      2.06550     76.00745

(sds <- attr(dataf_scaled, 'scaled:scale'))
##          age       income num_vehicles    gas_usage
##    18.012397 58102.481410     1.156294    50.717778

1 Centers the data by its mean and scales it by its standard deviation
2 Gets the means and standard deviations of the original data, which are stored as attributes of dataf_scaled

Because the scale() transformation puts all the numeric variables in compatible units, it’s a recommended preprocessing step for some data analysis and machine learning techniques like principal component analysis and deep learning.

Keep the training transformation

When you use parameters derived from the data (like means, medians, or standard deviations) to transform the data before modeling, you generally should keep those parameters and use them when transforming new data that will be input to the model. When you used the scale() function in listing 4.10, you kept the values of the scaled:center and scaled:scale attributes as the variables means and sds, respectively. This is so that you can use these values to scale new data, as shown in listing 4.11. This makes sure that the new scaled data is in the same units as the training data.

The same principle applies when cleaning missing values using the design_missingness_treatment() function from the vtreat package, as you did in section 4.1.3. The resulting treatment plan (called treatment_plan in listing 4.1.3) keeps the information from the training data in order to clean missing values from new data, as you saw in listing 4.5.

Listing 4.11. Treating new data before feeding it to a model

newdata <- customer_data                              1

library(vtreat)                                       2
newdata_treated <- prepare(treatment_plan, newdata)

new_dataf <- newdata_treated[, c("age", "income",     3
"num_vehicles", "gas_usage")]

dataf_scaled <- scale(new_dataf, center=means, scale=sds)

1 Simulates having a new customer dataset
2 Cleans it using the treatment plan from the original dataset
3 Scales age, income, num_vehicles, and gas_usage using the means and standard deviations from the original dataset

However, there are some situations when you may wish to use new parameters. For example, if the important information in the model is how a subject’s income relates to the current median income, then when preparing new data for modeling, you would want to normalize income by the current median income, rather than the median income from the time when the model was trained. The implication here is that the characteristics of someone who earns three times the median income will be different from those of someone who earns less than the median income, and that these differences are the same independent of the actual dollar amount of the income.

4.2.3. Log transformations for skewed and wide distributions

Normalizing by mean and standard deviation, as you did in section 4.2.2, is most meaningful when the data distribution is roughly symmetric. Next, we’ll look at a transformation that can make some distributions more symmetric.

Monetary amounts—incomes, customer value, account values, or purchase sizes—are some of the most commonly encountered sources of skewed distributions in data science applications. In fact, as we'll discuss in appendix B, monetary amounts are often lognormally distributed: the log of the data is normally distributed. This leads us to the idea that taking the log of monetary data can restore symmetry and scale to the data, by making it look “more normal.” We demonstrate this in figure 4.11.

For the purposes of modeling, it’s generally not too critical which logarithm you use, whether the natural logarithm, log base 10, or log base 2. In regression, for example, the choice of logarithm affects the magnitude of the coefficient that corresponds to the logged variable, but it doesn’t affect the structure of the model. We like to use log base 10 for monetary amounts, because orders of ten seem natural for money: $100, $1000, $10,000, and so on. The transformed data is easy to read.

An aside on graphing

Notice that the bottom panel of figure 4.10 has the same shape as figure 3.7. The difference between using the ggplot layer scale_x_log10 on a density plot of income and plotting a density plot of log10(income) is primarily axis labeling. Using scale_x_log10 will label the x-axis in dollar amounts, rather than in logs.

Figure 4.10. A nearly lognormal distribution and its log

It’s also generally a good idea to log transform data containing values that range over several orders of magnitude, for example, the population of towns and cities, which may range from a few hundred to several million. One reason for this is that modeling techniques often have a difficult time with very wide data ranges. Another reason is because such data often comes from multiplicative processes rather than from an additive one, so log units are in some sense more natural.

As an example of an additive process, suppose you are studying weight loss. If you weigh 150 pounds and your friend weighs 200, you’re equally active, and you both go on the exact same restricted-calorie diet, then you’ll probably both lose about the same number of pounds. How much weight you lose doesn’t depend on how much you weighed in the first place, only on calorie intake. The natural unit of measurement in this situation is absolute pounds (or kilograms) lost.

As an example of a multiplicative process, consider salary increases. If management gives everyone in the department a raise, it probably isn’t giving everyone $5,000 extra. Instead, everyone gets a 2% raise: how much extra money ends up in your paycheck depends on your initial salary. In this situation, the natural unit of measurement is percentage, not absolute dollars. Other examples of multiplicative processes:

A change to an online retail site increases conversion (purchases) for each item by 2% (not by exactly two purchases).
A change to a restaurant menu increases patronage every night by 5% (not by exactly five customers every night).

When the process is multiplicative, log transforming the process data can make modeling easier.

Unfortunately, taking the logarithm only works if the data is non-negative, because the log of zero is –Infinity and the log of negative values isn’t defined (R marks the log of negative numbers as NaN: not a number). There are other transforms, such as arcsinh, that you can use to decrease data range if you have zero or negative values. We don’t always use arcsinh, because we don’t find the values of the transformed data to be meaningful. In applications where the skewed data is monetary (like account balances or customer value), we instead use what we call a signed logarithm. A signed logarithm takes the logarithm of the absolute value of the variable and multiplies by the appropriate sign. Values strictly between -1 and 1 are mapped to zero. The difference between log and signed log is shown in figure 4.11.

Figure 4.11. Signed log lets you visualize non-positive data on a logarithmic scale.

Here’s how to calculate signed log base 10 in R:

signedlog10 <- function(x) {
     ifelse(abs(x) <= 1, 0, sign(x)*log10(abs(x)))
}

This maps all datums between -1 and 1 to zero, so clearly this transformation isn’t useful if values with magnitude less than 1 are important. But with many monetary variables (in US currency), values less than a dollar aren’t much different from zero (or 1), for all practical purposes. So, for example, mapping account balances that are less than or equal to $1 (the equivalent of every account always having a minimum balance of $1) is probably okay. You can also pick a larger threshold for “small,” such as $100. This would map the small accounts of less than $100 to the same value, and eliminate the long left tail in figures 4.10 and 4.11. In some situations, eliminating this long tail can be desirable—for one thing, it makes a graph of the data less visually skewed.^[4]

⁴
There are methods other than capping to deal with signed logarithms, such as the arcsinh function (see http://mng.bz/ZWQa), but they also distort data near zero and make almost any data appear to be bimodal, which can be deceiving.

Once you’ve got the data suitably cleaned and transformed, you’re almost ready to start the modeling stage. Before we get there, we have one more step.

4.3. Sampling for modeling and validation

Sampling is the process of selecting a subset of a population to represent the whole during analysis and modeling. In the current era of big datasets, some people argue that computational power and modern algorithms let us analyze the entire large dataset without the need to sample. But keep in mind even “big data” is usually itself a sample from a larger universe. So some understanding of sampling is always needed to work with data.

We can certainly analyze larger datasets than we could before, but sampling is still a useful tool. When you’re in the middle of developing or refining a modeling procedure, it’s easier to test and debug the code on small subsamples before training the model on the entire dataset. Visualization can be easier with a subsample of the data; ggplot runs faster on smaller datasets, and too much data will often obscure the patterns in a graph, as we mentioned in chapter 3. And often it’s not feasible to use your entire customer base to train a model.

It’s important that the dataset that you do use is an accurate representation of your population as a whole. For example, your customers might come from all over the United States. When you collect your customer data, it might be tempting to use all the customers from one state, say Connecticut, to train the model. But if you plan to use the model to make predictions about customers all over the country, it’s a good idea to pick customers randomly from all the states, because what predicts health insurance coverage for Texas customers might be different from what predicts health insurance coverage in Connecticut. This might not always be possible (perhaps only your Connecticut and Massachusetts branches currently collect the customer health insurance information), but the shortcomings of using a nonrepresentative dataset should be kept in mind.

Another reason to sample your data is to create test and training splits.

4.3.1. Test and training splits

When you’re building a model to make predictions, like our model to predict the probability of health insurance coverage, you need data to build the model. You also need data to test whether the model makes correct predictions on new data. The first set is called the training set, and the second set is called the test (or holdout) set. Figure 4.12 shows the splitting process (along with an optional split for a calibration set, which is discussed in the sidebar “Train/calibration/test splits”).

Figure 4.12. Splitting data into training and test (or training, calibration, and test) sets

The training set is the data that you feed to the model-building algorithm (we'll cover specific algorithms in part 2) so that the algorithm can fit the correct structure to best predict the outcome variable. The test set is the data that you feed into the resulting model, to verify that the model’s predictions will be accurate on new data. We’ll go into detail about the kinds of modeling issues that you can detect by using holdout data in chapter 6. For now, we’ll get our data ready for doing holdout experiments at a later stage.

Train/calibration/test splits

Many writers recommend train/calibration/test splits, where the calibration set is used to set parameters that the model-fitting algorithm needs, and the training set is used to fit the model. This is also good advice. Our philosophy is this: split the data into train/test early, don’t look at test until final evaluation, and if you need calibration data, resplit it from your training subset.

4.3.2. Creating a sample group column

A convenient way to manage random sampling is to add a sample group column to the data frame. The sample group column contains a number generated uniformly from zero to one, using the runif() function. You can draw a random sample of arbitrary size from the data frame by using the appropriate threshold on the sample group column.

For example, once you’ve labeled all the rows of your data frame with your sample group column (let’s call it gp), then the set of all rows such that gp < 0.4 will be about four-tenths, or 40%, of the data. The set of all rows where gp is between 0.55 and 0.70 is about 15% of the data (0.7 – 0.55 = 0.15). So you can repeatably generate a random sample of the data of any size by using gp.

Listing 4.12. Splitting into test and training using a random group mark

set.seed(25643)                                      1
customer_data$gp <- runif(nrow(customer_data))       2
customer_test <- subset(customer_data, gp <= 0.1)    3
customer_train <- subset(customer_data, gp > 0.1)    4

dim(customer_test)
## [1] 7463   16

dim(customer_train)
## [1] 65799    16

1 Sets the random seed so this example is reproducible
2 Creates the grouping column
3 Here we generate a test set of about 10% of the data.
4 Here we generate a training set using the remaining data.

Listing 4.12 generates a test set of approximately 10% of the data and allocates the remaining 90% of the data to the training set.

The dplyr package also has functions called sample_n() and sample_frac() that draw a random sample (a uniform random sample, by default) from a data frame. Why not just use one of these to draw training and test sets? You could, but you should make sure to set the random seed via the set.seed() command (as we did in listing 4.12) to guarantee that you’ll draw the same sample group every time. Reproducible sampling is essential when you’re debugging code. In many cases, code will crash because of a corner case that you forgot to guard against. This corner case might show up in your random sample. If you’re using a different random input sample every time you run the code, you won’t know if you will tickle the bug again. This makes it hard to track down and fix errors.

You also want repeatable input samples for what software engineers call regression testing (not to be confused with statistical regression). In other words, when you make changes to a model or to your data treatment, you want to make sure you don’t break what was already working. If model version 1 was giving “the right answer” for a certain input set, you want to make sure that model version 2 does so also.

We find that storing a sample group column with the data is a more reliable way to guarantee reproducible sampling during development and testing.

Reproducible sampling is not just a trick for R

If your data is in a database or other external store, and you only want to pull a subset of the data into R for analysis, you can draw a reproducible random sample by generating a sample group column in an appropriate table in the database, using the SQL command RAND.

4.3.3. Record grouping

One caveat is that the preceding trick works if every object of interest (every customer, in this case) corresponds to a unique row. But what if you’re interested less in which customers don’t have health insurance, and more in which households have uninsured members? If you’re modeling a question at the household level rather than the customer level, then every member of a household should be in the same group (test or training). In other words, the random sampling also has to be at the household level.

Suppose your customers are marked both by a household ID and a customer ID. This is shown in figure 4.13. We want to split the households into a training set and a test set. Listing 4.13 shows one way to generate an appropriate sample group column.

Figure 4.13. Example of a dataset with customers and households

Listing 4.13. Ensuring test/train split doesn’t split inside a household

household_data <- readRDS("hhdata.RDS")                 1
hh <- unique(household_data$household_id)               2

set.seed(243674)
households <- data.frame(household_id = hh,             3
                          gp = runif(length(hh)),
                         stringsAsFactors=FALSE)

household_data <- dplyr::left_join(household_data,      4
                             households,
                            by = "household_id")

1 If you have downloaded the PDSwR2 code example directory, then the household dataset is in the directory PDSwR2/Custdata. We assume that this is your working directory.
2 Gets the unique household IDs
3 Generates a unique sampling group ID per household, and puts in a column named gp
4 Joins the household IDs back into the original data

The resulting sample group column is shown in figure 4.14. Everyone in a household has the same sampling group number.

Figure 4.14. Sampling the dataset by household rather than customer

Now we can generate the test and training sets as before. This time, however, the threshold 0.1 doesn’t represent 10% of the data rows, but 10% of the households, which may be more or less than 10% of the data, depending on the sizes of the households.

4.3.4. Data provenance

You’ll also want to add a column (or columns) to record data provenance: when your dataset was collected, perhaps what version of your data-cleaning procedure was used on the data before modeling, and so on. This metadata is akin to version control for data. It’s handy information to have, to make sure that you’re comparing apples to apples when you’re in the process of improving your model, or comparing different models or different versions of a model.

Figure 4.15 shows an example of some possible metadata added to training data. In this example, you have recorded the original data source (called “data pull 8/2/18”), when the data was collected, and when it was treated. If, for example, the treatment date on the data is earlier than the most recent version of your data treatment procedures, then you know that this treated data is possibly obsolete. Thanks to the metadata, you can go back to the original data source and treat it again.

Figure 4.15. Recording the data source, collection date, and treatment date with data

Summary

At some point, you’ll have data quality that is as good as you can make it. You’ve fixed problems with missing data and performed any needed transformations. You’re ready to go on to the modeling stage.

Remember, though, that data science is an iterative process. You may discover during the modeling process that you have to do additional data cleaning or transformation. You may have to go back even further and collect different types of data. That’s why we recommend adding columns for sample groups and data provenance to your datasets (and, later, to the models and model output), so you can keep track of the data management steps as the data and the models evolve.

In this chapter you have learned

Different ways of handling missing values may be more suitable for a one purpose or another.
You can use the vtreat package to manage missing values automatically.
How to normalize or rescale data, and when normalization/rescaling are appropriate.
How to log transform data, and when log transformations are appropriate.
How to implement a reproducible sampling scheme for creating test/train splits of your data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4. Managing data

Create new playlist

Sign In

Sign Up

Chapter 4. Managing data

Figure 4.1. Chapter 4 mental model

4.1. Cleaning data

Example

4.1.1. Domain-specific data cleaning

Listing 4.1. Treating the age and income variables

Listing 4.2. Treating the gas_usage variable

4.1.2. Treating missing values

Listing 4.3. Counting the missing values in each variable

Figure 4.2. Even a few missing values can lose all your data.

Missing data in categorical variables

Figure 4.3. Creating a new level for missing categorical values

Missing values in numeric or logical variables

Figure 4.4. Income data with missing values

The nature of missing values

Figure 4.5. Replacing missing values with the mean

Treating missing values as information

Figure 4.6. Replacing missing values with the mean and adding an indicator column to track the altered values

Missingness indicators can be useful

4.1.3. The vtreat package for automatically treating missing variables

Figure 4.7. Creating and applying a simple treatment plan

Listing 4.4. Creating and applying a treatment plan

Listing 4.5. Comparing the treated data to the original

Listing 4.6. Examining the data treatment

4.2. Data transformations

Example

Listing 4.7. Normalizing income by state

4.2.1. Normalization

Listing 4.8. Normalizing by mean age

Figure 4.8. Is a 35-year-old young?

4.2.2. Centering and scaling

Listing 4.9. Centering and scaling age

Figure 4.9. Faceted graph: is a 35-year-old young?

Listing 4.10. Centering and scaling multiple numeric variables

Keep the training transformation

Listing 4.11. Treating new data before feeding it to a model

4.2.3. Log transformations for skewed and wide distributions

An aside on graphing

Figure 4.10. A nearly lognormal distribution and its log

Figure 4.11. Signed log lets you visualize non-positive data on a logarithmic scale.

4.3. Sampling for modeling and validation

4.3.1. Test and training splits

Figure 4.12. Splitting data into training and test (or training, calibration, and test) sets

4.3.2. Creating a sample group column

Listing 4.12. Splitting into test and training using a random group mark

Reproducible sampling is not just a trick for R

4.3.3. Record grouping

Figure 4.13. Example of a dataset with customers and households

Listing 4.13. Ensuring test/train split doesn’t split inside a household

Figure 4.14. Sampling the dataset by household rather than customer

4.3.4. Data provenance

Figure 4.15. Recording the data source, collection date, and treatment date with data

Summary

Table of Contents for
Chapter 4. Managing data

Listing 4.2. Treating the `gas_usage` variable