Chapter 5

Linear and Logistic Regression Models

Learning Objectives

By the end of this chapter, you will be able to:

  • Implement and interpret linear and logistic regression models
  • Compare linear and logistic regression models with cvms
  • Implement a random forest model
  • Create baseline evaluations with cvms
  • Select nondominated models, when metrics rank models differently

In Chapter 1, An Introduction to Machine Learning, we were introduced to linear and logistic regression models. In this chapter, we will expand our knowledge of these tools and use cross-validation to compare and choose between a set of models.

Introduction

While neural networks are often better than linear and logistic regression models at solving regression and classification tasks, respectively, they can be very difficult to interpret. If we wish to test the hypothesis that people drink more water when the temperature rises, it's important that we can extract this information from our model. A neural network with many layers might be very good at predicting the water consumption of a person, based on features such as age, gender, weight, height, humidity, and temperature, but it would be difficult to say how temperature alone affects the prediction. Linear regression would tell us specifically how temperature contributed to the prediction. So, while we might get a worse prediction, we gain an insight into the data and, potentially, the real world. Logistic regression, which we use for binary classification, is similarly easier to interpret.

In this chapter, we will implement and interpret linear and logistic regression models. Linear regression is used to predict continuous variables, while logistic regression is used for binary classification. We will use two methods of model building, one where features (which we will also refer to as "predictors") are incrementally added based on theoretical relevance, and one where all possible models are compared with cross-validation, using the cvms package. We will learn how to compare our models to baseline models, similar to what we did in Chapter 4, Introduction to neuralnet and Evaluation Methods, but this time using the baseline() function from cvms. We will also compare the models against random forest models. Finally, we will learn how to identify the nondominated models, when our evaluation metrics do not agree on what model is best.

Regression

Regression models are used to predict the value of a dependent variable from a set of independent variables, and to inform us about the strengths and forms of the potential relationships between each independent variable and the dependent variable in our dataset. While we will only cover linear and logistic regression in this chapter, it is worth noting that there are more types of regression, such as Poisson regression, for predicting count variables, such as the number of tattoos a person has, and ordinal regression, for predicting ranked variables, such as questionnaire answers ("Really Bad", "Bad", "Decent", "Good", "Really Good"), where the difference between "Decent" and "Good" is not necessarily the same as between "Really bad" and "Bad".

Each of these regression models relies on a set of assumptions about the data. For instance, in order to meaningfully use and interpret a linear regression model, we should make sure the phenomenon we are modeling is actually somewhat linear in nature. If these assumptions do not hold true for our data, we cannot reliably interpret our models.

Note

We will not cover these assumptions extensively in this chapter, so we highly recommend learning more about them elsewhere, for instance, in Discovering Statistics using R, by Andy Field (2012), or the in-depth YouTube video "Regression V: All regression assumptions explained!" by the channel zedstatistics. Here is the link to the video: https://www.youtube.com/watch?v=0MFpOQRY0rw

In Chapter 3, Feature Engineering, we learned about feature selection. In this chapter, we will use two different approaches to choose the predictors for our model. One is to use whatever domain-specific theory we have about our task to incrementally add the most meaningful predictors. We then compare this model to the previous models on a set of metrics, to test whether the model is better with or without the added predictor. This approach is meaningful when we have a specific hypothesis we wish to test, and when the models take a long time to train. One of the disadvantages of this approach is that the process might take a long time when we have many potential predictors. Also, it can be hard to know whether we have missed any meaningful combinations of the predictors.

A second approach is to try as many models as possible and choose the model that performs best on a test set (for instance by using cross-validation). Suppose we have five potential predictors available. This leaves us with 31 different models to test and compare. If we allow for two- and three-way interactions between predictors (we will discuss these later in the Interactions section), this increases the number of models to 5,160. If we have even more predictors available, this number can quickly explode. Hence, it makes sense to combine the two approaches by using domain-specific knowledge to limit the number of models to test, for instance, by only including theoretically meaningful predictors.

While the second approach tells us, which model is best at predicting the dependent variable in our test set, we do risk ending up with a model that is less theoretically meaningful. A variable can be a good predictor of another variable without there being a causal relationship between the two. For instance, we could probably (to some degree) predict how much sun we had during the summer months from the color of the grass. However, since changing the color of the grass would not affect the amount of sunshine, this is not a causal relationship. Conversely, if we were to predict the color of the grass, the amount of sun in the previous months could potentially be a causal effect.

We should also be aware that while cross-validation reduces the risk of overfitting, it does not eliminate it. When trying a high number of models, there is a risk that some of them are better due to overfitting.

In this chapter, we will start out with the incremental model-building approach, interpreting each model along the way, and then use cross-validation to test all possible models and interpret the one that performs best. We do this for pedagogical reasons but be aware that we would usually find the best model first, before starting the interpretation process.

The following datasets will be used in this chapter:

Figure 5.1: Datasets
Figure 5.1: Datasets

Linear Regression

When performing linear regression, we are trying to find linear relationships between variables. Suppose we have a cat shelter and want to know how many extra cans of cat food we need to buy after receiving new cats. A simple approach would be to find the average number of cans a cat eats per day (z) and multiply it by the number of new cats (x). This is a linear relationship: if the number of cats increases by x, make sure to buy x times z cans of cat food. Of course, other variables might affect how much a new cat eats, such as age, breed, and weight at birth. We could possibly make a better linear model by adding these as predictors.

Imagine we had measured the amount of food eaten by 85 cats, along with their weight at birth (in grams). For budgeting reasons, we wish to predict the amount of food (number of cans) a newborn cat will eat per day when it grows up. We can plot these variables against each other, as shown in Figure 5.2:

Figure 5.2: Cans of cat food eaten per day at age 2 by birth weight
Figure 5.2: Cans of cat food eaten per day at age 2 by birth weight

On the x-axis, we have the weight at birth in grams. On the y-axis, we have the number of cans eaten per day when the cat is two years old. The black points are the measurements, while the solid line represents the linear predictions. The dashed lines show the distance between each measurement and the predicted value at that birth weight. These distances are called the residuals.

Our (simulated) data seems to indicate that a heavier newborn cat will eat more at two years old than a lighter one. The residuals (the vertical, dashed lines) are used to calculate the error metrics, such as Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). The smaller the residuals (on a test set), the better the model is. We can plot these residuals as shown in the following figure:

Figure 5.3: Residuals
Figure 5.3: Residuals

The solid line lies at 0, as the residuals (the dashed lines) are the distances to that line. In this plot, the residuals seem to be distributed fairly similarly on both sides of the solid line. The model simply uses the average amount of cat food eaten per cat to create this line, so the sum of distances is the same on each side. Sometimes, though, we can have large outliers (greatly diverging data points) on one side, pushing the line away from the more regular data points. Imagine we had a cat that ate six cans per day. That would increase the average, causing the model to make poorer predictions about the more typical one-can-per-day cats. The fewer cats we had measured, the bigger impact such an outlier would have. When our sample size is small, it is therefore important to check that the residuals are somewhat normally distributed (like a bell curve) around 0. This means that most of the residuals should be close to the prediction line, with fewer and fewer residuals the further away we get from the line.

We can visualize the distribution of the residuals with a density plot. This visualizes the concentrations of continuous values in a vector. Imagine pushing the points in Figure 5.3 all the way to the left and rotating the plot by -90 degrees. We then add the density plot to show the distribution of the points at the bottom:

Figure 5.4: Density plot showing the distribution of residuals
Figure 5.4: Density plot showing the distribution of residuals

The black curvy line (the density plot) shows the shape of the distribution. We want this distribution to resemble a normal distribution (bell curve) with a median of around 0 (the vertical line). There are other types of plots commonly used to inspect normality (that the distribution is normal), as well as a set of statistical tests, such as the Shapiro-Wilk Normality Test. As we will be working with a dataset with more than a few hundred observations, we will not perform these tests or further discuss the assumption of normality in this chapter.

When interpreting our regression model, we are often interested in the relation between each predictor and the dependent variable. We ask how the dependent variable changes when a predictor increases by one unit. This interpretation relies on the assumption that the predictors are independent of each other. If a predictor can itself be predicted by a linear combination of the other predictors (multicollinearity), the predictor's information is redundant and non-independent. In that case, we cannot properly interpret the effects of the single predictors. Importantly, multicollinearity does not necessarily make the model worse at predicting the dependent variable.

The assumption of homoscedasticity is that the variance of the residuals should be similar for different values of our predictors. We shouldn't, for instance, be very good at predicting the amount of cat food when the cat weighed 90 grams at birth but be really bad when the cat weighed 110 grams at birth.

The assumption of no auto-correlation is that the residuals must be independent of each other. This means that we should not be able to predict the value of a residual from the other residuals.

Having briefly explained some of the assumptions of regression, we again recommend reading up on the assumptions of the regression models you intend to use. We will now move ahead and train a linear regression model.

Exercise 47: Training Linear Regression Models

In this exercise, we will load, inspect, and prepare the Sacramento dataset from the caret package. It contains prices for 932 houses in the Sacramento CA area, across a 5-day period. Then, we will train and interpret a simple linear regression model:

  1. Attach the caret, groupdata2, and cvms packages:

    # Attach Packages

    library(caret)

    library(groupdata2)

    library(cvms)

  2. Set the random seed to 1:

    set.seed(1)

  3. Load the Sacramento dataset:

    # Load Sacramento dataset

    data(Sacramento)

  4. Assign/copy the dataset to another variable name:

    # Assign data to variable name

    full_data <- Sacramento

  5. Summarize the dataset:

    # Summarize the dataset

    summary(full_data)

    The summary is as follows:

    ##              city          zip           beds           baths     

    ##  SACRAMENTO    :438   z95823 : 61   Min.   :1.000   Min.   :1.000

    ##  ELK_GROVE     :114   z95828 : 45   1st Qu.:3.000   1st Qu.:2.000

    ##  ROSEVILLE     : 48   z95758 : 44   Median :3.000   Median :2.000

    ##  CITRUS_HEIGHTS: 35   z95835 : 37   Mean   :3.276   Mean   :2.053

    ##  ANTELOPE      : 33   z95838 : 37   3rd Qu.:4.000   3rd Qu.:2.000

    ##  RANCHO_CORDOVA: 28   z95757 : 36   Max.   :8.000   Max.   :5.000

    ##  (Other)       :236   (Other):672                                 

    ##       sqft                type         price           latitude   

    ##  Min.   : 484   Condo       : 53   Min.   : 30000   Min.   :38.24

    ##  1st Qu.:1167   Multi_Family: 13   1st Qu.:156000   1st Qu.:38.48

    ##  Median :1470   Residential :866   Median :220000   Median :38.62

    ##  Mean   :1680                      Mean   :246662   Mean   :38.59

    ##  3rd Qu.:1954                      3rd Qu.:305000   3rd Qu.:38.69

    ##  Max.   :4878                      Max.   :884790   Max.   :39.02

    ##                                                                   

    ##    longitude    

    ##  Min.   :-121.6

    ##  1st Qu.:-121.4

    ##  Median :-121.4

    ##  Mean   :-121.4

    ##  3rd Qu.:-121.3

    ##  Max.   :-120.6

    The city column contains multiple cities, some of which only appear a few times.

  6. Count the appearances of each city with table(). Use sort() to order the counts:

    # Count observations per city

    sort( table(full_data$city) )

    The output is as follows:

    ##

    ##            COOL DIAMOND_SPRINGS      FORESTHILL   GARDEN_VALLEY

    ##               1               1               1               1

    ##       GREENWOOD          MATHER    MEADOW_VISTA          PENRYN

    ##               1               1               1               1

    ##    WALNUT_GROVE       EL_DORADO          LOOMIS     GRANITE_BAY

    ##               1               2               2               3

    ##   POLLOCK_PINES  RANCHO_MURIETA WEST_SACRAMENTO         ELVERTA

    ##               3               3               3               4

    ##      GOLD_RIVER          AUBURN          WILTON    CAMERON_PARK

    ##               4               5               5               9

    ##       FAIR_OAKS     PLACERVILLE      ORANGEVALE       RIO_LINDA

    ##               9              10              11              13

    ##          FOLSOM         ROCKLIN      CARMICHAEL            GALT

    ##              17              17              20              21

    ## NORTH_HIGHLANDS         LINCOLN EL_DORADO_HILLS  RANCHO_CORDOVA

    ##              21              22              23              28

    ##        ANTELOPE  CITRUS_HEIGHTS       ROSEVILLE       ELK_GROVE

    ##              33              35              48             114

    ##      SACRAMENTO

    ##             438

    Some cities have only one observation, which is not nearly enough to train a model on. As these cities might be very diverse, it can seem a bit artificial to group them together as one condition. On the other hand, doing so might show us some general tendencies of home prices in and outside Sacramento.

  7. Create a one-hot encoded column called in_sacramento, describing whether a home is located in Sacramento:

    # Create one-hot encoded factor column describing

    # if the city is Sacramento or not

    full_data$in_sacramento <- factor(

      ifelse(full_data$city == "SACRAMENTO", 1, 0)

    )

  8. Count the homes in and outside Sacramento:

    # Count observations per city condition

    table(full_data$in_sacramento)

    ##

    ##   0   1

    ## 494 438

  9. Partition the dataset into a training set (80%) and a validation set (20%). Balance the ratios of the two in_sacramento conditions between the partitions and assign each partition to a variable name:

    partitions <- partition(full_data, p = 0.8,

                            cat_col = "in_sacramento")

    train_set <- partitions[[1]]

    valid_set <- partitions[[2]]

    Now, we can train a linear regression model with the in_sacramento column as a predictor of price. We use the lm() function for this. Similar to the neuralnet() function in Chapter 4, Introduction to neuralnet and Evaluation Methods, we first have a model formula: price ~ in_sacramento (read as "price predicted by in_sacramento"). We use the summary() function to view the fitted model parameters.

  10. Fit a linear model where price is predicted by the in_sacramento variable:

    # Fit model where price is predicted by in_sacramento

    lin_model <- lm(price ~ in_sacramento, data = train_set)

  11. Inspect the model summary:

    # Inspect the model summary

    summary(lin_model)

    The summary is as follows:

    ##

    ## Call:

    ## lm(formula = price ~ in_sacramento, data = train_set)

    ##

    ## Residuals:

    ##     Min      1Q  Median      3Q     Max

    ## -261719  -82372  -25719   54491  593071

    ##

    ## Coefficients:

    ##                Estimate Std. Error t value Pr(>|t|)   

    ## (Intercept)      291719       6214   46.94   <2e-16 ***

    ## in_sacramento1   -91427       9067  -10.08   <2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## Residual standard error: 123500 on 743 degrees of freedom

    ## Multiple R-squared:  0.1204, Adjusted R-squared:  0.1192

    ## F-statistic: 101.7 on 1 and 743 DF,  p-value: < 2.2e-16

Now, let's learn how to interpret the output. In the top of the output, we have the Call section. This simply repeats how we called the lm() function.

Next, we have summary statistics of the residuals. The median residual (-25719) is negative, meaning that most data points lie below the prediction line. Interestingly, the most positive residual (593071) has more than double the magnitude of the most negative residual (-261719).

In the Coefficients section, we see the coefficient estimations, along with their standard errors and significance tests. The coefficient estimates can be thought of as the neural network weights we saw in the previous chapter. For this model, the Intercept tells us the average home price ($291,719) in the smaller cities. The estimate for homes in Sacramento tells us that the average price in Sacramento is $91,427 less! So, on average, it is cheaper to buy a house in Sacramento than in one of the other cities.

The coefficient standard error (std. error) is an estimate of how uncertain the model is about the coefficient estimate. As we are working with a random sample of houses from the Sacramento area, we do not know the true coefficient, and were we to collect a different sample of houses, the coefficient estimate would almost certainly be different from our current estimate. To get closer to the true coefficient, we could collect a lot of these samples and find the average coefficient estimate. As this could be expensive and time-consuming, we would like to have an estimate of how close to the true coefficient we think the model is. The standard error helps us by estimating the standard deviation of the coefficient estimates if we were to gather a lot of samples from the real world. In our summary, the standard error for in_sacramento ($9,067) is a lot smaller than the coefficient estimate ($-91,427). This tells us that we should expect the true coefficient to be within $9,067 of the current coefficient estimate 68% of the time, and within twice that amount ~95% of the time.

The p-value (Pr(>|t|)) tells us that the difference between living in and outside Sacramento is statistically significant (<2e-16). This means that we have strong evidence against the null hypothesis (that there is no difference between prices in and outside Sacramento) and can reasonably reject it. If the houses in our training set are representative of the real world, and the many assumptions of the model are met, this means that accepting the alternative hypothesis (that the difference between prices in and outside Sacramento exists) will only make us look stupid 5% of the time (that is, if our significance threshold is p=0.05, we would get a significant effect 5% of the time by chance). Note that the size of the coefficient estimate is just as important as the p-value when deciding whether to believe the alternative hypothesis. If we have enough data points, even very small coefficients can be significant. Whether it is meaningful or not depends on the context.

In the Signif. codes section below the coefficients, we see different significance thresholds, each symbolized by a number of asterisks. We should generally decide on a threshold before running the analysis. We will use the 0.05 threshold (one asterisk), but which threshold is appropriate will depend on the context.

Finally, we get a few metrics about the model, such as adjusted R-squared. This tells us that the model can explain ~11.92% of the variance in the dependent variable in the training set.

R2

The coefficient of determination, R2 , or R-squared, tells us how much of the variance in the dependent variable can be explained by the predictors in our model, as a percentage between 0 and 1. Although we want this to be as large as possible, even a seemingly small R2 can be good if the phenomenon we're modeling is very complex. A model with a high R2 is generally better suited for making predictions, whereas a model with a small R2 can still be useful for describing the effect of the various predictors. When we add predictors to a model, R2 increases as well, even if the model is overfitted. To account for this, the adjusted R2 only increases if the added predictor actually makes the model better (that is, it increases R2 more than adding a random variable as a predictor would), and can even decrease, if it doesn't.

In the previous chapter, we talked about the importance of comparing our models on their performances on a test set (either in cross-validation or a development set). The R2 metric is usually calculated on the training set, although a Predicted R2 metric does exist. We will therefore not use R2 for model selection.

In the following exercise, we will visualize the predictions our model makes on the training set.

Exercise 48: Plotting Model Predictions

In this exercise, we will plot our model's predictions of home prices on top of our training data using ggplot2. While we have used ggplot2 in the previous chapters, let's quickly refresh the basics.

We first create a ggplot object. We then use the + operator to add layers to the object. These layers can be boxes, lines, points, or many other geoms (from geometry). We specify the variables for the x and y axes via the aes() function ("aes" is from "aesthetic mapping"). We can also change the colors and sizes of the added geoms:

  1. Attach the ggplot2 package:

    # Attach ggplot2

    library(ggplot2)

  2. Assign/copy the training set to a new name:

    # Assign/copy the training set to a new name

    plotting_data <- train_set

  3. Use the model to predict prices in the training set:

    # Use the model to predict prices in the training set

    plotting_data$predicted_price <- predict(lin_model, train_set)

  4. Create a ggplot object. Specify the mapping between the variables and the axes. Remember the +:

    # Create ggplot object and specify the data and x,y axes

    ggplot(data = plotting_data,

           mapping = aes(x = in_sacramento, y = price)) +

  5. Add each observation in the training set as a point with geom_point(). Change the color and size as well. It is also possible to use the name of a common color, such as "green". Remember the +:

        # Add the training set observations as points

        geom_point(color = '#00004f', size = 0.4) +

  6. Add a prediction line with geom_line(). Because the x-axis is the in_sacramento factor, the line will start at the predicted value when in_sacramento is 0 and end at the predicted value when in_sacramento is 1. Remember the +:

        # Add the predictions as a line

        # Notice that we convert in_sacramento to the

        # type "numeric", as this allows us to draw the

        # line in-between the two binary values

        geom_line(color = '#d8007a',

                  aes(x = as.numeric(in_sacramento),

                      y = predicted_price)) +

  7. Add the light theme. There is a set of themes for ggplot2 plots to choose from. Do not add a +, as this is the last layer of our plot:

        # Add a theme to make our plot prettier

        theme_light()

    The output is as follows:

Figure 5.5: Plot of price predicted by in_sacramento
Figure 5.5: Plot of price predicted by in_sacramento

The points are the actual prices in the training set, while the solid line is what our model would predict in the two conditions. The "line" of black points in the right column seems to be shifted slightly downward from the left column, although a lot of the points are in the same range across the two conditions. The points in the left column cover a larger range of prices, which makes sense because we artificially grouped the cities together. Given that the points with similar values are all stacked on top of each other, it can be hard to tell their actual distribution. For that, we will add a violin plot behind the points.

  1. We wish to add a geom_violin() layer to the beginning of our plot. Instead of rewriting every step, just add the new layer to our code.
  2. You should have this code before the new layer:

    # Create ggplot object and specify the data and x,y axes

    ggplot(data = plotting_data,

           mapping = aes(x = in_sacramento, y = price)) +

  3. Add this code to create the new layer:

        # Add violin plot of the training data

        geom_violin(color = '#00004f', size = 0.4) +

  4. You should have the following code after the new layer:

        # Add the training data as points

        geom_point(color = '#00004f', size = 0.4) +

        # Add the predictions as a line

        # Notice that we convert in_sacramento to the

        # type "numeric", as this allows us to draw the

        # line in-between the two binary values

        geom_line(color = '#d8007a',

                  aes(x = as.numeric(in_sacramento),

                      y = predicted_price)) +

        # Add a theme to make our plot prettier

        theme_light()

    The output will appear as follows:

Figure 5.6: Improved plot of price predicted by in_sacramento
Figure 5.6: Improved plot of price predicted by in_sacramento

Similar to the density plot we saw previously, the broader the "violins" are, the greater number of data points have approximately that value. As the right violin (when in_sacramento is 1) is broadest at a lower point than the left violin (when in_sacramento is 0), we can tell that the bulk of Sacramento prices lie below those outside Sacramento. We can also tell though that using only this predictor in our model would not make it very successful at estimating the price of a specific home.

In this exercise, we learned how to plot the predictions against the actual observations. In the next exercise, we will learn how to incrementally add predictors.

Exercise 49: Incrementally Adding Predictors

In this exercise, we will add predictors to our linear regression model one at a time and see how the model summary changes. We have three additional predictors that are likely to impact the home price. First, the size (in sqft), then, the number of bedrooms (beds), and finally, the number of bathrooms (baths). We also have the type of house, but because it is so imbalanced, we will leave it out. As larger houses tend to have more beds and baths than smaller houses, we might hypothesize that our predictors are correlated. As you may recall, one of the assumptions of linear regression is that there is not a high degree of multicollinearity. If this assumption is not met, we need to be careful when interpreting the summaries. There are multiple approaches to checking for multicollinearity.

A simple first check is whether the predictors correlate with one another. If two predictors are highly correlated, we should consider removing one of them. Correlation is not enough though, as multicollinearity can also stem from a linear combination of multiple predictors. To check for this, it is common to calculate the Variance Inflation Factor (VIF). Let's first check the correlation between predictors, and then the VIF:

  1. Attach the car package:

    # Attach car

    library(car)

  2. Find the correlation coefficients between sqft, beds, and baths:

    cor(train_set[, c(3:5)])

    The correlation coefficients are as follows:

    ##            beds     baths      sqft

    ## beds  1.0000000 0.6452788 0.7179578

    ## baths 0.6452788 1.0000000 0.7619998

    ## sqft  0.7179578 0.7619998 1.0000000

    The pairwise correlations are sqft and baths (0.762), sqft and beds (0.718), and beds and baths (0.645). These correlation coefficients are fairly high, but not necessarily too high. They are worth bearing in mind, however, when interpreting the model summaries. To assess whether they are too high, or whether we have overly high multicollinearity in other ways, let's use VIF.

  3. Train a model with all four predictors and apply the vif() function from the car package:

    # Train a linear regression model

    # with all four predictors

    lin_model_all_predictors <- lm(

        price~in_sacramento + sqft + beds + baths,

        data = train_set)

    # Apply the vif() function to the model object

    vif(lin_model_all_predictors)

    ## in_sacramento          sqft          beds         baths

    ##      1.108431      3.133212      2.196416      2.515782

    We want the VIF values to be as close to 1 as possible. A common rule of thumb is that a VIF > 5 should be interpreted as there being high multicollinearity. In that case, we could consider removing one of the correlated predictors. In our case, the largest VIF is 3.133, which is acceptable for our analysis. Now, we will start training and interpreting some linear regression models.

  4. Train a linear model with in_sacramento and sqft as predictors of price:

    lin_model_2 <- lm(price ~ in_sacramento + sqft,

                      data = train_set)

    summary(lin_model_2)

    The summary of the model is as follows:

    ##

    ## Call:

    ## lm(formula = price ~ in_sacramento + sqft, data = train_set)

    ##

    ## Residuals:

    ##     Min      1Q  Median      3Q     Max

    ## -198778  -51380  -11981   35566  403189

    ##

    ## Coefficients:

    ##                 Estimate Std. Error t value Pr(>|t|)   

    ## (Intercept)     43421.47    9118.38   4.762 2.31e-06 ***

    ## in_sacramento1 -35732.88    6306.19  -5.666 2.09e-08 ***

    ## sqft              131.09       4.29  30.558  < 2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## Residual standard error: 82240 on 742 degrees of freedom

    ## Multiple R-squared:  0.6105, Adjusted R-squared:  0.6095

    ## F-statistic: 581.6 on 2 and 742 DF,  p-value: < 2.2e-16

    Adding the size in square feet changed the summary quite a lot. Our intercept is now the price when living outside Sacramento with a home size of 0 square feet. For every extra square foot, we have to add $131.09 to the price. The estimated effect of living in Sacramento has now become smaller as well, though it's still statistically significant. Finally, we see that the adjusted R-squared has grown to 60.95%. Now, let's add the number of bedrooms.

  5. Train a linear model with in_sacramento, sqft, and beds as predictors of price:

    lin_model_3 <- lm(price ~ in_sacramento + sqft + beds,

                      data = train_set)

    summary(lin_model_3)

    The summary of the model is as follows:

    ##

    ## Call:

    ## lm(formula = price ~ in_sacramento + sqft + beds, data = train_set)

    ##

    ## Residuals:

    ##     Min      1Q  Median      3Q     Max

    ## -197285  -51124  -10338   32609  390503

    ##

    ## Coefficients:

    ##                  Estimate Std. Error t value Pr(>|t|)   

    ## (Intercept)     88084.399  12230.228   7.202 1.46e-12 ***

    ## in_sacramento1 -32276.138   6225.042  -5.185 2.79e-07 ***

    ## sqft              154.180      6.025  25.591  < 2e-16 ***

    ## beds           -26058.703   4861.568  -5.360 1.11e-07 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## Residual standard error: 80740 on 741 degrees of freedom

    ## Multiple R-squared:  0.6251, Adjusted R-squared:  0.6235

    ## F-statistic: 411.8 on 3 and 741 DF,  p-value: < 2.2e-16

    We get an interesting estimate for the number of bedrooms. It seems that for every bedroom we add to the house, the price drops by $26,058. The coefficient tells us that when we already know the number of square feet, and whether the house is located in Sacramento, and we know how these variables affect the price, an increase in the number of bedrooms decreases our price expectation. Where a larger number of bedrooms usually implies a larger house and therefore a greater price, we now ask for the effect on price when more bedrooms do not imply a larger home as well.

    All these effects are significant, and the adjusted R-squared has increased to 62.35%. What happens if we add the number of bathrooms to the model?

  6. Train a linear model with in_sacramento, sqft, beds, and baths as predictors of price:

    lin_model_4 <- lm(

        price ~ in_sacramento + sqft + beds + baths,

        data = train_set)

    summary(lin_model_4)

    The summary of the model is as follows:

    ##

    ## Call:

    ## lm(formula = price ~ in_sacramento + sqft + beds + baths, data = train_set)

    ##

    ## Residuals:

    ##     Min      1Q  Median      3Q     Max

    ## -197062  -50856  -10504   32786  389848

    ##

    ## Coefficients:

    ##                  Estimate Std. Error t value Pr(>|t|)   

    ## (Intercept)     88552.649  12780.774   6.929 9.25e-12 ***

    ## in_sacramento1 -32331.816   6244.561  -5.178 2.90e-07 ***

    ## sqft              154.666      7.142  21.656  < 2e-16 ***

    ## beds           -25916.586   4991.630  -5.192 2.69e-07 ***

    ## baths            -838.448   6596.329  -0.127    0.899

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## Residual standard error: 80800 on 740 degrees of freedom

    ## Multiple R-squared:  0.6251, Adjusted R-squared:  0.623

    ## F-statistic: 308.4 on 4 and 740 DF,  p-value: < 2.2e-16

    When we already know the values of sqft, in_sacramento, and beds, the coefficient estimate for baths is not significant (p=0.899). The adjusted R-squared is approximately the same.

    What would happen if we left out the sqft predictor?

  7. Train a linear model with in_sacramento, beds, and baths as predictors of price:

    lin_model_5 <- lm(price ~ in_sacramento + beds + baths,

                      data = train_set)

    summary(lin_model_5)

    The summary of the model is as follows:

    ##

    ## Call:

    ## lm(formula = price ~ in_sacramento + beds + baths, data = train_set)

    ##

    ## Residuals:

    ##     Min      1Q  Median      3Q     Max

    ## -271495  -65516  -18642   41358  519964

    ##

    ## Coefficients:

    ##                Estimate Std. Error t value Pr(>|t|)   

    ## (Intercept)       38812      16060   2.417   0.0159 *

    ## in_sacramento1   -58566       7825  -7.485 2.04e-13 ***

    ## beds              24695       5634   4.383 1.34e-05 ***

    ## baths             75745       7113  10.649  < 2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## Residual standard error: 103200 on 741 degrees of freedom

    ## Multiple R-squared:  0.3875, Adjusted R-squared:  0.385

    ## F-statistic: 156.2 on 3 and 741 DF,  p-value: < 2.2e-16

Removing sqft made the coefficient estimate for baths significant (p<2e-16), and turned both beds ($24,695) and baths ($75,745) into positive effects. The adjusted R-squared decreased markedly (0.385).

But how do we know which model to trust?

Comparing Linear Regression Models

In Chapter 4, Introduction to neuralnet and Evaluation Methods, we used cross-validation to compare our classification models based on their predictions. This technique is also useful for linear regression, although we use a different set of evaluation metrics. Instead of using the long cross-validation training loop code from that chapter, though, we will use the cvms package, which is specifically designed for cross-validating linear and logistic regression models. We simply pass it a list of our model formulas as strings, and it gives us the results of the cross-validations.

Note

The cvms package was created by the one of the authors of this book, Ludvig Renbo Olsen.

Evaluation Metrics

When evaluating how well our model predicts the dependent variable in the test set, we use the residuals to calculate a set of error metrics. As mentioned previously, the residuals are the differences between the observations and the predicted values. The residuals can be both negative and positive, so we cannot simply add them together, as they might cancel out (if, for instance, one residual is 3 and another is -3, the result of adding them together would be 0). We are instead interested in the magnitudes of the residuals (the absolute values of the residuals). These are what we use in the Mean Absolute Error (MAE) metric. An alternative approach to making all the errors non-negative is to square them. We do this for the Root Mean Square Error (RMSE) metric. These two metrics are commonly used, and we wish to minimize both as far as possible. As the values of the two metrics depend on the scale of the dependent variable, they can be hard to compare between projects. We therefore mainly use them to compare our model formulas when selecting the best performing model. As we will see in the Differences between MAE and RMSE section, a key difference between the two metrics is how they respond to outliers.

MAE

The MAE (Mean Absolute Error) is the average absolute difference between the observed values and the predicted values.

The formula for MAE is as follows:

Figure 5.7: MAE formula
Figure 5.7: MAE formula

Here, is the vector of observations, and is the vector of predictions.

RMSE

The RMSE (Root Mean Square Error) is the square root of the average squared difference between the observed values and the predicted values. If this sounds complicated, here are the four steps involved:

  1. Find the residuals by subtracting the observed values from the predicted values.
  2. Square the residuals.
  3. Find the average squared residual.
  4. Take the square root of this average squared residual.

The formula for RMSE is as follows:

Figure 5.8: RMSE formula
Figure 5.8: RMSE formula

Differences between MAE and RMSE

One of the main differences between MAE and RMSE is how they react to outliers. RMSE penalizes large outliers more because it squares the residuals. To get a sense of how they differ, we can add two outliers (one negative and one positive) to the residuals in our cat food data and scale them, such that their magnitudes become increasingly larger than the magnitudes of the other residuals. Let's first visualize our residuals with these additional outliers:

Figure 5.9: Residuals with added outliers at three different scales
Figure 5.9: Residuals with added outliers at three different scales

Figure 5.9 shows three different scalings of our added outliers. The outliers are simply multiplied by the scaling factors (1, 5, and 10). Now, let's see the effect of these scalings on the RMSE and MAE metrics. We will compare the effect of the outlier scalings to the effect of scaling all the residuals at once:

Figure 5.10: Comparison of MAE and RMSE at different outlier scalings
Figure 5.10: Comparison of MAE and RMSE at different outlier scalings

On the left in Figure 5.10, we see the effect (y-axis) of multiplying all the residuals by different scaling factors (x-axis). This shows that RMSE becomes increasingly larger than MAE when the residuals increase in magnitude. On the right, only the two outliers have been scaled. While MAE increases linearly with the total error (in the left-hand plot, it is simply multiplied by the scaling factor), the two outliers affect the increase in RMSE more the larger the scaling factor is.

For the remainder of this chapter, we will use RMSE. This will favor models with fewer large residual outliers (errors). MAE could also be useful if we believe a prediction being $10,000 off is about twice as bad as being $5,000 off. This depends on the context in which we wish to use our model.

Exercise 50: Comparing Models with the cvms Package

In this exercise, we will use the cvms package to cross-validate a list of model formulas:

  1. Attach the cvms and groupdata2 packages:

    # Attach packages

    library(cvms)

    library(groupdata2)

  2. Set the random seed to 2:

    # Set seed for reproducibility

    set.seed(2)

  3. Create a vector with model formulas:

    # Create vector with model formulas

    model_formulas <- c(

        "price ~ in_sacramento",

        "price ~ in_sacramento + sqft",

        "price ~ in_sacramento + sqft + beds",

        "price ~ in_sacramento + sqft + beds + baths",

        "price ~ in_sacramento + beds + baths")

  4. Create five folds in the training set. Balance the folds by in_sacramento:

    # Create folds

    train_set <- fold(train_set, k = 5,

                      cat_col = "in_sacramento")

  5. Cross-validate the model formulas with cross_validate(). In addition to the dataset and model formulas, we specify the name of the fold column, .folds, and the model family, gaussian:

    # Cross-validate our models

    # Note that we specify the model family as "gaussian",

    # which refers to linear regression

    cv_results <- cross_validate(data = train_set,

                                 models = model_formulas,

                                 fold_cols = ".folds",

                                 family = "gaussian")

  6. Print the results:

    cv_results

    The output is as follows:

    ## # A tibble: 5 x 18

    ##     RMSE    MAE   r2m   r2c    AIC   AICc    BIC Predictions Results

    ##    <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl> <list>      <list>

    ## 1 1.23e5 92787. 0.121 0.121 15670. 15670. 15683. <tibble [7… <tibbl…

    ## 2 8.21e4 59878. 0.610 0.610 15187. 15187. 15204. <tibble [7… <tibbl…

    ## 3 8.06e4 58516. 0.624 0.624 15166. 15166. 15188. <tibble [7… <tibbl…

    ## 4 8.08e4 58662. 0.624 0.624 15167. 15167. 15194. <tibble [7… <tibbl…

    ## 5 1.03e5 75560. 0.387 0.387 15458. 15458. 15480. <tibble [7… <tibbl…

    ## # … with 9 more variables: Coefficients <list>, Folds <int>, 'Fold

    ## #   Columns' <int>, 'Convergence Warnings' <dbl>, 'Singular Fit

    ## #   Messages' <int>, Family <chr>, Link <chr>, Dependent <chr>,

    ## #   Fixed <chr>

    The output contains a lot of information. It is worth understanding the content of all the columns, but for now, we will select the metrics and model formulas with select_metrics() and discuss them. The other columns are described in the help page, accessed with ?cross_validate. We will use the kable() function from the knitr package to spice up the layout of the data frame.

  7. Attach the knitr package:

    # Attach knitr

    library(knitr)

  8. Select the metrics and model definitions with select_metrics(). Print the output with kable(), with numbers rounded to two digits:

    # Select the metrics and model formulas

    # with select_metrics()

    # Print with knitr::kable for fanciness,

    # and round to 2 decimals.

    kable(select_metrics(cv_results), digits = 2)

    The output is as follows:

Figure 5.11: The metrics in the cross-validation output
Figure 5.11: The metrics in the cross-validation output

As mentioned previously, we will focus on the RMSE metric, but Figure 5.12 contains the goals for each metric in the output:

Figure 5.12: Goals for each metric
Figure 5.12: Goals for each metric

Looking at the cross-validation results in Figure 5.11, we see that the third model (the one without the number of bathrooms) has the lowest RMSE. As this is our selection criterion, this model is the best of the five. When the models are this close, we can use repeated cross-validation to get an even better estimate of the model performances. We will apply this technique later, as well as cover an approach for choosing models by multiple, disagreeing metrics.

Interactions

So far, we have asked whether the price of a home goes up or down when we increase the size or add a bedroom. But what if more bedrooms increases the price of a small home, and decreases it for a bigger home? Or vice versa? We couldn't tell! We can ask our model that question, though, through interaction terms. An interaction is written as effect1 : effect2, and is simply added to the formula as follows:

lin_model_6 <- lm(price ~ in_sacramento + sqft + beds + sqft : beds,

                  data = train_set)

summary(lin_model_6)

The summary will appear as follows:

##

## Call:

## lm(formula = price ~ in_sacramento + sqft + beds + sqft : beds, data = train_set)

##

##     Min      1Q  Median      3Q     Max

## -202169  -48471  -11623   33049  396985

##

## Coefficients:

##                  Estimate Std. Error t value Pr(>|t|)   

## (Intercept)      1237.724  26124.893   0.047 0.962225   

## in_sacramento1 -30556.412   6187.792  -4.938 9.75e-07 ***

## sqft              211.963     16.515  12.835  < 2e-16 ***

## beds            -3190.095   7768.994  -0.411 0.681471   

## sqft:beds         -14.431      3.845  -3.753 0.000189 ***

## ---

## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 80040 on 740 degrees of freedom

## Multiple R-squared:  0.6321, Adjusted R-squared:  0.6301

## F-statistic: 317.8 on 4 and 740 DF,  p-value: < 2.2e-16

An alternative to the colon (:) operator is the asterisk (*) operator. This operator unpacks the interaction such that the predictors are also included in the formula on their own. If we have more than two predictors in our interaction, for instance, price ~ sqft * beds * baths (which we refer to as a three-way interaction), it will also include all the smaller interactions (the two-way interactions in this case). We will use the asterisk operator as it is an easier and cleaner way to write our formulas. We simply replace + with a * between two or more predictors, like so:

lin_model_6_2 <- lm(price ~ in_sacramento + sqft * beds,

                    data = train_set)

The summary of this model would only differ from the previous summary in the Call section.

When our model formula contains interaction terms, we get separate coefficient estimates for the interactions, represented as effect1:effect2, and the individual effects. When interpreting the summary, we start with the interaction estimates. If an interaction term is significant, we can reasonably reject the null hypothesis that the effects are not interacting. This changes the interpretation of the two predictors' coefficient estimates because, at least in our dataset, these appear to be dependent on one another and not just linearly correlated with the dependent variable. In our model, the interaction between sqft and beds is significant. To interpret this, we plot the two variables against each other with the interplot package. First, we will plot the effect of sqft on the price when the number of bedrooms changes. Notice that interplot creates a ggplot2 object, to which we can add layers and themes:

# Attach interplot

library(interplot)

# Plot the effect of size

# depending on the number of bedrooms

interplot(lin_model_6, "sqft", "beds", hist = TRUE) +

  # Add labels to the axes

  labs(x = "Number of bedrooms", y = "Price") +

  # Add a theme to make our plot prettier

  theme_light()

The output will be as follows:

Figure 5.13: Interaction plot for price per sqft when beds changes
Figure 5.13: Interaction plot for price per sqft when beds changes

The plot contains a black line with a gray boundary (95% confidence interval) and a histogram (the bars at the bottom). Note that the scale of the y-axis (Price) is for the line, not the histogram bars.

The histogram tells us the distribution of homes per bedroom count. The black line tells us the estimated price per square foot at the different bedroom counts. When there are few bedrooms, each square foot is worth almost $200. The more bedrooms we have, the less each square foot is worth. For a home with eight bedrooms, each square foot is worth less than $100.

Does our analysis indicate that we should knock down some walls if we were selling a house? Or does it tell us that the price of homes with eight bedrooms is less affected by the size?

While a house with 8 bedrooms is unlikely to be small and cheap, a home with only one bedroom could be both small and cheap, or large and expensive. It is also possible that the relationship between price and size is not actually linear. For instance, the gained benefit from adding one square foot to a 200 square-foot home is likely larger than adding it to a 3,000 square-foot home. It is also worth noting from the histogram that we have very few homes with more than five bedrooms in our dataset. If we were a real estate agent working with smaller houses, we could investigate the effect of removing these from the dataset, as they might decrease model performance on the types of houses we care about.

Now, let's check the effect of the bedroom count on the price when size changes:

# Plot the effect of number of bedrooms

# depending on size

interplot(lin_model_6, "beds", "sqft", hist = TRUE) +

  # Add labels to the axes

  labs(x = "Size in square feet", y = "Price") +

  # Add a theme to make our plot prettier

  theme_light()

The plot is as follows:

Figure 5.14: Interaction plot for price per bedroom when size changes
Figure 5.14: Interaction plot for price per bedroom when size changes

In Figure 5.14, the black line shows the effect of beds on price at different house sizes. The histogram is the distribution of homes at different sizes. Again, the scale of the y-axis (Price) is for the line and not the histogram bars.

The larger our home is, the more each bedroom subtracts from the price! So, if we have a small home, adding a bedroom affects the price less negatively than if the home is big. Again, we could consider whether the effect of beds on the price is actually linear? Does adding the first bedroom to a home affect the price the same as adding the fifth bedroom?

As most of our homes have a size of between 700 and 3,000 square feet, the few very large homes could bias the model inappropriately. If our real estate agency mostly sells homes in the lower size range, we could try to remove the biggest houses from the model and see whether this changes the model performance on our types of houses.

Exercise 51: Adding Interaction Terms to Our Model

Now that we have a grasp of what interactions are, we return to our proposed models. Are there any other potential interactions we should model? Perhaps even three-way interactions? (These are even harder to interpret!) Instead of adding one interaction at a time, let's create a list of model formulas with all the combinations of the predictors (also called fixed effects), including the potential two and three-way interactions. For this task, cvms has a function called combine_predictors(). It has a few limitations, which you can read about in the help page at ?combine_predictors. Once created, we will cross-validate all the models and select the best one. As mentioned previously, this approach does not guarantee a theoretically meaningful model, which is why we should always think critically about its meaningfulness:

  1. Attach the cvms and groupdata2 packages:

    # Attach packages

    library(cvms)

    library(groupdata2)

  2. Set the random seed to 1:

    set.seed(1)

  3. Generate all possible model formulas (with and without two- and three-way interaction terms) from the four predictors (fixed effects) using combine_predictors(). This will create 165 formulas:

    # Create all combinations of the predictors

    model_formulas <- combine_predictors(

        dependent = "price",

        fixed_effects = c("in_sacramento",

                          "sqft",

                          "baths",

                          "beds"))

  4. Print the first 15 model formulas:

    # Print the generated model formulas

    head(model_formulas, 15)

    The output is as follows:

    ##  [1] "price ~ baths"                             

    ##  [2] "price ~ beds"                              

    ##  [3] "price ~ in_sacramento"                     

    ##  [4] "price ~ sqft"                              

    ##  [5] "price ~ baths * beds"                      

    ##  [6] "price ~ baths * in_sacramento"             

    ##  [7] "price ~ baths * sqft"                      

    ##  [8] "price ~ baths + beds"                      

    ##  [9] "price ~ baths + in_sacramento"             

    ## [10] "price ~ baths + sqft"                      

    ## [11] "price ~ beds * in_sacramento"              

    ## [12] "price ~ beds * sqft"                       

    ## [13] "price ~ beds + in_sacramento"              

    ## [14] "price ~ beds + sqft"                       

    ## [15] "price ~ in_sacramento * sqft

  5. Create five folds in the training set. Balance the folds by in_sacramento. As we already have a fold column in the training set, we ask fold() to remove it and create a new one:

    # Create folds

    train_set <- fold(

        train_set, k = 5,

        cat_col = "in_sacramento",

        handle_existing_fold_cols = "remove")

  6. Cross-validate the model formulas:

    # Cross-validate our models

    # Note that we specify the model family as "gaussian",

    # which refers to linear regression

    cv_results <- cross_validate(data = train_set,

                                 models = model_formulas,

                                 fold_cols = ".folds",

                                 family = "gaussian")

  7. Order the results by RMSE:

    # Order by RMSE

    cv_results <- cv_results[order(cv_results$RMSE),]

  8. Select the 10 best models:

    # Select the 10 best models

    # (feel free to view them all instead)

    cv_results_top10 <- head(cv_results, 10)

  9. Select the metrics and model definition columns with select_metrics() and print them with kable():

    # Select the metrics and

    # model formulas with select_metrics()

    # Print the top 10 models with

    # knitr::kable for fanciness

    kable(select_metrics(cv_results_top10), digits = 2)

    The output is as follows:

    Figure 5.15: Output using kable
    Figure 5.15: Output using kable

    The best model has an interaction between baths and in_sacramento and an interaction between beds and sqft.

  10. Train the best model on the training set and inspect its summary. Try to interpret it:

    lin_model_7 <- lm(

        "price ~ baths * in_sacramento + beds * sqft",

        data = train_set)

    summary(lin_model_7)

    The summary is as follows:

    ##

    ## Call:

    ## lm(formula = "price ~ baths * in_sacramento + beds * sqft", data = train_set)

    ##

    ## Residuals:

    ##     Min      1Q  Median      3Q     Max

    ## -217093  -47986  -10551   33501  399965

    ##

    ## Coefficients:

    ##                       Estimate Std. Error t value Pr(>|t|)   

    ## (Intercept)          -38995.37   29818.83  -1.308   0.1914   

    ## baths                 11786.48    8049.75   1.464   0.1436   

    ## in_sacramento1        20807.25   19003.95   1.095   0.2739   

    ## beds                   2242.42    8047.44   0.279   0.7806   

    ## sqft                    218.39      17.11  12.763  < 2e-16 ***

    ## baths:in_sacramento1 -25478.69    8895.65  -2.864   0.0043 **

    ## beds:sqft               -16.83       3.92  -4.295 1.98e-05 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## Residual standard error: 79700 on 738 degrees of freedom

    ## Multiple R-squared:  0.6361, Adjusted R-squared:  0.6332

    ## F-statistic: 215.1 on 6 and 738 DF,  p-value: < 2.2e-16

Both interaction terms are significant, so we start by interpreting these. The first interaction term, baths:in_sacramento1, tells us that when the house is located in Sacramento, the effect of adding a bathroom decreases by $25,478.69. It also tells us that for each additional bathroom we add, the effect of being located in Sacramento decreases by $25,478.69.

The second interaction term, beds:sqft, tells us that for each square foot by which we increase the size, the effect of adding a bedroom decreases by $16.431. It also tells us that for each bedroom we add to the house, the effect of adding one square foot to the size of the house decreases by $16.431.

More formally, the x1:x2 interaction tells us the change in the coefficient for x1 when x2 is increased by 1 unit, and the change in the coefficient for x2 when x1 is increased by 1 unit.

The intercept tells us the price when the house has 0 bathrooms, 0 bedrooms, a size of 0 square feet, and is located outside Sacramento. The coefficient for sqft tells us that when we have 0 bedrooms, adding a square foot to the house affects the price by $218.39. The coefficient for beds tells us that when the size is 0 square feet, adding a bedroom affects the price by $2,242.42. The coefficient for in_sacramento tells us that when we have 0 bathrooms, being located in Sacramento adds $20,807.25 to the price. And finally, the coefficient for baths tells us that when the house is located outside Sacramento, adding a bathroom adds $11,786.48.

If it's weird to think about a house with a size of 0 square feet and (more importantly) 0 bathrooms, this is a good time to emphasize the fact that the model does not have common sense. It does not know that such a house is not meaningful, neither does it know that the size can't be a negative number. It simply finds the coefficients that lead to the lowest prediction error in the training set. However, it is possible to make these numbers more meaningful to us. By centering (not scaling) the predictors, the intercept is interpreted as the price when the predictors have their mean value. In the next section, we will discuss whether we should scale our predictors as well.

Should We Standardize Predictors?

In previous chapters, we have standardized our features before training the models. By doing so, we make sure they are all on the same scale. This can be great for some types of models, especially when we mostly care about how well our model predicts something. When the goal is to interpret our coefficient estimates though, standardizing the predictors mostly makes our job more difficult. Now, let's standardize the numeric predictors, train our model on them, and compare it to the previous model's summary:

# Find scaling and centering parameters

params <- preProcess(train_set[, 3:5],

                     method = c("center", "scale"))

# Make a copy of the dataset

standardized_data <- train_set

# Transform the dataset

standardized_data[, 3:5] <- predict(params,

                                    standardized_data[, 3:5])

# Train a model on the standardized data

lin_model_7_2 <- lm(price ~ baths * in_sacramento + beds * sqft,

                  data = standardized_data)

summary(lin_model_7_2)

The summary of the model is as follows:

##

## Call:

## lm(formula = price ~ baths * in_sacramento + beds * sqft, data = standardized_data)

##

## Residuals:

##     Min      1Q  Median      3Q     Max

## -217093  -47986  -10551   33501  399965

##

## Coefficients:

##                      Estimate Std. Error t value Pr(>|t|)   

## (Intercept)            269271       4405  61.131  < 2e-16 ***

## baths                    8395       5734   1.464   0.1436   

## in_sacramento1         -31843       6189  -5.145 3.43e-07 ***

## beds                   -23116       4356  -5.307 1.48e-07 ***

## sqft                   119822       5637  21.255  < 2e-16 ***

## baths:in_sacramento1   -18148       6336  -2.864   0.0043 **

## beds:sqft              -10870       2531  -4.295 1.98e-05 ***

## ---

## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 79700 on 738 degrees of freedom

## Multiple R-squared:  0.6361, Adjusted R-squared:  0.6332

## F-statistic: 215.1 on 6 and 738 DF,  p-value: < 2.2e-16

The interaction terms are still significant and the adjusted R-squared has not changed. The biggest change is in the Estimate column. The intercept is no longer the price when the house has 0 bathrooms, 0 bedrooms, and a size of 0 square feet, as we have centered these predictors. It is instead the price when we have the mean value of each of these predictors in our training set. Due to the scaling, the estimates for these predictors are now on the same scale. This has the benefit that they are easier to compare. On the other hand, the estimates are harder to map to our understanding of the real world. What does it mean when we say that increasing the number of square feet with one standard deviation increases the price by $119,822?

Hence, we might want to train the model without standardization (though possibly with centering) for interpretability and with standardization for the comparison of coefficient estimate sizes.

Repeated Cross-Validation

If two models are very close to one another in our cross-validation results, we need to ensure that it is not the way we created the folds that is responsible for the small difference. One way to test this is by making new folds with a different seed, and rerunning the cross-validation. We can repeat this process 100 times and average the results. Luckily, groupdata2 and cvms have this functionality built in.

Exercise 52: Running Repeated Cross-Validation

In this exercise, we will use repeated cross-validation on the 8 best linear regression models from Exercise 53, Adding Interaction Terms to Our Model. We will only run 10 repetitions to save time:

  1. Attach the cvms and groupdata2 packages:

    # Attach packages

    library(cvms)

    library(groupdata2)

    library(caret)

  2. Set the random seed to 2:

    # Set seed for reproducibility and easy comparison.

    set.seed(2)

  3. Load the Sacramento dataset:

    data(Sacramento)

  4. Assign/copy the dataset to another variable name:

    full_data <- Sacramento

  5. Create the in_sacramento variable:

    # Create one-hot encoded factor column describing

    # if the city is Sacramento or not

    full_data$in_sacramento <- factor(

        ifelse(full_data$city == "SACRAMENTO", 1, 0) )

  6. Partition the dataset into a training set (80%) and a validation set (20%). Balance the ratios of the in_sacramento levels between the partitions and assign each partition to a variable name:

    partitions <- partition(full_data, p = 0.8,

                            cat_col = "in_sacramento")

    train_set <- partitions[[1]]

    valid_set <- partitions[[2]]

  7. Create 10 unique fold columns with fold(). This is done by specifying the num_fold_cols argument. fold() will then attempt to create 10 unique columns, named ".folds_1", ".folds_2", and so on:

    # Create folds

    train_set <- fold(train_set, k = 5,

                      cat_col = "in_sacramento",

                      num_fold_cols = 10)

  8. Extract or create the fold column names. Either extract the names of the last 10 columns in the folded dataset or use paste0() to create the strings quickly:

    # Create the 10 fold column names

    fold_cols <- paste0(".folds_", 1:10)

    # Print the names

    fold_cols

    ##  [1] ".folds_1"  ".folds_2"  ".folds_3"

    ##  [4] ".folds_4"  ".folds_5"  ".folds_6"

    ##  [7] ".folds_7"  ".folds_8"  ".folds_9"

    ##  [10]".folds_10"

  9. Reconstruct the eight best formulas from the cross-validation results, using the reconstruct_formulas() function from cvms. Start by ordering the cv_results from Exercise 53, Adding Interaction Terms to Our Model, by RMSE. If you do not have access to cv_results, you can copy these formulas from the model_formulas vector in the following output and add them to a vector manually:

    # Order cv_results by RMSE

    cv_results <- cv_results[order(cv_results$RMSE),]

    # Reconstruct formulas for the top 8 models

    model_formulas <- reconstruct_formulas(cv_results, topn = 8)

    # Print the model formulas

    model_formulas

    The output is as follows:

    ## [1] "price ~ baths * in_sacramento + beds * sqft"

    ## [2] "price ~ beds * in_sacramento + beds * sqft"

    ## [3] "price ~ baths * beds + baths * in_sacramento + sqft"

    ## [4] "price ~ baths * beds + baths * in_sacramento + beds * sqft"

    ## [5] "price ~ beds * sqft + in_sacramento * sqft"

    ## [6] "price ~ baths * in_sacramento + beds * sqft + in_sacramento * sqft"

    ## [7] "price ~ baths + beds * in_sacramento + beds * sqft"               

    ## [8] "price ~ baths * in_sacramento + beds * in_sacramento + beds * sqft"

  10. Run the repeated cross-validation with cross_validate(). In the fold_cols argument, add the list of fold column names:

    # Cross-validate our models

    # Pass the list of fold column names

    # and cross_validate() will

    # cross-validate with each one and

    # return the average results

    repeated_cv_results <- cross_validate(

        data = train_set,

        models = model_formulas,

        fold_cols = fold_cols,

        family = "gaussian")

  11. Show the results with kable():

    # Select the metrics and model formulas

    # Print with knitr::kable for fanciness

    kable(select_metrics(repeated_cv_results), digits = 2)

    The output is as follows:

    Figure 5.16: Output using kable
    Figure 5.16: Output using kable

    The same model (price ~ baths * in_sacramento + beds * sqft) is still the best, by RMSE. The sixth model is the second best, and, in general, the rankings have changed a lot.

    We could run even more cross-validation repetitions if we wanted to be sure of their rankings. If we wish to see the results of each repetition, these results are also available.

  12. Show the results of the first 10 folds for the best model:

    # Extract the fold results for the best model

    fold_results_best_model <- repeated_cv_results$Results[[1]]

    # Print the results of the first 10 folds

    kable( head(fold_results_best_model, 10) )

    The output is as follows:

Figure 5.17: Results of the best model on the first 10 test folds
Figure 5.17: Results of the best model on the first 10 test folds

Figure 5.17 shows the results of the first 10 folds for the best model formula. Looking at the RMSE column, we see that the results vary quite a lot, depending on what fold is used as the test set. This is one of the reasons why repeated cross-validation is a good technique: we train and test the model on many combinations of the training set!

Exercise 53: Validating Models with validate()

Now that we have found the best model with cross_validate(), we want to train that model on the entire training set and evaluate it on the validation set. We could easily do this manually, but cvms has the validate() function that does this for us and returns the same metrics as cross_validate(), along with the trained model.

In this exercise, we will work with the training set from the previous exercise:

  1. Validate the best model from the previous exercise:

    # Train the model on the entire training set

    # and test it on the validation set.

    validation <- validate(

        train_data = train_set,

        test_data = valid_set,

        models = "price ~ baths * in_sacramento + beds * sqft",

        family = "gaussian")

  2. The output contains the "Results" data frame and the trained model object. Assign these to the variable names:

    valid_results <- validation$Results

    valid_model <- validation$Models[[1]]

  3. Print the results:

    kable(select_metrics(valid_results), digits = 2)

    The output is as follows:

    Figure 5.18: Results of the best model on the validation set
    Figure 5.18: Results of the best model on the validation set

    The RMSE is greater on the test set than in the cross-validation results. If we compare it to the results of the cross-validation iterations in Figure 5.18, it isn't higher than most of the RMSE scores, though, which indicates that the model is not overfitted.

  4. Print the summary of the model:

    summary(valid_model)

    The summary of the model is as follows:

    ##

    ## Call:

    ## lm(formula = model_formula, data = train_set)

    ##

    ## Residuals:

    ##     Min      1Q  Median      3Q     Max

    ## -209993  -46950  -12796   33713  553773

    ##

    ## Coefficients:

    ##                        Estimate Std. Error t value Pr(>|t|)   

    ## (Intercept)          -58024.456  29367.835  -1.976 0.048551 *

    ## baths                 23083.792   7865.568   2.935 0.003441 **

    ## in_sacramento1        29453.042  18898.087   1.559 0.119539   

    ## beds                   6095.281   7811.066   0.780 0.435441   

    ## sqft                    209.351     17.326  12.083  < 2e-16 ***

    ## baths:in_sacramento1 -30085.305   8904.927  -3.379 0.000767 ***

    ## beds:sqft               -17.465      3.912  -4.464 9.29e-06 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## Residual standard error: 79520 on 738 degrees of freedom

    ## Multiple R-squared:  0.6181, Adjusted R-squared:  0.615

    ## F-statistic: 199.1 on 6 and 738 DF,  p-value: < 2.2e-16

The summary differs from the one we trained in Exercise 53, Adding Interaction Terms to Our Model, because the different random seeds used in the exercises means that the partitions (training and validation sets) are different. The differences between the two summaries are sufficiently large that we might question which summary is most likely to be correct. Similar to how the cross_validate() results contain all the results of the different cross-validation iterations, it also contains the coefficient estimates of every trained model instance. We could use these to investigate what coefficient estimates are most common.

Activity 18: Implementing Linear Regression

In this activity, we will predict the price of a car in the cars dataset, using linear regression. Fit multiple linear regression models and compare them with cross-validation using the cvms package. Validate the best model on a validation set and interpret the model summary.

The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/caret_cars.csv.

The following steps should help you with the solution:

  1. Attach the groupdata2, cvms, caret, and knitr packages.
  2. Set the random seed to 1.
  3. Load the cars dataset from the preceding GitHub link.
  4. Partition the dataset into a training set (80%) and a validation set (20%).
  5. Fit multiple linear regression models on the training set with the lm() function, predicting Price. Try different predictors.
  6. View the summary() of each fitted model. Try to interpret the estimated coefficients. How do the interpretations change when you add or subtract predictors?
  7. Create model formulas with combine_predictors(). Limit the number of possibilities by a) using only the 4 first predictors, b) limiting the number of fixed effects in the formulas to 3, by specifying max_fixed_effects = 3, c) limiting the biggest possible interaction to a two-way interaction by specifying max_interaction_size = 2, and d) limiting the number of times a predictor can be included in a formula by specifying max_effect_frequency = 1. These limitations will decrease the number of models to run, which you may or may not want in your own projects.
  8. Create five fold columns with four folds each in the training set, using fold() with k = 4 and num_fold_cols = 5. Feel free to choose a higher number of fold columns.
  9. Perform repeated cross-validation on your model formulas with cvms.
  10. Select the best model according to RMSE.
  11. Fit the best model on the entire training set and evaluate it on the validation set. This can be done with the validate() function in cvms.
  12. View and interpret the summary of the best model.

    The output should be similar to the following:

    ##

    ## Call:

    ## lm(formula = Price ~ Cruise * Cylinder + Mileage, data = train_set)

    ##

    ## Residuals:

    ##    Min     1Q Median     3Q    Max

    ## -10485  -5495  -1425   3494  34693

    ##

    ## Coefficients:

    ##                   Estimate Std. Error t value Pr(>|t|)   

    ## (Intercept)      8993.2446  3429.9320   2.622  0.00895 **

    ## Cruise          -1311.6871  3585.6289  -0.366  0.71462   

    ## Cylinder         1809.5447   741.9185   2.439  0.01500 *

    ## Mileage            -0.1569     0.0367  -4.274 2.21e-05 ***

    ## Cruise:Cylinder  1690.0768   778.7838   2.170  0.03036 *

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## Residual standard error: 7503 on 638 degrees of freedom

    ## Multiple R-squared:  0.424,  Adjusted R-squared:  0.4203

    ## F-statistic: 117.4 on 4 and 638 DF,  p-value: < 2.2e-16

    Note

    The solution for this activity can be found on page 358.

Log-Transforming Predictors

When discussing interactions, we touched upon the linearity of the relationships between our predictors and the dependent variable in the Sacramento dataset. Does it seem likely that one additional square foot affects the price of a 200 square-foot home the same as of a 3,000 square-foot home? What about 100 additional square feet, which is a 50% increase for the first home, but only a ~3.3% increase for the second?

Similarly, the difference between having one or two bedrooms in a house seems bigger than the difference between having seven and eight bedrooms.

When we have non-linear relationships, it can sometimes help to apply a non-linear transformation, such as the log transformation, to the predictor (or even the dependent variable) to try and make it more linear. There are other types of transformations as well. Note that such transformations can make the summary harder to interpret.

Besides the linearity of the variable's relation to the dependent variable, the log transformation also affects its distribution. We can plot the histogram of the sqft predictor before and after the log transformation:

Figure 5.19: Distributions of log-transformed and untransformed sqft predictors
Figure 5.19: Distributions of log-transformed and untransformed sqft predictors

On the left in Figure 5.19, we have a histogram of the log-transformed sqft predictor in the training set. On the right, we have the histogram of the untransformed sqft predictor. Where the original version was skewed to the left, the transformed version is more normally distributed.

Exercise 54: Log-Transforming Predictors

In this exercise, we will create log-transformed versions of the three predictors – sqft, beds, and baths. We will create model formulas with all combinations of the transformed and non-transformed predictors, and use cross-validation to select the best model:

  1. Attach caret, cvms, and groupdata2:

    # Attach packages

    library(caret)

    library(cvms)

    library(groupdata2)

  2. Set the random seed to 3:

    # Set seed for reproducibility and easy comparison.

    set.seed(3)

  3. Load the Sacramento dataset:

    data(Sacramento)

  4. Assign the dataset to another variable name:

    full_data <- Sacramento

  5. Create the in_sacramento variable:

    # Create one-hot encoded factor column describing

    # if the city is Sacramento or not

    full_data$in_sacramento <- factor(

        ifelse(full_data$city == "SACRAMENTO", 1, 0) )

  6. Create log-transformed versions of sqft, beds, and baths:

    # Log-transform predictors

    full_data$log_sqft <- log(full_data$sqft)

    full_data$log_beds <- log(full_data$beds)

    full_data$log_baths <- log(full_data$baths)

  7. Partition the dataset into a training set (80%) and validation set (20%). Balance the ratios of the in_sacramento levels between the partitions and assign each partition to a variable name:

    # Partition the dataset

    partitions <- partition(full_data, p = 0.8,

                            cat_col = "in_sacramento")

    train_set <- partitions[[1]]

    valid_set <- partitions[[2]]

  8. Create five folds in the training set, using fold():

    # Create folds

    train_set <- fold(train_set, k = 5,

                      cat_col = "in_sacramento")

  9. Generate all possible model formulas with and without log transformations, using combine_predictors(). As we do not want both the transformed and non-transformed version of a predictor in the same formula, we supply a nested list to the fixed_effects argument. A sub-list should contain the various versions of a predictor that we wish to try out. Formulas with all combinations of the sub-list elements are generated. To save time, we will reduce the number of formulas generated by limiting the number of times a predictor can be included in a formula to 1:

    # Create all combinations of the predictors

    model_formulas <- combine_predictors(

        dependent = "price",

        fixed_effects = list("in_sacramento",

                             list("sqft", "log_sqft"),

                             list("baths", "log_baths"),

                             list("beds", "log_beds")),

        max_effect_frequency = 1)

    # Count the number of generated model formulas

    length(model_formulas)

    ## [1] 255

  10. Show the last 10 model formulas:

    # Show the last 10 formulas

    tail(model_formulas, 10)

    The output is as follows:

    ##  [1] "price ~ log_baths + log_beds * in_sacramento * log_sqft"

    ##  [2] "price ~ log_baths + log_beds * in_sacramento * sqft"   

    ##  [3] "price ~ log_baths + log_beds * in_sacramento + log_sqft"

    ##  [4] "price ~ log_baths + log_beds * in_sacramento + sqft"   

    ##  [5] "price ~ log_baths + log_beds * log_sqft + in_sacramento"

    ##  [6] "price ~ log_baths + log_beds * sqft + in_sacramento"   

    ##  [7] "price ~ log_baths + log_beds + in_sacramento * log_sqft"

    ##  [8] "price ~ log_baths + log_beds + in_sacramento * sqft"   

    ##  [9] "price ~ log_baths + log_beds + in_sacramento + log_sqft"

    ## [10] "price ~ log_baths + log_beds + in_sacramento + sqft"

  11. Run the cross-validation with cross_validate():

    # Cross-validate our models

    cv_results <- cross_validate(data = train_set,

                                 models = model_formulas,

                                 family = "gaussian")

  12. Order the results by RMSE and subset the best 10 models:

    # Order by RMSE

    cv_results <- cv_results[order(cv_results$RMSE),]

    # Create a subset with the 10 best models

    cv_results_top10 <- head(cv_results, 10)

  13. Show the results with kable():

    # Select the metrics and model formulas

    # Print with knitr::kable for fanciness

    kable(select_metrics(cv_results_top10), digits = 2)

    The output is as follows:

Figure 5.20: Cross-validation output with log-transformed predictors included
Figure 5.20: Cross-validation output with log-transformed predictors included

The four best models are all variations of the best model from previous ones. As the improvement we gained by adding the log-transformed predictors is very small, we might prefer to leave them out, so the model is easier to interpret. In the next section, we will learn about logistic regression.

Logistic Regression

In linear regression, we modeled continuous values, such as the price of a home. In (binomial) logistic regression, we apply a logistic sigmoid function to the output, resulting in a value between 0 and 1. This value can be interpreted as the probability that the observation belongs to class 1. By setting a cutoff/threshold (such as 0.5), we can use it as a classifier. This is the same approach we used with the neural networks in the previous chapter. The sigmoid function is , where is the output from the linear regression:

Figure 5.21: A plot of the sigmoid function
Figure 5.21: A plot of the sigmoid function

Figure 5.21 shows the sigmoid function applied to the output . The dashed line represents our cutoff of 0.5. If the predicted probability is above this line, the observation is predicted to be in class 1, otherwise, it's in class 0.

For logistic regression, we use the generalized version of lm(), called glm(), which can be used for multiple types of regression. As we are performing binary classification, we set the family argument to binomial.

Now, suppose we had a long list of houses from the Sacramento area and wanted to know whether they were located in a city outside of Sacramento. We could train a logistic regression classifier with the predictors in our dataset (including price) to classify the in_sacramento variable, and use that model on the list. Such a model would be useful in our context, so we will build a couple of them and discuss whether they are theoretically meaningful and how to interpret their summaries. Then, we will cross-validate the possible model formulas with cvms.

Exercise 55: Training Logistic Regression Models

In this exercise, we will train logistic regression models for predicting whether a home is located in Sacramento:

  1. Attach the packages:

    # Attach packages

    library(caret)

    library(cvms)

    library(groupdata2)

    library(knitr)

  2. Set the random seed to 2:

    # Set seed for reproducibility and easy comparison.

    set.seed(2)

  3. Load the Sacramento dataset:

    # Load the dataset

    data(Sacramento)

  4. Assign/copy the dataset to another variable name:

    full_data <- Sacramento

  5. Create the in_sacramento variable:

    # Create one-hot encoded factor column describing

    # if the city is Sacramento or not

    full_data$in_sacramento <- factor(

        ifelse(full_data$city == "SACRAMENTO", 1, 0) )

  6. Create log-transformed versions of sqft, beds, baths, and price:

    # Log-transform predictors

    full_data$log_sqft <- log(full_data$sqft)

    full_data$log_beds <- log(full_data$beds)

    full_data$log_baths <- log(full_data$baths)

    full_data$log_price <- log(full_data$price)

  7. Partition the dataset into a training set (80%) and a validation set (20%). Balance the ratios of the in_sacramento levels between the partitions and assign each partition to a variable name:

    partitions <- partition(full_data, p = 0.8,

                            cat_col = "in_sacramento")

    train_set <- partitions[[1]]

    valid_set <- partitions[[2]]

  8. Train a logistic regression model with price as a predictor of in_sacramento:

    logistic_model_1 <- glm(in_sacramento ~ price,

                            data = train_set,

                            family = "binomial")

    summary(logistic_model_1)

    The summary of the model is as follows:

    ##

    ## Call:

    ## glm(formula = in_sacramento ~ price, family = "binomial", data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -1.7461  -1.0995  -0.4591   1.0455   2.5995

    ##

    ## Coefficients:

    ##               Estimate Std. Error z value Pr(>|z|)   

    ## (Intercept)  1.486e+00  1.942e-01   7.654 1.95e-14 ***

    ## price       -6.910e-06  7.992e-07  -8.647  < 2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 1030.07  on 744  degrees of freedom

    ## Residual deviance:  930.69  on 743  degrees of freedom

    ## AIC: 934.69

    ##

    ## Number of Fisher Scoring iterations: 4

    Due to the logistic output (between 0 and 1), we can't interpret the coefficient estimates the same way we did with linear regression. They now tell us whether a single unit increase in the predictor increases or decreases the probability that the outcome is class 1 (though not directly how much it increases/decreases). As the coefficient estimate for price is negative, this means that the higher the price, the lower the chance that a house is located in Sacramento. Or, more accurately, the lower the chance that the model will predict that a house is located in Sacramento. Since changing the price of the house would not actually change the location of the house, it is not a causal relationship. In fact, the dependent variable (in_sacramento) is not actually dependent on any of our predictors. Hence, we shouldn't interpret the model summary as a description of causal relationships, but as a description of correlations that help the model make the best possible predictions. In general, linear and logistic models can't conclude that there are causal relationships between variables. We must instead use theoretical knowledge (and experimental design, where we intervene with the phenomenon we're trying to understand and measure the effect of the intervention) to determine whether there is strong evidence of a causal relationship.

    Technically, the coefficient estimate is the change in the log odds of the class being 1 when the predictor is increased by one unit. Where the odds of something, let's call this x, so odds(x), are defined as the probability of x, p(x), divided by the probability against x, 1-p(x), the log odds are just the natural logarithm of the odds log(p(x) / (1-p(x))). This function is also referred to as the logit function. When the log odds of x increase, the probability of x also increases.

    Let's add more predictors to our model.

  9. Add the number of bathrooms to the logistic model:

    logistic_model_2 <- glm(in_sacramento ~ price + baths,

                            data = train_set,

                            family = "binomial")

    summary(logistic_model_2)

    The summary of the model is as follows:

    ##

    ## Call:

    ## glm(formula = in_sacramento ~ price + baths, family = "binomial",

    ##     data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -1.7115  -1.0913  -0.4623   1.0606   2.4275

    ##

    ## Coefficients:

    ##               Estimate Std. Error z value Pr(>|z|)   

    ## (Intercept)  1.896e+00  2.678e-01   7.082 1.42e-12 ***

    ## price       -5.801e-06  9.150e-07  -6.340 2.30e-10 ***

    ## baths       -3.289e-01  1.430e-01  -2.300   0.0215 *

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 1030.07  on 744  degrees of freedom

    ## Residual deviance:  925.33  on 742  degrees of freedom

    ## AIC: 931.33

    ##

    ## Number of Fisher Scoring iterations: 4

    For each additional bathroom, the chance that the home is located in Sacramento decreases. If we compare the Akaike Information Criterion (AIC) at the bottom of the summary to the AIC of the previous model, this model's AIC is lower. AIC is a common metric for comparing model architectures. The lower the AIC, the better. Like RMSE, AIC is relative to the dataset and task, so we do not compare AICs from different tasks or datasets. Given that this model has a lower AIC, we prefer it to the first model.

  10. Add the sqft predictor:

    logistic_model_3 <- glm(in_sacramento ~ price + baths + sqft,

                            data = train_set,

                            family = "binomial")

    summary(logistic_model_3)

    The summary of the model is as follows:

    ##

    ## Call:

    ## glm(formula = in_sacramento ~ price + baths + sqft, family = "binomial",

    ##     data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -1.7353  -1.0833  -0.4555   1.0499   2.4370

    ##

    ## Coefficients:

    ##               Estimate Std. Error z value Pr(>|z|)   

    ## (Intercept)  1.897e+00  2.680e-01   7.080 1.44e-12 ***

    ## price       -6.401e-06  1.148e-06  -5.573 2.50e-08 ***

    ## baths       -4.239e-01  1.795e-01  -2.362   0.0182 *

    ## sqft         2.044e-04  2.308e-04   0.886   0.3758   

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 1030.07  on 744  degrees of freedom

    ## Residual deviance:  924.54  on 741  degrees of freedom

    ## AIC: 932.54

    ##

    ## Number of Fisher Scoring iterations: 4

    This model has a higher AIC than the previous model, although they are so close that if we partitioned our dataset with a different random seed, it could be lower.

    Now, we will generate the possible model formulas with combine_predictors() and cross-validate them.

  11. Create five folds in the training set using fold(). Balance the folds by in_sacramento:

    # Create folds

    train_set <- fold(train_set, k = 5,

                      cat_col = "in_sacramento")

  12. Generate model formulas with and without log transformations using combine_predictors(). To save time, limit the number of times a fixed effect can be included in a formula by specifying max_effects_frequency = 1:

    # Create all combinations of the predictors

    model_formulas <- combine_predictors(

        dependent = "in_sacramento",

        fixed_effects = list(list("price","log_price"),

                             list("sqft", "log_sqft"),

                             list("baths","log_baths"),

                             list("beds", "log_beds")),

        max_effect_frequency = 1)

    # Count the number of generated model formulas

    length(model_formulas)

    ## [1] 440

  13. Show the last 10 model formulas:

    # Show the last 10 formulas

    tail(model_formulas, 10)

    The last 10 model formulas are as follows:

    ##  [1] "in_sacramento ~ log_baths + log_beds * sqft + log_price"   

    ##  [2] "in_sacramento ~ log_baths + log_beds * sqft + price"       

    ##  [3] "in_sacramento ~ log_baths + log_beds + log_price * log_sqft"

    ##  [4] "in_sacramento ~ log_baths + log_beds + log_price * sqft"   

    ##  [5] "in_sacramento ~ log_baths + log_beds + log_price + log_sqft"

    ##  [6] "in_sacramento ~ log_baths + log_beds + log_price + sqft"   

    ##  [7] "in_sacramento ~ log_baths + log_beds + price * log_sqft"   

    ##  [8] "in_sacramento ~ log_baths + log_beds + price * sqft"       

    ##  [9] "in_sacramento ~ log_baths + log_beds + price + log_sqft"   

    ## [10] "in_sacramento ~ log_baths + log_beds + price + sqft"

    This generated 440 model formulas. Cross-validating all these models could take a while. Luckily, we can speed up the process by running it in parallel. Instead of cross-validating one model at a time, using only a single CPU core, we can utilize more cores and cross-validate multiple models at a time. To do this, we attach the doParallel package and register how many cores we want to use with the registerDoParallel() function. If we do not specify the number of cores, registerDoParallel() will autodetect the available cores and choose half.

  14. Attach doParallel and register the number of cores:

    # Attach doParallel

    library(doParallel)

    # Register four CPU cores

    registerDoParallel(4)

  15. Run the cross-validation with cross_validate(). Set parallel to TRUE and family to binomial:

    # Cross-validate our models in parallel

    cv_results <- cross_validate(data = train_set,

                                 models = model_formulas,

                                 family = "binomial",

                                 parallel = TRUE)

  16. Order the results by F1 and subset the 10 best models. As we want the F1 score to be as high as possible, we set decreasing to TRUE:

    # Order by RMSE

    cv_results <- cv_results[

        order(cv_results$F1, decreasing = TRUE),]

    # Create a subset with the 10 best models

    cv_results_top10 <- head(cv_results, 10)

    cv_results_top10

    The output is as follows:

    ## # A tibble: 10 x 26

    ##    'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'

    ##               <dbl> <dbl>       <dbl>       <dbl>            <dbl>

    ##  1            0.684 0.663       0.657       0.711            0.669

    ##  2            0.686 0.663       0.651       0.722            0.675

    ##  3            0.683 0.660       0.651       0.714            0.669

    ##  4            0.678 0.659       0.66        0.696            0.658

    ##  5            0.672 0.658       0.671       0.673            0.646

    ##  6            0.675 0.658       0.666       0.684            0.651

    ##  7            0.676 0.657       0.66        0.691            0.654

    ##  8            0.667 0.657       0.68        0.653            0.635

    ##  9            0.668 0.657       0.677       0.658            0.637

    ## 10            0.673 0.656       0.663       0.684            0.650

    ## # … with 21 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower

    ## #   CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection

    ## #   Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,

    ## #   Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,

    ## #   Coefficients <list>, Folds <int>, 'Fold Columns' <int>, 'Convergence

    ## #   Warnings' <dbl>, 'Singular Fit Messages' <int>, Family <chr>,

    ## #   Link <chr>, Dependent <chr>, Fixed <chr>

    The output contains a lot of information, so we select a subset of the metrics and focus on them.

  17. Select some of the metrics and show the results with kable():

    # Select the metrics and model formulas

    cv_results_top10 <- select_metrics(cv_results_top10)

    # Remove some of the metrics

    # Note: A great alternative is dplyr::select()

    # In general, the dplyr package is amazing

    cv_results_top10 <- cv_results_top10[,

        !(colnames(cv_results_top10) %in%

            c("Lower CI",

              "Upper CI","Detection Rate",

              "Detection Prevalence",

              "Prevalence")), drop=FALSE]

    # Print with knitr::kable for fanciness

    kable(cv_results_top10, digits = 3)

    The output is as follows:

    Figure 5.22: Logistic model output
    Figure 5.22: Logistic model output

    The formula "in_sacramento ~ baths * log_price * log_sqft + log_beds" has the highest F1 score, although it's approximately the same as the second best model. The output contains a few metrics that we haven't covered previously. Importantly, different statistical fields have different names for the same metrics, so what we know as Recall is also called Sensitivity, and what we know as Precision is also called Positive Prediction Value (Pos Pred Value). We won't go into detail about the other metrics, but we do recommend learning about them. As we learned about accuracy in the previous chapter, we will introduce the Balanced Accuracy metric, which is a version of accuracy that works well with imbalanced datasets. It has the following formula:

    Figure 5.23: Formula for balanced accuracy
    Figure 5.23: Formula for balanced accuracy

    As with the regular accuracy score, we wish for balanced accuracy to be as high as possible. In the cross-validation results, it favors a different model than the F1 score. We will learn how to choose models based on multiple metrics later.

  18. Use validate() to train the best model on the training set and evaluate it on the validation set:

    # Evaluate the best model on the validation set with validate()

    V_results_list <- validate(

        train_data = train_set,

        test_data = valid_set,

        models = "in_sacramento ~ baths * log_price * log_sqft + log_beds",

        family = "binomial")

    V_model <- V_results_list$Models[[1]]

    V_results <- V_results_list$Results

    # Select metrics

    V_results <- select_metrics(V_results)

    # Remove some of the metrics

    V_results <- V_results[, !(colnames(V_results) %in% c(

        "Lower CI","Upper CI","Detection Rate",

        "Detection Prevalence", "Prevalence")), drop=FALSE]

    # Print the results

    kable(V_results, digits = 3)

    The output is as follows:

    Figure 5.24: Validation results

    The results are very similar to the cross-validation results. Now, let's view the model summary.

  19. View and interpret the summary of the best model:

    # Print the model summary and interpret it

    summary(V_model)

    The summary of the model is as follows:

    ##

    ## Call:

    ## glm(formula = model_formula, family = binomial(link = link),

    ##     data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -2.2231  -1.0373  -0.3921   1.0380   2.7434

    ##

    ## Coefficients:

    ##                          Estimate Std. Error z value Pr(>|z|)

    ## (Intercept)              101.4540   113.1667   0.897   0.3700

    ## baths                    -86.3490    53.1521  -1.625   0.1043

    ## log_price                 -6.9804     9.1408  -0.764   0.4451

    ## log_sqft                 -13.5135    16.2621  -0.831   0.4060

    ## log_beds                   0.6739     0.4214   1.599   0.1097

    ## baths:log_price            6.2569     4.2603   1.469   0.1419

    ## baths:log_sqft            12.8919     7.3428   1.756   0.0791 .

    ## log_price:log_sqft         0.9241     1.3080   0.707   0.4799

    ## baths:log_price:log_sqft  -0.9484     0.5858  -1.619   0.1054

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 1030.07  on 744  degrees of freedom

    ## Residual deviance:  904.19  on 736  degrees of freedom

    ## AIC: 922.19

    ##

    ## Number of Fisher Scoring iterations: 5

Interestingly, none of the predictors or interaction terms are statistically significant. While this, in itself, would not be enough to conclude that none of the predictors affect the dependent variable, it does mean that if there were such relationships, we would not have enough evidence (data points) to say so. We also see that the AIC has not decreased much from the simple models we started with. With these things in mind, it would be interesting to see the baseline evaluation, so we know whether our model is doing better than random guessing. Normally, creating the baseline evaluation should be one of the first steps in our analysis, so we always know whether we are improving. In Chapter 4, Introduction to neuralnet and Evaluation Methods, we wrote the code for creating baseline evaluations ourselves. In the next two exercises, we will use the baseline() function from cvms to do it.

Recommendation: Once you feel comfortable with linear and logistic regression, the next step is to add random effects to your models. This is also known as mixed-effect modeling.

Exercise 56: Creating Binomial Baseline Evaluations with cvms

In this exercise, we will create baseline evaluations for the binary classification task from the previous exercise. The baseline() function will create 100 random sets of predicted probabilities and evaluate them against the dependent column, in_sacramento, in the validation set. It will also evaluate a set of all 0 predictions and a set of all 1 predictions. We will speed up the process by running the evaluations in parallel, similar to what we did in the previous exercise:

  1. Set the random seed to 1:

    # Set seed for reproducibility and easy comparison

    set.seed(1)

  2. Register four cores for running the evaluations in parallel. If you have a different number of cores available, feel free to adjust the number:

    # Attach doParallel and

    # register four CPU cores

    library(doParallel)

    registerDoParallel(4)

  3. Use baseline() to evaluate 100 random sets of probabilities against the in_sacramento variable in the validation set. The output is a list with two data frames. The first data frame contains the summarized results, and the second data frame contains all the evaluations of the random sets of probabilities. We will only look at the summarized results:

    # Create baseline evaluations of the

    # in_sacramento variable in the validation set

    binomial_baselines <- baseline(

        test_data = valid_set,

        dependent_col = "in_sacramento",

        n = 100,

        family = "binomial",

        parallel = TRUE)

  4. Print the summarized results:

    # Show the summarized results

    binomial_baselines$summarized_metrics

    The output is as follows:

    ## # A tibble: 10 x 15

    ##    Measure 'Balanced Accur…      F1 Sensitivity Specificity

    ##    <chr>              <dbl>   <dbl>       <dbl>       <dbl>

    ##  1 Mean              0.500   0.483       0.498       0.503

    ##  2 Median            0.499   0.488       0.5         0.505

    ##  3 SD                0.0333  0.0389      0.0535      0.0521

    ##  4 IQR               0.0420  0.0549      0.0795      0.0631

    ##  5 Max               0.578   0.567       0.625       0.657

    ##  6 Min               0.405   0.393       0.398       0.343

    ##  7 NAs               0       0           0           0    

    ##  8 INFs              0       0           0           0    

    ##  9 All_0             0.5    NA           0           1    

    ## 10 All_1             0.5     0.640       1           0    

    ## # … with 10 more variables: 'Pos Pred Value' <dbl>, 'Neg Pred

    ## #   Value' <dbl>, AUC <dbl>, 'Lower CI' <dbl>, 'Upper CI' <dbl>,

    ## #   Kappa <dbl>, MCC <dbl>, 'Detection Rate' <dbl>, 'Detection

    ## #   Prevalence' <dbl>, Prevalence <dbl>

    The left-hand column, Measure, tells us which statistical descriptor of the random evaluations the row describes. At the top, we have the mean of the various metrics. If our model was simply guessing, it would, on average, achieve an F1 score of 0.483. The highest F1 score obtained by guessing was 0.567, and the lowest was 0.393. When always predicting 1 (that the house is located in Sacramento), the F1 score is 0.640, as can be seen in the row where Measure is All_1.

    Our best model from the previous exercise had an F1 score of 0.65. This is higher than even the best random guess, but only slightly higher than always predicting 1. So, we look at the precision (here called Pos Pred Value) and recall (here called Sensitivity) to check whether it is always predicting 1. These are both greater than .7, so that is not the case, and we conclude that our model is performing better than the baseline on the F1 score.

In the next exercise, we will create Gaussian baseline evaluations with cvms.

Exercise 57: Creating Gaussian Baseline Evaluations with cvms

In this exercise, we will create baseline evaluations for the linear regression task from the first part of the chapter. The goal of the task is to predict the price of a house. The baseline() function will fit the formula "price ~ 1", where 1 is the intercept, on 100 random subsets of the training set, and evaluate each on the validation set. It will also fit the intercept-only model on the entire training set and evaluate that on the validation set as well.

The intercept-only model simply learns the average price in the training (sub)set and then predicts that value for everything:

  1. Set the random seed to 1:

    # Set seed for reproducibility and easy comparison

    set.seed(1)

  2. Register four cores for running the evaluations in parallel. If you have a different number of cores available, feel free to adjust the number:

    # Attach doParallel and

    # register four CPU cores

    library(doParallel)

    registerDoParallel(4)

  3. Use baseline() to fit the model formula "price ~ 1 " on 100 random subsets of the training set and evaluate them on the validation set. The output is a list with two data frames. The first data frame contains the summarized results, and the second data frame contains all the evaluations of the models fitted on the random subsets. We will only look at the summarized results:

    # Create baseline evaluations of the in_sacramento variable

    # in the validation set

    gaussian_baselines <- baseline(

        test_data = valid_set,

        train_data = train_set,

        dependent_col = "price",

        n = 100,

        family = "gaussian",

        parallel = TRUE)

    # Show the summarized results

    gaussian_baselines$summarized_metrics

    The summarized results are as follows:

    Figure 2.25: Summarized Matrix
Figure 2.25: Summarized Matrix

When always predicting the average value, our intercept-only model achieves an average RMSE of 142,928 (Min = 140,801), which is a lot higher than the RMSE of 75,057.56 we achieved in Exercise 56, Log-Transforming Predictors. Hence, we conclude that our model is better than the baseline.

Regression and Classification with Decision Trees

We can solve regression and classification tasks with a multitude of machine learning algorithms. In the previous chapter, we used neural networks, and, in this chapter, we have used linear and logistic regression. In the next chapter, we will learn about decision trees and random forests, which can also be used for these tasks. While linear and logistic regression models are usually easier to interpret, random forests can sometimes be better at making predictions. In this section, we will apply random forests to our dataset and compare the results to our linear and logistic regression models.

As we have seen in Chapter 1, An Introduction to Machine Learning, a decision tree is basically a set of if/else statements arranged as an upside-down tree, where the leaf nodes contain the possible predictions. For a specific observation, we could end up with the following paths down a tree:

  • If a home is larger than 1,500 sqft and it has more than 4 bedrooms, it is predicted to be located in Sacramento.
  • If it is larger than 1,500 sqft and has 4 bedrooms or fewer, it is predicted to be located outside Sacramento.
  • If it is 1,500 sqft or smaller and it has more than 2 bedrooms, it is predicted to be located in Sacramento.
  • If it is 1,500 sqft or smaller and it has 2 bedrooms or fewer, it is predicted to be located outside Sacramento.

In the first node, the root node, we ask whether the home is larger than 1,500 sqft. If so, we go to the left child node, which asks whether a home has more than 4 bedrooms. If not, we go to the right child node, which asks whether the home has more than 2 bedrooms. Depending on the answer in the child node, we end up in a leaf node (a node without child nodes) that contains the value of our prediction.

A random forest is an ensemble (a collection) of different decision trees, where the final prediction is based on the predictions from all the trees. In the next exercise, we will be training two random forest models.

Exercise 58: Training Random Forest Models

In this exercise, we will first train a random forest regression model to predict price in the Sacramento dataset, and then a random forest classification model to predict whether a home is located in Sacramento. We will not be using the log-transformed predictors for this example:

  1. Attach the randomForest and caret packages:

    # Attach packages

    library(randomForest)

    library(caret)

  2. Set the random seed to 1:

    # Set seed for reproducibility and easy comparison.

    set.seed(1)

  3. Fit a random forest model for predicting the price of a home:

    rf_model_regression <- randomForest(

        formula = price ~ beds + baths + sqft + in_sacramento,

        data = train_set)

  4. Calculate the RMSE. First, we use the model to predict the price in the validation set. Note that we would usually start with a development set or use cross-validation until we had found the best hyperparameters for our model. Second, we calculate the RMSE with the RMSE() function from the caret package:

    # Predict prices in the validation set

    price_predictions <- predict(rf_model_regression, valid_set)

    # Calculate and print the RMSE

    RMSE(price_predictions, valid_set$price)

    ## [1] 86957.24

    While this RMSE is lower than the baseline, it is higher than the best linear regression models we trained earlier in the chapter.

  5. Fit a random forest model for predicting whether the home is located in Sacramento:

    rf_model_classification <- randomForest(

        formula = in_sacramento ~ beds + baths + sqft + price,

        data = train_set)

  6. Use the model to predict the in_sacramento variable in the validation set and create a confusion matrix with confusionMatrix():

    # Predict in_sacramento in the validation set

    in_sacramento_predictions <- predict(rf_model_classification,

                                         valid_set)

    # Create a confusion matrix and print it

    confusionMatrix(in_sacramento_predictions,

                    valid_set$in_sacramento,

                    positive = "1",

                    mode = "prec_recall")

    The confusion matrix and statistics are as follows:

    ## Confusion Matrix and Statistics

    ##

    ##           Reference

    ## Prediction  0  1

    ##          0 64 28

    ##          1 35 60

    ##                                          

    ##                Accuracy : 0.6631         

    ##                  95% CI : (0.5905, 0.7304)

    ##     No Information Rate : 0.5294         

    ##     P-Value [Acc > NIR] : 0.0001434      

    ##                                          

    ##                   Kappa : 0.3268         

    ##                                          

    ##  Mcnemar's Test P-Value : 0.4496918      

    ##                                          

    ##               Precision : 0.6316         

    ##                  Recall : 0.6818         

    ##                      F1 : 0.6557         

    ##              Prevalence : 0.4706         

    ##          Detection Rate : 0.3209         

    ##    Detection Prevalence : 0.5080         

    ##       Balanced Accuracy : 0.6641         

    ##                                          

    ##        'Positive' Class : 1              

Interestingly, the random forest classifier obtains very similar results to the best logistic regression model, which also obtained an F1 score of 0.65. The balanced accuracy is 3% points lower. As we can tell from the confusion matrix, it does not predict the same class all the time, so we can conclude that it is better than the baseline.

Model Selection by Multiple Disagreeing Metrics

What happens if the metrics do not agree on the ranking of our models? In the last chapter, on classification, we learned about the precision and recall metrics, which we "merged" into the F1 score, because it is easier to compare models on one metric than two. But what if we did not want to (or couldn't) merge two or more metrics into one (possibly arbitrary) metric?

Pareto Dominance

If a model is better than another model on one metrics, and at least as good on all other metrics, this model should be considered better overall. We say that the model dominates the other model.

If we remove all the models that are dominated by other models, we will have the nondominated models left. This set of models is referred to as the Pareto set (or the Pareto front). We will see in a moment why Pareto front is a fitting name.

Let's say that our Pareto set consists of two models. One has high precision, but low recall. The other has low precision, but high recall. Imagine that we don't want to use the F1 metric or another combination of the two. Then, unless we have a preference for one of the metrics, we do not have a way to choose between the models and should instead consider including both models in our analysis or machine learning pipeline.

Exercise 59: Plotting the Pareto Front

In this exercise, we will simulate evaluation results for 20 classification models and visualize the Pareto front with the rPref package.

The Pareto front can be just as useful for linear regression models, where one model might have the lowest RMSE, while another has the lowest MAE. As precision and recall are slightly easier to simulate realistically, we will use these metrics for this exercise:

  1. Attach the required packages rPref, ggplot2 and knitr:

    # Attach packages

    library(rPref)

    library(ggplot2)

    library(knitr)

  2. Set the random seed to 3:

    # Set seed for reproducibility and easier comparison

    set.seed(3)

  3. Create a data frame with 20 simulated model evaluations. The metrics are precision and recall. Use the runif() function to randomly sample 20 numbers between 0 and 1:

    # Create random set of model evaluations

    # runif() samples random numbers between 0 and 1

    evaluated_models <- data.frame("Model" = c(1:20),

                                   "Precision" = runif(20),

                                   "Recall" = runif(20))

  4. Order the data frame by the two metrics and print it using the kable() function:

    # Order the data frame by the two metrics

    evaluated_models <- evaluated_models[order(evaluated_models$Precision,

                                               evaluated_models$Recall,

                                          decreasing = TRUE),]

    # Inspect the simulated results

    kable(evaluated_models, digits = 2)

    The output is as follows:

    Figure 5.26: Model precision and recall
    Figure 5.26: Model precision and recall

    By simply looking at the table, we can tell that models 19 and 8 are nondominated models, as they have the highest precision and recall scores, respectively. Instead of comparing the models manually, though, we will use the psel() function from the rPref package to find the Pareto front.

  5. Find the nondominated models using the psel() function. For the second argument, pref, we specify whether a metric is supposed to be as high or as low as possible:

    # Find the Pareto front / set

    front <- psel(evaluated_models,

                  pref = high("Precision") * high("Recall"))

    # Print the Pareto front

    kable(front, digits = 2)

    The output is as follows:

    Figure 5.27: Models in the Pareto front
    Figure 5.27: Models in the Pareto front

    Figure 5.26 shows the four nondominated models. Besides models 8 and 19, we also have models 6 and 10, which both have a higher recall than model 19, but lower precision. While model 10 has higher precision than model 6, model 6 has higher recall than model 10.

  6. Create a ggplot2 object with precision on the x-axis and recall on the y-axis. Add the models as points using geom_point(). Increase the size of the nondominated points by adding larger points on top of the smaller points. Using the geom_step() function, create a line to visualize the Pareto front. Finally, add the light theme with theme_light():

    # Create ggplot object

    # with precision on the x-axis and recall on the y-axis

    ggplot(evaluated_models, aes(x = Precision, y = Recall)) +

      # Add the models as points

      geom_point(shape = 1) +

      # Add the nondominated models as larger points

      geom_point(data = front, size = 3) +

      # Add a line to visualize the Pareto front

      geom_step(data = front, direction = "vh") +

      # Add the light theme

      theme_light()

    The Pareto front plot should look as follows:

Figure 5.28: Pareto front with precision and recall
Figure 5.28: Pareto front with precision and recall

In Figure 5.27, the four large points are the nondominated models. That is, for each of these models, there are no other models that are better on one metric and at least as good on the other metric. What we do with this information is up to us. We might want a model with very high precision, so it rarely makes mistakes when predicting our positive class. Or we might want high recall, so it captures most of the observations in the positive class, even if it means that some of these predictions are wrong. Alternatively, the two models in the "middle" of the Pareto front might have the balance we want. It all depends on our use case.

Activity 19: Classifying Room Types

In this activity, you will classify the type of room in Airbnb listings in Amsterdam using logistic regression. You will start by creating baseline evaluations for the task. Then, you will fit and interpret multiple logistic regression models. Next, you will generate model formulas and cross-validate them with cvms. First, you will run a single cross-validation, and then you will use repeated cross-validation on the best 10-20 models. Finally, you will choose the nondominated models and validate them with validate().

Note

The Amsterdam dataset has been taken from http://insideairbnb.com/get-the-data.html and modified to suit the activity. The modified dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/amsterdam.listings.csv.

The following steps will help you complete the activity.

  1. Attach the groupdata2, cvms, caret, randomForest, rPref, and doParallel packages.
  2. Set the random seed to 3.
  3. Load the amsterdam.listings dataset from https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/amsterdam.listings.csv.
  4. Convert the id and neighbourhood columns to factors.
  5. Summarize the dataset.
  6. Partition the dataset into a training set (80%) and a validation set (20%). Balance the partitions by room_type.
  7. Prepare for running the baseline evaluations and the cross-validations in parallel by registering the number of cores for doParallel.
  8. Create the baseline evaluation for the task on the validation set with the baseline() function from cvms. Run 100 evaluations in parallel. Specify the dependent column as room_type. Note that the default positive class is Private room.
  9. Fit multiple logistic regression models on the training set with the glm() function, predicting room_type. Try different predictors. View the summary() of each fitted model and try to interpret the estimated coefficients. How do the interpretations change when you add or subtract predictors? Note that interpreting the coefficients for logarithmic predictors in logistic regression is not an easy task, so don't worry if it doesn't make sense yet.
  10. Create model formulas with combine_predictors(). To save time, limit the interaction size to 2 by specifying max_interaction_size = 2, and limit the number of times an effect can be included in a formula to 1 by specifying max_effect_frequency = 1.
  11. Create five fold columns with five folds each in the training set using fold() with k = 5 and num_fold_cols = 5. Balance the folds by room_type. Feel free to choose a higher number of fold columns.
  12. Perform cross-validation (not repeated) on your model formulas with cvms. Specify fold_cols = ".folds_1". Order the results by F1 and show the best 10 models.
  13. Perform repeated cross-validation on the 10-20 best model formulas (by F1) with cvms.
  14. Find the Pareto front based on the F1 and balanced accuracy scores. Use psel() from the rPref package, and specify pref = high("F1") * high("`Balanced Accuracy`"). Note the ticks around `Balanced Accuracy`.
  15. Plot the Pareto front with the ggplot2 code from Exercise 61, Plotting the Pareto Front. Note that you may need to add ticks around `Balanced Accuracy` when specifying x or y in aes() in the ggplot call.
  16. Use validate() to train the nondominated models on the training set and evaluate them on the validation set.
  17. View the summaries of the nondominated model(s).

    The output should be similar to the following

    ##

    ## Call:

    ## glm(formula = model_formula, family = binomial(link = link),

    ##     data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -3.8323  -0.4836  -0.2724  -0.0919   3.9091

    ##

    ## Coefficients:

    ##                         Estimate Std. Error z value  Pr(>|z|)

    ## (Intercept)             15.3685268  0.4511978  34.062  < 2e-16 ***

    ## availability_365        -0.0140209  0.0030623  -4.579 4.68e-06 ***

    ## log_price                   -3.4441520  0.0956189 -36.020  < 2e-16 ***

    ## log_minimum_nights          -0.7163252  0.0535452 -13.378  < 2e-16 ***

    ## log_number_of_reviews       -0.0823821  0.0282115  -2.920   0.0035 **

    ## log_reviews_per_month        0.0733808  0.0381629   1.923   0.0545 .

    ## availability_365:log_price   0.0042772  0.0006207   6.891 5.53e-12 ***

    ## log_n_o_reviews:log_r_p_month 0.3730603  0.0158122  23.593 < 2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 14149.2  on 13853  degrees of freedom

    ## Residual deviance:  8476.7  on 13846  degrees of freedom

    ## AIC: 8492.7

    ##

    ## Number of Fisher Scoring iterations: 6

    Note

    The solution for this activity can be found on page 365.

Summary

In this chapter, we fitted and interpreted multiple linear and logistic regression models. We learned how to calculate the RMSE and MAE metrics, and checked their different responses to outliers. We generated model formulas and cross-validated them with the cvms package. To check whether our model is better than random guesses and making the same prediction every time, we created baseline evaluations for both linear regression and binary classification tasks. When multiple metrics (such as the F1 score and balanced accuracy) disagree on the ranking of models, we learned to find the nondominated models, also known as the Pareto front. Finally, we trained two random forest models and compared them to the best performing linear and logistic regression models.

In the next chapter, you will learn about unsupervised learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset