Chapter 2: Example Step-by-Step Data Science Project

Overview

Business Opportunity

Initial Questions

What is the business opportunity?

Do we have the data to support this project?

What type of work has been done previously on this type of problem?

What does a solution look like?

Get the Data

Select a Performance Measure

Train / Test Split

Target Variable Analysis

Predictor Variable Analysis

Modeling Considerations

Numeric Variables

Character Variables

Dummy Variables

Adjusting the TEST Data Set

Building a Predictive Model

Baseline Models

Regression Models

Non-Parametric Models

Decision Time

Implementation

Chapter Review

Overview

Learning data science is a lot like learning to play a musical instrument. There is a lot of underlying technical theory, but in the end, the only way to learn how to play is to put your hands on the instrument and practice. This chapter will be like a musical recital (or a bar room jam band). It will be a demonstration of approaches and techniques. The goal is to show a real-life example of some of the concepts that we touched on in the introductory chapter and things that we will cover in greater detail in later chapters.

The goal of this chapter is to provide you with a complete data science project. It can be easy to get lost in all the necessary steps, so here is a general outline of the workflow:

● Identify the business opportunity

● Review previous work performed for this problem

● Find data

● Select a performance measure

● Create TRAIN / TEST data sets

● Analyze the target variable

● Analyze the predictor variables

● Develop a modeling data set

● Build predictive models

● Analyze results

● Implement the model

Many of the concepts that are introduced in this chapter will be developed further in the following chapters, which provide deep-dive explanations of specific concepts and techniques. If you are not familiar with a certain term or if a concept is briefly introduced as part of this workflow, please don’t worry. These concepts are fully explained in the following chapters.

This chapter will also contain a good amount of SAS code. If you are not familiar with SAS coding, don’t worry. The next two chapters will provide a thorough explaination of all the techniques presented in this chapter.

Business Opportunity

Imagine that you are a smart person with some coding skills and you want to make a ton of money with an online business while you work from home and devote very little time to the upkeep of your business. I know, it’s a stretch. Who can imagine such a person? *Note: sarcasm does not come across well on the written page.

Having an entrepreneurial spirit, you have decided to develop a product that could be used to support another much more popular product. This type of product could give you the benefit of an established target audience. This appeals to your lazy nature since you don’t have to build an audience from scratch. After a little brainstorming, you’ve come up with a few ideas:

Amazon

Business Need – Order totals that fall just shy of the $25 threshold for free shipping.

Solution – An online search tool that will return products that will bring your total up to the minimum shipping threshold.

Ticketmaster

Business Need – Concert tickets are expensive and people need to know how much ticket prices will be, so they can save up to see the Van Halen reunion tour.

Solution – An online tool that will predict the cost of a given concert at a given location on a given date.

McDonald’s

Business Need – Cold french fries are disgusting.

Solution – An online tool that will predict when fresh, hot fries will be available at any given McDonald’s.

These are all brilliant ideas; however, some of them are not feasible.

● Amazon – A quick search reveals that this product already exists. Strike one.

● Ticketmaster – You love Van Halen so much that you don’t want other people to buy the tickets and leave you at the mercy of scalpers. Strike two.

● McDonald’s – Data is necessary to build models and you cannot find a database of historical french fry availability times for all possible McDonald’s. Maybe there is such a database for Burger King, but you give up out of frustration. Strike three.

While you wait for your next brilliant idea to hit you, you decide to list your apartment on the most popular peer-to-peer property rental site, Airbnb, and generate some passive income by renting out your guest bedroom. Airbnb is an online service that lets property owners rent out their properties or rooms to guests.

You log on to Airbnb, set up your account, list the features of your apartment, and upload some photos. Great, you are ready to make some cash. However, the final question that you need to answer for your listing is, how much do you want to charge per night?

You don’t know what your apartment is worth per night. You live in New York, so your place has to be expensive, right? But you don’t live in an expensive, fashionable neighborhood. It’s April, so it is cold and not tourist season. Is there a game or a festival or a concert in your area that might increase demand and, therefore, the price per night that you could charge?

You look at the available listings of the surrounding apartments, but they are not just like yours. Some are bigger or smaller, in different areas, and offer different features. You are about ready to give up out of frustration.

Suddenly, inspiration hits. This is it! This is the business opportunity that you’ve been looking for. You could develop an interactive web tool where people could input the features of their home, and it would give them the optimal Airbnb per diem price for a given location, date, and specific home features. Brilliant!

Initial Questions

Now that you have your brilliant idea, you need to answer these four initial questions:

● What is the business opportunity?

● Do we have the data to support this project?

● What type of work has been done previously on this type of problem?

● What does a solution look like?

What is the business opportunity?

A quick web search shows that there are approximately 4 million homes listed on Airbnb in 194 countries. Although this is very encouraging news, you want to focus only on the United States. Further research shows that there are nearly 600,000 listings in the United States. That’s great news; however, you are a cautious researcher, and you discover that there is an average of three listings per host on Airbnb. So, your total market population is about 200,000 property owners.

You have the choice of different pricing options for your product:

● Charging a one-time fee for an estimate

● Charging a monthly membership fee for unlimited estimates

● Forgoing the fee in favor of ad revenue

I’m sure that you can think of plenty of other ways to monetize your product, but we need to focus on developing the product.

Do we have the data to support this project?

After some web hunting, you hit the jackpot! You have discovered a project titled Inside Airbnb, where raw Airbnb data is made publicly available for several cities in the United States and other cities around the world (www.insideairbnb.com). Murray Cox created this project to shed light on the ongoing debate concerning the impact of Airbnb on housing dynamics. Due to Airbnb being so tight-lipped about their data, it looks like this data was scraped from the web. Thanks, Murray!

Here is your first decision point. How much of this problem do you want to take on? You are fairly confident that each city will have its own housing dynamics that are unique to that area. You decide that you want to focus your data analysis on New York City. This is where you live, and you are familiar with the dynamics of the city. It is a reasonable approach to start with an area that you are familiar with and expand once you have proven that you are able to model that environment successfully.

We will look at the data in detail in a little while, but our initial look shows that the data is organized into two categories: Listings and Calendar. The listings data set contains descriptions of the physical, geographical, and host-related attributes of each property listed in New York City as of December 6, 2018. The calendar data set contains records of the nightly prices for the following 12 months for the majority of the listings. Both of these data sets encode relevant information that could potentially be determining factors in the pricing of these properties.

Great, so we have a business plan and data to support our project. Now let’s see what kind of similar work has been done in this area. We don’t want to reinvent the wheel.

What type of work has been done previously on this type of problem?

Since the inception of the “shared economy,” there have been dozens of articles and scholarly papers written about Airbnb along with other peer-to-peer homestay networking sites. Although you have reviewed several of these papers, two of these studies appear to be the most relevant for your project.

Study #1

The online article titled “Predicting Airbnb Listing Prices with Scikit-Learn and Apache Spark” by Nick Amato was published on April 20, 2016 on the Mapr blog site. This article details the construction of a predictive model using the Python Scikit-Learn package in combination with Apache Spark for performance enhancement. The author uses the GradientBoostingRegressor predictive model in combination with the GridSearchCV cross-validation technique on the Apache Spark system to achieve an error of $21.43 rental price per night.

Takeaway

This article provides an excellent step-by-step look at developing several predictive models and assessing the accuracy of those models. The author also demonstrated how the large modeling data set produced processing issues. He had to rely on Apache Spark to increase processing efficiency and to allow the full range of exhaustive search methods to be employed on model development. So, let’s remember that the combination of the listings and calendar data sets produces a very large data set that requires significant processing power to manipulate.

Study #2

In “Neighborhood and Price Prediction for San Francisco Airbnb Listings,” Emily Tang and Kunal Sangani explore data from the Inside Airbnb project, containing a complete set of 7,029 listings of properties for rent on Airbnb in San Francisco as of November 2, 2015. 

Rather than attempting to predict price as a continuous variable, a binary response variable for the price is created that indicates whether the predicted price is above or below the median price in the data set. This method has the advantage of simplifying both the price prediction task and the evaluation of the goodness of the model. However, the disadvantage of this approach is that simple linear regression is no longer a suitable option to model the price response variable and that the results of the prediction are of limited usefulness, especially for the application of suggesting pricing to hosts.

Additionally, in the data cleaning phase, all observations in neighborhoods containing 70 or fewer listings are removed. While this reduces the burden on the neighborhood classifier, it also restricts the model’s usefulness to neighborhoods with a relatively high number of listings.

Takeaway

The final models for both price and neighborhood have impressive predictive power. In the test set, prices are categorized to the correct group (either above or below the median) with approximately 81% accuracy, while neighborhoods are categorized with 42% accuracy. While this performance seems good, the authors note that discretization of the predicted price into smaller bins is likely necessary for most applications, and is a promising direction for future work.

What does a solution look like?

We have done our homework and reviewed different approaches. We have gotten some insights into the data from previous researchers, and we have learned some best practices from these researchers. There seem to be benefits to both a continuous target variable and a binary target variable approach. The decision of our modeling approach will be driven by what we want the solution to look like.

The next decision point is establishing what is the problem that we are trying to solve. Do we want to construct this as a regression or a classification problem? We can quickly realize that the problem that we are trying to solve is a continuous target variable problem. Airbnb requires customers to list an exact amount that they will charge per night for their home. This is not a binary type of yes or no problem. We need to be able to generate a continuous dollar value.

Now that we have established what our target variable type will be, we have a couple of options on modeling types:

1. Transparent model – Regression model.

a. Pros – The benefit of this type of model is that it will produce a dollar value for each feature of a home (demonstrated by coefficient weights). Customers will be able to know exactly what each feature contributes to the total cost per night value of their home.

b. Cons – These types of models are traditionally not as accurate as more complex modeling types. This can result in property owners charging too little or too much for their nightly rate.

2. Black box model – This will be a more complex type of model (random forest, neural network, gradient boosting decision tree, and so on.)

c. Pros – The benefit of this type of model is that it generally produces much more accurate predictions. This could provide the homeowner with the best possible listing price for their property.

d. Cons – Although we will be able to determine the most important features that go into determining the nightly rate, we will not be able to provide a “cost per feature” in the same manner that we can with a regression model.

We have established the business model, identified a data set, learned best practices from other researchers, and envisioned our implementation. Not a bad start. But before we start fantasizing about sipping piña coladas on our Playa Del Carmen vacation paid for by this amazing business, we have more work to do.

Get the Data

We have outlined an opportunity; however, we also have a lot of decisions to make in order to bring this project to life. These decisions are not made in the dark. A general rule of data science is to let the data guide you. So, let’s get some data.

There are two ways to get the data.

1. You could go to the source Inside Airbnb data set and pull the data directly from their site. It is an excellent site that provides a lot of background information, insights, and many data attributes. You should definitely go to this site and check it out (www.insideairbnb.com).

2. I have downloaded the New York City Inside Airbnb data sets: Listings and Calendar, and I’ve placed them in our dedicated GitHub repository. These are the two main data sets that we will use for our analysis.

For the sake of consistency, I suggest that you use the data contained in our GitHub repository. The Inside Airbnb site updates their data often, so there is a good chance that the data that is currently on their site will be different from the data that we will be analyzing as part of this project. If you want to recreate this analysis to get a feel of the flow of data science (highly recommended), then use the data in our GitHub.

Web data that relies on user input is notoriously messy. People use inconsistent naming conventions (NY vs. New York vs. NYC), and they are generally terrible at spelling. I have taken the liberty of cleaning up the data by standardizing the naming conventions used in specific fields, creating logical and consistent formats for each variable, and I’ve eliminated all of the descriptive text fields. These descriptive text fields are generally very big fields, and they eat up a lot of memory and processing power.

We can always use the descriptive text fields later if we want to expand our analysis to include Natural Language Processing techniques to create variables developed from the text. But for now, let’s move forward with the numeric and categorical fields in the cleaned data sets.

Download the data from GitHub and place it in your local data folder. Once you have the data, you will need to import the data into SAS. The data import code is below in Program 2.1. Please don’t make fun of my terrible file naming convention.

Program 2.1: Import Data

FILENAME REFFILE ‘C:UsersJames GearheartDesktopSAS Book StuffDatalistings_clean.csv’;
PROC IMPORT DATAFILE=REFFILE
       DBMS=CSV
       OUT= MYDATA.Listings;
       GETNAMES=YES;
RUN;
PROC CONTENTS DATA=MYDATA.Listings; RUN;

For those of you who are new to SAS programming, this code might look a bit confusing. Don’t worry; I will give a breakdown of each part of the code in the next chapter that focuses on learning to code in SAS. Remember, the purpose of this chapter is to give you a look at the steps and the decision path of an example data science project.

Output 2.1: Data Contents Overview

The CONTENTS Procedure

Data Set Name

WORK.LISTINGS

Observations

49056

Member Type

DATA

Variables

54

The first two rows of the PROC CONTENTS output show that there are 49,056 observations with 54 variables in the Listings data set. The PROC CONTENTS procedure also produces a list of variables and attributes. Since there are 54 variables in this data set, I will not list all of them here.

Remember the old real estate saying, “Location, location, location.” Your model will be highly sensitive to the location of the property. When your data science problem is location specific, it is a good idea to generate a map graphic of your data.

Program 2.2: Create Geographic Map

ODS GRAPHICS / RESET WIDTH=6.4in HEIGHT=4.8in;
PROC SMAP plotdata=WORK.Listings;
       openstreetmap;
       scatter x=longitude y=latitude / 
markerattrs=(size=3 symbol=circle);
RUN;
ODS GRAPHICS / RESET;

SAS does an excellent job of creating picture-perfect maps with a few lines of code. They are highly customizable with the ability to create choropleth maps and bubble maps along with lots of other customizable features.

For additional information, you can group the data points by a categorical feature. The code below in Program 2.3 shows the Airbnb listings color-coded by the Neighborhood Group feature.

Figure 2.1: Output of Program 2.2

Program 2.3: Create a Color-Coded Map

 /*Neighborhood Group Map View*/
ODS GRAPHICS / RESET WIDTH=6.4in HEIGHT=4.8in;
PROC SMAP plotdata=WORK.Listings;
       openstreetmap;
       scatter x=longitude y=latitude /
       group=neighbourhood_group_cleansed
       name=”scatterPlot” markerattrs=(size=3 symbol=circle);
       keylegend “scatterPlot” / 
       title=’neighbourhood_group_cleansed’;
RUN;
ODS GRAPHICS / RESET;

The map view that is grouped by neighborhood in Figure 2.2 provides an easy visual snapshot of the data. We can see that Manhattan and Brooklyn appear to have a majority of the listings while The Bronx and Staten Island have relatively few listings.

Figure 2.2: Map View by Neighborhood

Select a Performance Measure

To be a successful data scientist, you need to be brutally honest with yourself. It is easy to cheat a little bit here and there by:

● Cherry-picking test cases

● Analyzing the TEST data set to learn information that will be included in the model

● Selecting a more favorable performance measure

A performance measure is a metric that is used to determine how well a process is achieving a defined outcome. In the context of predictive modeling, performance measures provide an objective method of assessing model accuracy. Before you start developing your model, you need to first decide on a performance measure that fits your target variable, and that is aligned with the goal of your model. One of the most common performance measures for a continuous variable is the Root Mean Squared Error (RMSE). This metric measures the squared distance between the predicted and the actual observation.

One important fact to keep in mind with this measure is that it severely penalizes large misses. This requires extra diligence on the upper and lower end predictions where large errors are common.

The RMSE is the performance measure that we will use to assess the performance of our model. The target value is a continuous variable (price), and our goal is to minimize the error distance between our predicted value and the actual value of price.

Train / Test Split

Before we start investigating the data and making any adjustments, we first need to split the data into TRAIN and TEST data sets. Remember, that we are attempting to simulate a real-world environment where we do not know anything about the data that the final model will be applied to. It would not be realistic to make all of our adjustments and learn everything that we can from the full data set and then split it into TRAIN and TEST data sets. It is a common mistake that data scientists make, which they pay for later when their implemented model does not perform as expected.

We create the TRAIN and TEST data sets by using the PROC SURVEYSELECT procedure. This is a great tool to use that allows you to set your sampling rate (SAMPRATE). I want an 80/20 split between my TRAIN and TEST data sets, with 80% of the data going to the TRAIN data set. Although there is no hard-and-fast rule as to what the sampling percentage should be, the 80/20 split is a standard amount.

The PROC SURVEYSELECT procedure also gives you the option of specifying which sampling strategy you would like to apply. For this project, I want to apply a simple random sample without replacement. This sampling method is accomplished with the METHOD = SRS statement.

Finally, I want to establish a seed. This specifies the initial seed for random number generation. In the model development stage of the project, we want to make sure that we create the same data set every time we run our code. The establishment of a seed is a great way to ensure that you can replicate your results.

You can use any number for a seed value. I use 42 because Douglas Adams showed us that 42 is the “Answer to the Ultimate Question of Life, the Universe, and Everything” in his “Hitchhiker’s Guide to the Galaxy” series.

Program 2.4: Develop an 80/20 Split Indicator

/*Split data into TRAIN and TEST datasets at an 80/20 split*/
PROC SURVEYSELECT DATA=MYDATA.Listings SAMPRATE=0.20 SEED=42
       OUT=Full OUTALL METHOD=SRS;
RUN;

The PROC SURVEYSELECT procedure simply adds a field to the existing WORK.Listings data set. The Selected field is a binary variable that is coded as a 1 or 0. The 20% of observations that are randomly selected will have an indicator of 1, while the remaining 80% will have an indicator of 0.

The output of the PROC SURVEYSELECT procedure provides you with a summary statement of the chosen parameters. The Sample Size column shows the number of observations that were randomly selected by the algorithm. A quick confirmation shows that 9812/49056 = 20%.

Output 2.2: PROC SURVEYSELECT Summary Output

We can create our TRAIN and TEST data sets by simply outputting records to each data set according to the value of the Selected field. Once the data is appropriately distributed to the data sets, we can drop the Selected variable.

Program 2.5: Create the TRAIN and TEST Data Sets

DATA TRAIN TEST;
    SET Full;
       IF Selected=0 THEN OUTPUT TRAIN; ELSE OUTPUT TEST;
       DROP Selected;
RUN;

One of the nice features of SAS is that the log provides you with a lot of information. Once Program 2.5 is submitted, you can check the log to confirm that the TRAIN and TEST data sets are the expected sizes and that there are no errors in the processing.

NOTE: There were 49056 observations read from the data set WORK.FULL.
NOTE: The data set WORK.TRAIN has 39244 observations and 54 variables.
NOTE: The data set WORK.TEST has 9812 observations and 54 variables.
NOTE: DATA statement used (Total process time):
      real time           0.09 seconds
      cpu time            0.07 seconds

Great! Now that our data is appropriately split, we can set aside the TEST data set and not look at it until we have developed our model and are ready to test it on our hold-out sample.

Target Variable Analysis

The next step in the development of our project is to identify and analyze the target variable.

We will take the time to review the target variable analysis in detail. It is critically important that you thoroughly explore your target variable and understand the content of this variable. We will make critical decisions that will affect the construction of the modeling data set and our end results.

The target variable requires a different approach than the predictor variables due to two main factors:

1. Missing and nonsensical values – It is inadvisable to infer these types of values for the target variable. It is common practice to infer missing values for predictor variables; however, you should not do it for your target variable.

2. Predictors can be collapsed or dropped – We will have lots of different choices to make with predictor variables. These include inferring values, regrouping, or dropping these variables. However, we have only one target variable, so we cannot drop it or change the fundamental nature of the variable by inferring values. We can transform the variable in different ways, but these methods cannot change the fundamental nature of the variable.

Let’s start by focusing on the cross-sectional data in the TRAIN data set. We are attempting to predict the per diem rate of Airbnb listings in New York City as of 12/06/2018. The data set contains a few different price variables for us to choose from:

● Price

● Weekly price

● Monthly price

● Security deposit

● Cleaning fee

● Extra guest fee

The Price variable looks like the right one to use for our model. The other variables relating to cost are important pieces of information that we might use later, but for now, we will investigate our selected target variable.

The PROC UNIVARIATE procedure is a great way to investigate a continuous variable. A few simple lines of code will result in a lot of information about the distribution and range of a continuous variable.

Program 2.6: Investigate the Target Variable

PROC UNIVARIATE DATA=TRAIN; 
       VAR Price; 
       HISTOGRAM; 
RUN;

This code calls the PROC UNIVARIATE procedure on the TRAIN data set. We have specified that we want to look at the Price variable. We have also requested that the output produce a histogram of the Price variable.

The output of the PROC UNIVARIATE procedure contains a lot of information, so let’s look at it one section at a time.

The first section of Output 2.3 shows that the data contained 39,242 observations and that the average value for price is $152.15. However, you immediately see an issue with this variable. The standard deviation is larger than the mean value. This difference indicates that there are some big outliers that are pulling the average price up. We also see that the skewness is very high. The skewness measures the symmetry of the data. As a general rule of thumb, you want the skewness of the data to be between -1 and 1. This range would indicate a normal distribution of the data. Since the skewness of our target variable is about 20X greater than we would want, we will probably have to transform the variable to create a normal distribution.

Output 2.3: PROC UNIVARIATE Moments Table

The next section of the PROC UNIVARIATE output shows the basic statistical measures. This section contains the average value of the Price variable (mean, median, and mode) along with measures of variability (std dev, variance, range, and interquartile range).

Output 2.4: PROC UNIVARIATE Basic Statistical Measures Table

There is a significant difference between the mean and median values. This difference is further evidence that outlier values are affecting the mean value. The range metric shows us the difference between the highest and lowest value. There is a $10,000 per diem range for price. Wow, that high end must be an awesome place to live.

Output 2.5: PROC UNIVARIATE Quartiles, Extreme Observations, and Missing Values Tables

The final section of the PROC UNIVARIATE output (Output 2.5) contains more information about the details of the variable distribution, along with outliers and missing values.

The first section of Output 2.5 shows the quantiles of the variable. This contains the minimum and maximum values as well as values at certain percentage cutoff points. The maximum value for the Price variable is $10,000 while the minimum is $0.

At this point, we need to ask ourselves a question. Does it make any sense that someone would charge $0 per night to stay at their house? Maybe they are very lonely and just want some company. It is probably just the result of bad data entry.

We also see that in the missing values section of the output, there are two observations with missing price data. This is actually good news. Usually, user-entered data is messier than this. These statistics show that less than 1% of the data would need to be either adjusted or deleted. We will make those decisions in the near future.

On the high end of the distributions, we see that the 99% cutoff value is $750. There is a massive difference between the 99% and 100% values. That top 1% of observations are extreme outlier data points.

The final piece of output that we can examine is the histogram of the Price variable. The distribution chart gives a visual demonstration of the statistics that we just reviewed.

The histogram (Output 2.6) shows a highly skewed distribution with a very long right-sided tail. The story that this chart tells us matches perfectly with the skewness value of 20.25 that we see in the statistics table above. All of this information tells us that the Price variable has significant outliers on the high end that we need to fix before we can do any modeling.

Output 2.6: PROC UNIVARIATE Histogram

Here is one of the main decision points that we need to address that will affect the rest of our project. We must determine which observations we will keep to establish our modeling data set. Extreme outliers are a big problem in model development because they pull the predicted value in a manner that is not representative of the majority of the data.

The old example of the effect of outliers is if you have 100 people in a room, their average net worth might be $50,000 per person. However, if Bill Gates walks into the room, suddenly the average net worth is now $100,000,000 per person. Bill has skewed the average net worth due to his extreme outlier position of net worth. The only way to establish an accurate average net worth is to kick Bill out of the room.

The moral of the story is that we need to throw out some observations for the model to be uninfluenced by extreme outliers. As a general rule of thumb, we can exclude the top 1% and the bottom 1% of target values if they are determined to be extreme outliers or nonsensical data.

An important note is that we are excluding only these observations because of the outlier nature of the target variable. When we see outliers in the predictors, we will have the option to set max and min thresholds and recode those values.

So, we have investigated the target variable and determined that values with missing or $0 entries do not make sense. We have also investigated the high-end values, and we have determined that a cutoff point of $750 would eliminate extreme outliers. Let’s code it up…

Program 2.7: Limit the Data and Create the Log Adjusted Target Value

/* Eliminate outliers and create log transformed price variable */
DATA Price;
       SET TRAIN;
       WHERE 30 le Price le 750;
       Price_Log = LOG(Price);
RUN;

The log shows us that the adjusted data set contains 38,523 observations. The original data set contained 39,241; therefore, we have lost only 715 observations. The majority of these excluded observations were for properties that have a $0 per diem rate.

NOTE: There were 38503 observations read from the data set WORK.TRAIN.
      WHERE (Price>=30 and Price<=750);
NOTE: The data set WORK.PRICE has 38503 observations and 55 variables.
NOTE: DATA statement used (Total process time):
      real time           0.07 seconds
      cpu time            0.03 seconds

Although our target variable is much cleaner now that we have established our upper and lower boundaries, we expect that the distribution of the data will still be skewed to the right. A PROC UNIVARIATE procedure developed on the Price variable shows a right tail skewness with a value of 2.13. (See Output 2.7.)

Output 2.7: PROC UNIVARIATE Moments Table and Histogram

In anticipation of this skewness, I created a log-transformed version of the Price variable in the previous DATA step (Program 2.7). This approach will transform the Price variable into a normally distributed variable. A PROC UNIVARIATE developed on the log-transformed Price variable shows a skewness value of 0.26.

Output 2.8: PROC UNIVARIATE Moments Table and Histogram on the Log-Adjusted Target Variable

Great! We now have a data set where we have eliminated extreme outliers and nonsensical values. We have also transformed our target variable into a normally distributed variable. Remember that we can always transform the log value of a variable back to its original scale by taking the exponential of the log value. But for now, we are ready to move on to investigating the predictor values.

Predictor Variable Analysis

Before we begin to analyze the predictor variables, it’s a good idea to take a moment to think about what the goal of the project is and how we would envision a completed project. We want to develop a tool that property owners could use to give them the optimal per diem price for their property on Airbnb. The predictor variables will be the values that property owners enter to describe the features of their property. You have the option of including additional features based on regional factors (tourist attractions, crime rate, events, and so on), but for now, we will focus on the data that we have already gathered.

Under this scenario, we could make some assumptions:

● We can expect that the traditional home features will be significant in the model:

◦ Location

◦ Property type

◦ Number of people the property will accommodate

◦ Number of beds, baths, rooms, and so on

● Some data will not be available at the point of application:

◦ User review scores

◦ Review volume

This is a good example of the difference between a data scientist and a machine learning engineer. The machine learning engineer will generally use all possible data attributes and create the most predictive model possible by focusing on minimizing the performance metric on the TEST data set.

The data scientist has to use business knowledge to throw out data. This might sound crazy to you. We have all been trained that more data is better. However, to actualize our vision of the final project, we need to limit the data to attributes that will be available at the point of application. Although the data set contains lots of features about current properties on Airbnb, we will not be able to use all of these features.

Modeling Considerations

Another important topic that we need to consider as we investigate our predictor variables is the issue of model design. We haven’t determined what type of model we will eventually put into production. However, we will need to take different approaches to our predictor variables based on what type of model that we will implement. Here are our choices:

Parametric Model – This would be some form of a regression model where we know the functional form of the model. This type of model will require us to do a thorough evaluation of the predictor variables and make adjustments based on:

Data quality – missing data and outlier values will need to be adjusted.

Correlation – predictor variables that are correlated will need to be identified and adjusted.

Categorization – character variables with many levels will need to be collapsed into smaller categories.

Scaling – regression models do not perform well when the numeric values are on different scales. For example, the home price could be a range of $15K to $500K, while the same model will have “number of bathrooms” that range from 0 to 5.

Non-parametric Model – These types of models do not assume a functional form and can, therefore, adapt to a much greater range of functional forms. These models are not (generally) sensitive to the issues that are problematic for regression models. For example, it is often advantageous to feed the raw data into a tree-type model and allow the algorithm to determine categorization and split points.

We will start the analysis of the predictor variables with a regression model in mind. We can always use the raw variables later to build a non-parametric model.

Numeric Variables

The first step to understanding the predictor variables is to separate the numeric and categorical variables. We will need to perform different types of analysis depending on the construct of the variables.

A PROC MEANS statement can give us a quick overview of the numeric variables. This procedure will not work for categorical variables, so we specify that we want to keep only the numeric variables in the KEEP statement.

Program 2.8: Investigate Numeric Data with PROC MEANS

PROC MEANS DATA=Price (KEEP = _NUMERIC_) N NMISS MIN MAX MEAN MEDIAN STD; RUN;

Output 2.9: PROC MEANS Output

Some of these variables, we can safely ignore. The ID and Host_ID variables are unique identifiers, so as long as the number missing (NMISS) value is 0, we do not need to investigate these any further. We can also ignore the Latitude and Longitude variables because they are useful only for mapping, and we can ignore the Price and Price_Log variables since we analyzed them extensively in the previous section.

Exclude Certain Predictors

Several numeric variables will not be available at the point of application. We will assume that users of our tool will be listing new properties and not ones that already contain user review scores.

Therefore, we will exclude the following variables:

● Number of reviews

● Review scores rating

● Review scores accuracy

● Review scores cleanliness

● Review scores check-in

● Review scores communication

● Review scores location

● Review scores value

● Reviews per month

● First review

● Last review

● osHHost since

We will also exclude the calculated_host_listing_count because this is a system-generated variable that will not be available at the point of application. It appears that the host_listing_count variable is be a reasonable substitute for this variable, but we will need to work with it because the max value (2310) and the standard deviation (92) appear to be very high.

Numeric Missing Values

A common decision that data scientists have to make is how to handle missing values. Several of the variables have missing values. This issue is especially problematic for regression models. Regression models will exclude any observation with a single missing value. This exclusion can result in a highly biased model. Therefore, we need to make some decisions on what to do with these variables.

A quick look at each of the variables in Table 2.1 that have some missing values shows that the percentage of missing values ranges from nearly all of the values (square_feet = 99.1%) to almost none of the values (host_listings_count = 0.0%).

Table 2.1: PROC MEANS Missing Data Analysis

We will take a missing value imputation approach based on the nature of the variable and the volume of missing values.

Square_feet – Since nearly all of the observations are missing, this variable adds very little information to the data set. It is an absurd approach to try to impute nearly all of the values. We will exclude this variable from the model.

Security_deposit and cleaning_fee – The logical reason that these values are missing is that these properties do not require a security deposit or cleaning fees as a condition of their rental agreement. We will set these missing values to zero.

Zipcode – It doesn’t make sense to impute ZIP codes with a mean value. Instead, we can identify the correct correlation between the neighborhood and ZIP code and apply it to neighborhoods with missing ZIP code values.

Bedrooms, bathrooms, beds, host listing count – There are very few observations where there are missing values. We can impute their value with the global mean.

Numeric Adjustments

Some of the most important decisions that data scientists will make concerning their projects is whether or not they will make adjustments to the predictor variables. These adjustments can include imputing values, setting thresholds, scaling, and excluding variables. This section reviews some of those decision points for our numeric variables.

Host_listings_count – This variable is highly skewed, with some property owners listed as having over 2,300 properties available on Airbnb. This skewness could be a data entry issue, or it could be actual data; we don’t really know. Either way, we will need to adjust this variable to deal with the skewness of its distribution.

Further analysis of this variable shows that there are differences between property owners who have a single property compared to those that have between two and ten properties and those who are major property owners who have more than ten properties. I have created a categorical variable for these levels and discarded the original numeric variable.

Table 2.2: Frequency Distribution of Newly Created Variable

Additional fees – It does not make intuitive sense that the fees charged for a security deposit, cleaning, and extra people would affect the base per diem rate for the property. You could make the argument that these fees suggest a level of quality of the property, but they are entirely subjective and not an inherent feature of the property. These fees could be modeled separately, and you could generate suggested fee amounts for your customers. For now, we will choose to exclude them from modeling.

Zipcode – This is a tricky feature to deal with. ZIP code is obviously a strong predictive factor because our background business knowledge tells us that location is a primary driver of the per diem rate. However, for our data set, there are 187 different ZIP codes. For our regression model approach, this severely inflates the degrees of freedom in the model, and we are confronted with the curse of dimensionality. (See the previous chapter for an overview of this topic).

Other categorical features provide similar information. These include neighbourhood_cleansed, neighbourhood_group, and city. The inclusion of ZIP code could lead to overfitting. For our regression model approach, we need to make a decision about which location variable to use. I would suggest that the location variables with over 20 levels result in a sparse modeling space and that we would benefit by using the neighbourhood_group_cleansed variable with five levels. The other location variables will be retained for other modeling approaches.

Maximum_nights and minimum_nights – These are highly skewed data attributes with glaring data entry errors. These variables were capped at the 99% value. For maximum nights we set an upper threshold of 1125, and for the minimum nights, we set an upper threshold of 31.

Beds, bedrooms, and bathrooms – Each of these variables have outliers associated with them. For example, the beds variable has a maximum value of 21 while the 99th percentile is just five beds. We will need to cap these variables at their 99th percentile to avoid the severe outliers affecting the model. Therefore, the beds variable is capped at 5, the bedrooms variable is capped at 4, and the bathrooms variable is capped at 3.

Collinearity Analysis

One of the founding assumptions of regression models is that the predictor variables need to be independent (not correlated with one another). This assumption is important because if two or more variables are closely related, it is difficult to separate the individual effects of those variables on the response variable. In the end, collinearity reduces the accuracy of the estimates of the regression coefficients because it inflates the standard error of those coefficients.

The table below provides a good rule of thumb for interpreting correlation coefficients. Any correlation coefficient above 0.7 or below -0.7 is considered highly correlated.

Table 2.3: Correlation Table

Size of Correlation

Interpretation

0.9 to 1.0 (-0.9 to -1.0)

Very high correlation

0.7 to 0.9 (-0.7 to -0.9)

High correlation

0.5 to 0.7 (-0.5 to -0.7)

Moderate correlation

0.3 to 0.5 (-0.3 to -0.5)

Low correlation

0.0 to 0.3 (0.0 to -0.3)

Negligible correlation

SAS provides two great ways to identify multicollinearity through the CORR and REG procedures. PROC CORR creates a correlation matrix that contains the correlation coefficient for each variable combination.

Here is a good SAS trick. You can create a global variable that contains all of the variables in a data set and places them into a single macro variable. The code below shows how you can call a CONTENTS procedure and retain the names of the variables from that procedure and place them into a separate data set. Then an SQL procedure is created that places those variables into a macro variable that is called with an ampersand (&). Now, you just need to call the macro variable rather than writing each of the variable names over and over!

Program 2.9: Place All Variable Names into a Macro Variable

/* Create global numeric variables */
PROC CONTENTS NOPRINT DATA=Clean (KEEP=_NUMERIC_ DROP=id host_id latitude longitude Price Price_Log) OUT=var1 (KEEP=name);
RUN;
PROC SQL NOPRINT;
       SELECT name INTO:varx separated by “ “ FROM var1;
QUIT;
%PUT &varx;
/* Create correlation analysis */
PROC CORR DATA=Clean;
       VAR &varx.;
RUN;

PROC CORR uses the numeric variables that I kept in the macro variable. The default correlation statistic is the Pearson’s r. The results of the procedure are below. I have color-coded the top 10% of high and low correlation values.

Table 2.4: Correlation Table

Variables are considered highly correlated if the correlation coefficient is above 0.7 or below -0.7. The bottom 10% of correlation values are highlighted in green. We can see that negative correlation is not a problem since the strongest negative correlation is -0.07.

For the positively correlated variables, we see that there are several cases where the correlation coefficient is above 0.7. It makes sense that the availability variables are highly correlated. If a property is available within 30 days, it would also be available within the next 60, 90, and 365 days. Also, the correlation between accommodates and beds makes sense. If a property has a high number of beds, it can accommodate more people.

Variance Inflation Factor

Let’s look at the correlation between these features from a different angle. We can create a REG procedure with the VIF option that will give us the variance inflation factor for each variable in the model. The variance inflation factor is calculated by dividing the ratio of variance in a full model by the variance of a model with a single variable. The lowest possible value for a VIF is 1, but a good rule of thumb is that you want your VIF to be less than or equal to 5.

The PROC REG code with the numeric macro variable and the VIF statement is included below in Program 2.10. I also added the COLLIN statement that produces a collinearity diagnostic.

Program 2.10: Variance Inflation Factor in a PROC REG

PROC REG DATA=WORK.CLEAN PLOTS=ALL;
       model Price_Log= &varx / 
       selection=forward VIF COLLIN;
RUN;

Output 2.10: Parameter Estimates Table with the VIF values

We can see in Output 2.10 that the VIF for the availability_60 and availability_90 variables are off the charts! If we think about what these variables are telling us about the property, we might interpret them as a measure of demand. I believe that we could safely exclude the availability of 60, 90, and 365 variables and still retain the necessary information that we get from the single availability_30 variable.

However, the problem of correlation is just beginning for this data set. We plan on creating additional variables (called feature engineering), and we plan on adding polynomial variables to the model. All of these variables will have some degree of correlation with the other variables that they are built on. So, instead of trying to find all of the variables that are correlated and throwing them out one at a time, we will instead introduce a regularization option to the predictive model. The concept of regularization is examined in detail in the regression chapter of this book. A regularization model is a model that introduces a shrinkage parameter to the algorithm. This parameter constrains the model to ensure that the model does not overfit the data.

The two main options for regularization are RIDGE and LASSO regression. For this project, we will use LASSO regression. The LASSO approach adds an absolute value bias term to the regression function. This approach reduces the negative effects of multicollinearity and consequently the model’s variance. This approach has the effect of setting the coefficients to zero if they are not relevant. We will review the LASSO methodology in detail later in the book, but for now, we just need to understand that it is a method that reduces the effects of multicollinearity in our model.

At this point, you might start realizing that data science is not just taking all of your data and throwing it into an XGBOOST model (hey, they win all the time on Kaggle) and minimizing your test error metric. That approach works well for modeling competitions and in academics, but if you are building predictive models to support a real-world business venture, there are a lot of decisions that have to be made that might work against your ability to minimize your test metric.
The question that you have to ask yourself is, “what is my goal”? Minimizing a test metric is not the goal. Building a business that is valuable to your customers by providing them with the most actionable information possible is your goal.

Scatter Matrix

An additional method of analyzing our numeric variables is to produce scatter plots of each variable’s relationship with the target variable. SAS provides a great visual demonstration of these relationships with the scatter matrix graph.

This graph contains a series of scatter plots and allows the researcher to easily see the relationships that each of the numeric predictor variables has with the target variable. We can also see the relationship that the numeric predictors have with one another.

In Output 2.11, we can already see the strong positive relationship between the price and accommodates variables, as well as between price and guests included.

Output 2.11: Scatter Matrix

Program 2.11: Scatter Plot Matrix

PROC SGSCATTER DATA=Clean;
    TITLE ‘Scatter Plot Matrix’;
    MATRIX Price_Log accommodates guests_included minimum_nights maximum_nights/ 
       START=TOPLEFT ELLIPSE = (ALPHA=0.05 TYPE=PREDICTED) NOLEGEND;
RUN;

Feature Engineering

Feature engineering is the process of creating additional variables based on your current variables. It is not just adding new features to your data set; it is the result of taking the time to actually think about your variables and what information they contain and then squeezing even more value from them.

For example, the code below creates two new variables based on the accommodates, beds, and bathrooms variables. When we think about real estate pricing and what features a potential renter is looking for when they are viewing Airbnb listings, we might consider that they are making their decision not only on the total number of beds, bathrooms, and total accommodates values. They are considering the “beds per accommodates” and the “bathrooms per accommodates” rates.

Program 2.12: Feature Engineering

/*     Feature Engineering */
IF beds not in (., 0) then
       beds_per_accom = accommodates / beds;
else beds_per_accom = 0;
IF bathrooms not in (., 0) then
       bath_per_accom = accommodates / bathrooms;
else bath_per_accom = 0;

If a property states that it can accommodate 12 people, but it only has a single bathroom, that can severely affect demand and therefore, could be a powerful feature to include in our model.

Polynomial Variables

Not all relationships between the target and predictor variables are linear. In fact, most of the time, they are not linear. The addition of polynomial variables to the model provides the model the opportunity to capture the non-linear relationships in the numeric variables.

It does not make sense to create polynomial variables for binary numeric indicators. So we will generate only polynomial variables based on the continuous numeric variables that we have already decided to keep in the regression modeling data set.

Program 2.13: Create Polynomials

poly_accom = accommodates**2;
poly_bath = bathrooms**2;
poly_guests = guests_included**2;
poly_min = minimum_nights**2;
poly_max = maximum_nights**2;
poly_avail = availability_30**2;

Standardize Numeric Variables

The final adjustment that we need to consider for numeric variables is related to the issue of scale. If you look at the scatter matrix graph, you can see that the scales are different for each variable. Variables with a much higher scale will have a stronger influence on the model than variables with a lower scale. For example, the variable maximum_nights will have an artificially inflated influence on the model results compared to guests_included.

An easy way to adjust the numeric variables to account for differences in scale is to standardize them. There are several different methodologies for standardization, and we will cover several of them in a later chapter. If you run a PROC UNIVARIATE on the predictor variables, you will find that they all have different scales and that several of them are significantly skewed. An easy way to take care of both of these issues is to log transform these variables.

Output 2.12: Scaled Scatter Plot Matrix

Program 2.14 creates log transformations out of the raw numeric data. Notice that I add one to each of the variables. This adjustment is because if the value for a variable is zero, the log transformation is undefined. This situation creates a missing value. (Remember that observations with missing values are excluded from the regression algorithm). So if we add one to the value, it changes the zero to one, and when we take the log of one, the value is zero. This slight global shift results in all of the variables shifting one unit positive, which does not affect the outcome; however, it does ensure that there are no unforeseen missing values in the data set.

Program 2.14: Create Log Transformations

/*Standardize variables with log transformation*/
log_accom    = log(accommodates +1);
log_bath     = log(bathrooms +1);
log_bedrooms = log(bedrooms +1);

The scatter matrix of the log variables is shown in Output 2.12. Notice how the scale of the variables has been changed (particularly the maximum_nights variable). Also, the distribution of the variables has been centered as a result of the log transformation.

We have done a lot of work on the numeric variables.

● We have investigated the distribution of these variables and have made decisions concerning upper and lower thresholds.

● We have identified variables that would not be available at the point of application and therefore, we have excluded them from further consideration.

● We have identified missing values and applied the proper inference for each variable.

● We have binned continuous variables and created categorical variables where necessary.

● We have performed a correlation analysis and decided to use a LASSO regression model design to account for multicollinearity.

● We have created new variables through feature engineering.

● We have developed polynomial variables to identify non-linear relationships.

● We have scaled the variables through log transformation.

Not too bad, but we are just getting started! I hope that you didn’t actually believe that you just take your raw data and throw it into a deep learning neural network model, and it magically gives you the perfect answer. Regardless of how much I wish that were true, actual predictive modeling takes a lot of preliminary data analysis and decision making before any modeling takes place.

Let’s push forward and look at the character variables.

Character Variables

We have applied a variety of different techniques to analyze and adjust the numeric variables. Those techniques included PROC MEANS and PROC UNIVARIATE. Those techniques work great for numeric variables, but they do not work well for character variables. The primary method of analysis for character variables is the FREQ procedure.

We can analyze the character variables by retaining them in the KEEP statement of PROC FREQ.

Program 2.15: Frequency Distribution on Character Variables

PROC FREQ DATA=Clean (KEEP= _CHARACTER_) ORDER=FREQ; RUN;

The FREQ procedure creates a table of information that contains each level of a categorical variable along with the frequency of observations, the percent of total observations, the cumulative frequency, and the cumulative percent of observations. An example PROC FREQ output for the room_type variable is shown below in Output 2.13.

Output 2.13: PROC FREQ Output

Depending on the type of machine learning algorithm that you apply to your data set, character variables are processed differently.

Regression Models – Categorical variables that have a high number of categories (levels) can be very problematic. For each level, there is an additional degree of freedom for a regression model. You can think of each level as its own variable. So, if a categorical variable has 100 levels, then it is equivalent to adding 100 individual variables to your modeling data set. You can easily see that this leads to the curse of dimensionality (very sparse data sets due to too many dimensions).

Tree-Based Models – The levels of character variables are automatically grouped together to determine the optimal split points. These types of models do not require you to collapse the levels since the algorithm does it for you. The drawback is that you will not get detailed information for each of the categorical levels.

I will assume that we will build both regression models and decision tree models to compare them. For the regression modeling data set, I will indicate which imputation method we should use to prepare the data for modeling.

A summary table of the categorical variables along with the number of levels that each variable has and my suggested imputation method is included below.

Host-dependent fields – These fields are related directly to the host and their relationship with Airbnb. They also include fields that describe requirements that hosts can select. These are binary fields (two levels) with no missing values, so we do not need to impute these values. These include:

● Host_is_superhost

● Host_has_profile_pic

● Host_identity_verified

● Instant_bookable

● Require_guest_profile_picture

● Require_guest_phone_verification

Table 2.5: Table of Character Variables

Location fields – These fields have many levels, and they are strongly correlated with one another. They range from 5 to 223 levels. Even though location variables are highly significant for property valuation, we would not be able to use all of these variables for regression models. All of these variables are telling us the same thing, namely, where the property is located. We can expect that if we leave these variables in the model, that it will overfit to the TRAIN data set and not generalize well to the TEST data set.

If we included the variables zip, city, neighbourhood_cleansed, and neighbourhood_group_cleansed as variables in a regression model, these four variables would result in thousands of additional levels of stratification for our model (the curse of dimensionality)! However, all of these variables could be offered to a tree-based model, and the algorithm will group the levels into categories that have the optimal split point for the leaf of the tree.

Since we are developing a regression model first, let’s decide that the categorical variables with over 20 levels are not suitable for our model. Luckily, we still have the neighbourhood_group_cleansed variable that has five levels that we can retain for our regression model.

Property type – This variable has multiple levels; however, 97% of the observations are captured by the top 5 variables. The remaining 24 levels can be collapsed. I used the method of identifying levels with similar rates associated with the target variable. This is an interesting use of PROC MEANS using the categorical variable in the CLASS statement.

Program 2.16: Identify Levels with Similar Target Value Rates

PROC MEANS DATA=Clean;
       CLASS Property_Type;
       VAR Price;
RUN;

The results of this analysis show that the remaining 24 levels can be collapsed into two groups that have a clear delineation between levels with an average price above $200 per night and those with an average price below $200 per night. We can collapse these levels into two groups within a DATA statement.

Program 2.17: Collapse Categorical Levels

DATA TRAIN_ADJ;
       SET Clean;
IF Property_Type in (‘Apartment’, ‘House’, ‘Townhouse’, ‘Loft’, ‘Condominium’) THEN Property_CAT = Property_Type;
       ELSE
IF Property_Type in (‘Houseboat’, ‘Resort’, ‘Tent’, ‘Serviced ap’, ‘Aparthotel’, ‘Hotel’, ‘Boat’, ‘Other’, ‘Boutique ho’) THEN Property_CAT = ‘Group 1’;
       ELSE Property_CAT = ‘Group 2’;
       IF host_has_profile_pic = ‘ ‘ then
              host_has_profile_pic = ‘f’;
       IF host_identity_verified = ‘ ‘ then
              host_identity_verified = ‘f’;
       IF host_is_superhost = ‘ ‘ then
              host_is_superhost = ‘f’;
DROP Property_Type is_location_exact calendar_updated host_response_rate host_response_time;
RUN;

A frequency distribution applied to the newly developed Property_CAT variable shows the collapsed levels in Output 2.14.

Output 2.14: Frequency Distribution of Newly Formed Categories

Dummy Variables

No, I am not insulting these variables. A dummy variable is a binary variable that is created to represent inclusion in a category. For example, you can have a single smoking category in your model that contains the values smoker and non-smoker. We can create a dummy variable for these categories and have a smoke_ind variable with values of 1 or 0 to indicate if an observation is classified as smoker. It would be repetitive to have a non-smoker_ind variable because any smoker_ind value of 0 would necessarily be a non-smoker.

For the character variables that we have decided to keep in the modeling data set, we will create dummy variables for each level of those variables. For example, the five levels of the bed_type variable can be used to create five individual dummy variables:

Program 2.18: Dummy Variable Creation

IF bed_type = ‘Airbed’ then b_air = 1; else b_air = 0; 
IF bed_type = ‘Couch’ then b_couch = 1; else b_couch = 0; 
IF bed_type = ‘Futon’ then b_futon = 1; else b_futon = 0; 
IF bed_type = ‘Pull-out Sofa’ then b_pullout = 1; else b_pullout = 0;
IF bed_type = ‘Real Bed’ then b_real = 1; else b_real = 0;

We will follow this procedure for the following character variables:

● Bed_type

● Neighbourhood_group_cleansed

● Room_type

● Host_is_superhost

● Host_has_profile_pic

● Host_identity_verified

● Instant_bookable

● Require_guest_profile_picture

● Require_guest_phone_verification

● Host_count_CAT

● Property_CAT

In comparison, the character variables were easier to analyze and adjust than the numeric variables. We do not need to worry about correlation or outliers or transformations or scaling any of the character variables. We just need to make sure that there are not too many levels and that they make intuitive sense for the purposes of our model. The modeling algorithm will determine which character variables are significant.

Adjusting the TEST Data Set

We have spent a good amount of time exploring the TRAIN data set and making several adjustments. We need to apply these same adjustments to the TEST data set so that the model can be properly applied to our hold-out TEST data set and our results analyzed.

Let’s try to remember everything that we did to adjust the TRAIN data set to prepare it for regression modeling.

● Log-transformed the target variable (price)

● Set thresholds on the target variable where the price is not missing, and it is not equal to $30 or greater than $750

● Inferred missing values of zipcode with cross-referenced ZIP codes by neighbourhood_cleansed

● Set missing values of security_deposit and cleaning_fees to 0

● Set missing values of bathrooms, bedrooms, and beds to 1

● Set upper limits of bathrooms to 4, bedrooms to 5, and beds to 5

● Set missing values of host_listing_count to 1

● Created a categorical variable for host listing count that groups values into three categories

● Set an upper limit for maximum_nights to 1125 and minimum_nights to 31

● Collapsed the property_type variable

● Created feature engineering variables

● Created polynomial variables

● Created dummy variables from categorical variables

● Log-transformed the numeric variables

Although we dropped several variables from the TRAIN data set, we do not need to worry about dropping those variables from the TEST data set. Any variable that is not retained by the modeling algorithm based on the TRAIN data set will not affect the TEST performance metric. Basically, the model will ignore all of the variables that we had previously dropped.

An important note: It is a best practice to create a macro that automatically applies these adjustments to any data set with a similar structure.
I used the DATA step in the code contained in the GitHub repository because I wanted to make sure that everyone can easily see the IF-THEN statements and understand how each of our decision points has been applied to the TEST data set.

Building a Predictive Model

Finally! This is what we came here for! You have suffered through 30 pages of boring data analysis and tedious data adjustments just to get to the glamorous world of predictive modeling. I appreciate your patience. There is an old saying that 80% of a data scientist’s time is spent investigating the data, and 20% is devoted to modeling. I think that the split is closer to 90/10. Either way, let’s talk about how to develop our models.

Baseline Models

A baseline model is considered a very simple model with a few data elements and a simple model design. Many people use a simple linear regression for a baseline model. It is a quick and easy way to establish a model without the need for feature engineering and hyperparameter tuning. However, I would suggest that we take a step back and look at an even more basic type of model based on simple averages.

Examining the error rate produced by simple averages can give us the most basic baseline model possible. We will simply calculate the target variable’s global average for all of the TRAIN data set’s observations and use it as the predicted value.

Let’s take this concept a step further and create three basic averages.

Global average – a single value for all observations in the TRAIN data set.

Neighbourhood_Group_Cleansed average – Neighbourhood Group average for the five levels of this variable.

Neighbourhood_Cleansed average – Neighbourhood average for the 223 levels of this variable.

The code used to create these averages and the predicted error rate is contained in the GitHub repository.

The Root Mean Squared Error (RMSE) was calculated by the formula:

We used the raw Price variable instead of the log-transformed Price variable for two main reasons:

1. We have not created a model yet, so we don’t have to worry about skewed data.

2. We want an easy-to-understand error metric.

The table of information and the associated bar chart below shows how badly these predictors perform.

Table 2.6: Baseline RMSE Performance Metrics

Baseline Global Avg

Baseline Neigh Group Avg

Baseline Neigh Avg

TRAIN

$ 102.11

$ 96.84

$ 90.32

TEST

$ 101.21

$ 96.09

$ 89.26

Figure 2.3: Bar Chart of Baseline RMSE Performance Metrics

Wow, these are bad models. The RMSE values are extremely high. An interpretation of these errors could be that if you have a property where the correct per diem rate is $300, the global average model would suggest a price of $139.26 because that is what the predicted price is for every property. This results in a residual error of $160.74. I would not pay a cent for this service.

One good takeaway from this example is that as you add more information to your model, your residual error will decrease. The global mean contains very little information and therefore has the highest error rate. The only way to get a worse error rate is to calculate a random price between $30 and $750 (because that is our price range) and use that as the predicted price. Just out of curiosity, I did just that, and the RMSE for a random price prediction is $356.10.

As we add more information by creating average prices for the neighbourhood_group_cleansed and the neighbourhood_cleansed variables, our error rate steadily decreases. However, even though there is an improvement, these models are still terrible.

But this does provide us an opportunity to demonstrate the power of predictive modeling.

Modeling Approach

There are two different approaches to model building. The first is a very cautious approach where the model developer intentionally selects variables for model inclusion and understands how each variable will contribute to the final predicted value. This approach is often used when a model needs to be highly interpretable.

The second type of approach is used when the model developer doesn’t have to understand how the variables contribute to the final predicted value. The primary goal for this type of model is predictive accuracy.

Both of these types of models are valuable, depending on the business question that they address and the environment in which they will be implemented. For example, if this is a government model that predicts the amount of public assistance that a region will receive, then that model had better be highly interpretable. There will be a lot of scrutiny concerning the inputs, methodology, and outputs of the model.

However, if the model is an image recognition model, then there is not currently as much scrutiny in understanding how the model came to a specific predicted value. The main concern here is accuracy. No one will really care how you got the correct image classification; they are only concerned that it is accurate.

We will develop both types of models to demonstrate the benefits of these different approaches.

Regression Models

Most of the data analysis and transformations that we performed prior to modeling were specifically designed to prepare the data for regression modeling. To have a successful regression model, you need to understand the content and distribution of each variable as well as how these variables relate to one another.

The parametric functional form of the regression model performs optimally when the following conditions are met:

● Missing values and outliers are identified, and the appropriate imputations and limitations are performed.

● The individual data attributes are not skewed (this is why we created the log transformations).

● The predictors are not correlated with one another (this is why we developed our correlation analysis).

● The categorical variables are collapsed and converted to dummy variables.

General Linear Model

The GLMSELECT procedure performs effect selection within the framework of general linear models. This approach is a very flexible procedure that has a wide variety of selection methods and output options. The procedure allows you to customize your model by specifying the inclusion criteria, stopping criteria, evaluation metrics, and many other options.

The first step of developing this model is to create a macro (see the Advanced SAS Programming chapter) that contains all of the variables that we want to offer to the model. This code creates a macro variable called lasso_var that we will use as the predictor for the GLMSELECT procedure.

Program 2.19: Create Macro Variable

*************************************************;
/* GLM Linear Regression MODEL                  */
*************************************************;
%let lasso_var = log_accom log_bath log_guest log_min log_max log_avil30 log_bedsper n_bronx n_brooklyn n_manhattan n_queens r_entire r_private h_super h_profile h_verified b_couch b_futon b_pullout b_real instant require_pic require_phone
hcount_level1 hcount_level2 hcount_level3 p_apart p_condo p_group2 p_house p_loft p_townhouse n_staten r_shared b_air p_group1 poly_accom poly_bath poly_guests poly_min poly_max poly_avail; 

To generate the evaluation plots, we first make sure that the ODS GRAPHICS ON statement is included. I will not provide a detailed description of each of the selections in the PROC GLMSELECT statement at this point because we will cover them in detail in the linear regression chapter of this book.

Program 2.20: PROC GLMSELECT Model

PROC GLMSELECT DATA=WORK.TRAIN_FINAL OUTDESIGN(ADDINPUTVARS)=Work.reg_design 
       PLOTS(stepaxis=normb)=all;
       MODEL Price_Log=&lasso_var. / 
selection=lasso(stop=&k choose=SBC);
       OUTPUT OUT = train_score;
       SCORE DATA=TEST_FINAL PREDICTED RESIDUAL OUT=test_score;
run;

A few things to note about this code:

● The top line of the PROC GLMSELECT statement includes the OUTDESIGN statement. This statement saves the list of selected effects in a macro variable (&_GLSMOD), and this macro variable can be used later.

● I have specified the plots that I want to use to evaluate the model.

● I specify that I want to use the macro variable lasso_var to predict the value of price_log.

● I have specified that I want to use a LASSO regularization selection methodology and use the Schwarz’s Bayesian Criterion (SBC) for stopping.

● I specify that I want to output the scored TRAIN data set.

● I specify that I want to score the TEST data set and output the predictions and residuals.

Now we can reap the benefits of all the time and thought that we put into evaluating our predictor variables.

The model output shows the order in which each variable was introduced to the model and its associated SBC value. Since 42 variables are introduced to the model, I have limited the LASSO selection summary to the top ten variables.

Output 2.15: GLMSELECT Model Output

The stepwise selection summary shows us that the r_entire variable is the most influential individual predictor of price_log. The remaining variables are listed in order of predictive power.

The graphical output below shows the effect sequence of each of the variables. This is basically the order in which they have been added to the model as well as their standardized coefficients and their associated SBC values.

This is looking pretty good. We have built a modeling data set and fed that information into our GLMSELECT LASSO model, and the algorithm gave us the optimal cut point of which variables are statistically significant. That is academically satisfying. However, our regression model was not built to provide us with an A+ in our STATS101 class; it is built with the goal of providing value to our business customers. So, let’s examine the details and make some decisions.

Looking at the coefficient progression plot above, we can see that the first four variables add a lot of predictive power to the model. That is to be expected. We are starting at zero information and adding information to our model, so we should expect to see significant gains within the first few variables. However, the information gains quickly level off. The fifth through the tenth variables add slightly more value, and we get diminishing returns for each variable that is added after that.

Let’s look at it another way. We can build the same GLMSELECT LASSO model, but we can add the variables one at a time and calculate the RMSE score each time a variable is added. Then we can assess the impact on both the TRAIN and the TEST data sets. By doing this, we will be able to see where the optimal variable cutoff is for the hold-out TEST data set.

The code is a bit lengthy, but I’ll show you the top part that starts the loop:

Program 2.21: Add Variables One at a Time to the GLMSELECT Model

%macro do_glm;
       %do k=1 %to 42;
PROC GLMSELECT DATA=WORK.TRAIN_FINAL OUTDESIGN(ADDINPUTVARS)=Work.reg_design 
                     PLOTS(stepaxis=normb)=all;
                     MODEL Price_Log=&lasso_var. / 
                     selection=lasso(stop=&k choose=SBC);
                     OUTPUT OUT = train_score;
                     SCORE DATA=TEST_FINAL PREDICTED RESIDUAL 
                     OUT=test_score;
              run;
       %end;
%mend;
%do_glm

Notice that I’ve created a macro with a DO loop inside of it. This process creates a range of values from 1 to 42. The DO loop starts with the first variable in the lasso_var macro variable and cycles through to the last variable in that list. Each cycle of the loop adds the next variable from the sequential list to the model.

An important point to know is that the variables have been sorted based on the order of importance as determined in the previous GLMSELECT LASSO model. This order is important because we are looping through the variables and adding them in order from the most significant to the least significant.

So, in essence, we are running the GLMSELECT LASSO model 42 times. The remaining part of the code that is contained in the GitHub repository calculates the RMSE for the TRAIN and the TEST data sets and appends them to a master RMSE table.

The chart below shows the calculated RMSE values for the TRAIN and the TEST data sets. This looks similar to the SBC chart above in which we see large gains in the first few variables, but diminishing returns for our additional variables.

Figure 2.4: Calculated RMSE Values for TRAIN and TEST Data Sets

This analysis shows us that the model begins to overfit at the point where the 25th variable is entered into the model. This point is where the RMSE of the TEST data set begins to increase. Notice that the TRAIN RMSE continues to decrease (or at least level off) as we add more variables to the model. This is exactly what we learned in the first chapter about the bias-variance tradeoff.

So, now we have two perspectives of the GLMSELECT LASSO model. The first is generated when we run the model and evaluate the model output for the TRAIN data set. This perspective states that there are 33 statistically significant variables in the data set.

The second perspective looks at the bias-variance tradeoff and shows the difference in the RMSE on the TRAIN versus the TEST data set. This perspective shows that there are 24 statistically significant variables in the model that can be used before overfitting occurs.

Both of these perspectives are technically correct; however, we are building a business model, and one of the important considerations with this type of model is for it to be parsimonious.

Parsimonious Model

A parsimonious model is a model that accomplishes the desired level of explanation with the fewest number of predictor variables. This idea stems from the principle of Occam’s razor. This principle is a philosophy that states that one should not make more assumptions than the minimum needed.

A quick look at Table 2.7 shows that the first four variables provide a lot of information to the model. The remaining 38 variables add marginal amounts of information to the model. We can attempt to quantify the impact of each additional variable by calculating the percentage decrease in the TEST RMSE for each variable.

Table 2.7 shows that the first four variables have a significant impact on the TEST RMSE. However, the impact quickly dissipates after the fourth variable is added to the model.

Table 2.7: RMSE Analysis

When we look at a comparison between the model suggested by the TRAIN vs. TEST RMSE that recommended 24 predictor variables, and the parsimonious model that suggests only four variables, the TEST RMSE of the 24 variable model is 0.40776 while the TEST RMSE of the four-variable model is 0.42649. This is a difference of (-0.01873). You can argue that that difference is marginal at best.

Bootstrapped Model

We can take one last look at the GLMSELECT LASSO model before we move on to a non-parametric model. The question that we need to address is whether our model results are the result of a particularly lucky data set that just happened to produce a good value for us. The underlying issue is whether the model is stable across multiple draws of the data.

A bootstrapped model repeatedly samples the modeling data set and builds a model on each one of the samples. The models are then averaged together to create final estimates for the coefficients. The code below creates the bootstrapped model.

Program 2.22: Bootstrapped GLMSELECT Model

%let parsi_var = r_entire log_accom n_manhattan p_group1;
ods noproctitle;
ods graphics / imagemap=on;
proc glmselect data=WORK.TRAIN_FINAL outdesign(addinputvars)=Work.reg_design 
       plots=(EffectSelectPct ParmDistribution criterionpanel 
       ASE) seed=1;
       model Price_Log=&parsi_var. / 
       selection=stepwise(select=sbc);
       modelAverage nsamples=1000 tables=(EffectSelectPct(all)
ParmEst(all));
       output out = train_score;
       score data=TEST_FINAL PREDICTED RESIDUAL out=test_score;
run;

A few notes on Program 2.22:

● I selected the top four variables from the parsimonious model and placed them into a macro variable called parsi_var.

● We are still using the LASSO selection methodology and choosing based on SBC.

● We are creating 1000 samples with replacement and averaging the coefficients across all of the samples.

Output 2.16a: Bootstrap Model Effect Selection

The output of the bootstrapped methodology provides us with a lot of information. The effect selection percentage shows us the percentage of the samples where a specific variable was significant. Output 2.16 confirms that each of the four predictor variables was significant in every one of the 1000 sample models.

The next part of the bootstrap model output shows the average parameter estimates across all of the 1000 sample models. This approach produces not only the coefficient estimate but also the standard deviation as well as the estimate quartiles. These values can be instrumental in understanding the estimated range for each of the predictor variables.

Output 2.16b: Bootstrap Model Parameter Estimates

Finally, the bootstrap model output contains a graphical distribution for each of the parameter estimates, as well as the intercept.

This graphic shows the distribution of intercept and coefficients’ calculated value for the 1000 samples. This is a nice real-world example of the central limit theorem.

Output 2.16c: Bootstrap Model Parameter Distributions

The final decision that we need to make concerning the regression model is to decide which regression model we should use.

Our choices are the full linear regression model with 33 statistically significant variables, the linear regression model based on the TRAIN/TEST evaluations with 24 statistically significant variables, and the parsimonious bootstrapped model with four statistically significant variables.

The benefits of the parsimonious model outweigh the marginal predictive benefits from the other two regression models. I suggest that we move forward with the parsimonious bootstrapped linear regression model with four variables.

Non-Parametric Models

When we first approached the modeling process for our Airbnb model, we made the initial assumption that we were going to build a regression model. Under this assumption, we made several changes to the data:

● Categorized some numeric variables

● Collapsed some character variables

● Standardized our numeric variables

● Created dummy variables for our character variables

● Excluded character variables with too many levels

These are all necessary steps when building a regression model. The result of this process is a modeling data set that contains normally distributed target and predictor variables and a series of binary variables that the regression model can easily read into the algorithm and produce coefficients for every significant variable in the final modeling output. These adjustments provide us with a transparent model where we understand how each variable is incorporated into the model and we have the final coefficient weights for every variable. Nothing is left unexplained about how we derived a prediction for a specific observation.

As we learned in the first chapter, there is another approach to modeling. A non-parametric model does not assume a specific functional form. This type of model can adapt to a wide range of functional forms. A linear regression model assumes that the relationship between the target variable and the predictors is linear (thus the name). Non-parametric models such as decision trees, support vector machines, neural networks, and several others do not assume that the relationship between the target and the predictors is linear.

Non-parametric models are very flexible and produce highly accurate predictions, but at the cost of transparency. For example, a gradient boosting decision tree model can often produce significantly lower test metric scores than a linear regression model, but there is no clear visibility into how the algorithm produces these results. We will delve into the details of these algorithms and their outputs later in the book, but for now, let’s apply some of these modeling types to our Airbnb data set and compare their predictions to the linear regression model.

For our comparisons, we will limit the non-parametric modeling types to tree-based models. There could be additional benefits in building other non-parametric types of models, but let’s focus on the tree-based models for this example.

Modeling Data

One of the big advantages of tree-based models is that we do not have to worry about the distribution of the predictor variables. In the previous section, we spent a lot of time analyzing each variable and making several changes to prepare the data for modeling. Tree-based models do not require that the numeric variables are normally distributed, and there is not an issue with categorical variables having too many levels. This is because the decision tree algorithm segments the predictor space into several simple regions.

The algorithm identifies the variable and split point that minimizes the Residual Sum of Squares (RSS). We will review the decision tree methodology in detail later in the book, but for now, we need to understand that the decision tree approach is fundamentally different from the linear regression approach. The decision tree assesses each variable, both numeric and categorical, and makes optimal selections of variable split points that reduce the RSS for the predictor space. Numeric variables are split at some point across a continuous spectrum, while the levels of the categorical variables are grouped together in optimal groupings that reduce the RSS.

Because the levels of categorical variables are grouped, we do not need to worry about variables such as ZIP code that have a couple of hundred levels. The algorithm will assess this variable and identify the optimal groupings of this variable into two categories that split the predictor space. For example, if a categorical variable has 100 levels, the decision tree might group 95 of these variables into one category and the remaining five variables into another category. These categories do not need to have an equivalent number of levels in each group.

Decision Tree

The first type of tree-based model that we will explore is a simple decision tree. This methodology constructs a single decision tree on the full TRAIN data set, and we can apply the resulting model to the hold-out TEST data set. This type of model is called a recursive binary splitting model. This is because the methodology begins with a single variable and splits the predictor space into two (binary) spaces, and then the algorithm evaluates each of the resulting predictor spaces and applies the same splitting methodology recursively (over and over).

This methodology is called “greedy” because it selects the best split at each branch of the tree. It never looks ahead and makes split decisions that would produce a better outcome in the end.

Since we can now utilize the full Airbnb data set, we can start by accessing the data set that we built prior to creating all the dummy variables. This data set contains all the categorical variables before collapsing them and transforming them into binary dummy variables.

The code below gathers the numeric and character variables from the data set that was created before creating the dummy variables, and it places the numeric and character variables into two separate macro variables. This will be helpful later when we need to make different specifications for each variable type.

Program 2.23: Create Numeric and Character Macro Variables

/*Create macro variable for numeric variables*/
PROC CONTENTS NOPRINT DATA=TRAIN_ADJ (KEEP=_NUMERIC_ DROP=id
host_id price price_log)
       OUT=VAR3 (KEEP=name);
RUN;
PROC SQL NOPRINT;
       SELECT name INTO: tree_num separated by “ “ FROM VAR3;
QUIT;
%PUT &tree_num;
/*Create macro variable for character variables*/
PROC CONTENTS NOPRINT DATA=TRAIN_ADJ (KEEP=_CHARACTER_ 
       DROP=Property_CAT)
       OUT=VAR4 (KEEP=name);
RUN;
PROC SQL NOPRINT;
       SELECT name INTO: tree_char separated by “ “ FROM VAR4;
QUIT;
%PUT &tree_char;

The resulting data set contains 43 variables that are a mix of the raw numeric variables, the log-transformed numeric variables, the feature engineered numeric variables, and the polynomial numeric variables as well as the raw character variables without any level collapsing. These variables are presented in Table 2.8.

We will present all of these variables to the decision tree algorithm and allow it to select the optimal variable and split points that will result in the lowest RSS value.

Table 2.8: Variable Listing

Although this process allows much more freedom to the modeling algorithm, it also requires significantly more processing power compared to the linear regression methodology. As we increase our complexity by adding randomization and boosting to the decision tree, we will consequently increase the processing power necessary to develop these models. This might not be an issue with a data set of this size, but as you try to apply these techniques to larger data sets, you will run into issues of processing power and memory space. Luckily, SAS has a couple of features that help you split your modeling tasks across different computer cores that can help alleviate this issue.

To create a single decision tree, I have opted to use PROC HPSPLIT. This is a high-performance decision tree algorithm that has lots of options for customization. The code below develops a single decision tree on the TRAIN data set and outputs a scoring code file that can be applied to the hold-out TEST data set for evaluation.

Program 2.24: Simple Decision Tree Model

ODS GRAPHICS ON;
PROC HPSPLIT DATA=WORK.TRAIN_ADJ seed=42;
   CLASS &tree_char.;
   MODEL price = &tree_char. &tree_num.;
   OUTPUT OUT=hpsplout;
   CODE FILE=’C:UsersJames GearheartDesktopSAS Book StuffDatahpsplexc.sas’;
run;

Notes on the decision tree code:

● I am building the model based on the TRAIN_ADJ data set. This is the data set that contains all of the numeric and character variables appropriate for modeling.

● I included a SEED statement to ensure reproducible results.

● The CLASS statement contains all of the character variables.

● The target variable is PRICE_LOG, and the predictor variables are the two macro variables that contain all of the numeric and character variables.

● The code file outputs a SAS program that contains all of the splitting logic. This file can be applied to other data sets for scoring.

One of the main cautions of tree-based models is that they are prone to overfitting. If they are not controlled, they will create a model with 100% accuracy on the training data and not generalize at all to other data sets. Luckily, SAS has default values built into the PROC HPSPLIT that regulate the model. The model developer can customize these values. If you want to select a different pruning method or max tree depth, you certainly have those options available to you.

Output 2.17a: PROC HPSPLIT Model Information

The first section of the model output contains information about the construct of the model. Output 2.17a shows the default modeling criteria as well as the default hyperparameters that were used to build the decision tree. We will review each of these criteria in the decision tree chapter of the book, but for now, we need to understand that each of these criteria and the associated hyperparameters are implemented to regularize the decision tree model so that it does not produce a highly overfitted model that represents only the training data set.

Output 2.17b: PROC HPSPLIT Cost Complexity

The next section of the model output is a graphical representation of the impact on the average squared error for each leaf of the decision tree. This graphic looks very similar to the graphical analysis of the linear regression model that shows the reduction in the SBC for each variable added to the model.

If we were to decide that we wanted to use a single decision tree for our final solution, we would have the option of reducing the number of variables based on the premise of parsimony, just like we did with the linear regression model. This could provide us with a transparent model with a few features that are competitive with our linear regression model.

Output 2.17c: PROC HPSPLIT Tree Visualization

The next item produced by the PROC HPSPLIT procedure is the visual representation of the decision tree. This is what we are all accustomed to thinking about when we hear the words “decision tree.” The visualization begins at the root value of the full data set and begins the splitting process. The first split of the decision tree starts with the room_type variable. (Hey, that’s the same variable that was most important in the linear regression model!)

Output 2.17d: PROC HPSPLIT Variable Importance

The algorithm continues the splitting process independently at each node of the tree. Notice at the third level of the tree, the left side is split on the zip variable while the right side is split on the bathrooms variable.

Also, a variable can be used several times throughout the splitting process. Just because the zip variable was used at the third level does not mean that it cannot be used again later on down the tree.

The next piece of information produced by PROC HPSPLIT is a table of variable importance. This table is similar to the “Effect Entered” table produced by the linear regression in that it tells us what the most important variables are as they relate to the target variable. Notice that the variables in this table and the order in which they appear are similar to the variables in the “Effect Entered” table in the linear regression output.

The decision tree model easily handles the multiple level categorical variables such as zip and neighbourhood_cleansed. These variables contain a lot of information about housing prices because they are specific to a location. The linear regression could not use them because they contain too many levels; however, the decision tree easily utilizes the information contained in these variables by grouping them by the optimal split point.

We can use the scoring code produced by PROC HPSPLIT to score the hold-out TEST data set. This is easily accomplished in a DATA step where the scoring code is brought in with an INCLUDE statement.

Program 2.25: Apply Decision Tree Model to Hold-out TEST Data Set

DATA TEST_SCORED;
  SET TEST_ADJ;
  %INCLUDE ‘C:UsersJames GearheartDesktopSAS Book StuffDatahpsplexc.sas’;
RUN; 

We can now run the same algorithm that I developed for assessing the RMSE for the linear regression model and apply it to our decision tree model. Remember that this code is contained in the GitHub repository.

The table below shows the calculated RMSE value for the TRAIN and TEST data sets for different modeling methodologies.

Table 2.9: Model RMSE Table

An observant reader might stop me right here and accuse me of an unfair comparison. The linear regression model is the parsimonious model that has a reduced number of variables. What if we reduced the variables in the simple decision tree? How would they compare then? Luckily, my data-centered OCD forces me to make a simple decision tree with the top ten variables based on relative variable importance for comparison to the parsimonious linear regression model.

As you can probably guess by now, a parsimonious simple decision tree model does not lose a lot of predictive power. The RMSE for the TRAIN data set is still 0.342, while the RMSE for the TEST data set increases very slightly to 0.366. The takeaway from this analysis is that the application of a parsimonious simple decision tree to the Airbnb data set performs better than the parsimonious linear regression model. This increase in performance is because the underlying data structure is non-linear.

Now that we have discovered the non-linear nature of the data, we can develop a few more tree-based models. You are obviously observant enough to see in the table above that we are going to build a random forest and a gradient boosting decision tree model.

Random Forest

An issue that we touched on when we developed the simple decision tree is that it is greedy in nature. That means that it selects the best variable and split point that reduces the RSS of the predictor space and then moves on to the next branch of the tree. Although this methodology makes optimal decisions at each branch of the tree, it never looks forward and makes decisions at an individual branch that might not be optimal for that particular branch but would benefit the overall RSS further down the tree. The random forest approach was developed to alleviate this issue.

A random forest is an “ensemble” modeling approach. This means that several models are brought together to create a single prediction. For a random forest to be developed, the TRAIN data set is sampled with replacement at a specified sampling rate many times. For our example, the TRAIN data set contains 38,527 observations. If I set the sampling rate to be 60%, then 23,116 randomly chosen observations will be in the sampled training data set. The remaining 40% of the observations will be in a hold-out data set that will be used to calculate the “out-of-bag” (OOB) error metric. The OOB is the mean prediction error for each sampled training data set. You can think of this as a mini TRAIN/TEST environment where a decision tree is developed on the sampled training data set and applied to the hold-out OOB test data set.

The randomization of selected observations is not the only randomized part of a random forest. For any data set, there are usually one or two variables that overpower the remaining variables. These are the variables that are most strongly correlated with the target variable. If these variables are included in a simple decision tree, they will be the top nodes of the tree and dominate the initial splitting decisions. A clever way that the random forest mitigates this issue of variable dominance is to randomly select a subset of the total number of possible variables to include in a given decision tree.

In our Airbnb example, there are 43 variables available to present to the algorithm. A rule of thumb for building a decision tree is to set the number of randomly selected predictor variables to the square root of the total number of predictors. The square root of 43 is 6.557, so we will round up and specify that we want seven randomly selected variables to be included with each decision tree.

Since this is a recursive procedure, we also need to specify how many times we want the algorithm to build these randomly selected trees. Keep in mind that the larger the number of trees that you specify, the more processing power will be required of your computer system.

The code below develops a random forest that consists of 500 randomly sampled trees, each with seven randomly sampled predictor variables. Each one of these trees is different due to the randomization of the observations and the predictors.

Program 2.26: Random Forest Model

proc hpforest data=WORK.TRAIN_ADJ
       maxtrees= 500 vars_to_try=7
       seed=42 trainfraction=0.6
       maxdepth=20 leafsize=6
       alpha= 0.1;
       target price_log/ level=interval;
       input &tree_num. / level=interval;
       input &tree_char. / level=nominal;
       ods output train_scored fitstatistics = fit;
       SAVE FILE = “C:UsersJames GearheartDesktopSAS Book StuffData
fmodel_fit.bin”;
run;

Notes on Program 2.26:

● I am building the model based on the TRAIN_ADJ data set. This is the data set that contains all of the numeric and character variables appropriate for modeling.

● I specify that I want to build 500 trees (MAXTREES) and have 7 predictors in each tree (VARS_TO_TRY).

● I specify that I want 60% of the data (TRAINFRACTION) to be used as the sampled training data set.

● I included a SEED statement to ensure reproducible results.

● There are separate INPUT statements for the numeric and character variables.

● The target variable is PRICE_LOG, and the target is specified to be an interval variable.

● The code file outputs a SAS program that contains all of the splitting logic. This file can be applied to other data sets for scoring.

The first section of the random forest output contains a summary of the hyperparameter settings for the model. I have specified a few of these settings in the code while others I left set to their default values. From this list, you can see that there are a multitude of ways to customize your algorithm, and the interaction of these settings can result in widely different predictive values. If you were to choose to move forward with the random forest and select this modeling algorithm as the model to drive your business, you would have to develop a grid search that loops through different combinations of these hyperparameters to find the subset that performs best on your hold-out TEST data set.

Output 2.18a: Random Forest Model Information

The next section of the model output provides the fit statistics for each of the 500 decision trees. I have presented only the first ten trees here for a demonstration. We can see that the output contains the cumulative results of including each additional tree to the overall random forest model. As each unique tree is added to the assessment, the number of leaves increases while the average squared error decreases for both the training (train) and testing (OOB) data sets.

Output 2.18b: Random Forest Fit Statistics

One of the strong selling points of using a random forest is that due to the development of the OOB testing data set, there is no need to have a formal TEST data set. This is because the OOB data set acts as a de facto testing data set. We could, in theory, use our full data set to build the random forests (both TRAIN and TEST). However, there is still a lot of benefit to having a traditional hold-out TEST data set. This will provide you with a better assessment of how the model will perform in a real-world setting.

So, here we have another tradeoff with model development. Should you combine your TRAIN and TEST data sets to produce a random forest on your largest possible number of observations? More data nearly always produces better modeling results. Or do we split our full data set into separate TRAIN and TEST data sets and develop our random forest solely on the TRAIN data set and evaluate the results on the hold-out TEST data set? The assessment of a hold-out data set nearly always provides a better assessment of model performance. Unfortunately, there is no objectively correct answer.

The final section of the model output is the Variable Importance table. This table orders the predictive variables based on each variable’s contribution to the reduction in the Mean Squared Error (MSE). The Variable Importance table lists all of the variables that were presented to the algorithm. I have included only the top 10 here for demonstration purposes.

Output 2.18c: Random Forest Variable Importance

These variables look very similar to the ones that were highly significant in the linear regression and the decision tree models. Maybe we should start believing that room_type is the most significant aspect of Airbnb per diem pricing in New York City as of December 2018. The other variables are similar to the top ten variables in the other two models that we have developed.

In our previous models, we looked at the incremental value of each additional variable, and we made decisions on selecting a subset of those variables based on parsimony. However, that approach is not really applicable to random forest models. The random forest approach is considered a black box model because we cannot state specifically the value of each variable to the overall prediction. That is because of the random selection of the observations and the random selection of the predictors. The random forest averages together the results of hundreds of decision trees, so we cannot state the value of a single predictor. Any given predictor was not present in a certain percentage of the hundreds of decision trees. We can state the overall average variable importance based on MSE, but we cannot state how a specific variable is used to construct the final predictor because it varies based on the random construction of each composite decision tree.

I subjected the random forest to the RMSE scoring algorithm (found in the GitHub repository). The results show a marginal improvement over the simple decision tree.

Table 2.10: Model RMSE Table

We can more clearly see the cost-benefit tradeoff of increasing model complexity. As our models increase in complexity, we can see a reduction in the TEST scoring metric, but at the cost of explainability.

Gradient Boosting

The gradient boosting model design was developed from a fundamentally different approach than the traditional decision tree and random forest approaches. Rather than developing a single deep tree (many nodes and branches) or a random collection of deep trees, this approach produces a series of shallow trees (called “weak learners”) sequentially.

This approach begins with a single decision tree that is limited to two to eight levels. This is called a weak learner because it does not produce very good predictions by design. This shallow structure produces somewhat large residuals for each observation. But here is the brilliance of this model design, the next tree in the sequence uses the residuals from the previous tree as the target value. You are probably saying, “That’s crazy talk, why in the world would you do such a thing?”

The problem with a traditional decision tree is that it is a greedy learner, which leads to non-optimal predictions. The shallow sequential design of the gradient boosting approach allows the model to learn slowly. Each shallow sequential decision tree is fitted on the residuals from the previous tree. Once a given decision tree is created, it is not modified. Each tree is added into the fitted function that updates the residuals that are the target values for the next shallow tree in the sequence. This approach slowly improves the function with increased weight on the areas of high residual values.

We will review the gradient boosting approach in more detail later in the book (are you sick of hearing me say that over and over?), but for now, we need to understand that this approach is methodologically different from the traditional decision tree and the random forest approach. The caution of this approach is that we are truly in the black box model territory now. This modeling approach has a different target value for each iteration of the algorithm!

I used PROC TREEBOOST to develop the gradient boosting model. This procedure allows the model developer to adjust nearly every aspect of the gradient boosting algorithm.

One note of caution when implementing a gradient boosting model is that it is resource-intensive. When you develop a random forest model, the trees are developed independently, so the algorithm can develop several trees at a time across multiple cores. However, with a gradient boosting algorithm, the trees are developed sequentially. This means that you cannot gain processing efficiency by spreading the task across many cores. In the end, this algorithm takes some time to run, so once you click the Run button, go make a sandwich and come back to it later.

Program 2.27: Gradient Boosting Model

PROC TREEBOOST DATA=TRAIN_ADJ
       CATEGORICALBINS = 10
       INTERVALBINS = 400
       EXHAUSTIVE = 5000
       INTERVALDECIMALS = MAX
       LEAFSIZE = 100
       MAXBRANCHES = 6
       ITERATIONS = 500
       MINCATSIZE = 50
       MISSING = USEINSEARCH
       SEED = 42
       SHRINKAGE = 0.1
       SPLITSIZE = 100
       TRAINPROPORTION = 0.6;
       INPUT &tree_num. / LEVEL=INTERVAL;
       INPUT &tree_char./ LEVEL=NOMINAL;
       TARGET PRICE_LOG / LEVEL=INTERVAL;
       IMPORTANCE NVARS=50 OUTFIT=BASE_VARS;
       SUBSERIES BEST;
       CODE FILE=”C:UsersJames GearheartDesktopSAS Book StuffDataBOOST_MODEL_FIT.sas”
       NOPREDICTION;
       SAVE MODEL=GBS_TEST FIT=FIT_STATS 
IMPORTANCE=IMPORTANCE RULES=RULES;
RUN;

Notes on Program 2.27:

● The model is developed on the TRAIN_ADJ data set.

● I have specified that there are a maximum of 6 branches (MAXBRANCHES) for each decision tree.

● I am building 500 sequential decision trees (ITERATIONS).

● The learning rate is set to 0.1 (SHRINKAGE).

● I am using 60% of the data to train each of the decision trees.

● I am outputting the code file to score other data sets.

● I am creating data sets for the fit statistics, variable importance, and splitting rules.

The first output data set that we can look at is the fit statistics data set. This data set shows the fit metrics for each of the 500 created decision trees. I’ve shown the top 10 iterations for demonstration purposes. You can see that the Sum of Squared Error (SSE), the Average Squared Error (ASE), and the Real Average Squared Error (RASE) all decrease for each added iteration of the algorithm.

Output 2.19: Gradient Boosting Fit Statistics

The next output data set that we can inspect is the variable importance table. This table is constructed in the same manner as the variable importance table that we saw in the random forest output. This table contains the relative importance for each variable presented to the algorithm. I have presented the top 10 variables in Table 2.11.

Table 2.11: Gradient Boosting Variable Importance

We can see that room_type is the top-ranked variable for the gradient boosting model. The variables accommodates and zip are also highly predictive of the target variable. At this point, I think that we can declare room_type as the winner for the most predictive variable for Airbnb per diem price for New York City as of December 2018!

A final comparison of the calculated RMSE metric on the TRAIN and hold-out TEST data set shows that the gradient boosting algorithm generates the lowest RMSE score for all of our developed models.

Table 2.12: Model RMSE Table

This isn’t too much of a surprise. Once we discovered that the simple decision tree performed substantially better than the linear regression model, we knew that a non-parametric approach was superior to a parametric approach. The underlying data is non-linear.

The gradient boosting model generally produces the most accurate predictions of all the recursive binary splitting (decision tree) algorithms. This is because it reweights each decision tree based on the residuals of the previous tree. It actually focuses the algorithm on fixing the big mistakes generated from the previous decision tree, and it does not focus on the parts where the previous tree got the prediction very close to the actual target value.

Decision Time

Now that we have developed several different predictive models, we need to decide which model we would use to run our Airbnb price generator business. It is tempting to say that since the gradient boosting algorithm generated the lowest error metric that we should obviously use that model. It would give the customer the most accurate, and consequently, the most competitive and profitable price estimate. However, if we use this model, we really cannot provide the customer with any detailed information about the price estimate.

As a customer, you may want to know what factors contribute to the per diem rate. This holds especially true if you are in the business of property arbitrage. This is a business model where investors lease many properties with the goal of listing them on Airbnb (or other short-term property rental sites) in an attempt to collect rents above their monthly lease fees. Airbnb strongly discourages this behavior. But if this is your business model, it would be highly valuable for you to have detailed information about what factors contribute to per diem pricing. This way, you could acquire properties that have the features that maximize your profit.

A gradient boosting model would not provide that level of information because there is very little transparency to this model. We know which features are important, but we cannot assign a specific dollar value for each significant feature.

The random forest model also suffers from a lack of transparency. The estimates are highly accurate, but we cannot assign values at the feature level.

This leaves us with choosing between the linear regression model and the simple decision tree. The simple decision tree will provide the customer with a clear logic path to a given prediction. They could understand quite clearly how the estimate was generated and how each factor contributes to the final predicted value. This type of model is easy to understand and has the added benefit of visibly demonstrating variable importance by showing which variables are at the top of the tree. It also allows the customer to see how important location data is by allowing multiple-level categorical data like the zip code and neighbourhood_cleansed variables to be included in the model. There is a strong case for using the simple decision tree for our modeling algorithm.

However, there are two main reasons that I would select the linear regression model:

1. Transparency – The business value that we are providing is information. This information is much more than the final predicted per diem rate. The linear regression model provides detailed information on the per-unit value for each feature in our model. I believe that is what our customers really want.

2. Accuracy – Although the linear regression model’s RMSE score is higher than the tree-based models, it is actually not too far off from them. When we transform the RMSE of the log adjusted price target variable back into its standard form, we can see in Table 2.13 that the average residual price for the linear regression model is $1.53 while the gradient boosting model’s average residual price is $1.40. I am willing to give up $0.13 in accuracy in order to have full transparency.

Table 2.13: True Dollar Error Estimates

One thing to keep in mind is that these error metrics are based on the average error. We know that properties on the high end and the low end of the price range will have larger residuals due to there being in the top and bottom decile of price categories. The distribution of residuals for these properties has heavier tails for the linear regression model than they do for the gradient boosting model.

Output 2.20 and 2.21 below show the quantile table and the histogram for the linear regression model and the gradient boosting model for comparison. You can easily see that the gradient boosting model produces a tighter distribution of residuals. At nearly every point in the quantile table, the linear regression model has larger residual values.

Another measure to quantitatively confirm that the gradient boosting model has a tighter distribution than the linear regression model is the kurtosis value found in the summary statistics output of the PROC UNIVARIATE procedure. The kurtosis measures the sharpness or the peak of a distribution curve. The kurtosis value for the linear regression model is 1.14, while the kurtosis for the gradient boosting model is 2.21.

Output 2.20: Linear Regression Quantiles

Output 2.21: Gradient Boosting Quantiles

This analysis has shown that the linear regression model performs worse than the gradient boosting model consistently across the entire range of predictions. The gradient boosting model provides better predictions at the high and low end of actual target values.

At this point, we have another decision to make. We could go back to the beginning of the analysis and stratify the data into three price ranges (high, medium, and low), and we could build separate models for each of these price ranges. There is little doubt that these stratified models would result in better price predictions on the high and low ends and probably in the middle range as well. There is no guarantee that this solution would provide better estimates, but for the sake of due diligence (and my data OCD), I would consider developing the stratified models.

The other option is to accept the models that we have developed and put one into production. Often in the real world, a data scientist does not have the luxury of time to iterate through every possible modeling approach.

Implementation

Every implementation process is different because it depends on the purpose of the model and the environment of the deployment. For our example, we developed a linear regression model that will be used to estimate per diem property prices based on user input. This model will be deployed on our own website. The big advantage that we have here is that we have complete control over the deployment and do not have to rely on a corporate IT team to deploy the model for us.

We can take advantage of this deployment environment by incorporating user selection drop-down boxes that force the user to select a predetermined category. This grants us total control over the range of input values.

Due to the simplicity of our parsimonious linear regression model, we can easily hard code the scoring algorithm directly into the HTML code of our website. The bootstrapped version of our model provides us with the 25% threshold, the median value, and the 75% threshold. This could provide the customer with an acceptable range of values that they can use to price their property.

The customer has to select only the appropriate drop-down boxes and the algorithm will generate pricing estimates for each of their selections. The result would include a summary price estimate for a low, medium, and high pricing strategy as well as per unit dollar values for each selected feature.

Congratulations! You have just developed a data-driven online business. Ok, now it’s time to book the flight to Playa Del Carmen and get ready to sip the piña coladas on the beach. But before you go, remember that you have to a few other things that you need to do:

● Trend the pricing estimates across the calendar year with the information contained in the Calendar data set.

● Develop pricing models for each city.

● Develop the website with integrated scoring logic.

● Develop a program that monitors your predictions.

● Market your product to potential Airbnb property listers.

● Develop a program that monitors your predictive accuracy.

● Tune the modeling algorithm based on historical accuracy.

Chapter Review

The goal of this chapter was to give you a thorough review of all the steps involved with a data science project. Even though each of the steps were covered in detail, there will still be many unexpected obstacles and challenges that you will face when developing your own data science project. Although the general workflow may be relatively consistent across projects, the details and implementation will vary greatly.

It is very important to understand the underlying methodology and the general reasoning as to why choose one analytical methodology or one modeling algorithm over another. The following chapters will cover these issues in detail.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset