By the end of this chapter, you will be able to:
In this chapter, we will be handling, selecting, and normalizing features required for building a model.
We learned about the process of machine learning in Chapter 1, An Introduction to Machine Learning, and looked at the different ways to process data in Chapter 2, Data Cleaning and Pre-processing. In this chapter, we will delve deep into the feature engineering process. Feature engineering is a process in which we select the attributes that are related to the target field in our dataset. The selection is made using techniques such as correlation analysis, Principal Component Analysis (PCA), and other techniques. During this process, new features can also be generated that are meaningful and add information to our dataset. In addition to this, we can generate statistics of existing numeric fields as features, as they contain statistical information about the fields or attributes.
In this chapter, we will learn how to create features for date variables, time series data, strings, and numeric variables, and explore text features. Furthermore, we will look at the implementation of new features to an R data frame. We will identify and handle redundant features appropriately. Correlation analysis and PCA will be used to select the required features. The features will be ranked using several techniques, such as learning vector quantization and PCA.
Figure 3.1 denotes a typical feature engineering process where the extraction of features from raw data is performed before the model building process. In Figure 3.1, N number of features are extracted for the model.
The datasets being used are shown in the following table:
In the next section, we will discuss the domain-specific features in detail.
We have two types of features:
Features can be extracted from the existing features. For instance, when we consider a date variable, we can extract the year from the entire date. From these datatypes, it is essential to extract the feature.
Imagine that you have a dataset containing information such as dates, months, and years in a non-numerical format; for example, 31/05/2019. We cannot feed this information to a machine learning algorithm, as such algorithms will not understand date-type values. Thus, converting date and time into machine-readable data format is an important skill for a machine learning engineer.
We can extract the year, month, day of the month, quarter, week, day of the week, the difference between two dates, and the hour, minute, and season. We can also find out whether the given date is a weekend or not, whether the time falls between business hours or not, whether it is a public holiday, and whether the year is a leap year. In the next exercise, we will extract the year, month, day, and weekday of the present time.
In this exercise, we will use the date feature in R and extract the year, month, day, and weekday using the POSIXlt() function.
#Fetch the current date
current_date <- Sys.time()
current_date
The output is as follows:
## [1] "2019-03-18 00:28:09 IST"
The Sys.time() function returns the time at the moment the command is executed.
The output for Exercises 1 and 2 will depend on the current_date variable that is shown in the preceding code. The year, month, date, minutes, and seconds will be different each time.
# print the date
formatted_date <- as.POSIXlt(current_date)
formatted_date
The output is as follows:
## [1] "2019-03-18 00:28:09 IST"
We will be making use of the POSIXlt class, which is a subset of the POSIXt class. POSIXlt returns the local time, which contains the year, month of the year, day of month, hours, minutes, seconds, day of week, day of year, and a daylight savings indicator.
#Fetch the year
year <- format(formatted_date, "%Y")
year
The output is as follows:
## [1] "2019"
The format() function takes the date and the section of date that is required and returns the specified value.
#Fetch the month
month <- format(formatted_date, "%m")
month
The output is as follows:
## [1] "03"
#Fetch the date
day <- format(formatted_date, "%d")
day
The output is as follows:
## [1] "18"
As can be seen from the output, it is the 18th day of the month.
#Fetch the day of week
weekday <- format(formatted_date, "%w")
weekday
The output is as follows:
## [1] "1"
The output is the first day of the week.
Thus, we have used the built-in functions to find the current time, date, day, and day of the week.
The values of the variables are also displayed in the Environment tab of RStudio.
In the next exercise, we will extract the time and date. This is important for when we want to use time information in our features.
In this exercise, we will use the time feature in R and extract the hour and minute using the lubridate library.
install.packages("lubridate")
library(lubridate)
The lubridate package helps fetch time features. It can be installed using the install.packages("lubridate") command. The methods in the lubridate package, such as hour() and minute(), are simple to use.
#Hour of Day
#hour<-hour(formatted_date)
The output is as follows:
## [1] 0
#Extract Minute
min <- minute(formatted_date)
The output is as follows:
## [1] 28
Thus, we have used the lubricate package to find the hour and minute from a given time.
Time series data is a special type of data where some quantities are measured over time, and therefore it contains data along with the timestamp. An examples would be stock prices and forecasting of the market, where we would have a stock name, stock value, and time as the time series data.
The following figure presents some time series features:
The time series features are as follows:
In this chapter, we will cover frequency domain features. In the next exercise, we will be learning about the binning feature from the frequency domain features.
In this exercise, we will look at the binning feature. We will be performing binning on Age data from the PimaIndianDiabetes dataset. Binning helps in visualizing the data.
library(caret)
library(mlbench)
#Install caret if not installed
#install.packages('caret')
data(PimaIndiansDiabetes)
age <- PimaIndiansDiabetes$age
summary(age)
The output is as follows:
Min. 1st Qu. Median Mean 3rd Qu. Max.
21.00 24.00 29.00 33.24 41.00 81.00
#Creating Bins
# set up boundaries for intervals/bins
breaks <- c(0,10,20,30,40,50,60,70,80)
# specify interval/bin labels
labels <- c("<10", "10-20", "20-30", "30-40", "40-50", "50-60", "60-70", "70-80")
bins <- cut(age, breaks, include.lowest = T, right=FALSE, labels=labels)
summary(bins)
The output is as follows:
<10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 NA's
0 0 396 165 118 57 29 2 1
plot(bins, main="Binning for Age", ylab="Total Count of People",col="bisque",xlab="Age",ylim=c(0,450))
The output is as follows:
We have the maximum values for the age between 20 to 30. Binning has helped to categorize the continuous values and also to derive insights. In the following activity, we will be dealing with a GermanCredit dataset and creating bins.
In this activity, we will create bins for a continuous numeric field called Duration (this is the duration of credit for the customer) in the GermanCredit dataset. Often, we have lots of continuous data values; these values are binned to understand the Data column better. The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/GermanCredit.csv.
These are the steps that will help you solve the activity:
Once you complete the activity, you should obtain the following output:
The solution for this activity can be found on page 331.
The following are some of the summary statistics that can be derived for numeric features:
In the following exercises, we will be finding descriptions of features in the GermanCredit dataset.
In this exercise, we will calculate the mean, standard deviation, minimum, maximum, and skewness of the dataset. These numeric features can be calculated as follows:
library(caret)
data(GermanCredit)
#See the structure of the dataset
str(GermanCredit)
The output is as follows:
From the structure, we can identify the numeric fields.
#Calculate mean
mean <- mean(GermanCredit$Amount)
mean
The output is as follows:
[1] 3271.258
#Calculate standard deviation
standard_dev <- sd(GermanCredit$Amount)
standard_dev
The output is as follows:
[1] 2822.737
#Calculate median
median <- median(GermanCredit$Amount)
median
The output is as follows:
[1] 2319.5
#Identify maximum
min <- max(GermanCredit$Amount)
max
The output is as follows:
[1] 18424
#Identify minimum
min <- min(GermanCredit$Amount)
min
The output is as follows:
[1] 250
library(e1071) # load e1071
skewness<-skewness(GermanCredit$Amount)
skewness
The output is as follows:
[1] 1.943783
In this exercise, we have descriptions of the features. In the next section, we will cover the standardizing technique.
Standardization contains two steps:
At times, features have to be scaled to lie within the same range. For instance, Age and Income will have different range of values; they could be scaled to [0-1] or any standard range such as [-1,1].
The steps to rescale are as follows:
Numerator = x-(-1)
Denominator = 1-(-1)
Categorical variables are a list of string values or numeric values for an attribute. For instance, gender can be "Male" or "Female". There are two types of categories: nominal and ordinal. In nominal categorical data, there is no ordering among the values in that attribute. This is the case with gender values. Ordinal categories have some order within the set of values. For instance, for temperature "Low," "Medium," and "High" have an order.
Many modeling techniques require data to have a normal distribution, so we transform data to a normal distribution wherever possible. Data is considered highly skewed if the skewness value is less than -1 or greater than 1.
Skewness denotes the alignment of the values in the specified column. A negative skewness value means that the data is skewed to the left and a positive skewness value means that the data is skewed to the right. We would not want the data to be skewed for the model. So, we will often try to reduce the skewness in the data.
In this exercise, we will find the skewness of the V4 column of the Sonar dataset. The Sonar dataset contains patterns of signals obtained from bouncing rocks and mines. The columns contain the pattern information. The "M" label indicates a mine and the "R" label indicates a rock.
library(mlbench)
library(lattice)
library(caret)
library(e1071)
data(Sonar)
skewness(Sonar$V4)
The skewness is 0.5646697.
histogram(Sonar$V4,xlab="V4")
The histogram is as follows:
The histogram denotes skewness. The positive skewness value means that the graph is skewed to the right, as you can see in the preceding plot.
In this activity, we will identify the skewness of the glucose column in the PimaIndiansDiabetes dataset. We will then compare it with the skewness of the age column. The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/GermanCredit.csv.
These are the steps that will help you solve the activity:
The solution for this activity can be found on page 334.
When a continuous variable has a skewed distribution, we can log-transform it to reduce skewness. This will make the distribution normal. The log() function is used to log-transform the values.
In this exercise, we will reduce the skewness of the data using log transform
#Log Transformation
transformed_data <- log(PimaIndiansDiabetes$age)
#View histogram
histogram(transformed_data)
The output is as follows:
As we can see, the data distribution is now looking much better; it now follows a more normal distribution.
This is a transformation that transforms a skewed distribution to a distribution closer to a normal distribution.
These are features that are derived from data that requires an understanding of the business domain.
Let's imagine a dataset that contains data for the sale prices of houses in different areas of a city and that our goal is to predict the future price of any house. For this dataset, the input fields are area code, size of the house, floor number, type of house (individual/apartment), age of the property, renovated status, and so on, along with the sale price of the house. The derived features in this scenario are as follows:
Another example of deriving domain-specific features would be deriving a person's age from their birth date and the current date in a dataset containing information about people.
We will look at the code to add new columns to an R data frame. A new column may be a new feature or a copy of an existing column. We'll look at an example in the following exercise. Adding new features can help in improving the efficiency of a model.
In this exercise, we will add columns to an existing R data frame. These new columns can be dummy values or copies of other columns.
#Adding new features to a R datadrame
library(caret)
data(GermanCredit)
#Assign the value to the new field
GermanCredit$NewField1 <- 1
str(GermanCredit)
The output is as follows:
#Copy an existing column into a new column
GermanCredit$NewField2 <- GermanCredit$Purpose.Repairs
str(GermanCredit)
The output is as follows:
We have added two new features to the dataset.
Redundant features are those that are highly correlated with each other. They will contain similar information with respect to their output variables. We can remove such features by finding correlation coefficients between features.
In this exercise, we will find redundant features, select any one among them, and remove them.
#Loading the library
library(caret)
# load the German Credit Data
data(GermanCredit)
# calculating the correlation matrix
correlationMatrix <- cor(GermanCredit[,1:9])
# printing the correlation matrix
print(correlationMatrix)
The output is as follows:
# finding the attributes that are highly corrected
filterCorrelation <- findCorrelation(correlationMatrix, cutoff=0.5)
# print indexes of highly correlated fields
print(filterCorrelation)
The output is as follows:
[1] 2
print(correlationMatrix)
The correlation matrix is as follows:
The output is the index of the highly correlated field; here it is Amount. If the fields are highly correlated, we can remove one of them. Now that we have covered redundant features, we will move on to text features.
Text features are generated for purely textual content, such as a data containing user blogs or user feedback regarding a product on a web page. The following are some text features:
Automated feature engineering is a process where generic features are calculated for a field by a pre-defined package. For text features, the package used is textfeatures. This R package generates the common text features used to train a machine learning model with textual data. This process is important because it saves us the time it takes to implement numerous text features.
Using the textfeatures package requires R version 3.1 or above.
In this exercise, we will use the textfeatures package to find text features.
install.packages("itunesr")
install.packages("textfeatures")
install.packages("tidyverse")
library(itunesr)
library(textfeatures)
library(tidyverse)
## the text is a review of a product
text_data <- c(
"This product was delivered very fast",
"IT'S A GREAT DAY TODAY!",
paste("The product works very efficiently"),
paste("The product saves us a lot of time"),
paste("The seller arranged a timely delivery")
)
## get the text features of a sample character vector
textfeatures(text_data)
The output is as follows:
## data frame with a character vector named "text"
df <- data.frame(
id = c(1, 2, 3),
text = c("this is A! sEntence https://github.com about #rstats @github",
"and another sentence here",
"The following list: - one - two - three Okay!?!"),
stringsAsFactors = FALSE
)
## Generate the text features
Features <- textfeatures(df)
#print the text features
glimpse(features)
The output is as follows:
The output shows the features generated for the text data that we provided. Three of the feature has been explained below:
In the next section, we will discuss in detail the various feature selection approaches.
There are two types of feature selection techniques: forward selection and backward selection.
In the selection of features, it is useful to find the correlation between the values. In the next section, we will look at correlation analysis, which helps us determine the correlation between two values.
The correlation between two variables plays an important part in feature selection. If two features are correlated with each other and they are linearly dependent on each other, then one of the features can be dropped as it has the same relationship with the output variable as the other. The linear dependency can be in the form of positive correlation or negative correlation. A positive correlation between fields x and y means that as x increases, y also increases. A negative correlation between x and y means that as x increases, y decreases.
In this exercise, we will plot the correlation between two variables.
library(mlbench)
data(PimaIndiansDiabetes)
#Correlation Analysis between glucose and pressure
plot(PimaIndiansDiabetes$glucose, PimaIndiansDiabetes$pressure, col="red", xlab = "Glucose", ylab = "Pressure", pch=16, main = "Pearson Correlation")
data(Sonar)
The correlation between glucose and pressure is 0.544, which means there is a moderate positive correlation:
plot(Sonar$V4, Sonar$V3, col="red", xlab = "V4",
ylab = "V3", pch=16, main = "Pearson Correlation")
The output is as follows:
The correlation value between v3 and v4 is 0.78, which means there is a strong positive correlation, which means that as v3 increases, v4 also increases. Since they are strongly correlated and since these are two input fields, we can drop one and retain another.
When we want to know whether a feature is correlated with the output variable in the real world, it is not enough to calculate the correlation coefficient for the two variables in the dataset, as these might not be representative of the real world. We also need to account for things such as the size of our dataset and the probability of the variables being correlated in our dataset by chance.
The p-value is the probability (0-1) that, given that the null hypothesis is true (that there is no correlation), we would see a correlation coefficient with the same magnitude as the one in our dataset, or higher, simply due to the random selection of observations from the target population (the real world). If the p-value is below a threshold, such as 0.05, we say that the correlation is significant. Note that significant does not mean important in this context. It instead means that we have strong evidence against the null hypothesis, meaning we can reasonably reject the null hypothesis. The p-value does not address the probability of the alternative hypothesis (that there is a correlation) directly, but by rejecting the null hypothesis, the alternative hypothesis becomes more viable. Importantly, a p-value greater than 0.05 does not mean that the two variables are not correlated in the real world, as we might just have too few datapoints to say so.
P-values are highly debated in many scientific fields. They are commonly misunderstood and misused, and it is recommended that you read up on them if you will be relying on them in your work.
It is possible for a feature to be useful in a model without being significantly correlated to the output variable. We therefore decide whether or not we should include it in the model based on whether it makes the model better at predicting the output variable in a test set. We will do this in the Recursive Feature Elimination section.
The cor.test() function in R is used to calculate the Pearson's product moment correlation coefficient between two variables. It outputs a correlation estimate and a p-value.
In this exercise, we are finding the p-value for a correlation coefficient between two variables.
library(caret)
#Calculating P-Value
component=cor.test(Sonar$V4, Sonar$V3)
print(component)
The output is as follows:
cor.test() in R is used to calculate the Pearson's product moment correlation coefficient between two fields. The two fields can be any two fields for which we need to find the correlation. It can be two input fields or one input and one output field. It will output a correlation estimate and a p-value.
#Print the P value
component$p.value
The output is as follows:
[1] 3.918396e-44
#Print the Correlation
component$estimate
The output is as follows:
cor
0.781786
The p-value of 3.918396e-44 suggests that we have strong evidence against the null hypothesis and have no reason not to trust the correlation coefficient of 0.781786. With such a high correlation, it is likely a good feature to include in our model.
Recursive Feature Elimination (RFE) is a recursive method used for feature selection in R's caret package. This method uses all possible combinations of the subset of the features to train models and accordingly drop features. The algorithms used here are linear regression, random forest, naïve Bayes, and bagged trees. This method will build models with the subset of columns and the best subset size will be printed as output.
In the following example, nine features from the GermanCredit dataset have been provided as input to the function. RFE selects the top five features that are important.
In this exercise, we will be eliminating features using the recursive feature elimination technique. The top features will be selected for training the model.
set.seed(7)
install.packages("e1071")
install.packages("randomForest")
library(e1071)
library(randomForest)
# Attach the packages
library(mlbench)
library(caret)
# load the German Credit Data
data("GermanCredit")
# Use random forest as the method
method_fn <- rfeControl(functions=rfFuncs, method="cv", number=9)
The rfeControl() function in R creates a control object through which we can specify a function for prediction/model fitting, a method that is a sampling method, and the number of folds or iterations.
# run the Recursive Feature Elimination algorithm
output <- rfe(GermanCredit[,1:9], GermanCredit[,10], sizes=c(1:9), rfeControl=method_fn)
The rfe() function in R performs recursive feature elimination.
# print the output
print(output)
The output is as follows:
Recursive feature selection
Outer resampling method: Cross-Validated (9 fold)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
1 0.7000 0.07993 0.01450 0.05992
2 0.6841 0.11776 0.03479 0.09647
3 0.6900 0.08827 0.03138 0.08838
4 0.6781 0.13250 0.03428 0.10378
5 0.7130 0.20887 0.03098 0.09935
6 0.7271 0.23200 0.03680 0.11610
7 0.7280 0.21180 0.02549 0.09123
8 0.7230 0.20408 0.02081 0.08139
9 0.7281 0.23882 0.03347 0.10788 *
The top 5 variables (out of 9):
Duration, Amount, Age, NumberPeopleMaintenance, Telephone
predictors(output)
The output is as follows:
[1] "Duration" "Amount" "Age" "NumberPeopleMaintenance"
[5] "Telephone" "InstallmentRatePercentage" "ResidenceDuration" "NumberExistingCredits"
[9] "ForeignWorker"
The preceding plot shows the accuracy values for each of the different variables chosen for the model. For instance, with these two variables, the accuracy is 0.73.
We visited PCA in Chapter 2, Data Cleaning and Pre-Processing, where we used PCA for pre-processing. In this chapter, we will delve into the details of this technique. PCA basically reduces the dimensionality of our data. For instance, if our data contains 20 columns, we can reduce it to 5 key fields generated from the 20 columns. These 5 key fields will now represent the data. Basically, PCA forms a linear combination of the data where the generated components will not be correlated with each other. These five values will also have maximum variance.
If our data has many dimensions, such as a large number of fields, then this technique can help to reduce the dimensions by generating the principal components. These components will represent most of the information from our high dimensional dataset. To perform PCA, there is first a check for correlation between all the fields, and then the important fields are chosen, and a linear combination of those fields is created to represent all the information from the fields. In this way, PCA helps to perform feature selection and is also known as a dimensionality reduction technique. PCA can even be applied to unlabeled data.
In this exercise, we will be using PCA to find the principal components in the PimaIndiansDiabetes dataset.
#PCA Analysis
data(PimaIndiansDiabetes)
#Use the
PimaIndiansDiabetes_subset <- PimaIndiansDiabetes[,1:8]
#Find out the Principal components
principal_components <- prcomp(x = PimaIndiansDiabetes_subset, scale. = T)
The prcomp() function performs PCA for the data in R.
#Print the principal components
print(principal_components)
The output is as follows:
The principal components are PC1, PC2,.. PC8, in their order of importance. These components are calculated from multiple fields and can be used as features on their own.
In this activity, we will use the GermanCredit dataset and find the principal components. These values can be used instead of the features. The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/GermanCredit.csv.
These are the steps that will help you solve the activity:
The PCA values will look as follows:
The solution for this activity can be found on page 337.
While building certain models such as decision trees and random forests, the features that are important to the model (for instance, the features that have good correlation with the output variable) are known. These features are then ranked by the model. We will look at a few examples of ranking features automatically using machine learning models.
In Learning Vector Quantization (LVQ) , we rank features based on their importance. LVQ and the variable importance function, varImp(), will be used to fetch the important variables. The GermanCredit dataset is used to demonstrate LVQ. For simplicity's sake, we are choosing the first ten columns in the GermanCredit dataset. The 10th column contains the class values for prediction.
In this exercise, we will implement LVQ for the GermanCredit dataset and use the variable importance function to list the importance of the fields in this dataset.
set.seed(9)
# loading the libraries
library(mlbench)
library(caret)
# load the German Credit dataset
data("GermanCredit")
#Setting parameters for training
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# training the model
model <- train(Class~., data=GermanCredit[,1:10], method="lvq", preProcess="scale", trControl=control)
# Getting the variable importance
importance <- varImp(model, scale=FALSE)
# print the variable importance
print(importance)
# plot the result
plot(importance)
The output is as follows:
print(importance)
The output is as follows:
ROC curve variable importance
Importance
Duration 0.6286
Age 0.5706
Amount 0.5549
InstallmentRatePercentage 0.5434
NumberExistingCredits 0.5251
Telephone 0.5195
ForeignWorker 0.5169
ResidenceDuration 0.5015
NumberPeopleMaintenance 0.5012
An importance score has been assigned to each variable. The Duration, Age, and Amount variables are the most important variables. The least important variables are ResidenceDuration and NumberPeopleMaintenance.
When using random forests to determine variable importance, multiple trees are trained. After creating the forest, it will also show the importance of variables used in the data. A tree-based model takes into consideration non-linear relationships. The features used in the split are highly relevant to the output variable. We should also make sure that we avoid overfitting, and therefore the depth of the tree should be small.
In this exercise, we will find variable importance in the PimaIndiansDiabetes dataset using random forests.
library(mlbench)
library(caret)
library(randomForest)
data(PimaIndiansDiabetes)
random_forest <- randomForest(Class~., data= PimaIndiansDiabetes)
# Create an importance based on mean decreasing gini
importance(random_forest)
The output is as follows:
MeanDecreaseGini
pregnant 28.60846
glucose 88.03126
pressure 29.83910
triceps 23.92739
insulin 25.89228
mass 59.12225
pedigree 42.86284
age 48.09455
varImp(random_forest)
The importance of the variables is as follows:
Overall
pregnant 28.60846
glucose 88.03126
pressure 29.83910
triceps 23.92739
insulin 25.89228
mass 59.12225
pedigree 42.86284
age 48.09455
The features that are important have higher scores and those features can be selected for the model.
In this activity, we will use the GermanCredit dataset and perform a random forest approach on the dataset to find the features with the highest and lowest importance. The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/GermanCredit.csv.
These are the steps that will help you solve the activity:
The expected output of variable importance will be as follows:
Overall
Duration 70.380265
Amount 121.458790
InstallmentRatePercentage 27.048517
ResidenceDuration 30.409254
Age 86.476017
NumberExistingCredits 18.746057
NumberPeopleMaintenance 12.026969
Telephone 15.581802
ForeignWorker 2.888387
The solution for this activity can be found on page 338.
We have seen the importance score provided by random forests; in this exercise, a logistic regression model is trained for the data to identify variable importance. We will use varImp() to show the relative importance of the columns. This model will help to provide us with the importance of the fields in a dataset.
In this exercise, we will implement a logistic regression model.
GermanCredit_subset <- GermanCredit[,1:10]
library(mlbench)
data(GermanCredit_subset)
data_lm = as.data.frame(GermanCredit_subset)
# Fit a logistic regression model
log_reg = glm(Class~.,GermanCredit_subset,family = "binomial")
library(caret)
# Using varImp() function
varImp(log_reg)
The output is as follows:
Overall
Duration 3.0412079
Amount 2.7164175
InstallmentRatePercentage 2.9227186
ResidenceDuration 0.6339908
Age 2.7370544
NumberExistingCredits 1.1394251
NumberPeopleMaintenance 0.6952838
Telephone 2.5708235
ForeignWorker 1.9652732
After building a logistic regression model, the importance of each variable is given a score. The higher the score is, the more important the variable is. The variables that are most important can be used for the model.
Using varImp(), we can list the features with their importance. rpart stands for Recursive Partitioning and Regression Trees. This package contains an implementation of a tree algorithm in R, specifically known as Classification and Regression Trees (CART). In the following exercise, we will be using the rpart package in R.
In this exercise, we will be finding the variable importance using rpart. Finding the importance of variables helps to select the correct variables.
install.packages("rpart")
install.packages("randomForest")
set.seed(10)
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
PimaIndiansDiabetes_subset <- PimaIndiansDiabetes[,1:9]
PimaIndiansDiabetes_subset
#Train a rpart model
rPartMod <- train(diabetes ~ ., data=PimaIndiansDiabetes_subset, method="rpart")
#Find variable importance
rpartImp <- varImp(rPartMod)
#Print variable importance
print(rpartImp)
The output is as follows:
rpart variable importance
Overall
glucose 100.000
mass 65.542
age 52.685
pregnant 30.245
insulin 16.973
pedigree 7.522
triceps 0.000
pressure 0.000
#Plot top 5 variable importance
plot(rpartImp, top = 5, main='Variable Importance')
The plot is as follows:
From the preceding plot, it can be noted that glucose, mass, and age are most important for the output, and should therefore be included for the modeling. In the next activity, we will be selecting features using variable importance.
In this activity, we will use the GermanCredit dataset and find the variable importance using rpart. The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/GermanCredit.csv.
These are the steps that will help you solve the activity:
The expected output of variable importance will be as follows:
The solution for this activity can be found on page 339.
Here is a table that summarizes the techniques we have looked at and the features we can select using these techniques:
Thus, we can see that most methods suggest the Duration, Amount, and Age as the features.
In this chapter, we have learned about the different types of features that are generated to train a model. We have derived domain-specific features and datatype-specific features. Also, we explored an automated technique for generating text features. The feature engineering process is essential for obtaining the best model performance. We delved into two variable transformation techniques and learned about techniques to identify redundant features and handle them in a dataset.
We have learned about forward and backward feature selection approaches and have performed correlation analysis through detailed examples. We implemented the calculation of p-values in R and looked at its significance to the process of the selection of features. Recursive feature elimination is another way that we saw to find the best combination of features for a model. We delved into a dimensionality reduction approach, known as PCA, that drastically reduces the number of features needed, as it calculates the principal components in a dataset.
We explored several techniques for ranking the features in R. LVQ and random forests were implemented in R to observe the ranking for the features in the GermanCredit dataset. We also learned how to use the variable importance function in R to list the importance of all the variables in a dataset.
In the next chapter, we will use neural networks to solve classification problems. Along with this, we will also evaluate the models using cross-validation.