Chapter 3

Feature Engineering

Learning Objectives

By the end of this chapter, you will be able to:

  • Interpret date, time series, domain-specific, and datatype-specific data in R.
  • Perform numeric and string operations in R.
  • Handle categorical variables.
  • Generate automated text features in R.
  • Identify and add features to an R data frame.
  • Implement selection using the correlation analysis, PCA, and RFE approaches.

In this chapter, we will be handling, selecting, and normalizing features required for building a model.

Introduction

We learned about the process of machine learning in Chapter 1, An Introduction to Machine Learning, and looked at the different ways to process data in Chapter 2, Data Cleaning and Pre-processing. In this chapter, we will delve deep into the feature engineering process. Feature engineering is a process in which we select the attributes that are related to the target field in our dataset. The selection is made using techniques such as correlation analysis, Principal Component Analysis (PCA), and other techniques. During this process, new features can also be generated that are meaningful and add information to our dataset. In addition to this, we can generate statistics of existing numeric fields as features, as they contain statistical information about the fields or attributes.

In this chapter, we will learn how to create features for date variables, time series data, strings, and numeric variables, and explore text features. Furthermore, we will look at the implementation of new features to an R data frame. We will identify and handle redundant features appropriately. Correlation analysis and PCA will be used to select the required features. The features will be ranked using several techniques, such as learning vector quantization and PCA.

Figure 3.1: The feature engineering process

Figure 3.1 denotes a typical feature engineering process where the extraction of features from raw data is performed before the model building process. In Figure 3.1, N number of features are extracted for the model.

The datasets being used are shown in the following table:

Figure 3.2: The description and output variable fields for datasets

In the next section, we will discuss the domain-specific features in detail.

Types of Features

We have two types of features:

  • Generic features, or datatype-specific features: These are features based on the datatype of the field.
  • Domain-specific features: These are features that are dependent upon the domain of the data. Here, we derive some features from the data based on our business knowledge or the domain.

Datatype-Based Features

Features can be extracted from the existing features. For instance, when we consider a date variable, we can extract the year from the entire date. From these datatypes, it is essential to extract the feature.

Date and Time Features

Imagine that you have a dataset containing information such as dates, months, and years in a non-numerical format; for example, 31/05/2019. We cannot feed this information to a machine learning algorithm, as such algorithms will not understand date-type values. Thus, converting date and time into machine-readable data format is an important skill for a machine learning engineer.

We can extract the year, month, day of the month, quarter, week, day of the week, the difference between two dates, and the hour, minute, and season. We can also find out whether the given date is a weekend or not, whether the time falls between business hours or not, whether it is a public holiday, and whether the year is a leap year. In the next exercise, we will extract the year, month, day, and weekday of the present time.

Exercise 23: Creating Date Features

In this exercise, we will use the date feature in R and extract the year, month, day, and weekday using the POSIXlt() function.

  1. Fetch the current date using the following command:

    #Fetch the current date

    current_date <- Sys.time()

    current_date

    The output is as follows:

    ## [1] "2019-03-18 00:28:09 IST"

    The Sys.time() function returns the time at the moment the command is executed.

    Note

    The output for Exercises 1 and 2 will depend on the current_date variable that is shown in the preceding code. The year, month, date, minutes, and seconds will be different each time.

  2. Use the POSIXlt() function to extract the local time:

    # print the date

    formatted_date <- as.POSIXlt(current_date)

    formatted_date

    The output is as follows:

    ## [1] "2019-03-18 00:28:09 IST"

    We will be making use of the POSIXlt class, which is a subset of the POSIXt class. POSIXlt returns the local time, which contains the year, month of the year, day of month, hours, minutes, seconds, day of week, day of year, and a daylight savings indicator.

  3. Fetch the year using the following command:

    #Fetch the year

    year <- format(formatted_date, "%Y")

    year

    The output is as follows:

    ## [1] "2019"

    The format() function takes the date and the section of date that is required and returns the specified value.

  4. Fetch the month using the following command:

    #Fetch the month

    month <- format(formatted_date, "%m")

    month

    The output is as follows:

    ## [1] "03"

  5. Fetch the date using the following command:

    #Fetch the date

    day <- format(formatted_date, "%d")

    day

    The output is as follows:

    ## [1] "18"

    As can be seen from the output, it is the 18th day of the month.

  6. Fetch the day of the week using the following command:

    #Fetch the day of week

    weekday <- format(formatted_date, "%w")

    weekday

    The output is as follows:

    ## [1] "1"

    The output is the first day of the week.

    Thus, we have used the built-in functions to find the current time, date, day, and day of the week.

    Note

    The values of the variables are also displayed in the Environment tab of RStudio.

In the next exercise, we will extract the time and date. This is important for when we want to use time information in our features.

Exercise 24: Creating Time Features

In this exercise, we will use the time feature in R and extract the hour and minute using the lubridate library.

  1. Install and attach the lubridate package:

    install.packages("lubridate")

    library(lubridate)

    The lubridate package helps fetch time features. It can be installed using the install.packages("lubridate") command. The methods in the lubridate package, such as hour() and minute(), are simple to use.

  2. Fetch the hour using the following command:

    #Hour of Day

    #hour<-hour(formatted_date)

    The output is as follows:

    ## [1] 0

  3. Fetch the minutes using the following command:

    #Extract Minute

    min <- minute(formatted_date)

    The output is as follows:

    ## [1] 28

Thus, we have used the lubricate package to find the hour and minute from a given time.

Time Series Features

Time series data is a special type of data where some quantities are measured over time, and therefore it contains data along with the timestamp. An examples would be stock prices and forecasting of the market, where we would have a stock name, stock value, and time as the time series data.

The following figure presents some time series features:

Figure 3.3: Time series features

The time series features are as follows:

  1. Lag features: Using the lag feature of time series data, we can shift the time series data values by a specific value.
  2. Difference in value of timestamp: In time series data, it is important to derive the difference between timestamps.
  3. Window Features: Window features are features that tend to change over a fixed interval of time (window). These could be measured using the change over time window, growth of measured value with time, or average over time window.
  4. Power/Energy: This features denotes the power consumption over a fixed time.
  5. Frequency Domain Features: These features summarize the data by creating bins and then finding peak, mean, standard deviation, min, and max for each bin.
  6. K peaks: For continuous data, it refers to picking the data point that have the highest value.

In this chapter, we will cover frequency domain features. In the next exercise, we will be learning about the binning feature from the frequency domain features.

Exercise 25: Binning

In this exercise, we will look at the binning feature. We will be performing binning on Age data from the PimaIndianDiabetes dataset. Binning helps in visualizing the data.

  1. Attach the following packages:

    library(caret)

    library(mlbench)

    #Install caret if not installed

    #install.packages('caret')

  2. Load the PimaIndiansDiabetes dataset:

    data(PimaIndiansDiabetes)

    age <- PimaIndiansDiabetes$age

  3. Check the data summary as follows:

    summary(age)

    The output is as follows:

    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

      21.00   24.00   29.00   33.24   41.00   81.00

  4. Create the bins (intervals):

    #Creating Bins

    # set up boundaries for intervals/bins

    breaks <- c(0,10,20,30,40,50,60,70,80)

  5. Create labels:

    # specify interval/bin labels

    labels <- c("<10", "10-20", "20-30", "30-40", "40-50", "50-60", "60-70", "70-80")

  6. Bucket the data points into the bins:

    bins <- cut(age, breaks, include.lowest = T, right=FALSE, labels=labels)

  7. Find the summary of the bins:

    summary(bins)

    The output is as follows:

      <10 10-20 20-30 30-40 40-50 50-60 60-70 70-80  NA's

        0     0   396   165   118    57    29     2     1

  8. Plot the bins:

    plot(bins, main="Binning for Age",  ylab="Total Count of People",col="bisque",xlab="Age",ylim=c(0,450))

    The output is as follows:

Figure 3.4: Binning the data

We have the maximum values for the age between 20 to 30. Binning has helped to categorize the continuous values and also to derive insights. In the following activity, we will be dealing with a GermanCredit dataset and creating bins.

Activity 10: Creating Time Series Features – Binning

In this activity, we will create bins for a continuous numeric field called Duration (this is the duration of credit for the customer) in the GermanCredit dataset. Often, we have lots of continuous data values; these values are binned to understand the Data column better. The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/GermanCredit.csv.

These are the steps that will help you solve the activity:

  1. Attach the caret package.
  2. Load the GermanCredit data from the caret package. (Hint: use read.csv().)
  3. Print the summary of the duration columns. (Hint: use summary().)
  4. Use ggplot2 to plot Duration (Hint: use of the gglplot2 package was covered in Chapter 1, An Introduction to Machine Learning.)
  5. Set up bins using [breaks <- c(0,10,20,30,40,50,60,70,80)].
  6. Set up labels, such as labels <- c("<10", "10-20", "20-30", "30-40", "40-50", "50-60", "60-70", "70-80").
  7. Create a new set of bins for the preceding labels using bins <- cut(duration, breaks, include.lowest = T, right=FALSE, labels=labels).
  8. Print summary of bins.
  9. Use plot() to plot the frequency of the new bin's variable.

Once you complete the activity, you should obtain the following output:

Figure 3.5: Plot of duration in bins

Note

The solution for this activity can be found on page 331.

Summary Statistics

The following are some of the summary statistics that can be derived for numeric features:

  1. Mean: Calculates the average for the values in the field.
  2. Standard Deviation: This defines spread of the data points around the mean.
  3. Minimum: The lowest value in the field is the minimum value.
  4. Maximum: The highest value in the field is the maximum value.
  5. Skewness: The distribution is used to identify the asymmetry of the data.

In the following exercises, we will be finding descriptions of features in the GermanCredit dataset.

Exercise 26: Finding Description of Features

In this exercise, we will calculate the mean, standard deviation, minimum, maximum, and skewness of the dataset. These numeric features can be calculated as follows:

  1. Attach the caret package and the GermanCredit dataset:

    library(caret)

    data(GermanCredit)

  2. Check the structure of the GermanCredit dataset:

    #See the structure of the dataset

    str(GermanCredit)

    The output is as follows:

    Figure 3.6: Section of the GermanCredit dataset

    From the structure, we can identify the numeric fields.

  3. Calculate the mean of the Amount values in the GermanCredit dataset:

    #Calculate mean

    mean <- mean(GermanCredit$Amount)

    mean

    The output is as follows:

    [1] 3271.258

  4. Calculate the standard deviation of the Amount values in the GermanCredit dataset:

    #Calculate standard deviation

    standard_dev <- sd(GermanCredit$Amount)

    standard_dev

    The output is as follows:

    [1] 2822.737

  5. Calculate the median of the Amount values in the GermanCredit dataset:

    #Calculate median

    median <- median(GermanCredit$Amount)

    median

    The output is as follows:

    [1] 2319.5

  6. Calculate the maximum Amount in the GermanCredit dataset:

    #Identify maximum

    min <- max(GermanCredit$Amount)

    max

    The output is as follows:

    [1] 18424

  7. Calculate the minimum Amount in the GermanCredit dataset:

    #Identify minimum

    min <- min(GermanCredit$Amount)

    min

    The output is as follows:

    [1] 250

  8. Calculate the skewness of the Amount values in the GermanCredit dataset. The e1071 package contains implementations of many statistical functions (such as skewness) in R:

    library(e1071)                    # load e1071

    skewness<-skewness(GermanCredit$Amount)

    skewness

    The output is as follows:

    [1] 1.943783

In this exercise, we have descriptions of the features. In the next section, we will cover the standardizing technique.

Standardizing and Rescaling

Standardization contains two steps:

  1. Subtract the mean from the value (if x is the value, then x-mean)
  2. Then, divide by the standard deviation ((x-mean)/standard deviation).

At times, features have to be scaled to lie within the same range. For instance, Age and Income will have different range of values; they could be scaled to [0-1] or any standard range such as [-1,1].

The steps to rescale are as follows:

  1. Subtract the value by the minimum value, x-min; in our case, -1

    Numerator = x-(-1)

  2. Subtract the max range and the min range; that is, 1 – (-1)

    Denominator = 1-(-1)

  3. Divide the numerator by the denominator to get the rescaled value.

Handling Categorical Variables

Categorical variables are a list of string values or numeric values for an attribute. For instance, gender can be "Male" or "Female". There are two types of categories: nominal and ordinal. In nominal categorical data, there is no ordering among the values in that attribute. This is the case with gender values. Ordinal categories have some order within the set of values. For instance, for temperature "Low," "Medium," and "High" have an order.

  • Label Encoding: String literals needs to be converted to numeric values, where "Male" can take value 1 and "Female" can take value 2. This is called integer encoding or label encoding. The integer values have a natural ordering so this may be suitable in cases dealing with categorical data, which is ordinal.
  • One-Hot Encoding: For nominal categories, label encoding is not suitable as the natural order of the numbers may be learned by the machine learning model. Therefore, the integers are encoded into binary values. For instance, 1 representing "Male" becomes 00 and 2 representing "Female" becomes 01.
  • Hashing: Hashing: Hashing is an approach where the categories are given hash values. There will be some information loss due to collision, but it works well with nominal and ordinal categories.
  • Count Encoding: This is a technique where the categories are replaced with their counts. A log transformation is used to reduce the effect of outliers.
  • Binning: Binning: Binning is an approach in which numeric values are converted to categorical values, such as discrete values. For example, "Age" can take a value from 1-100. 1-25 can be categorized as "young," 25-50 as "middle-aged," and 50-75 as "old age."
  • Variable Transformation: Many

    Many modeling techniques require data to have a normal distribution, so we transform data to a normal distribution wherever possible. Data is considered highly skewed if the skewness value is less than -1 or greater than 1.

Skewness

Skewness denotes the alignment of the values in the specified column. A negative skewness value means that the data is skewed to the left and a positive skewness value means that the data is skewed to the right. We would not want the data to be skewed for the model. So, we will often try to reduce the skewness in the data.

Exercise 27: Computing Skewness

In this exercise, we will find the skewness of the V4 column of the Sonar dataset. The Sonar dataset contains patterns of signals obtained from bouncing rocks and mines. The columns contain the pattern information. The "M" label indicates a mine and the "R" label indicates a rock.

  1. Attach the mlbench package:

    library(mlbench)

    library(lattice)

    library(caret)

    library(e1071)

  2. Load the Sonar dataset:

    data(Sonar)

  3. Find the skewness of the V4 column:

    skewness(Sonar$V4)

    The skewness is 0.5646697.

  4. Plot the histogram:

    histogram(Sonar$V4,xlab="V4")

    The histogram is as follows:

Figure 3.7: Histogram showing skewness

The histogram denotes skewness. The positive skewness value means that the graph is skewed to the right, as you can see in the preceding plot.

Activity 11: Identifying Skewness

In this activity, we will identify the skewness of the glucose column in the PimaIndiansDiabetes dataset. We will then compare it with the skewness of the age column. The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/GermanCredit.csv.

These are the steps that will help you solve the activity:

  1. Attach the mlbench package and load the PimaIndiansDiabetes data. (Hint: use library() and read.csv().)
  2. Print the skewness for the durations of the glucose and age columns. (Hint: use skewness(<<column name>>).)
  3. Use histogram() to visualize the two columns.
Figure 3.8: Histogram of the age values of the PimaIndiansDiabetes dataset

Note

The solution for this activity can be found on page 334.

Reducing Skewness Using Log Transform

When a continuous variable has a skewed distribution, we can log-transform it to reduce skewness. This will make the distribution normal. The log() function is used to log-transform the values.

Exercise 28: Using Log Transform

In this exercise, we will reduce the skewness of the data using log transform

  1. Calculate the log() of the values:

    #Log Transformation

    transformed_data <- log(PimaIndiansDiabetes$age)

  2. Plot the transformed data using the histogram() function:

    #View histogram

    histogram(transformed_data)

    The output is as follows:

Figure 3.9: Histogram of log-transformed data

As we can see, the data distribution is now looking much better; it now follows a more normal distribution.

This is a transformation that transforms a skewed distribution to a distribution closer to a normal distribution.

Derived Features or Domain-Specific Features

These are features that are derived from data that requires an understanding of the business domain.

Let's imagine a dataset that contains data for the sale prices of houses in different areas of a city and that our goal is to predict the future price of any house. For this dataset, the input fields are area code, size of the house, floor number, type of house (individual/apartment), age of the property, renovated status, and so on, along with the sale price of the house. The derived features in this scenario are as follows:

  • Total sales in the area for the past week, month, and so on
  • Location of the house (central area or suburb, based on the area code)
  • Livability index (based on the age and renovated columns)

Another example of deriving domain-specific features would be deriving a person's age from their birth date and the current date in a dataset containing information about people.

Adding Features to a Data Frame

We will look at the code to add new columns to an R data frame. A new column may be a new feature or a copy of an existing column. We'll look at an example in the following exercise. Adding new features can help in improving the efficiency of a model.

Exercise 29: Adding a New Column to an R Data Frame

In this exercise, we will add columns to an existing R data frame. These new columns can be dummy values or copies of other columns.

  1. Add a new feature to the R data frame, as follows:

    #Adding new features to a R datadrame

    library(caret)

  2. Load the GermanCredit dataset:

    data(GermanCredit)

  3. Assign the GermanCredit value to a new field:

    #Assign the value to the new field

    GermanCredit$NewField1 <- 1

  4. Print the GermanCredit string:

    str(GermanCredit)

    The output is as follows:

    Figure 3.10: Section showing the added NewField
  5. Copy an existing column into a new column, as follows:

    #Copy an existing column into a new column

    GermanCredit$NewField2 <- GermanCredit$Purpose.Repairs

  6. Print the GermanCredit string:

    str(GermanCredit)

    The output is as follows:

Figure 3.11: Section showing the added NewField2

We have added two new features to the dataset.

Handling Redundant Features

Redundant features are those that are highly correlated with each other. They will contain similar information with respect to their output variables. We can remove such features by finding correlation coefficients between features.

Exercise 30: Identifying Redundant Features

In this exercise, we will find redundant features, select any one among them, and remove them.

  1. Attach the caret package:

    #Loading the library

    library(caret)

  2. Load the GermanCredit dataset:

    # load the German Credit Data

    data(GermanCredit)

  3. Create a correlation matrix:

    # calculating the correlation matrix

    correlationMatrix <- cor(GermanCredit[,1:9])

  4. Print the correlation matrix:

    # printing the correlation matrix

    print(correlationMatrix)

    The output is as follows:

    Figure 3.12: The correlation matrix
  5. To find attributes that have high correlation, set the cutoff as 0.5.

    # finding the attributes that are highly corrected

    filterCorrelation <- findCorrelation(correlationMatrix, cutoff=0.5)

  6. Print the indexes that have a high level of correlation.

    # print indexes of highly correlated fields

    print(filterCorrelation)

    The output is as follows:

    [1] 2

  7. Print the correlation matrix:

    print(correlationMatrix)

    The correlation matrix is as follows:

Figure 3.13: The correlation matrix with a cutoff of 0.5

The output is the index of the highly correlated field; here it is Amount. If the fields are highly correlated, we can remove one of them. Now that we have covered redundant features, we will move on to text features.

Text Features

Text features are generated for purely textual content, such as a data containing user blogs or user feedback regarding a product on a web page. The following are some text features:

  • N-grams: N-grams form features. Using unigrams means splitting text into separate, individual words. For instance, "The product is extremely functional" will create the unigrams "The," "product," "is," "extremely," and "functional." The bigram of this textual data is "The product," "product is," "is extremely," "extremely functional." N-grams is splitting the text into n grams.
  • TF-IDF: This means Term Frequency-Inverse Document Frequency. Term frequency is the number of times a term is repeated in text data. Inverse document frequency is the importance of a term in a text.
  • Levenshtein Distance: This is a distance metric that is calculated at the character-level. It is the minimum edits needed to convert a string to another.
  • Cosine Similarity: This is a distance metric where two words are compared and the similarity is measured. Cosine similarity measures the angle of similarity between the words by projecting the words on vectors in a multi-dimensional space.
  • Number of words: The words will be separated using a space.
  • Number of characters: Each character will be counted in this case.
  • Number of stop words: The stop words will be counted. A stop word is a word that is used very often, such as "a", "the", and "as."
  • Number of special characters: Special characters such as !, @, and $.
Figure 3.14: Text features

Automated feature engineering is a process where generic features are calculated for a field by a pre-defined package. For text features, the package used is textfeatures. This R package generates the common text features used to train a machine learning model with textual data. This process is important because it saves us the time it takes to implement numerous text features.

Note

Using the textfeatures package requires R version 3.1 or above.

Exercise 31: Automatically Generating Text Features

In this exercise, we will use the textfeatures package to find text features.

  1. Install the itunesr, textfeatures, and tidyverse packages:

    install.packages("itunesr")

    install.packages("textfeatures")

    install.packages("tidyverse")

  2. Attach the itunersr, textfeatures, and tidyverse pacakge:

    library(itunesr)

    library(textfeatures)

    library(tidyverse)

  3. Create a text_data character vector with a few lines of text:

    ## the text is a review of a product

    text_data <- c(

      "This product was delivered very fast",

      "IT'S A GREAT DAY TODAY!",

      paste("The product works very efficiently"),

      paste("The product saves us a lot of time"),

      paste("The seller arranged a timely delivery")

    )

  4. Use the textfeatures() function on text_data:

    ## get the text features of a sample character vector

    textfeatures(text_data)

    The output is as follows:

    Figure 3.15: Text features of text_data
  5. Create a data frame with the text character vector:

    ## data frame with a character vector named "text"

    df <- data.frame(

      id = c(1, 2, 3),

      text = c("this is A! sEntence https://github.com about #rstats @github",

               "and another sentence here",

               "The following list: - one - two - three Okay!?!"),

      stringsAsFactors = FALSE

    )

  6. Generate the text features using the textfeatures() function:

    ## Generate the text features

    Features <- textfeatures(df)

  7. Print the text features using the glimpse() function:

    #print the text features

    glimpse(features)

    The output is as follows:

Figure 3.16: Glimpse of the text features

The output shows the features generated for the text data that we provided. Three of the feature has been explained below:

  • n_hashtags: This feature value is based on the number of hashtags in the text.
  • n_commas: This feature value is based on the number of commas in the text.
  • n_digits: This feature value is based on the number of numerical digits in the text.

In the next section, we will discuss in detail the various feature selection approaches.

Feature Selection

There are two types of feature selection techniques: forward selection and backward selection.

  • Forward Selection: This is an approach that can be used for a labeled dataset. Basically, we start with one feature and build the model. We add more features in an incremental fashion and make a note of the accuracy as we go. We then select the combination of features that gave the highest level of accuracy while training the model. One con of this technique is that for a dataset with a large set of features, this is an extremely time-consuming process. Also, if an already-added feature is causing degradation of the performance of the model, we will not know it.
  • Backward Selection: In this approach, we will need a labeled dataset. All the features will be used to build the model. We will iteratively remove features to observe the performance of the model. We can then select the best combination (the combination that produced the highest performance). The con of this approach is that for a dataset with a large set of features, this is an extremely time-consuming process.

In the selection of features, it is useful to find the correlation between the values. In the next section, we will look at correlation analysis, which helps us determine the correlation between two values.

Correlation Analysis

The correlation between two variables plays an important part in feature selection. If two features are correlated with each other and they are linearly dependent on each other, then one of the features can be dropped as it has the same relationship with the output variable as the other. The linear dependency can be in the form of positive correlation or negative correlation. A positive correlation between fields x and y means that as x increases, y also increases. A negative correlation between x and y means that as x increases, y decreases.

Exercise 32: Plotting Correlation between Two Variables

In this exercise, we will plot the correlation between two variables.

  1. Load the PimaIndiansDiabetes dataset from the mlbench packages:

    library(mlbench)

    data(PimaIndiansDiabetes)

  2. Use plot() with an additional parameter, main = "Pearson Correlation", to plot the correlation between the glucose and pressure fields:

    #Correlation Analysis between glucose and pressure

    plot(PimaIndiansDiabetes$glucose, PimaIndiansDiabetes$pressure, col="red", xlab = "Glucose", ylab = "Pressure", pch=16, main = "Pearson Correlation")

  3. Load the Sonar dataset from the mlbench library:

    data(Sonar)

  4. Use plot() with an additional parameter, main = "Pearson Correlation", to plot the correlation between the V3 and V4 fields:
    Figure 3.17: Pearson correlation in the PimaIndiansDiabetes dataset between the Pregnant and Age variables

    The correlation between glucose and pressure is 0.544, which means there is a moderate positive correlation:

  5. Use plot() with an additional parameter, main = "Pearson Correlation", to plot the correlation between the glucose and pressure fields:

    plot(Sonar$V4, Sonar$V3, col="red", xlab = "V4",

         ylab = "V3", pch=16, main = "Pearson Correlation")

    The output is as follows:

Figure 3.18: Pearson correlation between Sonar v3 and v4

The correlation value between v3 and v4 is 0.78, which means there is a strong positive correlation, which means that as v3 increases, v4 also increases. Since they are strongly correlated and since these are two input fields, we can drop one and retain another.

P-Value

When we want to know whether a feature is correlated with the output variable in the real world, it is not enough to calculate the correlation coefficient for the two variables in the dataset, as these might not be representative of the real world. We also need to account for things such as the size of our dataset and the probability of the variables being correlated in our dataset by chance.

The p-value is the probability (0-1) that, given that the null hypothesis is true (that there is no correlation), we would see a correlation coefficient with the same magnitude as the one in our dataset, or higher, simply due to the random selection of observations from the target population (the real world). If the p-value is below a threshold, such as 0.05, we say that the correlation is significant. Note that significant does not mean important in this context. It instead means that we have strong evidence against the null hypothesis, meaning we can reasonably reject the null hypothesis. The p-value does not address the probability of the alternative hypothesis (that there is a correlation) directly, but by rejecting the null hypothesis, the alternative hypothesis becomes more viable. Importantly, a p-value greater than 0.05 does not mean that the two variables are not correlated in the real world, as we might just have too few datapoints to say so.

P-values are highly debated in many scientific fields. They are commonly misunderstood and misused, and it is recommended that you read up on them if you will be relying on them in your work.

It is possible for a feature to be useful in a model without being significantly correlated to the output variable. We therefore decide whether or not we should include it in the model based on whether it makes the model better at predicting the output variable in a test set. We will do this in the Recursive Feature Elimination section.

The cor.test() function in R is used to calculate the Pearson's product moment correlation coefficient between two variables. It outputs a correlation estimate and a p-value.

Exercise 33: Calculating the P-Value

In this exercise, we are finding the p-value for a correlation coefficient between two variables.

  1. Attach the caret package:

    library(caret)

  2. Calculate the p-value for the V3 and V4 fields:

    #Calculating P-Value

    component=cor.test(Sonar$V4, Sonar$V3)

    print(component)

    The output is as follows:

    Figure 3.19: P-value

    cor.test() in R is used to calculate the Pearson's product moment correlation coefficient between two fields. The two fields can be any two fields for which we need to find the correlation. It can be two input fields or one input and one output field. It will output a correlation estimate and a p-value.

  3. Print the p-value:

    #Print the P value

    component$p.value

    The output is as follows:

    [1] 3.918396e-44

  4. Print the correlation:

    #Print the Correlation

    component$estimate

    The output is as follows:

         cor

    0.781786

The p-value of 3.918396e-44 suggests that we have strong evidence against the null hypothesis and have no reason not to trust the correlation coefficient of 0.781786. With such a high correlation, it is likely a good feature to include in our model.

Recursive Feature Elimination

Recursive Feature Elimination (RFE) is a recursive method used for feature selection in R's caret package. This method uses all possible combinations of the subset of the features to train models and accordingly drop features. The algorithms used here are linear regression, random forest, naïve Bayes, and bagged trees. This method will build models with the subset of columns and the best subset size will be printed as output.

In the following example, nine features from the GermanCredit dataset have been provided as input to the function. RFE selects the top five features that are important.

Exercise 34: Implementing Recursive Feature Elimination

In this exercise, we will be eliminating features using the recursive feature elimination technique. The top features will be selected for training the model.

  1. Install the e1071 and randomforest packages:

    set.seed(7)

    install.packages("e1071")

    install.packages("randomForest")

  2. Attach the e1071 and randomForest packages:

    library(e1071)

    library(randomForest)

  3. Attach the mlbench and caret packages:

    # Attach the packages

    library(mlbench)

    library(caret)

  4. Load the German credit data:

    # load the German Credit Data

    data("GermanCredit")

  5. Use the rfeControl() function with the parameter as rfFuncs, the method parameter as cv, and the number parameter as 9.

    # Use random forest as the method

    method_fn <- rfeControl(functions=rfFuncs, method="cv", number=9)

    The rfeControl() function in R creates a control object through which we can specify a function for prediction/model fitting, a method that is a sampling method, and the number of folds or iterations.

  6. Run the recursive elimination function using the random forest function with the GermanCredit[,1:9] data frame for model training, and GermanCredit[,10] as the outcomes of the features mentioned in sizes.

    # run the Recursive Feature Elimination algorithm

    output <- rfe(GermanCredit[,1:9], GermanCredit[,10], sizes=c(1:9), rfeControl=method_fn)

    The rfe() function in R performs recursive feature elimination.

  7. Print the output:

    # print the output

    print(output)

    The output is as follows:

    Recursive feature selection

    Outer resampling method: Cross-Validated (9 fold)

    Resampling performance over subset size:

    Variables Accuracy   Kappa AccuracySD KappaSD Selected

             1   0.7000 0.07993    0.01450 0.05992        

             2   0.6841 0.11776    0.03479 0.09647        

             3   0.6900 0.08827    0.03138 0.08838        

             4   0.6781 0.13250    0.03428 0.10378        

             5   0.7130 0.20887    0.03098 0.09935        

             6   0.7271 0.23200    0.03680 0.11610        

             7   0.7280 0.21180    0.02549 0.09123        

             8   0.7230 0.20408    0.02081 0.08139        

             9   0.7281 0.23882    0.03347 0.10788        *

    The top 5 variables (out of 9):

       Duration, Amount, Age, NumberPeopleMaintenance, Telephone

  8. Print the selected features:

    predictors(output)

    The output is as follows:

    [1] "Duration"                  "Amount"                    "Age"                       "NumberPeopleMaintenance"  

    [5] "Telephone"                 "InstallmentRatePercentage" "ResidenceDuration"         "NumberExistingCredits"    

    [9] "ForeignWorker"            

Figure 3.20: Plot of accuracy versus variables

The preceding plot shows the accuracy values for each of the different variables chosen for the model. For instance, with these two variables, the accuracy is 0.73.

PCA

We visited PCA in Chapter 2, Data Cleaning and Pre-Processing, where we used PCA for pre-processing. In this chapter, we will delve into the details of this technique. PCA basically reduces the dimensionality of our data. For instance, if our data contains 20 columns, we can reduce it to 5 key fields generated from the 20 columns. These 5 key fields will now represent the data. Basically, PCA forms a linear combination of the data where the generated components will not be correlated with each other. These five values will also have maximum variance.

If our data has many dimensions, such as a large number of fields, then this technique can help to reduce the dimensions by generating the principal components. These components will represent most of the information from our high dimensional dataset. To perform PCA, there is first a check for correlation between all the fields, and then the important fields are chosen, and a linear combination of those fields is created to represent all the information from the fields. In this way, PCA helps to perform feature selection and is also known as a dimensionality reduction technique. PCA can even be applied to unlabeled data.

Exercise 35: Implementing PCA

In this exercise, we will be using PCA to find the principal components in the PimaIndiansDiabetes dataset.

  1. Load the PimaIndiansDiabetes data:

    #PCA Analysis

    data(PimaIndiansDiabetes)

  2. Create a subset of the first nine columns into another variable named PimaIndiansDiabetes_subset:

    #Use the

    PimaIndiansDiabetes_subset <- PimaIndiansDiabetes[,1:8]

  3. Find the principal components:

    #Find out the Principal components

    principal_components <- prcomp(x = PimaIndiansDiabetes_subset, scale. = T)

    The prcomp() function performs PCA for the data in R.

  4. Print the principal components:

    #Print the principal components

    print(principal_components)

    The output is as follows:

Figure 3.21: Principal components of the PrimaIndiansDiabetes dataset
Figure 3.21: Principal components of the PrimaIndiansDiabetes dataset

The principal components are PC1, PC2,.. PC8, in their order of importance. These components are calculated from multiple fields and can be used as features on their own.

Activity 12: Generating PCA

In this activity, we will use the GermanCredit dataset and find the principal components. These values can be used instead of the features. The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/GermanCredit.csv.

These are the steps that will help you solve the activity:

  1. Load the GermanCredit data.
  2. Create a subset of the first nine columns into another variable named GermanCredit_subset.
  3. Use prcomp(x = GermanCredit_subset, scale. = T) to generate the principal components.
  4. Print the generated principal_components.
  5. Interpret the results.

The PCA values will look as follows:

Figure 3.22: Principal components of the GermanCredit dataset
Figure 3.22: Principal components of the GermanCredit dataset

Note

The solution for this activity can be found on page 337.

Ranking Features

While building certain models such as decision trees and random forests, the features that are important to the model (for instance, the features that have good correlation with the output variable) are known. These features are then ranked by the model. We will look at a few examples of ranking features automatically using machine learning models.

Variable Importance Approach with Learning Vector Quantization

In Learning Vector Quantization (LVQ) , we rank features based on their importance. LVQ and the variable importance function, varImp(), will be used to fetch the important variables. The GermanCredit dataset is used to demonstrate LVQ. For simplicity's sake, we are choosing the first ten columns in the GermanCredit dataset. The 10th column contains the class values for prediction.

Exercise 36: Implementing LVQ

In this exercise, we will implement LVQ for the GermanCredit dataset and use the variable importance function to list the importance of the fields in this dataset.

  1. Attach the mlbench and caret packages.

    set.seed(9)

    # loading the libraries

    library(mlbench)

    library(caret)

  2. Load the GermanCredit dataset.

    # load the German Credit dataset

    data("GermanCredit")

  3. Set the parameters for training using the trainControl() function.

    #Setting parameters for training

    control <- trainControl(method="repeatedcv", number=10, repeats=3)

  4. Train the model.

    # training the model

    model <- train(Class~., data=GermanCredit[,1:10], method="lvq", preProcess="scale", trControl=control)

  5. Find the importance of the variables:

    # Getting the variable importance

    importance <- varImp(model, scale=FALSE)

  6. Print the importance of the variables:

    # print the variable importance

    print(importance)

  7. Plot the importance:

    # plot the result

    plot(importance)

    The output is as follows:

    Figure 3.23: Importance of the variables
  8. Print the importance.

    print(importance)

    The output is as follows:

    ROC curve variable importance

                              Importance

    Duration                      0.6286

    Age                           0.5706

    Amount                        0.5549

    InstallmentRatePercentage     0.5434

    NumberExistingCredits         0.5251

    Telephone                     0.5195

    ForeignWorker                 0.5169

    ResidenceDuration             0.5015

    NumberPeopleMaintenance       0.5012

An importance score has been assigned to each variable. The Duration, Age, and Amount variables are the most important variables. The least important variables are ResidenceDuration and NumberPeopleMaintenance.

Variable Importance Approach Using Random Forests

When using random forests to determine variable importance, multiple trees are trained. After creating the forest, it will also show the importance of variables used in the data. A tree-based model takes into consideration non-linear relationships. The features used in the split are highly relevant to the output variable. We should also make sure that we avoid overfitting, and therefore the depth of the tree should be small.

Exercise 37: Finding Variable Importance in the PimaIndiansDiabetes Dataset

In this exercise, we will find variable importance in the PimaIndiansDiabetes dataset using random forests.

  1. Attach the necessary packages.

    library(mlbench)

    library(caret)

    library(randomForest)

  2. Load the PimaIndiansDiabetes data.

    data(PimaIndiansDiabetes)

  3. Train a random forest model using randomForest(Class~., data= PimaIndiansDiabetes).

    random_forest <- randomForest(Class~., data= PimaIndiansDiabetes)

  4. Invoke importance() for the trained random_forest.

    # Create an importance based on mean decreasing gini

    importance(random_forest)

    The output is as follows:

              MeanDecreaseGini

    pregnant         28.60846

    glucose          88.03126

    pressure         29.83910

    triceps          23.92739

    insulin          25.89228

    mass             59.12225

    pedigree         42.86284

    age              48.09455

  5. Use the varImp() function to view the list of important variables.

    varImp(random_forest)

    The importance of the variables is as follows:

              Overall

    pregnant 28.60846

    glucose  88.03126

    pressure 29.83910

    triceps  23.92739

    insulin  25.89228

    mass     59.12225

    pedigree 42.86284

    age      48.09455

The features that are important have higher scores and those features can be selected for the model.

Activity 13: Implementing the Random Forest Approach

In this activity, we will use the GermanCredit dataset and perform a random forest approach on the dataset to find the features with the highest and lowest importance. The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/GermanCredit.csv.

These are the steps that will help you solve the activity:

  1. Load the GermanCredit data.
  2. Create a subset to load the first ten columns into GermanCredit_subset.
  3. Attach the randomForest package.
  4. Train a random forest model using random_forest<-randomForest(Class~., data=GermanCredit_subset).
  5. Invoke importance() for the trained random_forest.
  6. Invoke the varImp() function, as in varImp(random_forest).
  7. Intepret the results.

The expected output of variable importance will be as follows:

                             Overall

Duration                   70.380265

Amount                    121.458790

InstallmentRatePercentage  27.048517

ResidenceDuration          30.409254

Age                        86.476017

NumberExistingCredits      18.746057

NumberPeopleMaintenance    12.026969

Telephone                  15.581802

ForeignWorker               2.888387

Note

The solution for this activity can be found on page 338.

Variable Importance Approach Using a Logistic Regression Model

We have seen the importance score provided by random forests; in this exercise, a logistic regression model is trained for the data to identify variable importance. We will use varImp() to show the relative importance of the columns. This model will help to provide us with the importance of the fields in a dataset.

Exercise 38: Implementing the Logistic Regression Model

In this exercise, we will implement a logistic regression model.

  1. Create a subset of the GermanCredit data excluding the first element.

    GermanCredit_subset <- GermanCredit[,1:10]

  2. Attach the mlbench package.

    library(mlbench)

  3. Load GermanCredit_subset.

    data(GermanCredit_subset)

  4. Create a dataframe using the as.data.frame() function.

    data_lm = as.data.frame(GermanCredit_subset)

  5. Fit a logistic regression model.

    # Fit a logistic regression model

    log_reg = glm(Class~.,GermanCredit_subset,family = "binomial")

  6. Attach the caret package.

    library(caret)

  7. Use the varImp() function to list out the importance of the variables.

    # Using varImp() function

    varImp(log_reg)

    The output is as follows:

                                Overall

    Duration                  3.0412079

    Amount                    2.7164175

    InstallmentRatePercentage 2.9227186

    ResidenceDuration         0.6339908

    Age                       2.7370544

    NumberExistingCredits     1.1394251

    NumberPeopleMaintenance   0.6952838

    Telephone                 2.5708235

    ForeignWorker             1.9652732

After building a logistic regression model, the importance of each variable is given a score. The higher the score is, the more important the variable is. The variables that are most important can be used for the model.

Determining Variable Importance Using rpart

Using varImp(), we can list the features with their importance. rpart stands for Recursive Partitioning and Regression Trees. This package contains an implementation of a tree algorithm in R, specifically known as Classification and Regression Trees (CART). In the following exercise, we will be using the rpart package in R.

Exercise 39: Variable Importance Using rpart for the PimaIndiansDiabetes Data

In this exercise, we will be finding the variable importance using rpart. Finding the importance of variables helps to select the correct variables.

  1. Install the following packages:

    install.packages("rpart")

    install.packages("randomForest")

    set.seed(10)

    library(caret)

    library(mlbench)

  2. Load the dataset and create a subset:

    data(PimaIndiansDiabetes)

    PimaIndiansDiabetes_subset <- PimaIndiansDiabetes[,1:9]

    PimaIndiansDiabetes_subset

  3. Train the rpart model:

    #Train a rpart model

    rPartMod <- train(diabetes ~ ., data=PimaIndiansDiabetes_subset, method="rpart")

  4. Find the variable importance:

    #Find variable importance

    rpartImp <- varImp(rPartMod)

  5. Print the variable importance:

    #Print variable importance

    print(rpartImp)

    The output is as follows:

    rpart variable importance

             Overall

    glucose  100.000

    mass      65.542

    age       52.685

    pregnant  30.245

    insulin   16.973

    pedigree   7.522

    triceps    0.000

    pressure   0.000

  6. Plot the top five variables' importance:

    #Plot top 5 variable importance

    plot(rpartImp, top = 5, main='Variable Importance')

    The plot is as follows:

Figure 3.24: Variable importance
Figure 3.24: Variable importance

From the preceding plot, it can be noted that glucose, mass, and age are most important for the output, and should therefore be included for the modeling. In the next activity, we will be selecting features using variable importance.

Activity 14: Selecting Features Using Variable Importance

In this activity, we will use the GermanCredit dataset and find the variable importance using rpart. The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/GermanCredit.csv.

These are the steps that will help you solve the activity:

  1. Attach the rpart and caret packages.
  2. Load the GermanCredit data.
  3. Create a subset to load the first 10 columns into GermanCredit_subset.
  4. Train an rpart model using rPartMod <- train(Class ~ ., data=GermanCredit_subset, method="rpart").
  5. Invoke the varImp() function, as in rpartImp <- varImp(rPartMod).
  6. Print rpartImp.
  7. Plot rpartImp using plot().
  8. Interpret the results.

The expected output of variable importance will be as follows:

Figure 3.25: Variable importance for the fields
Figure 3.25: Variable importance for the fields

Note

The solution for this activity can be found on page 339.

Here is a table that summarizes the techniques we have looked at and the features we can select using these techniques:

Figure 3.26: Summary of the models
Figure 3.26: Summary of the models

Thus, we can see that most methods suggest the Duration, Amount, and Age as the features.

Summary

In this chapter, we have learned about the different types of features that are generated to train a model. We have derived domain-specific features and datatype-specific features. Also, we explored an automated technique for generating text features. The feature engineering process is essential for obtaining the best model performance. We delved into two variable transformation techniques and learned about techniques to identify redundant features and handle them in a dataset.

We have learned about forward and backward feature selection approaches and have performed correlation analysis through detailed examples. We implemented the calculation of p-values in R and looked at its significance to the process of the selection of features. Recursive feature elimination is another way that we saw to find the best combination of features for a model. We delved into a dimensionality reduction approach, known as PCA, that drastically reduces the number of features needed, as it calculates the principal components in a dataset.

We explored several techniques for ranking the features in R. LVQ and random forests were implemented in R to observe the ranking for the features in the GermanCredit dataset. We also learned how to use the variable importance function in R to list the importance of all the variables in a dataset.

In the next chapter, we will use neural networks to solve classification problems. Along with this, we will also evaluate the models using cross-validation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset