Feature Transformations

So far, in this text, we have encountered feature engineering tools from what seems like all possible angles of data. From analyzing tabular data in order to ascertain levels of data to constructing and selecting columns using statistical measures in order to optimize our machine learning pipelines, we have been on a remarkable journey of dealing with features in our data.

It is worth mentioning once more that enhancements of machine learning come in many forms. We generally consider our two main metrics as accuracy and prediction/fit times. This means that if we can utilize feature engineering tools to make our pipeline have higher accuracy in a cross-validated setting, or be able to fit and/or predict data quicker, then we may consider that a success. Of course, our ultimate hope is to optimize for both accuracy and time, giving us a much better pipeline to work with.

The past five chapters have dealt with what is considered classical feature engineering. We have looked at five main categories/steps in feature engineering so far:

Exploratory data analysis: In the beginning of our work with machine learning pipelines, before even touching machine learning algorithms or feature engineering tools, it is encouraged to perform some basic descriptive statistics on our datasets and create visualizations to better understand the nature of the data
Feature understanding: Once we have a sense of the size and shape of the data, we should take a closer look at each of the columns in our dataset (if possible) and outline characteristics, including the level of data, as that will dictate how to clean specific columns if necessary
Feature improvement: This phase is about altering data values and entire columns by imputing missing values depending on the level of the columns and performing dummy variable transformations and scaling operations if possible
Feature construction: Once we have the best possible dataset at our disposal, we can think about constructing new columns to account for feature interaction
Feature selection: In the selection phase of our pipeline, we take all original and constructed columns and perform (usually univariate) statistical tests in order to isolate the best performing columns for the purpose of removing noise and speeding up calculations

The following figure sums up this procedure and shows us how to think about each step in the process:

Machine learning pipeline

This is an example of a machine learning pipeline using methods from earlier in this text. It consists of five main steps: analysis, understanding, improvement, construction, and selection. In the upcoming chapters, we will be focusing on a new method of transforming data that partly breaks away from this classical notion.

At this stage of the book, the reader is more than ready to start tackling the datasets of the world with reasonable confidence and expectations of performance. The following two Chapters 6, Feature Transformations, and Chapter 7, Feature Learning, will focus on two subsets of feature engineering that are quite heavy in both programming and mathematics, specifically linear algebra. We will, as always, do our best to explain all lines of code used in this chapter and only describe mathematical procedures where necessary.

This chapter will deal with feature transformations, a suite of algorithms designed to alter the internal structure of data to produce mathematically superior super-columns, while the following chapter will focus on feature learning using non-parametric algorithms (those that do not depend on the shape of the data) to automatically learn new features. The final chapter of this text contains several worked out case studies to show the end-to-end process of feature engineering and its effects on machine learning pipelines.

For now, let us begin with our discussion of feature transformation. As we mentioned before, feature transformations are a set of matrix algorithms that will structurally alter our data and produce what is essentially a brand new matrix of data. The basic idea is that original features of a dataset are the descriptors/characteristics of data-points and we should be able to create a new set of features that explain the data-points just as well, perhaps even better, with fewer columns.

Imagine a simple, rectangular room. The room is empty except for a single mannequin standing in the center. The mannequin never moves and is always facing the same way. You have been charged with the task of monitoring that room 24/7. Of course, you come up with the idea of adding security cameras to the room to make sure that all activity is captured and recorded. You place a single camera in a top corner of the room, facing down to look at the face of the mannequin and, in the process, catch a large part of the room on camera. With one camera, you are able to see virtually all aspects of the room. The problem is that the camera has blind spots. For example, you won't be able to see directly below the camera (due to its physical inability to see there) and behind the mannequin (as the dummy itself is blocking the camera's view). Being brilliant, you add a second camera to the opposite top corner, behind the mannequin, to compensate for the blind spots of the first camera. Using two cameras, you can now see greater than 99% of the room from a security office.

In this example, the room represents the original feature space of data and the mannequin represents a data-point, standing at a certain section of the feature space. More formally, I'm asking you to consider a three-dimensional feature space with a single data-point:

[X, Y, Z]

To try and capture this data-point with a single camera is like squashing down our dataset to have only one new dimension, namely, the data seen by camera one:

[X, Y, Z] ≈ [C1]

However, only using one dimension likely will not be enough, as we were able to conceive blind spots for that single camera so we added a second camera:

[X, Y, Z] ≈ [C1, C2]

These two cameras (new dimensions produced by feature transformations) capture the data in a new way, but give us enough of the information we needed with only two columns instead of three. The toughest part of feature transformations is the suspension of our belief that the original feature space is the best. We must be open to the fact that there may be other mathematical axes and systems that describe our data just as well with fewer features, or possibly even better.

Table of Contents for Feature Transformations

Create new playlist

Sign In

Sign Up

Table of Contents for
Feature Transformations