Best practices in feature engineering

In this section, we have figured out some good practices while performing the feature engineering on your available data. Some best practices of machine learning were described in Chapter 2, Machine Learning Best Practices. However, those were too general for the overall machine learning state of the arts. Those best practices, of course, would be useful in the feature engineering, too. Moreover, we will provide more concrete examples concerning feature engineering in the following sub-sections.

Understanding the data

Although the term feature engineering is more technical, however, it is an art that helps you to understand where the features come from. Now some vital questions evolve too, which need to be answered before understanding the data:

  • What are the provenances of those features? Is the data real-time or coming from the static sources?
  • Are the features continuous, discrete, or none?
  • What is the distribution of the features? Does the distribution largely depend on what subset of examples is being considered?
  • Do these features contain missing values (that is, NULL)? If so, is it possible to handle those values? Is it possible to eliminate them in the present, future, or upcoming data?
  • Is there duplicate or redundant entries?
  • Should we go for manual feature creation that proves to be useful? If so, how hard would it be to incorporate those features in the model training stage?
  • Are there features that can be used as standard features?

Knowing the answers to the preceding questions is important. Since data provenance would help you to prepare your feature engineering techniques a bit faster. You need to know if your features are discrete or continuous or if the requests are a real-time response or not. Moreover, you need to know the data distribution along with their skewness and kurtosis to handle the outliers.

You need to be prepared for the missing or null values whether they could be removed or need to be filled with alternative values. Besides, you need to remove duplicates entries in the first place, which is extremely important, since duplicate data points might significantly affect the results of model validation if not properly excluded. Finally, you need to know your machine learning problem itself since knowing the problem type would help you to label your data accordingly.

Innovative way of feature extraction

Be innovative while extracting and selecting the features. Here we provide eight tips altogether that will help you to generalize the same during your machine learning application development.

Tip

Create the input by rolling up existing data fields to a broader level or category.

To be more specific, let's give you some examples. Obviously, you can categorize your colleagues based on their title into strategic or tactical. For instance, you can code the employee with Vice President or VP or above as strategic and the Director and below could be encoded as tactical.

Collating several industries into a higher-level industry could be another example of such categorization. Collate oil and gas companies with commodity companies; gold, silver, or platinum as precious metal companies; high-tech giants and telecommunications industries as technology; define the companies with more than $1B revenue as large and small with net asset $1M for instance.

Tip

Split data into separate categories or bins.

To be more specific, let's give you some examples. Suppose you are doing some analytics on the companies with an annual that ranges from $50 M to over $1 B. Therefore, obviously, you can split the revenue into some sequential bins, such as $50-$200M, $201-$500M, $501M-$1B, and $1B+, for instance. Now how do you represent the features in a presentable format? It's so simple, try to put a value one whenever a company falls with the revenue bin; otherwise, the value is zero. There are now four new data fields created from the annual revenue field, right?

Tip

Think of an innovative way to combine existing data fields into new ones.

To be more specific, let's give you some examples. In the very first tip, we argue how to create new inputs by rolling up existing fields into broader fields. Now, suppose if you want to create a Boolean flag that identifies whether someone falls in a VP or higher category with more than 10 years of experience. Therefore, in this case, you are actually creating new fields by multiplying, dividing, adding, or subtracting one data field by another.

Tip

Think about the problem at hand and be creative simultaneously.

In previous tips, suppose you have created enough bins and fields or inputs. Now, don't worry much about creating too many variables in the first place. It would be wise to just let the brainstorming flow a normal flow for the feature selection step.

Tip

Don't be a fool.

Be cautious about creating unnecessary fields; since creating too many features out of a small amount of data may overfit your model, which can lead to spurious results. When you face the data correlation, remember that correlation does not always imply causation. Our logic to this common point is that modeling observational data can only show us that two variables are related, but it cannot tell us the reason.

Research articles in the book Freakonomics (see also at Steven D. Levitt, Stephen J. Dubner, Freakonomics: A Rogue Economist Explores the Hidden Side of Everything, http://www.barnesandnoble.com/w/freakonomics-steven-d-levitt/1100550563http://www.barnesandnoble.com/w/freakonomics-steven-d-levitt/1100550563) has found that data from public school's test scores indicates that children living at home with a higher number of books have a tendency of having higher standardized test scores compared to those with a lower number of books at home.

Therefore, be cautious before creating and constructing unnecessary features, which implies that don't be a fool.

Tip

Don't over engineer.

It is trivial to judge the difference whether an iteration takes a few minutes or half a day during the feature engineering phase. Since the most productive time during the feature engineering phase is usually spent on the whiteboard. Therefore, the most productive way to make sure it is done right is to ask the right questions to your data. It's true that nowadays the term big data is taking over the term feature engineering. There is no room for hacking, so for the over engineering:

Innovative way of feature extraction

Figure 4: Real interpretation of false positive and false negative.

Tip

Beware of false positives and false negatives.

Another important aspect is comparing the false negatives and false positives. Depending on the problem, getting a higher accuracy on one or the other is important. For instance, if you are doing research in the healthcare section and trying to develop a machine learning model that will work towards the disease prediction, getting false positives might be better than getting the false negative results. Therefore, our suggestion in this regard would be to look at the confusion matrix that will help you to see the predictions made by a classifier in a visual way.

The rows indicate the true class of each observation while the columns correspond to the class predicted by the model itself, as shown in Figure 4. However, Figure 5 would provide more insight. Note that the diagonal elements, also called correct decision, are marked in bold. The last column, Acc, signifies the accuracy for each key as follows:

Innovative way of feature extraction

Figure 5: A simple confusion matrix.

Tip

Think about precision and recall before selecting features.

Finally, two more important quantities to consider are the precision and recall. More technically, how often your classifier predicts a +ve outcome correctly is called recall. On the contrary, when your classifier predicts a +ve output and how often it is actually true is the precision.

It's true that it's really difficult to predict these two values. However, a careful feature selection would help you to get both the values better in the last place.

You will find more interesting and some excellent descriptions about the feature selection in a research paper written by Matthew Shardlow (see also at Matthew Shardlow, An Analysis of Feature Selection Techniques, https://studentnet.cs.manchester.ac.uk/pgt/COMP61011/goodProjects/Shardlow.pdf). Now let's have a journey to the realm of Spark's feature engineering features in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset