Advanced feature engineering

In this section, we will discuss some advanced features that are also involved in the feature engineering process such as manual feature construction, feature learning, iterative process of feature engineering, and deep learning.

Feature construction

The best results come down to you through the manual feature engineering or feature construction. Therefore, manual construction is the process of creating new features from the raw data. Feature selection based on the feature's importance can inform you about the objective utility of features; however, those features have to come from somewhere else. In fact, sometimes, you need to manually create them.

In contrast to the feature selection, the feature construction technique requires spending a lot of effort and time with not the aggregation or picking the feature, but on the actual raw data so that new features can be constructive towards increasing the predictive accuracies of the model. Therefore, it also involves thinking of the underlying structure of the data along with the ML problem.

In this regard, to construct new features from the complex and high dimensional dataset, you need to know the overall structure of the data. In addition to this, how to use and apply them in predictive modeling algorithms. There will be three aspects in terms of tabular, textual, and multimedia datasets:

  • Handling and manual creation from the tabular data often means a mixture of combining features to create new features. You might also need the decomposing or splitting of some original features to create new features.
  • With textual data, it often means devising document or context-specific indicators relevant to the problem. For example, when you are applying text analytics on large raw data such as data from Twitter hashtags.
  • With multimedia data such as image data, it can often mean enormous amounts of time are passed to pick out relevant structures in a manual way.

Unfortunately, the feature construction technique is not only manual, but the whole process is slower, requiring lots of research involvement from humans like you and us. However, it can make a big difference in the long run. In fact, feature engineering and feature selection are not mutually exclusive; however, both of them are important in the realm of machine learning.

Feature learning

Is it possible to avoid the manual process of prescribing how to construct or extract features from raw data? Feature learning helps you to get rid of this. Therefore, feature learning is an advanced process; alternatively, an automatic identification and use of features from raw data. This is also referred to as representation learning that helps your machine learning algorithm to identify useful features.

The feature learning technique is commonly used in deep learning algorithms. As a result, recent deep learning techniques are achieving some success in this area. The auto-encoders and restricted Boltzmann machines are such an example where the concept of feature learning was used. The key idea behind feature learning is the automatic and abstract representations of the features in a compressed form using unsupervised or semi-supervised learning algorithms.

Speech recognition, image classification, and object recognition are some successful examples; where researchers have found supported state-of-the-art results. Further details could not have been represented in this book due to the brevity.

Unfortunately, Spark has not implemented any APIs for the automatic feature extraction or construction.

Iterative process of feature engineering

The whole process of feature engineering is not a standalone, but more or less iterative. Since you are actually interplaying with the form the data selection to model evaluation again and again until you are completely satisfied or you are running out of time. The iteration could be imagined as a four-step workflow that iteratively runs over time. When you are aggregating or collecting the raw data, you might not be doing enough brainstorming. However, when you start exploring the data, you are really getting into the problem into deeper.

After that you will be looking at a lot of data, studying the best technique of feature engineering and the related problems presented in the state of the arts and you will see how much you are able to steal. When you have done enough brainstorming, you will start devising the required features or extracting the features depending on your problem type or class. You might use the automatic feature extraction or manual feature construction (or both sometimes). If you are not satisfied with the performance you might redo the feature extraction process for improvement. Please refer to Figure 7 for a clear view of the iterative process of feature engineering:

Iterative process of feature engineering

Figure 8: The iterative processing in feature engineering.

When you have devised or extracted the feature, you need to select the features. You might apply a different scoring or ranking mechanism based on feature importance. Similarly, you might iterate the same process such as devising the feature to improve the model. And finally, you will evaluate your model to estimate the model's accuracy on new data to make your model adaptive.

You also need a well-defined problem that will help you to stop the whole iteration. When finished, you can move on to try other models. There will be gain waiting for you in the future once you plateau on ideas or the accuracy delta out of your ML pipeline.

Deep learning

One of the most interesting and promising moves in data representation we would say is deep learning. It is very popular on the tensor computing application and the Artificial Intelligent Neural Network (AINN) system. Using the deep learning technique, the network learns how to represent data at different levels.

Therefore, you will have an exponential ability to represent the linear data you have. Spark can take this advantage and it can be used to improve deep learning. For more general discussion, please refer to the following URL at https://en.wikipedia.org/wiki/Deep_learning and to learn how to deploy pipelines on a cluster with TensorFlow, see https://www.tensorflow.org/.

A recent research and development at Databricks (also see https://databricks.com/) has shown that Spark can also be used to find the best set of hyperparameters for AINN training. The advantage is that Spark will do the computation 10X faster than a normal deep learning or neural network algorithm.

Consequently, your model training time will drastically reduce up to 10 times and the error rate will be 34% lower. Moreover, Spark can be applied to a trained AINN model on a large amount of data so you can deploy your ML model at scale. We will discuss more on deep learning in later chapters as advanced machine learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset