Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6. Decision Trees and Other Supervised Learning Methods

In the previous chapter, we introduced Machine Learning, unsupervised methods, and supervised methods. We focused on unsupervised learning and described some algorithms, we also concentrated on classifiers. We took time to study cluster analysis, focusing on centroids-based algorithms, and we also looked at hierarchical clustering.

We used Rattle to process customer data in order to create different clusters of customers, and then, we used Qlik Sense to visualize these different clusters.

The objective of this chapter is to introduce you to supervised learning. As I explained in the previous chapter, in supervised learning, the computer analyzes a set of examples to learn how to predict the output of a new situation.

We'll focus on Decision Tree Learning, or Decision Trees, because they're widely used and the knowledge learned by the tree is easy to translate to rules in any software, such as Qlik Sense. These rules are easy to understand for human experts.

In supervised learning, we split the dataset into three datasets—training, validation, and test. The training dataset usually contains 70 percent of the original observations, our algorithm will use this dataset in the training phase to learn by example. Each of the validation and test datasets usually contains 15 percent of the original observations. We'll use the validation dataset to fine-tune our algorithm, and finally, after the fine-tuning, we'll use the test dataset to evaluate the final performance of our algorithm. These three datasets match with the three phases of a supervised algorithm—training, validation (or tuning), and test (or performance evaluation).

In this chapter:

We'll describe the main concepts of Decision Tree Learning.
We'll review the algorithm and the possible applications, and we'll see examples based on these algorithms.
Then, we'll use Rattle and Qlik Sense to create an application to classify new loan applications into low risk applications and high risk applications. We'll load that data into Qlik Sense and create a few example visualizations.
After Decision Trees, we'll look at ensemble methods and Supported Vector Machines.
Finally, we'll look at Neural Networks, which can be used as supervised or unsupervised learning and statistics methods such as Regression or Survival Analysis.

Partitioning datasets and model optimization

As we've explained, in supervised learning, we split the dataset in three subsets—training, validation, and testing:

To create the model or learner, Rattle uses the training dataset. After creating a model, we use the validation data to evaluate its performance. To improve the performance, depending on the algorithm we're using, we can use different tuning options. After tuning, we rebuild the model and evaluate its performance again. This is an iterative process; we create the model and evaluate it until we're fine with its performance.

For simplicity, in this chapter, we'll see only model creation, and in the following chapter, we'll see model optimization, but in real life, this is an iterative process.

The examples in this chapter will not have any optimization.

Finally, when you're happy with the model, you can use the testing dataset to confirm its performance. You need to use the testing dataset because you've used the validation dataset to optimize the model. You need to be sure that the optimizations you've done, work for all data, not just for the validation data.

Rattle splits the data randomly to assure that each dataset is representative, but when we optimize the model and test it again, we need to be able to repeat the same experiment exactly, with the same data. In this way, we'll be able to know if we're improving the model performance. To solve this problem, Rattle splits the dataset using a pseudo random number generator. Every time we split the dataset using the same Seed, we'll have the same subsets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6. Decision Trees and Other Supervised Learning Methods

Create new playlist

Sign In

Sign Up

Chapter 6. Decision Trees and Other Supervised Learning Methods

Partitioning datasets and model optimization

Table of Contents for
6. Decision Trees and Other Supervised Learning Methods