In this section, we will create a model to classify credit risks. In this section, we will create the model; we won't look at the performance of the model. We'll evaluate the performance of the model and improve it in the next chapter.
As we did before, to create this example, we'll download a dataset from the UCI Machine Learning Repository. We'll use a dataset called Statlog (German Credit Data) Dataset. The source of the dataset is Professor Dr. Hans Hofmann from Institut für Statistik und Ökonometrie, Universität Hamburg. The dataset classifies people described by a set of attributes as good or bad credit risks.
The dataset is downloaded from the following link:
https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29
In the following screenshot, you can see the original form of this dataset. The screenshot shows us the top ten lines of the dataset. The dataset doesn't have a header line. It contains 20 attributes, and the last column is the target variable—1 for Good Credit and 2 for Bad Credit. The attributes are separated by a blank space, as shown in this screenshot:
We prefer to work with a CSV file with a header line and the attributes separated by commas. For this reason, before loading the dataset into Rattle, we work it a little, with a spreadsheet editor, to transform the original file.
To label each column, we used the following information document provided with the dataset:
Column |
Label |
Values |
---|---|---|
1 |
|
|
2 |
|
Numeric |
3 |
|
|
4 |
|
|
5 |
|
Numeric |
6 |
|
|
7 |
|
|
8 |
|
Numeric |
9 |
| |
10 |
|
|
11 |
Numeric | |
12 |
|
|
13 |
|
Numeric |
14 |
|
|
15 |
|
|
16 |
Numeric | |
17 |
|
|
18 |
Numeric | |
19 |
|
|
20 |
|
|
21 |
|
|
To create our classifier, we will start by loading the data into Rattle and identifying the target variable. During the data load, we'll split the dataset into three datasets—the training dataset, the validation dataset, and the testing dataset. As we've explained in this chapter, we'll use the training dataset to create our model, the validation dataset to tune it, and the testing dataset to evaluate the final performance. We'll come back to this in the next chapter when we look at cross-validation. The following screenshot shows how to split the original dataset into three datasets:
To create a Decision Tree, after loading the credit data, go to the Model tab. In this section, we will use Tree. We'll see the other models later in this chapter.
In the following screenshot, we can see that, to create a Decision Tree, Rattle offers us two algorithms, traditional and conditional. The traditional algorithm works as we've seen in this chapter. The conditional algorithm helps to address overfitting; this algorithm can work better than the traditional algorithm in many cases. To optimize our Tree, Rattle has six parameters. As we'll see in the next chapter, one of the most common problems of supervised learning is overfitting; these parameters will help us to avoid it by reducing the complexity of the resulting Tree:
Finally, we've two important buttons: Rules and Draw. We will use these buttons just after creating our first tree, as shown here:
Set the following parameters as in the previous screenshot and press Execute:
Rattle will create a tree and will show the new tree in the screen. In the following screenshot, we've shown the root node and the first two branches of the tree:
In the second line, n= 700
is the size of the training set. Remember that our original dataset has 1,000 observations, but we've divided the complete dataset into training (70 percent), validation (15 percent), and testing (15 percent). For this reason, the size of the training dataset is 700.
In the fifth line, we see the root node. The number 1)
is the node; root
denotes that this is the root node; 700
is the number of observations; 209
is the number of observations misclassified, 1
is the default value for the target variable, and (0.70142857 0.29857143)
is the distribution of the target variable. In our example, 0.7014 of the observations are classified as 1
(good credit risk) and 0.2986 are classified as 2
(bad credit risk).
The following lines show us the second and third nodes:
In this node, the symbol *
in the second node indicates that it's a leaf. The attribute Status.of.existing.checking.account
is used by Rattle to create a branch. If the value of this attribute is A13 (>= 200 DM/salary assignments for at least 1 year)
or A14 (no checking account)
, the observation belongs to the second node. This second node is a leaf with 326 observations classified as 1
(good credit risk) and 44
observations are misclassified.
If the value of the attribute is A11
(… < 0 DM
) or A12
(0 <= ... < 200 DM
), the observation belongs to the third node. This node has 374 observations, but it's not a leaf node, so under this node, we'll have more branches.
Now, press the Draw button, and you'll have a graphical representation of the same tree, as shown here:
As we've seen, one advantage of trees is that it is easy to convert trees into rules that are easy to translate to other languages such as SQL, or Qlik Sense. Now push the Rules button to create the set of rules.
In our example, Rattle generates 19 rules. In the next chapter we'll see how to evaluate the performance of this model. Now, we'll focus on understanding the rules and how to use them. In the following screenshot, we see the first rule:
The rule we see in the previous screenshot is the rule number 125
. There are 9 observations that fall into this rule (cover=9
); these 9 observations are 1 percent of the dataset. When an observation falls under this rule, the probability that the value of the target variable is I 2
(Target=2
), is 1.0 (prob=1.0
).
This rule looks very specific because it fits perfectly into a small number of observations; we'll improve it in the following chapter.
As we've explained before, we will call scores to the process of predicting the output for new examples. We've two options to score new observations with our Decision Tree; we can code the Decision Tree rules in Qlik Sense or we can use Rattle to automatically score new observations.
As you have seen before, the rules are easy to translate to an If then structure that is easy to implement in any language. Imagine Rattle provides you with a set of 10 rules and the first rule is as follows:
In the following screenshot, we see how we can create a new attribute called Prediction
. In this example, we will just see the implementation of the rule 109 using the Qlik Sense Data load editor, but we can use the If then structures to implement all the rules, as shown here:
Now, we have a new attribute in our table called Prediction
that gives us a prediction for the credit risk.
Rattle provides us an option to automatically score new observations. Using this option, we don't need to manually code the rules; for this reason, we will use Rattle to score new credit applications in this example.
In the Rattle's Evaluate tab, there are different types of evaluation. In this section, we will use Score, as shown in the following screenshot. Under the type of evaluation, there is the model we will use. In our example, we've only built a Tree model, for this reason, we will choose Tree.
Under the model, we must choose the data we want to score. The two most usual options are Testing and CSV File. We can score new observations contained in a CSV file by selecting the CSV File option. In our example, we will use the Testing option to score the testing:
Finally, we have to choose the type of report we want to create. Choose Class and a category will be created for each observation. In the Include option, choose All to include all variables in the report. Press the Execute button and Rattle will create a CSV file with all original variables and a new one called rpart
, as shown in this screenshot:
Now, we have a file containing all the variables of the testing dataset and a prediction for each observation. In the next section, we will use Qlik Sense to create a visual application for the business user. With this application, the business users will be able to access new applications information.
In the previous section, we've created a Decision Tree using Rattle and we've scored the testing dataset using the model we created. In this section, we'll use Qlik Sense to build a visual application to explore new loan applications.
The German Credit dataset contains two different types of input variables, numeric, and categorical. In a categorical variable such as Purpose
, each observation contains a value, and possible values for Purpose
are A40
, A41
, A42
, A43
, A44
, and A45
. Each value has meaning, for example A40
means a new car. In order to help the user to understand and explore the data, we want to translate all categorical values to its meaning. Like in Chapter 4, Creating Your First Qlik Sense Application, we'll add a description in separate tables and we'll build a data model, such as the following screenshot:
Remember that to link two tables, Qlik Sense needs two fields with exactly the same names.
Now, we need to create a table for each categorical variable containing the original value and its translation. For the variable Purpose
, we'll create a table like the following:
Use a spreadsheet tool such as Microsoft Excel, to create a file that contains a sheet for each categorical variable.
Now, we've two files with one file containing the scored testing dataset, and a file with all the descriptions for the categorical variables.
You've learned in Chapter 4, Creating Your First Qlik Sense Application, about how to load data into Qlik Sense. In this example, we have a file with 14 sheets or tables. If you want to load all sheets, you can select all sheets in the data load wizard, like in the following screenshot:
After loading the data, we create a visual application for the business user. You've learned in Chapter 4, Creating Your First Qlik Sense Application, and, Chapter 5, Clustering and Other Unsupervised Learning Methods, on how to create this application. One benefit of Qlik Sense is that it gives self-service data visualization; it means that each user can create his own charts depending on his interests. You can create the application you want; as an example, we've created an application with two sheets. The first sheet is an overview and the second sheet contains a table to see all the details of new applications, as shown in the following screenshot: