Using a Decision Tree to classify credit risks

In this section, we will create a model to classify credit risks. In this section, we will create the model; we won't look at the performance of the model. We'll evaluate the performance of the model and improve it in the next chapter.

As we did before, to create this example, we'll download a dataset from the UCI Machine Learning Repository. We'll use a dataset called Statlog (German Credit Data) Dataset. The source of the dataset is Professor Dr. Hans Hofmann from Institut für Statistik und Ökonometrie, Universität Hamburg. The dataset classifies people described by a set of attributes as good or bad credit risks.

The dataset is downloaded from the following link:

https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

In the following screenshot, you can see the original form of this dataset. The screenshot shows us the top ten lines of the dataset. The dataset doesn't have a header line. It contains 20 attributes, and the last column is the target variable—1 for Good Credit and 2 for Bad Credit. The attributes are separated by a blank space, as shown in this screenshot:

Using a Decision Tree to classify credit risks

We prefer to work with a CSV file with a header line and the attributes separated by commas. For this reason, before loading the dataset into Rattle, we work it a little, with a spreadsheet editor, to transform the original file.

To label each column, we used the following information document provided with the dataset:

Column

Label

Values

1

Status of existing checking account

A11:... < 0 DM

A12: 0 <= ... < 200 DM

A13: ... >= 200 DM /salary assignments for at least 1 year

A14: no checking account

2

Duration in months

Numeric

3

Credit history

A30: no credits taken/all credits paid back duly

A31: all credits at this bank paid back duly

A32: existing credits paid back duly till now

A33: delay in paying off in the past

A34: critical account/other credits existing (not at this bank)

4

Purpose

A40: car (new)

A41: car (used)

A42: furniture/equipment

A43: radio/television

A44: domestic appliances

A45: repairs

A46: education

A47: (vacation - does not exist?)

A48: retraining

A49: business

A410: others

5

Credit amount

Numeric

6

Savings account/bonds

A61: ... < 100 DM

A62: 100 <= ... < 500 DM

A63: 500 <= ... < 1000 DM

A64: >= 1000 DM

A65: unknown/ no savings account

7

Present employment since

A71: unemployed

A72: ... < 1 year

A73: 1 <= ... < 4 years

A74: 4 <= ... < 7 years

A75: .. >= 7 years

8

Installment rate in percentage of disposable income

Numeric

9

Personal status and sex

A91: male: divorced/separated

A92: female: divorced/separated/married

A93: male : single

A94: male: married/widowed

A95: female: single

10

Other debtors/guarantors

A101: none

A102: co-applicant

A103: guarantor

11

Present residence since

Numeric

12

Property

A121: real estate

A122: if not A121: building society savings agreement/life insurance

A123: if not A121/A122 : car or other, not in attribute 6

A124: unknown/no property

13

Age in years

Numeric

14

Other installment plans

A141: bank

A142: stores

A143: none

15

Housing

A151: rent

A152: own

A153: for free

16

Number of existing credits at this bank

Numeric

17

Job

A171: unemployed/unskilled - non-resident

A172: unskilled - resident

A173: skilled employee/official

A174: management/self-employed/

highly qualified employee/officer

18

Number of people being liable to provide maintenance for

Numeric

19

Telephone

A191: none

A192: yes, registered under the customers name

20

foreign worker

A201: yes

A202: no

21

Target

1: Good

2: Bad

To create our classifier, we will start by loading the data into Rattle and identifying the target variable. During the data load, we'll split the dataset into three datasets—the training dataset, the validation dataset, and the testing dataset. As we've explained in this chapter, we'll use the training dataset to create our model, the validation dataset to tune it, and the testing dataset to evaluate the final performance. We'll come back to this in the next chapter when we look at cross-validation. The following screenshot shows how to split the original dataset into three datasets:

Using a Decision Tree to classify credit risks

To create a Decision Tree, after loading the credit data, go to the Model tab. In this section, we will use Tree. We'll see the other models later in this chapter.

In the following screenshot, we can see that, to create a Decision Tree, Rattle offers us two algorithms, traditional and conditional. The traditional algorithm works as we've seen in this chapter. The conditional algorithm helps to address overfitting; this algorithm can work better than the traditional algorithm in many cases. To optimize our Tree, Rattle has six parameters. As we'll see in the next chapter, one of the most common problems of supervised learning is overfitting; these parameters will help us to avoid it by reducing the complexity of the resulting Tree:

  • Min Split: This is the minimum number of observations needed to create a new branch.
  • Min Bucket: This is the minimum number of observations in each leaf.
  • Max Depth: This is the maximum depth of the tree.
  • Complexity: With this parameter, we will control the minimum gain needed to create a new branch. If the value is high, the resulting tree will be simple; if the value is low, the resulting tree will be more complex.
  • Priors: Sometimes, the distribution of the target variable doesn't match with the real distribution. Imagine a dataset with a lot of sick patients. We can use this parameter to inform Rattle of the correct distribution of the target variable.
  • Loss Matrix: In the next chapter, we'll see that in some cases, we need to distinguish between different kinds of misclassifications or errors. This parameter will help us to address this problem.

Finally, we've two important buttons: Rules and Draw. We will use these buttons just after creating our first tree, as shown here:

Using a Decision Tree to classify credit risks

Set the following parameters as in the previous screenshot and press Execute:

  • Min Split; 20
  • Max Depth: 20
  • Min Bucket: 7
  • Complexity: 0.0100

Rattle will create a tree and will show the new tree in the screen. In the following screenshot, we've shown the root node and the first two branches of the tree:

Using a Decision Tree to classify credit risks

In the second line, n= 700 is the size of the training set. Remember that our original dataset has 1,000 observations, but we've divided the complete dataset into training (70 percent), validation (15 percent), and testing (15 percent). For this reason, the size of the training dataset is 700.

In the fifth line, we see the root node. The number 1) is the node; root denotes that this is the root node; 700 is the number of observations; 209 is the number of observations misclassified, 1 is the default value for the target variable, and (0.70142857 0.29857143) is the distribution of the target variable. In our example, 0.7014 of the observations are classified as 1 (good credit risk) and 0.2986 are classified as 2 (bad credit risk).

Using a Decision Tree to classify credit risks

The following lines show us the second and third nodes:

Using a Decision Tree to classify credit risks

In this node, the symbol * in the second node indicates that it's a leaf. The attribute Status.of.existing.checking.account is used by Rattle to create a branch. If the value of this attribute is A13 (>= 200 DM/salary assignments for at least 1 year) or A14 (no checking account), the observation belongs to the second node. This second node is a leaf with 326 observations classified as 1 (good credit risk) and 44 observations are misclassified.

If the value of the attribute is A11 (… < 0 DM) or A12 (0 <= ... < 200 DM), the observation belongs to the third node. This node has 374 observations, but it's not a leaf node, so under this node, we'll have more branches.

Now, press the Draw button, and you'll have a graphical representation of the same tree, as shown here:

Using a Decision Tree to classify credit risks

As we've seen, one advantage of trees is that it is easy to convert trees into rules that are easy to translate to other languages such as SQL, or Qlik Sense. Now push the Rules button to create the set of rules.

In our example, Rattle generates 19 rules. In the next chapter we'll see how to evaluate the performance of this model. Now, we'll focus on understanding the rules and how to use them. In the following screenshot, we see the first rule:

Using a Decision Tree to classify credit risks

The rule we see in the previous screenshot is the rule number 125. There are 9 observations that fall into this rule (cover=9); these 9 observations are 1 percent of the dataset. When an observation falls under this rule, the probability that the value of the target variable is I 2 (Target=2), is 1.0 (prob=1.0).

This rule looks very specific because it fits perfectly into a small number of observations; we'll improve it in the following chapter.

Using Rattle to score new loan applications

As we've explained before, we will call scores to the process of predicting the output for new examples. We've two options to score new observations with our Decision Tree; we can code the Decision Tree rules in Qlik Sense or we can use Rattle to automatically score new observations.

As you have seen before, the rules are easy to translate to an If then structure that is easy to implement in any language. Imagine Rattle provides you with a set of 10 rules and the first rule is as follows:

Using Rattle to score new loan applications

In the following screenshot, we see how we can create a new attribute called Prediction. In this example, we will just see the implementation of the rule 109 using the Qlik Sense Data load editor, but we can use the If then structures to implement all the rules, as shown here:

Using Rattle to score new loan applications

Now, we have a new attribute in our table called Prediction that gives us a prediction for the credit risk.

Rattle provides us an option to automatically score new observations. Using this option, we don't need to manually code the rules; for this reason, we will use Rattle to score new credit applications in this example.

In the Rattle's Evaluate tab, there are different types of evaluation. In this section, we will use Score, as shown in the following screenshot. Under the type of evaluation, there is the model we will use. In our example, we've only built a Tree model, for this reason, we will choose Tree.

Under the model, we must choose the data we want to score. The two most usual options are Testing and CSV File. We can score new observations contained in a CSV file by selecting the CSV File option. In our example, we will use the Testing option to score the testing:

Using Rattle to score new loan applications

Finally, we have to choose the type of report we want to create. Choose Class and a category will be created for each observation. In the Include option, choose All to include all variables in the report. Press the Execute button and Rattle will create a CSV file with all original variables and a new one called rpart, as shown in this screenshot:

Using Rattle to score new loan applications

Now, we have a file containing all the variables of the testing dataset and a prediction for each observation. In the next section, we will use Qlik Sense to create a visual application for the business user. With this application, the business users will be able to access new applications information.

Creating a Qlik Sense application to predict credit risks

In the previous section, we've created a Decision Tree using Rattle and we've scored the testing dataset using the model we created. In this section, we'll use Qlik Sense to build a visual application to explore new loan applications.

The German Credit dataset contains two different types of input variables, numeric, and categorical. In a categorical variable such as Purpose, each observation contains a value, and possible values for Purpose are A40, A41, A42, A43, A44, and A45. Each value has meaning, for example A40 means a new car. In order to help the user to understand and explore the data, we want to translate all categorical values to its meaning. Like in Chapter 4, Creating Your First Qlik Sense Application, we'll add a description in separate tables and we'll build a data model, such as the following screenshot:

Creating a Qlik Sense application to predict credit risks

Remember that to link two tables, Qlik Sense needs two fields with exactly the same names.

Now, we need to create a table for each categorical variable containing the original value and its translation. For the variable Purpose, we'll create a table like the following:

Creating a Qlik Sense application to predict credit risks

Use a spreadsheet tool such as Microsoft Excel, to create a file that contains a sheet for each categorical variable.

Now, we've two files with one file containing the scored testing dataset, and a file with all the descriptions for the categorical variables.

You've learned in Chapter 4, Creating Your First Qlik Sense Application, about how to load data into Qlik Sense. In this example, we have a file with 14 sheets or tables. If you want to load all sheets, you can select all sheets in the data load wizard, like in the following screenshot:

Creating a Qlik Sense application to predict credit risks

After loading the data, we create a visual application for the business user. You've learned in Chapter 4, Creating Your First Qlik Sense Application, and, Chapter 5, Clustering and Other Unsupervised Learning Methods, on how to create this application. One benefit of Qlik Sense is that it gives self-service data visualization; it means that each user can create his own charts depending on his interests. You can create the application you want; as an example, we've created an application with two sheets. The first sheet is an overview and the second sheet contains a table to see all the details of new applications, as shown in the following screenshot:

Creating a Qlik Sense application to predict credit risks
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset