In this section, we will develop a credit risk pipeline that is commonly used in financial institutions such as banks and credit unions. First we will discuss what credit risk analysis is and why it is important before developing a Spark ML-based pipeline using a Random-Forest-based classifier. Finally, we will provide some performance improvement suggestions.
When an applicant applies for a loan and a bank receives that application, based on the applicant's profile, the bank has to make a decision whether to approve the loan application or not.
In this regard, there are two types of risk associated with the bank's decision on the loan application:
Our common sense says that the second risk is the greater risk, as the bank has a higher chance of not being reimbursed the borrowed amount.
Therefore, most banks or credit unions evaluate the risks associated with lending money to a client, applicant, or customer. In business analytics, minimizing the risk tends to maximize the profit to the bank itself. In other words, maximizing the profit and minimizing the loss from a financial perspective is important.
Often, the bank makes a decision about approving a loan application based on different factor and parameters of an applicant. For example, the demographic and socio-economic conditions regarding their loan application.
In this section, we will first discuss the credit risk dataset in detail in order to gain some insight. After that, we will look at how to develop a large-scale credit risk pipeline. Finally, we will provide some performance improvement suggestions toward better prediction accuracy.
The German Credit dataset was downloaded from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/. Although a detailed description of the dataset is available in the link, we provide some brief insights here in Table 3. The data contains credit-related data on 21 variables and the classification of whether an applicant is considered a good or a bad credit risk for 1000 loan applicants. Table 3 shows details about each variable that was considered before making the dataset available online:
Entry |
Variable |
Explanation |
1 |
|
Capable of repaying |
2 |
|
Current balance |
3 |
|
Duration of the loan being applied for |
4 |
|
Is there any bad loan history? |
5 |
|
Purpose of the loan |
6 |
|
Amount being applied for |
7 |
|
Monthly saving |
8 |
|
Employment status |
9 |
|
Interest percent |
10 |
|
Sex and marriage status |
11 |
|
Are there any guarantors? |
12 |
|
Duration of residence at the current address |
13 |
|
Net assets |
14 |
|
Age of the applicant |
15 |
|
Concurrent credit |
16 |
|
Residential status |
17 |
|
Current credits |
18 |
|
Occupation |
19 |
|
Number of dependents |
20 |
|
If the applicant uses a phone |
21 |
|
If the applicant is a foreigner |
Table 3: German credit dataset properties
Note that, although Table 3 describes the variables in the dataset, there is no associated header. In Table 3, we have shown the variable, position, and associated significance of each variable.
There will be several steps involved, from data loading, parsing, data preparation, training testing set preparation, model training, model evaluation, and result interpretation. Let's go through the steps one by one.
Step 1: Load required APIs and libraries
The following is the code for loading the required APIs and libraries:
import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.ml.classification.RandomForestClassificationModel; import org.apache.spark.ml.classification.RandomForestClassifier; import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator; import org.apache.spark.ml.feature.StringIndexer; import org.apache.spark.ml.feature.VectorAssembler; import org.apache.spark.mllib.evaluation.RegressionMetrics; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession;
Step 2: Create a Spark session
The following is another code for creating a Spark session:
static SparkSession spark = SparkSession.builder() .appName("CreditRiskAnalysis") .master("local[*]") .config("spark.sql.warehouse.dir", "E:/Exp/") .getOrCreate();
Step 3: Load and parse the credit risk dataset
Note that the dataset is in Comma-Separated Value (CSV) format. Now load and parse the dataset using the Databricks-provided CSV readers and prepare a Dataset of Row, as follows:
String csvFile = "input/german_credit.data"; Dataset<Row> df = spark.read().format("com.databricks.spark.csv").option("header", "false").load(csvFile);
Now, show the Dataset to get to know the exact structure, as follows:
df.show();
Step 4: Create an RDD of type Credit
Create an RDD of typed class Credit
, as follows:
JavaRDD<Credit> creditRDD = df.toJavaRDD().map(new Function<Row, Credit>() { @Override public Credit call(Row r) throws Exception { return new Credit(parseDouble(r.getString(0)), parseDouble(r.getString(1)) - 1, parseDouble(r.getString(2)), parseDouble(r.getString(3)), parseDouble(r.getString(4)), parseDouble(r.getString(5)), parseDouble(r.getString(6)) - 1, parseDouble(r.getString(7)) - 1, parseDouble(r.getString(8)), parseDouble(r.getString(9)) - 1, parseDouble(r.getString(10)) - 1, parseDouble(r.getString(11)) - 1, parseDouble(r.getString(12)) - 1, parseDouble(r.getString(13)), parseDouble(r.getString(14)) - 1, parseDouble(r.getString(15)) - 1, parseDouble(r.getString(16)) - 1, parseDouble(r.getString(17)) - 1, parseDouble(r.getString(18)) - 1, parseDouble(r.getString(19)) - 1, parseDouble(r.getString(20)) - 1); } });
The preceding code segments creates an RDD of type Credit
after taking the variable as double values by using the parseDouble()
method, which takes a string and returns the corresponding value in Double
format. The parseDouble()
method goes as follows:
public static double parseDouble(String str) { return Double.parseDouble(str); }
Now we need to know the structure of the Credit
class so that the structure itself helps to create the RDDs using the typed class.
Well, the Credit
class is basically a singleton class that initializes all the setter and getter methods for the 21 variables from the dataset through the constructor. Here is the class:
public class Credit { private double creditability; private double balance; private double duration; private double history; private double purpose; private double amount; private double savings; private double employment; private double instPercent; private double sexMarried; private double guarantors; private double residenceDuration; private double assets; private double age; private double concCredit; private double apartment; private double credits; private double occupation; private double dependents; private double hasPhone; private double foreign; public Credit(double creditability, double balance, double duration, double history, double purpose, double amount, double savings, double employment, double instPercent, double sexMarried, double guarantors, double residenceDuration, double assets, double age, double concCredit, double apartment, double credits, double occupation, double dependents, double hasPhone, double foreign) { super(); this.creditability = creditability; this.balance = balance; this.duration = duration; this.history = history; this.purpose = purpose; this.amount = amount; this.savings = savings; this.employment = employment; this.instPercent = instPercent; this.sexMarried = sexMarried; this.guarantors = guarantors; this.residenceDuration = residenceDuration; this.assets = assets; this.age = age; this.concCredit = concCredit; this.apartment = apartment; this.credits = credits; this.occupation = occupation; this.dependents = dependents; this.hasPhone = hasPhone; this.foreign = foreign; } public double getCreditability() { return creditability; } public void setCreditability(double creditability) { this.creditability = creditability; } public double getBalance() { return balance; } public void setBalance(double balance) { this.balance = balance; } public double getDuration() { return duration; } public void setDuration(double duration) { this.duration = duration; } public double getHistory() { return history; } public void setHistory(double history) { this.history = history; } public double getPurpose() { return purpose; } public void setPurpose(double purpose) { this.purpose = purpose; } public double getAmount() { return amount; } public void setAmount(double amount) { this.amount = amount; } public double getSavings() { return savings; } public void setSavings(double savings) { this.savings = savings; } public double getEmployment() { return employment; } public void setEmployment(double employment) { this.employment = employment; } public double getInstPercent() { return instPercent; } public void setInstPercent(double instPercent) { this.instPercent = instPercent; } public double getSexMarried() { return sexMarried; } public void setSexMarried(double sexMarried) { this.sexMarried = sexMarried; } public double getGuarantors() { return guarantors; } public void setGuarantors(double guarantors) { this.guarantors = guarantors; } public double getResidenceDuration() { return residenceDuration; } public void setResidenceDuration(double residenceDuration) { this.residenceDuration = residenceDuration; } public double getAssets() { return assets; } public void setAssets(double assets) { this.assets = assets; } public double getAge() { return age; } public void setAge(double age) { this.age = age; } public double getConcCredit() { return concCredit; } public void setConcCredit(double concCredit) { this.concCredit = concCredit; } public double getApartment() { return apartment; } public void setApartment(double apartment) { this.apartment = apartment; } public double getCredits() { return credits; } public void setCredits(double credits) { this.credits = credits; } public double getOccupation() { return occupation; } public void setOccupation(double occupation) { this.occupation = occupation; } public double getDependents() { return dependents; } public void setDependents(double dependents) { this.dependents = dependents; } public double getHasPhone() { return hasPhone; } public void setHasPhone(double hasPhone) { this.hasPhone = hasPhone; } public double getForeign() { return foreign; } public void setForeign(double foreign) { this.foreign = foreign; } }
If you look at the flow of the class, at first it declares 21 variables for the 21 features in the dataset. Then it initializes them using the constructor. The rest are simple setter and getter methods.
Step 5: Create a Dataset of type Row from the RDD of type Credit
The following code shows how to create a Dataset of type Row:
Dataset<Row> creditData = spark.sqlContext().createDataFrame(creditRDD, Credit.class);
Now save the Dataset as a temporary view, or more formally, a table in-memory for query purposes, as follows:
creditData.createOrReplaceTempView("credit");
Now let's get to know the schema of the table as follows:
creditData.printSchema();
Step 6: Create the feature vector using the VectorAssembler
Create a new feature vector for the 21 variables using the VectorAssembler
class of Spark, as follows:
VectorAssembler assembler = new VectorAssembler() .setInputCols(new String[] { "balance", "duration", "history", "purpose", "amount", "savings", "employment", "instPercent", "sexMarried", "guarantors", "residenceDuration", "assets", "age", "concCredit", "apartment", "credits", "occupation", "dependents", "hasPhone", "foreign" }) .setOutputCol("features");
Step 7: Create a Dataset by combining and transforming the assembler
Create a Dataset by transforming the assembler using the creditData
Dataset previously created, and print the first top 20 rows of the Dataset, as follows:
Dataset<Row> assembledFeatures = assembler.transform(creditData); assembledFeatures.show();
Step 8: Create label for making predictions
Create a label column out of the creditability column of the preceding Dataset (Figure 38), as follows:
StringIndexer creditabilityIndexer = new StringIndexer().setInputCol("creditability").setOutputCol("label"); Dataset<Row> creditabilityIndexed = creditabilityIndexer.fit(assembledFeatures).transform(assembledFeatures);
Now let's explore the new Dataset using the show()
method as follows:
creditabilityIndexed.show();
From the preceding figure, we can understand that there are only two labels associated with the Dataset, which are 1.0 and 0.0. That signifies the problem as a binary classification problem.
Step 9: Prepare the training and test set
Prepare the training and test set as follows:
long splitSeed = 12345L; Dataset<Row>[] splits = creditabilityIndexed.randomSplit(new double[] { 0.7, 0.3 }, splitSeed); Dataset<Row> trainingData = splits[0]; Dataset<Row> testData = splits[1];
Here, the ratio is 70% and 30% for the training and testing set, respectively, with a long seed value to disallow the random result generation in each iteration.
Step 10: Train the Random Forest model
To train the Random Forest model, use the following code:
RandomForestClassifier classifier = new RandomForestClassifier() .setImpurity("gini") .setMaxDepth(3) .setNumTrees(20) .setFeatureSubsetStrategy("auto") .setSeed(splitSeed);
As previously mentioned, the problem is a binary classification problem. Therefore, we will evaluate the Random Forest model using a binary evaluator for the label
column, as follows:
RandomForestClassificationModel model = classifier.fit(trainingData); BinaryClassificationEvaluator evaluator = new BinaryClassificationEvaluator().setLabelCol("label");
Now we need to collect the model performance metric on the test set that goes as follows:
Dataset<Row> predictions = model.transform(testData); model.toDebugString();
Step 11: Print the performance parameters
We will observe several performance parameters of the binary evaluator, for example, accuracy after fitting the model, Mean Square Error (MSE), Mean Absolutize Error (MAE), Root Mean Squared Error (RMSE), R Squared and explained variable, and so on. Let's do it as follows:
double accuracy = evaluator.evaluate(predictions); System.out.println("Accuracy after pipeline fitting: " + accuracy); RegressionMetrics rm = new RegressionMetrics(predictions); System.out.println("MSE: " + rm.meanSquaredError()); System.out.println("MAE: " + rm.meanAbsoluteError()); System.out.println("RMSE Squared: " + rm.rootMeanSquaredError()); System.out.println("R Squared: " + rm.r2()); System.out.println("Explained Variance: " + rm.explainedVariance() + " ");
The preceding code segment generates the following output:
Accuracy after pipeline fitting: 0.7622000403307129 MSE: 1.926235109206349E7 MAE: 3338.3492063492063 RMSE Squared: 4388.8895055655585 R Squared: -1.372326447615067 Explained Variance: 1.1144695981899707E7
If you look at the performance metrics in Step 11, it is obvious that the credit risk predictions are not satisfactory, especially in terms of accuracy, which is only 76.22%. That means that for the given test data, our model can predict if there is a credit risk with 76.22% precision. Since we need to be more careful about such sensitive financial sectors, therefore, more accuracy is desired no doubt.
Now, if you want to increase the prediction performance, you should try training your model using a model other than the Random-Forest-based classifier. For example, a Logistic Regression or Naïve Baseyan-based classifier.
Moreover, you can use the SVM-based classifier or neural-network-based Multilayer Perceptron classifier. In Chapter 7, Tuning Machine Learning Models, we will look at how to tune the hyper parameters in order to select the best model.