Credit risk is the risk associated with an investment where the borrower is not able to repay the amount to the lender. This can happen on account of poor financial conditions of the borrower, and it represents a risk for the lender. The risk is for the lender to incur loss due to non-payment and hence disruption of cash flows and increased collection costs. The loss may be complete or partial. There are multiple scenarios in which a lender can suffer loss. Some of the scenarios are given here:
It is a practice of mitigating losses by understanding the adequacy of a bank's capital and loan loss reserves at any given time. In order to reduce the credit risk, the lender needs to develop a mechanism to perform a credit check on the prospective borrower. Generally, banks quantify the credit risk using two metrics - expected loss and economic capital. Expected loss is the value of a possible loss times the probability of that loss occurring. Economic capital is the amount of capital necessary to cover unexpected losses. There are three risk parameters that are essential in the process of calculating the EL and EC measurements: the probability of default (PD), loss given default (LGD), and exposure at default (EAD). Calculation of PD is more important, so we will be discussing it.
For building the PD model, let us use the subset of German Credit Data available in R. Data used for the analysis is given by executing the following code:
> data(GermanCredit) > LRData<-GermanCredit[,1:10]
Before starting the modeling, we need to understand the data, which can be done by executing the following code:
> str(LRData)
It gives us the column types and kind of values it has, as shown here:
Figure 7.11: Column description of the dataset
In this example, our target variable is Class
. Class = Good
means non-defaulter and Class = bad
means defaulter. Now, to understand the distribution of all the numeric variables, we can compute all the basic statistics related to the numeric attributes. This can be done by executing the following code:
> summary(LRData)
A sample of the output generated by the preceding code is displayed here:
Figure 7.12 Basic statistics of numeric variables
Now let us prepare our data for modeling by executing the following code:
> set.seed(100) > library(caTools) > res = sample.split(LRData$Class, 0.6) > Train_data = subset(LRData, res == TRUE) > Test_data=subset(LRData,res==FALSE)
The preceding code generates Train
and Test
data for modeling.
The proportion of selecting the Train
and Test
data is quite subjective. Now we can do basic statistics for imputation of missing/outlier values and exploratory analysis (such as information value analysis and correlation matrix) of the independent variables with respect to dependent variables for understanding the relationship.
Now let us try to fit the model on the Train
data, which can be done by executing the following code:
> lgfit = glm(Class ~. , data=Train_data, family="binomial") > summary(lgfit)
It generates the summary of the model as displayed here:
Figure 7.13: Output summary of logistic regression
As we can see in the summary, by means of Pvalues, there are significant as well as insignificant attributes in the model. Keeping in mind the significance of attributes and multicollinearity, we can iterate the model to find the best model. In our case, let us rerun the model with only significant attributes.
This can be done by executing the following code:
> lgfit = glm(Class ~Duration+InstallmentRatePercentage+Age , data=Train_data, family="binomial") > summary(lgfit)
It generates the summary output as follows:
Figure 7.14: Output summary of logistic regression having only significant attributes
The output summary shows that all the attributes considered in the model are significant.
There are a lot of statistics in logistic regression for checks of model accuracy and in this case, we will be showing the ROC curve and the confusion matrix for accuracy checks.
We can compute the threshold for classification by KS statistics but here let us assume the threshold value is 0.5
and try to score our Train
sample by executing the following code:
> Train_data$predicted.risk = predict(lgfit, newdata=Train_data, type="response") > table(Train_data$Class, as.numeric(Train_data$predicted.risk >= 0.5))
It generates the confusion matrix as displayed here:
Figure 7.15: Confusion matrix for logistic regression
Now, let us compute the auc
by executing the following code:
> library(ROCR) > pred = prediction(Train_data$predicted.risk, Train_data$Class) > as.numeric(performance(pred, "auc")@y.values)
It gives the value of auc
as shown here:
0.67925265
Now, let us plot the ROC curve by executing the following code:
> predict_Train = predict(lgfit, type="response") > ROCpred = prediction(predict_Train, Train_data$Class) > ROCperf = performance(ROCpred, "tpr", "fpr") > plot(ROCperf)
It plots the ROC curve as shown here:
Figure 7.16: ROC curve
We can use the same model fit created on Train_data
and score Test_data
and check whether the accuracy measures are in the same range or not to validate the model.