• No results found

Classification of banking clients according to their loan default status using machine learning algorithms.

N/A
N/A
Protected

Academic year: 2023

Share "Classification of banking clients according to their loan default status using machine learning algorithms."

Copied!
164
0
0

Loading.... (view fulltext now)

Full text

When comparing the random forest model using PCA to the random forest model using feature selection, the results showed a marginal difference between each performance metric analyzed. The random forest model using PCA used 48 variables, while the random forest model using feature selection used only 18 variables and thus appeared to be more appropriate for the classification problem under study.

Background

The financial institution under study has made numerous changes in loan application policies and procedures to take care of the increase in defaults, however, no improvements were made in post-disbursement procedures. This improved process is likely to result in the financial institution receiving more outstanding debts and reducing the number of customers who miss payments on their loans.

Literature review

2021) used tree-based machine learning algorithms to predict whether new customers are likely to default on their loans in order to determine whether to lend to the customers. The results showed that the random forest algorithm performed better than the decision tree algorithm.

Problem statement

Research questions, aim and objectives

Significance of the study

Theoretical Framework

There are several feature selection methods that can be used to reduce the number of features in the model. The three main categories for selecting functions are the filter method, the wrapper method, and the contained method.

Project layout

In this study, several classification machine learning algorithms, namely logistic regression, decision trees, random forest, k-nearest neighbors, naive Bayes algorithm, support vector machines, and artificial neural networks, are adapted to the default dataset using PCA and feature selection approaches. Evaluation metrics such as precision, balanced precision, recall, specificity, precision, negative predictive value, Gini, and auc score are used to evaluate and compare the performance of machine learning models to identify the model that performed best.

Data

Did the applicant take out a loan with another company less than 45 days before the loan included in our data set. Did the applicant receive a lower offer once the information was verified compared to the original offer received during the application process.

Table 2.2 presents the client information variables, provides a description for each variable,  and indicates whether these variables are numerical or categorical
Table 2.2 presents the client information variables, provides a description for each variable, and indicates whether these variables are numerical or categorical

Data exploration

Homeowners default at 5.5%, while nonhomeowners default at 12.4%. The arrears variable in Figure 2.7 shows that 10.5% of the customers studied have arrears, while 89.5% have no arrears.

Figure 2. 2: Distribution and percentage of clients defaulting for variables in the demographics subgroup
Figure 2. 2: Distribution and percentage of clients defaulting for variables in the demographics subgroup

Principal Component Analysis (PCA)

From Figure 2.15, 7 principal components are required to explain 80% of the variance chosen by the researcher. In Figure 2.16, the correlation coefficients for all pairs of principal components are close to zero, which means that all 7 principal components are uncorrelated.

Figure 2.15 displays the cumulative percentage of variance explained by number of principal  components
Figure 2.15 displays the cumulative percentage of variance explained by number of principal components

Summary

The number of variables in the processed data set was increased to 57, before the use of dimensionality reduction techniques (the original data included 48 variables). The increase in the number of variables was the result of coding categorical variables using the dummy variable method.

Brief introduction to machine learning

In this chapter, the researcher gives a brief introduction to machine learning, explains the theory of the classification algorithms used in this study and presents the performance measures against which the models will be evaluated.

Logistic regression

In most cases, the target variable (ie, the response variable) has only two possible outcomes, that is, the event occurs (Y=1, e.g., customer default) or the event does not occur (Y=0, p .eg, the customer happens not default). The test statistic D asymptotically follows a 𝑋2 distribution with n-p degrees of freedom, i.e., the degrees of freedom are equal to the number of parameters in the saturated model (n) minus the number of parameters in the fitted model (p) (Badi , 2017).

Decision tree (ID3, C4.5)

Often, fitting a decision tree until all leaf nodes consist of observations belonging to the same class results in overfitting (Podgorelec et al., 2002). By overfitting, the decision tree is classifying the training set instead of the general population and the algorithm becomes very specific to the training set (Podgorelec et al., 2002).

Figure 3.2 displays a decision tree that consists of both categorical and continuous variables
Figure 3.2 displays a decision tree that consists of both categorical and continuous variables

Random forest

During the building process of the random forest model, a training set must be created. An increase in the strength of the trees in the random forest will result in a decrease in the error rate of the random forest (Fawagreh et al., 2014; Tyralis et al., 2019).

Support vector machines

The hyperplane will pass through the origin in the absence of 𝑏 (Bhavsar & Panchal, 2012; Rampisela & Rustam, 2018). The hyperplane in the input space corresponds to a non-linear decision function, which is based on the kernel used (Hearst et al, 1998).

Figure 3. 4: Example of support vector machine structure with misclassifications
Figure 3. 4: Example of support vector machine structure with misclassifications

Naive Bayesian algorithm

When the data are numerical, the kernel density estimate can be used to estimate 𝑃(𝑋𝑖 = 𝑥𝑖|𝑌 = 𝑦), which is given by. During the classification phase, the Naive Bayes classifier given by Equation (3.6.8) (Krichene, 2017) is used to identify the class with the highest probability to which the new observation will be assigned:.

K-nearest neighbours

The Pearson correlation coefficient is used to derive the Pearson correlation distance (Alfeilat et al., 2019). During the third step, the distance between the new data point and each training data point (using the selected distance metric) is calculated (Ali et al., 2019).

Figure 3. 5: Impact of selected k value on model’s prediction
Figure 3. 5: Impact of selected k value on model’s prediction

Artificial neural network

The input information is multiplied by weights and passed on to the processing components in the hidden layer of the artificial neural network (Lallahem & Mania, 2002). Gradient descent optimization is one of the most popular algorithms used to optimize an artificial neural network by minimizing the error function (Feng & Lu, 2019).

Figure 3. 6: Structure of a feed-back neural network
Figure 3. 6: Structure of a feed-back neural network

Evaluation metrics

True positive ratio (Sensitivity/Recall): Recall describes the percentage of positive cases that the classification algorithm correctly identified. True Negative Ratio (Specificity): Specificity describes the number of negative cases that the classification algorithm correctly identified.

Table 3.1 displays the structure of a confusion matrix. TP, FP, TN and FN can be used to derive  evaluation metrics such as accuracy, balanced accuracy, true positive ratio, true negative ratio,  positive predictive value, negative predictive value and fal
Table 3.1 displays the structure of a confusion matrix. TP, FP, TN and FN can be used to derive evaluation metrics such as accuracy, balanced accuracy, true positive ratio, true negative ratio, positive predictive value, negative predictive value and fal

Logistic regression

A one-unit increase in the principal component6 is associated with an increase in the probability that the customer will default, holding all other variables constant. the decrease in the probability that a customer will default, all other variables remaining constant (Married_yes equals 1 if married and 0 if not married).

Table 4.2b: Interpretation of the odds ratio estimates for the 29 significant variables in the fitted logistic regression  model
Table 4.2b: Interpretation of the odds ratio estimates for the 29 significant variables in the fitted logistic regression model

Decision tree

This indicates that the decision tree appears to correctly classify a significant number of defaulters, but the model appears to misclassify a significant number of defaulters. The evaluation metrics in Table 4.6 are then explored to gain a better understanding of the model's performance.

Figure 4. 2: Mean absolute SHAP value for each feature in the decision tree
Figure 4. 2: Mean absolute SHAP value for each feature in the decision tree

Random forest

Thus, although the random forest model identified a significant number of customers who correctly defaulted (1336 out of 1811), the model also misclassified many non-defaulters as defaulters. The true positive ratio of 0.738 is then analyzed; the model correctly identified about 74% of defaulting customers.

Figure 4. 4: Mean absolute SHAP value for each feature in the random forest
Figure 4. 4: Mean absolute SHAP value for each feature in the random forest

Support vector machines

From Table 4.8, a balanced accuracy score of approximately 70% is reported, suggesting that the model performed well overall. The balance accuracy score is greater than the accuracy score of 0.656; this indicates that the model has correctly identified a larger proportion of customers in the minority class, i.e., the default class.

Naïve Bayes Classifier

From Table 4.10, the researcher discusses the balanced score of accuracy, true positive ratio and true negative ratio. Next, the true negative ratio of 0.625 in Table 4.12 is analyzed; the model did not perform well in identifying defaulters, but the true negative ratio is still acceptable as it is close to 65%.

K-nearest neighbours (K-NN)

From Table 4.14, the researcher discusses the balanced accuracy score, the true positive ratio, and the true negative ratio. The true negative ratio shown in Table 4.14 is then analyzed to gain insight into the performance of the model in identifying non-defaulters; the true negative ratio of 0.67 is acceptable as misclassification costs associated with false positives are low.

Table 4. 13: Confusion matrix for the K-NN model
Table 4. 13: Confusion matrix for the K-NN model

Artificial neural network

This suggests that although the model misclassified a significant number of non-defaulting customers, most of the customers that were classified as non-defaulting did not default. We next examine the true positive ratio (sensitivity) of 0.739; it suggests that the model correctly identified approx. 74% of customers who defaulted.

Table 4. 16: Performance metrics for the ANN model
Table 4. 16: Performance metrics for the ANN model

Summary of model performance using PCA

The researcher then analyzes the actual negative ratio; this represents the percentage of customers who were correctly classified as non-defaulters, of the actual non-defaulters. The researcher considers these to be good scores given the nature of the classification problem being studied.

Table 4. 18: Evaluation metrics for each model under study using the PCA approach
Table 4. 18: Evaluation metrics for each model under study using the PCA approach

Brief introduction to feature selection

To understand the recursive feature elimination technique, the researcher discusses and explains it using random forest algorithm. In this study, a tree-based feature importance method was used to model random forests.

Figure 5. 1: Feature importance for all variables in the random forest model
Figure 5. 1: Feature importance for all variables in the random forest model

Logistic regression

External subsequent lending is associated with an increase in the odds of a customer defaulting when all other variables are held constant. One unit increase in YearsWithCurrentEmployer is associated with a x 100%) reduction in the odds of a customer defaulting when all other variables are held constant.

Table 5. 3: Performance metrics for the logistic regression model using feature selection and the model using the full  set of features
Table 5. 3: Performance metrics for the logistic regression model using feature selection and the model using the full set of features

Decision tree

Thus, the decision tree using feature selection performed better when predicting the default class and the decision tree using all features under consideration performed slightly better when predicting the non-default class. The true negative ratio of 0.585 obtained by the decision tree using feature selection was lower than the true negative ratio score of 0.601 obtained by the decision tree using all features.

Table 5. 6: Confusion matrix for the decision tree using feature selection and the decision tree using the full set of  features
Table 5. 6: Confusion matrix for the decision tree using feature selection and the decision tree using the full set of features

Random forest

The true negative ratio of 0.645 achieved by the model using only selected features was slightly lower than the true negative ratio of 0.654 achieved by the model using the full set of features. From this table, the AUC score and Gini coefficient of 0.745 and 0.491, respectively, obtained by the model using feature selection, were both slightly higher than the AUC score and Gini coefficient of 0.744 and 0.488, respectively, obtained by the model that including all functions.

Support vector machine

From Table 5.11, the balanced accuracy score of the SVM model using feature selection is 0.676, which is higher than the score of 0.661 obtained by the model using all features. From Table 5.11, it can be observed that the AUC score and Gini obtained by the model using selected features are 0.733 and 0.466 respectively; these values ​​are higher than the AUC score and Gini obtained by the model using all features, namely 0.720 and 0.440, respectively.

Naïve Bayes classifier

Thus, the Naïve Bayes classifier using feature selection performed significantly better when identifying outliers; however, the classifier using the full feature set correctly identified more miscreants. A summary of the performance metrics for the Naïve Bayes classifier that used the selected features and the one that used the full feature set are shown in Table 5.13.

Table 5. 12: Confusion matrix for the Naïve Bayes classifier using feature selection, and the model using the full set of  features
Table 5. 12: Confusion matrix for the Naïve Bayes classifier using feature selection, and the model using the full set of features

K-nearest neighbours

From Table 5.15, it can be observed that the K-NN model using feature selection obtained a true positive ratio of 0.642, while the K-NN model using the full feature set obtained a slightly higher true positive ratio of 0.658. All other evaluation metrics shown in Table 5.15 show better performance for the K-NN model that included only selected features.

Artificial neural network

The true positive ratio achieved by the model using only the selected features was 0.722, while the true positive ratio achieved by the model using all features was 0.743. Although the ANN model achieved a lower true positive ratio with feature selection, the true positive ratio of 0.722 is still very good.

Summary of model performance using feature selection

Therefore, the performance metrics associated with the random forest model using the PCA approach are compared to the performance metrics associated with the random forest model that used feature selection. The ROC curves for the random forest classifier using feature selection and the random forest classification using PCA are then analyzed in Figure 5.2.

Table 5. 18: Confusion matrix for each model under study using feature selection
Table 5. 18: Confusion matrix for each model under study using feature selection

Conclusions

The random forest model using feature selection used 18 features, while the random forest model using PCA used 48 features. Therefore, the random forest model using feature selection seemed most suitable for the classification problem under study.

Limitations to the study

The third research question aimed to identify the main risk factors associated with defaulting customers; based on the recommended model (ie, random forest using feature selection), these factors are the age of the customer, the number of years the customer has been with their current company, the maximum amount offered to the customer, the maximum ratio of installment with the income allowed to the customer, the percentage of the total offer received by the customer and the amount of the installment received by the customer as a percentage of his income. Thus, features related to a customer's age, income, and credit amount often appear to be important in models that aim to predict customers' default status.

Recommendations for further study

Principal Component Analysis

True, inplace=True) principalDf.reset_index(drop=True, inplace=True) dataset = pandas.concat([ dataset,principalDf ],axis=1) dataset.drop([' Age',.X_train_scaled = pandas.DataFrame( X_train_scaled, columns=X_train.columns) X_test_scaled = pandas.DataFrame(X_test_scaled, columns=X_test.columns).

Feature Selection

Figure

Table 2. 4: Description of income, expenses and debt variables under study
Figure 2. 2: Distribution and percentage of clients defaulting for variables in the demographics subgroup
Figure 2. 3: Distribution and percentage of clients defaulting for variables in the client information subgroup
Figure 2. 4: Distribution and percentage of clients defaulting for variables in the loan information subgroup
+7

References

Related documents

It is therefore most likely that a combination of factors such as high inflation of farm inputs, in- creased agricultural investments, high interest rates charged to farmers due to bad