When comparing the random forest model using PCA to the random forest model using feature selection, the results showed a marginal difference between each performance metric analyzed. The random forest model using PCA used 48 variables, while the random forest model using feature selection used only 18 variables and thus appeared to be more appropriate for the classification problem under study.
Background
The financial institution under study has made numerous changes in loan application policies and procedures to take care of the increase in defaults, however, no improvements were made in post-disbursement procedures. This improved process is likely to result in the financial institution receiving more outstanding debts and reducing the number of customers who miss payments on their loans.
Literature review
2021) used tree-based machine learning algorithms to predict whether new customers are likely to default on their loans in order to determine whether to lend to the customers. The results showed that the random forest algorithm performed better than the decision tree algorithm.
Problem statement
Research questions, aim and objectives
Significance of the study
Theoretical Framework
There are several feature selection methods that can be used to reduce the number of features in the model. The three main categories for selecting functions are the filter method, the wrapper method, and the contained method.
Project layout
In this study, several classification machine learning algorithms, namely logistic regression, decision trees, random forest, k-nearest neighbors, naive Bayes algorithm, support vector machines, and artificial neural networks, are adapted to the default dataset using PCA and feature selection approaches. Evaluation metrics such as precision, balanced precision, recall, specificity, precision, negative predictive value, Gini, and auc score are used to evaluate and compare the performance of machine learning models to identify the model that performed best.
Data
Did the applicant take out a loan with another company less than 45 days before the loan included in our data set. Did the applicant receive a lower offer once the information was verified compared to the original offer received during the application process.

Data exploration
Homeowners default at 5.5%, while nonhomeowners default at 12.4%. The arrears variable in Figure 2.7 shows that 10.5% of the customers studied have arrears, while 89.5% have no arrears.

Principal Component Analysis (PCA)
From Figure 2.15, 7 principal components are required to explain 80% of the variance chosen by the researcher. In Figure 2.16, the correlation coefficients for all pairs of principal components are close to zero, which means that all 7 principal components are uncorrelated.

Summary
The number of variables in the processed data set was increased to 57, before the use of dimensionality reduction techniques (the original data included 48 variables). The increase in the number of variables was the result of coding categorical variables using the dummy variable method.
Brief introduction to machine learning
In this chapter, the researcher gives a brief introduction to machine learning, explains the theory of the classification algorithms used in this study and presents the performance measures against which the models will be evaluated.
Logistic regression
In most cases, the target variable (ie, the response variable) has only two possible outcomes, that is, the event occurs (Y=1, e.g., customer default) or the event does not occur (Y=0, p .eg, the customer happens not default). The test statistic D asymptotically follows a 𝑋2 distribution with n-p degrees of freedom, i.e., the degrees of freedom are equal to the number of parameters in the saturated model (n) minus the number of parameters in the fitted model (p) (Badi , 2017).
Decision tree (ID3, C4.5)
Often, fitting a decision tree until all leaf nodes consist of observations belonging to the same class results in overfitting (Podgorelec et al., 2002). By overfitting, the decision tree is classifying the training set instead of the general population and the algorithm becomes very specific to the training set (Podgorelec et al., 2002).

Random forest
During the building process of the random forest model, a training set must be created. An increase in the strength of the trees in the random forest will result in a decrease in the error rate of the random forest (Fawagreh et al., 2014; Tyralis et al., 2019).
Support vector machines
The hyperplane will pass through the origin in the absence of 𝑏 (Bhavsar & Panchal, 2012; Rampisela & Rustam, 2018). The hyperplane in the input space corresponds to a non-linear decision function, which is based on the kernel used (Hearst et al, 1998).

Naive Bayesian algorithm
When the data are numerical, the kernel density estimate can be used to estimate 𝑃(𝑋𝑖 = 𝑥𝑖|𝑌 = 𝑦), which is given by. During the classification phase, the Naive Bayes classifier given by Equation (3.6.8) (Krichene, 2017) is used to identify the class with the highest probability to which the new observation will be assigned:.
K-nearest neighbours
The Pearson correlation coefficient is used to derive the Pearson correlation distance (Alfeilat et al., 2019). During the third step, the distance between the new data point and each training data point (using the selected distance metric) is calculated (Ali et al., 2019).

Artificial neural network
The input information is multiplied by weights and passed on to the processing components in the hidden layer of the artificial neural network (Lallahem & Mania, 2002). Gradient descent optimization is one of the most popular algorithms used to optimize an artificial neural network by minimizing the error function (Feng & Lu, 2019).

Evaluation metrics
True positive ratio (Sensitivity/Recall): Recall describes the percentage of positive cases that the classification algorithm correctly identified. True Negative Ratio (Specificity): Specificity describes the number of negative cases that the classification algorithm correctly identified.

Logistic regression
A one-unit increase in the principal component6 is associated with an increase in the probability that the customer will default, holding all other variables constant. the decrease in the probability that a customer will default, all other variables remaining constant (Married_yes equals 1 if married and 0 if not married).

Decision tree
This indicates that the decision tree appears to correctly classify a significant number of defaulters, but the model appears to misclassify a significant number of defaulters. The evaluation metrics in Table 4.6 are then explored to gain a better understanding of the model's performance.

Random forest
Thus, although the random forest model identified a significant number of customers who correctly defaulted (1336 out of 1811), the model also misclassified many non-defaulters as defaulters. The true positive ratio of 0.738 is then analyzed; the model correctly identified about 74% of defaulting customers.

Support vector machines
From Table 4.8, a balanced accuracy score of approximately 70% is reported, suggesting that the model performed well overall. The balance accuracy score is greater than the accuracy score of 0.656; this indicates that the model has correctly identified a larger proportion of customers in the minority class, i.e., the default class.
Naïve Bayes Classifier
From Table 4.10, the researcher discusses the balanced score of accuracy, true positive ratio and true negative ratio. Next, the true negative ratio of 0.625 in Table 4.12 is analyzed; the model did not perform well in identifying defaulters, but the true negative ratio is still acceptable as it is close to 65%.
K-nearest neighbours (K-NN)
From Table 4.14, the researcher discusses the balanced accuracy score, the true positive ratio, and the true negative ratio. The true negative ratio shown in Table 4.14 is then analyzed to gain insight into the performance of the model in identifying non-defaulters; the true negative ratio of 0.67 is acceptable as misclassification costs associated with false positives are low.

Artificial neural network
This suggests that although the model misclassified a significant number of non-defaulting customers, most of the customers that were classified as non-defaulting did not default. We next examine the true positive ratio (sensitivity) of 0.739; it suggests that the model correctly identified approx. 74% of customers who defaulted.

Summary of model performance using PCA
The researcher then analyzes the actual negative ratio; this represents the percentage of customers who were correctly classified as non-defaulters, of the actual non-defaulters. The researcher considers these to be good scores given the nature of the classification problem being studied.

Brief introduction to feature selection
To understand the recursive feature elimination technique, the researcher discusses and explains it using random forest algorithm. In this study, a tree-based feature importance method was used to model random forests.

Logistic regression
External subsequent lending is associated with an increase in the odds of a customer defaulting when all other variables are held constant. One unit increase in YearsWithCurrentEmployer is associated with a x 100%) reduction in the odds of a customer defaulting when all other variables are held constant.

Decision tree
Thus, the decision tree using feature selection performed better when predicting the default class and the decision tree using all features under consideration performed slightly better when predicting the non-default class. The true negative ratio of 0.585 obtained by the decision tree using feature selection was lower than the true negative ratio score of 0.601 obtained by the decision tree using all features.

Random forest
The true negative ratio of 0.645 achieved by the model using only selected features was slightly lower than the true negative ratio of 0.654 achieved by the model using the full set of features. From this table, the AUC score and Gini coefficient of 0.745 and 0.491, respectively, obtained by the model using feature selection, were both slightly higher than the AUC score and Gini coefficient of 0.744 and 0.488, respectively, obtained by the model that including all functions.
Support vector machine
From Table 5.11, the balanced accuracy score of the SVM model using feature selection is 0.676, which is higher than the score of 0.661 obtained by the model using all features. From Table 5.11, it can be observed that the AUC score and Gini obtained by the model using selected features are 0.733 and 0.466 respectively; these values are higher than the AUC score and Gini obtained by the model using all features, namely 0.720 and 0.440, respectively.
Naïve Bayes classifier
Thus, the Naïve Bayes classifier using feature selection performed significantly better when identifying outliers; however, the classifier using the full feature set correctly identified more miscreants. A summary of the performance metrics for the Naïve Bayes classifier that used the selected features and the one that used the full feature set are shown in Table 5.13.

K-nearest neighbours
From Table 5.15, it can be observed that the K-NN model using feature selection obtained a true positive ratio of 0.642, while the K-NN model using the full feature set obtained a slightly higher true positive ratio of 0.658. All other evaluation metrics shown in Table 5.15 show better performance for the K-NN model that included only selected features.
Artificial neural network
The true positive ratio achieved by the model using only the selected features was 0.722, while the true positive ratio achieved by the model using all features was 0.743. Although the ANN model achieved a lower true positive ratio with feature selection, the true positive ratio of 0.722 is still very good.
Summary of model performance using feature selection
Therefore, the performance metrics associated with the random forest model using the PCA approach are compared to the performance metrics associated with the random forest model that used feature selection. The ROC curves for the random forest classifier using feature selection and the random forest classification using PCA are then analyzed in Figure 5.2.

Conclusions
The random forest model using feature selection used 18 features, while the random forest model using PCA used 48 features. Therefore, the random forest model using feature selection seemed most suitable for the classification problem under study.
Limitations to the study
The third research question aimed to identify the main risk factors associated with defaulting customers; based on the recommended model (ie, random forest using feature selection), these factors are the age of the customer, the number of years the customer has been with their current company, the maximum amount offered to the customer, the maximum ratio of installment with the income allowed to the customer, the percentage of the total offer received by the customer and the amount of the installment received by the customer as a percentage of his income. Thus, features related to a customer's age, income, and credit amount often appear to be important in models that aim to predict customers' default status.
Recommendations for further study
Principal Component Analysis
True, inplace=True) principalDf.reset_index(drop=True, inplace=True) dataset = pandas.concat([ dataset,principalDf ],axis=1) dataset.drop([' Age',.X_train_scaled = pandas.DataFrame( X_train_scaled, columns=X_train.columns) X_test_scaled = pandas.DataFrame(X_test_scaled, columns=X_test.columns).
Feature Selection