1. Introduction
The fintech ecosystem is a living organism, growing and transforming along with technological development, driven by the consumers' demands for ubiquity, instantaneity, and user experience. The term fintech became utterly relevant around 2015, but the merge of finance and technology is not a novelty (Arner et al., 2015, 2016). We have experienced fintech ever since the creation of automated machine tellers, e-banking to streamline access to financial products and synergic connection between financial institutions and consumers. The Cambridge Centre for Alternative Finance (CCAF) has been tracking the fintech ecosystem since 2015, generating survey-based information about the industry´s development and valuable insights that have help as a benchmark to participants at a global level. For 2020, they collected information from 703 firms, compared to 205 European firms in 2015 and around three hundred firms in the Americas in 2016. These reports show how the growth of the fintech ecosystem represents a complex dynamic between individuals and the necessity of access to financing and liquidity.
According to the 2nd Global Alternative Finance Market Benchmarking report elaborated by the CCAF, the most representative market in the fintech ecosystem is crowdlending or peer-to-peer consumer lending (P2P). It has a global market volume of USD 3.5 billion, ahead of all other business models in the ecosystem. This business model’s mechanism is about creating a virtual marketplace where investors and borrowers meet. While investors choose the loans, they will fund according to their risk appetite, the platform generates a credit scoring model for borrowers' creditworthiness assessment (Ziegler et al., 2021).
We can attribute the growth of this market to the credit access restriction subsequent to the 2007 Global Recession in the first world countries, and financial access impairs in emerging economies (Brunnermeier, 2009). These platforms seized the opportunity and developed an internet-based marketplace gathering borrowers and investors, creating a business model with high transaction rates. Likewise, the overall success of this marketplace is because it has allowed the unbanked population to participate in the P2P dynamic, achieving financial inclusion goals while constantly raising the number of financial services consumers.
The success of P2P lending markets is evident and justified. The platforms belonging to this market have forged a trusting relationship with users, offering transparent, fast, and convenient access to financing. First platforms such as ZOPA (UK based, founded in 2005), LendingClub, and Prosper (USA based, founded in 2007), inspired other countries to create more platforms. China is the best example. Since 2007 its P2P consumer lending market grew exponentially, being a significant competitor in the fintech ecosystem until 2018. China's Banking Regulatory Commission passed strict regulations for the P2P marketplace in 2016, and several platforms could not comply. Such requirements restricted platforms from pulling loans and offering credit services, so they only can act as intermediaries. The Chinese P2P lending market started to crash in 2018, facing fraud and delinquency scandals. By 2020 the CCAF omitted China from the global fintech report due to the dramatic fall in P2P lending market activities. (Stern et al., 2017).
Even though a regulatory effect caused the Chinese P2P lending market controversy, the rest of the platforms are not exempt from suffering difficulties in capital returning and loan recovery due to a more flexible regulation. Therefore, some regulatory requirements have included the platform to share the risk with investor (e.g., regulation in some countries now demand their participation in the loan funding).
Risk is the cornerstone in this mechanism and it is related, mainly, to the creditworthiness of borrowers and the platform's capacity to discern bad borrowers from good borrowers. Therefore, P2P lending platforms aim to mitigate or compensate the default risk through attractive yields that attract investors' trust. For risk mitigation, each platform develops its own credit model that determines the default probabilities and repayment capacities of potential borrowers. In order to achieve this, they usually go to historical data and design quantitative models. Access to such datasets is challenging, mainly for startups, who rely on benchmarks to construct their credit models.
This research aims to analyze a public dataset from a well-known P2P lending platform called LendingClub (https://www.lendingclub.com/) to identify a set of predictors of payment or default. Some of these predictors include credit history, loan characteristics, borrower characteristics, and the probability of default. We address other available research papers that study the LendingClub dataset in different periods and use different methodologies. In this study, we will extract the information for the whole operation of this platform (2007-2020) and analyze which of the 140 predictors available are suitable for a standard data-driven classification and prediction model. The methodology involves testing a Random Forest algorithm to identify feature importance in the default prediction.
Results show that we can develop a model with only a few variables and achieve accurate classification metrics. Therefore, from the 140 variables available, we selected nine according to feature importance provided by the Random Forest algorithm. We also address the class imbalance in the target variable with SMOTE oversampling for default observations. A marginal improvement compared to the analysis with the original class imbalance was identified, due to the Random Forest capacity to handle this condition.
This study’s contribution is in the credit risk assessment for the fintech ecosystem. We evidence that credit history variables are determinants in the default prediction for the LendingClub dataset. The layout of this paper is as follows. Section 2 presents the literature review; section 3 explains the model; sections 4 and 5 describe the dataset and descriptive statistics, respectively. Results for the model are discussed in section 6, followed by section 7 for conclusions.
2. Literature Review
Research on the P2P lending market concentrates on two big groups. The first group studies social and behavioral aspects of the mechanism to identify what motivates investors to fund a specific loan and participate in a P2P lending market. Several approaches find that the borrowers’ characteristics, such as their photograph (Gonzalez & Loureiro, 2014), profile, and social networks, together with a description of loan purpose, are influential factors over investors' decisions. Other set of studies relate investors' decisions to herd behavior in P2P lending market platforms, demonstrating that investors prefer to fund loans with a certain percentage of funders to share the risk with (Lee & Lee, 2012; Zhang & Liu, 2012). Further studies try to assess the information asymmetry present in these marketplaces and mitigate it to avoid adverse selection in investment (Weiss et al., 2010). Authors find the inclusion of soft information to be helpful in the mitigation of adverse selection (Tao et al., 2017). From the social perspective, P2P lending is related to financial inclusion efforts and social capital development since these platforms have enabled access to financial services for unbanked individuals, and young people with little credit history (Hasan et al., 2020; Maskara et al., 2021).
The second group of research focuses on the business model operation and the use of technological developments such as big data software and alternative data analytics in creating credit scoring models and credit risk analysis. Platforms access information such as consumers' payment history, insurance claims, and social networks and combine these with traditional data sources such as FICO rates to generate a better assessment of creditworthiness within users (Jagtiani & Lemieux, 2019). Among this group, there are some works that suggest using psychometric and demographic variables, as well as email usage information to generate a creditworthiness assessment when credit history is unavailable, generating sufficient accuracy proof in implementing statistical classifiers using these predictors (Djeundje et al., 2021).
With the increment of mobile devices usage, several studies have included metrics generated in these devices for credit risk assessment. Variables such as call records, mobile location, applications installed, and SMS activity prove to increase accuracy in default prediction when used along with credit bureau information (Agarwal et al., 2020; Björkegren & Grissen, 2018; Óskarsdóttir et al., 2019). Soft data and user-generated text are also employed to enhance predictive models for credit risk assessment (Netzer et al., 2019). Text is processed and categorized to determine creditworthiness according to spelling error rate, length of text, upper and lower cases, readability, and tone. These variables positively impact the creation of enhanced credit risk models (Berg et al., 2020). There is also a body of literature around credit default probability and overall credit risk assessment implementing traditional econometric alternatives and AI-based algorithms. Econometric models such as binary classifications or logistic regressions are used as benchmarks for estimating probabilities as they are highly interpretable.
Most P2P lending research is developed using the LendingClub dataset, one of the few public datasets available for such tasks. The platform has quarterly information about loans (2007-2020) available for investors to help them analyze borrowers’ conditions1. Serrano-Cinca et al. (2015) used the LendingClub dataset and, through a logit model, they assessed determinant variables in default prediction for a period of six years (2008-2014). This work identified variables related to the characteristics of the loan and the borrower's credit history information and proposed a model using loan purpose, annual income, housing situation, credit history, and indebtedness levels. In a second approach, they used logistic regression for predictive assessment, concluding that the stronger predictors are platform risk grade assignation and indebtedness levels.
Despite the transparency of a logit model, researchers have applied other methodologies that are not restricted to linear assumptions and that can handle large data volumes, outperforming the results of logit classifications and offering better insights to Big Data methods. Research on the LendingClub dataset uses Machine Learning algorithms because it does not require extensive preprocessing and can handle multicollinearity conditions. Literature shows the application of algorithms such as support vector machines, neural networks, and ensemble methods. Cho et al. (2019) trained an Instance-Based Entropy Fuzzy SVM algorithm to identify default probabilities in P2P lending. They proposed investment decision models to maximize the expected return on non-defaulting loans. Kin et al. (2020) also trained an ensemble of four classifiers (neural networks, random forest, adaptive boosting, and extreme gradient boosting) considering five common characteristics for credit analysis.2 Ensemble methodologies have trained several weak estimators to yield a unified, robust estimation looking for error rate reductions and better predictive accuracy (Dietterich, 2000). Deep learning algorithms such as convolutional neural networks were trained in Chengeta and Mabika (2021) to identify default and possible frauds in P2P lending. Authors propose a model where loan purpose, employment status, and credit scorings conveniently identify possible defaulters.
Several research papers use tree-structured algorithms for borrower classification and identify potential defaulters according to the importance of the dataset features (Breiman, 2001). In contrast with logit regressions, studies find that Machine Learning algorithms such as Random Forest (RF) build better prediction models for binary and multilevel classification (e.g., Jin et al., 2015; Li and Zengyi, 2020; Ye et al., 2018; Zhu et al., 2019). Zhu et al. (2019) use the LendingClub dataset for the 2019Q1 and perform RF using fifteen attributes belonging to credit characteristics such as loan amount, installment, and grade, concluding that this algorithm outperforms SVM, Decision Tree, and logistic regression. Li and Zengyi (2020) propose a model for lenders' profit evaluation, using LendingClub to validate the model. In contrast to Zhu et al., authors found relevant variables such as in debt to income and interest rate. Jin et al. (2015) employ RF for feature selection and a posterior evaluation of other Machine Learning models. The resulting variables selected are term, annual income, loan amount, debt to income ratio, credit grade, and revolving utilization. Ye et al. (2018) develop a profit score model using RF optimization genetic algorithm to study the maximization of lender profits.
3. Model
Random forest combines multiple machine learning models to explain a wide range of data effectively. This model is an ensemble method that helps us with classification and regression problems. At the beginning of this century, random Forest appeared as an idea related to trees' natural differences, building some randomness to select variables, and voting for the most popular class (classification) or averaging the cases (regression). Breiman (2001) introduced random forest as a new predicted tool to compete with boosting and adaptive bagging.
Assuming a sample,
Where the couples
The coordinates of
So, for a learning sample
From each prediction
Therefore, let the random Forest
4. Data
We use the LendingClub snapshot dataset for the 2007-2020Q3 period, the last dataset available since LendingClub retired the investment notes from the platform in the last quarter of 2020. This dataset is available in Kaggle, a data science repository. The original dataset contains 140 variables and 2,925,493 individual loans observations. It is noteworthy to mention that the dataset has a high percentage of missing values for several variables. The dataset has loans and borrowers' characteristics, such as their credit history, credit risk scores, and credit-issuing conditions. The loan status variable discloses the current loan repayment stage for individual borrowers, where the status is "fully paid," "charged off," "late," "in grace period," or "current."
5. Methodology
We performed data preprocessing before algorithm implementation. We eliminated variables presenting above 50% missing values and discarded individual observations that present blanks for any feature. We adopt this approach to avoid treating missing values with any strategy since this would bias the information. This dataset is not a time series; individuals do not necessarily meet the same conditions, especially regarding the variables that represent credit history. Finally, we encode categoric variables as dummies.
The dependent variable for this dataset is loan status. We select fully paid and charged off loan status to represent non-defaulted and defaulted loans. Under these conditions, we can perform binary classification algorithms to predict default probabilities based on the features presented in the LendingClub dataset.
We employ an RF algorithm with the preprocessed dataset to find the most representative variables according to the classification prediction objective. We selected a 60% -40% randomized split for training and testing subsets, respectively. We apply the feature importance approach for a dimensionality reduction of the dataset, based on how the model rates the input variables' relevance in the testing phase. Feature importance is between zero and one (0 = no influence over the target, and 1 = perfect target prediction).
Resulting features are: “recoveries”, “total_rec_prncp”, “collection_recovery_fee”, “last_fico_range_high”, “last_fico_range_low”, “last_pymnt_amnt”, “total_pymnt_inv”, “total_pymnt”, “funded_amnt”, “installment”, “loan_amnt”, “funded_amnt_inv”, “debt_settlement_flag_Y”, “debt_settlement_flag_N”, “total_rec_int”, “term”, “total_rec_late_fee”, “int_rate”, “issue_d”, “grade_A”.
We eliminate recovery-related variables because they are a trivial explanatory for defaulted or delinquent loans. LendingClub charges fees for recovery of principal and interest rate for late or no payments. Also, we delete redundant features with a correlation coefficient over 0.80. We maintain the following variables for further analysis:
Feature name | Description | Type |
---|---|---|
last_fico_range_high | The upper boundary ranges the borrower's last FICO pulled belongs to. | Numeric |
last_pymnt_amnt | Last total payment amount received | Numeric |
loan_amnt | The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value. | Numeric |
debt_settlement_flag_Y | Flags whether or not the borrower, who has charged-off, is working with a debt-settlement company. | Categoric: 0 for 'YES,' 1 for 'NO' |
total_rec_int | Interest received to date | Numeric |
term | The number of payments on the loan. Values are in months and can be either 36 or 60. | Numeric |
int_rate | Interest Rate on the loan | Numeric |
issue_d | The month which the loan was funded | Numeric |
grade_A | LC assigned loan grade | Categoric: 1 for grade A, 0 for other grades |
Source: From the LendingClub data dictionary
Figure 1 displays the Pearson correlation matrix for the selected features. In Table 2., we present the descriptive statistics.
last_fico_ range_high | last_pymnt _amnt | loan_amnt | debt_ settlement_flag_Y | total_ rec_int | term | int_rate | issue_d | grade_A | |
---|---|---|---|---|---|---|---|---|---|
mean | 678.24 | 5531.04 | 14962.89 | 0.03 | 2579.96 | 42.34 | 13.29 | 2015 | 0.18 |
std | 81.91 | 7283.07 | 9027.74 | 0.16 | 2809.52 | 10.58 | 4.87 | 1.63 | 0.38 |
min | 0 | -400 | 1000 | 0 | 0 | 36 | 5.31 | 2012 | 0 |
0.25 | 624 | 400.12 | 8000 | 0 | 796.42 | 36 | 9.75 | 2015 | 0 |
0.5 | 694 | 2026.8 | 12900 | 0 | 1651.48 | 36 | 12.74 | 2016 | 0 |
0.75 | 734 | 8437.4 | 20000 | 0 | 3293.22 | 60 | 16.02 | 2017 | 0 |
max | 850 | 42192.05 | 40000 | 1 | 31714.37 | 60 | 30.99 | 2020 | 1 |
Furthermore, we use the Synthetic Minority Oversampling Technique (SMOTE) to handle the class imbalance problem present in this dataset. The number of fully paid loans is 1,121,412 and 269,193 charged-off loans (80.64% and 19.36%, respectively). We oversample the charged-off loan status, so the fully paid/charged-off ratio is 40%. Additionally, we apply 5-fold cross-validation to evaluate performance stability with F1-macro score and accuracy score.
6. Results
We performed the RF algorithm on the dataset restricted to the resulting features selected in the methodology section. We demonstrate that the number of variables in the original dataset may not be essential for a credit risk analysis. From the 140 variables available, most models only use ten to fifteen variables for class prediction, as seen in the literature. In this section, we prove that the RF algorithm yields robust results for class prediction. We trained 60% of the dataset, left the rest for testing, and performed k-fold cross-validation on train and test samples. We repeat the process for the oversampled train set to address improvements in classification metrics. Random Forest results are compared to logit classification to show the performance improvement of the RF model. Table 3 presents the cross-validation results for train and test samples.
5-Fold Cross-Validation | |||||||
---|---|---|---|---|---|---|---|
Classifier | F1 - Macro Score | ||||||
1st Fold | 2nd Fold | 3rd Fold | 4th Fold | 5th Fold | Mean | Std.Dev | |
RF | 0.96652641 | 0.96583453 | 0.96610146 | 0.96632976 | 0.96521389 | 0.96600121 | 0.000510266 |
RF (SMOTE) | 0.97124209 | 0.9718793 | 0.97129974 | 0.98113967 | 0.98021147 | 0.97515445 | 0.005056883 |
LOGIT | 0.90917119 | 0.90903796 | 0.91266451 | 0.91066891 | 0.91063253 | 0.91043502 | 0.001312509 |
The F1-Macro Score is a classification metric that averages each class F1 Score; this metric is helpful for skewed data and measuring classification performance in class-imbalance situations because it treats all classes as equals regardless of support values. The confusion matrix is a tangible representation of the predictive performance of this algorithm; it presents the number of correctly classified and misclassified observations. In Table 4, we present the confusion matrix results for the test set, and in Table 5, we present the classification report.
Random Forest prediction for Classification (Test set) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Confusion Matrix | Predicted Class | Confusion Matrix (SMOTE) | Predicted Class | |||||||
0 | 1 | 0 | 1 | |||||||
True Class | 0 | 443731 | 4834 | True Class | 0 | 442452 | 6113 | |||
1 | 6581 | 101096 | 1 | 5421 | 102256 | |||||
Logit prediction for Classification (Test set) | ||||||||||
Confusion Matrix | Predicted Class | |||||||||
0 | 1 | |||||||||
True Class | 0 | 433375 | 15190 | |||||||
1 | 15746 | 91931 |
Classification Report | ||||
---|---|---|---|---|
Classifier | Accuracy | F1-Score | H-Score | |
0 | 1 | |||
RF | 0.98 | 0.99 | 0.95 | 0.88 |
RF (SMOTE) | 0.98 | 0.99 | 0.95 | 0.88 |
Logit | 0.94 | 0.97 | 0.86 | 0.69 |
The H-Score proposed by Hand (2009) is a Bayesian approach that specifies a prior distribution for each class loss independent of the algorithm. This measure replaces ROC - AUC scores since they present a dependency relationship with the algorithm used (Hand, 2009; Hand & Anagnostopoulos, 2013). The H-Score allows determining a cost of misclassification as a severity ratio. In this case, we penalize misclassification symmetrically, selecting a severity ratio of one. We observe the performance dropped in both RF and RF SMOTE predictions by an average of 10% compared to the F1-Score and Accuracy metrics. We consider evaluating several metrics because we propose a model for default prediction; from the business perspective, misclassification is problematic as it leads to unnecessary or unwanted risks.
Figure 2 displays how selected features behave for RF and RF SMOTE. We observe a slight change in the debt settlement flag, interest rate, and loan amount position. This result indicates how the alternative SMOTE technique ponders the features differently, assigning more importance to interest recovery and interest rate. Nevertheless, both techniques' top three variables are consistent. FICO score, last payment amount, and total interest received to date give important insights about the loan and borrower information. FICO scores result from a proprietary algorithm for credit scoring based on credit history, while the interest received to date and the last payment amount represent the borrower's behavior in the LendingClub platform.
7. Conclusions
We can draw two important conclusions from this study. First, we confirm the Random Forest algorithm's capacity to predict binary classification problems based on performance metrics obtained. We highlight the interpretation transparency achieved using this algorithm. The feature importance result allowed us to perform a dimensionality reduction that reproduced a robust model for default prediction using only nine variables. These results can be compared to other research articles using other Machine Learning based algorithms with similar performance reports. And second, we denote the influence of traditional credit scoring variables on default prediction problems.
Therefore, P2P lending platforms still have to consider credit bureau information to assess the credit risk of potential borrowers as a principal requirement. The resulting model presented in this study is data-driven; it is not strictly conclusive for all P2P platforms. Nonetheless, LendingClub is a consolidated corporation considered as a benchmark for the P2P lending market. Startups and active platforms can benefit from this study to generate credit risk models.