Predicting loan default

Daria Morgan
7 min readAug 6, 2020

Combining detailed customer data with machine learning models to better predict default.

Photo by Floriane Vita on Unsplash

Banks use the term default to describe any event where a borrower fails to repay either the interest or principal on their loan on time. As such, a default can occur when a borrower is unable to make timely payments, misses payments, or avoids or stops making payments.

Loan defaults represent a large risks for banks. If they can better predict who will default, they can better price their loans to compensate for the risks they take. The question I wanted to answer is such: can banks use better data to help them optimize their operations? In my analysis I wanted to answer that question using supervised learning techniques.

Data Source

For my analysis I decided to use a data set provided by a company called Home Credit as part of their competition posted on Kaggle.

Source: https://en.wikipedia.org/wiki/Home_Credit

Home Credit is a leading international multi-channel provider of consumer finance founded in the Czech Republic in 1997, with operations on 3 continents. It strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience (in an environment that lacks traditional credit agencies like FICO), Home Credit makes use of a variety of alternative data — including telcom and transactional information — to predict their clients’ repayment abilities.

The data structure.

The data itself comprises of seven CSV files that have ample information on applicants, their credit history (if they have one), payment information with Home Credit, and/or other financial institutions.

I downloaded these files and stored them on an AWS EC2 instance since they are large (more than 5 Gb). I also created relational tables out of these files with PostgreSQL DB, connected it to my jupyter notebooks to reduce computational cost, and used SQL and Python for data preparation (cleaning, grouping, and aggregation) before the analysis.

Exploratory data analysis

Home Credit provided exhaustive information about applicants/clients and their loans — application table along describes each applicant in 122 columns. These features (columns) include the highest level of a client’s education, whether she/he is married, the number of family members they have, what area she/he lives, whether she/he owns a car, phone, etc.

The data set has lots of missing values, with almost 70% for some of the features.

For that reason, I decided to go through each feature from the application table and hand-select ones that have the most sense for me and the analysis (since Home Credit is represented on three continents, some of the features were specific for particular regions).

As the first step, I decided to get a grasp of the data by performing an exploratory analysis.

The target value distribution shows class imbalance, that will be handled later in analysis.

Target value in the data set represented by:

1default, client with payment difficulties: he/she had late payment on at least one of the first installments of the loan.

0no-default, all other cases.

The target values skewed heavily towards one class (0 — No Default) which is good news for Home Credit because their customers are able to repay their loans. However, in terms of analysis, this class imbalance would interfere with class separation when applying various machine learning algorithms, therefore it needs to be balanced by using some techniques such as oversampling the minority class, under-sampling the majority class, synthesizing new minority classes, etc. I found this article particularly helpful in getting a better understanding of all of the data balancing techniques.

Income for about third of the data set is in the range from from 80k to 150k. Data has couple of the outliers. Since data is represented by different countries, the name of the currency or what exchange rate was applied is unknown.
Age distributed almost equally with some spikes around 30 and 40 years old applicants. The same could be seen for an age of defaulted applicants. However, default most likely for age in range from 25 to 32 years old.

As part of the feature engineering, I created a new basic feature ‘Debt to income ratio’ that I believe is used in every analysis on evaluating whether an applicant is able to repay a loan. Also, I wanted to plot it against the age of an applicant to see if there is any correlation.

High values for Debt to income ratio (DTI) more common for clients who are older then 50, most likely pensioners (retired), and who also might have their savings to repay loan.
Loans in the data set are mainly represented by short-term ones, in particular revolving loans and cash loans.
Lower secondary and secondary special have almost 10% and 9% default rate, respectively, within education group.

The majority of applicants with Home Credit have either lower secondary or secondary special educations — and at first glance it looks like they also have the highest default rate. However, when you look at the default rates by category, they actually have the second highest default rate, and those with lower secondary educations have the highest rate.

The higher default rate could be seen within Laborers, Sales and Core staff, and Drivers categories.

The Analysis

Before starting the analysis I created a few more features as part of the feature engineering going through the data set and tables with a credit history of applicants with Home Credit and/or other financial institutions. Some of these features include:

  • Average number of applications that client had previously,
  • Rejection rate if client applied for the loan before;
  • Average number of days that client had past due, etc.

More information on what features were selected and which features were created could be found on this link on my GitHub page.

I chose to focus on two metrics, the F1 score and AUC, because the data set has class imbalance for target value:

  • AUC is an area under the ROC (Receiver Operating Characteristics) Curve. ROC is a probability curve and AUC represents the degree or measure of separability. AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0. And when AUC is 0.5, it means the model has no class separation capacity whatsoever — model randomly ‘guessing’ .
  • F1 score is a harmonic mean of precision and recall. Where precision — is the fraction of positive predictions made that were correct, and recall —is a fraction of positive cases predicted correctly.

A better understanding of F1 score metric can be seen at the confusion matrix.

Logistic regression results.

As a baseline model for my analysis, I decided to go with Logistic regression and a few features such as Debt to income ratio, education type, the age of an applicant, and the number of days he/she was employed.

Also, I tried a couple of the techniques for handling class imbalance and found that balancing the class weights works better for my analysis in terms of metrics values (that could be done setting parameters for the logistic regression model with sklearn).

The results of my base model could be seen in the graph above with the F1 score as 0.197 and AUC as 0.623, which is a good starting point for further work.

I was able to improve my results to 0.27 and 0.726 of the F1 score and AUC score, respectively, by adding more features to the analysis describing an applicant in the greater details (altogether 32 features).

Going further I also wanted to check the performance of other machine learning algorithms by using the same set of features. Such algorithms are:

  • Random forest,
  • XGBoost Classifier.

Since our data set has lots of missing values both of these algorithms are better in handling this issue.

Comparison of the performance of all three models.

Random Forest and XGBoost classifier performed better compare to logistic regression with the slight advantage by XGBoost classifier.

Checking which features XGBoost classifier identified as important ones — top 15 are provided below. Among which are earlier created features: DTI, loan_dpd (days past due on payment of a loan created from the table with repayment history for the previously disbursed credits in Home Credit), prev_app_reject_ratio (rejection rate if client applied for the loan before), pos_avg_dpd (average days past due on cash loans with Home Credit).

Conclusion

The winning score for the Kaggle competition was 0.8 for AUC score. My model had a score of 0.76, which is a good start but also leaves room for improvement.

To improve my model I would go again through my data set and use more feature engineering, and revise the features that I picked — since the data set has more than 150 features perhaps I ruled out some that could be helpful in my hand-selecting process.

In addition, algorithms such as XGBoost and Random forest have lots of hyper-parameters that paired with new features could give a great boost for the metrics/model performance.

--

--