Predicting loan default

Photo by Floriane Vita on Unsplash
Source: https://en.wikipedia.org/wiki/Home_Credit
The data structure.
The data set has lots of missing values, with almost 70% for some of the features.
The target value distribution shows class imbalance, that will be handled later in analysis.
Income for about third of the data set is in the range from from 80k to 150k. Data has couple of the outliers. Since data is represented by different countries, the name of the currency or what exchange rate was applied is unknown.
Age distributed almost equally with some spikes around 30 and 40 years old applicants. The same could be seen for an age of defaulted applicants. However, default most likely for age in range from 25 to 32 years old.
High values for Debt to income ratio (DTI) more common for clients who are older then 50, most likely pensioners (retired), and who also might have their savings to repay loan.
Loans in the data set are mainly represented by short-term ones, in particular revolving loans and cash loans.
Lower secondary and secondary special have almost 10% and 9% default rate, respectively, within education group.
The higher default rate could be seen within Laborers, Sales and Core staff, and Drivers categories.
Logistic regression results.
Comparison of the performance of all three models.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store