Skip to content

(Invalid URL)

Loan Data

Introduction

Context

This dataset (source) consists of data from almost 10,000 borrowers that took loans - with some paid back and others still in progress. It was extracted from lendingclub.com which is an organization that connects borrowers with investors. We've included a few suggested questions at the end of this template to help you get started.

Load packages

library(skimr)
library(tidyverse)
library(GGally)
library(ggridges)
library(caret)
library(MLeval)
#got an error for missing package, hence the below
install.packages("e1071")
install.packages("kernlab")
install.packages("klaR")
install.packages("naivebayes")

Load the Data

loans <- readr::read_csv('data/loans.csv.gz')
skim(loans) %>% 
  select(-(numeric.p0:numeric.p100)) %>%
  select(-(complete_rate))

Exploratory data analysis

Data preparation

Here below we will prepare the data for further analysis, and verify that there are no missing values

# transform to factor with readable labels
Loans_prep <- loans %>%
  #transform the skewed data using log or sqrt, add the actual annual income
  mutate(log_days_with_cr_line = log(days_with_cr_line),
         log_installment = log(installment),
         #use sqrt here because of the 0s
         sqrt_revol_bal = sqrt(revol_bal),
         sqrt_inq_last_6mths = sqrt(inq_last_6mths),
         sqrt_delinq_2yrs = sqrt(delinq_2yrs),
         sqrt_pub_rec = sqrt(pub_rec),
         #also get back the annual income, for analysis
         annual_inc = exp(log_annual_inc)) %>%
  #Use factors for the charts
  within({
    credit_policy = factor(credit_policy, levels = c(1, 0), labels = c("Credit", "No credit"))
    not_fully_paid = factor(not_fully_paid, levels = c(1, 0), labels = c("Defaulter", "Non-defaulter"))
    purpose = factor(purpose, levels= c("credit_card", "debt_consolidation", "major_purchase", "home_improvement", "educational", "small_business", "all_other"), labels = c("Credit card", "Debt consolidation", "Major purchase", "Home improvement", "Educational", "Small business", "All other"))
    inq_last_6mths = factor(inq_last_6mths) 
    delinq_2yrs = factor(delinq_2yrs) 
    pub_rec = factor(pub_rec)
  })
  
#check for potential missing value
anyNA(Loans_prep)

Let's first have a look at the numeric features of the dataset


#show the correlation plot for all the numeric features (show the ones already transformed)
ggpairs(Loans_prep[c("int_rate", "log_installment", "log_annual_inc", "sqrt_revol_bal", "dti", "fico", "log_days_with_cr_line", "revol_util")],
        ggplot2::aes(colour=Loans_prep$not_fully_paid, alpha = 0.5))

This dataset includes 9 578 loans, for each of which 14 features are reported. \

4 of the features are related to the loan: \

  • the purpose (7 categories are reported),\
  • the interest rate\
  • the monthly installments owed if the loan is funded.\
  • whether or not the loan was fully paid.

And 10 features are characterizing the borrower:\

  • whether or not the customer meets the credit underwriting criteria of the Lending Club, and \
  • the natural log of his self-reported annual income. The logarithm transformation is justified by the very strong skew of the income data (a few very high income are stretching the curve to the right).\
  • his debt-to-income ratio (debt / annual income).\
  • his FICO credit score .\
  • the number of days he has had a credit line.\
  • his revolving balance (amount unpaid at the end of the credit card billing cycle).\
  • his revolving line utilization rate (amount of the credit line used / total credit available).\
  • the number of inquiries by creditors in the last 6 months.\
  • the number of times he had been 30+ days past due on a payment in the past 2 years.\
  • the number of derogatory public records (bankruptcy filings, tax liens, or judgments).\

What we do not know however, is the total amount of the loan, or the number of payment planned, therefore it is not clear how the time for paying back the loan can be calculated.

The first question we could try to answer looking at these features is: How well does the credit policy of the Lending Club indicates whether or not a loan is fully paid?

Credit policy and payment status

We will start by generating a confusion matrix to compare the predictions (in that case, meeting the credit club policy criteria) and the actual outcome (the loan being paid). Important to note here: We assume that the objective is to identify the loan that will not be paid, therefore the true positive class corresponds to a loan not fully paid, for a borrower who does not meet credit policy criteria (defaulter).

# Vector of the payment status
actual <- Loans_prep$not_fully_paid

#change the label of the Lending Club credit policy to match with the payment status
predicted_LC = as.factor(ifelse(Loans_prep$credit_policy == "No credit", "Defaulter", "Non-defaulter"))

#generate a confusion matrix
confusion_LC = confusionMatrix(table(actual, predicted_LC))
confusion_LC

As it can be seen, the accuracy of the credit underwriting criteria of the Lending Club is 75%, which could appear as quite good, However it is important to consider here what the goal of this criteria is. In fact, the sensitivity is only 28%, meaning most of the borrower who actually defaulted on payment still met this criteria. On the other hand, the Specificity is 86 %, meaning that the vast majority of those who met this criteria actually paid their loan back.

In order to assess the potential for making accurate prediction from the available data, let's start by looking at the distribution of the loans characteristics.

Loan characteristics: Interest rate and monthly installment

# Installment vs Interest rate
Loans_prep %>%
  ggplot(aes(x = installment, y = int_rate, level = not_fully_paid)) + 
  stat_density_2d(geom = "polygon", 
                  aes(alpha = ..level.., fill = not_fully_paid),
                  bins = 5) +
  theme_classic() + 
  ggtitle("Interest rate vs installment") +
  xlab("Installment") +
  ylab("Interest rate") +
  theme(legend.title = element_blank()) +
  geom_smooth(method = "lm",aes(fill = not_fully_paid))

Most of the installment are located below 500 per month. Interest rate seem to increase with the value of installment, but they are very spread, therefore it is difficult to come to a strong conclusion on the relation here. However it is already interesting to note that at equivalent level of monthly installments, loan that were not paid show higher interest rate on average.

The highest concentration of paid loan seems to occur at low interest rate and low installment, and both paid and unpaid loan show a high density around the 12% interest rate and 250 per month installment .

Let us now look at the Borrowers and their balance in relation to the interest rates

Borrower's characteristics: Income and balance

# Annual income vs Interest rate
Loans_prep %>%
  filter(exp(log_annual_inc) < 100000) %>%
  ggplot(aes(x = annual_inc, y = int_rate, level = not_fully_paid)) + 
  stat_density_2d(geom = "polygon", 
                  aes(alpha = ..level.., fill = not_fully_paid),
                  bins = 5) +
  theme_classic() + 
  ggtitle("Interest rate vs annual income") +
  xlab("Annual income") +
  ylab("Interest rate") +
  theme(legend.title = element_blank())

There seems to be virtually no direct relation between the Annual income an the interest rate. However, again it can be observed that at equivalent level of annual income, payment default occur at higher interest rate on average.