Competition - 2nd Place Winner - Loan Data

from IPython.display import Image
Image("Graph.PNG")

Analysis of lendingclub.com Loan Data

This dataset (source) consists of data from almost 10,000 borrowers that took loans - with some paid back and others still in progress. It was extracted from lendingclub.com which is an organization that connects borrowers with investors. Read on to see how the graph above was produced and how to predict a loan that is not going to be paid off!

First some important imports and house-keeping:

# Load packages
import numpy as np 
import pandas as pd 
import warnings
warnings.filterwarnings('ignore')

%%html
<style>
  table {margin-left: 0 !important;}
  .container { width:80% !important; }  
</style>

PROBLEM STATEMENT

The purpose is to gain useful insight from the available data and train a model to predict the probability of a loan not being paid in full

In the first part, the exploratory analysis will provide information and insights on the data. Tables and visualizations will be used to better understand the available data and answer relevant questions

In the second part, the data will be used to predict the probability of a loan not being paid in full. Different models will be trained and the best-performing one will be selected.

Notebook structure

The coding cells are collapsed most of the times to keep the reading easy and flowing. Whenever you feel like you need to examine the code, simply expand the cells by clicking the three-dots icon on the collapsed cell.

EDIT: There seems to be a problem with collapsed cells, so I will have all code cells expanded until a fix is implemented

Disclaimer ;)

I am not a financial analyst, so my knowledge of business logic on that area is limited. While going through it, try to focus on the structure of the notebook, the data analysis & machine learning concepts instead of the business knowledge. Thanks!

RESEARCH

some information for context from: https://www.capitalone.com/learn-grow/money-management/revolving-credit-balance/

How Does Revolving Credit Work?

If you’re approved for a revolving credit account, like a credit card, the lender will set a credit limit. The credit limit is the maximum amount you can charge to that account. When you make a purchase, you’ll have less available credit. And every time you make a payment, your available credit goes back up.

Revolving credit accounts are open ended, meaning they don’t have an end date. As long as the account remains open and in good standing, you can continue to use it. Keep in mind that your minimum payment might vary from month to month because it’s often calculated based on how much you owe at that time.

What Is a Revolving Balance?

If you don’t pay the balance on your revolving credit account in full every month, the unpaid portion carries over to the next month. That’s called a revolving balance.

You might apply for credit assuming you’ll always pay your balance in full every month. But real life can get in the way. Cars break down. Doctors’ appointments come up. And if you can’t pay your full balance, you’ll find yourself carrying a revolving balance to the following month.

What About Revolving Balances and Interest?

As the Consumer Financial Protection Bureau (CFPB) explains, “A credit card’s interest rate is the price you pay for borrowing money.” And the higher your revolving balance, the more interest you might be charged. But you can typically avoid interest charges by paying your balance in full every month.

What’s Revolving Utilization and How Does It Impact Credit Score?

Your credit utilization ratio—sometimes called revolving utilization—is how much available credit you have compared with the amount of credit you’re using. According to the CFPB, you can calculate your credit utilization ratio by dividing your total balances across all of your accounts by your total credit limit.

So why does your credit utilization ratio matter? It’s one of the factors that determines your credit score. If you manage credit responsibly and keep your utilization ratio relatively low, it might help you improve your credit score. The CFPB recommends keeping your utilization below 30% of your available credit.

EXPLORATORY ANALYSIS

Before going into the analysis, the dataset has to be examined and cleaned.

# Load data from the csv file
df = pd.read_csv('loan_data.csv', index_col=None)

# Change the dots in the column names to underscores
df.columns = [c.replace(".", "_") for c in df.columns]
print(f"Number of rows/records: {df.shape[0]}")
print(f"Number of columns/variables: {df.shape[1]}")
df.head()

# Understand the variables
variables = pd.DataFrame(columns=['Variable','Number of unique values','Number of nulls', 'Values'])

for i, var in enumerate(df.columns):
    variables.loc[i] = [var, df[var].nunique(), df[var].isnull().sum(), df[var].unique().tolist()]
    
# Join with the variables dataframe
var_dict = pd.read_csv('variable_explanation.csv', index_col=0)
variables.set_index('Variable').join(var_dict)

FEATURES

From the introduction above we know what features are available and their types. For convenience we can organize the features of the dataset in useful groups:

NUMERIC features containing numeric data
BINARY features containing binary data (0,1)
CATEGORICAL features with categorical values
LOAN features related to the loan itself
PERSON features related to the person getting the loan
TARGET the target feature for training the model

NUMERIC = ["int_rate", "installment", "log_annual_inc", "dti", "fico", "days_with_cr_line", "revol_bal", "revol_util", "inq_last_6mths", "delinq_2yrs", "pub_rec"]
BINARY = ["credit_policy","not_fully_paid"]
CATEGORICAL = ["purpose"]
LOAN = ["int_rate", "installment", "days_with_cr_line", "revol_bal", "revol_util"]
PERSON = ["log_annual_inc", "dti", "fico", "inq_last_6mths", "delinq_2yrs", "pub_rec"]
TARGET = ["not_fully_paid"]

#also change the type for TARGET to categorical
#df[TARGET] = df[TARGET].astype('category')

MISSING VALUES & IMPUTATION

Missing values might create errors in the analysis. From the table above, we can see that there are no missing values, so we can skip the imputation step :)

OUTLIERS

Outliers might skew aggregations and create bias in the training model. The dataset does not have many features (columns) so we can check the min & max of each feature and locate outliers. For example, for the binary features we expect values of 0 minimum and 1 maximum.

df[BINARY].agg(['min','max'])

df[NUMERIC].agg(['min','max']).round(2)

The ranges of each feature seem to be within the expected ranges, except for revol_bal: this range is from 0 to 1.2 milions! Let's examine this feature in more detail by visually showing its distribution with a boxplot:

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,2))
sns.boxplot(data=df, x='revol_bal')
plt.xticks(ticks=[0,50000,200000,400000,600000,800000,1000000,1200000], labels=['0','50K','200K','400K','600K','800K','1000K','1200K'])
plt.title("Distribution of revolving balance");

‌
‌
‌