Loan Data
Context
This dataset (source) consists of data from almost 10,000 borrowers that took loans - with some paid back and others still in progress. It was extracted from lendingclub.com which is an organization that connects borrowers with investors. We've included a few suggested questions at the end of this template to help you get started.
%%capture
pip install imblearn%%capture
# Load packages
# Import
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from timeit import default_timer as timer
import time
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.svm import SVR
from sklearn.feature_selection import RFE
from sklearn.feature_selection import RFECV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.inspection import plot_partial_dependence
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, classification_report, precision_recall_curve
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split, KFold
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from sklearn.utils import resample,shuffle
# Import necessary modules
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from imblearn.pipeline import Pipeline, make_pipeline
from imblearn.over_sampling import SMOTE
# Others
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
cmap=sns.color_palette('Blues_r')Load your data
# Load data from the csv file
df = pd.read_csv('loan_data.csv', index_col=None)
# Change the dots in the column names to underscores
df.columns = [c.replace(".", "_") for c in df.columns]
print(f"Number of rows/records: {df.shape[0]}")
print(f"Number of columns/variables: {df.shape[1]}")
df.head()1. Executive summary
According to a research by BCG, data analytics has been credited as 1 of the biggest drivers behind digital lending. The availability of data coupled with favorable regulatory environment is projected to enable nearly $1 trillion worth of digital loans to be disbursed in India alone in the next 5 years. For a digital lender to be competitive in the market, it must own the appropriate model to make lending decisions. Thus, based on the dataset which contains client credit record and loan history, we propose a model to predict the chances of a non-full payment from the client's loan. We first created a simple logistic regression model using features selected with RFE (and Random Forest and Gradient boosting as the underlying models). However, recall was low at 2% using this approach. Next, ‘stratified k-fold cross validation with SMOTE’ was applied during the model training to overcome issues with an imbalanced class. Hyper-params were also fine-tuned using GridSearchCV. Recall then improved significantly to 61.2%. As a next step, we recommend researching more complex models like neural network to further improve the model and reduce the credit risk of the lender.
2. Business Problem and motivation
For a digital lender to be profitable, it must not only have data strategy to acquire the data necessary for credit modelling, more importantly, such lenders must possess analytics models which allow it to make both lending decisions and monitor the financial health of its clients. In this report, we will attempt to create a model for digital lenders to make lending decisions based on the client data it has acquired. (McKinsey, 2021)
Our hypothesis is that based on the client's loan history and credit records, the lender will be able to predict the chances of a non-full payment from the client's loan. Such a predict can be further improved by perform vigorous feature selection using the RFE method with an ensemble model. In addition, by further fine tuning the model (by hyper-param tunning, SMOTE, k-fold cross validation), we can further improve the Recall of the model.
3. Exploratory data analysis
We first check for missing values in the dataset and noticed there is none.
# Create train and test sets
train, test = train_test_split(df, test_size=0.2, random_state=0)
train, valid = train_test_split(df, test_size=0.2, random_state=0)
train.head()# Train: X and y split
X_train = train.drop('not_fully_paid', axis=1)
y_train = train[['not_fully_paid']]
X_train.head()y_train.head()# Train: X and y split
X_test = test.drop('not_fully_paid', axis=1)
y_test = test[['not_fully_paid']]
X_test.head()# Train: X and y split
X_valid = valid.drop('not_fully_paid', axis=1)
y_valid = test[['not_fully_paid']]
X_valid.head()