Seaborn Heatmaps Tutorial - Loan Data
This dataset consists of more than 9,500 loans with information on the loan structure, the borrower, and whether the loan was pain back in full. This data was extracted from LendingClub.com, which is a company that connects borrowers with investors.
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
loan_data = pd.read_csv("loan_data.csv")
print(loan_data.shape)
loan_data.head(100)
Data dictionary
Variable | Explanation | |
---|---|---|
0 | credit_policy | 1 if the customer meets the credit underwriting criteria; 0 otherwise. |
1 | purpose | The purpose of the loan. |
2 | int_rate | The interest rate of the loan (more risky borrowers are assigned higher interest rates). |
3 | installment | The monthly installments owed by the borrower if the loan is funded. |
4 | log_annual_inc | The natural log of the self-reported annual income of the borrower. |
5 | dti | The debt-to-income ratio of the borrower (amount of debt divided by annual income). |
6 | fico | The FICO credit score of the borrower. |
7 | days_with_cr_line | The number of days the borrower has had a credit line. |
8 | revol_bal | The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle). |
9 | revol_util | The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available). |
10 | inq_last_6mths | The borrower's number of inquiries by creditors in the last 6 months. |
11 | delinq_2yrs | The number of times the borrower had been 30+ days past due on a payment in the past 2 years. |
12 | pub_rec | The borrower's number of derogatory public records. |
13 | not_fully_paid | 1 if the loan is not fully paid; 0 otherwise. |
Source of dataset.
Prepare Your Data
We will look for missing data and outliers.
loan_data.describe()
# Count missing values
missing_values_count = loan_data.isnull().sum()
print(missing_values_count)
Our dataset does not contain any missing values so we can move on to the outliers.
# Subset dataframe for numeric columns only
numeric_df = loan_data.loc[:, "int.rate":"revol.util"]
# Create a list of numeric columns only
numeric_cols = numeric_df.columns.tolist()
# Look for far-out outliers using IQR (inter quartile range)
Q1 = numeric_df.quantile(0.25)
Q3 = numeric_df.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 3 * IQR
upper_bound = Q3 + 3 * IQR
outlier_bool = (numeric_df > upper_bound) | (numeric_df < lower_bound)
print(outlier_bool.sum())
# Identify which rows to remove
rows_to_delete = outlier_bool.any(axis=1)
# Create a filtered dataframe with outliers removed
filtered_df = numeric_df[-rows_to_delete]
print(f"Rows removed: {len(numeric_df)-len(filtered_df)}")
We removed 457 rows that contained outliers that exceeded the far-out boundary. Depending on the goals of your project, you may decide to retain these outliers.
Creating Your First Heatmap
We will create a heatmap showing the correlation coefficient between each of the numeric variables in our data.
# Calculate the correlation matrix
correlation_matrix = filtered_df.corr()
# Create the heatmap
plt.figure(figsize = (10,8))
sns.heatmap(correlation_matrix, cmap = 'coolwarm')
plt.show()