Skip to content

Seaborn Heatmaps Tutorial - Loan Data

This dataset consists of more than 9,500 loans with information on the loan structure, the borrower, and whether the loan was pain back in full. This data was extracted from LendingClub.com, which is a company that connects borrowers with investors.

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
loan_data = pd.read_csv("loan_data.csv")
print(loan_data.shape)
loan_data.head(100)

Data dictionary

VariableExplanation
0credit_policy1 if the customer meets the credit underwriting criteria; 0 otherwise.
1purposeThe purpose of the loan.
2int_rateThe interest rate of the loan (more risky borrowers are assigned higher interest rates).
3installmentThe monthly installments owed by the borrower if the loan is funded.
4log_annual_incThe natural log of the self-reported annual income of the borrower.
5dtiThe debt-to-income ratio of the borrower (amount of debt divided by annual income).
6ficoThe FICO credit score of the borrower.
7days_with_cr_lineThe number of days the borrower has had a credit line.
8revol_balThe borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
9revol_utilThe borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
10inq_last_6mthsThe borrower's number of inquiries by creditors in the last 6 months.
11delinq_2yrsThe number of times the borrower had been 30+ days past due on a payment in the past 2 years.
12pub_recThe borrower's number of derogatory public records.
13not_fully_paid1 if the loan is not fully paid; 0 otherwise.

Source of dataset.

Prepare Your Data

We will look for missing data and outliers.

loan_data.describe()
# Count missing values
missing_values_count = loan_data.isnull().sum()
print(missing_values_count)

Our dataset does not contain any missing values so we can move on to the outliers.

# Subset dataframe for numeric columns only
numeric_df = loan_data.loc[:, "int.rate":"revol.util"]

# Create a list of numeric columns only
numeric_cols = numeric_df.columns.tolist()
# Look for far-out outliers using IQR (inter quartile range)
Q1 = numeric_df.quantile(0.25)
Q3 = numeric_df.quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 3 * IQR
upper_bound = Q3 + 3 * IQR

outlier_bool = (numeric_df > upper_bound) | (numeric_df < lower_bound)
print(outlier_bool.sum())
# Identify which rows to remove
rows_to_delete = outlier_bool.any(axis=1)

# Create a filtered dataframe with outliers removed
filtered_df = numeric_df[-rows_to_delete]
print(f"Rows removed: {len(numeric_df)-len(filtered_df)}")

We removed 457 rows that contained outliers that exceeded the far-out boundary. Depending on the goals of your project, you may decide to retain these outliers.

Creating Your First Heatmap

We will create a heatmap showing the correlation coefficient between each of the numeric variables in our data.

# Calculate the correlation matrix
correlation_matrix = filtered_df.corr()
# Create the heatmap
plt.figure(figsize = (10,8))
sns.heatmap(correlation_matrix, cmap = 'coolwarm')
plt.show()