Seaborn Heatmaps Tutorial - Loan Data

This dataset consists of more than 9,500 loans with information on the loan structure, the borrower, and whether the loan was pain back in full. This data was extracted from LendingClub.com, which is a company that connects borrowers with investors.

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

loan_data = pd.read_csv("loan_data.csv")
print(loan_data.shape)
loan_data.head(100)

Data dictionary

	Variable	Explanation
0	credit_policy	1 if the customer meets the credit underwriting criteria; 0 otherwise.
1	purpose	The purpose of the loan.
2	int_rate	The interest rate of the loan (more risky borrowers are assigned higher interest rates).
3	installment	The monthly installments owed by the borrower if the loan is funded.
4	log_annual_inc	The natural log of the self-reported annual income of the borrower.
5	dti	The debt-to-income ratio of the borrower (amount of debt divided by annual income).
6	fico	The FICO credit score of the borrower.
7	days_with_cr_line	The number of days the borrower has had a credit line.
8	revol_bal	The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
9	revol_util	The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
10	inq_last_6mths	The borrower's number of inquiries by creditors in the last 6 months.
11	delinq_2yrs	The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
12	pub_rec	The borrower's number of derogatory public records.
13	not_fully_paid	1 if the loan is not fully paid; 0 otherwise.

Source of dataset.

Prepare Your Data

We will look for missing data and outliers.

loan_data.describe()

# Count missing values
missing_values_count = loan_data.isnull().sum()
print(missing_values_count)

Our dataset does not contain any missing values so we can move on to the outliers.

# Subset dataframe for numeric columns only
numeric_df = loan_data.loc[:, "int.rate":"revol.util"]

# Create a list of numeric columns only
numeric_cols = numeric_df.columns.tolist()

# Look for far-out outliers using IQR (inter quartile range)
Q1 = numeric_df.quantile(0.25)
Q3 = numeric_df.quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 3 * IQR
upper_bound = Q3 + 3 * IQR

outlier_bool = (numeric_df > upper_bound) | (numeric_df < lower_bound)
print(outlier_bool.sum())

# Identify which rows to remove
rows_to_delete = outlier_bool.any(axis=1)

# Create a filtered dataframe with outliers removed
filtered_df = numeric_df[-rows_to_delete]
print(f"Rows removed: {len(numeric_df)-len(filtered_df)}")

We removed 457 rows that contained outliers that exceeded the far-out boundary. Depending on the goals of your project, you may decide to retain these outliers.

Creating Your First Heatmap

We will create a heatmap showing the correlation coefficient between each of the numeric variables in our data.

# Calculate the correlation matrix
correlation_matrix = filtered_df.corr()

# Create the heatmap
plt.figure(figsize = (10,8))
sns.heatmap(correlation_matrix, cmap = 'coolwarm')
plt.show()

‌
‌
‌

Seaborn Heatmaps Tutorial - Loan Data

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Seaborn Heatmaps Tutorial - Loan Data

Data dictionary

Prepare Your Data

Creating Your First Heatmap

Seaborn Heatmaps Tutorial - Loan Data