Seaborn Heatmaps Tutorial - Loan Data
  • AI Chat
  • Code
  • Report
  • Spinner

    Seaborn Heatmaps Tutorial - Loan Data

    This dataset consists of more than 9,500 loans with information on the loan structure, the borrower, and whether the loan was pain back in full. This data was extracted from LendingClub.com, which is a company that connects borrowers with investors.

    # Import libraries
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    loan_data = pd.read_csv("loan_data.csv")
    print(loan_data.shape)
    loan_data.head(100)

    Data dictionary

    VariableExplanation
    0credit_policy1 if the customer meets the credit underwriting criteria; 0 otherwise.
    1purposeThe purpose of the loan.
    2int_rateThe interest rate of the loan (more risky borrowers are assigned higher interest rates).
    3installmentThe monthly installments owed by the borrower if the loan is funded.
    4log_annual_incThe natural log of the self-reported annual income of the borrower.
    5dtiThe debt-to-income ratio of the borrower (amount of debt divided by annual income).
    6ficoThe FICO credit score of the borrower.
    7days_with_cr_lineThe number of days the borrower has had a credit line.
    8revol_balThe borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
    9revol_utilThe borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
    10inq_last_6mthsThe borrower's number of inquiries by creditors in the last 6 months.
    11delinq_2yrsThe number of times the borrower had been 30+ days past due on a payment in the past 2 years.
    12pub_recThe borrower's number of derogatory public records.
    13not_fully_paid1 if the loan is not fully paid; 0 otherwise.

    Source of dataset.

    Prepare Your Data

    We will look for missing data and outliers.

    loan_data.describe()
    # Count missing values
    missing_values_count = loan_data.isnull().sum()
    print(missing_values_count)

    Our dataset does not contain any missing values so we can move on to the outliers.

    # Subset dataframe for numeric columns only
    numeric_df = loan_data.loc[:, "int.rate":"revol.util"]
    
    # Create a list of numeric columns only
    numeric_cols = numeric_df.columns.tolist()
    # Look for far-out outliers using IQR (inter quartile range)
    Q1 = numeric_df.quantile(0.25)
    Q3 = numeric_df.quantile(0.75)
    
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 3 * IQR
    upper_bound = Q3 + 3 * IQR
    
    outlier_bool = (numeric_df > upper_bound) | (numeric_df < lower_bound)
    print(outlier_bool.sum())
    # Identify which rows to remove
    rows_to_delete = outlier_bool.any(axis=1)
    
    # Create a filtered dataframe with outliers removed
    filtered_df = numeric_df[-rows_to_delete]
    print(f"Rows removed: {len(numeric_df)-len(filtered_df)}")

    We removed 457 rows that contained outliers that exceeded the far-out boundary. Depending on the goals of your project, you may decide to retain these outliers.

    Creating Your First Heatmap

    We will create a heatmap showing the correlation coefficient between each of the numeric variables in our data.

    # Calculate the correlation matrix
    correlation_matrix = filtered_df.corr()
    # Create the heatmap
    plt.figure(figsize = (10,8))
    sns.heatmap(correlation_matrix, cmap = 'coolwarm')
    plt.show()