Skip to content
Analyzing Credit Scores with tidymodels in R
  • AI Chat
  • Code
  • Report
  • Analyzing Credit Scores with tidymodels in R

    Welcome to Analyzing Credit Scores with tidymodels in R!

    In this live training, we'll explore what differentiates consumer credit score levels and demonstrate how dimensionality reduction can retain much of the information in a dataset while reducing its size. We'll use the embed and tidymodels to build UMAP and decision tree models. We will to demonstrate the concept of information by comparing the performance of decision tree models before and after applying UMAP dimensionality reduction.

    If you want to learn more about dimensionality reduction and the tidymodels framework, check out the new Dimensionality Reduction in R

    Let's get started!

    Setup Environment

    First, we'll load the necessary packages -- tidyverse, tidymodels, embed (note we will need to install embed).

    I'm assuming you've used the tidyverse before. If you have not used tidymodels or embed packages before, here's a quick summary.

    • tidymodels -- next generation of packages that incorporate tidyverse principles into machine learning and modeling.
    • embed -- contains extra recipes steps to create "embeddings" (i.e., encoding predictors)
    # install the 'embed' package
    # load the needed packages
    # set options to enlarge our plots
    options(repr.plot.width=12, repr.plot.height=16)

    Load the Credit Data

    The data was adapted from Kaggle's "Credit score classification" data (thanks Rohan Paris!).

    We'll load it using read_csv() and take a glimpse of it.

    # the credit score data is available here
    data_url <- ""
    # use read_csv to load the data
    credit_df <- read_csv(data_url)
    # reorder the credit_score factor levels
    credit_df <- credit_df %>% 
      mutate(credit_score = factor(credit_score, levels = c("Poor", "Standard", "Good")))
    # look at the available features

    The data's dimensionality is just its number of columns. credit_df has 23 dimensions, or features -- one target variable (credit_score) and 22 predictor variables.

    The target variable -- credit_score -- is categorical and has three levels: Poor, Standard, and Good. So, from a machine learning perspective we'll be dealing with a classification problem.

    Our core objective is to understand what differentiates consumers with poor, standard, and good credit scores. In short, we want to explain why consumers' credit scores differ. Along the way, we'll learn about UMAP (feature extraction algorithm) and the tidymodels framework.


    Let's visually explore credit_df a little and see if we can understand why consumers have different credit scores.

    NOTE:: As humans we can't visualize high-dimensional data -- we are limited to about three dimensionals (maybe four, if you add animation to capture time).

    What differentiates consumer credit scores?

    Let's generate a few plots to see if we can discover a few predictors that do a good job of separating the credit scores.

    Annual income density plot

    Let's start by plotting the distribution of annual income for each of the three credit score levels.

    # plot annual_income distribution for each credit score level
    credit_df %>%  
      ggplot(aes(x = annual_income, color = credit_score)) +
      geom_density() +
      xlim(0, 200000)

    Takeaway: Those with lower annual income tend to have poorer credit scores. That means that annual income contains information that helps us determine credit score.

    Age density plot

    Let's explore the age of consumers by creating a density plot of age for each of the credit score levels.

    # plot age distribution for each credit_score level
    credit_df %>%  
      ggplot(aes(x = age, color = credit_score)) +

    Takeaway: Older consumers tend to have better credit score. In other words, age also contains some information that is useful for determining credit_score.

    Delay from due date vs. credit history months
    • Delay from due date = the average number of days late on payment

    • Credit history months = the number of months of credit history the consumer has on record

    Let's explore both of these features using a scatterplot that separates the credit score levels by color.

    # plot delay_from_due_date vs credit_history_months 
    credit_df %>%  
      ggplot(aes(x = delay_from_due_date, y = credit_history_months , color = credit_score)) +
      geom_jitter(alpha = 0.4)