Analyzing Credit Scores with tidymodels
in R
tidymodels
in RWelcome to Analyzing Credit Scores with tidymodels
in R!
In this live training, we'll explore what differentiates consumer credit score levels and demonstrate how dimensionality reduction can retain much of the information in a dataset while reducing its size. We'll use the embed
and tidymodels
to build UMAP and decision tree models. We will to demonstrate the concept of information by comparing the performance of decision tree models before and after applying UMAP dimensionality reduction.
If you want to learn more about dimensionality reduction and the tidymodels
framework, check out the new Dimensionality Reduction in R
Let's get started!
Setup Environment
First, we'll load the necessary packages -- tidyverse
, tidymodels
, embed
(note we will need to install embed
).
I'm assuming you've used the tidyverse
before. If you have not used tidymodels
or embed
packages before, here's a quick summary.
tidymodels
-- next generation of packages that incorporate tidyverse principles into machine learning and modeling.embed
-- contains extra recipes steps to create "embeddings" (i.e., encoding predictors)
# install the 'embed' package
install.packages('embed')
# load the needed packages
library(tidyverse)
library(tidymodels)
library(embed)
# set options to enlarge our plots
options(repr.plot.width=12, repr.plot.height=16)
Load the Credit Data
The data was adapted from Kaggle's "Credit score classification" data (thanks Rohan Paris!).
We'll load it using read_csv()
and take a glimpse of it.
# the credit score data is available here
data_url <- "https://assets.datacamp.com/production/repositories/6081/datasets/e02471e553bc28edddc1fe862666d36e04daed80/credit_score.csv"
# use read_csv to load the data
credit_df <- read_csv(data_url)
# reorder the credit_score factor levels
credit_df <- credit_df %>%
mutate(credit_score = factor(credit_score, levels = c("Poor", "Standard", "Good")))
# look at the available features
glimpse(credit_df)
The data's dimensionality is just its number of columns. credit_df
has 23 dimensions, or features -- one target variable (credit_score
) and 22 predictor variables.
The target variable -- credit_score
-- is categorical and has three levels: Poor, Standard, and Good. So, from a machine learning perspective we'll be dealing with a classification problem.
Our core objective is to understand what differentiates consumers with poor, standard, and good credit scores. In short, we want to explain why consumers' credit scores differ. Along the way, we'll learn about UMAP (feature extraction algorithm) and the tidymodels
framework.
Exploration
Let's visually explore credit_df
a little and see if we can understand why consumers have different credit scores.
NOTE:: As humans we can't visualize high-dimensional data -- we are limited to about three dimensionals (maybe four, if you add animation to capture time).
What differentiates consumer credit scores?
Let's generate a few plots to see if we can discover a few predictors that do a good job of separating the credit scores.
Annual income density plot
Let's start by plotting the distribution of annual income for each of the three credit score levels.
# plot annual_income distribution for each credit score level
credit_df %>%
ggplot(aes(x = annual_income, color = credit_score)) +
geom_density() +
xlim(0, 200000)
Takeaway: Those with lower annual income tend to have poorer credit scores. That means that annual income contains information that helps us determine credit score.
Age density plot
Let's explore the age of consumers by creating a density plot of age for each of the credit score levels.
# plot age distribution for each credit_score level
credit_df %>%
ggplot(aes(x = age, color = credit_score)) +
geom_density()
Takeaway: Older consumers tend to have better credit score. In other words, age
also contains some information that is useful for determining credit_score
.
Delay from due date vs. credit history months
-
Delay from due date = the average number of days late on payment
-
Credit history months = the number of months of credit history the consumer has on record
Let's explore both of these features using a scatterplot that separates the credit score levels by color.
# plot delay_from_due_date vs credit_history_months
credit_df %>%
ggplot(aes(x = delay_from_due_date, y = credit_history_months , color = credit_score)) +
geom_jitter(alpha = 0.4)