Data Analysis: Predicting Credit Default & Wage Growth Forecast

Executive Summary

This project applied decision tree models to uncover key drivers in two datasets: credit default risk and wage growth. In the credit model, missed payments emerged as the most influential factor, followed by credit utilization, with asset ownership playing a smaller role — yielding 92% accuracy, 96% precision, and 93% recall. In the economic model, recession status and GDP growth rate were the strongest predictors of wage growth, with an RMSE of 0.85 indicating strong predictive reliability. Scenario testing revealed that borrowers with clean payment histories and higher assets face significantly lower default risk, while robust GDP growth and low unemployment consistently align with stronger wage growth. These findings provide actionable, interpretable insights for both financial decision-making and economic policy planning.

Image by Megan Rexazin Conde via Pixabay

Introduction

This project applies decision tree algorithms to address two distinct predictive modeling challenges. The first model predicts credit default risk based on individual financial behavior and asset ownership. The second model forecasts wage growth using macroeconomic indicators. Both models leverage real-world-like datasets to demonstrate supervised learning workflows, model optimization, and practical scenario analysis. The goal is to produce actionable insights that would inform financial institutions' risk management strategies and policymakers' economic planning.

Data Preparation

We worked with two distinct datasets, each designed to answer a specific business question.

Credit Card Default Data (600 records, 8 fields): This dataset captures customer payment history, credit usage, and asset ownership. Key factors we focused on included whether a customer had missed a payment, their level of credit utilization, and the assets they owned (none, a car, a house, or both). The goal was to predict whether a customer would default on their credit obligations.

Economic Data (99 records, 6 fields): This dataset reflects broader economic conditions, including whether the economy was in recession, the unemployment rate, and GDP growth. Our objective here was to forecast wage growth under different economic scenarios.

# Step 1

# Loading package to show the decision tree
install.packages("rpart.plot")

# Loading credit card default data set
credit_default <- read.csv(file='credit_card_default.csv', header=TRUE, sep=",")

print("installation completed!")
print("data set (first 6 rows)")
head(credit_default, 6)

# ---- Step 1: type cleanup (REPLACE your current block with this) ----
to_bool_factor <- function(x) {
  sx <- as.character(x)
  yes_vals <- c("1","yes","y","true","t","Yes","YES")
  no_vals  <- c("0","no","n","false","f","No","NO")
  out <- ifelse(sx %in% yes_vals, "yes",
         ifelse(sx %in% no_vals,  "no", NA))
  factor(out, levels = c("no","yes"))
}

# Map 0/1 or yes/no to "no"/"yes"
credit_default$default        <- to_bool_factor(credit_default$default)
credit_default$missed_payment <- to_bool_factor(credit_default$missed_payment)

# Recode assets (numeric 1/2/3 or character "1"/"2"/"3") -> labels
credit_default$assets <- factor(as.character(credit_default$assets),
  levels = c("1","2","3"),
  labels = c("none","car","car_house")
)

# numeric
credit_default$credit_utilize <- as.numeric(credit_default$credit_utilize)

# Save factor levels for prediction use later
lv_mp     <- levels(credit_default$missed_payment)
lv_assets <- levels(credit_default$assets)

credit_default$assets <- factor(credit_default$assets,
  levels = c("1","2","3"),
  labels = c("none","car","car_house")
)

# Load required libraries
library(rpart)
library(rpart.plot)

3 Classification Decision Tree: Credit Default Prediction

3.1 Methodology The dataset was split into training (70%) and validation (30%) sets using a fixed random seed for reproducibility. A classification decision tree was trained using the 'rpart' package in R, with 'missed_payment', 'credit_utilize', and 'assets' as predictors. Model complexity was tuned using the cost-complexity pruning parameter (cp), selected based on validation error minimization.

3.2 Results The pruned classification tree revealed 'missed_payment' as the most significant predictor of default, followed by 'credit_utilize'. Asset ownership had a smaller but measurable impact. The confusion matrix on the validation set yielded strong predictive performance: • Accuracy: 92% • Precision: 96% • Recall: 93% True negatives were correctly identified in 74 cases, and true positives in 100 cases, with minimal false classifications.

3.3 Scenario Predictions • Scenario A: No missed payments, owns car and house, 30% credit utilization → Predicted: No Default (Probability ≈ 19.9%) • Scenario B: Missed payments, no assets, 30% credit utilization → Predicted: Default (Probability ≈ 95.3%)

# Step 2
cat("\nStep 2: Split 70/30 (train/test)\n")
set.seed(6751342)
samp.size <- floor(0.70 * nrow(credit_default))
train_ind <- sample(seq_len(nrow(credit_default)), size = samp.size)
train.data1 <- credit_default[train_ind, ]
test.data1  <- credit_default[-train_ind, ]
cat("Train rows:", nrow(train.data1), " | Test rows:", nrow(test.data1), "\n")

# Step 3

cat("\nStep 3: Fit classification tree (missed_payment + credit_utilize + assets)\n")
set.seed(6751342)
model1 <- rpart(
  default ~ missed_payment + credit_utilize + assets,
  method = "class",
  data   = train.data1,
  control = rpart.control(minsplit = 10)
)
printcp(model1)

# Step 4
cat("\nStep 4: Plot cp table\n")
plotcp(model1, minline = TRUE, lty = 3, col = 2, upper = c("size","splits","none"))

# Step 5

set.seed(6751342)
pruned_model1 <- rpart(
  default ~ missed_payment + credit_utilize + assets,
  method = "class",
  data = train.data1,
  control = rpart.control(cp = 0.010101)
)
printcp(pruned_model1)

# Step 6
cat("\nStep 6: Plot pruned classification tree\n")
rpart.plot(pruned_model1)

# Step 7

library(rpart.plot)
rpart.plot(pruned_model1)

# make predictions
pred <- predict(pruned_model1, newdata = test.data1, type = "class")

# ensure consistent levels (adjust the order/labels if your positive class comes first)
test.data1$default <- factor(test.data1$default, levels = c("no","yes"))
pred <- factor(pred, levels = levels(test.data1$default))

# build confusion matrix with consistent dimensions
conf.matrix <- table(
  "Actual default"    = test.data1$default,
  "Prediction default"= pred
)

cat("Confusion Matrix\n")
print(conf.matrix)

# Step 7.5

# prediction 1
newdata1 <- data.frame(
  missed_payment  = factor("no",  levels = lv_mp),
  assets          = factor("car_house", levels = lv_assets),  # ensure this level exists exactly
  credit_utilize  = 0.30
)
predict(pruned_model1, newdata1, type = "class")

# prediction 2
newdata2 <- data.frame(
  missed_payment  = factor("yes", levels = lv_mp),
  assets          = factor("none", levels = lv_assets),
  credit_utilize  = 0.30
)
predict(pruned_model1, newdata2, type = "class")

want <- c("no","yes")
have <- colnames(conf.matrix)
conf.matrix <- conf.matrix[, intersect(want, have), drop = FALSE]

# Step 8

# Load the data set
economic <- read.csv(file='economic.csv', header=TRUE, sep=",")

# Print the first six rows
print("head")
head(economic, 6)

4 Regression Decision Tree: Wage Growth Forecasting

4.1 Methodology The economic dataset was split into training (80%) and validation (20%) sets. A regression decision tree was trained using 'economy', 'unemployment', and 'gdp' as predictors. The optimal cp value was selected based on the lowest validation RMSE.

4.2 Results The model identified 'economy' status as the primary driver of wage growth, followed by GDP growth rate. The validation RMSE was approximately 0.85, indicating that predictions deviated from actual wage growth by less than 1 percentage point on average.

4.3 Scenario Predictions • Scenario A: No recession, unemployment 3.4%, GDP growth 3.5% → Predicted wage growth ≈ 7.2% • Scenario B: Recession, unemployment 7.4%, GDP growth 1.4% → Predicted wage growth ≈ 2.1%

‌
‌
‌