Insurance companies invest a lot of time and money into optimizing their pricing and accurately estimating the likelihood that customers will make a claim. In many countries, insurance is a legal requirement to have car insurance in order to drive a vehicle on public roads, so the market is very large!
Knowing all of this, On the Road car insurance has requested your services in building a model to predict whether a customer will make a claim on their insurance during the policy period. As they have very little expertise and infrastructure for deploying and monitoring machine learning models, they've asked you to use simple Logistic Regression, identifying the single feature that results in the best-performing model, as measured by accuracy.
They have supplied you with their customer data as a csv file called car_insurance.csv, along with a table (below) detailing the column names and descriptions below.
The dataset
| Column | Description |
|---|---|
id | Unique client identifier |
age | Client's age:
|
gender | Client's gender:
|
driving_experience | Years the client has been driving:
|
education | Client's level of education:
|
income | Client's income level:
|
credit_score | Client's credit score (between zero and one) |
vehicle_ownership | Client's vehicle ownership status:
|
vehcile_year | Year of vehicle registration:
|
married | Client's marital status:
|
children | Client's number of children |
postal_code | Client's postal code |
annual_mileage | Number of miles driven by the client each year |
vehicle_type | Type of car:
|
speeding_violations | Total number of speeding violations received by the client |
duis | Number of times the client has been caught driving under the influence of alcohol |
past_accidents | Total number of previous accidents the client has been involved in |
outcome | Whether the client made a claim on their car insurance (response variable):
|
# Import required libraries
library(readr)
library(dplyr)
library(glue)
library(yardstick)
# Start coding!
# Clean data
library(dplyr)
library(tidyr)
# Read the dataset
car_data <- read.csv("car_insurance.csv")
# Explore structure and summary
str(car_data)
summary(car_data)
colSums(is.na(car_data)) # Check for missing values
#Drop rows with any missing values
clean_data <- car_data %>%
dplyr::select(-id) %>% # Remove ID column
drop_na()
head(clean_data)# Creating a data frame to store the features to be used for the models.
# Ensure outcome is binary (0/1)
clean_data$outcome <- as.numeric(as.character(clean_data$outcome))
# Identify features
features <- setdiff(names(clean_data), "outcome")
# Model results
model_results <- data.frame(feature = character(), accuracy = numeric(), stringsAsFactors = FALSE)
for (f in features) {
formula <- as.formula(paste("outcome ~", f))
model <- glm(formula, data = clean_data, family = binomial)
# Predict outcomes
preds <- ifelse(predict(model, type = "response") > 0.5, 1, 0)
# Accuracy
acc <- mean(preds == clean_data$outcome)
# Store result
model_results <- rbind(model_results, data.frame(feature = f, accuracy = acc))
}
# Measuring performance
# View all model accuracies
print(model_results)
best <- model_results %>%
arrange(desc(accuracy)) %>%
slice(1)
# Final output
best_feature_df <- data.frame(
best_feature = best$feature,
best_accuracy = best$accuracy
)
print(best_feature_df)