Insurance companies invest a lot of time and money into optimizing their pricing and accurately estimating the likelihood that customers will make a claim. In many countries insurance it is a legal requirement to have car insurance in order to drive a vehicle on public roads, so the market is very large!
Knowing all of this, On the Road car insurance have requested my services in building a model to predict whether a customer will make a claim on their insurance during the policy period. As they have very little expertise and infrastructure for deploying and monitoring machine learning models, they've asked me to identify the single feature that results in the best performing model, as measured by accuracy, so they can start with a simple model in production.
They have supplied me with their customer data as a csv file called car_insurance.csv, along with a table detailing the column names and descriptions below.
The dataset
| Column | Description |
|---|---|
id | Unique client identifier |
age | Client's age:
|
gender | Client's gender:
|
driving_experience | Years the client has been driving:
|
education | Client's level of education:
|
income | Client's income level:
|
credit_score | Client's credit score (between zero and one) |
vehicle_ownership | Client's vehicle ownership status:
|
vehcile_year | Year of vehicle registration:
|
married | Client's marital status:
|
children | Client's number of children |
postal_code | Client's postal code |
annual_mileage | Number of miles driven by the client each year |
vehicle_type | Type of car:
|
speeding_violations | Total number of speeding violations received by the client |
duis | Number of times the client has been caught driving under the influence of alcohol |
past_accidents | Total number of previous accidents the client has been involved in |
outcome | Whether the client made a claim on their car insurance (response variable):
|
# Importing necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.formula.api import logit
# Loading the car insurance dataset
car_ins = pd.read_csv('car_insurance.csv')
# Filling missing values in 'credit_score' and 'annual_mileage' columns with their respective means
car_ins['credit_score'] = car_ins['credit_score'].fillna(np.mean(car_ins['credit_score']))
car_ins['annual_mileage'] = car_ins['annual_mileage'].fillna(np.mean(car_ins['annual_mileage']))
# Preparing data by dropping 'outcome' and 'id' columns to get feature set
features = car_ins.drop(columns=['outcome', 'id'])
# Initializing a list to store models
models = []
# Building logistic regression models for each feature and storing them
for col in features.columns:
model = logit(f"outcome ~ {col} + 0", data=car_ins).fit()
models.append(model)
# Initializing a list to store accuracy of each model
accuracies = []
# Calculating accuracy for each model and storing it
for i in range(len(models)):
conf_matrix = models[i].pred_table()
TN = conf_matrix[0,0]
FN = conf_matrix[1,0]
FP = conf_matrix[0,1]
TP = conf_matrix[1,1]
accuracy = (TN + TP) / (TN + TP + FN + FP)
accuracies.append(accuracy)
# Creating a DataFrame to store accuracies and feature names
accuracy_df = pd.DataFrame({'feature': features.columns, 'accuracy': accuracies})
# Heatmap to visualize the accuracy of the features
plt.figure(figsize=(12, 8))
sns.heatmap(accuracy_df.pivot_table(index='feature', values='accuracy'), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Feature Accuracy Heatmap')
plt.show()
# Identifying the best performing model based on accuracy
highest_acc_index = accuracies.index(max(accuracies))
best_feature = features.columns[highest_acc_index]
# Creating a DataFrame to display the best feature and its accuracy
best_feature_df = pd.DataFrame({'best_feature': [best_feature], 'best_accuracy': [max(accuracies)]})
best_feature_dfTHE MODEL RESULT
The resulting dataframe indicates that the feature "DRIVING EXPERIENCE" leads to the highest model accuracy. This suggests that among all the features considered, driving experience is the most significant predictor of the model's performance. For example, if we were predicting accident likelihood, drivers with more experience might be less prone to accidents, thus improving model accuracy. This insight can be crucial for On the Road car insurance when assessing risk. Overall, focusing on driving experience can enhance the predictive power of the model.
The next best performing feature if you look at the heatmap above is "INCOME". This suggests that income is also a significant predictor of the model's performance. For instance, individuals with higher incomes might afford safer vehicles or invest in better driving education, which could reduce the likelihood of accidents. This insight is valuable for On the Road car insurance as it can help in tailoring insurance products and premiums based on income levels. By incorporating income into the model, companies can further refine their risk assessment and improve the accuracy of their predictions.