Insurance Claims Prediction: Finding the Most Informative Feature
---
Car insurance is mandatory in most countries, making the market huge โ and competitive. Insurance providers aim to accurately assess the risk of a customer making a claim to set fair premiums and minimize losses.
On the Road Car Insurance has asked for help identifying the single most predictive feature that can be used in a simple model to predict whether a customer will file a claim. With limited infrastructure, they want something interpretable, measurable, and easy to implement.
Goal:
Build and compare single-feature logistic regression models and identify the one feature that gives the highest accuracy in predicting insurance claims.
Dataset Overview
The dataset car_insurance.csv includes various customer attributes that may influence their likelihood to make a claim. Each row represents one customer.
The dataset
| Column | Description |
|---|---|
id | Unique client identifier |
age | Client's age:
|
gender | Client's gender:
|
driving_experience | Years the client has been driving:
|
education | Client's level of education:
|
income | Client's income level:
|
credit_score | Client's credit score (between zero and one) |
vehicle_ownership | Client's vehicle ownership status:
|
vehcile_year | Year of vehicle registration:
|
married | Client's marital status:
|
children | Client's number of children |
postal_code | Client's postal code |
annual_mileage | Number of miles driven by the client each year |
vehicle_type | Type of car:
|
speeding_violations | Total number of speeding violations received by the client |
duis | Number of times the client has been caught driving under the influence of alcohol |
past_accidents | Total number of previous accidents the client has been involved in |
outcome | Whether the client made a claim on their car insurance (response variable):
|
# Import required modules
import pandas as pd
import numpy as np
from statsmodels.formula.api import logitStep 1: Load and Inspect the Data
We begin by loading the dataset and checking for:
- Missing values
- Data types
- Basic data distributions
car_insurance=pd.read_csv("car_insurance.csv")
car_insurance.info()car_insurance.isna().sum()๐ Distribution of Credit Score
We observe the distribution and handle missing values by replacing them with the mean.
car_insurance["credit_score"]=car_insurance["credit_score"].fillna(car_insurance["credit_score"].mean())import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(car_insurance["credit_score"])
plt.show()Distribution of Annual Mileage
Again, we fill missing values with the mean to prepare clean input for modeling.
car_insurance["annual_mileage"]=car_insurance["annual_mileage"].fillna(car_insurance["annual_mileage"].mean())sns.histplot(car_insurance["annual_mileage"])
plt.show()car_insurance.info()
car_insurance.isna().sum()Step 2: Train Logistic Regression Models (One Feature at a Time)
We build separate logistic regression models using one predictor at a time to predict whether a customer will make a claim (outcome).
We use statsmodels for simplicity and interpretability.
โ
โ