Skip to content

Insurance Claims Prediction: Finding the Most Informative Feature

---

Car insurance is mandatory in most countries, making the market huge โ€” and competitive. Insurance providers aim to accurately assess the risk of a customer making a claim to set fair premiums and minimize losses.

On the Road Car Insurance has asked for help identifying the single most predictive feature that can be used in a simple model to predict whether a customer will file a claim. With limited infrastructure, they want something interpretable, measurable, and easy to implement.

Goal:
Build and compare single-feature logistic regression models and identify the one feature that gives the highest accuracy in predicting insurance claims.

Source:
Accenture: Machine Learning in Insurance

Dataset Overview

The dataset car_insurance.csv includes various customer attributes that may influence their likelihood to make a claim. Each row represents one customer.

The dataset

ColumnDescription
idUnique client identifier
ageClient's age:
  • 0: 16-25
  • 1: 26-39
  • 2: 40-64
  • 3: 65+
genderClient's gender:
  • 0: Female
  • 1: Male
driving_experienceYears the client has been driving:
  • 0: 0-9
  • 1: 10-19
  • 2: 20-29
  • 3: 30+
educationClient's level of education:
  • 0: No education
  • 1: High school
  • 2: University
incomeClient's income level:
  • 0: Poverty
  • 1: Working class
  • 2: Middle class
  • 3: Upper class
credit_scoreClient's credit score (between zero and one)
vehicle_ownershipClient's vehicle ownership status:
  • 0: Does not own their vehilce (paying off finance)
  • 1: Owns their vehicle
vehcile_yearYear of vehicle registration:
  • 0: Before 2015
  • 1: 2015 or later
marriedClient's marital status:
  • 0: Not married
  • 1: Married
childrenClient's number of children
postal_codeClient's postal code
annual_mileageNumber of miles driven by the client each year
vehicle_typeType of car:
  • 0: Sedan
  • 1: Sports car
speeding_violationsTotal number of speeding violations received by the client
duisNumber of times the client has been caught driving under the influence of alcohol
past_accidentsTotal number of previous accidents the client has been involved in
outcomeWhether the client made a claim on their car insurance (response variable):
  • 0: No claim
  • 1: Made a claim
# Import required modules
import pandas as pd
import numpy as np
from statsmodels.formula.api import logit

Step 1: Load and Inspect the Data

We begin by loading the dataset and checking for:

  • Missing values
  • Data types
  • Basic data distributions
car_insurance=pd.read_csv("car_insurance.csv")
car_insurance.info()
car_insurance.isna().sum()

๐Ÿ“ˆ Distribution of Credit Score

We observe the distribution and handle missing values by replacing them with the mean.

car_insurance["credit_score"]=car_insurance["credit_score"].fillna(car_insurance["credit_score"].mean())
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(car_insurance["credit_score"])
plt.show()

Distribution of Annual Mileage

Again, we fill missing values with the mean to prepare clean input for modeling.

car_insurance["annual_mileage"]=car_insurance["annual_mileage"].fillna(car_insurance["annual_mileage"].mean())
sns.histplot(car_insurance["annual_mileage"])
plt.show()
car_insurance.info()
car_insurance.isna().sum()

Step 2: Train Logistic Regression Models (One Feature at a Time)

We build separate logistic regression models using one predictor at a time to predict whether a customer will make a claim (outcome).

We use statsmodels for simplicity and interpretability.

โ€Œ
โ€Œ
โ€Œ