Project: Modeling Car Insurance Claim Outcomes

Insurance Claims Prediction: Finding the Most Informative Feature

---

Car insurance is mandatory in most countries, making the market huge — and competitive. Insurance providers aim to accurately assess the risk of a customer making a claim to set fair premiums and minimize losses.

On the Road Car Insurance has asked for help identifying the single most predictive feature that can be used in a simple model to predict whether a customer will file a claim. With limited infrastructure, they want something interpretable, measurable, and easy to implement.

Goal:
Build and compare single-feature logistic regression models and identify the one feature that gives the highest accuracy in predicting insurance claims.

Source:
Accenture: Machine Learning in Insurance

Dataset Overview

The dataset car_insurance.csv includes various customer attributes that may influence their likelihood to make a claim. Each row represents one customer.

The dataset

Column	Description
`id`	Unique client identifier
`age`	Client's age: `0`: 16-25 `1`: 26-39 `2`: 40-64 `3`: 65+
`gender`	Client's gender: `0`: Female `1`: Male
`driving_experience`	Years the client has been driving: `0`: 0-9 `1`: 10-19 `2`: 20-29 `3`: 30+
`education`	Client's level of education: `0`: No education `1`: High school `2`: University
`income`	Client's income level: `0`: Poverty `1`: Working class `2`: Middle class `3`: Upper class
`credit_score`	Client's credit score (between zero and one)
`vehicle_ownership`	Client's vehicle ownership status: `0`: Does not own their vehilce (paying off finance) `1`: Owns their vehicle
`vehcile_year`	Year of vehicle registration: `0`: Before 2015 `1`: 2015 or later
`married`	Client's marital status: `0`: Not married `1`: Married
`children`	Client's number of children
`postal_code`	Client's postal code
`annual_mileage`	Number of miles driven by the client each year
`vehicle_type`	Type of car: `0`: Sedan `1`: Sports car
`speeding_violations`	Total number of speeding violations received by the client
`duis`	Number of times the client has been caught driving under the influence of alcohol
`past_accidents`	Total number of previous accidents the client has been involved in
`outcome`	Whether the client made a claim on their car insurance (response variable): `0`: No claim `1`: Made a claim

# Import required modules
import pandas as pd
import numpy as np
from statsmodels.formula.api import logit

Step 1: Load and Inspect the Data

We begin by loading the dataset and checking for:

Missing values
Data types
Basic data distributions

car_insurance=pd.read_csv("car_insurance.csv")
car_insurance.info()

car_insurance.isna().sum()

📈 Distribution of Credit Score

We observe the distribution and handle missing values by replacing them with the mean.

car_insurance["credit_score"]=car_insurance["credit_score"].fillna(car_insurance["credit_score"].mean())

import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(car_insurance["credit_score"])
plt.show()

Distribution of Annual Mileage

Again, we fill missing values with the mean to prepare clean input for modeling.

car_insurance["annual_mileage"]=car_insurance["annual_mileage"].fillna(car_insurance["annual_mileage"].mean())

sns.histplot(car_insurance["annual_mileage"])
plt.show()

car_insurance.info()
car_insurance.isna().sum()

Step 2: Train Logistic Regression Models (One Feature at a Time)

We build separate logistic regression models using one predictor at a time to predict whether a customer will make a claim (outcome).

We use statsmodels for simplicity and interpretability.

‌
‌
‌