Insurance Charge Prediction Model
Introduction
In this project, we will be looking at a dataset from Kaggle about insurance charges. The data includes the customers age, sex, bmi, number of children, whether they smoke or not, their region and their annual insurance charges. We will use this dataset to build a model to predict the average insurance charge for a new customer. This data could then be used by the insurance company to set premiums and deductibles.
Importing Data and Libraries
#importing libraries
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
#importing data
data = pd.read_csv('insurance.csv')
df = data.copy()
df
Data Exploration
Now we will explore the data more by using basic stats, checking correlations, checking for missing values, and things like that.
#Getting basic stats for the features
df.describe()
right from the start, I can see we will need to scale the data, as the numbers for age and bmi go from 18 to 64, and 5 to 53 respectively, but children only go from 0 to 5. The also all have 1338.
#Checking for any correlations
df.corr()
None of the attributes are highly correlated, so we won't have to worry about multicollinearity. Also, age appears to be the attribute most correlated with charges. Below we will visualize the correlations between the attributes
axes = scatter_matrix(df,figsize=(12, 8))
for ax in axes.flatten():
ax.xaxis.label.set_rotation(90)
ax.yaxis.label.set_rotation(0)
ax.yaxis.label.set_ha('right')
plt.tight_layout()
plt.gcf().subplots_adjust(wspace=0, hspace=0)
plt.show()
Data Preparations
Now we will prep the data for the machine learning model, including one-hot-encoding the categorical attributes, scaling the features, and getting our training and testing sets. As stated above we don't need to drop any attributes to accomodate multicollinearity