Skip to content
Insurance Charge Prediction Model - Linear Regression/Decision Tree
  • AI Chat
  • Code
  • Report
  • Insurance Charge Prediction Model

    Introduction

    In this project, we will be looking at a dataset from Kaggle about insurance charges. The data includes the customers age, sex, bmi, number of children, whether they smoke or not, their region and their annual insurance charges. We will use this dataset to build a model to predict the average insurance charge for a new customer. This data could then be used by the insurance company to set premiums and deductibles.

    Importing Data and Libraries

    #importing libraries
    import numpy as np
    import pandas as pd
    from pandas.plotting import scatter_matrix
    from sklearn.linear_model import LinearRegression
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.preprocessing import StandardScaler
    from sklearn.metrics import mean_squared_error
    
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    
    %matplotlib inline
    #importing data
    data = pd.read_csv('insurance.csv')
    df = data.copy()
    df

    Data Exploration

    Now we will explore the data more by using basic stats, checking correlations, checking for missing values, and things like that.

    #Getting basic stats for the features
    df.describe()

    right from the start, I can see we will need to scale the data, as the numbers for age and bmi go from 18 to 64, and 5 to 53 respectively, but children only go from 0 to 5. The also all have 1338.

    #Checking for any correlations
    df.corr()

    None of the attributes are highly correlated, so we won't have to worry about multicollinearity. Also, age appears to be the attribute most correlated with charges. Below we will visualize the correlations between the attributes

    axes = scatter_matrix(df,figsize=(12, 8))
    
    for ax in axes.flatten():
        ax.xaxis.label.set_rotation(90)
        ax.yaxis.label.set_rotation(0)
        ax.yaxis.label.set_ha('right')
    
    plt.tight_layout()
    plt.gcf().subplots_adjust(wspace=0, hspace=0)
    plt.show()

    Data Preparations

    Now we will prep the data for the machine learning model, including one-hot-encoding the categorical attributes, scaling the features, and getting our training and testing sets. As stated above we don't need to drop any attributes to accomodate multicollinearity