Project: Predicting Insurance Charges using Regression

Project Brief:

As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

Dataset Summary

Dataset: insurance.csv dataset, packed with information on health insurance customers. Here's what you need to know about the data you'll be working with:

insurance.csv

Column	Data Type	Description
`age`	int	Age of the primary beneficiary.
`sex`	object	Gender of the insurance contractor (male or female).
`bmi`	float	Body mass index, a key indicator of body fat based on height and weight.
`children`	int	Number of dependents covered by the insurance plan.
`smoker`	object	Indicates whether the beneficiary smokes (yes or no).
`region`	object	The beneficiary's residential area in the US, divided into four regions.
`charges`	float	Individual medical costs billed by health insurance.

Once your model is built using the insurance.csv dataset, the next step is to apply it to the validation_dataset.csv. This new dataset, similar to your training data minus the charges column, tests your model's accuracy and real-world utility by predicting costs for new customers.

# Import required libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()

insurance.info()

insurance

Hidden output

insurance.isnull().sum()

I have observed, a couple of problems with the dataset, so far

Missing Data
Peoplease aged O
Inconsistencies in Gender Categorisation. There are entries for Male, female, M and F
Negative number of children
Inconsistencies in capitalisation of the regions
Object data type of the charges column. Some have a dollar sign attached to them, making the entry a string

The missing data is only >=5 % of the complete dataset, so i will drop them, in this project.

insurance['sex'].value_counts()

def clean_dataset(insurance):
    insurance['sex']= insurance['sex'].replace({"male": "M", "man": "M", "female": "F", "woman": "F"})
    insurance['charges']= insurance['charges'].replace({"\$": ""}, regex = True).astype(float)
    insurance= insurance[insurance['age'] > 0]
    insurance['region']= insurance['region'].str.lower()
    insurance.loc[insurance['children']<0, 'children'] = 0
    
    return insurance.dropna()

def train_and_evaluate_model(insurance):
    X= insurance.drop('charges', axis = 1)
    y= insurance['charges']
    categorical_feats = ['sex', 'smoker', 'region']
    numerical_feats = ['age', 'bmi', 'children']
    
    X_categorical = pd.get_dummies(X[categorical_feats], drop_first= True)
    X_combined = pd.concat([X[numerical_feats], X_categorical], axis = 1)
    
    scaler = StandardScaler()
    X_Scaled = scaler.fit_transform(X_combined)
    
    model = LinearRegression()
    # Pipeline
    steps = [("scaler", scaler), ("model", model)]
    insurance_model_pipeline = Pipeline(steps)
    
    insurance_model_pipeline.fit(X_Scaled, y)
    
    #Evaluation
    mse_scores = -cross_val_score(insurance_model_pipeline, X_Scaled, y, cv=5, scoring='neg_mean_squared_error')
    r2_scores = cross_val_score(insurance_model_pipeline, X_Scaled, y, cv=5, scoring='r2')
    mean_mse = np.mean(mse_scores)
    mean_r2 = np.mean(r2_scores)
    
    return insurance_model_pipeline, mean_mse, mean_r2

cleaned_insurance = clean_dataset(insurance)

insurance_model, mean_mse, r2_score = train_and_evaluate_model(cleaned_insurance)

print(mean_mse, r2_score)

validation_data = pd.read_csv('validation_dataset.csv')

validation_data_processed = pd.get_dummies(validation_data, columns= ['sex', 'smoker', 'region'], drop_first= True)

validation_predictions = insurance_model.predict(validation_data_processed)
validation_data['predicted_charges']= validation_predictions

‌
‌
‌

Project: Predicting Insurance Charges using Regression

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Project Brief:

Dataset Summary

insurance.csv

Project Brief: