Project: From Data to Dollars - Predicting Insurance Charges

Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

Dataset Summary

Meet your primary tool: the insurance.csv dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

insurance.csv

Column	Data Type	Description
`age`	int	Age of the primary beneficiary.
`sex`	object	Gender of the insurance contractor (male or female).
`bmi`	float	Body mass index, a key indicator of body fat based on height and weight.
`children`	int	Number of dependents covered by the insurance plan.
`smoker`	object	Indicates whether the beneficiary smokes (yes or no).
`region`	object	The beneficiary's residential area in the US, divided into four regions.
`charges`	float	Individual medical costs billed by health insurance.

A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the insurance.csv dataset, the next step is to apply it to the validation_dataset.csv. This new dataset, similar to your training data minus the charges column, tests your model's accuracy and real-world utility by predicting costs for new customers.

Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()

#shape of the data
insurance.shape

#data types 
insurance.dtypes

insurance.describe()

# Cleaning the charges column, i.e., removing non-numeric characters
insurance['charges'] = insurance['charges'].apply(lambda x: str(x).replace('$', '').replace(',', ''))

# Convert the column to float
insurance['charges'] = insurance['charges'].astype(float)

#defining a cleaning function
def transform_data(insurance):

    #selecting only rows where age is greater than 0

    insurance = insurance[insurance['age']>0]

    #replacing negative values of children column with 0, 
    #this is mentioned in the project guid, in my own opinion, these values     #should be raplaced with absolute values.
    insurance.loc[insurance['children']<0,'children'] =0 
   
    # Converting the sex and smoker columns to numeric form
    insurance['sex'] = insurance['sex'].apply(lambda x: 1 if x =='male'         else 0)
    insurance['smoker'] = insurance['smoker'].apply(lambda x: 1 if x ==         'yes' else 0)
    
    #the region column has inconsistancy in region spellings, converting       all to lower
    insurance['region'] = insurance['region'].apply(lambda                     x:str(x).lower())
    #removing 'nan' values from region column
    insurance = insurance[insurance['region']!='nan']
    #removing null values
    insurance.dropna(inplace=True)
    
    # Convert the region column to numeric form using get_dummies and           concatenate to the original dataframe
    insurance = pd.concat([insurance, pd.get_dummies(insurance['region'],       prefix='region')], axis=1)
    insurance.drop('region', axis=1, inplace=True)
    return insurance

#calling the above function
insurance_transformed = transform_data(insurance)

#verify the results
insurance_transformed.head()

#check the datatypes again
insurance_transformed.dtypes

#lets look at the null values
insurance_transformed.isna().sum()

#scaling feature
X = insurance_transformed.drop(columns=['charges'])
y = insurance_transformed['charges']
X_scaled= StandardScaler().fit_transform(X)

#importing model and score
from sklearn.linear_model import LinearRegression

#training a model
insurance_model = LinearRegression()
insurance_model.fit(X_scaled,y)
mse_scores = -cross_val_score(insurance_model,X_scaled,y, cv=5,scoring = 'neg_mean_squared_error')
r2_scores = cross_val_score(insurance_model,X_scaled,y,cv=5,scoring='r2')

mean_mse = np.mean(mse_scores)
r2_score = np.mean(r2_scores)
print('MSE :',mean_mse)
print('R2 :',r2_score)

#predicting on validation dataset
validation_data = pd.read_csv('validation_dataset.csv')
validation_data.head()

‌
‌
‌

Project: From Data to Dollars - Predicting Insurance Charges

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Dataset Summary

insurance.csv

Let's Get Started!

Dataset Summary