Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.
Dataset Summary
Meet your primary tool: the insurance.csv dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:
insurance.csv
| Column | Data Type | Description |
|---|---|---|
age | int | Age of the primary beneficiary. |
sex | object | Gender of the insurance contractor (male or female). |
bmi | float | Body mass index, a key indicator of body fat based on height and weight. |
children | int | Number of dependents covered by the insurance plan. |
smoker | object | Indicates whether the beneficiary smokes (yes or no). |
region | object | The beneficiary's residential area in the US, divided into four regions. |
charges | float | Individual medical costs billed by health insurance. |
A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the insurance.csv dataset, the next step is to apply it to the validation_dataset.csv. This new dataset, similar to your training data minus the charges column, tests your model's accuracy and real-world utility by predicting costs for new customers.
Let's Get Started!
This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!
Clean the data
# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
#remove $ sine from charges and set to float type
insurance['charges'] = insurance['charges'].str.strip('$').astype('float')
#check sex category must be only male or female
insurance['sex'] = insurance['sex'].map({'female': 'F', 'male': 'M', 'woman': 'F', 'man': 'M', 'male': 'M'})
#convert region category to lowercase for consistency
insurance['region'] = insurance['region'].str.lower()
#drop missing value rows
insurance = insurance.dropna()
print(insurance['sex'].unique())
print(insurance['region'].unique())
#concert objects to category
for col in insurance.columns:
if insurance[col].dtype == 'object':
insurance[col] = insurance[col].astype('category')
print(insurance[col].unique())
#display numeric features to check for inconsistencies and rectify
display(insurance[['age', 'bmi', 'children', 'charges']].describe())
insurance = insurance[insurance['age'] > 0]
insurance.loc[insurance['children'] < 0, 'children'] = 0
#print info and descrive for the final cleaned dataset
print(insurance.info())
print(insurance.describe())Model development and training
#import libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as MSE
# Select features
X_pre = insurance[['sex', 'smoker', 'region', 'age', 'bmi', 'children']]
y = insurance['charges']
# Corrected line: Use pd.get_dummies on the entire DataFrame X_pre, not just a subset
Xd = pd.get_dummies(X_pre, columns=['sex', 'smoker', 'region'], drop_first=True)
#apply standard scaler to features X
scaler = StandardScaler()
#X = scaler.fit_transform(Xd)
#split train and test sets
Xd_train, Xd_test, y_train, y_test = train_test_split(Xd, y, test_size=0.2, random_state=42)
X_train=scaler.fit_transform(Xd_train)
X_test=scaler.transform(Xd_test)
#apply linear regression, calculate score, MSE and rmse
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
score = linreg.score(X_test, y_test)
print(score)
mse = MSE(y_test, y_pred)
rmse=mse**(1/2)
print(mse, rmse)
# try cross validate route for better score
mse_cv = -cross_val_score(linreg, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
r2_cv = cross_val_score(linreg, X_train, y_train, cv=5, scoring='r2')
#print cross validate results
print("Cross-validated sqrt mse:", mse_cv)
print("Cross-validated R2:", r2_cv)
mean_mse=np.mean(mse_cv)
mean_r2=np.mean(r2_cv)
print(f' mean mse {mean_mse}')
print(f' mean r2 {mean_r2}')predict charges for the data in validation_dataset.csv
import pandas as pd
#read in validatation dataset
valid_df = pd.read_csv('validation_dataset.csv')
#just in case check categories and numeric features for inconsistency
print(valid_df.info())
print(valid_df['sex'].unique())
print(valid_df['smoker'].unique())
print(valid_df['region'].unique())
print(valid_df.describe())
#convert categorical features to numerical
X_d = pd.get_dummies(valid_df,columns=['sex', 'smoker', 'region'],drop_first=True)
print(X_d.info())
#now use the regression model from previous step to predict charges
valid_pred=linreg.predict(X_d)
valid_df['pred_charges']=valid_pred
#set charges less than 1000 to 1000
valid_df.loc[valid_df['pred_charges']<1000,'pred_charges']=1000
#print validation dataset with predicted charges as the last column
print(valid_df.head(10))