Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.
Dataset Summary
Meet your primary tool: the insurance.csv
dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:
insurance.csv
Column | Data Type | Description |
---|---|---|
age | int | Age of the primary beneficiary. |
sex | object | Gender of the insurance contractor (male or female). |
bmi | float | Body mass index, a key indicator of body fat based on height and weight. |
children | int | Number of dependents covered by the insurance plan. |
smoker | object | Indicates whether the beneficiary smokes (yes or no). |
region | object | The beneficiary's residential area in the US, divided into four regions. |
charges | float | Individual medical costs billed by health insurance. |
A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the insurance.csv
dataset, the next step is to apply it to the validation_dataset.csv
. This new dataset, similar to your training data minus the charges
column, tests your model's accuracy and real-world utility by predicting costs for new customers.
Let's Get Started!
This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!
# Re-run this cell
# Import required libraries
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np
import pandas as pd
# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head(10), insurance.info()
Task Overview:
Develop a regression model using the insurance.csv dataset to predict charges. Evaluate the model's accuracy using the R-Squared Score. Then, apply the model to estimate predicted_charges for unseen data in validation_dataset.csv.
Task 1:
Build a regression model to predict charges using the insurance.csv dataset. Evaluate the R-Squared Score of your trained model and save it as a variable named r2_score. The model's success will be assessed based on its R-Squared Score, which must exceed a threshold of 0.65.
Task 2:
Use the trained model to predict charges for the data in validation_dataset.csv. Store the predictions in a new column named predicted_charges within the validation dataset, and save it as a pandas DataFrame called validation_data. Ensure a minimum basic charge of 1000.
⚠️ Note: If you encounter errors during model training, make sure the insurance DataFrame is properly cleaned and ready for modeling.
# Dropping all missing values before type conversion
insurance_cleaned = insurance.dropna()
#Converting column to appropriate types and categories
insurance_cleaned['children'] = insurance_cleaned['children'].astype('int')
insurance_cleaned['charges'] = insurance_cleaned['charges'].astype(str).str.replace("$", "").astype(float)
insurance_cleaned['region'] = insurance_cleaned['region'].str.lower()
# A one-hot encoding for the categorical variables
insurance_encoded= pd.get_dummies(insurance_cleaned, columns=["sex", "smoker", "region"], drop_first=True)
insurance_cleaned.info() , insurance_cleaned.head()
# Checking for NaN values and dropping them
insurance_encoded = insurance_encoded.dropna()
# Defining the features and the target variables
X = insurance_encoded.drop(columns=["charges"]) # Independent variables
y = insurance_encoded["charges"] # Target variable
# Standardize the features
X_scaled = StandardScaler().fit_transform(X)
# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Build and train regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Perform cross-validation (5-fold)
cv_scores = cross_val_score(model, X_scaled, y, cv=5, scoring="r2")
# Make predictions
y_pred = model.predict(X_test)
# Evaluate R-squared score
r2_score = r2_score(y_test, y_pred)
r2_score
The R -Squared Score is equal to 0.667. This score is above the required threshold of 0.65. This means the model explains approximately 66.7% of the variance in charges, making it a reasonably good fit.
# Load the validation dataset
validation_file_path = "validation_dataset.csv"
validation_data = pd.read_csv(validation_file_path)
# Cleaning the validation dataset
validation_data["children"] = validation_data["children"].astype("Int64")
validation_data["region"] = validation_data["region"].str.lower()
# One-hot encode categorical variables to match training data
validation_encoded = pd.get_dummies(validation_data, columns=["sex", "smoker", "region"], drop_first=True)
# Ensure the validation dataset has the same feature columns as the training dataset
missing_cols = set(X.columns) - set(validation_encoded.columns)
for col in missing_cols:
validation_encoded[col] = 0 # Add missing columns with default value 0
# Ensure the columns matches the training data by reordering
validation_encoded = validation_encoded[X.columns]
# Standardize features using the same scaler
validation_scaled = StandardScaler().fit_transform(validation_encoded)
# Make predictions
validation_data["predicted_charges"] = model.predict(validation_scaled)
# Setting the minimum charges to be 1000
validation_data["predicted_charges"] = validation_data["predicted_charges"].apply(lambda x: max(x, 1000))
# Display updated dataset with threshold applied
validation_data.head()