Skip to content

Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

Dataset Summary

Meet your primary tool: the insurance.csv dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

insurance.csv

ColumnData TypeDescription
ageintAge of the primary beneficiary.
sexobjectGender of the insurance contractor (male or female).
bmifloatBody mass index, a key indicator of body fat based on height and weight.
childrenintNumber of dependents covered by the insurance plan.
smokerobjectIndicates whether the beneficiary smokes (yes or no).
regionobjectThe beneficiary's residential area in the US, divided into four regions.
chargesfloatIndividual medical costs billed by health insurance.

A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the insurance.csv dataset, the next step is to apply it to the validation_dataset.csv. This new dataset, similar to your training data minus the charges column, tests your model's accuracy and real-world utility by predicting costs for new customers.

Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np

# Load the dataset
# insurance_data_path = 'insurance.csv'
insurance = pd.read_csv('insurance.csv')

# Data Cleaning.....
def clean_data(insurance):
    """Cleans the insurance dataset by performing several preprocessing tasks"""
    insurance['sex'] = insurance['sex'].replace({'M': 'male', 'man': 'male', 'F': 'female', 'woman': 'female'})
    insurance['charges'] = insurance['charges'].replace({'\$': ''}, regex=True).astype(float)
    insurance = insurance[insurance["age"] > 0]
    insurance.loc[insurance["children"] < 0, "children"] = 0
    insurance["region"] = insurance["region"].str.lower()
    return insurance.dropna()

insurance = clean_data(insurance)
insurance.info()
# Display unique values for each column
# for column in insurance.columns:
#     print(f"Unique values in {column}: {insurance[column].unique()}")
# Implement model creation and training here
# Use as many cells as you need

# Import necessary libraries for regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np

#insurance.head(10)
def create_and_evaluate_regression_model(insurance):
    """
    Prepares the data, fits a linear regression model, and evaluates it using cross-validation.
    
    Parameters:
    - insurance: pandas DataFrame, the cleaned insurance dataset.
    
    Returns:
    - A tuple containing the fitted sklearn Pipeline object, mean MSE, and mean R2 scores.
    """
    # Preprocessing
    X = insurance.drop('charges', axis=1)
    y = insurance['charges']
    categorical_features = ['sex', 'smoker', 'region']
    numerical_features = ['age', 'bmi', 'children']
    
    # Convert categorical variables to dummy variables
    X_categorical = pd.get_dummies(X[categorical_features], drop_first=True)
    
    # Combine numerical features with dummy variables
    X_processed = pd.concat([X[numerical_features], X_categorical], axis=1)
    # Scaling numerical features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_processed)
    # Linear regression model
    lin_reg = LinearRegression()
    
    # Pipeline
    steps = [("scaler", scaler), ("lin_reg", lin_reg)]
    insurance_model_pipeline = Pipeline(steps)
    
    # Fitting the model
    insurance_model_pipeline.fit(X_scaled, y)
    
    # Evaluating the model
    mse_scores = -cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='neg_mean_squared_error')
    r2_scores = cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='r2')
    mean_mse = np.mean(mse_scores)
    mean_r2 = np.mean(r2_scores)
    
    return insurance_model_pipeline, mean_mse, mean_r2

# Usage example
insurance = clean_data(insurance)
insurance_model, mean_mse, r2_score = create_and_evaluate_regression_model(insurance)
print("Mean MSE:", mean_mse)
print("Mean R2:", r2_score)

# Predict on validation data
validation_df = pd.read_csv('validation_dataset.csv')

# Ensure categorical variables are properly transformed
validation_df_transformed = pd.get_dummies(validation_df, columns=['sex', 'smoker', 'region'], drop_first=True)

# Make predictions using the trained model
validation_prediction = insurance_model.predict(validation_df_transformed)

# Add predicted charges to the validation data
validation_df['predicted_charges'] = validation_prediction

# Adjust predictions to ensure minimum charge is $1000
validation_df.loc[validation_df['predicted_charges'] < 1000, 'predicted_charges'] = 1000

# Display the updated dataframe
validation_df.head()