Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.
Dataset Summary
Meet your primary tool: the insurance.csv dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:
insurance.csv
| Column | Data Type | Description |
|---|---|---|
age | int | Age of the primary beneficiary. |
sex | object | Gender of the insurance contractor (male or female). |
bmi | float | Body mass index, a key indicator of body fat based on height and weight. |
children | int | Number of dependents covered by the insurance plan. |
smoker | object | Indicates whether the beneficiary smokes (yes or no). |
region | object | The beneficiary's residential area in the US, divided into four regions. |
charges | float | Individual medical costs billed by health insurance. |
A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the insurance.csv dataset, the next step is to apply it to the validation_dataset.csv. This new dataset, similar to your training data minus the charges column, tests your model's accuracy and real-world utility by predicting costs for new customers.
Let's Get Started!
This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!
# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression
# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()def preprocess_dataset(insurance):
# For the sex column, ensure it contains only 'male' and 'female'
insurance['sex'] = insurance['sex'].replace({'M': 'male', 'man': 'male', 'F': 'female', 'woman': 'female'})
# Ensure all values in the region column are in lowercase by using
insurance['region'] = insurance['region'].str.lower()
# Clean the charges column by removing non-numeric characters and convert the column to float
insurance['charges'] = insurance['charges'].replace({'\$': ''}, regex=True).astype(float)
# Verify and remove any rows with negative or zero values in the age column
insurance = insurance[insurance['age'] > 0]
# To handle negative values in the children column, replace them with 0
insurance.loc[insurance['children'] < 0, 'children'] = 0
# Iterate over each column of insurance and impute the most frequent value for object data types and the mean for numeric data types
for col in insurance.columns:
# Check if the column is of object type
if insurance[col].dtypes == "object":
# Impute with the most frequent value
insurance[col] = insurance[col].fillna(
insurance[col].value_counts().index[0]
)
else:
insurance[col] = insurance[col].fillna(insurance[col].mean())
# Convert categorical data into numeric
insurance_dummies = pd.get_dummies(insurance, drop_first=True)
return insurance_dummies
insurance_dummies = preprocess_dataset(insurance)
print(insurance_dummies.head())# Define feature variables
X = insurance_dummies.drop("charges", axis=1).values
# Define target variable
y = insurance_dummies["charges"].values
# Scale data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X)# Instantiate the model
reg = LinearRegression()
reg.fit(X_train_scaled, y)
# Compute 5-fold cross-validation R-Squared scores
cv_scores = cross_val_score(reg, X_train_scaled, y, cv=5)
# The mean of the R-Squared scores
r2_score = np.mean(cv_scores)
print(r2_score)# Loading the validation insurance dataset
val_data_path = 'validation_dataset.csv'
validation_data = pd.read_csv(val_data_path)
validation_dummies = pd.get_dummies(validation_data, drop_first=True)
validation_dummies.head()# Define feature variables
X_val = validation_dummies.values
# Predict the labels of the test insurance data
y_pred = reg.predict(X_val)
# Add predicted charges to the validation data
validation_data['predicted_charges'] = y_pred
# For any values below $1000, set them to $1000
validation_data.loc[validation_data['predicted_charges'] < 1000, 'predicted_charges'] = 1000
print(validation_data.head())