Competition - Blood Sugars

What do your blood sugars tell you?

Executive Summary

Objective:

The objective of this analysis is to build a predictive model to identify individuals at risk of developing diabetes based on various health metrics. The model aims to assist healthcare providers in early intervention and prevention strategies.

Dataset:

The dataset used for this analysis includes information on various health metrics such as glucose levels, blood pressure, BMI, age, and pregnancy history. The target variable is the diabetes outcome, indicating whether an individual has diabetes or not. the data was subset to focus on the most important features.

Data Preprocessing:

Handling Missing Values: Zero values in columns like glucose and BMI were identified as missing data and imputed using median values.

Model Building:

Random Forest Classifier: Achieved an accuracy of 73% with a recall of 85% for the positive class (diabetes).

Key Findings:

Imbalanced Dataset: The dataset is imbalanced, with the majority class (non-diabetic) being far larger than the minority class (diabetic). Feature Importance: Glucose levels, BMI, and age were identified as the most important features in predicting diabetes. Model Performance: The Random Forest Classifier showed promising results with high precision for the non-diabetic class (0.89), indicating that it correctly identifies most of the true negatives. However, the precision for the diabetic class is lower (0.58), suggesting that the model has a higher rate of false positives for this class. The recall for the diabetic class is high (0.85), indicating that the model captures most of the true positives. However, the recall for the non-diabetic class is lower (0.66), suggesting that the model misses some true negatives.

I chosed to prioritize recall slightly more than precision because the cost of missing a true positive (a person with diabetes) is higher than the cost of a false positive. This approach ensures that individuals at risk are identified early, allowing for timely intervention and better health outcomes.

# Importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import RandomOverSampler
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

Explore

# Read the csv file into a dataframe
data = pd.read_csv('data/diabetes.csv')

# Display the first few rows of the DataFrame
print(data.shape)
display(data.head())

# Check if there's any duplicates
data.duplicated().sum()

# Checking the data types and is there missing values
data.info()

What are the most important features affecting the diabetes outcome ?

top_features = data.corr().sort_values(by='Outcome', ascending=False)
top_features['Outcome']

# Subset the data to only focus on the most important features
df = data[['Glucose', 'BMI', 'Age', 'Outcome']]
df.head()

# Generate descriptive statistics for the subset of data to investigate any problems
df.describe().T

Zero values are physiologically impossible for these features. They might represent missing data or errors, I'll impute the with the median.

# Replace 0 values in the glucose column with the median
gluco_median = df['Glucose'].median()

df['Glucose'].replace(0, gluco_median, inplace=True)

# Replace 0 values in the glucose column with the median
bmi_median = df['BMI'].median()

df['BMI'].replace(0, bmi_median, inplace=True)

Let's visualize the relationship between diabetes and the determined factors from the previous step

‌
‌
‌