Predicting Diabetes Risk Using Key Health Indicators

What do your blood sugars tell you?

📖 Background

Diabetes mellitus remains a global health issue, causing several thousand people to die each day from this single condition. Finding and avoiding diabetes in the earlier stages can help reduce the risk of serious health issues such as circulatory system diseases, kidney malfunction, and vision loss. This competition involves developing a predictive model for effectively detecting potential Diabetes cases, ideally, before commencing preventive treatment.

💾 The data

The dataset contains diagnostic measurements that are associated with diabetes, which were collected from a population of Pima Indian women. The data includes various medical and demographic attributes, making it a well-rounded resource for predictive modeling.

The columns and Data Types are as follows:

Pregnancies Type: Numerical (Continuous) Description: Number of times the patient has been pregnant.
Glucose Type: Numerical (Continuous) Description: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
BloodPressure Type: Numerical (Continuous) Description: Diastolic blood pressure (mm Hg).
SkinThickness Type: Numerical (Continuous) Description: Triceps skinfold thickness (mm).
Insulin Type: Numerical (Continuous) Description: 2-Hour serum insulin (mu U/ml).
BMI Type: Numerical (Continuous) Description: Body mass index (weight in kg/(height in m)^2).
DiabetesPedigreeFunction Type: Numerical (Continuous) Description: A function that represents the likelihood of diabetes based on family history.
Age Type: Numerical (Continuous) Description: Age of the patient in years.
Outcome Type: Categorical (Binary) Description: Class variable (0 or 1) indicating whether the patient is diagnosed with diabetes. 1 = Yes, 0 = No.

Executive Summary

This project focuses on developing a predictive model to assess the risk of diabetes based on key health indicators. Through a comprehensive analysis of the dataset, the most critical factors influencing diabetes outcomes were identified: glucose levels, BMI, and age, followed by the Diabetes Pedigree Function. These factors were determined to have the highest importance in predicting diabetes.

Interactive visualizations were employed to provide a clearer understanding of the relationships between these factors and diabetes outcomes. For example, the Glucose vs. BMI Scatter Plot reveals significant insights:

Clustering: Data points tend to cluster in two main regions; one around lower glucose levels and BMI values, and another towards higher glucose levels and BMI. This suggests the presence of distinct groups within the population based on diabetes risk.
Colour Coding: The colour gradient on the plot ranges from purple (indicating a lower risk or absence of diabetes) to yellow (signifying a higher risk or presence of diabetes). This visual cue helps in quickly identifying high-risk individuals.
Correlation: A positive correlation was observed between glucose levels and BMI. As BMI increases, there is a general trend towards higher glucose levels, suggesting a potential association between obesity and diabetes.
Outliers: A few data points are located outside the main clusters, indicating individuals with either unusually high or low glucose levels for their BMI. These outliers warrant further investigation to understand the factors that might influence these deviations. In addition, the distribution of data points within each cluster provides insights into the variability of glucose levels and BMI within different diabetes risk categories. This plot, however, does not account for other factors such as age, genetics, and lifestyle, which are also critical in assessing diabetes risk.

Furthermore, the model was applied to predict the diabetes risk for a hypothetical individual, resulting in a probability of 0.50, indicating a moderate risk. This underscores the importance of maintaining healthy glucose levels and BMI as part of diabetes prevention strategies.

Key Recommendations:

Regular monitoring of glucose levels and BMI.
Implementing lifestyle changes to manage BMI and reduce diabetes risk.
Further research to explore the outliers and their underlying causes.
Incorporating additional factors such as age, genetics, and lifestyle into future models for a more comprehensive risk assessment.

Overall, this analysis provides a valuable foundation for understanding the relationship between key health indicators and diabetes risk. The insights gained from the scatter plot and other visualizations can guide targeted interventions and early detection efforts to reduce the impact of diabetes.

import pandas as pd

data = pd.read_csv('data/diabetes.csv')
# Display the first few rows of the DataFrame
data.head()

Visualizing the data distribution of glucose

import seaborn as sns
import matplotlib.pyplot as plt

# Ensure 'Outcome' is treated as a categorical variable
data['Outcome'] = data['Outcome'].astype('category')

# Boxplot of Glucose levels by Outcome
plt.figure(figsize=(8, 6))
sns.boxplot(x='Outcome', y='Glucose', data=data, palette=["blue", "red"])
plt.title("Glucose Levels by Diabetes Outcome")
plt.xlabel("Diabetes Outcome")
plt.ylabel("Glucose Level (mg/dL)")
plt.show()

Visualizing the age distribution

# Density plot of Age by Outcome
plt.figure(figsize=(8, 6))
sns.kdeplot(data=data, x='Age', hue='Outcome', fill=True, common_norm=False, palette=["blue", "red"], alpha=0.5)
plt.title("Age Distribution by Diabetes Outcome")
plt.xlabel("Age")
plt.ylabel("Density")
plt.show()

from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = pd.read_csv('data/diabetes.csv')

X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Initialize and train RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

importances = model.feature_importances_
features = X.columns
importance_df = pd.DataFrame({'Feature': features, 'Importance': importances}).sort_values(by='Importance', ascending=False)

print(importance_df)

import plotly.express as px

# Assuming `data` is the DataFrame containing the dataset
fig = px.scatter(data, x='Glucose', y='BMI', color='Outcome', 
                 labels={'Glucose': 'Glucose Level (mg/dL)', 'BMI': 'Body Mass Index'},
                 title='Glucose vs BMI by Diabetes Outcome')

fig.show()

# Define the individual's attributes
individual = pd.DataFrame({
    'Pregnancies': [0],  # Assuming no pregnancies
    'Glucose': [125],
    'BloodPressure': [70],
    'SkinThickness': [20],
    'Insulin': [80], 
    'BMI': [96 / ((178 / 100) ** 2)],
    'DiabetesPedigreeFunction': [0.5],
    'Age': [54]
})

# Predict the risk
risk = model.predict_proba(individual)
print(f"Probability of having diabetes: {risk[0][1]:.2f}")