Predictive Modeling for Early Detection of Diabetes: Insights from Diagnostic Data

What do your blood sugars tell you?

📖 Background

Diabetes mellitus remains a global health issue, causing several thousand people to die each day from this single condition. Finding and avoiding diabetes in the earlier stages can help reduce the risk of serious health issues such as circulatory system diseases, kidney malfunction, and vision loss. This competition involves developing a predictive model for effectively detecting potential Diabetes cases, ideally, before commencing preventive treatment.

Executive Summary

Objective: Develop a predictive model to detect diabetes using diagnostic measurements from Pima Indian women.

Key Features: Glucose level, Blood Pressure, BMI, Insulin, DiabetesPedigreeFunction (family history), Pregnancies and Age

Data Handling:

Missing values in features like Insulin, SkinThickness, and BloodPressure were imputed using KNN.
No significant multicollinearity detected between features (all VIF values < 5).

Feature Correlations:

Glucose has the strongest positive correlation with diabetes outcome (0.47).
BMI (0.29), age (0.24), and pregnancies (0.22) show moderate positive correlations with diabetes.

Model Comparisons:

Random Forest and Logistic Regression were tested and tuned with hyperparameter search.
Logistic Regression selected due to better computational efficiency and similar predictive performance.

Model Performance (Logistic Regression):

Accuracy: 76%
Precision: 0.68
Recall: 0.62
F1 Score: 0.65
ROC-AUC: 0.81 (indicating good predictive ability).

Key Feature Importance:

DiabetesPedigreeFunction has the strongest positive impact on diabetes risk.
Other important features: BMI, pregnancies, and glucose levels.

Prediction Example:

Predicted 41% risk of diabetes for a 54-year-old person with BMI of 30.31 and glucose level of 125 mg/dL.

Conclusion: The model performs well in predicting diabetes risk and can be used for early diagnosis and preventive healthcare efforts.

💾 The data

The dataset contains diagnostic measurements that are associated with diabetes, which were collected from a population of Pima Indian women. The data includes various medical and demographic attributes, making it a well-rounded resource for predictive modeling.

The columns and Data Types are as follows:

Pregnancies Type: Numerical (Continuous) Description: Number of times the patient has been pregnant.
Glucose Type: Numerical (Continuous) Description: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
BloodPressure Type: Numerical (Continuous) Description: Diastolic blood pressure (mm Hg).
SkinThickness Type: Numerical (Continuous) Description: Triceps skinfold thickness (mm).
Insulin Type: Numerical (Continuous) Description: 2-Hour serum insulin (mu U/ml).
BMI Type: Numerical (Continuous) Description: Body mass index (weight in kg/(height in m)^2).
DiabetesPedigreeFunction Type: Numerical (Continuous) Description: A function that represents the likelihood of diabetes based on family history.
Age Type: Numerical (Continuous) Description: Age of the patient in years.
Outcome Type: Categorical (Binary) Description: Class variable (0 or 1) indicating whether the patient is diagnosed with diabetes. 1 = Yes, 0 = No.

Imports and Handling Data

1 hidden cell

Hidden code

# Ensure 'Outcome' is treated as a categorical variable
data['Outcome'] = data['Outcome'].astype('category')

Check for Data Quality

Hidden code

Impute missing values using KNN Imputer

Hidden code

1. Correlation of Features

Hidden code

Linear relationship (e.g., Pearson correlation)

Glucose has the strongest positive correlation with the Outcome (0.47), indicating that higher glucose levels are strongly associated with a higher likelihood of diabetes.
BMI (0.29) and Age (0.24) also have moderate positive correlations with the Outcome, suggesting that higher BMI and age could be associated with increased risk of diabetes.
Pregnancies (0.22) shows a moderate positive correlation with the Outcome, indicating that the number of pregnancies could also be a factor.
Blood Pressure, Skin Thickness, Family History and Insulin show weak correlations with the Outcome.

A high correlation suggests that changes in one variable are strongly associated with changes in another. However, this doesn’t tell you how important the feature is when considering all features together in a model.

Variance Inflation Factor (VIF) VIF = 1: There is no multicollinearity between the feature and others. VIF between 1 and 5: There is a moderate amount of multicollinearity, but it's not typically a concern. VIF > 5: There is a high level of multicollinearity, and you may want to investigate further. VIF > 10: Indicates severe multicollinearity, and you should strongly consider removing or combining features.

‌
‌
‌

Predictive Modeling for Early Detection of Diabetes: Insights from Diagnostic Data

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}What do your blood sugars tell you?

📖 Background

Executive Summary

Objective: Develop a predictive model to detect diabetes using diagnostic measurements from Pima Indian women.

Key Features: Glucose level, Blood Pressure, BMI, Insulin, DiabetesPedigreeFunction (family history), Pregnancies and Age

Data Handling:

Feature Correlations:

Model Comparisons:

Model Performance (Logistic Regression):

Key Feature Importance:

Prediction Example:

Conclusion: The model performs well in predicting diabetes risk and can be used for early diagnosis and preventive healthcare efforts.

💾 The data

Imports and Handling Data

Check for Data Quality

1. Correlation of Features

What do your blood sugars tell you?