Skip to content
0

What do your blood sugars tell you?

πŸ“– Background

Diabetes mellitus remains a global health issue, causing several thousand people to die each day from this single condition. Finding and avoiding diabetes in the earlier stages can help reduce the risk of serious health issues such as circulatory system diseases, kidney malfunction, and vision loss. This competition involves developing a predictive model for effectively detecting potential Diabetes cases, ideally, before commencing preventive treatment.

πŸ’Ύ The data

The dataset contains diagnostic measurements that are associated with diabetes, which were collected from a population of Pima Indian women. The data includes various medical and demographic attributes, making it a well-rounded resource for predictive modeling.

The columns and Data Types are as follows:

  • Pregnancies Type: Numerical (Continuous) Description: Number of times the patient has been pregnant.

  • Glucose Type: Numerical (Continuous) Description: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.

  • BloodPressure Type: Numerical (Continuous) Description: Diastolic blood pressure (mm Hg).

  • SkinThickness Type: Numerical (Continuous) Description: Triceps skinfold thickness (mm).

  • Insulin Type: Numerical (Continuous) Description: 2-Hour serum insulin (mu U/ml).

  • BMI Type: Numerical (Continuous) Description: Body mass index (weight in kg/(height in m)^2).

  • DiabetesPedigreeFunction Type: Numerical (Continuous) Description: A function that represents the likelihood of diabetes based on family history.

  • Age Type: Numerical (Continuous) Description: Age of the patient in years.

  • Outcome Type: Categorical (Binary) Description: Class variable (0 or 1) indicating whether the patient is diagnosed with diabetes. 1 = Yes, 0 = No.

Executive Summary

Project Overview

This project involves analyzing a dataset for predicting the likelihood of diabetes using several features such as glucose levels, blood pressure, insulin, and body mass index (BMI). The primary objective is to build a predictive model that accurately identifies individuals at risk of diabetes based on both biological factors and family history.

Data Preprocessing and Feature Engineering

  • Outliers Handling: Outliers were identified and handled using both the IQR method and a custom technique based on the z-score (3 standard deviations). This ensured that extreme values did not unduly influence the model.
  • Missing Values: Missing values, primarily represented as zeros in columns like Insulin, SkinThickness, and BloodPressure, were replaced by predicted values using other available features rather than simple mean imputation. This enhanced the data’s predictive power.
  • Feature Scaling: The dataset was scaled using methods appropriate for each feature, used RobustScalar to ensure the outliers that weren't dealt with doesn't affect the model performance, ensuring that no feature disproportionately impacted the model.
  • Mutual Information Analysis: A mutual information analysis was performed to rank features based on their importance in predicting the target variable (diabetes outcome). Key features such as Glucose, BloodPressure, Insulin, and Age had the highest mutual information scores, indicating they are most predictive. features were also scaled by theri importance to the target variable.

Feature Importance

Based on mutual information, the most important features for predicting diabetes are:

FeatureMutual Information
Insulin0.385976
SkinThickness0.262024
Glucose0.170098
BloodPressure0.135170
BMI0.106836
Age0.098755
DiabetesPedigreeFunction0.027636
Pregnancies0.026698
--- -

Less important features like Pregnancies and DiabetesPedigreeFunction were also retained but are less informative.

Multicollinearity Detection

  • Variance Inflation Factor (VIF) analysis revealed significant multicollinearity among features such as BMI (VIF: 18.4) and Glucose (VIF: 16.7).
  • Which wasn't addressed as the modelling algorithem used are not affected by it.

Modeling Approaches

Several classification algorithms were explored, including:

  • Logistic Regression
  • RandomForestClassifier
  • GradientBoostingClassifier
  • XGBClassifier
  • SVC
  • GaussianNB

SMOTE was used to ensure that class imbalances did not bias the model results, and class weighting was adjusted to further handle the imbalanced target variable.

The final model was a combination of the best model with the best paramiters found using RandomSearchCV. and was created with StackingClassifier. Achiving an accuracy of 0.95.

Model Evaluation

The models were evaluated using the following metrics:

  • Accuracy: The proportion of correct predictions.
  • Precision: The proportion of true positive predictions among all positive predictions.
  • Recall: The proportion of true positive predictions among all actual positive instances.
  • F1 Score: The harmonic mean of precision and recall.
  • ROC-AUC: The area under the receiver operating characteristic curve, which measures the model's ability to distinguish between classes.

Results

  • The models achieved strong performance,StackingClassifier. Achiving an accuracy of 0.95.
  • To improve performance further, feature engineering, such as combining correlated features and scaling, played a crucial role.
  • The results also highlighted the importance of Glucose, Insulin, and Age in predicting diabetes.

Handling Real-World Scenarios

  • The Person class was created to handle missing values and predict the likelihood of diabetes based on the available data, filling the missing values with the mean.
  • The Person class was also used to predict the likelihood of diabetes for a new patient based on their medical history and demographic information.

Future Work

  • Hyperparameter Tuning: Further hyperparameter tuning could be performed to optimize the model's performance.
  • Feature Selection: Advanced feature selection techniques like recursive feature elimination could be used to identify the most predictive features.
  • Ensemble Methods: Additional ensemble methods like AdaBoost could be explored to further enhance model performance.
  • Website Implementation: I plan on creating a website where users can input their medical data and receive a prediction of their diabetes risk.

Conclusion

This notebook provides a comprehensive workflow for data preprocessing, feature engineering, and model building to predict diabetes. The approach is robust, handling issues like outliers, missing values, and multicollinearity, while leveraging key features like Glucose, SkinThickness and Insulin to achieve high prediction accuracy.

πŸ“š Libraries

In this notebook, we will use the following libraries:

  • numpy and pandas for data manipulation
  • matplotlib and seaborn for data visualization
  • sklearn for model building and evaluation
  • xgboost for gradient boosting algorithms

πŸ“ Note

This notebook is a work in progress, and I will continue to update it with new insights and model improvements. If you have any suggestions or feedback, please feel free to leave a comment. This model was trained for a competition with limited data and features and is in no way a substitute for professional medical advice. Always consult a healthcare provider for medical advice and treatment.

Linkedin: Adejori Eniola (Invalid URL)

Import Packages, data and handle default settings

import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
plt.rcParams['figure.figsize'] = [10, 5]

data = pd.read_csv('data/diabetes.csv')
# Display the first few rows of the DataFrame
data.head()

Quick Exploration and General Statistics

#view the data types of each column and the number of non-null values
data.info()
#view the summary statistics of the data
data.describe()

πŸ“š Observations:

  • Several features (e.g., Glucose, Blood Pressure, Skin Thickness, Insulin, BMI) have a minimum value of 0, which may indicate missing data or errors.
  • The high variance in some features, especially Insulin, suggests significant variability in the dataset.
  • Median values (50th percentile) are often different from the mean, indicating skewness in the data distribution.
  • An insulin of 846 seems unlikely and look like a possible outlier

This summary helps in understanding the central tendency, spread, and potential issues such as missing data or outliers in the dataset.

πŸ“Š Exploratory Data Analysis (EDA)

β€Œ
β€Œ
β€Œ