Skip to content
0

(Invalid URL)

What Do Your Blood Sugars Tell You?

(Invalid URL)

Executive Summary

Competition Overview

Diabetes mellitus remains a global health issue, causing several thousand people to die each day from this single condition. Finding and avoiding diabetes in the earlier stages can help reduce the risk of serious health issues such as circulatory system diseases, kidney malfunction, and vision loss.

This competition involves developing a predictive model for effectively detecting potential diabetes cases, ideally, before commencing preventive treatment.


Dataset

The dataset contains diagnostic measurements that are associated with diabetes, which were collected from a population of Pima Indian women. The data includes various medical and demographic attributes, making it a well-rounded resource for predictive modeling. The columns and data types are as follows:

Features:

  • Pregnancies: Numerical (Continuous); Number of times the patient has been pregnant.
  • Glucose: Numerical (Continuous); Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
  • BloodPressure: Numerical (Continuous); Diastolic blood pressure (mm Hg).
  • SkinThickness: Numerical (Continuous); Triceps skinfold thickness (mm).
  • Insulin: Numerical (Continuous); 2-Hour serum insulin (mu U/ml).
  • BMI: Numerical (Continuous); Body mass index (weight in kg/(height in m)^2).
  • DiabetesPedigreeFunction: Numerical (Continuous); A function that represents the likelihood of diabetes based on family history.
  • Age: Numerical (Continuous); Age of the patient in years.

Target:

  • Outcome: Categorical (Binary); Class variable (0 or 1) indicating whether the patient is diagnosed with diabetes. 1 = Yes, 0 = No.

Findings

Identifying key features in the dataset

Two initial models were built using all of the dataset features: a simple Random Forest classifier and a Logistic Regression classifier with Lasso regularization. Each model has its own way of handling data and evaluating feature importance, thus leading to different results. In our trial case, the top-3 most important features were identical: Glucose, BMI, and Age. Such consistency signals that these features are important predictors of diabetes.

For features that appear moderately important in one model, such as pregnancies in the logistic regression model, the likely explanation is the model's handling of linear relationships.

To reconcile, SHAP analysis was performed to deliver a unified approach to explain individual predictions and understand the global importance of each feature. The results from the analysis again showed that Glucose, BMI, and Age were the most important features explaining diabetes outcome. The other features were dropped from the final model.

Most Important Predictors of diabetes outcome

With the aforementioned feature importance analysis as our basis, Glucose, BMI, and Age were selected as the most significant features influencing the model's predictions of diabetes. High values in these features are strongly associated with an increased likelihood of diabetes.

Moderate Predictors

Features such as insulin, the number of pregnancies and diabetes pedigree function show some influence on diabetes outcome but are less impactful compared to the top features. Morever, some relationships exhibit strong positive correlation, such as number of pregnancies with age, or insulin with glucose; the weaker feature should be removed to reduce risk of multicollinearity.


Relationship between diabetes and key predictors: Glucose, BMI, and Age

Glucose vs. BMI: Higher glucose levels are more commonly associated with diabetic individuals. While there is some overlap, non-diabetic individuals tend to have lower glucose levels.

Glucose vs. Age: Age does not show a clear pattern of differentiation for diabetes in this plot. Both diabetic and non-diabetic groups span a wide range of ages, indicating that while age is a factor, glucose levels are a stronger indicator of diabetes in this dataset.

BMI vs. Age: This suggests that BMI alone may not be a strong differentiator for diabetes when considered in combination with age. However, based on the results of the feature importance analysis, both BMI and age contribute to the overall risk profile.


Model Building

Using the elected key features of Glucose, BMI, and Age, we evaluated five different classification algorithms using 5-fold cross validation.

We selected the model that achieved the highest mean F1-score with reasonable variance to ensure consistent performance across different folds. F1-score was the benchmark scoring standard because it balances model precision and recall. The best performing model was the Support Vector Machine (SVM) classifier. The model weights were tuned to achieve optimal performance. The best model weights were ultimately the default weights.


Risk Assessment

Assessing Diabetes Risk Using Best Model

With our trained SVM model on hand, we estimated the of a person of Age 54, height 178 cm, weight 96 kg, and glucose levels of 125 mg/dL getting diabetes.

The model predicts that the probability of such person having diabetes is about 62%, an elevated risk. This model signals to such person that a visit to a clinic for a check-up is imperative and that increased attention should be devoted toward behaviorial and diet modification.

(Invalid URL)

Table of Contents

    1. Importing Required Libraries
    1. Reading and Understanding our Data
    1. Data Processing & Preliminary EDA
    1. Initial Model Building for Feature Importance
    1. Feature Importance Reconciliation
    1. Exploratory Data Analysis of Key Features
    1. Model Building
    1. Assessing Diabetes Risk Using Best Model
    1. Conclusion

(Invalid URL)

1. Importing Required Libraries

# core libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
sns.set_style('whitegrid')

# sklearn
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, make_scorer, f1_score, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# plotly
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import plotly.offline as pyo

# other
import shap
import ipywidgets as widgets
from IPython.display import display, clear_output
from itertools import combinations

# helpers
import helpers as h

# warnings
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)

%load_ext autoreload
%autoreload 2

(Invalid URL)

2. Reading Data

# import dataset
data = pd.read_csv('data/diabetes.csv')
print(f"The dataset comprises {data.shape[0]} observations, {data.shape[1] - 1} features, and 1 target column")
# show first 5 rows
data.head()
# evaluate feature type
data.info()
# check for missing values
data.isnull().sum()
# review dataset for outliers and zero values
data.describe()
Commentary on initial dataset import
  • Each column of the dataset is formatted properly as type numeric
  • There are no null instances
  • There are clear instances of incorrect data entry in columns: Glucose, BloodPressure, SkinThickness, Insulin, and BMI. This will be rectified in the next stage. All zero instance values will be replaced with the column median.

(Invalid URL)

3. Data Processing & Preliminary EDA