Understanding Diabetes Incidence: Feature Selection, Bayesian Optimization and Model Evaluation

Executive Summary

This analysis examines the distribution and factors contributing to diabetes incidence within our dataset, which includes adults aged from 20 to 80 years. Approximately 35% of individuals in the dataset are diagnosed with diabetes, underscoring the need to understand contributing factors.

Key Findings

Data Quality Issues: Variables such as Blood Pressure, BMI, Glucose, Skin Thickness, and Insulin contain biologically implausible zero values. While fewer than 30 observations (3% of the dataset) can be safely discarded, Skin Thickness and Insulin have too many zero values to remove without significant data loss.
Correlation and Feature Importance: Glucose shows a strong correlation with diabetes outcomes, whereas Skin Thickness, Blood Pressure, and Insulin exhibit weak correlations. Recursive Feature Elimination with Cross-Validation (RFECV) using a standard Random Forest Classifier confirms the minimal importance of Insulin and Skin Thickness.
Feature Selection: Glucose and BMI are significant predictors of diabetes, demonstrating distinct means based on diabetes outcomes and high feature importance. Despite its low correlation, Blood Pressure is included in the best subset identified by RFECV.
Metric Selection: Given the imbalanced dataset and the objective of minimizing false negatives (i.e., predicting a person is healthy when they actually have diabetes), recall was chosen as the primary evaluation metric.
Model Selection: Tree-based models were prioritized for their interpretability. Still, Logistic Regression and XGBoost were used only for score comparison.

Recommendations

While Glucose, BMI, and Age are important predictors of diabetes outcomes, the performance of a fine-tuned Decision Tree based solely on these variables resulted in decreased recall and ROC AUC scores compared to a similar model using more features. Notably, the inclusion of family history, described by the Diabetes Pedigree Function, improves model performance.

To enhance recall, the decision threshold is lowered from 0.5 to 0.3, recommending diabetes screening for anyone with a 30% or higher probability, thereby reducing false negatives.

The model first evaluates glucose, followed by BMI. Typically, low glucose and a healthy BMI (< 26.45) indicate good health, while high glucose and BMI suggest diabetes.

For instance, a 54-year-old with a height of 1.78m, weight of 96kg, and glucose level of 125 mg/dL has a 32% predicted risk, prompting a recommendation for screening. This prediction aligns with common sense: although glucose level is within the normal range, their slightly obese condition increases their risk of having or developing diabetes.

⚙️ Technical Summary

While this is a competition, it is also a learning exercise. As a result, I took this opportunity to explore several concepts further, including:

Feature Selection: I experimented with both filter and wrapper methods. Normally, I do not rely heavily on these since I am usually either very familiar with the datasets or can apply common sense. It was interesting, but it definitely does not replace Exploratory Data Analysis (EDA). During EDA, we can identify more complex relationships between variables and even generate new features. The best approach is a mix of domain knowledge and feature selection methods.
Gradient Descent: I have been wanting to explore this for a while and I am satisfied with how fast it computes results. I only used it with Logistic Regression since Support Vector Machines are not suitable for this problem as they do not naturally output instance probabilities.
Hyperparameter Tuning using Informed Search: I enjoyed exploring Coarse to Fine and Bayesian Optimization methods. It was interesting to note that Bayesian Optimization did not work well for Logistic Regression, as it does not establish any meaningful relationship between applying or not applying a penalty and the value of that penalty. As for Genetic approaches, I am not convinced. While I do not doubt that with more processing time it could find a complex model with better performance, from my experience, stakeholders prefer models they can understand. Still, for critical applications, they seem like a feasible approach.
Using Polars: I have been using Polars for database connections and data cleaning at work, and I find it more intuitive than Pandas. However, it is less versatile; for example, it cannot handle PostgreSQL daterange datatype like SQLAlchemy (used by Pandas) can. In this notebook, I did not enjoy using Polars as much because I wanted to create charts with hvplot, but DataLab does not support it. I had to resort to seaborn, which converts Polars DataFrames back to Pandas before plotting.

Overall, I am satisfied with this learning experience and look forward to continuing to explore these topics.

As a side note, I left a few technical comments throughout the notebook marked with ⚙️.

Review of DataLab

This was my first experience using the Premium DataLab, and I have a few thoughts to share.

In terms of computation speed, it is significantly better. The AI tool is also very helpful in curating texts and handling boring tasks. Speaking of boring tasks, one feature I would really appreciate is a built-in linter. As someone who primarily works with Python scripts, I rely heavily on tools like Ruff and Vulture.

Regarding the AI, it would be beneficial if we could provide prompts that affect all cells, such as "Indicate used but unimported libraries in the present notebook" or "Add Google style docstrings and type hints to all functions in this notebook." Nevertheless, I appreciate that the AI only performs actions that was prompted.

Overall, I do not regret upgrading to DataLab Premium. The enhanced performance and helpful AI features have made my workflow more efficient and enjoyable.

Data Sources

The dataset encompasses diagnostic measurements pertinent to diabetes from Pima Indian women, incorporating various medical and demographic attributes.

All feature variables are numerical and continuous, whereas the outcome variable is numerical and binary.

Target Variable

Outcome - Class variable (0 or 1) indicating whether the patient is diagnosed with diabetes. 1 signifies a positive diagnosis, while 0 signifies a negative diagnosis.

Feature Variables

Pregnancies - Number of times the patient has been pregnant.

Glucose - Plasma glucose concentration measured 2 hours after an oral glucose tolerance test (mg/dL).

Blood Pressure - Diastolic blood pressure (mmHg).

Skin Thickness - Triceps skinfold thickness (mm).

Insulin - 2-Hour serum insulin (µU/mL).

BMI - Body mass index, calculated as .

Diabetes Pedigree Function - A function representing the likelihood of diabetes based on family history.

Age - Age of the patient in years.

%%capture
!pip install GPyOpt
!pip install pywaffle
!pip install dtreeviz

import re
from typing import Optional

import numpy as np
import pandas as pd
import polars as pl
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import LinearSegmentedColormap

from tabulate import tabulate

from sklearn.model_selection import (
    train_test_split, 
    RandomizedSearchCV, 
    GridSearchCV, 
    cross_val_score
)
from sklearn.feature_selection import (
    SelectKBest, 
    f_classif, 
    RFECV, 
    SequentialFeatureSelector
)
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, 
    confusion_matrix, 
    classification_report, 
    roc_auc_score, 
    roc_curve, 
    auc, 
    precision_recall_curve, 
    f1_score, 
    recall_score, 
    precision_score
)
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

from pywaffle import Waffle
from mlxtend.feature_selection import ExhaustiveFeatureSelector
from GPyOpt.methods import BayesianOptimization
from tpot import TPOTClassifier
import xgboost as xgb
import dtreeviz

RANDOM_STATE = 42
PALETTE = "husl"

⚙️ Although Polars DataFrames are not fully supported by the tabulate library and their styling features are still considered unstable (as of September 2024), Polars offers a wide range of configuration options to enhance their display. However, while they look well-formatted during code edition, they are not in the live publication.

pl.Config.set_tbl_cols(100)
pl.Config.set_fmt_str_lengths(22)
pl.Config.set_tbl_hide_dataframe_shape(True)
pl.Config.set_tbl_hide_column_data_types(True)

# With *polars*, the first row shown by `.head()` corresponds to the column data **types**
data: pl.DataFrame = pl.read_csv("data/diabetes.csv")

col_pattern: re.Pattern = re.compile(r"([a-z]+)([A-Z]+)")
data = data.rename({col: re.sub(col_pattern, r"\1 \2", col) for col in data.columns})

print(f"Dataset has {data.shape[0]} rows.")
print(data.head(3))

EDA

Distribution of Diabetes Incidence

This analysis examines the distribution of diabetes incidence within our dataset, categorizing the population into two groups: those diagnosed with diabetes and those without the condition.

Diabetes is a prevalent chronic disease that affects a significant portion of the population. In our dataset, approximately 35% of individuals are diagnosed with diabetes. This statistic highlights the critical need to understand the factors contributing to diabetes.

⚙️ Chart was created using pywaffle.

outcome_counts: pl.DataFrame = data["Outcome"].value_counts()
outcome_dict = dict(zip(outcome_counts["Outcome"].to_list(), outcome_counts["count"].to_list()))
outcome_dict['total'] = outcome_dict[0] + outcome_dict[1]

fig = plt.figure(
    figsize=((15, 15)),
    FigureClass=Waffle,
    rows=16,
    values=[outcome_dict[0], outcome_dict[1]],
    colors=["#2ECC71", "#E74C3C"],
    icons='person',
    font_size=16,
    icon_style='solid',
    icon_legend=True,
    legend={
        'labels': [f"Healthy ({outcome_dict[0]/outcome_dict['total']:.0%})", 
                   f"Diabetes ({outcome_dict[1]/outcome_dict['total']:.0%})"], 
        'loc': 'lower center', 
        'bbox_to_anchor': (0.5, -0.2),
        'ncol': 2,
        'fontsize': 20
    }
)

plt.title("Diabetes Incidence Distribution", fontsize=24, fontweight='bold', color='#34495E', pad=20)

plt.show()

Missing Data Analysis

The dataset appears to be well-curated, with no missing values detected in any field.

⚙️ Checking for null values in Polars is very straightforward. However, if missing values were present, we could use the missingno library to identify and visualize patterns of missing data.

print("Number of missing values per field:")
print(data.null_count())

‌
‌
‌

Understanding Diabetes Incidence: Feature Selection, Bayesian Optimization and Model Evaluation

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Understanding Diabetes Incidence: Feature Selection, Bayesian Optimization and Model Evaluation

Executive Summary

⚙️ Technical Summary

Review of DataLab

Data Sources

Target Variable

Feature Variables

EDA

Distribution of Diabetes Incidence

Missing Data Analysis

Understanding Diabetes Incidence: Feature Selection, Bayesian Optimization and Model Evaluation