Predicting Diabetes Using Machine Learning | Comparison of Classification Models
By: Zachary Raup
Project Introduction:
This project demonstrates how to classify whether a patient has diabetes using various machine learning models and techniques within the scikit-learn library. With diabetes being a leading cause of serious health issues, including heart disease, kidney failure, and vision loss, early detection and intervention are crucial for effective management and treatment. By leveraging machine learning algorithms, we can enhance diagnostic accuracy, enabling healthcare providers to identify at-risk individuals more efficiently.
The models applied in this project include Logistic Regression, K-Nearest Neighbors (KNN), Decision Tree, Random Forest, and Support Vector Machine (SVM). To enhance model performance, techniques such as scaling, Random Forest feature importance analysis, and hyperparameter optimization are utilized. The models are evaluated using cross-validation metrics and visualized through confusion matrices and ROC curves, allowing for a comprehensive assessment of their effectiveness.
Ultimately, this project aims to provide insights into the best-performing models for diabetes prediction, contributing to more accurate and timely healthcare decisions. The findings may inform future research and applications in predictive analytics within the healthcare domain, potentially leading to better patient outcomes and reduced healthcare costs.
Skills demonstrated in this project:
- Data preprocessing with train_test_split and StandardScaler (Invalid URL)
- Feature importance using Random Forest (Invalid URL)
- Hyperparameter Optimization with KNN and GridSearch (Invalid URL)
- Model comparison (Logistic Regression, KNN, Decision Tree, Random Forest, SVM) (Invalid URL)
- Model evaluation using precision, recall, accuracy, and cross-validation metrics (Invalid URL)
- ROC Curves (Invalid URL)
About the Data:
The dataset used is sourced from DataCamp and was a practice dataset from the Supervised Learning with scikit-learn course (Invalid URL). It is a cleaned version of the diabetes dataset with no missing data, and the target variable, diabetes, is a binary column where 1 indicates that the patient has diabetes and 0 indicates that they do not. The "dpf" column refers to Diabetes Pedigree Function. This column provides a measure of the likelihood of diabetes based on a person's genetic history (family history) and the interaction of genetics with other risk factors. This cleaned dataset version allows this project to focus on model training, evaluation, and interpretation without worrying about data preprocessing issues like missing values.
Load the Cleaned Diabetes Dataset
This section imports necessary Python libraries for data manipulation and visualization. It loads the cleaned diabetes dataset into a Pandas DataFrame, providing a structured view of the data, which is essential for subsequent processing.
# Importing libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
diabetes_df = pd.read_csv('datasets/diabetes_clean.csv')
# Preview the diabetes dataset
print(diabetes_df.head())
Data Preprocessing: Split and Scale Data
Assign the feature variables (X) and target variable (y) and randomly split the dataset into training (80%) and testing (20%) sets with train_test_split from sklearn. The stratification ensures that the class distribution of the target variable (diabetes/no diabetes) is maintained, which is critical for balanced model evaluation. This section also applies standard scaling to the data. Scaling ensures that all feature values are normalized and on the same scale, which assists models like KNN and SVM that are sensitive to differences in feature magnitudes.
# Import train_test_split and StandardScaler from sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Assign features and target
X = diabetes_df.drop("diabetes", axis=1).values
y = diabetes_df["diabetes"].values
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21, stratify=y)
# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Feature Importance using Random Forest
In this section, a Random Forest classifier is trained to rank the features based on their importance towards classifying the target variable. The Random Forest model evaluates each feature's contribution by considering how much each one improves the accuracy of the model when used in tree splits. By identifying the top-ranked features, we can focus on the most informative variables, potentially simplifying the model, improving interpretability, and maintaining or enhancing performance. This step also helps identify redundant or irrelevant features, reducing the risk of overfitting and unnecessary complexity in the model.
# Import RandomForestClassifier for feature importance
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
# Initialize and fit Random Forest on the scaled training data
rf = RandomForestClassifier(random_state=21)
rf.fit(X_train_scaled, y_train)
# Extract feature importances from the Random Forest model
importances = rf.feature_importances_
# Use the original feature names (before scaling)
feature_names = diabetes_df.columns[:-1]
# Visualize feature importance using a bar chart
plt.figure(figsize=(10, 6))
pd.Series(importances, index=feature_names).sort_values().plot(kind='barh', color='lightgreen')
plt.title('Random Forest Feature Importance')
plt.xlabel('Importance Score')
plt.show()
# Select the most important features based on Random Forest importance scores
important_features = feature_names[importances > 0.1]
print(f"Selected Important Features: {important_features}")
# Get the indices of important features
important_feature_indices = [list(feature_names).index(feat) for feat in important_features]
# Redefine X with the important features only (for NumPy arrays)
X_train_imp = X_train[:, important_feature_indices]
X_test_imp = X_test[:, important_feature_indices]
Insights from Random Forest Feature Importance
The Random Forest model's feature importance analysis revealed that 'glucose' is the most significant predictor of diabetes, with a score of 0.26, as shown in the accompanying bar chart. The next most important features are 'bmi' (0.17), 'dpf' (0.14), and 'age' (0.13), with a notable drop in importance beyond these variables. To maintain the model's predictive accuracy while simplifying it, a threshold is set, retaining only features with an importance score greater than 0.1. This ensures that the model remains interpretable and efficient by focusing on the most impactful variables, while discarding those with lower predictive power.
Hyperparameter Optimization for KNN with GridSearch
Next, optimize the K-Nearest Neighbors (KNN) model by selecting the best value for the n_neighbors hyperparameter. The n_neighbors parameter controls how many neighboring data points are considered when making a prediction, directly impacting the model's decision boundary. Smaller values of n_neighbors make the model more sensitive to local variations, while larger values provide smoother, more generalized predictions.
To find the optimal value, GridSearch is used from sklearn, which is a systematic approach that evaluates a range of possible n_neighbors values to identify the one that maximizes model accuracy. GridSearch works by performing an exhaustive search over the predefined hyperparameter space, selecting the configuration that yields the best performance based on cross-validation results. This process ensures that the KNN model generalizes well to unseen data, reducing the risk of overfitting and making the model more reliable and robust for predicting diabetes.
# Import KNeighborsClassifier, Pipeline, and GridSearchCV for hyperparameter tuning
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
# Set up a pipeline for scaling and running the KNN model before using Grid Search
knn_pipeline = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier())])
# Set up the parameter grid for GridSearch
knn_params = {"knn__n_neighbors": np.arange(1, 50)}
# Perform Grid Search to find optimal value for n_neighbor
grid_search = GridSearchCV(knn_pipeline, knn_params, cv=5)
grid_search.fit(X_train_imp, y_train)
# Extract the best n_neighbors value
best_n_neighbors = grid_search.best_params_['knn__n_neighbors']
# Print best hyperparameters and score
print("Best KNN Model:")
print(f"Best score: {grid_search.best_score_:.4f}")
print(f"Best parameters: {grid_search.best_params_}")
# Extract results from GridSearchCV
cv_results = grid_search.cv_results_
# Plot the mean test score vs n_neighbors
plt.figure(figsize=(10, 6))
plt.plot(knn_params["knn__n_neighbors"], cv_results["mean_test_score"], marker='o', linestyle='--', color='b')
plt.title("KNN Hyperparameter Optimization: n_neighbors")
plt.xlabel("Number of Neighbors (n_neighbors)")
plt.ylabel("Mean Cross-Validation Accuracy")
plt.grid(True)
plt.show()
Analysis of Hyperparameter Tuning
The GridSearchCV process identified that the optimal value for the n_neighbors parameter in the K-Nearest Neighbors (KNN) model is 18, which yielded the highest mean cross-validation accuracy score of 0.79. This optimal number of neighbors enhances the balance between bias and variance, ensuring that the model generalizes well to unseen data. Going forward, the KNN models will be trained and tested using this optimal value of n_neighbors = 18, maximizing performance by utilizing the best configuration found during the grid search.
Compare Multiple Models
In this section, we compare the performance of five different classification models: Logistic Regression, K-Nearest Neighbors (KNN), Decision Tree, Random Forest, and Support Vector Machine (SVM). We will use cross-validation to evaluate each model's performance, which involves dividing the training data into multiple folds. This approach trains each model on several subsets and tests it on the remaining fold, providing a more reliable assessment of how well the model generalizes to unseen data. The comparison will help us identify the best-performing model based on accuracy.
-
Logistic Regression: This model estimates the probability that a given input point belongs to a particular category. It uses a logistic function to model the relationship between the dependent variable (diabetes status) and one or more independent variables (features).
-
K-Nearest Neighbors (KNN): KNN classifies a data point based on how its neighbors are classified. It identifies the 'k' closest training examples in the feature space and assigns the class most common among those neighbors.
-
Decision Tree: This model creates a flowchart-like structure where each internal node represents a feature test, each branch represents an outcome of the test, and each leaf node represents a class label. Decision trees split the data at each node to maximize information gain or minimize impurity.
-
Random Forest: An ensemble method that constructs multiple decision trees during training and outputs the mode of their predictions for classification tasks. It helps improve accuracy and control overfitting by averaging the results of individual trees.
-
Support Vector Machine (SVM): SVM finds the optimal hyperplane that separates different classes in the feature space. It maximizes the margin between the classes, making it effective in high-dimensional spaces.
To evaluate these models, cross-validation is used to obtain a robust measure of accuracy.
# Import classifiers and cross-validation tools
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, KFold
import matplotlib.pyplot as plt
# Extract the best n_neighbors value
best_n_neighbors = grid_search.best_params_['knn__n_neighbors']
# Initialize classification models
models = {
"Logistic Regression": LogisticRegression(),
"KNN": KNeighborsClassifier(n_neighbors=best_n_neighbors),
"Decision Tree": DecisionTreeClassifier(),
"Random Forest": RandomForestClassifier(),
"SVM": SVC(probability=True)
}
# Evaluate models using cross-validation
results = []
model_names = []
kf = KFold(n_splits=5, shuffle=True, random_state=21)
for name, model in models.items():
cv_results = cross_val_score(model, X_train_imp, y_train, cv=kf)
results.append(cv_results)
model_names.append(name)
print(f"{name} Cross-Validation Mean Accuracy: {cv_results.mean():.4f}")
### Visualizing Cross-Validation Accuracy Comparison
# Plot the boxplot of cross-validation results
plt.figure(figsize=(10, 6))
plt.boxplot(results, labels=model_names)
plt.title("Model Comparison (Cross-Validation Accuracy)")
plt.ylabel("Accuracy")
plt.show()
Analysis of the Training Model Comparison
The Logistic Regression model stands out as the most effective classifier in this case, balancing accuracy with stability. For classifying whether a patient has diabetes, this model would likely generalize the best to unseen data. While SVM and KNN also performed well, they displayed slightly more variability. The Decision Tree and Random Forest models, while useful in some scenarios, did not perform as well for this specific dataset, showing lower accuracy and more significant variability. Therefore, based on the cross-validation results, Logistic Regression should be considered the preferred model for this task.