What do your blood sugars tell you?
๐ Background
Diabetes mellitus remains a global health issue, causing several thousand people to die each day from this single condition. Finding and avoiding diabetes in the earlier stages can help reduce the risk of serious health issues such as circulatory system diseases, kidney malfunction, and vision loss. This competition involves developing a predictive model for effectively detecting potential Diabetes cases, ideally, before commencing preventive treatment.
๐พ The data
The dataset contains diagnostic measurements that are associated with diabetes, which were collected from a population of Pima Indian women. The data includes various medical and demographic attributes, making it a well-rounded resource for predictive modeling.
The columns and Data Types are as follows:
-
Pregnancies Type: Numerical (Continuous) Description: Number of times the patient has been pregnant.
-
Glucose Type: Numerical (Continuous) Description: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
-
BloodPressure Type: Numerical (Continuous) Description: Diastolic blood pressure (mm Hg).
-
SkinThickness Type: Numerical (Continuous) Description: Triceps skinfold thickness (mm).
-
Insulin Type: Numerical (Continuous) Description: 2-Hour serum insulin (mu U/ml).
-
BMI Type: Numerical (Continuous) Description: Body mass index (weight in kg/(height in m)^2).
-
DiabetesPedigreeFunction Type: Numerical (Continuous) Description: A function that represents the likelihood of diabetes based on family history.
-
Age Type: Numerical (Continuous) Description: Age of the patient in years.
-
Outcome Type: Categorical (Binary) Description: Class variable (0 or 1) indicating whether the patient is diagnosed with diabetes. 1 = Yes, 0 = No.
import pandas as pd
data = pd.read_csv('data/diabetes.csv')
# Display the first few rows of the DataFrame
data.head()Visualizing the data distribution of glucose
import seaborn as sns
import matplotlib.pyplot as plt
# Ensure 'Outcome' is treated as a categorical variable
data['Outcome'] = data['Outcome'].astype('category')
# Boxplot of Glucose levels by Outcome
plt.figure(figsize=(8, 6))
sns.boxplot(x='Outcome', y='Glucose', data=data, palette=["blue", "red"])
plt.title("Glucose Levels by Diabetes Outcome")
plt.xlabel("Diabetes Outcome")
plt.ylabel("Glucose Level (mg/dL)")
plt.show()
Visualizing the age distribution
# Density plot of Age by Outcome
plt.figure(figsize=(8, 6))
sns.kdeplot(data=data, x='Age', hue='Outcome', fill=True, common_norm=False, palette=["blue", "red"], alpha=0.5)
plt.title("Age Distribution by Diabetes Outcome")
plt.xlabel("Age")
plt.ylabel("Density")
plt.show()
๐ช Competition challenge
In this challenge, you will focus on the following key tasks:
- Determine the most important factors affecting the diabetes outcome.
- Create interactive plots to visualize the relationship between diabetes and the determined factors from the previous step.
- What's the risk of a person of Age 54, length 178 cm and weight 96 kg, and Glucose levels of 125 mg/dL getting diabetes?
===============================================================
EXECUTIVE SUMMARY
Logistic regression by statsmodels land sklearn and supervised learning by sklearn are used to evaluate the predictive power of critical features on as diabetes indicators.
The predictive power of the indicators is moderate for all three methods with scores about 0.7.
The most important factors affecting diabetes outcome are Glucose level, BMI and Age.
The risk of a person of age 54, height 178 cm, weight 96 kg and Glocose level of 125 is evaluataed using all three methods.
Both logistic regression methods yield 'no diabetes' outcome, while the KNN method indicates possible 'diabetes risk'. This is probably because the case is right at the margin.
For more clearcut cases like a high glucose level or older age, all three methods correctly identify the diabetes outcome.
===============================================================
PRELIMINARY ANALYSIS
Data validation
Cross correlation
#check data types and missing values
print(data.info())
print(data.isna().sum())
data['Outcome']=data['Outcome'].astype('int')import seaborn as sns
display(data.corr())
sns.heatmap(data.corr())
plt.showPreliminary analysis - result:
Correlation coefficients between variables and Outcome are moderate to low.
The top indicator of diabetes is Glucose level, followed by BMI and Age.
===============================================================
LOGISTIC REGRESSION USING STATSMODELS
Outcome as a function of all available numerical variables
Confusion matrix
Accuracy, sensitivity,
Best indicators of diabetes
#from scipy.stats import linregress
import statsmodels.formula.api as smf
from sklearn.linear_model import LogisticRegression
def logit_results(results):
print(results.params)
conf_matrix=results.pred_table()
print(f'Confusion matrix \n {conf_matrix}')
TN=conf_matrix[0,0]
TP=conf_matrix[1,1]
FN=conf_matrix[1,0]
FP=conf_matrix[0,1]
print('\nProportion of correct (positive or negative) predictions:')
accuracy=round((TN+TP)/(TN+TP+FN+FP),2)
print(f'Accuracy {accuracy}')
print('\nProportion of positive predictions that are actually right')
precision=round(TP/(TP+FP),2)
print(f'Precision {precision}')
print('\nProportion of acually right values predicted correcty to incorrect results:')
sensitivity=round(TP/(FN+FP),2)
print(f'Sensitivity,{sensitivity}')
print('\nProportion of negative predictions that are actually right')
specificity=round(TN/(TN+FP),2)
print(f'Specificity {specificity}')
print('\nProportion of actually right values identified correcly')
recall=round(TP/(TP+FN),2)
print(f'Recall {recall}')
#run logistic regression
import numpy as np
from statsmodels.graphics.mosaicplot import mosaic
results=smf.logit('Outcome~Pregnancies+Glucose+BloodPressure+BMI+SkinThickness+Insulin+DiabetesPedigreeFunction',data=data).fit()
#Find relevant results and display
print('\nLogistic regression using all variables:')
print('==========================================')
logit_results(results)
conf_matrix=results.pred_table()
mosaic(conf_matrix)
plt.show()
#Repeat for different combinnation of selected indicators
print('\nLogistic regression using only glicose:')
print('==========================================')
results=smf.logit('Outcome~Glucose',data=data).fit()
logit_results(results)
print('\nLogistic regression using only BMI:')
print('==========================================')
results=smf.logit('Outcome~BMI',data=data).fit()
logit_results(results)
print('\nLogistic regression using glucose and BMI:')
print('==========================================')
results=smf.logit('Outcome~Glucose+BMI',data=data).fit()
logit_results(results)
โ
โ