Logistic KNN sklearn statmodels

What do your blood sugars tell you?

📖 Background

Diabetes mellitus remains a global health issue, causing several thousand people to die each day from this single condition. Finding and avoiding diabetes in the earlier stages can help reduce the risk of serious health issues such as circulatory system diseases, kidney malfunction, and vision loss. This competition involves developing a predictive model for effectively detecting potential Diabetes cases, ideally, before commencing preventive treatment.

💾 The data

The dataset contains diagnostic measurements that are associated with diabetes, which were collected from a population of Pima Indian women. The data includes various medical and demographic attributes, making it a well-rounded resource for predictive modeling.

The columns and Data Types are as follows:

Pregnancies Type: Numerical (Continuous) Description: Number of times the patient has been pregnant.
Glucose Type: Numerical (Continuous) Description: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
BloodPressure Type: Numerical (Continuous) Description: Diastolic blood pressure (mm Hg).
SkinThickness Type: Numerical (Continuous) Description: Triceps skinfold thickness (mm).
Insulin Type: Numerical (Continuous) Description: 2-Hour serum insulin (mu U/ml).
BMI Type: Numerical (Continuous) Description: Body mass index (weight in kg/(height in m)^2).
DiabetesPedigreeFunction Type: Numerical (Continuous) Description: A function that represents the likelihood of diabetes based on family history.
Age Type: Numerical (Continuous) Description: Age of the patient in years.
Outcome Type: Categorical (Binary) Description: Class variable (0 or 1) indicating whether the patient is diagnosed with diabetes. 1 = Yes, 0 = No.

import pandas as pd

data = pd.read_csv('data/diabetes.csv')
# Display the first few rows of the DataFrame
data.head()

Visualizing the data distribution of glucose

import seaborn as sns
import matplotlib.pyplot as plt

# Ensure 'Outcome' is treated as a categorical variable
data['Outcome'] = data['Outcome'].astype('category')

# Boxplot of Glucose levels by Outcome
plt.figure(figsize=(8, 6))
sns.boxplot(x='Outcome', y='Glucose', data=data, palette=["blue", "red"])
plt.title("Glucose Levels by Diabetes Outcome")
plt.xlabel("Diabetes Outcome")
plt.ylabel("Glucose Level (mg/dL)")
plt.show()

Visualizing the age distribution

# Density plot of Age by Outcome
plt.figure(figsize=(8, 6))
sns.kdeplot(data=data, x='Age', hue='Outcome', fill=True, common_norm=False, palette=["blue", "red"], alpha=0.5)
plt.title("Age Distribution by Diabetes Outcome")
plt.xlabel("Age")
plt.ylabel("Density")
plt.show()

💪 Competition challenge

In this challenge, you will focus on the following key tasks:

Determine the most important factors affecting the diabetes outcome.
Create interactive plots to visualize the relationship between diabetes and the determined factors from the previous step.
What's the risk of a person of Age 54, length 178 cm and weight 96 kg, and Glucose levels of 125 mg/dL getting diabetes?

===============================================================

EXECUTIVE SUMMARY

Logistic regression by statsmodels land sklearn and supervised learning by sklearn are used to evaluate the predictive power of critical features on as diabetes indicators.

The predictive power of the indicators is moderate for all three methods with scores about 0.7.

The most important factors affecting diabetes outcome are Glucose level, BMI and Age.

The risk of a person of age 54, height 178 cm, weight 96 kg and Glocose level of 125 is evaluataed using all three methods.

Both logistic regression methods yield 'no diabetes' outcome, while the KNN method indicates possible 'diabetes risk'. This is probably because the case is right at the margin.

For more clearcut cases like a high glucose level or older age, all three methods correctly identify the diabetes outcome.

===============================================================

PRELIMINARY ANALYSIS

Data validation

Cross correlation


#check data types and missing values
print(data.info())
print(data.isna().sum())
data['Outcome']=data['Outcome'].astype('int')

import seaborn as sns
display(data.corr())
sns.heatmap(data.corr())
plt.show

Preliminary analysis - result:

Correlation coefficients between variables and Outcome are moderate to low.

The top indicator of diabetes is Glucose level, followed by BMI and Age.

===============================================================

LOGISTIC REGRESSION USING STATSMODELS

Outcome as a function of all available numerical variables

Confusion matrix

Accuracy, sensitivity,

Best indicators of diabetes

#from scipy.stats import linregress
import statsmodels.formula.api as smf
from sklearn.linear_model import LogisticRegression
def logit_results(results):
    print(results.params)
    conf_matrix=results.pred_table()
    print(f'Confusion matrix \n {conf_matrix}')
    TN=conf_matrix[0,0]
    TP=conf_matrix[1,1]
    FN=conf_matrix[1,0]
    FP=conf_matrix[0,1]
    print('\nProportion of correct (positive or negative) predictions:')
    accuracy=round((TN+TP)/(TN+TP+FN+FP),2)
    print(f'Accuracy {accuracy}')
    print('\nProportion of positive predictions that are actually right')
    precision=round(TP/(TP+FP),2)
    print(f'Precision {precision}')
    print('\nProportion of acually right values predicted correcty to  incorrect results:')
    sensitivity=round(TP/(FN+FP),2)
    print(f'Sensitivity,{sensitivity}')
    print('\nProportion of negative predictions that are actually right')
    specificity=round(TN/(TN+FP),2)
    print(f'Specificity {specificity}')
    print('\nProportion of actually right values identified correcly')
    recall=round(TP/(TP+FN),2)
    print(f'Recall {recall}')

#run logistic regression
import numpy as np
from statsmodels.graphics.mosaicplot import mosaic
results=smf.logit('Outcome~Pregnancies+Glucose+BloodPressure+BMI+SkinThickness+Insulin+DiabetesPedigreeFunction',data=data).fit()

#Find relevant results and display
print('\nLogistic regression using all variables:')
print('==========================================')
logit_results(results)
conf_matrix=results.pred_table()
mosaic(conf_matrix)
plt.show()

#Repeat for different combinnation of selected indicators
print('\nLogistic regression using only glicose:')
print('==========================================')
results=smf.logit('Outcome~Glucose',data=data).fit()
logit_results(results)

print('\nLogistic regression using only BMI:')
print('==========================================')
results=smf.logit('Outcome~BMI',data=data).fit()
logit_results(results)


print('\nLogistic regression using glucose and BMI:')
print('==========================================')
results=smf.logit('Outcome~Glucose+BMI',data=data).fit()
logit_results(results)

‌
‌
‌

Logistic KNN sklearn statmodels

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}What do your blood sugars tell you?

📖 Background

💾 The data

Visualizing the data distribution of glucose

Visualizing the age distribution

💪 Competition challenge

EXECUTIVE SUMMARY

PRELIMINARY ANALYSIS

Preliminary analysis - result:

LOGISTIC REGRESSION USING STATSMODELS

What do your blood sugars tell you?