Learn about the factors that effect having diabetes

What do your blood sugars tell you?

📖 Background

Diabetes mellitus remains a global health issue, causing several thousand people to die each day from this single condition. Finding and avoiding diabetes in the earlier stages can help reduce the risk of serious health issues such as circulatory system diseases, kidney malfunction, and vision loss. This competition involves developing a predictive model for effectively detecting potential Diabetes cases, ideally, before commencing preventive treatment.

💾 The data

The dataset contains diagnostic measurements that are associated with diabetes, which were collected from a population of Pima Indian women. The data includes various medical and demographic attributes, making it a well-rounded resource for predictive modeling.

The columns and Data Types are as follows:

Pregnancies Type: Numerical (Continuous) Description: Number of times the patient has been pregnant.
Glucose Type: Numerical (Continuous) Description: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
BloodPressure Type: Numerical (Continuous) Description: Diastolic blood pressure (mm Hg).
SkinThickness Type: Numerical (Continuous) Description: Triceps skinfold thickness (mm).
Insulin Type: Numerical (Continuous) Description: 2-Hour serum insulin (mu U/ml).
BMI Type: Numerical (Continuous) Description: Body mass index (weight in kg/(height in m)^2).
DiabetesPedigreeFunction Type: Numerical (Continuous) Description: A function that represents the likelihood of diabetes based on family history.
Age Type: Numerical (Continuous) Description: Age of the patient in years.
Outcome Type: Categorical (Binary) Description: Class variable (0 or 1) indicating whether the patient is diagnosed with diabetes. 1 = Yes, 0 = No.

import pandas as pd

data = pd.read_csv('data/diabetes.csv')
# Display the first few rows of the DataFrame
data.head()

Visualizing the data distribution of glucose

import seaborn as sns
import matplotlib.pyplot as plt

# Ensure 'Outcome' is treated as a categorical variable
data['Outcome'] = data['Outcome'].astype('category')

# Boxplot of Glucose levels by Outcome
plt.figure(figsize=(8, 6))
sns.boxplot(x='Outcome', y='Glucose', data=data, palette=["blue", "red"])
plt.title("Glucose Levels by Diabetes Outcome")
plt.xlabel("Diabetes Outcome")
plt.ylabel("Glucose Level (mg/dL)")
plt.show()

It's obvious from the last plot that most of the people with higher glucose levels (more than almost 120) have a diabetes, let's use this information to see if the BMI has anything to do with it.

Visualization of BMI vs Glucose

import matplotlib.pyplot as plt
import seaborn as sns

glucose_range = data[(data['Glucose'] >= 120)]


plt.figure(figsize=(8, 6))
sns.scatterplot(x='Glucose', y='BMI', hue='Outcome', data=glucose_range, s=100, palette=["blue", "red"])

plt.xlabel('Glucose')
plt.ylabel('BMI')
plt.title('Relationship between Glucose(Glucose >= 120) and BMI ')

plt.show()

The scatter is crowded, but my attention to details eye, I can spot more blue points where the BMI is less than 30,So I'll go with the assumption of that people with a BMI that is more than 30 are more likely to have a diabetes, espically in the area where glucose is between 120 and 150, maybe this is the percent of people with 120 glucose level shown in the first plot, but still I want to make sure I'm not making a broad assumption, so let's count how many actually have diabetes while also having BMI that is between 20 - 30, and how many does not vs the people in the range of 30-40.

Number of people having diabetes in the range of(20-29)BMI and (30-40)BMI

import seaborn as sns
import matplotlib.pyplot as plt

BMI20_29 = data[(data['BMI'] < 30) & (data['BMI'] >= 20)]
BMI30_40 =  data[(data['BMI'] <= 40) & (data['BMI'] >= 30)]

# Count the number of points where outcome is 1 in this age range
outcome_1_count = BMI20_29[BMI20_29['Outcome'] == 1].shape[0]
outcome_1_count2 = BMI30_40[BMI30_40['Outcome'] == 1].shape[0]

print(f'Number of points with outcome = 1 and BMI between 20 and 29: {outcome_1_count}')
print(f'Number of points with outcome = 1 and BMI between 30 and 40: {outcome_1_count2}')

YES!! As expected! having high BMI increases the chances of getting a diabetes, but wait a minute, let me see how many points are actaully there, just to be fair 🙄

import seaborn as sns
import matplotlib.pyplot as plt 

BMI20_29 = data[(data['BMI'] < 30) & (data['BMI'] >= 20)]
BMI30_40 =  data[(data['BMI'] <= 40) & (data['BMI'] >= 30)]
# Count the number of total points
total_count_of20_29_points = BMI20_29.shape[0]
total_count_of30_40_points = BMI30_40.shape[0]
print(f'total count of 20-29 : {total_count_of20_29_points}')
print(f'total count of 20-29 : {total_count_of30_40_points}')
#getting the percentage of points over total in different BMI ranges
print(f'percentage of points over total in (20-29) BMI range: {(outcome_1_count/total_count_of20_29_points) * 100:.2f}%')
print(f'percentage of points over total in (30-40) BMI range: {(outcome_1_count2/total_count_of30_40_points) *100:.2f}%')

Getting the percentage of points in each range, assures our first assumption! are you still not convinced as I am?

I think it's not fair to have a 105 point difference, so let us do the chi -squared test, and if the p value is < .005 I promise I won't bother you agian with my assumption

‌
‌
‌