What do your blood sugars tell you?
π Background
Diabetes mellitus remains a global health issue, causing several thousand people to die each day from this single condition. Finding and avoiding diabetes in the earlier stages can help reduce the risk of serious health issues such as circulatory system diseases, kidney malfunction, and vision loss. This competition involves developing a predictive model for effectively detecting potential Diabetes cases, ideally, before commencing preventive treatment.
πΎ The data
The dataset contains diagnostic measurements that are associated with diabetes, which were collected from a population of Pima Indian women. The data includes various medical and demographic attributes, making it a well-rounded resource for predictive modeling.
The columns and Data Types are as follows:
-
Pregnancies Type: Numerical (Continuous) Description: Number of times the patient has been pregnant.
-
Glucose Type: Numerical (Continuous) Description: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
-
BloodPressure Type: Numerical (Continuous) Description: Diastolic blood pressure (mm Hg).
-
SkinThickness Type: Numerical (Continuous) Description: Triceps skinfold thickness (mm).
-
Insulin Type: Numerical (Continuous) Description: 2-Hour serum insulin (mu U/ml).
-
BMI Type: Numerical (Continuous) Description: Body mass index (weight in kg/(height in m)^2).
-
DiabetesPedigreeFunction Type: Numerical (Continuous) Description: A function that represents the likelihood of diabetes based on family history.
-
Age Type: Numerical (Continuous) Description: Age of the patient in years.
-
Outcome Type: Categorical (Binary) Description: Class variable (0 or 1) indicating whether the patient is diagnosed with diabetes. 1 = Yes, 0 = No.
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('data/diabetes.csv')
# Display the first few rows of the DataFrame
data.head()Visualizing the data distribution of glucose
import seaborn as sns
import matplotlib.pyplot as plt
# Ensure 'Outcome' is treated as a categorical variable
data['Outcome'] = data['Outcome'].astype('category')
# Boxplot of Glucose levels by Outcome
plt.figure(figsize=(8, 6))
sns.boxplot(x='Outcome', y='Glucose', data=data, palette=["blue", "red"])
plt.title("Glucose Levels by Diabetes Outcome")
plt.xlabel("Diabetes Outcome")
plt.ylabel("Glucose Level (mg/dL)")
plt.show()
Visualizing the age distribution
# Density plot of Age by Outcome
plt.figure(figsize=(8, 6))
sns.kdeplot(data=data, x='Age', hue='Outcome', fill=True, common_norm=False, palette=["blue", "red"], alpha=0.5)
plt.title("Age Distribution by Diabetes Outcome")
plt.xlabel("Age")
plt.ylabel("Density")
plt.show()
πͺ Competition challenge
In this challenge, you will focus on the following key tasks:
- Determine the most important factors affecting the diabetes outcome.
- Create interactive plots to visualize the relationship between diabetes and the determined factors from the previous step.
- What's the risk of a person of Age 54, length 178 cm and weight 96 kg, and Glucose levels of 125 mg/dL getting diabetes?
π§ββοΈ Judging criteria
This is a community-based competition. Once the competition concludes, you'll have the opportunity to view and vote for the best submissions of others as the voting begins. The top 5 most upvoted entries will win. The winners will receive DataCamp merchandise.
β
Checklist before publishing into the competition
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- Remove redundant cells like the judging criteria, so the workbook is focused on your story.
- Make sure the workbook reads well and explains how you found your insights.
- Try to include an executive summary of your recommendations at the beginning.
- Check that all the cells run without error
βοΈ Time is ticking. Good luck!
df = data
df.head()# Function to convert camelCase to snake_case
def camel_to_snake(name):
return re.sub(r'(?<!^)(?=[A-Z])', '_', name).lower()
# Apply the function to all column names
df.columns = [camel_to_snake(col) for col in df.columns]
dfdf.columns# Boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x='outcome', y='b_m_i', data=df, palette=["blue", "red"])
plt.xlabel('Diabetes Outcome')
plt.ylabel('BMI')
plt.title('BMI by Diabetes Outcome')
plt.show()β
β