What possible indicators do Diabetes have?

What do your blood sugars tell you?

📖 Background

Diabetes mellitus remains a global health issue, causing several thousand people to die each day from this single condition. Finding and avoiding diabetes in the earlier stages can help reduce the risk of serious health issues such as circulatory system diseases, kidney malfunction, and vision loss. This competition involves developing a predictive model for effectively detecting potential Diabetes cases, ideally, before commencing preventive treatment.

💾 The data

The dataset contains diagnostic measurements that are associated with diabetes, which were collected from a population of Pima Indian women. The data includes various medical and demographic attributes, making it a well-rounded resource for predictive modeling.

The columns and Data Types are as follows:

Pregnancies Type: Numerical (Continuous) Description: Number of times the patient has been pregnant.
Glucose Type: Numerical (Continuous) Description: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
BloodPressure Type: Numerical (Continuous) Description: Diastolic blood pressure (mm Hg).
SkinThickness Type: Numerical (Continuous) Description: Triceps skinfold thickness (mm).
Insulin Type: Numerical (Continuous) Description: 2-Hour serum insulin (mu U/ml).
BMI Type: Numerical (Continuous) Description: Body mass index (weight in kg/(height in m)^2).
DiabetesPedigreeFunction Type: Numerical (Continuous) Description: A function that represents the likelihood of diabetes based on family history.
Age Type: Numerical (Continuous) Description: Age of the patient in years.
Outcome Type: Categorical (Binary) Description: Class variable (0 or 1) indicating whether the patient is diagnosed with diabetes. 1 = Yes, 0 = No.

# Load necessary libraries
suppressMessages(library(readr))

# Load the data from the CSV file
data <- read_csv("data/diabetes.csv", show_col_types = FALSE)

# Display the first few rows of the DataFrame
head(data)

str(data)

Visualizing the data distribution of glucose

suppressMessages(library(ggplot2))

# Ensure that Outcome is a factor
data$Outcome <- as.factor(data$Outcome)

# Boxplot of Glucose levels by Outcome
ggplot(data, aes(x = Outcome, y = Glucose, fill = Outcome, group = Outcome)) +
  geom_boxplot() +
  labs(title = "Glucose Levels by Diabetes Outcome",
       x = "Diabetes Outcome",
       y = "Glucose Level (mg/dL)") +
  scale_fill_manual(values = c("blue", "red")) +
  theme_minimal()

Visualizing the age distribution

# Density plot of Age by Outcome
ggplot(data, aes(x = Age, fill = Outcome, group = Outcome)) +
  geom_density(alpha = 0.5) +
  labs(title = "Age Distribution by Diabetes Outcome",
       x = "Age",
       y = "Density") +
  scale_fill_manual(values = c("blue", "red")) +
  theme_minimal()

💪 Competition challenge

In this challenge, you will focus on the following key tasks:

Determine the most important factors affecting the diabetes outcome.
Create interactive plots to visualize the relationship between diabetes and the determined factors from the previous step.
What's the risk of a person of Age 54, length 178 cm and weight 96 kg, and Glucose levels of 125 mg/dL getting diabetes?

🧑‍⚖️ Judging criteria

This is a community-based competition. Once the competition concludes, you'll have the opportunity to view and vote for the best submissions of others as the voting begins. The top 5 most upvoted entries will win. The winners will receive DataCamp merchandise.

✅ Checklist before publishing into the competition

Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
Remove redundant cells like the judging criteria, so the workbook is focused on your story.
Make sure the workbook reads well and explains how you found your insights.
Try to include an executive summary of your recommendations at the beginning.
Check that all the cells run without error

⌛️ Time is ticking. Good luck!

Hi, I am Brian, and this is my submissiong for a competition that is already over.

This is not meant to be an entry to the competition, but rather an take of the competition as a challenge to learn new skills and reinforce existing ones. In this particular project, I will be using R to reinforce my learning in R techniques in data analysis.

In this particular challenge, we will be focusing on tackling the points brought up by the key tasks, specifically:

What are the relationship between "Outcome" and the other factors?
What are the important plots that can be highlighted based on the relationship?
What are the risk that an Age 54, Length 178, Weight 96 and Glucose Level of 125mg/Dl getting diabetes? This means that specifically, we wanna try to plot as many factors to see the outcome.

While most of the contents will might be technical, it is my aim to make it so that the general community is able to read this workbook without any issues, so a personal task to take is to make the workbook as readable as possible.

Exploratory Data Analysis (EDA)

In order to tackle the data, we must first analyse the data and perform EDA to clean the data from potential errors that could lead to incorrect conclusions. This can be many forms, such as missing data, non-standardised values or values that are outside the scope.

We will use SQL to quickly determine whether or not these errors are present and take appropriate measures to correct them. Do note that while spotting errors is one aspect of EDA, solving the errors might yield different results based on the solution. This might in terms of removing the data or replacing the data, which can result in different analysis. However, it is still good to identify the potential conflicts first.

We can go through one by one, starting off with missing data.

Missing data

Data frameas

df1

variable

SELECT * 
FROM 'data/diabetes.csv'
WHERE "Outcome" IS NULL;

While not every single SQL is shown, this SQL query is checked with every column. This SQL query ensures that if any of the data is NULL, there will be a return of results. Fortunately, none of the columns returned a result, meaning that all of the data is present.

Duplicated Data

‌
‌
‌