Exploring Diabetes Risk: Feature Analysis and Predictive Modeling (copy)

What do your blood sugars tell you?

📖 Background

Diabetes mellitus remains a global health issue, causing several thousand people to die each day from this single condition. Finding and avoiding diabetes in the earlier stages can help reduce the risk of serious health issues such as circulatory system diseases, kidney malfunction, and vision loss. This competition involves developing a predictive model for effectively detecting potential Diabetes cases, ideally, before commencing preventive treatment.

💾 The data

The dataset contains diagnostic measurements that are associated with diabetes, which were collected from a population of Pima Indian women. The data includes various medical and demographic attributes, making it a well-rounded resource for predictive modeling.

The columns and Data Types are as follows:

Pregnancies Type: Numerical (Continuous) Description: Number of times the patient has been pregnant.
Glucose Type: Numerical (Continuous) Description: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
BloodPressure Type: Numerical (Continuous) Description: Diastolic blood pressure (mm Hg).
SkinThickness Type: Numerical (Continuous) Description: Triceps skinfold thickness (mm).
Insulin Type: Numerical (Continuous) Description: 2-Hour serum insulin (mu U/ml).
BMI Type: Numerical (Continuous) Description: Body mass index (weight in kg/(height in m)^2).
DiabetesPedigreeFunction Type: Numerical (Continuous) Description: A function that represents the likelihood of diabetes based on family history.
Age Type: Numerical (Continuous) Description: Age of the patient in years.
Outcome Type: Categorical (Binary) Description: Class variable (0 or 1) indicating whether the patient is diagnosed with diabetes. 1 = Yes, 0 = No.

Summary

This document outlines a comprehensive methodology for estimating diabetes risk. The approach begins with identifying the most significant factors influencing diabetes outcomes by analyzing relevant data to determine which attributes have the greatest impact on diabetes risk. We then create interactive plots to visually represent the relationships between these key factors and diabetes risk, enhancing our understanding of how variations in each factor may influence the likelihood of developing diabetes. Following this, we utilize a RandomForestClassifier model trained on the Pima Indians Diabetes dataset to predict diabetes risk based on the identified factors. Finally, we apply this methodology to assess the risk for a specific individual with characteristics including age 54 years, height 178 cm, weight 96 kg, and glucose levels of 125 mg/dL. This integrated approach provides a tailored risk prediction for the individual by combining insights from factor analysis, visualizations, and machine learning.

Key Points

Key Factors: Identifying important factors involves analyzing various data attributes to understand their impact on diabetes risk.
Interactive Visualization: Plots are created to provide visual insights into the relationship between diabetes and the determined factors.
BMI Calculation: Body Mass Index (BMI) is computed to classify the individual's weight category, which is a crucial component in assessing diabetes risk.
Machine Learning Model: We use a RandomForestClassifier model trained on the Pima Indians Diabetes dataset. This model predicts diabetes risk based on characteristics such as age, BMI, glucose levels, and other factors like blood pressure and insulin, which are adjusted based on trimmed values when specific data is unavailable.

# Import the pandas library for data manipulation and analysis
import pandas as pd

# Import the numpy library for numerical operations and handling arrays
import numpy as np

# Import the seaborn library for statistical data visualization
import seaborn as sns

# Import the matplotlib.pyplot module for creating static, animated, and interactive visualizations
import matplotlib.pyplot as plt

# Import the plotly.express module for creating interactive and elaborate plots
import plotly.express as px

# Import the stats module from scipy for statistical functions and tests
from scipy import stats

# Import the RandomForestClassifier from scikit-learn's ensemble module for classification tasks
from sklearn.ensemble import RandomForestClassifier

# Import train_test_split from scikit-learn to split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

# Import StandardScaler from scikit-learn to standardize features by removing the mean and scaling to unit variance
from sklearn.preprocessing import StandardScaler

# Import accuracy_score from scikit-learn's metrics module to evaluate the performance of the classification model
from sklearn.metrics import accuracy_score, classification_report


data = pd.read_csv('data/diabetes.csv')
# Display the first few rows of the DataFrame
data.head()

# Count the total NaN values in columns that are not of 'category' type
nan_counts = data.select_dtypes(exclude=['category']).isna().sum()

# Create a DataFrame to show column names and NaN counts
nan_counts_df = pd.DataFrame({
    'Column Name': nan_counts.index,
    'NaN Count': nan_counts.values
})

# Display the result
print(nan_counts_df)

# Count the number of non-null (non-missing) values for each column in the DataFrame 'data'
total_values = data.count()

# Print the results
print(total_values)

# Generate descriptive statistics of the DataFrame 'data'
data.describe()

Visualizing the data distribution of glucose



# Ensure 'Outcome' is treated as a categorical variable
data['Outcome'] = data['Outcome'].astype('category')

# Boxplot of Glucose levels by Outcome
plt.figure(figsize=(8, 6))
sns.boxplot(x='Outcome', y='Glucose', data=data, palette=["blue", "red"])
plt.title("Glucose Levels by Diabetes Outcome")
plt.xlabel("Diabetes Outcome")
plt.ylabel("Glucose Level (mg/dL)")
plt.show()

Visualizing the age distribution

# Density plot of Age by Outcome
plt.figure(figsize=(8, 6))
sns.kdeplot(data=data, x='Age', hue='Outcome', fill=True, common_norm=False, palette=["blue", "red"], alpha=0.5)
plt.title("Age Distribution by Diabetes Outcome")
plt.xlabel("Age")
plt.ylabel("Density")
plt.show()

💪 Competition challenge

In this challenge, you will focus on the following key tasks:

Determine the most important factors affecting the diabetes outcome.
Create interactive plots to visualize the relationship between diabetes and the determined factors from the previous step.
What's the risk of a person of Age 54, length 178 cm and weight 96 kg, and Glucose levels of 125 mg/dL getting diabetes?

Analysis of Key Determinants Influencing Diabetes Outcomes

‌
‌
‌