Competition - Blood Sugars - Finding a model to predict diabetes

Competition - Blood Sugars

Competition description

**#What do your blood sugars tell you? **

Background

Diabetes mellitus remains a global health issue, causing several thousand people to die each day from this single condition. Finding and avoiding diabetes in the earlier stages can help reduce the risk of serious health issues such as circulatory system diseases, kidney malfunction, and vision loss. This competition involves developing a predictive model for effectively detecting potential Diabetes cases, ideally, before commencing preventive treatment.

The data

The dataset contains diagnostic measurements that are associated with diabetes, which were collected from a population of Pima Indian women. The data includes various medical and demographic attributes, making it a well-rounded resource for predictive modeling.

The columns and Data Types are as follows:

Pregnancies Type: Numerical (Continuous) Description: Number of times the patient has been pregnant.

Glucose Type: Numerical (Continuous) Description: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.

BloodPressure Type: Numerical (Continuous) Description: Diastolic blood pressure (mm Hg).

SkinThickness Type: Numerical (Continuous) Description: Triceps skinfold thickness (mm).

Insulin Type: Numerical (Continuous) Description: 2-Hour serum insulin (mu U/ml).

BMI Type: Numerical (Continuous) Description: Body mass index (weight in kg/(height in m)^2).

DiabetesPedigreeFunction Type: Numerical (Continuous) Description: A function that represents the likelihood of diabetes based on family history.

Age Type: Numerical (Continuous) Description: Age of the patient in years.

Outcome Type: Categorical (Binary) Description: Class variable (0 or 1) indicating whether the patient is diagnosed with diabetes. 1 = Yes, 0 = No.

Competition challenge

In this challenge, you will focus on the following key tasks:

Determine the most important factors affecting the diabetes outcome.
Create interactive plots to visualize the relationship between diabetes and the determined factors from the previous step.
What's the risk of a person of Age 54, length 178 cm and weight 96 kg, and Glucose levels of 125 mg/dL getting diabetes?

Executive summary of your recommendations

Database cleaning: search for missing or nule values in your database and treat it.
Database cleaning 2: search for impossible or highly discrepant values and treat it. Descriptive statistics can help with it.
Bring context to your data (this is a health topic, with many scientifics articles describing it).
Visualize your data (histograms and box plots, for example) and compare your findings with the context you described.
Different variables can bring information about the same process. A redudancy analysis can help with it.
Search for a method that can be used with your data (here, we used Logistic Regression). Also, think about what parameters are important to your model (Prediction? Recall? Accuracy?) and adapt your model.
Adapt your model to the user reality (if it is a screening program and the health system cannot afford a specific test. Is it a good idea to include it in your model?).

METHODOLOGY

CLEANING DATABASE

First of all, the database should be evaluated for null and blank values.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import scipy.stats as stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, recall_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import recall_score
from sklearn.utils.class_weight import compute_class_weight
import plotly.express as px


data = pd.read_csv('data/diabetes.csv')

# Checking for null (NaN) values in the entire DataFrame.
has_na = data.isnull().values.any()

# Checking for issues and printing the corresponding message.
if has_na:
    print('There are issues with NA (null) values')
else:
    print('No issues with NA values')

Fortunately, the database did not contain any blank or null values.

The second step is to check the descriptive statistics of each variable, primarily to identify impossible or highly discrepant values.

# Dictionary with the descriptive data
descriptive_stats = {
    'Variable': [],
    'Mean': [],
    'Median': [],
    'Max': [],
    'Min': []
}

# List to extract statistics
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

# Loop to calculate each column statistic
for column in columns:
    descriptive_stats['Variable'].append(column)
    descriptive_stats['Mean'].append(np.mean(data[column]))
    descriptive_stats['Median'].append(np.median(data[column]))
    descriptive_stats['Max'].append(np.max(data[column]))
    descriptive_stats['Min'].append(np.min(data[column]))

# DataFrame with the descriptive statistics
descriptive_df = pd.DataFrame(descriptive_stats)

# Configuring axis from the figure and table
fig, ax = plt.subplots(figsize=(10, 4)) 
ax.axis('tight')
ax.axis('off')

# Creating table using matplotlib
table = ax.table(cellText=descriptive_df.values, colLabels=descriptive_df.columns, cellLoc='center', loc='center')

# Print Table as figure
plt.show()
data.shape

Here, we see that there is no discrepant maximum value according to the nature of the variables. However, when we look at the minimum value, we notice that it is impossible for them to have a value of 0. For example, a person with a blood glucose level of 0 is nonexistent. Or a BMI of 0 (does that mean they weigh 0 kg?).

Therefore, we will exclude all rows that contain any values of Age, BMI, Glucose, Insulin, Blood Pressure, and SkinThickness equal to zero. Before excluding these rows, our database contained information on 768 rows.