Skip to content
0

What do your blood sugars tell you?

📖 Background

Diabetes mellitus remains a global health issue, causing several thousand people to die each day from this single condition. Finding and avoiding diabetes in the earlier stages can help reduce the risk of serious health issues such as circulatory system diseases, kidney malfunction, and vision loss. This competition involves developing a predictive model for effectively detecting potential Diabetes cases, ideally, before commencing preventive treatment.

💾 The data

The dataset contains diagnostic measurements that are associated with diabetes, which were collected from a population of Pima Indian women. The data includes various medical and demographic attributes, making it a well-rounded resource for predictive modeling.

The columns and Data Types are as follows:

  • Pregnancies Type: Numerical (Continuous) Description: Number of times the patient has been pregnant.

  • Glucose Type: Numerical (Continuous) Description: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.

  • BloodPressure Type: Numerical (Continuous) Description: Diastolic blood pressure (mm Hg).

  • SkinThickness Type: Numerical (Continuous) Description: Triceps skinfold thickness (mm).

  • Insulin Type: Numerical (Continuous) Description: 2-Hour serum insulin (mu U/ml).

  • BMI Type: Numerical (Continuous) Description: Body mass index (weight in kg/(height in m)^2).

  • DiabetesPedigreeFunction Type: Numerical (Continuous) Description: A function that represents the likelihood of diabetes based on family history.

  • Age Type: Numerical (Continuous) Description: Age of the patient in years.

  • Outcome Type: Categorical (Binary) Description: Class variable (0 or 1) indicating whether the patient is diagnosed with diabetes. 1 = Yes, 0 = No.

Executive Summary

The dataset has zero values in which it is impossible to have (glucose, blood pressure, skin thickness, and BMI). The zeros were not imputed with the mean since there was no significant increase in the model's performance. All features are kept when training the model.

The best performing model that were created was the Logistic Regression model with an accuracy score of ~76%. The top 3 features from the model are Glucose, BMI, and Pregnancies.

Recommendations:

  1. Better way to impute zero values in features that are impossible to be zero.
  2. Fine tuning of the Logistic Regression model.
  3. Talk to experts if there are features that should be added to improve model accuracy.

Data Validation

Checking each column, there are cases in which it is almost impossible to have a zero value. Namely glucose, blood pressure, skin thickness, and BMI. While the rest of the columns are clean. Initially, means were imputed for columns that didn't make sense to have 0 value. However, since the model did not perform significantly well, it was decided to leave the value at 0.

A recommendation would be to use more sophisticated tools when imputing 0 values in glucose, blood pressure, skin thickness, and BMI.

Hidden code
Hidden code

Exploratory Data Analysis

Correlation between features were inspected incase there are redundant features. Based on the heatmap, there is no 'very strong' correlation. Hence, none of the features will be dropped.

Hidden code

1 hidden cell

A series of scatter plots with a sigmoid line are used to visualize how each feature affects the outcome.

Hidden code

Machine Learning

Three models are trained using the data (KNeighborsClassifier, Logarithmic Regression, and Random Forest Classifier). Based on the accuracy score, confusion matrices, and ROC AUC Curve, the linear regression has the best performance among the three models which will be the selected model.

Hidden code
Hidden code

Important Features

Using the linear regression model, the coefficients of each feature is extracted to pinpoint which features affects the outcome the most. A table is generated that shows the feature coeffecients and their absolute values.

‌
‌
‌