ASCVD Understanding and Prediction
Table of Contents
- Introduction
- Data & Methodology
- Data Understanding
- Data Preparation
- Exploratory Analysis
- Modeling & Evaluation
- Conclusion & Recommendations
1. Introduction
1.1. Problem Statement
Atherosclerosis is a disease that affects the walls of the arteries and can cause them to become thicker and less elastic. This condition is a leading cause of death globally, and can lead to serious health problems such as heart attacks, strokes, and damage to arteries in the legs. Risk factors for atherosclerosis include high cholesterol, diabetes, smoking, family history, being sedentary, being overweight, and high blood pressure. Symptoms are caused by a reduced or blocked blood flow due to plaque buildup, and can vary depending on the artery affected. Diagnosis is done through medical examinations like angiography or ultrasonography. Treatment includes modifying risk factors, making lifestyle changes, taking antiplatelet drugs and antiatherogenic drugs. By identifying and analyzing risk factors, we can predict the likelihood of developing ASCV disease, and take preventative measures accordingly.
1.2. Project Objective
This personal project aims to:
-
Analyze the impact of different factors, such as age, gender, medical examination results, etc., on the development of cardiovascular disease.
-
Build a machine learning model to predict the presence or absence of cardiovascular disease using those features.
1.3. Executive Summary
Our analytics report aims to help us achieve our project goals by exploring the impact of demographics, physical characteristics, health examination results, and lifestyle on the development of ASCVD. Through this analysis, we have identified critical features that increase the risk of ASCVD and built a promising machine-learning model (with an F1 Score of 0.74 and an ROC-AUC of 0.80) to predict the presence or absence of cardiovascular disease.
Our data analytic process helped us discover some significant findings that can help us take preventive measures to reduce the risk of ASCVD:
- Age, high blood pressure, high blood glucose, high cholesterol, and obesity are high-risk factors for ASCVD. When these factors combine, they increase the risk even further.
- A lifestyle with regular physical activity can help reduce the risk of having ASCVD.
- Although people who smoke and drink alcohol are not at risk of developing ASCVD, they can still increase their risk of cardiovascular disease by leading a sedentary lifestyle and continuing these bad habits.
With this information, we can take proactive steps to reduce the risk of ASCVD and improve our overall health.
2. Data and Methodology
2.1. The Data
The data used in this project is taken from Kaggle (source).
There are 3 types of input features:
- Objective: patient's demographics;
- Examination: results of medical examination;
- Subjective: information given by the patient (lifestyle).
Feature | Variable Type | Variable | Value Type |
---|---|---|---|
Age | Objective Feature | age | int (days) |
Height | Objective Feature | height | int (cm) |
Weight | Objective Feature | weight | float (kg) |
Gender | Objective Feature | gender | categorical code (1 - women, 2 - men) |
Systolic blood pressure | Examination Feature | ap_hi | int |
Diastolic blood pressure | Examination Feature | ap_lo | int |
Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
Smoking | Subjective Feature | smoke | binary |
Alcohol intake | Subjective Feature | alco | binary |
Physical activity | Subjective Feature | active | binary |
Presence or absence of cardiovascular disease | Target Variable | cardio | binary |
All of the dataset values were collected at the moment of medical examination.
2.2. Methodology
Our process of analyzing data involves various methods as detailed below:
Data Understanding
- Collecting the initial data from Kaggle and importing it into the DataFrame.
- Utilizing several tools and techniques to comprehend the structure, contents, and quality of the data and identify potential issues that require further investigation or correction.
Data Preparation
- Data cleaning: Using a data cleaning checklist to recognize and resolve any quality problems with the data, including issues with data constraints, text and categorical data, data uniformity, and missing data.
- Data transformation: Modifying and creating new variables from existing data to make it more appropriate for our analysis objectives.
- Data validation: Verifying and validating the data after cleaning and transforming.
Exploratory Analysis
- Conducting data analysis, statistical tests and visualizing data to discover insights.
- Our exploratory data analysis includes univariate, bivariate, and multivariate analysis tasks.
Modeling & Evaluation
- Building a machine learning model on the training dataset with the identified findings.
- Evaluating the trained model on the unseen dataset with fine-tuning parameters.
2.3. Importing Libraries
To make our data analysis, visualization, and modeling more efficient, we have developed some user-defined function modules. To use them, please ensure that the module files are copied to the same directory as this notebook.
(+) Libraries
# Pandas
import pandas as pd
pd.set_option("display.max_columns", None)
# Numpy and others
import os, sys, glob, re, math
import numpy as np
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Machine Learning
import sklearn.tree as skltr
import sklearn.ensemble as sklen
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import sklearn.metrics as sklme
(+) User-defined Modules