Skip to content

Millions of people develop some sort of heart disease every year, and heart disease is the biggest killer of both men and women in the United States and around the world. Statistical analysis has identified many risk factors associated with heart disease, such as age, blood pressure, total cholesterol, diabetes, hypertension, family history of heart disease, obesity, lack of physical exercise, and more.

In this project, you will run statistical tests and models using the Cleveland heart disease dataset to assess one particular factor -- the maximum heart rate one can achieve during exercise and how it is associated with a higher likelihood of getting heart disease.

Examining how heart rate responds to exercise along with other factors such as age, gender, the maximum heart rate achieved may reveal abnormalities that could be indicative of heart disease. Let's find out more!

The Data

Available on Cleveland_hd.csv

ColumnTypeDescription
agecontinuousage in years
sexdiscrete0=female 1=male
cpdiscretechest pain type: 1=typical angina, 2=atypical angina, 3=non-anginal pain, 4=asymptom
trestbpscontinuousresting blood pressure (in mm Hg)
cholcontinuousserum cholesterol in mg/dl
fbsdiscretefasting blood sugar>120 mg/dl: 1=true 0=False
restecgdiscreteresult of electrocardiogram while at rest are represented in 3 distinct values 0=Normal 1=having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) 2=showing probable or definite left ventricular hypertrophy Estes' criteria (Nominal)
thalachcontinuousmaximum heart rate achieved
exangdiscreteexercise induced angina: 1=yes 0=no
oldpeakcontinuousdepression induced by exercise relative to rest
slopediscretethe slope of the peak exercise segment: 1=up sloping 2=flat, 3=down sloping
cacontinuousnumber of major vessels colored by fluoroscopy that ranged between 0 and 3
thaldiscrete3=normal 6=fixed defect 7=reversible defect
classdiscretediagnosis classes: 0=no presence 1=minor indicators for heart disease 2=>1 3=>2 4=major indicators for heart disease
# Load necessary packages
library(tidyverse)
library(yardstick)

# Load the data
hd_data <- read.csv("Cleveland_hd.csv")

# Make class binary
hd_data$class <- ifelse(hd_data$class == 0, 0, 1)

# Run statistical tests to find significant predictors
p_values <- sapply(names(hd_data)[names(hd_data) != "class"], function(col) {
  if(is.numeric(hd_data[[col]])) {
    t.test(hd_data[[col]] ~ hd_data$class)$p.value
  } else {
    chisq.test(table(hd_data[[col]], hd_data$class))$p.value
  }
})

# Get three most significant features as a list
highly_significant <- names(sort(p_values))[1:3]

# Fit logistic regression model
model <- glm(class ~ ., data = hd_data[, c(highly_significant, "class")], family = binomial)
probs <- predict(model, type = "response")
preds <- ifelse(probs > 0.5, 1, 0)

# Ensure preds and hd_data$class have the same length
preds <- preds[1:nrow(hd_data)]

# Calculate metrics
accuracy <- mean(preds == hd_data$class)
confusion <- table(preds, hd_data$class)

# Convert highly_significant to a list explicitly
highly_significant <- as.list(highly_significant)