Millions of people develop some sort of heart disease every year, and heart disease is the biggest killer of both men and women in the United States and around the world. Statistical analysis has identified many risk factors associated with heart disease, such as age, blood pressure, total cholesterol, diabetes, hypertension, family history of heart disease, obesity, lack of physical exercise, and more.
In this project, you will run statistical tests and models using the Cleveland heart disease dataset to assess one particular factor -- the maximum heart rate one can achieve during exercise and how it is associated with a higher likelihood of getting heart disease.
Examining how heart rate responds to exercise along with other factors such as age, gender, the maximum heart rate achieved may reveal abnormalities that could be indicative of heart disease. Let's find out more!
The Data
Available on Cleveland_hd.csv
| Column | Type | Description | 
|---|---|---|
| age | continuous | age in years | 
| sex | discrete | 0=female 1=male | 
| cp | discrete | chest pain type: 1=typical angina, 2=atypical angina, 3=non-anginal pain, 4=asymptom | 
| trestbps | continuous | resting blood pressure (in mm Hg) | 
| chol | continuous | serum cholesterol in mg/dl | 
| fbs | discrete | fasting blood sugar>120 mg/dl: 1=true 0=False | 
| restecg | discrete | result of electrocardiogram while at rest are represented in 3 distinct values 0=Normal 1=having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) 2=showing probable or definite left ventricular hypertrophy Estes' criteria (Nominal) | 
| thalach | continuous | maximum heart rate achieved | 
| exang | discrete | exercise induced angina: 1=yes 0=no | 
| oldpeak | continuous | depression induced by exercise relative to rest | 
| slope | discrete | the slope of the peak exercise segment: 1=up sloping 2=flat, 3=down sloping | 
| ca | continuous | number of major vessels colored by fluoroscopy that ranged between 0 and 3 | 
| thal | discrete | 3=normal 6=fixed defect 7=reversible defect | 
| class | discrete | diagnosis classes: 0=no presence 1=minor indicators for heart disease 2=>1 3=>2 4=major indicators for heart disease | 
# Load the necessary packages
install.packages("Metrics")
library(tidyverse)
library(yardstick)
library(Metrics)
# Load the data
hd_data <- read.csv("Cleveland_hd.csv")
# Inspect the first five rows
head(hd_data, 5)
# Create a new feature to represent a binary outcome for the current class variable for whether a patient has heart disease or not
hd_data %>% mutate(hd = ifelse(class > 0, 1, 0))-> hd_data  
# Convert the sex feature to a factor with levels 0 and 1 and labels "Female" and "Male"
hd_data %>% mutate(sex = factor(sex, levels = 0:1, labels = c("Female", "Male")))-> hd_data
# Use statistical tests to check which features impact heart disease
# Check the sex variable
hd_sex <- chisq.test(hd_data$sex, hd_data$hd)
print(hd_sex)
# Check the age variable
hd_age <- t.test(hd_data$age ~ hd_data$hd)
print(hd_age)
# Check the thalach variable
hd_heartrate <- t.test(hd_data$thalach ~ hd_data$hd)
print(hd_heartrate)
# Save the highly signficant features to a list
highly_significant <- list("age", "sex", "thalach")
# Optional: explore the associations graphically
# Optional: recode the binary heart disease feature to be labelled
hd_data %>% mutate(hd_labelled = ifelse(hd == 0, "No disease", "Disease")) -> hd_data
# Optional: visualize the sex associations
ggplot(data = hd_data, aes(x = hd_labelled, fill = sex)) + geom_bar(position = "fill") + ylab("Sex %")
# Optional: visualize the age associations
ggplot(data = hd_data, aes(x = hd_labelled, y = age)) + geom_boxplot()
# Optional: visualize the thalach associations
ggplot(data = hd_data, aes(x = hd_labelled, y = thalach)) + geom_boxplot()
# Build a model to predict heart disease using the significant features as predictors
model <- glm(data = hd_data, hd ~ age + sex + thalach, family = "binomial" )
# Extract the model summary
summary(model)
# Predict the probability of heart disease
pred_prob <- predict(model, hd_data, type="response")
# Create a decision rule using probability 0.5 as cutoff and save the predicted decision into the main data frame
hd_data$pred_hd <- ifelse(pred_prob >= 0.5, 1, 0)
# Calculate and print the accuracy score
accuracy <- accuracy(hd_data$hd, hd_data$pred_hd)
print(paste("Accuracy=", accuracy))
# Calculate and print the confusion matrix
confusion <- conf_mat(table(hd_data$hd, hd_data$pred_hd))
confusion