In 1986, a group of urologists in London published a research paper in The British Medical Journal that compared the effectiveness of two different methods to remove kidney stones. Treatment A was open surgery (invasive), and treatment B was percutaneous nephrolithotomy (less invasive). When they looked at the results from 700 patients, treatment B had a higher success rate. However, when they only looked at the subgroup of patients different kidney stone sizes, treatment A had a better success rate. What is going on here? This known statistical phenomenon is called Simpon’s paradox. Simpon's paradox occurs when trends appear in subgroups but disappear or reverse when subgroups are combined.
The Data
Available on kidney_stone_data.csv
Column | Type | Description |
---|---|---|
treatment | discrete | Treatment method, indicated by A or B |
stone_size | discrete | Size of the kidney stone, categorized as 'small' or 'large' |
success | discrete | Outcome of the treatment: 1=successful, 0=unsuccessful |
In this project, you are going to explore Simpon’s paradox using multiple regression and other statistical tools. Our main goal is to determine if Treatment A is superior to Treatment B after accounting for the severity of the kidney stones. Let's dive in now!
# Load the necessary packages
library(readr)
library(dplyr)
library(ggplot2)
library(broom)
# Load the data
data <- read_csv("kidney_stone_data.csv")
# Inspect the first five rows
head(data, 5)
# Calculate the number and frequency of success and failure of each treatment
data %>%
group_by(treatment, success) %>%
summarise(N = n()) %>%
mutate(Freq = round(N/sum(N),3))
# Calculate number and frequency of success and failure by stone size for each treatment
sum_data <-
data %>%
group_by(treatment, stone_size, success) %>%
summarise(N = n()) %>%
mutate(Freq = round(N/sum(N),3))
# Print out the data frame we just created
sum_data
# Run a Chi-squared test
trt_ss <- chisq.test(data$treatment, data$stone_size)
# Print out the result in tidy format
tidy(trt_ss)
# Run a multiple logistic regression
logistic_model <- glm(data = data, success ~ stone_size + treatment, family = 'binomial')
# Print out model coefficient table in tidy format
tidy(logistic_model)
# Save the tidy model output into an object
tidy_m <- tidy(logistic_model)
# Is small stone more likely to be a success after controlling for treatment option effect?
small_high_success <- "Yes"
# Is treatment A significantly better than B?
A_B_sig <- "No"