Survival analysis
In medicine, especially but not only in oncology, it is very important to analyse overall survival, recurrence-free survival and perform other time-to-event analyses. In this example, using simulated data, we will analyse overall mammary cancer survival. We will try to see if survival is associated with a few predictive variables.
The dataset
For every woman, we have data about
- date of cancer diagnosis
- date of last follow up visit
- survival at last follow up visit
- possible predictors of survival: BMI, cancer stage, neo_adiuvant_therapy
suppressWarnings(suppressMessages(library(dplyr, quiet = T)))
set.seed(111)
# Upload data (I created the data, it's not real!)
db = read.csv("mammary_cancer_data.csv") %>% select(-X) %>%
filter(as.Date(date_diagnosis,format="%d/%m/%Y")>as.Date("01/01/1900",format="%d/%m/%Y"))
db = db %>%
# simulate stage data and neo_adiuvant_therapy data accordingly
mutate(stage = sample(1:4, nrow(db), replace = T),
neo_adiuvant_therapy = ifelse(stage %in% c(3,4), 1, neo_adiuvant_therapy),
# calculate survival time after diagnosis
survival_time = abs(as.numeric(as.Date(last_visit,format="%d/%m/%Y") - as.Date(date_diagnosis,format="%d/%m/%Y")-sample(1:5000, nrow(db)))),
survival_factor = as.factor(survival),
neo_adiuvant_therapy = as.factor(neo_adiuvant_therapy),
stage = as.factor(stage),
BMI_cat = ifelse(BMI > 30, "Obese", "Not Obese"))
head(db)
Data summary
BMI, stage and neoadiuvant therapy are not significantly associated with survival status at bivariate analysis, but time has not yet been considered.
# Summary of data
install.packages("arsenal", quiet = T)
library(arsenal, quiet = T)
as.data.frame(summary(tableby(survival_factor ~ BMI + neo_adiuvant_therapy + stage, data = db), text = T))
Kaplan-Meier Analysis
Overall Kaplan-Meier survival curve shows a median survival of more than 4000 days, which is more than 10 years.
library(survival, quiet = T)
library(ggplot2, quiet = T)
suppressWarnings(suppressMessages(install.packages("survminer", quiet = T)))
suppressWarnings(suppressMessages(library(survminer, quiet = T)))
# overall survival
sop <- Surv(db$survival_time, db$survival)
fitkm <- survfit(sop ~ 1, type="kaplan-meier")
ggsurvplot(
fitkm,
data = db,
conf.int = TRUE,
xlab = "Days", ylab = "",
ggtheme = theme_light(),
surv.median.line = "hv", legend.labs = c("Overall survival"),
legend.title = "",risk.table = F,
palette = c("#8C3F4D"), font.y = 15, font.x = 15, font.tickslab = 10, font.legend = 12)
Bivariate and Multivariate Analysis
BMI is associated with survival at bivariate analysis while neoadiuvant therapy and stage seem not (remember this is fake data!!)
listavariabili<-list("BMI_cat", "stage", "neo_adiuvant_therapy")
listatitoli<- c("BMI", "Cancer stage", "Neo-adiuvant therapy")
listaetichette <- list(c("Not Obese", "Obese"), c("1", "2", "3", "4"), c("No", "Yes"))
formulae <- lapply(listavariabili, function(x) as.formula(paste0("sop ~ ", x)))
fits <- surv_fit(formulae, data = db)
p1_st <- ggsurvplot(fits, data = db,
pval = TRUE,
#pval.coord = c(200, 0.10),
xlab = "Days", ylab = "",
ggtheme = theme_light(),
#surv.median.line = "hv",
title = "",
legend.title=listatitoli, tables.height = 0.28, tables.theme = theme_cleantable(), conf.int = F, tables.col = "strata", tables.y.text = F, censor = T,
palette = c("#8C3F4D", "#3E606F", "#6476a1", "red"), font.y = 7, font.x = 7, font.tickslab = 10, font.legend = 12, legend.labs = listaetichette,
risk.table = T, risk.table.title = "", risk.table.fontsize = 3
)
options(repr.plot.width=10, repr.plot.height=20)
arrange_ggsurvplots(p1_st, ncol = 1, nrow = 3, print = TRUE, title = "Kaplan-Meier plots for survival in mammary cancer stratified by Body Mass Index, neoadiuvant therapy and stage.\nLog rank test p values are shown for each group.")
suppressWarnings(suppressMessages(install.packages("Publish", quiet = T)))
suppressWarnings(suppressMessages(library(Publish, quiet = T)))
fit<-coxph(sop~stage + neo_adiuvant_therapy + BMI, data = db)
a <- publish(fit, print = F, factor.reference = "extraline", units = NULL, probindex = FALSE,pvalue.digits=3,pvalue.stars=TRUE)
a_df <- as.data.frame(a[["regressionTable"]])
a_df[, 1:5]