Skip to content
0

Patient Characteristics and Readmission Modeling

Introduction

Hospital X seeks our expertise in comprehensively analyzing a decade's worth of data on patient readmissions following discharge. The medical staff is seeking our assistance in determining whether initial diagnoses, number of procedures, or other factors can help better predict the probability of readmission. They hope to use the findings to target follow-up calls and attention to patients who are more likely to be readmitted.

Objectives

The main objective of this report is to explore patient characteristics and readmissions. It specifically aims to:

  1. Describe the overall and by age characteristics of the patients.
  2. Investigate and model the patient readmissions by their representing features.
  3. Identify patient groups with the best readmission rates.

Data Used

The hospital information used in this report is part of the clinical care at 130 US hospitals and integrated delivery networks. Below is a list of variables and their descriptions:

vars-desc

Acknowledgments: Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, "Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records," BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.

# ---------- Packages & Datasets

# Load pre-installed, required packages
suppressPackageStartupMessages(library(tidyverse)) 
suppressPackageStartupMessages(library(dplyr)) 
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(scales))

# Install and load the 'ggfun' package
# For round rectangle borders and backgrounds in ggplots
suppressWarnings(suppressMessages(install.packages("ggfun", verbose=TRUE, quiet=TRUE)))
suppressPackageStartupMessages(library(ggfun))

# Install and load the 'ggchicklet' package
# For bar charts with rounded corners
suppressWarnings(suppressMessages(remotes::install_github("hrbrmstr/ggchicklet", auth_token = "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")))
suppressPackageStartupMessages(library("ggchicklet"))

# Install and load the 'ggthemes' package
# For using the look of a plot theme
suppressWarnings(suppressMessages(install.packages("ggthemes")))
suppressPackageStartupMessages(library(ggthemes))

# Install and load the 'patchwork' package
# For combining ggplots into the same graphic					   
suppressWarnings(suppressMessages(install.packages("patchwork", verbose=TRUE, quiet=TRUE))) 
suppressPackageStartupMessages(library(patchwork))

# Install and load the "mlbench" package
#									   
#suppressWarnings(suppressMessages(install.packages("mlbench")))
#suppressPackageStartupMessages(library(mlbench))

# Install and load the "ggpubr" package
# For boxplots
suppressWarnings(suppressMessages(install.packages("ggpubr")))
suppressPackageStartupMessages(library("ggpubr"))
									   
# Read 'readmissions' dataset
readmissions <- readr::read_csv('data/hospital_readmissions.csv', show_col_types = FALSE) %>%
	mutate_if(is.character,as.factor) %>%
	mutate_all(~ if_else(.x == "Missing", NA,.x))

#is.na(readmissions)
#colSums(is.na(readmissions))
#which(colSums(is.na(readmissions))>0)
#names(which(colSums(is.na(readmissions))>0))
#missmap(readmissions, col=c("blue", "red"), legend=FALSE)

#summary(readmissions)
#par(mfrow=c(1,6))
#for(i in 1:17) {
#    boxplot(readmissions[,i], main=names(readmissions)[i])
#}
# ----- For link's image thumbnail

# Install and load the 'png' package
# For superimposing PNG images in a ggplot								   
suppressWarnings(suppressMessages(install.packages("png", verbose=TRUE, quiet=TRUE)))       
suppressPackageStartupMessages(library(png))  

# Create a data
data <- data.frame(x = 1:3,
                   y = 1:3)

# Read the PNG file
my_image <- readPNG("documentation/cover_Fig10_diag_stacked_bar_plot.png", native = TRUE)

# Create a plot and combine with the image
ggplot(data, aes(x, y)) +
	geom_point() +
	theme_minimal() +
	theme(axis.title = element_blank(),
          axis.text = element_blank(),
          axis.line = element_blank(),
          axis.ticks = element_blank()) +
	inset_element(p = my_image,
                  left = -0.52,
                  bottom = -0.03,
                  right = 1.52,
                  top = 1.02)

Results and Discussion

Patient Characteristics

The following information describe the characteristics of the sample composing of 25,000 patients admitted to the hospital after being discharged.

Overall
Numerical

Age: The grouped mean and median ages are approximately 68 and 69, respectively, with a standard deviation of ~13, indicating that the hospital primarily admitted elderly patients. As shown in Figure 1, the distribution of patients across age groups is approximately symmetric at the mean.

# Frequency Distribution Table (FDT) for Age
age_fdt <- readmissions %>%
	select(age) %>%
	group_by(age) %>%
	count() %>%
	mutate(class_interval = paste(as.numeric(unlist(regmatches(age, gregexpr("[[:digit:]]+", age)))), collapse = "-"),
           class_mark = mean(as.numeric(unlist(regmatches(age, gregexpr("[[:digit:]]+", age)))))) %>%
	ungroup() %>%
	mutate(cum_freq = cumsum(n),
           n_times_cm = n*class_mark) %>%
	select(age, class_interval, everything())

# Create GroupedMedian() function to calculate the median of grouped data	
# Reference: http://stackoverflow.com/a/18931054/1270695
GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
  # If "sep" is specified, the function will try to create the 
  #   required "intervals" matrix. "trim" removes any unwanted 
  #   characters before attempting to convert the ranges to numeric.
  if (!is.null(sep)) {
    if (is.null(trim)) pattern <- ""
    else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
    else pattern <- trim
    intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
  }

  Midpoints <- rowMeans(intervals)
  cf <- cumsum(frequencies)
  Midrow <- findInterval(max(cf)/2, cf) + 1
  L <- intervals[1, Midrow]      # lower class boundary of median class
  h <- diff(intervals[, Midrow]) # size of median class
  f <- frequencies[Midrow]       # frequency of median class
  cf2 <- cf[Midrow - 1]          # cumulative frequency class before median class
  n_2 <- max(cf)/2               # total observations divided by 2

  unname(L + (n_2 - cf2)/f * h)
}

# Grouped Mean, Median, and Sample SD
grouped_age_stats <- age_fdt %>% 
 	summarize(Variable = "Age group",
			  "Mean" = sum(n_times_cm)/sum(n),
              "Std. Dev." = sqrt(sum(n*(Mean-class_mark)^2)/(sum(n))),
			  "Med." = GroupedMedian(frequencies = age_fdt$n, intervals = age_fdt$class_interval, sep = "-")
			 )

grouped_age_stats

# Bar plot
age_bar_plot <- ggplot(age_fdt, aes(group=1)) + 
	geom_chicklet(aes(x = age,
                      y = n,
					  group=1), 
                  color="white",
                  fill="#6C8CBF",
                  radius = grid::unit(1, "mm"), position="stack") +
    
	# Plot mean and median lines
	geom_vline(xintercept=3.344, color="red", linewidth=0.6) +
 
	ggtitle("Fig. 1: Bar Graph of the Patients' Age Groups\n") +
	labs(x="\nAge group", y="Number of patients\n") +
	scale_y_continuous(expand = c(0.01, 0), limits = c(0,7000),
                      breaks = seq(0, 7000, by=1000)) +
	theme_economist() + 
	scale_color_economist() +
	theme(plot.title = element_text(size= 12),
          panel.grid.minor = element_line(color="grey",
                                          linetype="dashed",
                                          linewidth=0.3),
		  panel.grid.major = element_line(color="grey",
                                          linetype="dashed",
                                          linewidth=0.3),
          axis.ticks = element_blank()
         ) +
	annotate("text", x=2.7, y=6500, 
			 label=paste("Mean = ",sum(age_fdt$n_times_cm)/sum(age_fdt$n)),
			 color="red",
			 size=3.5)

Time in hospital: The mean and median lengths of stay in the hospital are 4.4533 and 4, respectively, with a standard deviation of ~3. Figure 2 shows a positively skewed distribution for this patient feature.

# Summary statistics for numerical variables
sum_stats <- data.frame(
    Variable = readmissions %>%
		select_if(is.numeric) %>%
		colnames) %>%
	bind_cols(as.data.frame(t(readmissions %>%
                              summarise_if(is.numeric, list(mean)) %>%
                              bind_rows(readmissions %>%
                                        	summarise_if(is.numeric, list(sd)), 
                                        readmissions %>%
                                        	summarise_if(is.numeric, list(min)),
                                        readmissions %>%
                                        	summarise_if(is.numeric, list(median)),
                                        readmissions %>%
                                        	summarise_if(is.numeric, list(max)))
                             )) %>%
              rename(Mean = V1,
                     `Std. Dev.` = V2,
                     `Min.` = V3,
                     `Median` = V4,
                     `Max.` = V5))

rownames(sum_stats) <- 1: nrow(sum_stats)
sum_stats

# time in hospital
time_in_hospital_dst_plot <- ggplot(readmissions, aes(x = time_in_hospital)) + 
	geom_histogram(aes(y=after_stat(density)),
                   colour="white",
                   fill="#5AA7A7",
                   binwidth=1) +
	geom_density(alpha=0.2,
                 fill="#FF6666") +

	# Plot mean and median lines
	geom_vline(aes(xintercept = mean(time_in_hospital)), col="red", linewidth=0.6) +
	#geom_vline(aes(xintercept = median(time_in_hospital)), col="blue", linewidth=0.6) +
	
	ggtitle("Fig. 2: Distribution of the Time Length in Hospital\n") +
	labs(x="\nNumber of days (from 1 to 14)", y="Density\n") +
	scale_x_continuous(expand = c(0.01, 0), 
                       breaks = seq(0, 14, by=2)) +
	scale_y_continuous(expand = c(0.01, 0),
                       limits = c(0,0.20),
                       breaks = seq(0,0.20, by=0.025)) +
	theme_economist() + 
	scale_color_economist() +
	theme(plot.title = element_text(size= 12),
          panel.grid.minor = element_line(color="grey",
                                          linetype="dashed",
                                          linewidth=0.3),
		  panel.grid.major = element_line(color="grey",
                                          linetype="dashed",
                                          linewidth=0.3)
         ) +
	annotate("text", x=mean(readmissions$time_in_hospital)+1.45, y=(0.200+0.175)/2, 
			 label=paste("Mean = ",round(mean(readmissions$time_in_hospital),4)),
			 color="red",
			 size=3.5)

Number of procedures: The mean and median number of medical procedures performed during the hospital stay are 1.3524 and 1, with a standard deviation of 1.7152, respectively, whereas the mean and median numbers of laboratory procedures performed are 43.2408 and 44, respectively, with a standard deviation of 19.8186. This implies that throughout their hospitalization, patients underwent laboratory procedures more frequently than medical procedures.

Based on Figures 3 and 4, the two types of procedures exhibit dissimilar distributional characteristics, with medical procedures demonstrating positive skewness, and laboratory procedures showing slight symmetry.

# Number of Procedures
n_procedures_hs_dst_plot <- ggplot(readmissions, aes(x = n_procedures)) + 
	geom_histogram(aes(y=after_stat(density)),
                   colour="white",
                   fill="#5AA7A7",
                   binwidth=0.5
                  ) +
	geom_density(alpha=0.2,
                 fill="#FF6666") +

	# Plot mean and median lines
	geom_vline(aes(xintercept = mean(n_procedures)), col="red", linewidth=0.6) +
	#geom_vline(aes(xintercept = median(n_procedures)), col="blue", linewidth=0.6) +
	
	ggtitle("Fig. 3: Distribution of the Number of Medical Procedures\n") +
	labs(x="\nNumber of procedures performed during the hospital stay", y="Density\n") +
	theme_economist() + 
	scale_color_economist() +
	theme(plot.title = element_text(size= 12),
          panel.grid.minor = element_line(color="grey",
                                          linetype="dashed",
                                          linewidth=0.3),
		  panel.grid.major = element_line(color="grey",
                                          linetype="dashed",
                                          linewidth=0.3)
         ) +
	annotate("text", x=mean(readmissions$n_procedures)+0.72, y=0.7, 
			 label=paste("Mean = ",round(mean(readmissions$n_procedures),4)),
			 color="red",
			 size=3.5)

# Number of Lab Procedures
n_lab_procedures_hs_dst_plot <- ggplot(readmissions, aes(x = n_lab_procedures)) + 
	geom_histogram(aes(y=after_stat(density)),
                   colour="white",
                   fill="#5AA7A7",
                   binwidth=3
                  ) +
	geom_density(alpha=0.2,
                 fill="#FF6666") +

	# Plot mean and median lines
	geom_vline(aes(xintercept = mean(n_lab_procedures)), col="red", linewidth=0.6) +
	#geom_vline(aes(xintercept = median(n_lab_procedures)), col="blue", linewidth=0.6) +
	
	ggtitle("Fig. 4: Distribution of the Number of Laboratory Procedures\n") +
	labs(x="\nNumber of lab procedures performed during the hospital stay", y="Density\n") +
	theme_economist() + 
	scale_color_economist() +
	theme(plot.title = element_text(size= 12),
          panel.grid.minor = element_line(color="grey",
                                          linetype="dashed",
                                          linewidth=0.3),
		  panel.grid.major = element_line(color="grey",
                                          linetype="dashed",
                                          linewidth=0.3)
         ) +
	annotate("text", x=mean(readmissions$n_lab_procedures)+14, y=0.025, 
			 label=paste("Mean = ",round(mean(readmissions$n_lab_procedures),4)),
			 color="red",
			 size=3.5)

#ggarrange(n_procedures_hs_dst_plot, n_lab_procedures_hs_dst_plot, 
#          ncol = 1, nrow = 2)

Number of medications: The mean and median numbers of medications administered during the hospital stay are 16.2524 and 15, with a standard deviation of 8.0605, as well as a slightly skewed distribution to the right.


1 hidden cell