Building a healthier tomorrow: Reducing hospital readmissions today
1. Introduction
1.1 Background
We work for a consulting company helping a hospital group better understand patient readmissions. The hospital gave us access to ten years of information on patients readmitted to the hospital after being discharged. The doctors want us to assess if initial diagnoses, number of procedures, or other variables could help them better understand the probability of readmission.
They want to focus follow-up calls and attention on those patients with a higher probability of readmission.
1.2 The data
We have access to ten years of patient information (source):
Information in the file
- "age" - age bracket of the patient
- "time_in_hospital" - days (from 1 to 14)
- "n_procedures" - number of procedures performed during the hospital stay
- "n_lab_procedures" - number of laboratory procedures performed during the hospital stay
- "n_medications" - number of medications administered during the hospital stay
- "n_outpatient" - number of outpatient visits in the year before a hospital stay
- "n_inpatient" - number of inpatient visits in the year before the hospital stay
- "n_emergency" - number of visits to the emergency room in the year before the hospital stay
- "medical_specialty" - the specialty of the admitting physician
- "diag_1" - primary diagnosis (Circulatory, Respiratory, Digestive, etc.)
- "diag_2" - secondary diagnosis
- "diag_3" - additional secondary diagnosis
- "glucose_test" - whether the glucose serum came out as high (> 200), normal, or not performed
- "A1Ctest" - whether the A1C level of the patient came out as high (> 7%), normal, or not performed
- "change" - whether there was a change in the diabetes medication ('yes' or 'no')
- "diabetes_med" - whether a diabetes medication was prescribed ('yes' or 'no')
- "readmitted" - if the patient was readmitted at the hospital ('yes' or 'no')
Acknowledgments: Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, "Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records," BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.
1.3 Methods
At the first stage, and throughout the exploratory analysis phase some preliminary data transformations were conducted including:
- handling missing values and duplicates in the dataset
- summary statistics (number of rows and columns, altering variable types, finding age range, dispersion of data, min and max values, etc.)
- aggregating data (such as grouping variables, summarizing measures and mutating columns)
- feature scaling (also known as data normalization), necessary for implementing machine learning model
2. Exploratory Analysis
Figure 1: Number of Patients by Age Group
Figure 1 shows that the patients in the dataset are grouped into six age brackets:
- [40-50),
- [50-60),
- [60-70),
- [70-80),
- [80-90),
- [90-100).
The highest number of patients are in the [70-80) age bracket, with 6,837 patients, followed by the [60-70) age bracket with 5,913 patients. The [90-100) age bracket has the lowest number of patients, with only 750 patients.
Figure 2. Time in Hospital by Age Group
Figure 2 shows:
- a strange peak for patients in the [70-80) and [80-90) age group. The dark blue and pink lines show a similar pattern in which hospital stays tend to rise in a high pace (from 1-3 days) compared to the other age groups.
- All age groups except for [90-100) have a somewhat similar trend, often staying 3 days in hospital the majority of the time . Only a small percentage of patients in those age groups stay for longer than 3 days (we can see this by the downtrend of all line plots). On the other hand, the grey line plot shows a steady trend for patients in the [90-100) age range. In other words, even though the peak is at 3-4 days, patients in the [90-100) mark typically stay far longer in hospital.
Figure 3. Boxplots For Different Variables
-
Starting from the left, the first plot shows that patients time in hospital tends to increase as the age increases. This can be seen by the mean, the 1st and 3rd quartile. The time_in_hospital variable ranges from 1 to 14 days. The majority of patients stayed in the hospital for less than 6 days, with 25% of patients staying for 2 days or less.
-
Moving to the next box plot we notice that patients in the age groups of 40-80 have undergone 1 procedure on avergae whereas patients older than that have undergone 0 procedures on average. Based on our summary statistics, the n_procedures variable has a mean value of 1.35 and ranges from 0 to 6 procedures. In summary, the majority of patients did not undergo any procedures, with 75% of patients undergoing 2 procedures or less.
-
The n_lab_procedures box plot shows that patients have undergone about 43 lab procedures on average. These procedures range from 1 to 113. The majority of patients underwent 60 or fewer lab procedures, with 25% of patients undergoing 30 or fewer. We also see some outliers in the 50-60 and 60-70 age mark meaning that some patients in those age marks underwent an extremely high amount of lab procedures.
-
The n_medications box plot shows that patients tend to receive 16 medications on average. The amount of medications range from 1 to 79. The majority of patients were prescribed 20 or fewer medications, with 25% of patients being prescribed 11 or fewer.
Figure 4. Procedures by Age Group &
Figure 5. Distribution of Emergencies
-
Figure 4 shows that the general frequency of patients who had procedures tends to decrease as the number of procedures increases. In terms of differences between age categories, there are some notable patterns as well. For example, the age category [60-70) have the highest frequency of patients who had 3,4,5 or 6 procedures. If compared to the [70-80) age mark (which have about the same number of patients ~ see Figure 1), one question comes in mind: Why are older patients undergoing more procedures ?
-
Figure 5 shows a distribution that appears to be right-skewed. Overall, this plot suggests that emergency admissions are relatively rare among the patients in the dataset, and that when they do occur, they are typically limited to one admission per patient.
Figure 6. Distribution of Medications &
Figure 7. Number of Lab Procedures by Age Group
- Figure 6 suggests that the distribution is right-skewed.
- Figure 7 shows a large spike in the frequency of observations with n_lab_procedures = 1 for the age group 40-50. This suggests that there may be a specific reason why this age group has a higher proportion of patients with only one lab procedure, which could be related to the underlying medical conditions or treatments for this age group. This should be analysed further.
Further analysis suggests that there is no high (positive-negative) correlation between number of Lab Procedures and other variables in the dataset. This occurrence could be based on random chance.
*** We may also want to consider excluding this age group from the next phase of analysis (machine learning model) or adjusting it to account for this large spike in observations.
Figure 8. Effect of Primary, Secondary, Third Diagnosis & Diabetes Medication on Readmissions.
The image above features four separate bar charts, each with a different variable on the x-axis and the proportion of patients in each category on the y-axis, grouped by whether they were readmitted or not.
The first three charts show the proportion of patients readmitted or not for their primary, secondary, and third diagnosis, respectively. The fourth chart shows the proportion of patients who were prescribed diabetes medication, grouped by whether they were readmitted or not.
The horizontal line added to each chart, indicates the point at which the proportions of readmitted and not readmitted patients are equal. From all blue bar charts we notice:
- a stronger effect of
diabetes(for primary diagnosis) in a patient's readmission (see diabetes group crossing the red line in the first bar plot). - We Notice that the effect of diabetes tends to decrease in diagnosis 2 and 3. Based on the above it is not clear if diabetes play some kind of profound role in a patients readmission. Further Analysis will be conducted.
- The pink bar plot suggest that diabetes medication has quite the effect on a patient's readmission (it is also shown in Figure 9 for a better view). The importance of
diabetees diagnosis & diabetes medicationon patients readmission is further explored in a machine learning model (random forest) for predicting patient's readmissions, and also in validating the doctors assumptions (see part 3.2 and 3.3)
Figure 9. Effect of Diabetes Medication & Glucose Test on Readmissions.
Figure 9 shows a stacked bar plot with a reference line to analyze the relationship between:
- diabetes medication prescribed and readmissions (horizontal version of the on in Figure 7). There seems to be a positive correlation between diabetes medication and readmissions
- glucose_test and readmissions. A high glucose test translates to a higher probability of readmission.
Figure 10. Correlation Heatmap between Numerical Variables
The plot displays the pairwise correlations among variables in the "readmissions" data frame in a heatmap format, with the correlations represented by color and the correlation coefficients displayed in the plot. Based on our findings, there is no linear relationship between the majority of variables, besides n_medications and time_in_hospital. We notice (coefficient = 0.45) a moderate positive correlation, meaning that there is a moderate linear relationship between the variables, with one variable increasing as the other also increases. However, the correlation is not strong, meaning that other factors or variables may also be contributing to the relationship between the two variables.
Figure 11. Cramer's V for 3 Pairs of Categorical Variables
Cramer's V values range from 0 to 1, with higher values indicating a stronger association between two categorical variables. In this case, the highest value is 0.505 for the relationship between "change" and "diabetes_med". This suggests a moderate association between these two variables. The Cramer's V value for "readmitted" and "diabetes_med" is 0.062, indicating a weak association. The Cramer's V value for "glucose_test" and "A1Ctest" is 0.052, indicating an even weaker association.
Figure 12. Radar Chart of Patient's Average "Performance"
The radar chart allows us to quickly see and compare the different "habits" or "performance" of patients in their respective age range in multiple categories, providing valuable insights into their relative "strengths" and "weaknesses". By analyzing the shape of the polygons, we notice the following:
- The average patient in the
[40-50)age range is required to stay more frequently in a hospital overnight. The same patient also visits the emergency room way more than the other patients (on average). - The average patient in the
[60-70)age range is administered with more procedures and more medications compared to the other age ranges, all during the hospital stay. - The average patient in the
[80-90)age range tends to have a higher number of laboratory procedures performed during the hospital stay. The patient also stays longer in the hospital but usually does not require to stay in a hospital overnight. In other words, he usually goes home the same day after the procedure or treatment. - The average patient in the
[90-100)age range spends the majority of his time in hospital and gets a high number of laboratory procedures performed during the hospital stay.
IMPORTANT: the radar chart represents the average number of various metrics per patient in his respective age range.
3. Questions
3.1 What is the most common primary diagnosis by age group?
Figure 12. Most Common Primary Diagnosis by Age Group
- The most common diagnosis for patients in the age range of 40-50 is listed as "Other".
- For patients in the age ranges of 50-100, the most common diagnosis is "Circulatory". This could suggest that circulatory diseases become more prevalent as people age.
3.2 Some doctors believe diabetes might play a central role in readmission. Explore the effect of a diabetes diagnosis on readmission rates.
Note:
The findings of diabetes in figure 8 in the Exploratory Analysis Phase suggest that a diagnosis of 'diabetes' might not play such a profound role in readmission. Though it is worth exploring wether diabetes medications has some kind of effect on readmission.
One way to explore the effect is through a chi-squared test of independence. This test can determine if there is a significant association between the two variables, "diabetes_med" and "readmitted".
The contingency table shows the number of patients with and without diabetes medication who were readmitted or not readmitted to the hospital. In the table:
- 3385 patients with no diabetes medication were not readmitted.
- 2387 patients with no diabetes medication were readmitted.
- 9861 patients with diabetes medication were not readmitted.
- 9367 patients with diabetes medication were readmitted.
The chi-squared test is used to test the independence of two categorical variables, in this case, diabetes medication and readmission status.
The test results show that there is a significant association between diabetes medication and readmission status (p-value < 2.2e-16), meaning that diabetes medication and readmission status are not independent. The X-squared statistic of 96.256 indicates the strength of the association between the variables.
Worth mentioning:
Based on the Importance plot in 3.3, even though diabetes medication itself seems to play a central role in readmission, when used in conjunction with other predictors, its role becomes relatively insignificant or negligible.
3.3 On what groups of patients should the hospital focus their follow-up efforts to better monitor patients with a high probability of readmission?
Based on the Random Forest Model (with the One-Hot encoding technique - Accuracy ~ 77%) we identify a couple of patient characteristics that are most strongly associated with readmission.
Feature importance: The plot below contains factors that are most important for predicting readmission and could guide the hospital in prioritizing certain types of patients for follow-ups. The higher the importance score of a feature, the more important it is in predicting the outcome variable.
-
The most important feature in predicting readmission is "n.emergency" with a feature importance score of 1088.3378. This suggests that the number of emergency visits in the year before the hospital stay is a strong predictor of the likelihood of readmission.
-
The second most important feature is "n.inpatient" with a feature importance score of 832.6012, followed by "n.outpatient" with a score of 700.4211. These features suggest that the number of inpatient and outpatient visits in the year before the hospital stay are also strong predictors of the likelihood of readmission.
3.4 Conclusion
If we combine the information in the Feature Importance plot and the Radar chart, we conclude that the hospital should focus their follow-up efforts more on patients in the age ranges of : [40-50) as they are highly prone to emergencies and require to stay more frequently in a hospital overnight. The next patients worth mentioning are those between [70-80) and [80-90) as they stay longer in the hospital but usually do not require to stay in a hospital overnight.
4. Appendix
# install packages
install.packages("cowplot")
library(cowplot)
install.packages("fmsb")
install.packages("ricardo-bion/ggradar", dependencies = TRUE)
suppressPackageStartupMessages((install.packages("corrplot")))
suppressMessages(install.packages(c("hrbrthemes","ggthemes")))
library(fmsb)
library(tidyverse)
library(ggthemes)
library(hrbrthemes)
library(corrplot)
install.packages("vcd") #used for calculating correlation of binary data.
library(vcd)
# install & load the 'ggchicklet' package for bar charts with rounded corners
suppressMessages(install.packages("ggchicklet", repos = "https://cinc.rud.is", verbose=TRUE, quiet=TRUE))
suppressPackageStartupMessages(library(ggchicklet))#read file
readmissions <- read_csv('data/hospital_readmissions.csv', show_col_types = FALSE)
readmissions
#number of missing value present in the dataset
sapply(readmissions, function(x) sum(is.na(x)))
# Check for duplicates
sum(duplicated(readmissions))
str(readmissions)
summary(readmissions)Assuming each row corresponds to a patient:
# Plot for age
counts = table(readmissions$age)
barplot(counts, border=F, col=c("#9EC1E3", "#9EC1E3", "#9EC1E3", "#416E9B", "#9EC1E3", "#BFBFBF"),
legend = rownames(counts), beside=TRUE)
# Plot for relationship between categorical variables "age" and "time_in_hospital"
counts <- table(readmissions$age, readmissions$time_in_hospital)
barplot(counts, border=F, col=c("#9EC1E3", "#9EC1E3", "#9EC1E3", "#416E9B", "#D59FBD", "#BFBFBF"),
legend = rownames(counts), beside= TRUE)# Transform the data into a long format
data_long <- data.frame(age = rep(rownames(counts), ncol(counts)),
time_in_hospital = rep(colnames(counts), each = nrow(counts)),
count = as.vector(counts)) %>%
mutate(time_in_hospital = as.numeric(time_in_hospital))
# Plot the data
ggplot(data_long, aes(x = time_in_hospital, y = count, color = age)) +
geom_line(size = 1.7) +
labs(x = "Time in Hospital", y = "Count") +
scale_color_manual(values = c("#9EC1E3", "#9EC1E3", "#9EC1E3", "#416E9B", "#D59FBD", "#BFBFBF")) +
theme_classic()