Skip to content
0

Hospital Readmissions - A Short Bayesian Approach

📖 Summary

Based on the given dataset comprising patient readmission records, I set to address three key questions:

  1. What is the most common primary diagnosis by age group? By simple tabulating occurrences, I found that for all age groups circulatory problems where the most common primary diagnosis, except for the 40-50 group where the most common one is of type "Other".

  2. Some doctors believe diabetes might play a central role in readmission. Explore the effect of a diabetes diagnosis on readmission rates Using a multivariate Bayesian generalized linear model (GLM), I found that with >99% probability, diabetes as a primary diagnosis is associated with an increased likelihood of patient readmission.

  3. On what groups of patients should the hospital focus their follow-up efforts to better monitor patients with a high probability of readmission? Based on the aforementioned GLM, by taking into account all positive coefficients I would suggest focusing on patients that present numerous emergency, inpatient or outpatient visits from the previous year and that were prescribed a diabetes medication.

🚀 Install dependecies, load dataset

Let us begin by installing the Python Bayesian framework pymc, then importing the dataset along with all relevant libraries.

# Install pymc
!pip3 install pymc -q --no-warn-script-location

# Imports
import numpy as np
import pandas as pd
import pymc as pm
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
hosp_df = pd.read_csv('data/hospital_readmissions.csv')

We are now ready to take on the analysis to address each of the three questions.

⚙️ Analysis

1. What is the most common primary diagnosis by age group?

This is a simple question that only requires reshaping a long frequency table, then determining the highest frequency per age group.

# Extract relative frequency of primary diagnoses by age group
diag1_age = hosp_df.groupby("age")["diag_1"].value_counts(normalize=True).unstack()
# Display single most common primary diagnosis per group
print(diag1_age.idxmax(axis=1))

# Create stacked barplot
diag1_age.plot.bar(stacked=True, colormap="Set2", ylabel="Relative frequency")
# Add legend
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), title="Primary diagnosis")

As we can see, in all age groups Circulatory is the most common primary diagnosis, except for the 40-50 group, where Other is the most common one. The stacked barplot provides additional insights, for example showing how the proportions of diabetes and circulatory diagnoses decrease and increase with age, respectively.

2. Some doctors believe diabetes might play a central role in readmission. Explore the effect of a diabetes diagnosis on readmission rates

After determining which patients had Diabetes as a primary diagnsosis, we could next conduct a (Chi-square) test of indepencence; however, we would need to resolve differences with respect to the other variables, particularly age. Therefore, I propose instead fitting a Bayesian GLM (logistic regression) based on seven variables from the dataset to predict readmitted and ultimately ascertain the impact of diabetes primary diagnosis.

Before formalizing the model structure, let us first handle feature encoding, i.e. encode binary and ordinal categorial variables, and capture diabetes as primary diagnosis.

# Flag diabetes diagnosis
hosp_df["diabetes_diag"] = (hosp_df["diag_1"] == "Diabetes").astype(int)
# Encode age as ordinal categorical
map_age = {
    "[40-50)": 0,
    "[50-60)": 1,
    "[60-70)": 2,
    "[70-80)": 3,
    "[80-90)": 4,
    "[90-100)": 5
}
hosp_df["age_ord"] = hosp_df["age"].map(map_age)
# Binarize `readmitted`, `change` and `diabetes_med_bin`
hosp_df["readmitted_bin"] = hosp_df["readmitted"].map({"yes": 1, "no": 0})
hosp_df["change_bin"] = hosp_df["change"].map({"yes": 1, "no": 0})
hosp_df["diabetes_med_bin"] = hosp_df["diabetes_med"].map({"yes": 1, "no": 0})
# Add intercept to df
hosp_df["intercept"] = 1

# Select predictors
predictors =  [
    "age_ord",
    "time_in_hospital",
    "n_procedures",
    "n_lab_procedures",
    "n_medications",
    "n_outpatient",
    "n_inpatient",
    "n_emergency",
    "diabetes_diag",
    "change_bin",
    "diabetes_med_bin" 
]

Let us now formalize the model structure. Our GLM will be essentially built on two prior distributions, one for the intercept and - multiple instances of - one for the predictor coefficients

as you can see these priors are normal distributions centered around zero, thus reflecting our ignorance over i) the value of the intercept relative to a 50% readmission probability and ii) the impact of the individual predictors on readmission probability. Then we can define the logit of the response variable as

Finally, we use the logistic function to convert the logit to the estimated readmission probability , which we assume to parameterize a Bernoulli distribution that explains the observed values of readmitted

We can now fit the model using pymc and sample the posterior samples of the parameters - this may take about 10min ⏳

# Initialize model
mod = pm.Model()

# Fit model and sample posterior
with mod:
    # Intercept prior
    alpha = pm.Normal("alpha", mu=0, sigma=10)
    # Beta coefficient priors
    beta = pm.Normal("beta", mu=0, sigma=5, shape=len(predictors))
    # Compute logit
    logit = alpha + sum([beta[i]*hosp_df[pred] for i, pred in enumerate(predictors)])
    # Apply logistic function to extract probability p
    p = 1 / (1 + np.exp(-logit))
    # Condition likelihood of p based on Bernoulli distribution with observed values `readmitted_bin`
    Y_obs = pm.Bernoulli("Y_obs", p=p, observed=hosp_df["readmitted_bin"])
    # Draw 1000 posterior samples
    trace = pm.sample()
    
# Examine chains
pm.plot_trace(trace)

We see that the chains converge nicely and are rather stable. Next we explore the distributions of those same samples after restructuring the NumPy arrays appropriately, and building a dataframe identifying the individual coefficients. In addition, we will determine the proportion of sampled values from the coefficient parameterizing diabetes_diag that are greater than zero, which will tell us the probability - based on our model structure and assumptions - that a diabetes primary diagnosis increases the chances of patient readmission.

# Extract posterior samples of beta
post_alpha = trace.posterior['alpha'].values.reshape(-1, 1)
post_beta = trace.posterior['beta'].values.reshape(-1, len(predictors))
post_coeffs = pd.DataFrame(np.concatenate([post_alpha, post_beta], axis=1), columns=["alpha"] + predictors)

# Violin plot with seaborn
ax = sns.violinplot(data=post_coeffs, orient='h')
ax.set(xlabel=r'$\beta$')
plt.axvline(0., linestyle='--', color='black', alpha=.25)

# Print probability that beta from diabetes_diag > 0
prob_pos = (post_coeffs["diabetes_diag"] > 0).mean()
print("Probability diabetes diagnosis increases readmission chances: {:.3f}".format(prob_pos))

Lots of interesting insights: firstly, we observe with >99% confidence that diabetes diagnosis has a positive effect on readmission. This should come as no surprise since this dataset was constructed around individuals somehow diagnosed with diabetes (cf. source). Then, as one would expect, we see that the eldest patients have greater chances of readmission. We also see how the parameter adjusted itself to conform to the proportion of readmitted patients in the dataset, which is smaller than 50%.

3. On what groups of patients should the hospital focus their follow-up efforts to better monitor patients with a high probability of readmission?

To answer this question we can again use the GLM posterior samples; by simply taking into account the posterior distributions of the individual coefficients we learn that following up on patients that present numerous emergency, inpatient or outpatient visits the year before and that were prescribed a diabetes medication would lead to a better monitoring of those at higher risk of readmission.

💉 Final Notes

Concluding, we answered the first question by simply computing the relative frequencies of the different primary diagnoses per age group. Then, to address the last two questions we capitalized on the relatively small number of features in the given dataset to devise a Bayesian GLM approach, simultaneously determining the impact of a diabetes diagnosis on readmission and the subset of patients that are at a higher risk of readmission and thus deserve close follow-up.

Further work could provide additional clues and insights. I would like to point out that the proposed GLM model could be based on a different structure and a set of different covariates.

If you enjoyed this short Bayesian excursion, please give it a thumbs up 👍 All feedback is welcome!