Reducing hospital readmissions
📖 Background
You work for a consulting company helping a hospital group better understand patient readmissions. The hospital gave you access to ten years of information on patients readmitted to the hospital after being discharged. The doctors want you to assess if initial diagnoses, number of procedures, or other variables could help them better understand the probability of readmission.
They want to focus follow-up calls and attention on those patients with a higher probability of readmission.
💾 The data
You have access to ten years of patient information (source):
Information in the file
- "age" - age bracket of the patient
- "time_in_hospital" - days (from 1 to 14)
- "n_procedures" - number of procedures performed during the hospital stay
- "n_lab_procedures" - number of laboratory procedures performed during the hospital stay
- "n_medications" - number of medications administered during the hospital stay
- "n_outpatient" - number of outpatient visits in the year before a hospital stay
- "n_inpatient" - number of inpatient visits in the year before the hospital stay
- "n_emergency" - number of visits to the emergency room in the year before the hospital stay
- "medical_specialty" - the specialty of the admitting physician
- "diag_1" - primary diagnosis (Circulatory, Respiratory, Digestive, etc.)
- "diag_2" - secondary diagnosis
- "diag_3" - additional secondary diagnosis
- "glucose_test" - whether the glucose serum came out as high (> 200), normal, or not performed
- "A1Ctest" - whether the A1C level of the patient came out as high (> 7%), normal, or not performed
- "change" - whether there was a change in the diabetes medication ('yes' or 'no')
- "diabetes_med" - whether a diabetes medication was prescribed ('yes' or 'no')
- "readmitted" - if the patient was readmitted at the hospital ('yes' or 'no')
Acknowledgments: Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, "Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records," BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.
import pandas as pd
df = pd.read_csv('data/hospital_readmissions.csv')
df.head()
💪 Competition challenge
Create a report that covers the following:
- What is the most common primary diagnosis by age group?
- Some doctors believe diabetes might play a central role in readmission. Explore the effect of a diabetes diagnosis on readmission rates.
- On what groups of patients should the hospital focus their follow-up efforts to better monitor patients with a high probability of readmission?
⌛️ Time is ticking. Good luck!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('data/hospital_readmissions.csv')
df.shape
INTRODUCTION
There are many ways to explore the data and also a lot of variables to consider. For the purposes of this analysis we will focus on three main questions.
-
What is the most common primary diagnosis by age group?
-
What is the effect of diabetes diagnosis on readmission rate?
-
What groups of patients should the hospital focus their follow-up efforts to better monitor patients with a high probability of readmission?
Primary analysis
We've got 17 columns, 25000 rows and there appear to be no missing values and no duplicates. There could be some columns with improperly stored data, so we'll check for unique values and count them in categorical data.
Categorical data
df.info()
df.duplicated().sum()
cat_cols = df.select_dtypes(exclude="number").columns
for col in cat_cols:
print('Column', col)
print(df[col].value_counts())
Columns medical_specialty, diag_1, diag_2, diag_3 have values which contain non accounted for data. We will take a look at these columns more closely by exploring them visually.
cols = ['medical_specialty', 'diag_1', 'diag_2', 'diag_3']
fig, axes = plt.subplots(1, len(cols), figsize = (30,10))
index = 0
for col in cols:
sns.countplot(x = col, data = df, ax = axes[index], order = df[col].value_counts().index)
axes[index].tick_params('x', labelrotation=45)
axes[index].set_title(col)
index += 1
All columns except for medical_specialty contain very little missing values, about 1% all together. Medical_specialty however, contains a lot.
df['medical_specialty'].value_counts(normalize = True)
Missing values account for almost 50% of data. So for 50% of admissions we don't have the information about what type of specialist admitted the patient.
One thing also worth noticing is that a large portion of diagnosis don't fall under the most standard ones. This is also seen in the first stage of diagnosis. We expect to see other diagnosis prevail in second and additional second diagnosis, because if they were one of the more common ones, they'd be diagnosed the first time already.
On average we see that Other and Circulatory diagnosis account for about 2/3 of all diagnosis in all stages.
Age column contains ranges of patient's age in categorical format. The dataset looks at people aged 40+.
This was a quick overview of the categorical data for this dataset.
Let's take a look at numerical data the same way to see if we find any missing values.
Numerical data