Understanding Hospital Readmission
Table of contents
- Itroduction
- Executive summary
- Data and Methods
- Expolratory Data Analysis (EDA)
- Further consideration
- Conclusions and Recommendations
- Annex
1. Introduction
Hospital readmission is a problem in healthcare where patients are discharged from the hospital and then readmitted within a certain period of time, often within 30 days of their initial discharge. This is a costly and preventable problem that can negatively impact patients' health outcomes and quality of life. Causes of readmissions include inadequate care during initial hospitalization and poor discharge planning. Patients with chronic conditions, such as heart failure, diabetes, and respiratory disease, are at a particularly high risk of readmission. To reduce readmissions, interventions such as improved care coordination, enhanced patient education, and medication management are implemented. Machine learning and artificial intelligence (AI) algorithms are also used to predict which patients are at the highest risk of readmission and enable healthcare providers to intervene proactively to prevent readmissions.
2. Executive summary
Our consulting company has been tasked with helping a hospital group improve their understanding of patient readmissions. We have been given access to ten years' worth of data on patients who were readmitted to the hospital after being discharged. Our goal is to assess whether initial diagnoses, the number of procedures, or other variables could provide insight into the probability of readmission, and to identify those patients who are at a higher risk of readmission so that the hospital can focus their follow-up calls and attention accordingly.
To achieve these objectives, we have prepared a report covering the following:
- Analysis of the most common primary diagnosis by age group.
- Exploration of the impact of a diabetes diagnosis on readmission rates.
- Identification of patient groups that the hospital should focus their follow-up efforts on to better monitor patients with a high probability of readmission.
Results
The report begins with a brief overview and cleaning of the data, followed by an explanation of the methodologies employed to extract the most valuable insights. The exploratory data analysis has indicated that:
- 
The primary disease diagnosis that is most frequently observed among different age groups is 'Circulatory', except for the 40-50 age group, where 'Other' is the most common diagnosis. 
- 
Investigation of the readmission rates concerning primary, secondary, and tertiary diagnoses reveals that: - Patients diagnosed with diabetes have a higher readmission rate than those diagnosed with other conditions, 48% for patients with diabetes against 44% for patients diagnosed with 'Other' diseases.
- The Chi-square statistical test used to assess the dependence of primary diagnosis on readmission rate revealed that there was a significant statistical association between primary diagnosis of diabetes and hospital readmission rate.
 
- 
The hospital should concentrate its follow-up efforts on patient groups with a high likelihood of readmission, including: - patients in the age range of 50 to 90 years old
- patients diagnosed with diabetic, circulatory and respiratory diseases
- According to the machine learning models developed during the analysis, the features that have the most significant impact on the readmission rate include:
- the number of outpatient visits in the year before a hospital stay
- the number of inpatient visits in the year before a hospital stay
- the number of medications administered during the hospital stay
 
 
Key Recommandations
- Further analysis should be conducted on patient groups identified as having a high probability of readmission to determine the specific factors contributing to their readmission rates.
- The hospital should implement targeted intervention programs for patients in the identified age group and diagnose diabetic, circulatory, and respiratory diseases to reduce their readmission rates.
- Evaluate the performance of different machine learning models and identify opportunities for model improvement.
- Analyze the impact of different hospital policies and practices, such as discharge planning and post-discharge follow-up, on readmission rates.
3. Data and Methods
As a company, we have access to a dataset that contains patient information spanning over a period of ten years.(source):
Information in the Dataset
- "age" - age bracket of the patient
- "time_in_hospital" - days (from 1 to 14)
- "n_procedures" - number of procedures performed during the hospital stay
- "n_lab_procedures" - number of laboratory procedures performed during the hospital stay
- "n_medications" - number of medications administered during the hospital stay
- "n_outpatient" - number of outpatient visits in the year before a hospital stay
- "n_inpatient" - number of inpatient visits in the year before the hospital stay
- "n_emergency" - number of visits to the emergency room in the year before the hospital stay
- "medical_specialty" - the specialty of the admitting physician
- "diag_1" - primary diagnosis (Circulatory, Respiratory, Digestive, etc.)
- "diag_2" - secondary diagnosis
- "diag_3" - additional secondary diagnosis
- "glucose_test" - whether the glucose serum came out as high (> 200), normal, or not performed
- "A1Ctest" - whether the A1C level of the patient came out as high (> 7%), normal, or not performed
- "change" - whether there was a change in the diabetes medication ('yes' or 'no')
- "diabetes_med" - whether a diabetes medication was prescribed ('yes' or 'no')
- "readmitted" - if the patient was readmitted at the hospital ('yes' or 'no')
Remarks on the data:
The dataframe contains 25000 rows and 17 columns, with no missing values or duplicate rows. Most of the numeric columns exhibit positive skewness, likely due to a significant number of outliers, which totalled 11181. To prevent the loss of important information during analysis, we retained these outliers.
Methods
Our exploratory data analysis involved various methodologies, including data cleaning, data visualization, statistical analysis, and machine learning algorithms. To clean the data, we used Pandas to handle missing values, and outliers, and transform variables as necessary. We also used Scikit-learn tools, such as One-Hot-Encoder, to prepare the data for machine learning algorithms. For visualization, we employed Matplotlib and Seaborn to create various plots, including barplots, lineplots, and heat maps, to identify patterns and relationships. Additionally, we utilized the Pingouin library for statistical analysis, including the Chi-square test to understand relationships between variables. For machine learning, we implemented various algorithms, such as k-Nearest Neighbors, Logistic Regression, and Random Forests, and evaluated the models based on accuracy, precision, recall, F1 score, and cross-validation.
Acknowledgments: Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, "Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records," BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.
4. EDA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pingouin as pg
import time
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, KFold, GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, precision_recall_curve, auc, roc_auc_score, roc_curve, confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplaydf = pd.read_csv('data/hospital_readmissions.csv')
dfdf.info()df.isna().sum()df.duplicated().sum()for column, values in df.iteritems():
    unique_values = values.sort_values().unique()
    print(f"Unique values in column '{column}': {unique_values}")num= df.select_dtypes(exclude=['object'])#Plot dataframe distribution
sns.set_palette('Blues_r')
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes[-1, -1].remove()
for ax, col in zip(axes.flatten(), num.columns):
    sns.kdeplot(num[col], ax=ax, fill=True)
    ax.set_title(col, fontsize=15)
    
plt.subplots_adjust(hspace=0.3, wspace=0.3)
plt.show()