Competition - hospital readmissions

Reducing hospital readmissions

📖 Background

A consulting company helping a hospital group better understand patient readmissions. The hospital gave us access to ten years of information on patients readmitted to the hospital after being discharged. The doctors want us to assess if initial diagnoses, number of procedures, or other variables could help them better understand the probability of readmission.

Readmissions are costly, sometimes doubling the cost of care, which makes it a key performance indicator for hospitals. Doctors want to focus follow-up calls and attention on those patients with a higher probability of readmission.

💾 The data

We have access to ten years of patient information (source):

Information in the file

"age" - age bracket of the patient
"time_in_hospital" - days (from 1 to 14)
"n_procedures" - number of procedures performed during the hospital stay
"n_lab_procedures" - number of laboratory procedures performed during the hospital stay
"n_medications" - number of medications administered during the hospital stay
"n_outpatient" - number of outpatient visits in the year before a hospital stay
"n_inpatient" - number of inpatient visits in the year before the hospital stay
"n_emergency" - number of visits to the emergency room in the year before the hospital stay
"medical_specialty" - the specialty of the admitting physician
"diag_1" - primary diagnosis (Circulatory, Respiratory, Digestive, etc.)
"diag_2" - secondary diagnosis
"diag_3" - additional secondary diagnosis
"glucose_test" - whether the glucose serum came out as high (> 200), normal, or not performed
"A1Ctest" - whether the A1C level of the patient came out as high (> 7%), normal, or not performed
"change" - whether there was a change in the diabetes medication ('yes' or 'no')
"diabetes_med" - whether a diabetes medication was prescribed ('yes' or 'no')
"readmitted" - if the patient was readmitted at the hospital ('yes' or 'no')

Acknowledgments: Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, "Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records," BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.

# we start from importing all necessary libraries
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

# displaying the first 10 rows of our dataset
df = pd.read_csv('data/hospital_readmissions.csv')
df.head(10)

First, let's display the general information about our dataset, to check if we have any missing values or possible incorrect data type. It looks like we don't have any missing values, and all data type is correct in accordance with our initial description of data. We may begin our data manipulation now.

df.info()

Duplicates.

# let's check for explicit duplicates using the following method:
duplicate_rows = df[df.duplicated()]
print("Number of duplicate rows: ", duplicate_rows.shape[0])

Data outliers.

As a data analyst, it's important to ensure that the data we are working with is accurate and relevant. One common issue with datasets is the presence of outliers. Outliers can significantly affect the results of our analysis and skew our findings. In order to detect outliers, we can use a visualization tool such as the box plot and the describe() method. The box plot displays the minimum, first quartile, median, third quartile, and maximum values for our data, making it easy to identify any values that lie significantly outside of this range. The describe() method provides us with a summary of the distribution of our data, including the mean, standard deviation, and quartile values.

By combining these two methods, we can get a comprehensive view of our data and make informed decisions about any outliers that may be present. It's important to note that this analysis should only be performed on numeric values in our dataframe, as outliers can be calculated only for numerical values.

df.describe()

Outliers can be counted by identifying values that fall outside of the expected range based on statistical methods such as the Interquartile Range (IQR) or standard deviation.

For example, in a data set with a mean of μ and standard deviation of σ, values more than 3σ away from the mean could be considered outliers. The specific method used to identify outliers may depend on the distribution of the data, the goal of the analysis, and other factors.

# Select only numeric columns in the dataframe
numeric_df = df.select_dtypes(include=[np.number])

# Create a box plot of the numeric columns
fig = px.box(numeric_df, title="Box Plot of Numeric Columns", color_discrete_sequence=px.colors.sequential.Plasma)

# Show the plot
fig.show()

Conclusion: Based on the df.describe() output, it seems that there are some data outliers present in the numeric values. For example, the maximum value for n_lab_procedures and n_medications is 113 and 79 respectively, while the 75th percentile value for these features is only 57 and 20. This indicates that there are some extreme values present in the data, which can be considered as outliers. The presence of outliers in the data can affect the accuracy of the results of statistical models and therefore, it is important to handle them carefully. This can be done by either removing them from the data or transforming the data to make it more robust to outliers.

By default, this code removes outliers from all columns in the dataframe, but you can modify the code to only remove outliers from specific columns, if desired. We'll store the new dataframe as df_outliers_removed.

‌
‌
‌