Competition - Analyzing Patient Readmissions: Identifying Risk Factors and Targeted Follow-Up Efforts for a Hospital Group.

Reducing hospital readmissions

📖 Background

You work for a consulting company helping a hospital group better understand patient readmissions. The hospital gave you access to ten years of information on patients readmitted to the hospital after being discharged. The doctors want you to assess if initial diagnoses, number of procedures, or other variables could help them better understand the probability of readmission.

They want to focus follow-up calls and attention on those patients with a higher probability of readmission.

💾 The data

You have access to ten years of patient information (source):

Information in the file

"age" - age bracket of the patient
"time_in_hospital" - days (from 1 to 14)
"n_procedures" - number of procedures performed during the hospital stay
"n_lab_procedures" - number of laboratory procedures performed during the hospital stay
"n_medications" - number of medications administered during the hospital stay
"n_outpatient" - number of outpatient visits in the year before a hospital stay
"n_inpatient" - number of inpatient visits in the year before the hospital stay
"n_emergency" - number of visits to the emergency room in the year before the hospital stay
"medical_specialty" - the specialty of the admitting physician
"diag_1" - primary diagnosis (Circulatory, Respiratory, Digestive, etc.)
"diag_2" - secondary diagnosis
"diag_3" - additional secondary diagnosis
"glucose_test" - whether the glucose serum came out as high (> 200), normal, or not performed
"A1Ctest" - whether the A1C level of the patient came out as high (> 7%), normal, or not performed
"change" - whether there was a change in the diabetes medication ('yes' or 'no')
"diabetes_med" - whether a diabetes medication was prescribed ('yes' or 'no')
"readmitted" - if the patient was readmitted at the hospital ('yes' or 'no')

Acknowledgments: Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, "Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records," BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.

import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as mticker
import matplotlib.colors as mcolors
from scipy.stats import chi2_contingency, f_oneway
from sklearn.metrics import average_precision_score
from sklearn.model_selection import cross_val_score, KFold
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
import plotly.graph_objects as go

df = pd.read_csv('data/hospital_readmissions.csv')
df

💪 Competition challenge

Create a report that covers the following:

What is the most common primary diagnosis by age group?
Some doctors believe diabetes might play a central role in readmission. Explore the effect of a diabetes diagnosis on readmission rates.
On what groups of patients should the hospital focus their follow-up efforts to better monitor patients with a high probability of readmission?

⌛️ Time is ticking. Good luck!

print (df ['diag_1'].value_counts())
print ('--------------------------------------------------------------------------------------------------------------------------')
print (df ['diag_2'].value_counts())
print ('--------------------------------------------------------------------------------------------------------------------------')
print (df ['diag_3'].value_counts())

df.info()
print('--------------------------------------------------------------------------------------------------------------------------')
print (df.shape)

df.describe()

df.isna().sum()

Q1.-What is the most common primary diagnosis by age group?

This question is straightforward and can be answered directly. We group by age groups to observe the prevalence of different diseases using a stacked bar chart.

# Filter out rows where 'diag_1', 'diag_2', or 'diag_3' is equal to 'Missing'
df_filtered = df[(df['diag_1'] != 'Missing') & (df['diag_2'] != 'Missing') & (df['diag_3'] != 'Missing')]

# Group by "age" and count occurrences of each unique value in "diag_1" column
diag_1_counts_by_age = df_filtered.groupby("age")["diag_1"].value_counts()

# Convert the result to a DataFrame and reset the index
diag_1_counts_by_age = diag_1_counts_by_age.to_frame(name="count").reset_index()

# Plot the results
plt.figure(figsize=(12, 6))  # Set the size of the figure
colors = plt.cm.get_cmap('tab20', len(diag_1_counts_by_age["diag_1"].unique()))  # Get colors for each unique value in "diag_1"
for i, diag_1 in enumerate(diag_1_counts_by_age["diag_1"].unique()):
    # Filter data for each "diag_1" value
    data = diag_1_counts_by_age[diag_1_counts_by_age["diag_1"] == diag_1]
    plt.bar(data["age"], data["count"], color=colors(i), label=diag_1)  # Create a bar for each "diag_1" value

plt.title("Fig1. Counts of Diagnoses by Age Group")  # Set the title of the chart
plt.xlabel("Age Group")  # Set the x-axis label
plt.ylabel("Count")  # Set the y-axis label
plt.legend(title="Diagnosis")  # Add a legend for the "diag_1" values
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.show()  # Display the chart

# Filter out rows where 'diag_1', 'diag_2', or 'diag_3' is equal to 'Missing'
df_filtered = df[(df['diag_1'] != 'Missing') & (df['diag_2'] != 'Missing') & (df['diag_3'] != 'Missing')]

# Group by "age" and count occurrences of each unique value in "diag_1" column
diag_1_counts_by_age = df_filtered.groupby("age")["diag_1"].value_counts(normalize=True)  # Calculate proportions using normalize=True

# Convert the result to a DataFrame, reset the index, and rename the count column to proportion
diag_1_proportions_by_age = diag_1_counts_by_age.to_frame(name="proportion").reset_index()

# Create a pivot table to convert "diag_1" values into separate columns
diag_1_proportions_pivot = diag_1_proportions_by_age.pivot(index="age", columns="diag_1", values="proportion")

# Plot the results
plt.figure(figsize=(12, 6))  # Set the size of the figure
colors = plt.cm.get_cmap('tab20', len(diag_1_proportions_pivot.columns))  # Get colors for each unique value in "diag_1"
for i, diag_1 in enumerate(diag_1_proportions_pivot.columns):
    # Filter data for each "diag_1" value
    data = diag_1_proportions_pivot[diag_1]
    plt.plot(data.index, data.values, marker='o', color=colors(i), label=diag_1)  # Create a line plot for each "diag_1" value

plt.title("Fig2. Proportions of Diagnoses by Age Group")  # Set the title of the chart
plt.xlabel("Age Group")  # Set the x-axis label
plt.ylabel("Proportion")  # Set the y-axis label
plt.legend(title="Diagnosis")  # Add a legend for the "diag_1" values
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.show()  # Display the chart

# Create a DataFrame for the proportions
diag_1_proportions_df = pd.DataFrame(diag_1_proportions_pivot.to_records())

# Add a new column that shows the most frequent disease (max value) for each row
diag_1_proportions_df['Most frequent disease'] = diag_1_proportions_df.iloc[:, 1:].idxmax(axis=1)

# Display the resulting DataFrame
display(diag_1_proportions_df)

# Iterate through each row in the DataFrame
for index, row in diag_1_proportions_df.iterrows():
    # Get the age group and its most frequent disease and proportion
    age_group = row['age']
    most_frequent_disease = row['Most frequent disease']
    proportion = row[most_frequent_disease]
    # Print the statement
    
    print(f"The most common disease for age group {age_group} is {most_frequent_disease} with a proportion of {proportion:.2%}.")

In all age groups, except for the first age group [40-50], circulatory disease is the most prevalent disease. However, in the first age group, the most common disease is classified as "Other.

‌
‌
‌