Competition - hospital readmissions

Reducing hospital readmissions

📖 Background

You work for a consulting company helping a hospital group better understand patient readmissions. The hospital gave you access to ten years of information on patients readmitted to the hospital after being discharged. The doctors want you to assess if initial diagnoses, number of procedures, or other variables could help them better understand the probability of readmission.

They want to focus follow-up calls and attention on those patients with a higher probability of readmission.

💾 The data

You have access to ten years of patient information (source):

Information in the file

"age" - age bracket of the patient
"time_in_hospital" - days (from 1 to 14)
"n_procedures" - number of procedures performed during the hospital stay
"n_lab_procedures" - number of laboratory procedures performed during the hospital stay
"n_medications" - number of medications administered during the hospital stay
"n_outpatient" - number of outpatient visits in the year before a hospital stay
"n_inpatient" - number of inpatient visits in the year before the hospital stay
"n_emergency" - number of visits to the emergency room in the year before the hospital stay
"medical_specialty" - the specialty of the admitting physician
"diag_1" - primary diagnosis (Circulatory, Respiratory, Digestive, etc.)
"diag_2" - secondary diagnosis
"diag_3" - additional secondary diagnosis
"glucose_test" - whether the glucose serum came out as high (> 200), normal, or not performed
"A1Ctest" - whether the A1C level of the patient came out as high (> 7%), normal, or not performed
"change" - whether there was a change in the diabetes medication ('yes' or 'no')
"diabetes_med" - whether a diabetes medication was prescribed ('yes' or 'no')
"readmitted" - if the patient was readmitted at the hospital ('yes' or 'no')

Acknowledgments: Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, "Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records," BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.

import pandas as pd
df = pd.read_csv('data/hospital_readmissions.csv')
df.head()

💪 Competition challenge

Create a report that covers the following:

What is the most common primary diagnosis by age group?
Some doctors believe diabetes might play a central role in readmission. Explore the effect of a diabetes diagnosis on readmission rates.
On what groups of patients should the hospital focus their follow-up efforts to better monitor patients with a high probability of readmission?

🧑‍⚖️ Judging criteria

CATEGORY	WEIGHTING	DETAILS
Recommendations	35%	Clarity of recommendations - how clear and well presented the recommendation is. Quality of recommendations - are appropriate analytical techniques used & are the conclusions valid? Number of relevant insights found for the target audience.
Storytelling	35%	How well the data and insights are connected to the recommendation. How the narrative and whole report connects together. Balancing making the report in-depth enough but also concise.
Visualizations	20%	Appropriateness of visualization used. Clarity of insight from visualization.
Votes	10%	Up voting - most upvoted entries get the most points.

✅ Checklist before publishing into the competition

Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
Remove redundant cells like the judging criteria, so the workbook is focused on your story.
Make sure the workbook reads well and explains how you found your insights.
Try to include an executive summary of your recommendations at the beginning.
Check that all the cells run without error.

⌛️ Time is ticking. Good luck!

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

df = pd.read_csv('data/hospital_readmissions.csv')

# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(250, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

ax.set_title('Correlation between variables')

import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

df = pd.read_csv('data/hospital_readmissions.csv')

df['readmitted'] = df['readmitted'].replace(['yes','no'],[1,0])

attributes = ['readmitted','n_inpatient','n_outpatient','n_emergency','time_in_hospital','n_medications','n_lab_procedures','n_procedures']

scatter_matrix(df[attributes], figsize=(12,8), alpha=0.3)


#Compute the correlation matrix

#corr = df.corr()
#
#corr['n_lab_procedures'].sort_values(ascending=False)

sns.set_style('darkgrid')
ax = sns.histplot(data=df, x='diag_1', stat='percent', color='lightskyblue')
ax.patches[0].set_facecolor('salmon')
for i in ax.containers:
    ax.bar_label(i,fmt='%.1f%%')
plt.xticks(rotation=90)
plt.show()

df_grouped = df.groupby(['age','diag_1']).size().to_frame(name='count').reset_index()
df_grouped.groupby('age').head(1).reset_index(drop=True)

xtab = pd.crosstab(index=df['readmitted'],columns=df['diabetes_med'])
print(xtab)

# importing the required function
from scipy.stats import chi2_contingency
 
# Performing Chi-sq test
chi2_contingency(xtab)
 
# P-Value is the Probability of H0 being True
# If P-Value&gt;0.05 then only we Accept the assumption(H0)
 
#print('The P-Value of the ChiSq Test is:', ChiSqResult[1])

import pandas as pd
from sklearn.model_selection import train_test_split

pd.set_option('display.max_columns', 50)

df = pd.read_csv('data/hospital_readmissions.csv')
df = df.sample(frac=1, random_state=13)

df.drop(['diag_2','diag_3'], axis=1, inplace=True)


X_train, X_test, y_train, y_test = train_test_split(df.drop("readmitted", axis=1), data["readmitted"], test_size=0.25, random_state=13)

# Treat the binary variables differently than the categorical ones. This way I can control n=0 and y=1
bi_cols = ('diabetes_med','change')

for b in bi_cols:
    df[b+'_bin'] = df[b].replace(('yes','no'),(1,0))
    df.drop(b, axis=1, inplace=True)

    

# Select the categorical variables and apply an Ordinal Encoder in order to categorize them in 0,1 or 2: no, normal, high
from sklearn.preprocessing import OrdinalEncoder

data=df

# Categorical columns chosen. 
cat_cols = ('glucose_test','A1Ctest')

# Initialize an empty DataFrame to store the encoded data
encoded_data = pd.DataFrame()

# Loop through each categorical column and apply the encoding
for c in cat_cols:
    cats = data[c].unique()
    encoder = OrdinalEncoder(categories=[cats])
    encoded = encoder.fit_transform(data[[c]])
    encoded = pd.DataFrame(encoded, columns=[c+'_encoded'])
    encoded_data = pd.concat([encoded_data, encoded], axis=1)
    data.drop(c, axis=1, inplace=True)


data_encoded = pd.concat([data, encoded_data], axis=1)

#data_encoded = data_encoded.drop([list(cat_cols)],axis=1)



#dum_cols = ['age','diag_1']
#
#
#for d in dum_cols:
#    data_dum = pd.DataFrame()
#    dummies = pd.get_dummies(data_encoded, prefix=[d], columns=[d])
#    data_dum = pd.concat([data_dum, dummies], axis=1)
#
#data_encdum = pd.concat([data_encoded, data_dum], axis=1)


# Apply OneHotEncoder to handle categorical variables that can't be attributed to a value in an arbitrary manner i.e. good, bad, worse
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

transformer = make_column_transformer(
    (OneHotEncoder(),['age','diag_1','medical_specialty']),
    remainder='passthrough')

transformed = transformer.fit_transform(data_encoded)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names())
#print(transformed_df.info())

transformed_df = transformed_df.astype(float)
#print(transformed_df.head(5))

# Import necessary libraries
from sklearn.preprocessing import StandardScaler

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(transformed_df)
scaled_data = pd.DataFrame(X_train_scaled,columns=transformed_df.columns)

print(scaled_data.head())


#dummies = pd.get_dummies(data_encoded, prefix=['age'],columns=['age'])
#dummies.head()

#data_encoded.info()
#data_encdum.info()

# Print the encoded data
#print(data_encoded[data_encoded['A1Ctest'] == 'high'])

print(data['medical_specialty'].unique())
print(data['diag_1'].unique())
print(data['diag_2'].unique())
print(data['diag_3'].unique())

# Import necessary libraries
from sklearn.preprocessing import StandardScaler

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


######################################### Codi per incloure una transformació amb Pandas o altres --en aquest cas aplica un 
######################################### np.log-- al pipeline de sklearn. Així no hem d'aplicar-ho dues vegades al train
######################################### i al test set. Lògicament s'ha de fer l'split abans.
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

# Define a custom transformer that applies a log transformation to a column
class LogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X[self.column] = X[self.column].apply(lambda x: np.log(x))
        return X

# Load the dataset into a Pandas DataFrame
df = pd.read_csv('my_data.csv')

# Define the preprocessing pipeline
preprocessing = ColumnTransformer(
    [('log', LogTransformer(column='numeric_column_1'), ['numeric_column_1']),
     ('scale', StandardScaler(), ['numeric_column_2']),
     ('onehot', OneHotEncoder(), ['categorical_column']),
     ('ordinal', OrdinalEncoder(), ['glucose_test','A1Ctest'])],
    remainder='passthrough')

# Define the logistic regression model
model = LogisticRegression()

# Combine the preprocessing pipeline and the logistic regression model into a single pipeline
pipeline = Pipeline([('preprocessing', preprocessing),
                     ('model', model)])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Evaluate the performance of the pipeline on the test data
accuracy = pipeline.score(X_test, y_test)
print('Test accuracy:', accuracy)

‌
‌
‌