Project: Dr. Semmelweis and the Importance of Handwashing

Hungarian physician Dr. Ignaz Semmelweis worked at the Vienna General Hospital with childbed fever patients. Childbed fever is a deadly disease affecting women who have just given birth, and in the early 1840s, as many as 10% of the women giving birth died from it at the Vienna General Hospital. Dr.Semmelweis discovered that it was the contaminated hands of the doctors delivering the babies, and on June 1st, 1847, he decreed that everyone should wash their hands, an unorthodox and controversial request; nobody in Vienna knew about bacteria.

I will reanalyze the data that made Semmelweis discover the importance of handwashing and its impact on the hospital and the number of deaths.

The data is stored as two CSV files within the data folder.

data/yearly_deaths_by_clinic.csv contains the number of women giving birth at the two clinics at the Vienna General Hospital between the years 1841 and 1846.

Column	Description
`year`	Years (1841-1846)
`births`	Number of births
`deaths`	Number of deaths
`clinic`	Clinic 1 or clinic 2

data/monthly_deaths.csv contains data from 'Clinic 1' of the hospital where most deaths occurred.

Column	Description
`date`	Date (YYYY-MM-DD)
`births`	Number of births
`deaths`	Number of deaths

# Import libraries to play with
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Read in csv files
df_monthly = pd.read_csv('monthly_deaths.csv')
df_yearly = pd.read_csv('yearly_deaths_by_clinic.csv')

# Inspect first dataframe
print(df_yearly.shape)
print(df_yearly.info())
display(df_yearly.head())

# Inspect second dataframe
print(df_monthly.shape)
print(df_monthly.info())
display(df_monthly.head())

# Convert date format to pd datetime
df_monthly['date'] = pd.to_datetime(df_monthly['date'])

# Verify data types
print(df_monthly.dtypes)

# Create copy to play with
df_copy = df_monthly.copy()

# Convert date to year format
df_copy['date'] = df_copy['date'].dt.year

# Group by year and sum stats
df_copy = df_copy.groupby('date', as_index=False)[['births', 'deaths']].sum()

# Create clinic 1 column
df_copy['clinic'] = 'clinic 1'

# Rename date column
df_copy = df_copy.rename(columns={'date': 'year'})

clinic_1 = df_yearly[df_yearly['clinic'] =='clinic 1']
df_1 = pd.concat([df_copy, clinic_1]).groupby(['year', 'clinic'], as_index=False)[['births', 'deaths']].sum()
print(df_1)

# Filter for clinic 2 information
df_2 = df_yearly[df_yearly['clinic']=='clinic 2']
print(df_2)

# Create deathe percentage column
df_2['death_percentage'] = df_2['deaths'] / df_2['births']
df_1['death_percentage'] = df_1['deaths'] / df_1['births']

# Merge dataframes
df = df_1.merge(df_2, how='left', on='year', suffixes=['_clinic_1', '_clinic_2'])

# Fill null values
df = df.fillna(0)

# Drop clinic columns
df = df.drop(['clinic_clinic_1', 'clinic_clinic_2'], axis=1)

# Replace zeroes in clinic 2 with values from clinic 1
df['death_percentage_clinic_2'] = np.where(df['death_percentage_clinic_2'] == 0, df['death_percentage_clinic_1'], df['death_percentage_clinic_2'])

# Calculate the average death percentage
df['death_percentage'] = ((df['death_percentage_clinic_2'] + df['death_percentage_clinic_1']) / 2).round(4) * 100

# View dataframe
display(df)

# Visualize death percentage by year
sns.barplot(data=df, x='year', y='death_percentage', color='#333333')
sns.color_palette(palette='Blues')
sns.set_style('darkgrid')
plt.ylabel('Death Percentage')
plt.xlabel('Year')
plt.title('Proportion of Chilbirth Deaths')
plt.show()

# Find the year with the highest death percentage
highest_year = df.groupby('year', as_index=False)['death_percentage'].mean().sort_values(by='death_percentage', ascending=False).iloc[0]

print(f"The year with the highest death percentage was {int(highest_year['year'])} with a death percentage of {highest_year['death_percentage']:.2f}%.")

# Split dataframe into before and after dataframes
df_before = df_monthly[df_monthly['date']<'1847-06']
df_after = df_monthly[df_monthly['date']>='1847-06']

# Create death proportion column
df_before['death_percentage'] = df_before['deaths'] / df_before['births']
df_after['death_percentage'] = df_after['deaths'] / df_after['births']

# Create copies to play with 
df_before_copy = df_before.copy()
df_after_copy = df_after.copy()

# Create boolean flag for merge
df_before_copy['handwashing_started'] = False

# Create boolean flag for merge
df_after_copy['handwashing_started'] = True

# Find mean proportion before handwashing
df_before_copy = df_before_copy.groupby('handwashing_started')['death_percentage'].mean().round(4) * 100

# Find mean proportion after handwashing
df_after_copy = df_after_copy.groupby('handwashing_started')['death_percentage'].mean().round(4) * 100

# Combine dataframes
monthly_summary = pd.concat([df_before_copy, df_after_copy]).reset_index()

# View results
monthly_summary.head()

# Format datetime column for month and year
df_before['month'] = df_before['date'].dt.month
df_before['year'] = df_before['date'].dt.year
df_after['month'] = df_after['date'].dt.month
df_after['year'] = df_after['date'].dt.year

# Select date and proportion columns
df_before = df_before[['month', 'year', 'death_percentage']]
df_after = df_after[['month', 'year', 'death_percentage']]

df_before['death_percentage'] = df_before['death_percentage'].round(4) * 100
df_after['death_percentage'] = df_after['death_percentage'].round(4) * 100

# Preview before df
display(df_before.head())

# Preview after df
display(df_after.head())

# Bootstrap samples of before and after
mean_differences = []
boot_before_means = []
boot_after_means = []

for i in range(3000):
    bootstrap_before = df_before['death_percentage'].sample(frac=1, replace=True)
    bootstrap_after = df_after['death_percentage'].sample(frac=1, replace=True)
    mean_differences.append(bootstrap_before.mean() - bootstrap_after.mean())
    boot_before_means.append(bootstrap_before.mean())
    boot_after_means.append(bootstrap_after.mean())
    
# Convert to pd series
series = pd.Series(mean_differences)

# Calculate quartile range for 95% confidence intervals
lower = np.quantile(series, 0.025).round(2)
upper = np.quantile(series, 0.975).round(2)
clean_before = np.percentile(boot_before_means, [2.5, 97.5])
clean_after = np.percentile(boot_after_means, [2.5, 97.5])

# Append intervals
confidence_interval = [lower, upper]

# Convert to series
confidence_interval = pd.Series(confidence_interval)

# Results
print(confidence_interval)

# Plot the distribution of bootstrap means
sns.histplot(boot_before_means, bins = 20, kde=True, color='blue', label='Before 1847')
sns.histplot(boot_after_means, bins = 20, kde=True, color='red', label='After 1847')

plt.legend()
plt.xlabel('Death Percentage')
plt.ylabel('Frequency')
plt.title('Bootstrap Distribution of Death Percentages')
plt.show()

I am going to use my bootstrap means to run a hypothesis test stating that the percentage of childbirth deaths significantly decreased after people started washing their hands.

I will use a standard T Test since they both follow normal distribution patterns.

Null Hypothesis: There is no significant difference in death percentage before and after handwashing was invented.

Alternative Hypothesis: The percentage of deaths is significantly less with hand washing.

I will be using a 95% confidence interval with a 5% significance value.

# Perform t-test
t_stat, p_value = stats.ttest_ind(clean_after, clean_before, alternative='less')

# Print results
print(f"T-Statistic: {t_stat.round(4)}")
print(f"P-Value: {p_value.round(4)}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print('Reject the null hypothesis: Handwashing significantly lowers the death rate.')
else:
    print('Fail to reject the null hypothesis: Handwashing does not significantly lower the death rate.')

print(f"The average percent of childbearing deaths before handwashing was {clean_before.mean().round(2)}%.")
print(f"The average percent of childbearing deaths after handwashing quickly dropped to {clean_after.mean().round(2)}%.")