Project: Dr. Semmelweis and the Importance of Handwashing

Hungarian physician Dr. Ignaz Semmelweis worked at the Vienna General Hospital with childbed fever patients. Childbed fever is a deadly disease affecting women who have just given birth, and in the early 1840s, as many as 10% of the women giving birth died from it at the Vienna General Hospital. Dr.Semmelweis discovered that it was the contaminated hands of the doctors delivering the babies, and on June 1st, 1847, he decreed that everyone should wash their hands, an unorthodox and controversial request; nobody in Vienna knew about bacteria.

I will reanalyze the data that made Semmelweis discover the importance of handwashing and its impact on the hospital and the number of deaths.

The data is stored as two CSV files within the data folder.

data/yearly_deaths_by_clinic.csv contains the number of women giving birth at the two clinics at the Vienna General Hospital between the years 1841 and 1846.

Column	Description
`year`	Years (1841-1846)
`births`	Number of births
`deaths`	Number of deaths
`clinic`	Clinic 1 or clinic 2

data/monthly_deaths.csv contains data from 'Clinic 1' of the hospital where most deaths occurred.

Column	Description
`date`	Date (YYYY-MM-DD)
`births`	Number of births
`deaths`	Number of deaths

# Imported libraries
import pandas as pd
import matplotlib.pyplot as plt

The first question is what year had the highest yearly proportion of deaths at each clinic.

# Load and inspect the yearly data
yearly = pd.read_csv('data/yearly_deaths_by_clinic.csv')
yearly.head()

# Add proportion_deaths to the DataFrames to gain more insight
yearly["proportion_deaths"] = yearly["deaths"] / yearly["births"]

# Plot the year versus proportion deaths and separate by clinic 
for clinic in yearly['clinic'].unique():
    clinic_data = yearly[yearly['clinic'] == clinic]
    plt.plot(clinic_data['year'], clinic_data['proportion_deaths'], label=clinic)
    
plt.xlabel('Year')
plt.ylabel('Proportion of Deaths')
plt.title('Year vs Proportion of Deaths by Clinic')
plt.legend(title='Clinic')
plt.grid(True)
plt.show()

It is clearly that 1842 has the highest yearly proportion of deaths.

highest_year = 1842

The second question I want answer is what are the mean proportions of deaths before and after handwashing from the monthly data.

# Load and inspect the monthly data
monthly = pd.read_csv("data/monthly_deaths.csv")
monthly.head()

# Add proportion_deaths to the DataFrame
monthly["proportion_deaths"] = monthly["deaths"] / monthly["births"]

# Add the threshold as the date handwashing was introduced
handwashing_start = '1847-06-01'

# Create a boolean column that shows True after the date handwashing was introduced
monthly['handwashing_started'] = monthly['date'] >= handwashing_start

# Group by the new boolean column calculate the mean proportion of deaths
# Reset the index to store the result as a DataFrame
monthly_summary = monthly.groupby('handwashing_started').agg(
    mean_proportion_deaths=('proportion_deaths', 'mean')
).reset_index()

print(monthly_summary)

Finally, I would analyze the difference in the mean monthly proportion of deaths and calculate a 95% confidence interval

# Split the monthly data into before and after handwashing was introduced
before_washing = monthly[monthly["date"] < handwashing_start]
after_washing = monthly[monthly["date"] >= handwashing_start]
before_proportion = before_washing["proportion_deaths"]
after_proportion = after_washing["proportion_deaths"]

# Perform a bootstrap analysis of the reduction of deaths due to handwashing
boot_mean_diff = []
for i in range(3000):
    boot_before = before_proportion.sample(frac=1, replace=True)
    boot_after = after_proportion.sample(frac=1, replace=True)
    boot_mean_diff.append( boot_after.mean() - boot_before.mean() )

# Calculate a 95% confidence interval
confidence_interval = pd.Series(boot_mean_diff).quantile([0.025, 0.975])
print(confidence_interval)

To conclude, it is no doubt that handwashing is really important before delivering the babies.