Skip to content

Hungarian physician Dr. Ignaz Semmelweis worked at the Vienna General Hospital with childbed fever patients. Childbed fever is a deadly disease affecting women who have just given birth, and in the early 1840s, as many as 10% of the women giving birth died from it at the Vienna General Hospital. Dr.Semmelweis discovered that it was the contaminated hands of the doctors delivering the babies, and on June 1st, 1847, he decreed that everyone should wash their hands, an unorthodox and controversial request; nobody in Vienna knew about bacteria.

You will reanalyze the data that made Semmelweis discover the importance of handwashing and its impact on the hospital and the number of deaths.

The data is stored as two CSV files within the data folder.

data/yearly_deaths_by_clinic.csv contains the number of women giving birth at the two clinics at the Vienna General Hospital between the years 1841 and 1846.

ColumnDescription
yearYears (1841-1846)
birthsNumber of births
deathsNumber of deaths
clinicClinic 1 or clinic 2

data/monthly_deaths.csv contains data from 'Clinic 1' of the hospital where most deaths occurred.

ColumnDescription
dateDate (YYYY-MM-DD)
birthsNumber of births
deathsNumber of deaths
# Imported libraries
import pandas as pd
import matplotlib.pyplot as plt
ydbc = pd.read_csv("data/yearly_deaths_by_clinic.csv")
md = pd.read_csv("data/monthly_deaths.csv")

for df in [ydbc, md]:
    for col in df.columns:
        globals()[col]=col
ydbc.head()

1. What year had the highest yearly proportion of deaths at each clinic?

import seaborn as sns
deaths_to_births = "deaths_to_births"
ydbc[deaths_to_births] = ydbc[deaths]/ydbc[births]
ydbc = ydbc.sort_values(by=year).reset_index(drop=True)
sns.lineplot(data=ydbc, x=year, y=deaths_to_births, hue=clinic)
plt.show()
highest_year = 1842
print(highest_year)

2. Handwashing was introduced on June 1st, 1847. What are the mean proportions of deaths before and after handwashing from the monthly data?

md.head()
md[deaths_to_births]=md[deaths]/md[births]
md[date] = pd.to_datetime(md[date])
md[date].head()
hw_started = pd.to_datetime("1847-06-01")
hw_condition = md[date] >= hw_started
handwashing_started = "handwashing_started"
# Assign a boolean Series, not a DataFrame, to the new column
md[handwashing_started] = hw_condition
monthly_summary = md.groupby(handwashing_started).agg("mean").reset_index()
monthly_summary.drop(columns=[births, deaths], inplace=True)
monthly_summary.head()

3. Analyze the difference in the mean monthly proportion of deaths before and after the introduction of handwashing using all of the data and calculate a 95% confidence interval.

The lower bound is the larger absolute difference in the deaths_to_births ratio mean averages (before minus after), since the deaths_to_births ratio means after handwashing was introduced tended to be much smaller than the deaths_to_births ratio means before handwashing.
before = md[md['handwashing_started'] == False][deaths_to_births]
after = md[md['handwashing_started'] == True][deaths_to_births] 
mean_diff_list = []
for _ in range(1000):
    a = after.sample(frac=1, replace=True)
    b = before.sample(frac=1, replace=True)
    mean_diff = np.mean(a)-np.mean(b)
    mean_diff_list.append(mean_diff)
mean_diff_series = pd.Series(mean_diff_list)
lower = mean_diff_series.quantile(.025)
upper = mean_diff_series.quantile(.975)

confidence_interval = pd.Series([lower, upper])
confidence_interval