Hypothesis Testing in Python
Run the hidden code cell below to import the data used in this course.
# Import pandas
import pandas as pd
# Import the course datasets
republican_votes = pd.read_feather('datasets/repub_votes_potus_08_12.feather')
democrat_votes = pd.read_feather('datasets/dem_votes_potus_12_16.feather')
shipments = pd.read_feather('datasets/late_shipments.feather')
stackoverflow = pd.read_feather("datasets/stack_overflow.feather")Take Notes
Add notes about the concepts you've learned and code cells with code you want to keep.
Add your notes here
In pandas, a value's proportion in a categorical DataFrame column can be quickly calculated using the syntax:
prop = (df['col'] == val).mean()
# Print the late_shipments dataset
print(late_shipments)
# Calculate the proportion of late shipments
late_prop_samp = (late_shipments['late'] == 'Yes').mean()
# Print the results
print(late_prop_samp)Since variables have arbitrary ranges and units, we need to standardize them. For example, a hypothesis test that gave different answers if the variables were in Euros instead of US dollars would be of little value. Standardization avoids that.
One standardized value of interest in a hypothesis test is called a z-score. To calculate it, you need three numbers: the sample statistic (point estimate), the hypothesized statistic, and the standard error of the statistic (estimated from the bootstrap distribution).
The sample statistic is available as late_prop_samp.
late_shipments_boot_distn is a bootstrap distribution of the proportion of late shipments, available as a list.
# Hypothesize that the proportion is 6%
late_prop_hyp = 0.06
# Calculate the standard error
std_error = np.std(late_shipments_boot_distn, ddof=1)
# Find z-score of late_prop_samp
z_score = (late_prop_samp - late_prop_hyp) / std_error
# Print z_score
print(z_score)In order to determine whether to choose the null hypothesis or the alternative hypothesis, you need to calculate a p-value from the z-score.
You'll now return to the late shipments dataset and the proportion of late shipments.
The null hypothesis, , is that the proportion of late shipments is six percent.
The alternative hypothesis, , is that the proportion of late shipments is greater than six percent.
The observed sample statistic, late_prop_samp, the hypothesized value, late_prop_hyp (6%), and the bootstrap standard error, std_error are available. norm from scipy.stats has also been loaded without an alias.
# Calculate the z-score of late_prop_samp
z_score = (late_prop_samp - late_prop_hyp) / std_error
# Calculate the p-value
p_value = 1 - norm.cdf(z_score, loc=0, scale=1)
# Print the p-value
print(p_value) If you give a single estimate of a sample statistic, you are bound to be wrong by some amount. For example, the hypothesized proportion of late shipments was 6%. Even if evidence suggests the null hypothesis that the proportion of late shipments is equal to this, for any new sample of shipments, the proportion is likely to be a little different due to sampling variability. Consequently, it's a good idea to state a confidence interval. That is, you say, "we are 95% 'confident' that the proportion of late shipments is between A and B" (for some value of A and B).
# Calculate 95% confidence interval using quantile method
lower = np.quantile(late_shipments_boot_distn, 0.025)
upper = np.quantile(late_shipments_boot_distn, 0.975)
# Print the confidence interval
print((lower, upper))The hypothesis test for determining if there is a difference between the means of two populations uses a different type of test statistic to the z-scores you saw in Chapter 1. It's called "t", and it can be calculated from three values from each sample. When testing for differences between means, the test statistic is callet "t".
# Calculate the numerator of the test statistic
numerator = xbar_yes - xbar_no
# Calculate the denominator of the test statistic
denominator = np.sqrt(s_no**2/n_no + s_yes**2/n_yes)
# Calculate the test statistic
t_stat = numerator / denominator
# Print the test statistic
print(t_stat)Previously, you calculated the test statistic for the two-sample problem of whether the mean weight of shipments is smaller for shipments that weren't late (late == "No") compared to shipments that were late (late == "Yes"). In order to make decisions about it, you need to transform the test statistic with a cumulative distribution function to get a p-value.