Hypothesis Testing in Python
Run the hidden code cell below to import the data used in this course.
# Import pandas
import pandas as pd
# Import the course datasets
republican_votes = pd.read_feather('datasets/repub_votes_potus_08_12.feather')
sample_dem_data = pd.read_feather('datasets/dem_votes_potus_12_16.feather')
late_shipments = pd.read_feather('datasets/late_shipments.feather')
stackoverflow = pd.read_feather("datasets/stack_overflow.feather")
Calculating the sample mean The late_shipments dataset contains supply chain data on the delivery of medical supplies. Each row represents one delivery of a part. The late columns denotes whether or not the part was delivered late. A value of "Yes" means that the part was delivered late, and a value of "No" means the part was delivered on time.
You'll begin your analysis by calculating a point estimate (or sample statistic), namely the proportion of late shipments.
In pandas, a value's proportion in a categorical DataFrame column can be quickly calculated using the syntax:
prop = (df['col'] == val).mean() late_shipments is available, and pandas is loaded as pd.
# Calculate the proportion of late shipments
late_prop_samp = (late_shipments['late'] == 'Yes').mean()
# Print the results
print(late_prop_samp)
Calculating a z-score Since variables have arbitrary ranges and units, we need to standardize them. For example, a hypothesis test that gave different answers if the variables were in Euros instead of US dollars would be of little value. Standardization avoids that.
One standardized value of interest in a hypothesis test is called a z-score. To calculate it, you need three numbers: the sample statistic (point estimate), the hypothesized statistic, and the standard error of the statistic (estimated from the bootstrap distribution).
The sample statistic is available as late_prop_samp.
late_shipments_boot_distn is a bootstrap distribution of the proportion of late shipments, available as a list.
pandas and numpy are loaded with their usual aliases.
Instructions 100 XP Hypothesize that the proportion of late shipments is 6%. Calculate the standard error from the standard deviation of the bootstrap distribution. Calculate the z-score.
import numpy as np
# late_shipments_boot_distn was not available, creating as in Sampling in Python
late_ship_sample = late_shipments.sample(n=500)
late_shipments_boot_distn = []
for i in range(1000):
sample = late_ship_sample.sample(frac=1, replace=True)
late_shipments_boot_distn.append(
(sample['late'] == 'Yes').mean()
)
np.mean(late_shipments_boot_distn)
# Hypothesize that the proportion is 6%
late_prop_hyp = 0.06
# Calculate the standard error
std_error = np.std(late_shipments_boot_distn)
# Find z-score of late_prop_samp
z_score = (late_prop_samp - late_prop_hyp) / std_error
# Print z_score
print(z_score)
Criminal trials vs. hypothesis testing Either H or H is true (not both) Initially, H is assumed to be true The test ends in either "reject H " or "fail to reject H " If the evidence from the sample is "significant" that H is true, reject H , else choose H Significance level is "beyond a reasonable doubt" for hypothesis testing
p-values: probability of obtaining a result, assuming the null hypothesis is true
from scipy.stats import norm
# Calculate the p-value (RIGHT, GREATER)
p_value = 1 - norm.cdf(z_score, loc=0, scale=1)
# Print the p-value (the result is close to the one from the lecture. It's a chance that the proportion is 6%. It's big, so we failed to reject H_0.)
print(p_value)
Calculating a confidence interval If you give a single estimate of a sample statistic, you are bound to be wrong by some amount. For example, the hypothesized proportion of late shipments was 6%. Even if evidence suggests the null hypothesis that the proportion of late shipments is equal to this, for any new sample of shipments, the proportion is likely to be a little different due to sampling variability. Consequently, it's a good idea to state a confidence interval. That is, you say, "we are 95% 'confident' that the proportion of late shipments is between A and B" (for some value of A and B).
Sampling in Python demonstrated two methods for calculating confidence intervals. Here, you'll use quantiles of the bootstrap distribution to calculate the confidence interval.
late_prop_samp and late_shipments_boot_distn are available; pandas and numpy are loaded with their usual aliases.
# Calculate 95% confidence interval using quantile method
lower = np.quantile(late_shipments_boot_distn, 0.025)
upper = np.quantile(late_shipments_boot_distn, 0.975)
# Print the confidence interval (6% is within this interval, so we failed to reject H_0)
print((lower, upper))
Examples:
The null hypothesis is that the population mean for the two groups is the same, and the alternative hypothesis is that the population mean for users who started coding as children is greater than for users who started coding as adults.
Two-Sample and ANOVA Tests
Two sample mean test statistic The hypothesis test for determining if there is a difference between the means of two populations uses a different type of test statistic to the z-scores you saw in Chapter 1. It's called "t", and it can be calculated from three values from each sample using this equation.
While trying to determine why some shipments are late, you may wonder if the weight of the shipments that were on time is less than the weight of the shipments that were late. The late_shipments dataset has been split into a "yes" group, where late == "Yes" and a "no" group where late == "No". The weight of the shipment is given in the weight_kilograms variable.
The sample means for the two groups are available as xbar_no and xbar_yes. The sample standard deviations are s_no and s_yes. The sample sizes are n_no and n_yes. numpy is also loaded as np.
Instructions 100 XP Calculate the numerator of the test statistic. Calculate the denominator of the test statistic. Use those two numbers to calculate the test statistic.
Calculate the numerator of the test statistic
numerator = xbar_no - xbar_yes
Calculate the denominator of the test statistic
denominator = np.sqrt(s_no ** 2 / n_no + s_yes ** 2 / n_yes)
Calculate the test statistic
t_stat = numerator / denominator
Print the test statistic
print(t_stat)