Skip to content

Hypothesis Testing in Healthcare: Drug Safety

A pharmaceutical company GlobalXYZ has just completed a randomized controlled drug trial. To promote transparency and reproducibility of the drug's outcome, they (GlobalXYZ) have presented the dataset to your organization, a non-profit that focuses primarily on drug safety.

The dataset provided contained five adverse effects, demographic data, vital signs, etc. Your organization is primarily interested in the drug's adverse reactions. It wants to know if the adverse reactions, if any, are of significant proportions. It has asked you to explore and answer some questions from the data.

The dataset drug_safety.csv was obtained from Hbiostat courtesy of the Vanderbilt University Department of Biostatistics. It contained five adverse effects: headache, abdominal pain, dyspepsia, upper respiratory infection, chronic obstructive airway disease (COAD), demographic data, vital signs, lab measures, etc. The ratio of drug observations to placebo observations is 2 to 1.

For this project, the dataset has been modified to reflect the presence and absence of adverse effects adverse_effects and the number of adverse effects in a single individual num_effects.

The columns in the modified dataset are:

ColumnDescription
sexThe gender of the individual
ageThe age of the individual
weekThe week of the drug testing
trxThe treatment (Drug) and control (Placebo) groups
wbcThe count of white blood cells
rbcThe count of red blood cells
adverse_effectsThe presence of at least a single adverse effect
num_effectsThe number of adverse effects experienced by a single individual

The original dataset can be found here.

Your organization has asked you to explore and answer some questions from the data collected. See the project instructions.

# Import packages
import numpy as np
import pandas as pd
from statsmodels.stats.proportion import proportions_ztest
import pingouin
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
drug_safety = pd.read_csv("drug_safety.csv")

# Explore the dataset to gain further understanding
print(drug_safety.head(10))
print(drug_safety.info())
# Count the adverse_effects column values for each trx group
adv_eff_by_trx = drug_safety.groupby('trx')['adverse_effects'].value_counts()
adv_eff_by_trx
# Compute total rows in each group
adv_eff_by_trx_totals = adv_eff_by_trx.groupby('trx').sum()
adv_eff_by_trx_totals
# Create an array of the "Yes" counts for each group
yeses = [adv_eff_by_trx['Drug']['Yes'], adv_eff_by_trx['Placebo']['Yes']]
yeses
# Create an array of the total number of rows in each group
n = [adv_eff_by_trx_totals['Drug'], adv_eff_by_trx_totals['Placebo']]
n
# We perform a two-sided z-test here to compare the proportion of people who experienced adverse effects
# between the Drug and Placebo groups. The test checks if there is a statistically significant difference
# in the rates of adverse effects between the two groups, without assuming in advance which group will have
# a higher or lower rate. In simple terms, it helps us find out if the drug changes the chance of adverse
# effects compared to placebo, in either direction (higher or lower).

why did i have to do a proportions ztest? please explain it to me simply and with steps

Why Did We Do a Proportions Z-Test?

A proportions z-test helps us compare the rates (proportions) of a certain outcome between two groups. In your case, you wanted to know if the proportion of people who experienced adverse effects is different between the Drug and Placebo groups.

Simple Steps:
  1. Count the "Yes" responses: For each group (Drug and Placebo), count how many people had adverse effects.
  2. Count the total in each group: Find out how many people are in each group overall.
  3. Calculate the proportions: For each group, divide the number of "Yes" responses by the total number in that group.
  4. Run the z-test: The z-test checks if the difference between these two proportions is bigger than what we would expect by random chance.
  5. Interpret the result: If the test says the difference is statistically significant, it means the drug might really cause more (or fewer) adverse effects than the placebo.
In summary:

You did a proportions z-test to see if the rate of adverse effects is truly different between the Drug and Placebo groups, or if any difference you see could just be due to random variation.

# Perform a two-sided z-test on the two proportions
two_sample_results = proportions_ztest(yeses, n)

# Store the p-value
two_sample_p_value = two_sample_results[1]
two_sample_p_value
# We do this to check if the number of adverse effects ("num_effects") is related to the treatment group ("trx").
# If they are independent, it means the treatment does not affect the number of adverse effects.
# If they are not independent, it suggests the treatment group may influence how many adverse effects occur.
# The chi-squared test of independence helps us answer this question statistically.

num_effects_groups = pingouin.chi2_independence(
    data=drug_safety, x="num_effects", y="trx")
num_effects_groups
# Explanation:
# This step is about getting the p-value from the results of the chi-squared test we ran earlier.
# The p-value tells us whether there is a statistically significant relationship between the treatment group ("trx")
# and the number of adverse effects ("num_effects").
# A small p-value (typically less than 0.05) means there is evidence that the treatment group and number of effects are related.

# What the code does:
# - 'num_effects_groups' is a DataFrame with the results of the chi-squared test.
# - The code accesses the third row (index 2) of this DataFrame, which contains the test results for the chi-squared test of independence.
# - It then gets the value in the "pval" column of that row, which is the p-value for the test.
# - This p-value is stored in the variable 'num_effects_p_value', and then displayed as the output of the cell.

num_effects_p_value = num_effects_groups[2]["pval"][0]
num_effects_p_value

what does num_effects_groups[2] mean?

The expression num_effects_groups[2] is used to access the third element (since Python uses zero-based indexing) of the object num_effects_groups.

To understand exactly what this means, it's important to know the type and structure of num_effects_groups. In the context of your previous code, it appears that num_effects_groups is likely a list or a similar sequence of DataFrames or dictionaries, where each element contains results from a statistical test (such as a chi-squared test) for different groupings or variables.

So, num_effects_groups[2] retrieves the third set of results from this collection. In your code, you then access the "pval" column of this result to get the p-value for the test.

If you want to see what num_effects_groups contains and its structure, you can display its contents in a code cell.

# Display the type and a summary of num_effects_groups to better understand its structure
print(type(num_effects_groups))
try:
    # Try to display the first few elements if it's a list or similar
    for i, item in enumerate(num_effects_groups):
        print(f"Element {i}: type={type(item)}")
        if hasattr(item, 'head'):
            display(item.head())
        else:
            print(item)
        if i >= 2:
            break
except Exception as e:
    print(f"Error displaying num_effects_groups: {e}")