Skip to content
Project: Data-Driven Decisions with A/B Testing
As a Data Scientist at a leading online travel agency, you’ve been tasked with evaluating the impact of a new search ranking algorithm designed to improve conversion rates. The Product team is considering a full rollout, but only if the experiment shows a clear positive effect on the conversion rate and does not lead to a longer time to book.
They have shared A/B test datasets with session-level booking data ("sessions_data.csv") and user-level control/variant split ("users_data.csv"). Your job is to analyze and interpret the results to determine whether the new ranking system delivers a statistically significant improvement and provide a clear, data-driven recommendation.
sessions_data.csv
sessions_data.csv| column | data type | description |
|---|---|---|
session_id | string | Unique session identifier (unique for each row) |
user_id | string | Unique user identifier (non logged-in users have missing user_id values; each user can have multiple sessions) |
session_start_timestamp | string | When a session started |
booking_timestamp | string | When a booking was made (missing if no booking was made during a session) |
time_to_booking | float | time from start of the session to booking, in minutes (missing if no booking was made during a session) |
conversion | integer | New column to create: did session end up with a booking (0 if booking_timestamp or time_to_booking is Null, otherwise 1) |
users_data.csv
users_data.csv| column | data type | description |
|---|---|---|
user_id | string | Unique user identifier (only logged-in users in this table) |
experiment_group | string | control / variant split for the experiment (expected to be equal 50/50) |
The full on criteria are the following:
- Primary metric (conversion) effect must be statistically significant and show positive effect (increase).
- Guardrail (time_to_booking) effect must either be statistically insignificant or show positive effect (decrease)
import pandas as pd
import numpy as np
from scipy.stats import chisquare
from pingouin import ttest
from statsmodels.stats.proportion import proportions_ztestsessions = pd.read_csv('sessions_data.csv')
users = pd.read_csv('users_data.csv')sessions.sample(5)users.sample(5)sessions.shapeusers.shapesessions_x_users = sessions.merge(users, on="user_id", how="inner")
sessions_x_users.info()# Assign 0 to conversion where booking_timestamp is null and 1 where not null
sessions_x_users["conversion"] = sessions_x_users["booking_timestamp"].notnull().astype(int)confidence_level = 0.90 # Set the pre-defined confidence level (90%)
alpha = 1 - confidence_level # Significance level for hypothesis tests# SAMPLE RATIO MISMATCH TEST
# Check if the number of users in each experiment group is balanced (a basic A/A sanity check)
groups_count = sessions_x_users['experiment_group'].value_counts()
print(groups_count)
n = sessions_x_users.shape[0] # Total sample size
srm_chi2_stat, srm_chi2_pval = chisquare(f_obs = groups_count, f_exp = (n/2, n/2))
srm_chi2_pval = round(srm_chi2_pval, 4)
print(f'\nSRM\np-value: {srm_chi2_pval}') # If p < alpha, there's likely a sampling issue issue# EFFECT ANALYSIS - PRIMARY METRIC
# Compute success counts and sample sizes for each group
success_counts = sessions_x_users.groupby('experiment_group')['conversion'].sum().loc[['control', 'variant']]
sample_sizes = sessions_x_users['experiment_group'].value_counts().loc[['control', 'variant']]
# Run Z-test for proportions (binary conversion metric)
zstat_primary, pval_primary = proportions_ztest(
success_counts,
sample_sizes,
alternative = 'two-sided',
)
pval_primary = round(pval_primary, 4)
# EFFECT SIZE FUNCTION
def estimate_effect_size(df: pd.DataFrame, metric: str) -> float:
"""
Calculate relative effect size
Parameters:
- df (pd.DataFrame): data with experiment_group ('control', 'variant') and metric columns.
- metric (str): name of the metric column
Returns:
- effect_size (float): average treatment effect (effect size)
"""
avg_metric_per_group = df.groupby('experiment_group')[metric].mean()
effect_size = avg_metric_per_group['variant'] / avg_metric_per_group['control'] - 1
return effect_size
# Estimate effect size for the conversion metric
effect_size_primary = estimate_effect_size(sessions_x_users, 'conversion')
effect_size_primary = round(effect_size_primary, 4)
print(f'\nPrimary metric\np-value: {pval_primary: .4f} | effect size: {effect_size_primary: .4f}')
# EFFECT ANALYSIS - GUARDRAIL METRIC
# T-test on time to booking for control vs variant
stats_guardrail = ttest(
sessions_x_users.loc[(sessions_x_users['experiment_group'] == 'control'), 'time_to_booking'],
sessions_x_users.loc[(sessions_x_users['experiment_group'] == 'variant'), 'time_to_booking'],
alternative='two-sided',
)
pval_guardrail, tstat_guardrail = stats_guardrail['p-val'].values[0], stats_guardrail['T'].values[0]
pval_guardrail = round(pval_guardrail, 4)
# Estimate effect size for the guardrail metric
effect_size_guardrail = estimate_effect_size(sessions_x_users, 'time_to_booking')
effect_size_guardrail = round(effect_size_guardrail, 4)
print(f'\nGuardrail\np-value: {pval_guardrail} | effect size: {effect_size_guardrail}')
# DECISION
# Primary metric must be statistically significant and show positive effect (increase)
criteria_full_on_primary = (pval_primary < alpha) & (effect_size_primary > 0)
# Guardrail must either be statistically insignificant or whow positive effect (decrease)
criteria_full_on_guardrail = (pval_guardrail > alpha) | (effect_size_guardrail <= 0)
# Final launch decision based on both metrics
if criteria_full_on_primary and criteria_full_on_guardrail:
decision_full_on = 'Yes'
print('\nThe experiment results are significantly positive and the guardrail metric was not harmed, we are going full on!')
else:
decision_full_on = 'No'
print('\nThe experiment results are inconclusive or the guardrail metric was harmed, we are pulling back!')