GIF Design by Sezin Sirin Hussan
Welcome to the A/B Testing Case Study!
Before diving into the data and uncover insights, product related descriptions, business problem, metrics will be introduced. Afterwards, we will go through assumptions, hypothesis and data preparation steps briefly since carefully designed A/B test will set us success. At the end, we will explore the data collected from an A/B test, perform statistical analysis, and draw conclusions to understand the impact of the changes tested. At the end, we will decide whether to launch the feature and potential metric trade offs.
It is important to highlight that the product mentioned in this project is fictitious. Addition to being a Data Scientist, I am a certified life coach and a big advocate of more affordable and accessible mental health. Therefore, I defined my ideal mindfullness app with its features, functionalities, and possible business problems that might come with it.
Now, let's move to the fun part!
1. Prior to A/B Testing
1.1. Background of the Product
This is "SSHWellness". It is a mindfullness app with web version that offers various products including personalized meditations, breathing exercises, sleep routines, education, 1:1 life coaching sessions, discounts on minfullness books, diaries, and events.
SSHWellness helps users to find a product that serves their needs, which is the value proposition to users. The main source of revenue stream is the count of subscribers per year since it is a subscription-based business with 7-days free trial. There are also special pricings for Students, Educators, Veterans and Low-Income families.
1.2. User journey
New User→ Paywall Displayed→ Trial Started→ Trial Converted→ Subscription Renewed 1,2...N (N:frequency is based on weekly/monthly/annual plan)
A user downloads an app and sees the paywall. If the user signs up, the free trial period starts. To enable the personalized onboarding, the user puts some information such as their main interest(s) and answers couple follow-up questions accordingly. After this 3 pages short and quick information gathering, specific user content meets the user as an opening page based on a recommendation system. These selections can be altered later. The free trial content is limited and conversion is required to be able to access the full content.
The main point is to engage the user in the first few minutes of using the product providing the value s/he is looking for. Then, we hope the user likes the experience, comes back the next day and purchase a plan at the end of the free trial period. Eventually, the retention/loyalty of the user (subscription renewal) will be the interest of long-term use of the product.
2. Preparation to A/B Testing
2.1. Definition of the Business Problem:
In order to see the performance of the app, we built cohorts by users' first seen date, observed their behavior during free trail period and compared with the lapse date (the earliest date a user can subscribe after the free trial completes). As a result, we observed various level drops in subscriptions ("Trial Converted") by cohorts.
The business problem we want to tackle: "How can we convince users to purchase a plan after the free trial?"
As we benchmark with competitiors and look for the best practices, we have decided to test the power of user reviews to increase the first subscription purchase. Running this experiment that proves the satisfaction of the app means that "SSHWellness" can ensure the value of the product to its users and ultimately improve both subscription (#) and revenue ($) retentions.
2.2. Definition of A/B test, User Target and Exposure
A/B testing is a powerful tool for data-driven business decisions. They are Randomized Controlled Trials (RCTs) that will give us a chance to test causality questions in our mind in a scienitfic way. Even though the experiment is not significant at the end, the results can provide us some insights for certain segments that can be used as a marketing strategy going forward.
A/B test:Test a new paywall with social proof (with customer reviews) to increase trial conversions.
User Target: To test the paywall apperance change, the experiment will be limited to the US users only.
Randomization Unit& Exposure: It is the event when users are exposed to the change. After 7 days free trail period ends,users get an email to take them to the paywall for trial conversion. The email will take our control group to see our current paywall. The treatment group, on the other hand, will see a new paywall with a small selection of customer reviews to prove the value of paid subscription.
2.3. Clarification of Hypothesis& KPIs (Metrics)
Null Hypothesis indicates that control and treatment have the same impact on the response. In other words, new paywall does not improve conversion rate. When we have enough evidence to reject the null hypothesis, the new paywall makes a difference on conversion rate with statistical significance.
- Ho: The conversion rates of current and new paywalls are same.
- Ha: The conversion rates of current and new paywalls are different.
Important Metrics to consider:
- North Star Metric : The number of subscribers per year is the primary KPI at business level.
- Driver Metric (Goal of the test): Increase the percentage of first subscription purchase rate per user is the primary KPI at product level and is used in A/B test.
- Guardrails Metrics: These are the metrics we want to monitor for negative change while conducting A/B test.
- Business:Sign-up rate for free trial period per day should not go down. Cancellations per day and refunds per day should not go up.
- Validity: Sample mismatch ratio.
- Segmentation Metrics: It is related to observe subgroup variability (ex.device type, location, demographics). In this case study, test is assumed be conducted on US customers only.
2.4. Experiment Design
Right before any random trial starts, the test parameters must be clarified in order to calculate sample size and experiement duration and conduct statistical inference after the experiment completed.
Here are 3 important experiment parameter assumptions to aid us navigate through the test:
-
1.Significance level (p-value) (alpha=5%): The p-value is the probability of obtaining a result equal to or more extreme than what was observed, assuming that all the modeling assumptions, including the null hypothesis, Ho, are true (Greenland, Senn, et al. 2016). It is also known as Type I error. Here I assume that there is 5% chance of rejecting Ho when it is true. However, significance level must be evaluated based on the risk appetite of the business. If the product team believes that adding reviews will create long-lasting positive impact on the product, then they can take higher risk (10%) to launch the feature.
-
2.Statistical power (power=80%): The probability of correctly rejecting the null hypothesis when it is false. There is 80% chance of detecting a given effect when it exists. This is a reasonable point to start. If we are rerunning the experiment due to achieve statistical significance for a reasonable lift, we can increase the power to 90% (reducing the Type II error to 10%). That means we run the experiment with bigger sample size and reduced variablity as the estimate gets close to the population parameter.
-
3.Sensitivity (Minimum detectable effect) (mde=10%): This parameter is the lift/improvement that will bring us more revenue (more conversion and hopefully long-term loyalty) than its costs (the cost of adding the reviews to the paywall). Our case study is based on a start-up developing its product, so MDE is reasonable. If it were to be online platform with billion dollar revenue per year, we might be interested in 1% lift. Therefore, expected ROI is the reasonable metric to guide the lift.
2.4.1. Experiment parameters
alpha = 0.05
power = 0.80
mde = 0.10
# Proportions for both groups
p1 = 0.08 # Control (Current Paywall Conversion Rate based on our data)
p2 = round(p1 * (1 + mde),3) # Treatment (Expected New Paywall Conversion Rate)
p22.4.2."Minimum Sample Size" based on selected parameters
In this case, we set up our experiement for a sizable case. If we were looking to detect smaller impact (mde=0.01) and/or with more confidence (alpha = 0.01), then we would run the experiment with bigger sample size.
import statsmodels.api as sm
from statsmodels.stats.power import tt_ind_solve_power
# Calculate the effect size
effect = sm.stats.proportion_effectsize(p1, p2)
# Estimate the sample size for each group
n = tt_ind_solve_power(effect_size=effect, power=power, alpha=alpha)
n = int(round(n, -3))
print(f'To detect an effect of {100*mde:.1f}% lift from the current paywall conversion rate at {100*p1:.0f}%, '
f'the sample size per group required is {n}.'
f'\nThe total sample size required in this experiment is {2*n}.')