Introduction to A/B Testing in R
Today, we want to analyze a fictional dataset spanning seven months from July to January, where we conducted an A/B test in October.
The dataset is a fictional dataset which has been created in data_creation.ipynb
library(tidyverse)
# Not directly available from Workspace, locally you should be able to use install.packages("lmtest") and install.packages("sandwich")
install.packages("lmtest_0.9-40.tar.gz", repos = NULL, type = "source")
install.packages("sandwich_3.0-2.tar.gz", repos = NULL, type = "source")
#
library(lmtest)
library(sandwich)
Tasks 1
Load the experiment_data.csv via read_csv
and look at some random rows.
Month
shows the time dimension ranging from July 2022 to January 2023.
Group
indicates whether a customer is in the treatment group or not
Treated
is always 0 for the control (Existing) group as well as for the A group before October (prior to implementing the experiment).
Dollars
are the $ spent by our customers
id
is a personal identifier of the customers
Task 2
Look at the customer_data to see the number of customers we observe per month in each group. How many individual customers are there?
Look at the Treated
column by Month
cat("Rows and unique rows in the dataset:\n")
cat("\nUnique/distinct months in the dataset:\n")
cat("\nUnique/distinct customers in the dataset:\n")
cat("\nNumber of clients by group ('New' vs 'Existing'):\n")
cat("\n")
Task 3
Aggregate the whole dataset by Month
and Group
and look at the Dollars
spent with a line plot.
if(FALSE) {
month_group_data <- customer_data %>% group_by(Month, Group) %>% summarize(Dollars = mean(Dollars), Treatment = mean(Treated))
month_group_data %>% arrange(Month, Group)
}
if(FALSE) {
# Drop October (because some in the 'New' group already saw the new product others still the old one)
customer_data = customer_data %>% filter(Month != "202210")
# Add a binary to indicate the actual A/B testing period
customer_data$AB_period = ifelse(customer_data$Month %in% c("202211", "202212", "202301"), 1, 0)
#
table(customer_data$Month, customer_data$AB_period)
}
Task 4
Plot the Dollars
spent by Group
in the actual A/B time period.
Task 5
Plot again the Dollars
spent by Group
in the actual A/B time period. This time, however, on a new dataset where we averaged the individual Dollars spent (by period) to avoid having multiple observations by the same customer during the same period.
if(FALSE) {
# Now aggregate on the customer-level that we get one row for each customer before and after seeing the "New" product
customer_data_aggregated = customer_data %>% group_by(id, Treated, Group, AB_period) %>% summarize(Dollars = mean(Dollars))
customer_data_aggregated = customer_data_aggregated %>% arrange(id, Treated, Group)
head(customer_data_aggregated)
tail(customer_data_aggregated)
}
Task 6
Now let's compare the Dollars
spent between New
vs. Existing
Group
in the actual A/B testing period.
Task 7
But we could also compare only New
before and after implementing the A/B test. Let's do that!!
Task 8
Calculate the standard deviation of the Dollars
spent in A/B period of the New
group and use power.t.test()
to calculate the necessary sample size to get statistical significant results on the p = 0.05
signficiance level assuming power = 0.8
(and equal variances).
#round(sd(customer_data_aggregated %>% filter(Group == "New" & AB_period == 1) %>% pull(Dollars)), 1)