Website redesign -- 4-fold A/B testing

from IPython import display
display.Image("Website Redesign.png")

Which version of the website should you use?

📖 Background

You work for an early-stage startup in Germany. Your team has been working on a redesign of the landing page. The team believes a new design will increase the number of people who click through and join your site.

They have been testing the changes for a few weeks and now they want to measure the impact of the change and need you to determine if the increase can be due to random chance or if it is statistically significant.

💾 The data

The team assembled the following file:

Redesign test data

"treatment" - "yes" if the user saw the new version of the landing page, no otherwise.
"new_images" - "yes" if the page used a new set of images, no otherwise.
"converted" - 1 if the user joined the site, 0 otherwise.

The control group is those users with "no" in both columns: the old version with the old set of images.

import pandas as pd
df = pd.read_csv('./data/redesign.csv')
df.head()

💪 Challenge

Complete the following tasks:

Analyze the conversion rates for each of the four groups: the new/old design of the landing page and the new/old pictures.
Can the increases observed be explained by randomness? (Hint: Think A/B test)
Which version of the website should they use?

🧑‍⚖️ Judging criteria

We will randomly select ten winners from the correct submissions for this challenge.

The winners will receive DataCamp merchandise.

✅ Checklist before publishing

Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
Remove redundant cells like the judging criteria, so the workbook is focused on your answers.
Check that all the cells run without error.

1. Analyzing the conversion rates for each of the four groups: the new/old design of the landing page and the new/old pictures.

df.groupby(['treatment','new_images']).sum().unstack().div(df.groupby(['treatment','new_images']).count().unstack(),axis=0).style.background_gradient(cmap='Blues').format('{:,.2%}')

2. Can the increases observed be explained by randomness?

To answer this question we need to compare stats of different groups. Out of the four groups following comparisons make sense:

Comparing group with new version and new set of images to group with old version and new set of images (To study effectiveness of new set of images across different versions)
Comparing group with new version and new images to group with new version and old set of images (To study effectiveness of new version across different sets of images)
Comparing group with old version and old set of images to group with old version and new set of images (To study effectiveness of old version across different sets of images)
Comparing group with old version and old set of images to group with new version and old set of images (To study effectiveness of old set of images across different versions)

2.1. Defining a function to A/B test:

from statsmodels.stats.proportion import proportions_ztest, proportion_confint

def statistical_experiment(df, col, state):

    other_col=df.select_dtypes(object).columns.drop(col)[0]
    
    yes_experiment = df.loc[(df[col]==state) & (df[other_col]=='yes'),'converted']
    no_experiment = df.loc[(df[col]==state) & (df[other_col]=='no'),'converted']

    successes = [no_experiment.sum(), yes_experiment.sum()]
    nobs = [no_experiment.count(), yes_experiment.count()]

    z_stat, pval = proportions_ztest(successes, nobs=nobs)


    print(f"p-value: {pval:.3f} for effectiveness of using { {'treatment':'new version', 'new_images':'set of new images'}.get(other_col, other_col)} vs not using it for the group where { {'treatment':'new version', 'new_images':'set of new images'}.get(col, col)} is{ {'no':' not'}.get(state, '')} used")

2.2. Getting p-values for each combinations to understand significance

from itertools import combinations, product

for state, col in list(product(df.new_images.unique(), df.select_dtypes(object).columns)):
    statistical_experiment(df, col, state)

2.3. Understanding results based on standard (p-value=0.5)

We have enough statistical evidence to say:

New version is better when old set of images are used. Thus the increases observed here cannot be explained by mere randomness.

We do not have enough statistical evidence to say:

New version is better when new set of images are used. Increases here could be random.
New set of images is better when old version is used. Increases here could be random.
New set of images are better when new version is used. Increases here could be random.

3. Which version of the website should they use?

They should be using new version with old images as this combination has the highest conversion rate of 12.00%. Besides, this is the only combination we have enough statistical evidence for, to say its a worthwhile change.