Website redesign - A/B testing

1. Data exploration

First of all, we have to explore our dataset. Now visualize the table and check if there are any data gaps.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from math import lgamma

df = pd.read_csv('./data/redesign.csv')
print(df.head())
print(df.tail())
print(df.info())

nan_elements1 = df['converted'].isnull().values.any()
nan_elements2 = df['treatment'].isnull().values.any()
nan_elements3 = df['new_images'].isnull().values.any()
print(f'NaN elements in converted, treatment, new_images columns: {nan_elements1, nan_elements2, nan_elements3}')

We have just found out that there are 40 484 observations and no missing information.

Data structure

There are only 3 columns in the dataset. Each row takes the value of 0 or 1. Where 1 is 'yes' (resp. 0 is 'no').

'treatment' shows if the user saw the new version of the landing page.
'new_images' shows if the page used a new set of images.
'converted' shows if the user joined the site.

Groups of users

The control group is those users with 'no' in both columns: the old version with the old set of images.
The AB group is those users with 'yes' in the first column ('treatment') and 'no' in the second ('new_images')
The BA group, on the contrary, is those users with 'no' and 'yes' respectively.
The AA group is the reversed control group with 'yes' in both columns. | 1/2 | yes | no | | ---: | :----: | :--- | | yes | AA | AB | | no | BA | CG |

Let's write the get_info function for a more detailed study of the data in each group.

def get_info(treatment, new_images):
    df_new = df.loc[(df['treatment'] == treatment) & (df['new_images'] == new_images)]
    joined = np.array(df_new['converted'])
    true = np.count_nonzero(joined)
    false = np.count_nonzero(joined == 0)
    tf = true+false
    print(f'Group: treatment = {treatment}, new_images = {new_images}')
    print(f'Total number of users from the control-group: {tf}')
    print(f'Number of users who joined the site: {true}')
    print(f'Number of users who did not join the site: {false}')
    print(f'Ratio of users who joined the site: {np.round(true/tf*100, 4)}%')
    print(' ')
    
get_info('no', 'no')
get_info('yes', 'no')
get_info('no', 'yes')
get_info('yes', 'yes')

num1, num2, num3, num4 = 10121, 10121, 10121, 10121
joined1, joined2, joined3, joined4 = 1084, 1215, 1139, 1151
rate1, rate2, rate3, rate4 = joined1/num1, joined2/num2, joined3/num3, joined4/num4

Group exploration

As we can see, all groups contain 10 121 observations. Apparently, the control group is the worst one (10.71% of joined users). The AB group seems to be the best one (12%). Both groups BA and AA seem to be better than the control group, but worse than the AB group.

However, we cannot draw any conclusions now. All changes may happen due to a random chance. Let's move on to the next part of the study to investigate it.

2. Mann-Whitney U test

It is possible to try to solve this problem with the Mann-Whitney U test before completing the A/B testing. Later we will be allowed to compare the results.

Mann-Whitney U test is a nonparametric of the null hypothesis that randomly selected values have equal probabilities to be greater than each other. Simply put, H₀: the two distributions are equal, and H₁: they are not equal. It is a very useful tool in our case of a discrete distribution.

a = np.zeros(num1)
a[:joined1] = 1
b1, b2, b3 = np.zeros(num2), np.zeros(num3), np.zeros(num4)
b1[:joined2] = 1
b2[:joined3] = 1
b3[:joined4] = 1
stat1, p_value1 = stats.mannwhitneyu(a, b1, alternative='less')
stat2, p_value2 = stats.mannwhitneyu(a, b2, alternative='less')
stat3, p_value3 = stats.mannwhitneyu(a, b3, alternative='less')
print(f'Mann-Whitney U test')
print(f'p-value for group AB: {p_value1:0.3f}')
print(f'p-value for group BA: {p_value2:0.3f}')
print(f'p-value for group AA: {p_value3:0.3f}')

Now we compare all three groups (AB, BA, AA) with the control group. The p-value is low enough (<.05) only in the first test. We can reject the null hypothesis and say that the AB group has statistically significant better results. Two other tests show higher p-values (>.05) that make us accept the null hypothesis, so the differences between CG, BA and AA are not significat.

3. Bayesian A/B testing

There are more than 10 000 observations in each group, so it is possible to use binomial or even normal distribution (as a consequence of the central limit theorem).

As we do not know the exact value of p (the probability that the user will join the site) for X ~ B(n,p), it is better to use the Bayesian approach, Beta distribution, just to be more correct.

Visualization

First, let's visualize our data. Two functions, beta_mode and plot, will help to do this.

a1, b1 = joined1+1, num1-joined1+1
a2, b2 = joined2+1, num2-joined2+1
a3, b3 = joined3+1, num3-joined3+1
a4, b4 = joined4+1, num4-joined4+1
beta1 = stats.beta(a1, b1)
beta2 = stats.beta(a2, b2)
beta3 = stats.beta(a3, b3)
beta4 = stats.beta(a4, b4)

def beta_mode(a, b):
    return (a-1)/(a+b-2)

def plot(betas, names, linf=0.09, lsup=0.15):
    sns.set_theme()
    x=np.linspace(linf,lsup, 100)
    fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,4))
    for f, name in zip(betas,names):
        y = f.pdf(x)
        y1 = f.cdf(x)
        y_mode = beta_mode(f.args[0], f.args[1])
        y_var = f.var()
        ax1.plot(x, y, label=f'{name} group')
        ax2.plot(x, y1, label=f'{name} group')
    ax1.legend()
    ax1.set_ylabel('PDF')
    ax1.set_xlabel('mean of joined users')
    ax2.set_ylabel('CDF')
    ax2.set_xlabel('mean of joined users')
    ax2.legend()
    fig.suptitle('PDF and CDF of Beta distribution')
    fig.show()

plot([beta1, beta2, beta3, beta4], names=['Control', 'AB', 'BA', 'AA'])

The graphs presented above (PDF and CDF) allow us to better understand the difference between the groups. We know that AB group has to be the best one. Visually, it seems to be the truth. Let's do the final step and check it mathematically!

Calculations

John Dock described a very good way to solve such tasks in this work. However, we will use a simpler approach based on the random generation of distributions with given parameters. As it is a simulation, the estimations have to be very close to the exact values.

beta1_rvs = beta1.rvs(10121)
beta2_rvs = beta2.rvs(10121)
beta3_rvs = beta3.rvs(10121)
beta4_rvs = beta4.rvs(10121)

delta1 = (beta2.mean()-beta1.mean())/beta1.mean()
add_val1 = (beta2_rvs > beta1_rvs).mean()
delta2 = (beta3.mean()-beta1.mean())/beta1.mean()
add_val2 = (beta3_rvs > beta1_rvs).mean()
delta3 = (beta4.mean()-beta1.mean())/beta1.mean()
add_val3 = (beta4_rvs > beta1_rvs).mean()
print (f'The AB group has {delta1*100:2.2f}% better results with {add_val1*100:2.1f}% probability')
print (f'The BA group has {delta2*100:2.2f}% better results with {add_val2*100:2.1f}% probability')
print (f'The AA group has {delta3*100:2.2f}% better results with {add_val3*100:2.1f}% probability')

There is nothing surprising about these results. They are very similar to the results obtained using the Mann-Whitney U test. It is time to sum up.

4. Conclusion

Results of every group are better than the result of the control group
AB group has the best result
Only the result of AB group can be recognized as statistically significant with a 5% level of significance

Thus, AA and BA groups may be better than the control group due to a random chance. The developers should choose the new version with the old set of images (AB group).