A/B Testing: Players Retention

1. Of cats and cookies

Cookie Cats is a hugely popular mobile puzzle game developed by Tactile Entertainment. It's a classic "connect three"-style puzzle game where the player must connect tiles of the same color to clear the board and win the level. It also features singing cats. We're not kidding!

As players progress through the levels of the game, they will occasionally encounter gates that force them to wait a non-trivial amount of time or make an in-app purchase to progress. In addition to driving in-app purchases, these gates serve the important purpose of giving players an enforced break from playing the game, hopefully resulting in that the player's enjoyment of the game being increased and prolonged.

But where should the gates be placed? Initially the first gate was placed at level 30, but in this notebook we're going to analyze an AB-test where we moved the first gate in Cookie Cats from level 30 to level 40. In particular, we will look at the impact on player retention. But before we get to that, a key step before undertaking any analysis is understanding the data. So let's load it in and take a look!

# Importing pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Reading in the data
df = pd.read_csv("cookie_cats.csv")

# Explor the dataset
print(df.info())

# Showing the first few rows
df.head()

2. The AB-test data

The data we have is from 90,189 players that installed the game while the AB-test was running. The variables are:

userid - a unique number that identifies each player.
version - whether the player was put in the control group (gate_30 - a gate at level 30) or the group with the moved gate (gate_40 - a gate at level 40).
sum_gamerounds - the number of game rounds played by the player during the first 14 days after install.
retention_1 - did the player come back and play 1 day after installing?
retention_7 - did the player come back and play 7 days after installing?

When a player installed the game, he or she was randomly assigned to either gate_30 or gate_40. As a sanity check, let's see if there are roughly the same number of players in each AB group.

# Counting the number of players in each AB group.
df["version"].value_counts()

3. The distribution of game rounds

It looks like there is roughly the same number of players in each group, nice!

The focus of this analysis will be on how the gate placement affects player retention, but just for fun: Let's plot the distribution of the number of game rounds players played during their first week playing the game.

# This command ensures that plots are displayed inline in the Jupyter notebook
%matplotlib inline

# Grouping the data by the number of game rounds played and counting the number of users for each group
plot_df = df.groupby("sum_gamerounds")["userid"].count()

# Selecting a sample of 100 data points to plot, for a clearer visualization, and plotting the histogram
# The random_state parameter ensures that the sample is the same every time the code is run
ax = plot_df.sample(n=100, random_state=200).plot(kind="hist")

# Setting the label for the x-axis
ax.set_xlabel("Rounds played")

# Setting the label for the y-axis
ax.set_ylabel("Users");

4. Check for normality

Looking at the distribution of rounds played we can see it is not normally distributed. To confrim this we will run the Shapiro-Wilk test. Geting a p-value of 0 means there is very strong evidence against the null hypothesis, which, in the case of the Shapiro-Wilk test, is the hypothesis that the data is normally distributed.

# Check for normality of the total scores for both women's and men's results using Shapiro-Wilk test
shapiro_gate_30 = stats.shapiro(df[df["version"]=="gate_30"]['sum_gamerounds'])
shapiro_gate_40 = stats.shapiro(df[df["version"]=="gate_40"]['sum_gamerounds'])

# Display the Shapiro-Wilk test results
print("Shapiro-Wilk Test Gate 30:", shapiro_gate_30)
print("Shapiro-Wilk Test Gate 40:", shapiro_gate_40)

# Visual inspection of the distribution using Q-Q plots
plt.figure(figsize=(12, 6))

# Plot for women's total scores
plt.subplot(1, 2, 1)
stats.probplot(df[df["version"]=="gate_30"]['retention_1'], dist="norm", plot=plt)
plt.title('Q-Q Plot Gate 30')

# Plot for men's total scores
plt.subplot(1, 2, 2)
stats.probplot(df[df["version"]=="gate_40"]['retention_1'], dist="norm", plot=plt)
plt.title('Q-Q Plot Gate 40')

plt.tight_layout()
plt.show()

5. Game rounds played by version

We see there is not much difference between the game rounds played between the two versions. Most people played up to 100 rounds before quting the game.

# Categorizing sum_gamerounds into specified ranges
df["gamerounds_category"] = pd.cut(df["sum_gamerounds"],
                                   bins=[-1, 0, 100, 500, 1000, float('inf')],
                                   labels=["0", "0-100", "100-500", "500-1000", "1000+"])

# Displaying the counts of each category within each version
category_counts = df.groupby(["version", "gamerounds_category"]).size().unstack(fill_value=0)

# Visualizing the result
category_counts.plot(kind='bar', figsize=(10, 6))
plt.title('Game rounds by Version')
plt.xlabel('Version')
plt.ylabel('Counts')
plt.xticks(rotation=0)
plt.legend(title='Game rounds')
plt.show()

6. Overall 1-day retention

In the plot above we can see that some players install the game but then never play it (0 game rounds), some players just play a couple of game rounds in their first week, and some get really hooked!

What we want is for players to like the game and to get hooked. A common metric in the video gaming industry for how fun and engaging a game is 1-day retention: The percentage of players that comes back and plays the game one day after they have installed it. The higher 1-day retention is, the easier it is to retain players and build a large player base.

As a first step, let's look at what 1-day retention is overall.

# The % of users that came back the day after they installed
display(df["retention_1"].value_counts(normalize=True).get(True))

7. 1-day retention by AB-group

So, a little less than half of the players come back one day after installing the game. Now that we have a benchmark, let's look at how 1-day retention differs between the two AB-groups.

# Calculating 1-day retention for each AB-group
df.groupby("version")["retention_1"].mean()

8. Should we be confident in the difference?

It appears that there was a slight decrease in 1-day retention when the gate was moved to level 40 (44.2%) compared to the control when it was at level 30 (44.8%). It's a small change, but even small changes in retention can have a large impact. But while we are certain of the difference in the data, how certain should we be that a gate at level 40 will be worse in the future?

There are a couple of ways we can get at the certainty of these retention numbers. Here we will use bootstrapping: We will repeatedly re-sample our dataset (with replacement) and calculate 1-day retention for those samples. The variation in 1-day retention will give us an indication of how uncertain the retention numbers are.

‌
‌
‌