A Practical Walkthrough of A/B Test

505 words

3 minutes

A Practical Walkthrough of A/B Test

2025-06-09

Product Analytics

/

AB Testing

/

Real Dataset

This post walks through an end to end A/B test using a real mobile game dataset. The goal is simple. Decide whether placing a progression gate at level 5 or level 10 leads to better player retention. The analysis is implemented in a Jupyter notebook, and this article translates that notebook into a readable narrative while keeping the technical substance intact.

Problem context#

In many mobile games, progression gates slow players down unless they invest time or make an in app purchase. The placement of these gates can meaningfully affect player experience and long term retention.

Players were randomly assigned to one of two versions.

Gate at level 5
Gate at level 10

Success is evaluated using one day retention and seven day retention.

Dataset overview#

The dataset contains 90,189 players. Each row represents a single user and includes the following fields.

userid
version indicating gate placement
sum_gamerounds representing total rounds played
retention_1 and retention_7 as binary indicators

Before analyzing outcomes, we validate the experiment split.

1
import pandas as pd
2

3
df = pd.read_csv("game_app_ab_testing.csv")
4
df.groupby("version")["userid"].count()

Both variants have comparable sample sizes, which allows a fair comparison.

Understanding player activity#

Before focusing on retention, it helps to understand overall engagement. The distribution of total game rounds is heavily right skewed. Most players churn early, while a small fraction play for a long time.

1
plot_df = df.groupby("sum_gamerounds")["userid"].count()
2
plot_df.plot(kind="hist")

This pattern is common in free to play games and reinforces why retention is a more stable metric than total playtime.

Retention metrics#

Retention is measured at two horizons.

One day retention captures immediate engagement
Seven day retention reflects short term value

A small number of players record zero game rounds yet still return. This edge case is rare but worth keeping in mind when interpreting results.

Comparing gate placement#

We begin by computing average retention for both variants.

1
df.groupby("version")[["retention_1", "retention_7"]].mean()

At first glance, the level 5 gate shows slightly higher retention at both horizons. To understand whether this difference is meaningful, we estimate uncertainty using bootstrapping.

Bootstrapping the difference#

Bootstrapping allows us to approximate the sampling distribution without relying on parametric assumptions. We repeatedly resample users with replacement and recompute retention for each group.

1
boot_1d = []
2

3
for _ in range(5000):
4
    sample = df.sample(frac=1, replace=True)
5
    stats = sample.groupby("version")["retention_1"].mean()
6
    boot_1d.append(stats)

The same approach is applied to seven day retention. From these samples, we compute the percentage difference between gate placements.

Visualizing the results#

Kernel density plots make the comparison intuitive.

1
boot_1d_diff.plot(kind="kde")
2
boot_7d_diff.plot(kind="kde")

In both cases, the majority of the distribution lies above zero, indicating higher retention when the gate is placed at level 5.

Probability interpretation#

Rather than focusing on a single p value, we compute the probability that level 5 outperforms level 10.

1
(boot_1d_diff > 0).mean()
2
(boot_7d_diff > 0).mean()

These probabilities are high for both metrics, providing strong evidence that earlier gating improves retention.

Final takeaway#

Moving the progression gate from level 10 to level 5 increases both one day and seven day retention. The effect size is modest but consistent and statistically robust.

From a product perspective, this suggests that earlier friction does not necessarily harm engagement. From a data perspective, it highlights how bootstrapping offers a clear and interpretable framework for A/B testing decisions.

The accompanying notebook contains the full implementation and can be reused as a template for similar experiments.