How to analyse A/B experiments using bayesian “expected loss”

A how-to guide to calculating Bayesian Expected Loss for your experiment

Iqbal Ali
Towards Data Science

--

Image by David Schwarzenberg from Pixabay

Let’s imagine we’ve had an A/B experiment running for the last four weeks. We now have a decision to make: is the variation a winner or loser? How certain are we of this decision?

To help answer this question, here are the results of our imaginary experiment. It compares the conversion rates for our two variation groups cumulatively tracked over time:

Conversion rates tracked over time. Control is on top, and the probability is tracked. The cumulative probability is 89.9%
Image by author. The story of reward. Cumulative “conversion rates” plotted over time. Higher is better.

The above line graph charts the cumulative conversion rates of our two variations. The control group has a slightly higher conversion rate. This has been the case for the last three weeks.

The overall Bayesian probability for this result is 89.9%. This is just below the threshold our company commonly accepts (~90%). If you’re talking in frequentist terms, you can think of this as the level of statistical significance.

Some context on the experiment itself: we’re using this test to de-risk the release of an important feature. A “win” here is desirable, but not essential. A “flat” result, meanwhile, is acceptable. A “loss” is something we’d ideally want to avoid.

Overall, based on the above view, we can see that the experiment is likely a “loss”, but we’re not completely certain.

This is a common situation for many experiments. A decision needs to be made, but the results aren’t clear to do this with confidence. What usually happens in these sorts of cases is that the experiment keeps running with the hope of achieving greater certainty with more traffic.

All this happens because the view above is an incomplete picture of our experiment. The view above represents the reward for running this test (or lack of a reward in this case). What’s missing here is the view of risk to help us make our decision.

I’ve written about Expected Loss in an earlier article. Developed for VWO by Chris Stucchio, Expected Loss presents the risk of choosing one variant of an experiment over another. You can find out more about it in his whitepaper.

But essentially, the lower the risk, the better. If we chart the cumulative expected loss of choosing one variation over another, our experiment looks like this:

Two lines plotted on a line graph. Variation group on top, the control group on the bottom. The lines criss-cross to start and then settle. The target is zero.
Image by author. The story of risk. Cumulative “expected loss” plotted over time. Lower is better.

We can see that the relative “cost” of choosing the variant over the control is 0.75%. What this means financially depends on the traffic and segment being tested. It will also differ from company to company.

For instance, 0.75% might be a deal-breaker for some larger companies with high traffic volumes.

Reviewing this line graph now, we can see that the lines look stable enough to accept this view of the “risk”. So now, there comes a business decision: does the need for rolling out this feature outweigh the risk?

We’ve given the decision-makers enough information to make their decision. We could also have made this decision at week 3 depending on the urgency. Showing the risk alongside the reward is a tremendously powerful way to tell the data story of our experiment.

Calculating Expected Loss

So, how do we calculate Expected Loss? Let’s take a deeper look, using a trusty Jupyter Notebook and putting our Python 3 skills to the test!

What we’ll be using

If you want to follow along, I’m going to assume you’ve got Jupyter Notebook and Python 3 installed.

Also, we’ll be using numpy and scipy, so you’ll need to install those too. We should already have functools since we’re using Python 3.

Our first cell in our notebook looks like this:

import numpy as np
from scipy.stats import beta
from functools import reduce

This loads all the libraries we need.

Load some example data

Next, we add our data. This means adding the visits (or users if that’s your flavour of jam) for each variant, along with the conversions.

visits_control =  5625348
orders_control = 219197
visits_variation = 5613277
orders_variation = 221100

I just made those numbers up, but if doing this for real, we’ll be getting these from our analytics tool of choice.

Calculate the conversion rates

We need to know what the conversion rates are that we’re dealing with for each variant.

conversion_control = orders_control / visits_control
conversion_variation = orders_variation / visits_variation
print("Control:", conversion_control)
print("Variation:", conversion_variation)

The output for this:

Control: 0.0389659448624334
Variation: 0.03938875633609387

These are 3.90% and 3.94% respectively at two decimal places.

Random samples from the normal distribution

Let’s get into some basic statistical concepts. We can assume the conversion rates are normally distributed and so create a normal distribution curve for each conversion rate — as per the central limit theorem.

Here is what a normal distribution curve looks like:

A normal distribution bell-curve, with the mean marked in the centre.
Image by author. Normal distribution curve.

The horizontal axis defines the conversion rate, and the pink line is our “observed” conversion rate. So, for the control group, this is 0.038965 or 3.90%. For the variation, this is 0.039389 or 3.94%.

These values are our µ or the “mean” for each of our distribution curves.

The vertical axis defines the volume of “observations”. The shape under the curve is the probability distribution of these observations. The further away from the left or right of our µ, the fewer “observations” we have.

We can mark some lines for the standard deviations σ:

Normal distribution curve, marking the standard deviations either side of the mean.
Image by author. Normal distribution curve with standard deviations

Imagine we were to pull some random samples from our control group sample. What conversion rates would we see for these random samples? Well, there is a 68.26% that our random samples would come from the pink shaded area above (34.13% + 34.13%):

Normal distribution curve, highlighting -1/+1 standard deviations
Image by author. Normal distribution curve highlighting -1/+1 standard deviations from mean

Further from the mean means less chance of those values.

Don’t worry if this isn’t clear yet. Things will hopefully get more clear as we continue. Let’s go ahead and use the magic of python to pull some random samples from our normal distribution curves.

Edit: Correction, we are actually going to be pulling out samples from a beta distribution.

First, we define how many random samples we want:

N_MC = 10

Although we’d normally use a number like 10,000 or 100,000, it’s much easier to use a smaller number as a demonstration.

To get our random samples, we need to pass “successes” and “failures” into our python function. This function call looks something like this:

beta.rvs(successes, failures, size=N_MC)

We use these values to generate our random samples:

control_successes = orders_control
control_failures = visits_control-orders_control
control_sample = beta.rvs(control_successes, control_failures, size=N_MC)

If we printed the value of control_sample we’d get this:

[0.03903641 0.0389794  0.03905511 0.0390165  0.03891369 0.03899223
0.03884811 0.03901279 0.03884893 0.03901625]

A list of ten sample conversion rates based on our observed “mean”. They look pretty close to our control conversion rate of 0.038965. That’s what we’d expect.

We need to do the same for the Variation group:

var_successes = orders_variation
var_failures = visits_variation-orders_variation
variation_sample = beta.rvs(var_successes, var_failures, size=N_MC)

The printed output of variation_sample:

[0.0393568  0.03934686 0.03950938 0.03935167 0.03929198 0.0393716
0.03940077 0.03934958 0.03939534 0.03936369]

They look pretty close to our variation conversion rate of 0.039389. That’s also what we’d expect.

Calculating the Expected loss

So, now we have:

  1. control_sample: a list of 10 random conversion rates based on our observed conversion rate for the control group
  2. variation_sample: a list of 10 random conversion rates based on our observed conversion rate for the variation group

Now we want to compare the two sets of samples. It’s easy to do this if we zip them up into a list of tuples — a tuple is a list of two values. So, if we do this…

samples = list(zip(variation_sample, control_sample))

…and print the output of samples, we’d get:

[(0.03935680021728885, 0.03903640775019114), (0.039346863185651795, 0.03897940249438995), , (0.039351671006072196, 0.0390165049777375), (0.03929198308373445, 0.03891368745047992), (0.03937159635366056, 0.038992234526441655), (0.03940077430481381, 0.0388481084116417), (0.03934958290885752, 0.039012789327513016), (0.039395339750152525, 0.03884893384273168), (0.039363691751648874, 0.03901625035958206)]

Here’s an alternative view of this:

Visualisation of a tuple
Image by author. Visualising the list of tuples

Having a list of tuples makes it easier to compare our two values. To work out the Expected Loss for the control group, for each sample in our list we need to perform the following actions:

  1. variation_sample - control_sample
  2. baseline any negative results to 0 (because Expected Loss is either 0 or a positive number, never a negative value)
  3. Return the average (mean) for the list

To do this in python, we create a list of differences as diff_list :

diff_list = map(lambda sample: np.max([sample[0]-sample[1], 0]), samples)

For each item in the diff_list add them up…

sum_diff = reduce(lambda x,y:x+y, diff_list)

..then return the mean as a percentage.

EL_CONTROL = sum_diff/N_MC * 100.

The final Expected Loss for the control group is:

0.04018252880723959

In other words, we expect a loss of 0.04018% if we picked the control group.

Now, we need to do the same for the variation group:

diff_list = map(lambda sample: np.max([sample[1]-sample[0], 0]), samples)
sum_diff = reduce(lambda x,y:x+y, diff_list)
EL_VAR = sum_diff/N_MC * 100.

Printing EL_VAR gives us:

0.0

In other words, we expect to lose nothing (0%) by picking the variation. So, overall the variation group represents the least risk.

It's important to note that we used 10 samples here. If we were doing this for real, we would use many more samples. But essentially it's the same process. Simple, right?

Now, These calculations of Expected Loss on their own may or may not be particularly useful. We still need to know if this result is “significant”.

For this, we’d need to plot the cumulative results over time. You can take the Expected Loss figures into Excel and plot the graph there, or you can create it Jupyter Notebooks (I may write a follow up on this for those who are interested).

The final view would look something like this:

Two lines plotted on a line graph. Variation group on top, the control group on the bottom. The lines criss-cross to start and then settle. The target is zero.
Image by author. Expected Loss example.

We can apply some rules to validate whether a test is stable enough to conclude. Rules that I’ve used with teams I’ve worked with are:

  1. Seven days without expected loss lines crossing
  2. Consistency of the expected loss lines
  3. The probability of best should be 90% or more

Note: The probability is best done using the conversion rates, rather than the expected loss, as we removed values below 0.

Even though the example line graphs have several weeks of data, you could take the read earlier (perhaps at week 3). The idea is that the business is aware of the risk before deciding the results of the test.

Utilising Expected Loss views in this way has helped reduce the runtime of many experiments I’ve run, especially those where the primary goal was to derisk a feature or rollout.

I really believe the dual view of looking at experiments from the perspective of risk as well as reward is incredibly powerful. Not just when analysing experiments, but also when communicating the results with stakeholders as it makes decision-making much easier.

Would love to hear your thoughts. Any other views to make experiment results clearer for the end-user?

About me

I’m Iqbal Ali, writer of comics and former Head of Optimisation at Trainline.

I help companies with their experimentation programs through training, setting up processes, and telling the data stories of experiments!

Here’s my LinkedIn if you want to connect. Or follow me here on Medium.

--

--