Usable A/B Testing - A Bayesian Approach

Slide 1

Slide 1 text

USABLE A/B TESTING A BAYESIAN APPROACH

Slide 2

Slide 2 text

WHY SHOULD WE CARE ? MAKE ABSOLUTELY SURE THAT IF WE INTRODUCE CHANGES TO OUR PRODUCT THE MEMBERS ARE NOT NEGATIVELY AFFECTED 2

Slide 3

Slide 3 text

A/B TESTING AT ▸ regularly run live tests ▸ ensure scientists ﬁnd interesting content ▸ new product designs/features ▸ new algorithms ▸ several experiments on all parts of the platform

Slide 4

Slide 4 text

A/B TESTING EXPERIMENT - GENERAL CONSIDERATIONS ▸ Deﬁne motivation behind experiment ▸ Think about your user segments ▸ Don’t optimise very small parts of your product ▸ Don’t test too many versions in parallel ▸ Don’t get frustrated 4

Slide 5

Slide 5 text

A/B TESTING - KEY METRICS ▸ Know your baseline ▸ define the key metrics you are working on ▸ understand the range of acceptable fluctuations ▸ look at specific time ranges ▸ NOT knowing this will lead to erroneous findings 5

Slide 6

Slide 6 text

EXPERIMENT FRAMEWORK - SPLIT TESTING ▸ diverts your trafﬁc to different experiment variants ▸ Visual Website Optimiser ▸ Optimizely ▸ in house ▸ A/A Experiments 6

Slide 7

Slide 7 text

HYPOTHESIS BASED TESTING ▸ Most widespread technique for experiment frameworks ▸ p-value ▸ conﬁdence intervals ▸ needs a ﬁxed sample size in advance 7

Slide 8

Slide 8 text

A/B TESTING - THE SAMPLE SIZE ▸ Calculate sample size ▸ deﬁne minimal detectable effect ▸ statistical power ▸ how often will you recognise a successful test ▸ typically 80% ▸ signiﬁcance level ▸ how often will you observe a positive result although there is none ▸ typically 5% 8

Slide 9

Slide 9 text

SAMPLE SIZE CALCULATION - EXAMPLE Power = 0.8, α = 0.05 Test duration now depends on your trafﬁc to your website 9

Slide 10

Slide 10 text

RUN YOUR EXPERIMENT - IDEAL WORLD ▸ You stick to the rules ▸ The experiment runs until the pre calculated sample size ▸ Your website has enough trafﬁc ▸ You don’t look at your experiment while it is running ▸ You only do a simple A/B split experiment ▸ You are patient 10

Slide 11

Slide 11 text

THE REAL WORLD - A STORY OF MISUNDERSTANDING ▸ Signiﬁcance based experiment evaluation ▸ calculate the p-value using your preferred test ▸ the probability of your data given your hypothesis ▸ you can only reject the null hypothesis ▸ NO indication of the importance of your result ▸ calculate conﬁdence intervals ▸ capture the uncertainty of your measurement 11

Slide 12

Slide 12 text

PROBLEMS WITH SIGNIFICANCE BASED METHODS ▸ Large sample sizes increase experiment run time ▸ Statistical significance is not a valid stopping criterion ▸ p-value reaches significance very early on because of the novelty effect ▸ Confidence interval ▸ NOT 95% probability the true parameter falls within the interval 12

Slide 13

Slide 13 text

MULTIPLE NULL HYPOTHESIS TESTS ▸ If you peek at the running experiment you increase your chance of falsely detecting a statistically signiﬁcant result ▸ effectively corrupts your test ▸ Segmentation of experiment after the fact ▸ the greater n the more false positives ▸ Multiple goals 1 (1 0.05)n 13

Slide 14

Slide 14 text

WHAT IS EASIER TO COMMUNICATE? ▸ We reject the null hypothesis that variant A = variant B with a p-value of 0.02 ▸ There is a 85% chance that variant B has a 8 % lift over variant A 14

Slide 15

Slide 15 text

BAYESIAN REASONING Sherlock Holmes: “How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?” (Doyle, 1890) 15

Slide 16

Slide 16 text

BAYES THEOREM P(H | E) = P(E | H) P(H) P(E) Posterior probability of your hypothesis given the evidence Prior probability of your hypothesis Likelihood of your evidence if hypothesis is true Prior probability that the evidence is true 16 POSTERIOR ∝ PRIOR * LIKELIHOOD

Slide 17

Slide 17 text

NON OVERLAPPING POPULATIONS ▸ Show the different variants to two different populations ▸ A and B are independent P(CV RA , CV RB | data) = P(CV RA | data) P(CV RB | data) 17

Slide 18

Slide 18 text

CONVERSION RATE - BAYES THEOREM TO THE RESCUE ▸ Like a coin ﬂip - click or no-click ▸ compare 2 variants (A and B) ▸ Posterior probability is a two dimensional function of CVRA and CVRB P(CV RA, CV RB |data) = P(data|CV RA) P(CV RA) P(data|CV RB) P(CV RB) P(data) P(data) 18

Slide 19

Slide 19 text

LIKELIHOOD & PRIOR PROBABILITY ▸ Likelihood ▸ Prior probability P(viewsA , clicksA | CV RA) = ✓ viewsA clicksA ◆ CV RclicksA A (1 CV RA)viewsA clicksA 19 P(CV RA) = (1 CV RA)b 1 CV Ra 1 A B(a , b)

Slide 20

Slide 20 text

CONJUGATE BETA DISTRIBUTION ▸ prior and posterior distribution are of the same family ▸ Posterior distribution for variant A P(CV RA | viewsA , clicksA) = P(viewsA , clicksA | CV RA) P(CV RA) P(viewsA , clicksA) 20

Slide 21

Slide 21 text

POSTERIOR PROBABILITY ▸ Becomes a Beta distribution again ▸ Can be updated regularly ▸ Example for variant A Beta(CV RA, a + clicksA, b + viewsA clicksA) 21

Slide 22

Slide 22 text

UNINFORMATIVE PRIOR 22

Slide 23

Slide 23 text

INFORMATIVE PRIOR 23

Slide 24

Slide 24 text

EXAMPLE EXPERIMENT ▸ 2 variants ▸ Experiment ran for 1 week ▸ ﬁnd the posterior probability of the CVR of variant B is greater than variant A ▸ Monte Carlo style sampling Clicks Views Variant A 3254 6037 Variant B 3576 6040 24

Slide 25

Slide 25 text

DRAWING SAMPLES MONTE CARLO STYLE ▸ Neat trick to get samples for CVRA and CVRB from their respective posterior distribution ▸ To make this even easier we can do this in python using numpy ▸ Obtain distribution of credible conversion rates ▸ Highest density interval (HDI) 25

Slide 26

Slide 26 text

EVALUATE USING PYTHON 26

Slide 27

Slide 27 text

PLOTTING THE DISTRIBUTIONS ▸ Variant A mode: 53.90 % ▸ Variant B mode: 59.20 % 27

Slide 28

Slide 28 text

THE RELATIVE DIFFERENCE 28 ▸ lower bound: 6.42 % ▸ upper bound: 13.32 % ▸ 85% probability that the lift of B over A is 8%

Slide 29

Slide 29 text

MULTIPLE VARIANTS 29

Slide 30

Slide 30 text

OVERALL PROBABILITY 30

Slide 31

Slide 31 text

CALCULATE LIFT Relative lift of Variant E compared over Variant A (default): min. 0.156, max.1.048 31

Slide 32

Slide 32 text

SEQUENTIAL DATA COLLECTION ▸ threshold-of-not-caring ▸ expected performance loss of choosing one variant over the other ▸ evaluate the ROPE (region of practical equivalence) ▸ compare ROPE and HDI for relative lift ▸ decide which version to use by weighing potential losses proportional to the lift being lost 32

Slide 33

Slide 33 text

EXAMPLE - SEQUENTIAL SAMPLES 33 ▸ Threshold of caring: 0.01% ▸ Every day add new clicks and views ▸ Stop as soon as we have acquired enough samples

Slide 34

Slide 34 text

EXAMPLE SEQUENTIAL LIFT 34 ▸ ROPE as decision which version to choose ▸ if HDI completely inside the ROPE ▸ no difference between two variants

Slide 35

Slide 35 text

THANK YOU Nora Neumann @nora_neu https://github.com/researchgate

Slide 36

Slide 36 text

SAMPLE SIZE CALCULATORS ▸ Evan Miller ▸ http://www.evanmiller.org/ab-testing/sample-size.html ▸ Optimizely ▸ https://www.optimizely.com/resources/sample-size-calculator/ ▸ VWO ▸ https://vwo.com/ab-split-test-duration/ ▸ By Hand (example: 3% base conversion, 5% relative MDE = 3.15%) from statsmodels.stats.power import tt_ind_solve_power  from statsmodels.stats.proportion import proportion_effectsize    es = proportion_effectsize(0.03, 0.0315)  n = tt_ind_solve_power(effect_size=es, ratio=1, power=0.8, alpha=0.05)

Slide 37

Slide 37 text

FURTHER RESOURCES ▸ John Kruschke ▸ http://doingbayesiandataanalysis.blogspot.de ▸ http://www.indiana.edu/~kruschke/BEST/BEST.pdf ▸ Chris Stucchio ▸ https://www.chrisstucchio.com ▸ VWO Smart stats ▸ https://vwo.com/bayesian-ab-testing/