Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Usable A/B Testing - A Bayesian Approach

nneu
May 21, 2016

Usable A/B Testing - A Bayesian Approach

nneu

May 21, 2016
Tweet

Other Decks in Technology

Transcript

  1. WHY SHOULD WE CARE ? MAKE ABSOLUTELY SURE THAT IF

    WE INTRODUCE CHANGES TO OUR PRODUCT THE MEMBERS ARE NOT NEGATIVELY AFFECTED 2
  2. A/B TESTING AT ▸ regularly run live tests ▸ ensure

    scientists find interesting content ▸ new product designs/features ▸ new algorithms ▸ several experiments on all parts of the platform
  3. A/B TESTING EXPERIMENT - GENERAL CONSIDERATIONS ▸ Define motivation behind

    experiment ▸ Think about your user segments ▸ Don’t optimise very small parts of your product ▸ Don’t test too many versions in parallel ▸ Don’t get frustrated 4
  4. A/B TESTING - KEY METRICS ▸ Know your baseline ▸

    define the key metrics you are working on ▸ understand the range of acceptable fluctuations ▸ look at specific time ranges ▸ NOT knowing this will lead to erroneous findings 5
  5. EXPERIMENT FRAMEWORK - SPLIT TESTING ▸ diverts your traffic to

    different experiment variants ▸ Visual Website Optimiser ▸ Optimizely ▸ in house ▸ A/A Experiments 6
  6. HYPOTHESIS BASED TESTING ▸ Most widespread technique for experiment frameworks

    ▸ p-value ▸ confidence intervals ▸ needs a fixed sample size in advance 7
  7. A/B TESTING - THE SAMPLE SIZE ▸ Calculate sample size

    ▸ define minimal detectable effect ▸ statistical power ▸ how often will you recognise a successful test ▸ typically 80% ▸ significance level ▸ how often will you observe a positive result although there is none ▸ typically 5% 8
  8. SAMPLE SIZE CALCULATION - EXAMPLE Power = 0.8, α =

    0.05 Test duration now depends on your traffic to your website 9
  9. RUN YOUR EXPERIMENT - IDEAL WORLD ▸ You stick to

    the rules ▸ The experiment runs until the pre calculated sample size ▸ Your website has enough traffic ▸ You don’t look at your experiment while it is running ▸ You only do a simple A/B split experiment ▸ You are patient 10
  10. THE REAL WORLD - A STORY OF MISUNDERSTANDING ▸ Significance

    based experiment evaluation ▸ calculate the p-value using your preferred test ▸ the probability of your data given your hypothesis ▸ you can only reject the null hypothesis ▸ NO indication of the importance of your result ▸ calculate confidence intervals ▸ capture the uncertainty of your measurement 11
  11. PROBLEMS WITH SIGNIFICANCE BASED METHODS ▸ Large sample sizes increase

    experiment run time ▸ Statistical significance is not a valid stopping criterion ▸ p-value reaches significance very early on because of the novelty effect ▸ Confidence interval ▸ NOT 95% probability the true parameter falls within the interval 12
  12. MULTIPLE NULL HYPOTHESIS TESTS ▸ If you peek at the

    running experiment you increase your chance of falsely detecting a statistically significant result ▸ effectively corrupts your test ▸ Segmentation of experiment after the fact ▸ the greater n the more false positives ▸ Multiple goals 1 (1 0.05)n 13
  13. WHAT IS EASIER TO COMMUNICATE? ▸ We reject the null

    hypothesis that variant A = variant B with a p-value of 0.02 ▸ There is a 85% chance that variant B has a 8 % lift over variant A 14
  14. BAYESIAN REASONING Sherlock Holmes: “How often have I said to

    you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?” (Doyle, 1890) 15
  15. BAYES THEOREM P(H | E) = P(E | H) P(H)

    P(E) Posterior probability of your hypothesis given the evidence Prior probability of your hypothesis Likelihood of your evidence if hypothesis is true Prior probability that the evidence is true 16 POSTERIOR ∝ PRIOR * LIKELIHOOD
  16. NON OVERLAPPING POPULATIONS ▸ Show the different variants to two

    different populations ▸ A and B are independent P(CV RA , CV RB | data) = P(CV RA | data) P(CV RB | data) 17
  17. CONVERSION RATE - BAYES THEOREM TO THE RESCUE ▸ Like

    a coin flip - click or no-click ▸ compare 2 variants (A and B) ▸ Posterior probability is a two dimensional function of CVRA and CVRB P(CV RA, CV RB |data) = P(data|CV RA) P(CV RA) P(data|CV RB) P(CV RB) P(data) P(data) 18
  18. LIKELIHOOD & PRIOR PROBABILITY ▸ Likelihood ▸ Prior probability P(viewsA

    , clicksA | CV RA) = ✓ viewsA clicksA ◆ CV RclicksA A (1 CV RA)viewsA clicksA 19 P(CV RA) = (1 CV RA)b 1 CV Ra 1 A B(a , b)
  19. CONJUGATE BETA DISTRIBUTION ▸ prior and posterior distribution are of

    the same family ▸ Posterior distribution for variant A P(CV RA | viewsA , clicksA) = P(viewsA , clicksA | CV RA) P(CV RA) P(viewsA , clicksA) 20
  20. POSTERIOR PROBABILITY ▸ Becomes a Beta distribution again ▸ Can

    be updated regularly ▸ Example for variant A Beta(CV RA, a + clicksA, b + viewsA clicksA) 21
  21. EXAMPLE EXPERIMENT ▸ 2 variants ▸ Experiment ran for 1

    week ▸ find the posterior probability of the CVR of variant B is greater than variant A ▸ Monte Carlo style sampling Clicks Views Variant A 3254 6037 Variant B 3576 6040 24
  22. DRAWING SAMPLES MONTE CARLO STYLE ▸ Neat trick to get

    samples for CVRA and CVRB from their respective posterior distribution ▸ To make this even easier we can do this in python using numpy ▸ Obtain distribution of credible conversion rates ▸ Highest density interval (HDI) 25
  23. THE RELATIVE DIFFERENCE 28 ▸ lower bound: 6.42 % ▸

    upper bound: 13.32 % ▸ 85% probability that the lift of B over A is 8%
  24. SEQUENTIAL DATA COLLECTION ▸ threshold-of-not-caring ▸ expected performance loss of

    choosing one variant over the other ▸ evaluate the ROPE (region of practical equivalence) ▸ compare ROPE and HDI for relative lift ▸ decide which version to use by weighing potential losses proportional to the lift being lost 32
  25. EXAMPLE - SEQUENTIAL SAMPLES 33 ▸ Threshold of caring: 0.01%

    ▸ Every day add new clicks and views ▸ Stop as soon as we have acquired enough samples
  26. EXAMPLE SEQUENTIAL LIFT 34 ▸ ROPE as decision which version

    to choose ▸ if HDI completely inside the ROPE ▸ no difference between two variants
  27. SAMPLE SIZE CALCULATORS ▸ Evan Miller ▸ http://www.evanmiller.org/ab-testing/sample-size.html ▸ Optimizely

    ▸ https://www.optimizely.com/resources/sample-size-calculator/ ▸ VWO ▸ https://vwo.com/ab-split-test-duration/ ▸ By Hand (example: 3% base conversion, 5% relative MDE = 3.15%) from statsmodels.stats.power import tt_ind_solve_power
 from statsmodels.stats.proportion import proportion_effectsize
 
 es = proportion_effectsize(0.03, 0.0315)
 n = tt_ind_solve_power(effect_size=es, ratio=1, power=0.8, alpha=0.05)
  28. FURTHER RESOURCES ▸ John Kruschke ▸ http://doingbayesiandataanalysis.blogspot.de ▸ http://www.indiana.edu/~kruschke/BEST/BEST.pdf ▸

    Chris Stucchio ▸ https://www.chrisstucchio.com ▸ VWO Smart stats ▸ https://vwo.com/bayesian-ab-testing/