Upgrade to Pro — share decks privately, control downloads, hide ads and more …

how_to_ab_test_with_confidence_railsconf.pdf

 how_to_ab_test_with_confidence_railsconf.pdf

Frederick Cheung

April 13, 2021
Tweet

More Decks by Frederick Cheung

Other Decks in Programming

Transcript

  1. The Plan • Intro: What's an A/B Test? • Test

    setup errors • Errors during the test • Test analysis errors • Best practices Photo by Javier Allegue Barros on Unsplash
  2. • Layouts / designs / fl ows • Algorithms (eg

    recommendation engines) • Anything where you can measure a di ff erence Not just buttons!
  3. Signi fi cance • Is the observed di ff erence

    is just noise? • p value of 0.05 = 5% chance it’s a fl uke • The statistical test depends on the type of metric • No guarantees on the magnitude of the di ff erence
  4. Test power • How small a change do I want

    to detect? • 10% to 20% is much easier to measure than 0.1% to 0.2%
  5. Sample size • Check this is feasible! • Ideally you

    don’t look / change anything until sample size reached • Be wary of very short experiments
  6. Bayesian A/B testing • Allows you to model your existing

    knowledge & uncertainties • Can be better at with low base rates • The underlying maths are a bit more complicated
  7. class User < ActiveRecord::Base def ab_group if id % 2

    == 0 'experiment' else 'control' end end end
  8. Non random split • Newer users in other group •

    Older users in one group • New users were less loyal!
  9. Checkout Page A Checkout Page B 5,000 Users 5,000 Users

    15,000 Users 15,000 Users 30,000 Users 30,000 Users Home Page 50,000 Users Home Page 50,000 Users
  10. 2600 conversions 2500 conversions Checkout Page A Checkout Page B

    5,000 Users 5,000 Users 15,000 Users 15,000 Users 30,000 Users 30,000 Users Home Page 50,000 Users Home Page 50,000 Users
  11. 2600 conversions 2500 conversions Home Page 100,000 Users 60,000 Users

    30,000 Users Checkout Page A Checkout Page B 5,000 Users 5,000 Users
  12. Not agreeing setup • Scope of the test (what pages,

    users, countries ...) • What is the goal? How do we measure it? • Agree *one* metric
  13. Repeated signi fi cance testing • Invalidates signi fi cance

    calculation • Di ffi cult to resist! • Stick to your Sample Size • This is fi ne with Bayesian A/B testing
  14. Do the maths • Use the appropriate statistical test •

    Signi fi cance on one metric does not imply signi fi cance on another
  15. -4 -2 0 2 4 6 8 week 1 week

    2 week 3 week 4 week 5 week 6 week 7
  16. If the result is 'wrong' • Start looking at result

    splits • Start digging for potential errors • Hey what about this other metric • Well documented test can help
  17. Don't reinvent the wheel • Split, Vanity gems do a

    good job • Consider platforms (Optimizely, Google Optimize) • But understand your tool, drawbacks
  18. Resist the urge to check/tinker • Repeated signi fi cance

    testing • Changing the test while it is running (restart the test if necessary)
  19. A/A tests • Do the full process but with no

    di ff erence between the variants • Allows you to practise
  20. Be wary of overtesting • Let's test everything! • Can

    be paralysing/time consuming • Not a substitute for vision / talking to your users
  21. Document your test • Metric (inc. outliers etc.) • Success

    criteria • Scope • Sample size / test power • Signi fi cance calculation/process • Meaningful variant names