how_to_ab_test_with_confidence_railsconf.pdf

How to A/B Test with con fi dence @fglc2 Photo
by Ivan Aleksic on Unsplash

The Plan • Intro: What's an A/B Test? • Test
setup errors • Errors during the test • Test analysis errors • Best practices Photo by Javier Allegue Barros on Unsplash

What is an A/B test?

Buy Now Order Or

🧛🙋🙋🙋🧕🧑✈👨🌾👩💼💁🧑🎨 🧑🎤👩💼🙋👷🙋👩🏭🕵🙋🧑🚀🧝 👨🎓💁👨🏭💂👩🌾🧛🧑✈💁🧝💁 🙋🕵👩🏭👨🚀🙋🧕👨🦱👰👨🎓🕵 👩🔧🧑🚒👩🚀🧝👨🎓🥷🧑🏭🧕🧑✈🧟

💁👨🏭🙋🙋🧕🧕🧝 👩🏭👨🚀🧛👩💼💁👰👨🎓 🕵🧟💁🧑🎨🧑🎤🧕👨🎓 🙋💂👨🌾👩🏭 🕵👩🚀🧝👨🎓👨🦱🧑✈👩🔧 🕵🥷🧑🏭🧑✈👩🌾👩💼👷 🙋🙋🧑🚒🙋🧑🚀🧑✈💁 🧝🧛🙋🙋 Buy Now
Order 49 orders 56 orders

Is the difference real?

• Layouts / designs / fl ows • Algorithms (eg
recommendation engines) • Anything where you can measure a di ff erence Not just buttons!

Jargon

Signi fi cance • Is the observed di ff erence
is just noise? • p value of 0.05 = 5% chance it’s a fl uke • The statistical test depends on the type of metric • No guarantees on the magnitude of the di ff erence

Test power Photo by Michael Longmire on Unsplash Test power

Test power • How small a change do I want
to detect? • 10% to 20% is much easier to measure than 0.1% to 0.2%

Sample size • Check this is feasible! • Ideally you
don’t look / change anything until sample size reached • Be wary of very short experiments

Bayesian A/B testing

Bayesian A/B testing • Allows you to model your existing
knowledge & uncertainties • Can be better at with low base rates • The underlying maths are a bit more complicated

Test setup errors

Group Randomisation Photo by Macau Photo Agency on Unsplash

class User < ActiveRecord::Base def ab_group if id % 2
== 0 'experiment' else 'control' end end end

class User < ActiveRecord::Base def ab_group(experiment) hash = Digest::SHA1.hexdigest( “#{experiment}-#{id}"
).to_i(16) if hash % 2 == 0 'experiment' else 'control' end end end

Non random split • Newer users in other group •
Older users in one group • New users were less loyal!

Starting too early

Home Page 50,000 Users Home Page 50,000 Users

30,000 Users 30,000 Users Home Page 50,000 Users Home Page
50,000 Users

15,000 Users 15,000 Users 30,000 Users 30,000 Users Home Page
50,000 Users Home Page 50,000 Users

Checkout Page A Checkout Page B 5,000 Users 5,000 Users
15,000 Users 15,000 Users 30,000 Users 30,000 Users Home Page 50,000 Users Home Page 50,000 Users

2600 conversions 2500 conversions Checkout Page A Checkout Page B
5,000 Users 5,000 Users 15,000 Users 15,000 Users 30,000 Users 30,000 Users Home Page 50,000 Users Home Page 50,000 Users

2600 conversions 2500 conversions Home Page 100,000 Users 60,000 Users
30,000 Users Checkout Page A Checkout Page B 5,000 Users 5,000 Users

Not agreeing setup • Scope of the test (what pages,
users, countries ...) • What is the goal? How do we measure it? • Agree *one* metric

Errors during the test Photo by Sarah Kilian on Unsplash

A test measures the impact of all differences

Ecommerce Service Recommendation Service

Ecommerce Service Recommendation Service 10x more crashes

Repeated signi fi cance testing • Invalidates signi fi cance
calculation • Di ffi cult to resist! • Stick to your Sample Size • This is fi ne with Bayesian A/B testing

Test analysis errors Photo by Isaac Smith on Unsplash

Do the maths • Use the appropriate statistical test •
Signi fi cance on one metric does not imply signi fi cance on another

Outliers Photo by Ministerie van Buitenlandse Zaken

Photo by Ministerie van Buitenlandse Zaken

Understanding the domain

-4 -3 -2 -1 0 week 1 week 2 week
3

-4 -2 0 2 4 6 8 week 1 week
2 week 3 week 4 week 5 week 6 week 7

Results splitting

We aren't neutral

If the result is 'right' 🎉

If the result is 'wrong' • Start looking at result
splits • Start digging for potential errors • Hey what about this other metric • Well documented test can help

Best practices Photo by SpaceX on Unsplash

Don't reinvent the wheel • Split, Vanity gems do a
good job • Consider platforms (Optimizely, Google Optimize) • But understand your tool, drawbacks

Resist the urge to check/tinker • Repeated signi fi cance
testing • Changing the test while it is running (restart the test if necessary)

A/A tests • Do the full process but with no
di ff erence between the variants • Allows you to practise

Be wary of overtesting • Let's test everything! • Can
be paralysing/time consuming • Not a substitute for vision / talking to your users

Document your test • Metric (inc. outliers etc.) • Success
criteria • Scope • Sample size / test power • Signi fi cance calculation/process • Meaningful variant names

Thank you! @fglc2

Further Reading • https://www.evanmiller.org/how-not-to-run-an-ab-test.html • https://making.lyst.com/bayesian-calculator/ • https://www.chrisstucchio.com/blog/2014/ bayesian_ab_decision_rule.html @fglc2

how_to_ab_test_with_confidence_railsconf.pdf

how_to_ab_test_with_confidence_railsconf.pdf

More Decks by Frederick Cheung

Other Decks in Programming

Featured

Transcript