Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
how_to_ab_test_with_confidence_railsconf.pdf
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Frederick Cheung
April 13, 2021
Programming
0
67
how_to_ab_test_with_confidence_railsconf.pdf
Frederick Cheung
April 13, 2021
Tweet
Share
More Decks by Frederick Cheung
See All by Frederick Cheung
Fixing Performance and Memory Problems (RubyWine)
fcheung
0
81
Fixing Performance and Memory Problems
fcheung
2
540
Asking questions
fcheung
0
72
Extending Ruby
fcheung
1
500
Introduction to Version Control
fcheung
0
92
Other Decks in Programming
See All in Programming
Geminiの機能を調べ尽くしてみた
naruyoshimi
0
200
Fundamentals of Software Engineering In the Age of AI
therealdanvega
1
230
Windows on Ryzen and I
seosoft
0
220
朝日新聞のデジタル版を支えるGoバックエンド ー価値ある情報をいち早く確実にお届けするために
junkiishida
1
740
AI活用のコスパを最大化する方法
ochtum
0
130
受け入れテスト駆動開発(ATDD)×AI駆動開発 AI時代のATDDの取り組み方を考える
kztakasaki
2
550
maplibre-gl-layers - 地図に移動体たくさん表示したい
kekyo
PRO
0
230
15年目のiOSアプリを1から作り直す技術
teakun
1
620
RAGでハマりがちな"Excelの罠"を、データの構造化で突破する
harumiweb
9
2.7k
TROCCOで実現するkintone+BigQueryによるオペレーション改善
ssxota
0
170
ふつうのRubyist、ちいさなデバイス、大きな一年 / Ordinary Rubyists, Tiny Devices, Big Year
chobishiba
1
420
DevinとClaude Code、SREの現場で使い倒してみた件
karia
1
990
Featured
See All Featured
Lightning talk: Run Django tests with GitHub Actions
sabderemane
0
140
JAMstack: Web Apps at Ludicrous Speed - All Things Open 2022
reverentgeek
1
380
Speed Design
sergeychernyshev
33
1.6k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
31
3.1k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
21
1.4k
A Guide to Academic Writing Using Generative AI - A Workshop
ks91
PRO
0
230
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
From Legacy to Launchpad: Building Startup-Ready Communities
dugsong
0
170
WCS-LA-2024
lcolladotor
0
480
The Straight Up "How To Draw Better" Workshop
denniskardys
239
140k
GitHub's CSS Performance
jonrohan
1032
470k
Principles of Awesome APIs and How to Build Them.
keavy
128
17k
Transcript
How to A/B Test with con fi dence @fglc2 Photo
by Ivan Aleksic on Unsplash
None
The Plan • Intro: What's an A/B Test? • Test
setup errors • Errors during the test • Test analysis errors • Best practices Photo by Javier Allegue Barros on Unsplash
What is an A/B test?
Buy Now Order Or
🧛🙋🙋🙋🧕🧑✈👨🌾👩💼💁🧑🎨 🧑🎤👩💼🙋👷🙋👩🏭🕵🙋🧑🚀🧝 👨🎓💁👨🏭💂👩🌾🧛🧑✈💁🧝💁 🙋🕵👩🏭👨🚀🙋🧕👨🦱👰👨🎓🕵 👩🔧🧑🚒👩🚀🧝👨🎓🥷🧑🏭🧕🧑✈🧟
💁👨🏭🙋🙋🧕🧕🧝 👩🏭👨🚀🧛👩💼💁👰👨🎓 🕵🧟💁🧑🎨🧑🎤🧕👨🎓 🙋💂👨🌾👩🏭 🕵👩🚀🧝👨🎓👨🦱🧑✈👩🔧 🕵🥷🧑🏭🧑✈👩🌾👩💼👷 🙋🙋🧑🚒🙋🧑🚀🧑✈💁 🧝🧛🙋🙋 Buy Now
Order 49 orders 56 orders
Is the difference real?
• Layouts / designs / fl ows • Algorithms (eg
recommendation engines) • Anything where you can measure a di ff erence Not just buttons!
Jargon
Signi fi cance • Is the observed di ff erence
is just noise? • p value of 0.05 = 5% chance it’s a fl uke • The statistical test depends on the type of metric • No guarantees on the magnitude of the di ff erence
Test power Photo by Michael Longmire on Unsplash Test power
Test power • How small a change do I want
to detect? • 10% to 20% is much easier to measure than 0.1% to 0.2%
Sample size • Check this is feasible! • Ideally you
don’t look / change anything until sample size reached • Be wary of very short experiments
Bayesian A/B testing
Bayesian A/B testing
Bayesian A/B testing • Allows you to model your existing
knowledge & uncertainties • Can be better at with low base rates • The underlying maths are a bit more complicated
Test setup errors
Group Randomisation Photo by Macau Photo Agency on Unsplash
class User < ActiveRecord::Base def ab_group if id % 2
== 0 'experiment' else 'control' end end end
class User < ActiveRecord::Base def ab_group(experiment) hash = Digest::SHA1.hexdigest( “#{experiment}-#{id}"
).to_i(16) if hash % 2 == 0 'experiment' else 'control' end end end
Non random split • Newer users in other group •
Older users in one group • New users were less loyal!
Starting too early
Home Page 50,000 Users Home Page 50,000 Users
30,000 Users 30,000 Users Home Page 50,000 Users Home Page
50,000 Users
15,000 Users 15,000 Users 30,000 Users 30,000 Users Home Page
50,000 Users Home Page 50,000 Users
Checkout Page A Checkout Page B 5,000 Users 5,000 Users
15,000 Users 15,000 Users 30,000 Users 30,000 Users Home Page 50,000 Users Home Page 50,000 Users
2600 conversions 2500 conversions Checkout Page A Checkout Page B
5,000 Users 5,000 Users 15,000 Users 15,000 Users 30,000 Users 30,000 Users Home Page 50,000 Users Home Page 50,000 Users
2600 conversions 2500 conversions Home Page 100,000 Users 60,000 Users
30,000 Users Checkout Page A Checkout Page B 5,000 Users 5,000 Users
Not agreeing setup • Scope of the test (what pages,
users, countries ...) • What is the goal? How do we measure it? • Agree *one* metric
Errors during the test Photo by Sarah Kilian on Unsplash
A test measures the impact of all differences
Ecommerce Service Recommendation Service
Ecommerce Service Recommendation Service 10x more crashes
Repeated signi fi cance testing • Invalidates signi fi cance
calculation • Di ffi cult to resist! • Stick to your Sample Size • This is fi ne with Bayesian A/B testing
Test analysis errors Photo by Isaac Smith on Unsplash
Do the maths • Use the appropriate statistical test •
Signi fi cance on one metric does not imply signi fi cance on another
Outliers Photo by Ministerie van Buitenlandse Zaken
Photo by Ministerie van Buitenlandse Zaken
Photo by Ministerie van Buitenlandse Zaken
Understanding the domain
-4 -3 -2 -1 0 week 1 week 2 week
3
-4 -2 0 2 4 6 8 week 1 week
2 week 3 week 4 week 5 week 6 week 7
Results splitting
💰
💰
We aren't neutral
If the result is 'right' 🎉
If the result is 'wrong' • Start looking at result
splits • Start digging for potential errors • Hey what about this other metric • Well documented test can help
Best practices Photo by SpaceX on Unsplash
Don't reinvent the wheel • Split, Vanity gems do a
good job • Consider platforms (Optimizely, Google Optimize) • But understand your tool, drawbacks
Resist the urge to check/tinker • Repeated signi fi cance
testing • Changing the test while it is running (restart the test if necessary)
A/A tests • Do the full process but with no
di ff erence between the variants • Allows you to practise
Be wary of overtesting • Let's test everything! • Can
be paralysing/time consuming • Not a substitute for vision / talking to your users
Document your test • Metric (inc. outliers etc.) • Success
criteria • Scope • Sample size / test power • Signi fi cance calculation/process • Meaningful variant names
Thank you! @fglc2
Further Reading • https://www.evanmiller.org/how-not-to-run-an-ab-test.html • https://making.lyst.com/bayesian-calculator/ • https://www.chrisstucchio.com/blog/2014/ bayesian_ab_decision_rule.html @fglc2