Slide 1

Slide 1 text

How To Avoid Lying To Yourself With Statistics A Shortlist of Core Principles for Productive Online Controlled Experiments

Slide 2

Slide 2 text

● Why Experiment? ● Why Sweat the Details? ● Core Experimentation Principles By Phase: ● Design ● Execution ● Analysis ● Wrap-Up + Q&A

Slide 3

Slide 3 text

Why Experiment?

Slide 4

Slide 4 text

* Source: HBR - The Surprising Power of Online Experiments by Ronny Kohavi, Stefan Thomke 80-90% of features shipped have negative or neutral impact on metrics they were designed to improve* 4 How Effective Are We At Moving The Needle?

Slide 5

Slide 5 text

Who Will Ultimately Decide?

Slide 6

Slide 6 text

Who Will Ultimately Decide? “Some designers and managers have attacked A/B testing because of a misconception that testing somehow replaces or diminishes design. Nothing could be further from the truth. Good design comes first.”

Slide 7

Slide 7 text

“The key is that it is the end users who are the final arbiters of design success Who Will Ultimately Decide? and A/B testing is the instrument that informs us of their judgment.”

Slide 8

Slide 8 text

● Why Experiment? ● Why Sweat the Details? ● Core Experimentation Principles By Phase: ● Design ● Execution ● Analysis ● Wrap-Up + Q&A

Slide 9

Slide 9 text

Why Sweat The Details When Using Experimentation?

Slide 10

Slide 10 text

We want to avoid the cost of false signals Remember this example and you’ll never confuse false positive & false negative again Type I Error False positive Type II Error False negative

Slide 11

Slide 11 text

We want to avoid the cost of human bias Here are 180+ forms of it. (It’s a big problem)

Slide 12

Slide 12 text

Controlled experimentation done right removes external influences. Metrics Change Your new feature All other changes in the world

Slide 13

Slide 13 text

Controlled experimentation can distinguish noise from a real signal. There is always noise or randomness in the data. If we flip a coin 1000 times - we’re unlikely to get exactly 500 heads and 500 tails (only a 3% chance) - that doesn’t mean the coin is biased. Experimentation statistics allow us to determine if a result is big enough to rule out normal variations and noise. There is always noise or randomness in the data. If we flip a coin 1000 times - we’re unlikely to get exactly 500 heads and 500 tails (only a 3% chance) - that doesn’t mean the coin is biased. Experimentation statistics allow us to determine if a result is big enough to rule out normal variations and noise.

Slide 14

Slide 14 text

14  Statistically Significant: The result is unlikely to have occurred by chance alone

Slide 15

Slide 15 text

● Why Experiment? ● Why Sweat the Details? ● Core Experimentation Principles By Phase: ● Design ● Execution ● Analysis ● Wrap-Up + Q&A

Slide 16

Slide 16 text

Core Experimentation Principles By Phase Design Execution Analysis

Slide 17

Slide 17 text

Our example: Airbnb host sign-up page A no video B with video

Slide 18

Slide 18 text

Design 1. Have an Explicit Hypothesis 2. Choose Metrics with Care 3. Perform Power Analysis to Determine Time Needed To Hit Traffic Goal 4. Consider Business Cycles & Seasonality Why? To prevent after-the-fact fitting of results to wishful thinking or hidden biases.

Slide 19

Slide 19 text

Design # 1 Have an Explicit Hypothesis

Slide 20

Slide 20 text

Standardized Format, Easy & Repeatable Steps A good hypothesis has a benchmark, is clear, concise and understandable, testable and measurable. By Craig Sullivan

Slide 21

Slide 21 text

B video Airbnb hypothesis : Because we see potential hosts aren’t aware of their level of control, we expect that adding an informative video will provide reassurances leading to more sign-ups. We’ll measure this using the metric: proportion of users who sign up as hosts A no video B with video

Slide 22

Slide 22 text

Airbnb hypothesis : Because our survey showed potential hosts aren’t aware of their level of control, we expect that adding an informative video for Canadian users will provide reassurances leading to more sign-ups. We’ll measure this using the metric: proportion of users who visited the host homepage who sign up as hosts. We expect to see an increase of 5% in the host-page conversion rate over a period of 2 weeks B video A no video B with video

Slide 23

Slide 23 text

Design # 2 Choose Metrics With Care

Slide 24

Slide 24 text

# 2.1 Characteristics of Good Metrics Design: #2 Choose Metrics With Care

Slide 25

Slide 25 text

25 Characteristics of Good Metrics Key properties: ● Meaningful - The metric directly captures business value or customer satisfaction ● Sensitive - The metric should change due to small changes in user satisfaction or business value so that the result of an experiment is clear within a reasonable time period. ● Directional - If the business value increases, the metric should move consistently in one direction. If it decreases, the metric should move in the opposite direction. ● Understandable - It should be easily understood by business executives.

Slide 26

Slide 26 text

# 2.2 Beware of focusing on the wrong metric. Design: #2 Choose Metrics With Care

Slide 27

Slide 27 text

27 The lure of winning: poor choice of OEC What went wrong? queries run per user as the north star metric at bing

Slide 28

Slide 28 text

28 The lure of winning: blatant gaming Chinese shoe company tricks people into swiping Instagram ad with fake strand of hair.

Slide 29

Slide 29 text

29 Examples of Key Hypothesis Metrics Examples: ● # Sessions / User - Google ● # Tweets / User - Twitter ● # Rides / User - Uber One way to recognize whether you have a good key metric is to intentionally experiment with a bad feature you know your users would not like.

Slide 30

Slide 30 text

Airbnb Key Metric : Proportion of users who visited the host homepage who sign up as hosts B video A no video B with video

Slide 31

Slide 31 text

# 2.3 Beware of measuring just one metric. Design: #2 Choose Metrics With Care

Slide 32

Slide 32 text

NYPD CompStat One Metric: Reduce Reported Crime

Slide 33

Slide 33 text

33  As NYPD Assistant Commissioner Ronald J. Wilhelmy wrote in a November 2013 internal NYPD strategy document: [W]e cannot continue to evaluate personnel on the simple measure of whether crime is up or down relative to a prior period. Most importantly, CompStat has ignored measurement of other core functions. Chiefly, we fail to measure what may be our highest priority: public satisfaction. We also fail to measure quality of life, integrity, community relations, administrative efficiency, and employee satisfaction, to name just a few other important areas.

Slide 34

Slide 34 text

34 Guardrail metrics Guardrails are metrics that should not degrade in pursuit of the key metric. Like the key metrics, they should be directional and sensitive, but not necessarily tie back to business value.

Slide 35

Slide 35 text

# 2.4 Beware of too many metrics. Design: #2 Choose Metrics With Care

Slide 36

Slide 36 text

Every comparison you make brings α% chance of a false positive. This applies to testing multiple - Metrics Treatments Segments More comparisons → more false positives!

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

Safe ways to measure many metrics: Use a stricter confidence level 1,2 Re-test significant metrics that weren’t your primary metric

Slide 39

Slide 39 text

# 3 Perform Power Analysis to Determine Time Needed To Hit Traffic Goal. Design

Slide 40

Slide 40 text

40 Understanding the power of your experiment is vital. ● Any test will only be able to measure differences larger than a given size ● Larger sample size → can detect smaller differences Minimum Likely Detectable Effect (MLDE) “The smallest effect your test is likely to be able to detect”

Slide 41

Slide 41 text

Power : “The likelihood of detecting an effect of a given size” Minimum Likely Detectable Effect (MLDE) : “The smallest change your test is likely to be able to detect” α = 0.05

Slide 42

Slide 42 text

Power : “The likelihood of detecting an effect of a given size” Minimum Likely Detectable Effect (MLDE) : “The smallest change your test is likely to be able to detect” α = 0.05

Slide 43

Slide 43 text

# 4 Consider Business Cycles & Seasonality Design

Slide 44

Slide 44 text

44 Business Cycles & Seasonality Usually different traffic on different days of the week → recommend running weekly cycles

Slide 45

Slide 45 text

Execution 1. Monitor For Data Flow, Randomization (SRM) 2. Monitor for Big Degradations 3. Don’t Peek/Cherry Pick Outcome

Slide 46

Slide 46 text

# 5 Monitor For Data Flow, Randomization (SRM) Execution

Slide 47

Slide 47 text

Is data flowing through?

Slide 48

Slide 48 text

Did randomization work as targeted? (Sample Ratio Mismatch)

Slide 49

Slide 49 text

# 6 Monitor For Big Degradations Execution

Slide 50

Slide 50 text

Catch obvious broken stuff

Slide 51

Slide 51 text

# 7 Don’t Peek or Cherry Pick Outcome Execution

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Analysis 1. Observe Statistical Significance, Not Just Raw Deltas of Metrics 2. Don’t “Dig” for “Insights” except to Formulate a New Hypothesis To Test

Slide 54

Slide 54 text

# 8 Don’t Peek or Cherry Pick Outcome Analysis

Slide 55

Slide 55 text

p-value = how likely is it that you’d see a difference as big as this if there were no real difference Example: flip a coin 100 times to check if its ‘fair’ ● Getting 45 heads, 55 tails is not that unlikely for a fair coin → p-value = 0.32 ● Getting 10 heads, 90 tails is very unlikely! → p-value = 0.000000000000001 You can calculate the p-value yourself, or use an online calculator 5,6 32% chance of getting a difference this big with a fair coin Close to 0% chance of getting a difference this big with a fair coin

Slide 56

Slide 56 text

Airbnb example analysis A 1000 visitors, 30% (300) signed up B 1000 visitors, 25% (250) signed up

Slide 57

Slide 57 text

“ Not Statistically Significant ” Does not mean your treatment had no impact It means that your test wasn’t able to confidently say it did have an impact

Slide 58

Slide 58 text

# 10 Don’t “Dig” for “Insights” except to Formulate a New Hypothesis Analysis

Slide 59

Slide 59 text

Experimentation Results & Actions Statistically Significant Continue to ramp the winning experience to maximise its value

Slide 60

Slide 60 text

Experimentation Results & Actions Statistically Significant Statistically Significant Continue to ramp the winning experience to maximise its value Kill experiment and re-iterate on your implementation

Slide 61

Slide 61 text

Experimentation Results & Actions Statistically Significant Statistically Significant Statistically Inconclusive Continue to ramp the winning experience to maximise its value Kill experiment and re-iterate on your implementation This inconclusive result means we can't confidently say there was an impact.

Slide 62

Slide 62 text

● Why Experiment? ● Why Sweat the Details? ● Core Experimentation Principles By Phase: ● Design ● Execution ● Analysis ● Wrap-Up + Q&A

Slide 63

Slide 63 text

Takeaways.

Slide 64

Slide 64 text

Users are the final arbiter of your design decisions. Takeaway One.

Slide 65

Slide 65 text

Experimentation allows us to watch our users vote with their actions. Takeaway Two.

Slide 66

Slide 66 text

To achieve a meaningful outcome, we must know what we are testing upfront and invest time in choosing the right metrics Takeaway Three.

Slide 67

Slide 67 text

Any well designed, implemented and analyzed experiment is a successful experiment Takeaway Four.

Slide 68

Slide 68 text

References: Multiple Comparison Corrections 1Bonferroni correction : https://www.stat.berkeley.edu/~mgoldman/Section0402.pdf 2Benjamini Hochberg procedure : https://www.statisticshowto.datasciencecentral.com/benjamini-hochberg-procedure/ Online power analysis calculators 3For proportions: http://www.experimentationhub.com/hypothesis-kit.html,4For means: http://statulator.com/SampleSize/ss2M.html Online p-value calculators 5For proportions: https://www.socscistatistics.com/tests/chisquare, 6For means: http://www.statskingdom.com/140MeanT2eq.html

Slide 69

Slide 69 text

69 Review Periods & Seasonality Usually different traffic on different days of the week → recommend running weekly cycles

Slide 70

Slide 70 text

Correlation is not Causation

Slide 71

Slide 71 text

71 Business Cycles & Seasonality Usually different traffic on different days of the week → recommend running weekly cycles