Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How To Avoid Lying To Yourself With Statistics

Dave Karow
September 13, 2019

How To Avoid Lying To Yourself With Statistics

Great teams don’t run experiments to prove they are right; they run them to answer questions. Since it’s human nature to want things to go your way, how can we guard against falling into the traps of wishful thinking and hidden biases as we strive to reach meaningful outcomes?

You can follow a handful of core principles in the design, execution, and analysis of experiments proven by teams that run hundreds or even thousands of experiments every month. If you do, you'll increase the chances of learning something truly useful and you reduce the odds of wasting your time and heading off in an unproductive direction due to false signals.

Whether you are an “old hand” at online experimentation, merely “experimentation curious,” or somewhere in between, you’ll tilt the odds of productive outcomes in your favor by reviewing these slides and/or watching the video and transcript posted at https://www.split.io/blog/how-to-avoid-lying-to-yourself-with-statistics/

Dave Karow

September 13, 2019
Tweet

More Decks by Dave Karow

Other Decks in Programming

Transcript

  1. How To Avoid Lying To Yourself With Statistics A Shortlist

    of Core Principles for Productive Online Controlled Experiments
  2. • Why Experiment? • Why Sweat the Details? • Core

    Experimentation Principles By Phase: • Design • Execution • Analysis • Wrap-Up + Q&A
  3. * Source: HBR - The Surprising Power of Online Experiments

    by Ronny Kohavi, Stefan Thomke 80-90% of features shipped have negative or neutral impact on metrics they were designed to improve* 4 How Effective Are We At Moving The Needle?
  4. Who Will Ultimately Decide? “Some designers and managers have attacked

    A/B testing because of a misconception that testing somehow replaces or diminishes design. Nothing could be further from the truth. Good design comes first.”
  5. “The key is that it is the end users who

    are the final arbiters of design success Who Will Ultimately Decide? and A/B testing is the instrument that informs us of their judgment.”
  6. • Why Experiment? • Why Sweat the Details? • Core

    Experimentation Principles By Phase: • Design • Execution • Analysis • Wrap-Up + Q&A
  7. We want to avoid the cost of false signals Remember

    this example and you’ll never confuse false positive & false negative again Type I Error False positive Type II Error False negative
  8. We want to avoid the cost of human bias Here

    are 180+ forms of it. (It’s a big problem)
  9. Controlled experimentation can distinguish noise from a real signal. There

    is always noise or randomness in the data. If we flip a coin 1000 times - we’re unlikely to get exactly 500 heads and 500 tails (only a 3% chance) - that doesn’t mean the coin is biased. Experimentation statistics allow us to determine if a result is big enough to rule out normal variations and noise. There is always noise or randomness in the data. If we flip a coin 1000 times - we’re unlikely to get exactly 500 heads and 500 tails (only a 3% chance) - that doesn’t mean the coin is biased. Experimentation statistics allow us to determine if a result is big enough to rule out normal variations and noise.
  10. • Why Experiment? • Why Sweat the Details? • Core

    Experimentation Principles By Phase: • Design • Execution • Analysis • Wrap-Up + Q&A
  11. Design 1. Have an Explicit Hypothesis 2. Choose Metrics with

    Care 3. Perform Power Analysis to Determine Time Needed To Hit Traffic Goal 4. Consider Business Cycles & Seasonality Why? To prevent after-the-fact fitting of results to wishful thinking or hidden biases.
  12. Standardized Format, Easy & Repeatable Steps A good hypothesis has

    a benchmark, is clear, concise and understandable, testable and measurable. By Craig Sullivan
  13. B video Airbnb hypothesis : Because we see potential hosts

    aren’t aware of their level of control, we expect that adding an informative video will provide reassurances leading to more sign-ups. We’ll measure this using the metric: proportion of users who sign up as hosts A no video B with video
  14. Airbnb hypothesis : Because our survey showed potential hosts aren’t

    aware of their level of control, we expect that adding an informative video for Canadian users will provide reassurances leading to more sign-ups. We’ll measure this using the metric: proportion of users who visited the host homepage who sign up as hosts. We expect to see an increase of 5% in the host-page conversion rate over a period of 2 weeks B video A no video B with video
  15. 25 Characteristics of Good Metrics Key properties: • Meaningful -

    The metric directly captures business value or customer satisfaction • Sensitive - The metric should change due to small changes in user satisfaction or business value so that the result of an experiment is clear within a reasonable time period. • Directional - If the business value increases, the metric should move consistently in one direction. If it decreases, the metric should move in the opposite direction. • Understandable - It should be easily understood by business executives.
  16. 27 The lure of winning: poor choice of OEC What

    went wrong? queries run per user as the north star metric at bing
  17. 28 The lure of winning: blatant gaming Chinese shoe company

    tricks people into swiping Instagram ad with fake strand of hair.
  18. 29 Examples of Key Hypothesis Metrics Examples: • # Sessions

    / User - Google • # Tweets / User - Twitter • # Rides / User - Uber One way to recognize whether you have a good key metric is to intentionally experiment with a bad feature you know your users would not like.
  19. Airbnb Key Metric : Proportion of users who visited the

    host homepage who sign up as hosts B video A no video B with video
  20. 33  As NYPD Assistant Commissioner Ronald J. Wilhelmy wrote

    in a November 2013 internal NYPD strategy document: [W]e cannot continue to evaluate personnel on the simple measure of whether crime is up or down relative to a prior period. Most importantly, CompStat has ignored measurement of other core functions. Chiefly, we fail to measure what may be our highest priority: public satisfaction. We also fail to measure quality of life, integrity, community relations, administrative efficiency, and employee satisfaction, to name just a few other important areas.
  21. 34 Guardrail metrics Guardrails are metrics that should not degrade

    in pursuit of the key metric. Like the key metrics, they should be directional and sensitive, but not necessarily tie back to business value.
  22. Every comparison you make brings α% chance of a false

    positive. This applies to testing multiple - Metrics Treatments Segments More comparisons → more false positives!
  23. Safe ways to measure many metrics: Use a stricter confidence

    level 1,2 Re-test significant metrics that weren’t your primary metric
  24. 40 Understanding the power of your experiment is vital. •

    Any test will only be able to measure differences larger than a given size • Larger sample size → can detect smaller differences Minimum Likely Detectable Effect (MLDE) “The smallest effect your test is likely to be able to detect”
  25. Power : “The likelihood of detecting an effect of a

    given size” Minimum Likely Detectable Effect (MLDE) : “The smallest change your test is likely to be able to detect” α = 0.05
  26. Power : “The likelihood of detecting an effect of a

    given size” Minimum Likely Detectable Effect (MLDE) : “The smallest change your test is likely to be able to detect” α = 0.05
  27. 44 Business Cycles & Seasonality Usually different traffic on different

    days of the week → recommend running weekly cycles
  28. Execution 1. Monitor For Data Flow, Randomization (SRM) 2. Monitor

    for Big Degradations 3. Don’t Peek/Cherry Pick Outcome
  29. Analysis 1. Observe Statistical Significance, Not Just Raw Deltas of

    Metrics 2. Don’t “Dig” for “Insights” except to Formulate a New Hypothesis To Test
  30. p-value = how likely is it that you’d see a

    difference as big as this if there were no real difference Example: flip a coin 100 times to check if its ‘fair’ • Getting 45 heads, 55 tails is not that unlikely for a fair coin → p-value = 0.32 • Getting 10 heads, 90 tails is very unlikely! → p-value = 0.000000000000001 You can calculate the p-value yourself, or use an online calculator 5,6 32% chance of getting a difference this big with a fair coin Close to 0% chance of getting a difference this big with a fair coin
  31. “ Not Statistically Significant ” Does not mean your treatment

    had no impact It means that your test wasn’t able to confidently say it did have an impact
  32. Experimentation Results & Actions Statistically Significant Statistically Significant Continue to

    ramp the winning experience to maximise its value Kill experiment and re-iterate on your implementation
  33. Experimentation Results & Actions Statistically Significant Statistically Significant Statistically Inconclusive

    Continue to ramp the winning experience to maximise its value Kill experiment and re-iterate on your implementation This inconclusive result means we can't confidently say there was an impact.
  34. • Why Experiment? • Why Sweat the Details? • Core

    Experimentation Principles By Phase: • Design • Execution • Analysis • Wrap-Up + Q&A
  35. To achieve a meaningful outcome, we must know what we

    are testing upfront and invest time in choosing the right metrics Takeaway Three.
  36. References: Multiple Comparison Corrections 1Bonferroni correction : https://www.stat.berkeley.edu/~mgoldman/Section0402.pdf 2Benjamini Hochberg

    procedure : https://www.statisticshowto.datasciencecentral.com/benjamini-hochberg-procedure/ Online power analysis calculators 3For proportions: http://www.experimentationhub.com/hypothesis-kit.html,4For means: http://statulator.com/SampleSize/ss2M.html Online p-value calculators 5For proportions: https://www.socscistatistics.com/tests/chisquare, 6For means: http://www.statskingdom.com/140MeanT2eq.html
  37. 69 Review Periods & Seasonality Usually different traffic on different

    days of the week → recommend running weekly cycles
  38. 71 Business Cycles & Seasonality Usually different traffic on different

    days of the week → recommend running weekly cycles