Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lizzie Eardley & Sophie Harpur - Online experimentation: with great power comes great responsibility (Turing Fest 2019)

Lizzie Eardley & Sophie Harpur - Online experimentation: with great power comes great responsibility (Turing Fest 2019)

Experimentation has the power to provide you with insights about your users, to let you rollout software safely, and to ensure that the features you are building are delivering value. But with this power comes great responsibility — it’s dangerously easy to be misled by the data. Are you making the right decisions based on your experiment results? Are you falling into the common traps of interpreting statistics?

Join an optimistic product manager, Sophie, and a pedantic data scientist, Lizzie, as this experimentation duo discuss and explain experimentation from all perspectives.

Turing Fest

August 28, 2019
Tweet

More Decks by Turing Fest

Other Decks in Technology

Transcript

  1. What we will talk about today: Experimentation allows you to

    feel confident when you deploy and in the product decisions you make. ! What is experimentation? ! What can it do for you? How you should be experimenting ! Design ! Execution ! Analysis
  2. Engineering teams are moving faster than ever before TODAY Hundreds

    of Times a Day Every few Years 1990s Twice a Year 2000s Several Per Day 2015 Weekly 2010s
  3. ! To enable safe fast releases ! To ensure you’re

    delivering value Why do we need controlled experimentation?
  4. Look at the data to decide which version works better

    A B What is online controlled experimentation?
  5. ! To enable safe fast releases ! To ensure you’re

    delivering value ! Remove external factors ! To distinguish noise from a real signal ! Limit human biases Why do we need controlled experimentation?
  6. Metrics Change Your new feature All other changes in the

    world Controlled experimentation removes external influences.
  7. Controlled experimentation can distinguish noise from a real signal. There

    is always noise or randomness in the data If we flip a coin 1000 times - we’re very unlikely to get 500 heads and 500 tails (3% chance) - that doesn’t mean the coin is biased Experimentation statistics allows us to determine if a result is too big to be due to normal variations and noise
  8. False Positives & False Negatives. Type I Error 
 (false

    positive) Type II Error 
 (false negative)
  9. Controlled experimentation can limit human biases. There are 180+ cognitive

    biases that mess with how we process data, think critically, and perceive reality[1].
  10. * Source: HBR - The Surprising Power of Online Experiments

    by Ronny Kohavi, Stefan Thomke 80-90% of features shipped have negative or neutral impact on metrics they were designed to improve*
  11. Look at the data to decide which version works better

    A B Three stages of online controlled experimentation: 1. Design 2. Execution 3. Analysis
  12. A real example: Airbnb host sign-up page A no video

    B video A no video B with video
  13. Standardised Format, Easy & Repeatable Steps A good hypothesis has

    a benchmark, is clear, concise and understandable, testable and measurable. By Craig Sullivan
  14. B video Airbnb hypothesis : Because we see potential hosts

    aren’t aware of their level of control, we expect that adding an informative video will provide reassurances leading to more sign-ups. We’ll measure this using the metric: proportion of users who sign up as hosts A no video B with video
  15. Airbnb hypothesis : Because our survey showed potential hosts aren’t

    aware of their level of control, we expect that adding an informative video for Canadian users will provide reassurances leading to more sign-ups. We’ll measure this using the metric: proportion of users who visited the host homepage who sign up as hosts. We expect to see an increase of 5% in the host-page conversion rate over a period of 2 weeks B video A no video B with video
  16. Characteristics of Good Metrics Key properties: ! Meaningful - The

    metric directly captures business value or customer satisfaction 
 ! Sensitive - The metric should change due to small changes in user satisfaction or business value so that the result of an experiment is clear within a reasonable time period. 
 ! Directional - If the business value increases, the metric should move consistently in one direction. If it decreases, the metric should move in the opposite direction. 
 ! Understandable - It should be easily understood by business executives.
  17. The lure of winning Chinese shoe company tricks people into

    swiping Instagram ad with fake strand of hair.
  18. Examples of Key Hypothesis Metrics Examples: ! # Sessions /

    User - Google ! # Tweets / User - Twitter ! # Rides / User - Uber 
 One way to recognize whether you have a good key metric is to intentionally experiment with a bad feature you know your users would not like.
  19. Airbnb Key Metric : Proportion of users who visited the

    host homepage who sign up as hosts B video A no video B with video
  20. Guardrail metrics Guardrails are metrics that should not degrade in

    pursuit of the key metric. Like the key metrics, they should be directional and sensitive, but not necessarily tie back to business value.
  21. Every comparison you make brings α% chance of a false

    positive. This applies to testing multiple - Metrics Treatments Segments More comparisons → more false positives!
  22. Safe ways to measure many metrics: Use a stricter confidence

    level [2,3] Re-test significant metrics that weren’t your primary metric
  23. Understanding the power of your experiment is vital. ! Any

    test will only be able to measure differences larger than a given size ! Larger sample size → can detect smaller differences Minimum Likely Detectable Effect (MLDE) “The smallest effect your test is likely to be able to detect”
  24. Business Cycles & Seasonality Usually different traffic on different days

    of the week → recommend running weekly cycles
  25. You can also do it yourself. variant = getVariant(); if

    (variant.equals("on")) { // insert on code here } else if (variant.equals("off")) { // insert off code here } else { // insert control code here } saveResult(variant, metricValue)
  26. Things to check for: ! Data is flowing ! Sample

    ratio mismatch ! Big degradations (monitoring)
  27. p-value = how likely is it that you’d see a

    difference as big as this if there were no real difference ! We’re looking for evidence to reject the Null Hypothesis that there is no real difference ! Low p-value = strong evidence that the Null is not true, your results are not simply due to noise
  28. Airbnb example analysis A 1000 visitors, 30% (300) signed up

    B 1000 visitors, 25% (250) signed up p-value of 0.01 means only 1% chance you’d see a difference this big if no-one cared about the video at all. [6,7]
  29. Experimentation Results & Actions Continue to ramp the winning experience

    to maximise its value Kill experiment and re-iterate on your implementation Statistically Significant Statistically Significant
  30. Statistically Significant Statistically Significant Statistically Inconclusive[8] Continue to ramp the

    winning experience to maximise its value Kill experiment and re-iterate on your implementation This inconclusive result means we can't confidently say there was an impact. Experimentation Results & Actions
  31. Know what you are testing upfront and invest time in

    choosing the right metrics Takeaway Three.
  32. Come chat to us: : @soph_lwc Resources: 1Biases: https://www.visualcapitalist.com/every-single-cognitive-bias/ Multiple

    Comparison Corrections 2Bonferroni correction : https://www.stat.berkeley.edu/~mgoldman/Section0402.pdf 3Benjamini Hochberg procedure : https://www.statisticshowto.datasciencecentral.com/benjamini-hochberg-procedure/ Online power analysis calculators 4For proportions: http://www.experimentationhub.com/hypothesis-kit.html,5For means: http://statulator.com/SampleSize/ss2M.html Online p-value calculators 6For proportions: https://www.socscistatistics.com/tests/chisquare, 7For means: http://www.statskingdom.com/140MeanT2eq.html 8Confidence Intervals: https://conversionxl.com/blog/ab-testing-statistics/ : @LizzieEardley www.Split.io