Lizzie Eardley & Sophie Harpur - Online experimentation: with great power comes great responsibility (Turing Fest 2019)

A little about us Experimentation team at Skyscanner Experimentation Product
team at Split Software

What we will talk about today: Experimentation allows you to
feel confident when you deploy and in the product decisions you make. ! What is experimentation? ! What can it do for you? How you should be experimenting ! Design ! Execution ! Analysis

Engineering teams are moving faster than ever before TODAY Hundreds
of Times a Day Every few Years 1990s Twice a Year 2000s Several Per Day 2015 Weekly 2010s

Are we failing more and shipping shittier software faster?

Despite moving faster, we still fail

How do we rapidly deliver valuable software?

Experimentation changes the way we work Delivering speed Delivering value

! To enable safe fast releases ! To ensure you’re
delivering value Why do we need controlled experimentation?

8 to 23% Growth in Bing’s search market share since
2009

What is controlled experimentation? An experiment where everything is held
constant except for one variable

First documented controlled experiment was ran by a doctor from
Edinburgh in the 1700’s!

...facilitating online experiments   in the modern world.

Look at the data to decide which version works better
A B What is online controlled experimentation?

! To enable safe fast releases ! To ensure you’re
delivering value ! Remove external factors ! To distinguish noise from a real signal ! Limit human biases Why do we need controlled experimentation?

Controlled experimentation removes external influences. Metrics Change Your new feature
All other changes in the world

Metrics Change Your new feature All other changes in the
world Controlled experimentation removes external influences.

Controlled experimentation can distinguish noise from a real signal. There
is always noise or randomness in the data If we flip a coin 1000 times - we’re very unlikely to get 500 heads and 500 tails (3% chance) - that doesn’t mean the coin is biased Experimentation statistics allows us to determine if a result is too big to be due to normal variations and noise

Statistically Significant: The result is unlikely to have occurred by
chance alone

False Positives & False Negatives. Type I Error   (false
positive) Type II Error   (false negative)

Controlled experimentation can limit human biases. There are 180+ cognitive
biases that mess with how we process data, think critically, and perceive reality[1].

* Source: HBR - The Surprising Power of Online Experiments
by Ronny Kohavi, Stefan Thomke 80-90% of features shipped have negative or neutral impact on metrics they were designed to improve*

Look at the data to decide which version works better
A B Three stages of online controlled experimentation: 1. Design 2. Execution 3. Analysis

A real example: Airbnb host sign-up page A no video
B video A no video B with video

Design. Hypothesis Metrics Power Analysis

Hypothesis.

Standardised Format, Easy & Repeatable Steps A good hypothesis has
a benchmark, is clear, concise and understandable, testable and measurable. By Craig Sullivan

B video Airbnb hypothesis : Because we see potential hosts
aren’t aware of their level of control, we expect that adding an informative video will provide reassurances leading to more sign-ups. We’ll measure this using the metric: proportion of users who sign up as hosts A no video B with video

Airbnb hypothesis : Because our survey showed potential hosts aren’t
aware of their level of control, we expect that adding an informative video for Canadian users will provide reassurances leading to more sign-ups. We’ll measure this using the metric: proportion of users who visited the host homepage who sign up as hosts. We expect to see an increase of 5% in the host-page conversion rate over a period of 2 weeks B video A no video B with video

Metrics.

Characteristics of Good Metrics Key properties: ! Meaningful - The
metric directly captures business value or customer satisfaction   ! Sensitive - The metric should change due to small changes in user satisfaction or business value so that the result of an experiment is clear within a reasonable time period.   ! Directional - If the business value increases, the metric should move consistently in one direction. If it decreases, the metric should move in the opposite direction.   ! Understandable - It should be easily understood by business executives.

The lure of winning Chinese shoe company tricks people into
swiping Instagram ad with fake strand of hair.

Examples of Key Hypothesis Metrics Examples: ! # Sessions /
User - Google ! # Tweets / User - Twitter ! # Rides / User - Uber   One way to recognize whether you have a good key metric is to intentionally experiment with a bad feature you know your users would not like.

Airbnb Key Metric : Proportion of users who visited the
host homepage who sign up as hosts B video A no video B with video

Beware of measuring one metric.

Guardrail metrics Guardrails are metrics that should not degrade in
pursuit of the key metric. Like the key metrics, they should be directional and sensitive, but not necessarily tie back to business value.

Beware of too many metrics.

Every comparison you make brings α% chance of a false
positive. This applies to testing multiple - Metrics Treatments Segments More comparisons → more false positives!

Safe ways to measure many metrics: Use a stricter confidence
level [2,3] Re-test significant metrics that weren’t your primary metric

Power Analysis.

Understanding the power of your experiment is vital. ! Any
test will only be able to measure differences larger than a given size ! Larger sample size → can detect smaller differences Minimum Likely Detectable Effect (MLDE) “The smallest effect your test is likely to be able to detect”

Many online power calculators out there to help[4,5] experimentationhub.com/hypothesis-kit.html

experimentationhub.com/hypothesis-kit.html Many online power calculators out there to help[4,5]

experimentationhub.com/hypothesis-kit.html You don’t need super high traffic… but it helps!

Business Cycles & Seasonality Usually different traffic on different days
of the week → recommend running weekly cycles

Execution. Randomization Data Collection

Many tools exist to help.

You can also do it yourself. variant = getVariant(); if
(variant.equals("on")) { // insert on code here } else if (variant.equals("off")) { // insert off code here } else { // insert control code here } saveResult(variant, metricValue)

Things to check for: ! Data is flowing ! Sample
ratio mismatch ! Big degradations (monitoring)

Don’t Peek!

Likelihood there was no impact

Analysis. Statistics Conclusions Actions

p-value = how likely is it that you’d see a
difference as big as this if there were no real difference ! We’re looking for evidence to reject the Null Hypothesis that there is no real difference ! Low p-value = strong evidence that the Null is not true, your results are not simply due to noise

Airbnb example analysis A 1000 visitors, 30% (300) signed up
B 1000 visitors, 25% (250) signed up p-value of 0.01 means only 1% chance you’d see a difference this big if no-one cared about the video at all. [6,7]

Statistically Significant Continue to ramp the winning experience to maximise
its value Experimentation Results & Actions

Experimentation Results & Actions Continue to ramp the winning experience
to maximise its value Kill experiment and re-iterate on your implementation Statistically Significant Statistically Significant

Statistically Significant Statistically Significant Statistically Inconclusive[8] Continue to ramp the
winning experience to maximise its value Kill experiment and re-iterate on your implementation This inconclusive result means we can't confidently say there was an impact. Experimentation Results & Actions

Takeaways.

Experimentation can help you gain confidence and safety within your
deployment cycles Takeaway One.

Experimentation allows you to understand your product and your users
in a new way Takeaway Two.

Know what you are testing upfront and invest time in
choosing the right metrics Takeaway Three.

Any well designed, implemented and analysed experiment is a successful
experiment Takeaway Four.

Come chat to us: : @soph_lwc Resources: 1Biases: https://www.visualcapitalist.com/every-single-cognitive-bias/ Multiple
Comparison Corrections 2Bonferroni correction : https://www.stat.berkeley.edu/~mgoldman/Section0402.pdf 3Benjamini Hochberg procedure : https://www.statisticshowto.datasciencecentral.com/benjamini-hochberg-procedure/ Online power analysis calculators 4For proportions: http://www.experimentationhub.com/hypothesis-kit.html,5For means: http://statulator.com/SampleSize/ss2M.html Online p-value calculators 6For proportions: https://www.socscistatistics.com/tests/chisquare, 7For means: http://www.statskingdom.com/140MeanT2eq.html 8Confidence Intervals: https://conversionxl.com/blog/ab-testing-statistics/ : @LizzieEardley www.Split.io

Lizzie Eardley & Sophie Harpur - Online experim...

Lizzie Eardley & Sophie Harpur - Online experimentation: with great power comes great responsibility (Turing Fest 2019)

More Decks by Turing Fest

Other Decks in Technology

Featured

Transcript