PyData Dallas 2015: How to conclude online experiments in Python

volodymyrk How to conclude online experiments in Python Volodymyr (Vlad)
Kazantsev Head of Data Science at Product Madness

volodymyrk

volodymyrk Goal of the tutorial Uncover the “magic” behind statistics
used for A/B testing and other online experiments

volodymyrk • Head of Data Science (Social Gaming) • Product
Manager at King • MBA at London Business School • Visual Effect developer (Avatar, Batman, ...) • MSc in Probability (Kiev Uni, Ukraine) A quick bio Now 2004

volodymyrk Different kinds of tests • Classic A/B tests •
Long running activities with control groups • Longitudinal tests

volodymyrk Why bother? • To test your hypothesis and learn
• To avoid blindly following HiPPOs • To audit performance of product and marketing teams

volodymyrk Why Stats? • To separate data from the noise
• To quantify uncertainty

volodymyrk Fruit Crush Epic The Story of almost real mobile
game, in the almost real gaming company.. and one Data Scientist

volodymyrk Day-1 3 seconds panic-attack

volodymyrk Day 1 - loading time panic-attack! Fruit Crush Epic

volodymyrk Taxonomy of Classical stat testing Which Test? 1 Sample
2 Samples >2 Samples Mean Proportion Variance σ known σ unknown z-test one sample t-test one sample z-test for proportion Chi-squared test Mean Proportion Variance ANOVA z-test for (μ 1 -μ 2 ) t-test for (μ 1 -μ 2 ) z-test or t-test for dependent samples z-test, 2 proportions independent dependent samples σ 1 ,σ 2 known σ 1 ,σ 2 unknown F-test

volodymyrk One sample t-test Null Hypothesis: - avg. loading time
<=3 seconds for last hour's observation Alternative Hypothesis: - population mean is >3 seconds for last hour's observation Test: - single sample, one-sided t-test.

volodymyrk One sample t-test t_value = t-test(samples, expected mean) p-value:
0.086 probability of obtaining the result as extreme as observed, assuming Null-hypothesis is true t-distribution lookup(t_value, sample_size)

volodymyrk If you want to code it yourself

volodymyrk Stats in Python numpy scipy.stats statsmodels.stats theano pymc3 Classical
Bayesian * High-level view. Lot’s of stuff missing here. pymc3 uses statsmodels for GLM

volodymyrk One sample t-test and z-test

volodymyrk Confidence Interval

volodymyrk Confidence Interval for the Mean

volodymyrk Standard Error of the Mean in Python

volodymyrk Next Day

volodymyrk Day-2 OMG, my Retention is low!

volodymyrk Is my day-1 retention low? Day-1 results: installs 448
returned next day 123 Day-1 retention 27.46% Retention target 30% Fruit Crush Epic

volodymyrk One sample z-test for proportion Null Hypothesis: - avg.
retention >=30% Alternative Hypothesis: - avg. retention <30% Test: - single sample, one-sided z-test for proportion

volodymyrk In Python...

volodymyrk So what is my confidence interval?

volodymyrk Day-5 Connect with Facebook or Die! The First A/B
test

volodymyrk A/B test 1 - connect to Facebook

volodymyrk A/B test design Group A Group B Start Level
1 Start Level 1 Finish Level 1 50% 50% Have seen prompt 2501 Connected 1104 Connect rate 44.1% Have seen prompt 2141 Connected 1076 Connect rate 50.2% Fruit Crush Epic

volodymyrk Is it statistically significant? Fruit Crush Epic

volodymyrk Two samples z-test for proportion Null Hypothesis: - avg.
connection rate is the same. P 1 = P 2 Alternative Hypothesis: - P 1 ≠ P 2 Test: - two samples z-test for proportion. Two sided

volodymyrk Two samples z-test for proportion in Python

volodymyrk Confidence interval for difference in proportion

volodymyrk In Python

volodymyrk What should we measure, exactly? 1000 1000 150 400
450 30 390 430 160 840 40 400 400 connected: 47% retained: 82% connected: 50% retained: 80% Start Level 1 Start Level 1 Start Level 2 Start Level 2

volodymyrk What about Bayesian Stats?

volodymyrk Bayesian Credible Interval vs. CI

volodymyrk Day-30 Do you want to buy last chance? A/B
testing Revenue

volodymyrk How much an extra life is worth? LOSER!!! Purchase
another chance for only.. $0.99 LOSER!!! Purchase another chance for only.. $1.99 Fruit Crush Epic

volodymyrk How we are going to test it? Consider •
There are multiple items to buy in game (lives, boosters, blenders, etc) • We expect more people to make a $0.99 purchase, so we hope to make more money overall, even at lower price A/B test Design • We will show A/B test to new users only • Will run for 2 months • We will measure overall revenue per user in the first 30 days • Null-hypothesis: we make more money from $0.99 group Measurements • Difference in Average Revenue Per User (ARPU) in 30 days • Difference in Conversion Rate (%% of users who make at least 1 purchase)

volodymyrk Results count 450 390 mean 151.9 214.2 25% 20.8
26.5 50% 55.3 69.4 75% 147.3 231.3 max 3960 3647.8 Fruit Crush Epic * random generator used in the example is available in ipython notebooks ** distribution is made more extreme than what is normally observed in casual game, like our imaginary match-3 title

volodymyrk Results 30,000 users in each group 450 payers 390
payers p-value = 0.037 Significant p-value = ??? Is it Significant?

volodymyrk Welch's t-test (σ 1 ≠σ 2 ) Can we
actually use t-test?

volodymyrk Poor’s man non-parametric test: split 5 p < 3%

volodymyrk If you don’t know enough stats - simulate! This
is very close to p-value from t-test

volodymyrk Can we improve sensitivity? 27 players, who have spent
> $1000 in both group. 10 in $0.99 group and 17 in $1.99 group Max spent = $3960

volodymyrk And we re-run our analysis Again, we can use
t-test

volodymyrk Final Thoughts

volodymyrk Can we analyse distributions? You can quantify difference between
two curves Area under the curve is Average Revenue per User Fruit Crush Epic * random generator used in the example is available in ipython notebooks ** distribution is made more extreme than what is normally observed in casual game, like our imaginary match-3 title

volodymyrk Is 30 day revenue a good metric? LTV projection
A LTV projection B Fruit Crush Epic

volodymyrk Summary: • There are only few stats tests that
any Data Scientist must know • t-tests are robust to be useful even with skewed data sets • Bayesian and MCMC is cool, but don’t use MCMC for trivial cases • It is hard to detect the difference in heavily-skewed cases IPython Notebooks for this tutorial are available at: http://nbviewer.ipython.org/github/VolodymyrK/stats-testing-in-python

PyData Dallas 2015: How to conclude online expe...

PyData Dallas 2015: How to conclude online experiments in Python

More Decks by VolodymyrK

Other Decks in Technology

Featured

Transcript