Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData Dallas 2015: How to conclude online experiments in Python

PyData Dallas 2015: How to conclude online experiments in Python

slides for PyData Dallas 2015 "How to conclude online experiments in Python"

VolodymyrK

April 25, 2015
Tweet

More Decks by VolodymyrK

Other Decks in Technology

Transcript

  1. volodymyrk How to conclude online experiments in Python Volodymyr (Vlad)

    Kazantsev Head of Data Science at Product Madness
  2. volodymyrk Goal of the tutorial Uncover the “magic” behind statistics

    used for A/B testing and other online experiments
  3. volodymyrk • Head of Data Science (Social Gaming) • Product

    Manager at King • MBA at London Business School • Visual Effect developer (Avatar, Batman, ...) • MSc in Probability (Kiev Uni, Ukraine) A quick bio Now 2004
  4. volodymyrk Different kinds of tests • Classic A/B tests •

    Long running activities with control groups • Longitudinal tests
  5. volodymyrk Why bother? • To test your hypothesis and learn

    • To avoid blindly following HiPPOs • To audit performance of product and marketing teams
  6. volodymyrk Fruit Crush Epic The Story of almost real mobile

    game, in the almost real gaming company.. and one Data Scientist
  7. volodymyrk Taxonomy of Classical stat testing Which Test? 1 Sample

    2 Samples >2 Samples Mean Proportion Variance σ known σ unknown z-test one sample t-test one sample z-test for proportion Chi-squared test Mean Proportion Variance ANOVA z-test for (μ 1 -μ 2 ) t-test for (μ 1 -μ 2 ) z-test or t-test for dependent samples z-test, 2 proportions independent dependent samples σ 1 ,σ 2 known σ 1 ,σ 2 unknown F-test
  8. volodymyrk Taxonomy of Classical stat testing Which Test? 1 Sample

    2 Samples >2 Samples Mean Proportion Variance σ known σ unknown z-test one sample t-test one sample z-test for proportion Chi-squared test Mean Proportion Variance ANOVA z-test for (μ 1 -μ 2 ) t-test for (μ 1 -μ 2 ) z-test or t-test for dependent samples z-test, 2 proportions independent dependent samples σ 1 ,σ 2 known σ 1 ,σ 2 unknown F-test
  9. volodymyrk One sample t-test Null Hypothesis: - avg. loading time

    <=3 seconds for last hour's observation Alternative Hypothesis: - population mean is >3 seconds for last hour's observation Test: - single sample, one-sided t-test.
  10. volodymyrk One sample t-test t_value = t-test(samples, expected mean) p-value:

    0.086 probability of obtaining the result as extreme as observed, assuming Null-hypothesis is true t-distribution lookup(t_value, sample_size)
  11. volodymyrk Stats in Python numpy scipy.stats statsmodels.stats theano pymc3 Classical

    Bayesian * High-level view. Lot’s of stuff missing here. pymc3 uses statsmodels for GLM
  12. volodymyrk Is my day-1 retention low? Day-1 results: installs 448

    returned next day 123 Day-1 retention 27.46% Retention target 30% Fruit Crush Epic
  13. volodymyrk Taxonomy of Classical stat testing Which Test? 1 Sample

    2 Samples >2 Samples Mean Proportion Variance σ known σ unknown z-test one sample t-test one sample z-test for proportion Chi-squared test Mean Proportion Variance ANOVA z-test for (μ 1 -μ 2 ) t-test for (μ 1 -μ 2 ) z-test or t-test for dependent samples z-test, 2 proportions independent dependent samples σ 1 ,σ 2 known σ 1 ,σ 2 unknown F-test
  14. volodymyrk One sample z-test for proportion Null Hypothesis: - avg.

    retention >=30% Alternative Hypothesis: - avg. retention <30% Test: - single sample, one-sided z-test for proportion
  15. volodymyrk A/B test design Group A Group B Start Level

    1 Start Level 1 Finish Level 1 50% 50% Have seen prompt 2501 Connected 1104 Connect rate 44.1% Have seen prompt 2141 Connected 1076 Connect rate 50.2% Fruit Crush Epic
  16. volodymyrk Taxonomy of Classical stat testing Which Test? 1 Sample

    2 Samples >2 Samples Mean Proportion Variance σ known σ unknown z-test one sample t-test one sample z-test for proportion Chi-squared test Mean Proportion Variance ANOVA z-test for (μ 1 -μ 2 ) t-test for (μ 1 -μ 2 ) z-test or t-test for dependent samples z-test, 2 proportions independent dependent samples σ 1 ,σ 2 known σ 1 ,σ 2 unknown F-test
  17. volodymyrk Two samples z-test for proportion Null Hypothesis: - avg.

    connection rate is the same. P 1 = P 2 Alternative Hypothesis: - P 1 ≠ P 2 Test: - two samples z-test for proportion. Two sided
  18. volodymyrk What should we measure, exactly? 1000 1000 150 400

    450 30 390 430 160 840 40 400 400 connected: 47% retained: 82% connected: 50% retained: 80% Start Level 1 Start Level 1 Start Level 2 Start Level 2
  19. volodymyrk How much an extra life is worth? LOSER!!! Purchase

    another chance for only.. $0.99 LOSER!!! Purchase another chance for only.. $1.99 Fruit Crush Epic
  20. volodymyrk How we are going to test it? Consider •

    There are multiple items to buy in game (lives, boosters, blenders, etc) • We expect more people to make a $0.99 purchase, so we hope to make more money overall, even at lower price A/B test Design • We will show A/B test to new users only • Will run for 2 months • We will measure overall revenue per user in the first 30 days • Null-hypothesis: we make more money from $0.99 group Measurements • Difference in Average Revenue Per User (ARPU) in 30 days • Difference in Conversion Rate (%% of users who make at least 1 purchase)
  21. volodymyrk Results count 450 390 mean 151.9 214.2 25% 20.8

    26.5 50% 55.3 69.4 75% 147.3 231.3 max 3960 3647.8 Fruit Crush Epic * random generator used in the example is available in ipython notebooks ** distribution is made more extreme than what is normally observed in casual game, like our imaginary match-3 title
  22. volodymyrk Results 30,000 users in each group 450 payers 390

    payers p-value = 0.037 Significant p-value = ??? Is it Significant?
  23. volodymyrk Taxonomy of Classical stat testing Which Test? 1 Sample

    2 Samples >2 Samples Mean Proportion Variance σ known σ unknown z-test one sample t-test one sample z-test for proportion Chi-squared test Mean Proportion Variance ANOVA z-test for (μ 1 -μ 2 ) t-test for (μ 1 -μ 2 ) z-test or t-test for dependent samples z-test, 2 proportions independent dependent samples σ 1 ,σ 2 known σ 1 ,σ 2 unknown F-test
  24. volodymyrk Can we improve sensitivity? 27 players, who have spent

    > $1000 in both group. 10 in $0.99 group and 17 in $1.99 group Max spent = $3960
  25. volodymyrk Can we analyse distributions? You can quantify difference between

    two curves Area under the curve is Average Revenue per User Fruit Crush Epic * random generator used in the example is available in ipython notebooks ** distribution is made more extreme than what is normally observed in casual game, like our imaginary match-3 title
  26. volodymyrk Summary: • There are only few stats tests that

    any Data Scientist must know • t-tests are robust to be useful even with skewed data sets • Bayesian and MCMC is cool, but don’t use MCMC for trivial cases • It is hard to detect the difference in heavily-skewed cases IPython Notebooks for this tutorial are available at: http://nbviewer.ipython.org/github/VolodymyrK/stats-testing-in-python