Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Class 6: Finding False Findings Fast and Furiously

40e37c08199ed4d3866ce6e1ff0be06d?s=47 David Evans
January 31, 2019

Class 6: Finding False Findings Fast and Furiously

Class 6: Finding False Findings Fast
https://uvammm.github.io/class6

Markets, Mechanisms, and Machines
University of Virginia
cs4501/econ4559 Spring 2019
David Evans and Denis Nekipelov
https://uvammm.github.io/

40e37c08199ed4d3866ce6e1ff0be06d?s=128

David Evans

January 31, 2019
Tweet

Transcript

  1. MARKETS, MECHANISMS, MACHINES University of Virginia, Spring 2019 Class 6:

    Experiments 31 January 2019 cs4501/econ4559 Spring 2019 David Evans and Denis Nekipelov https://uvammm.github.io
  2. Project Grading 1 Excellent job – met our expectations for

    this project and got what we hoped you would out of it. Reasonable – missed some things we hoped you would get, but a good effort and got most of what we wanted out of this. Some serious problems – seem to be missing key ideas, or sign of unacceptable effort; didn’t get what we think you should out of this. Positive modifiers:
  3. Unbounded Scale 2 Exceptional! Better than we thought possible! Breakthrough

    result, should be published in a top venue Worthy of a Turing Award/Nobel Prize
  4. Project 2 Due Tuesday More opportunities for … than Project

    1 Projects will get more and more open-ended, until the Final Project where you will be able to select your own problem 3
  5. Experiments 4

  6. 5 Current citation rates suggest that I am among the

    10 scientists worldwide who are currently the most commonly cited, perhaps also the currently most-cited physician. This probably only proves that citation metrics are highly unreliable, since I estimate that I have been rejected over 1,000 times in my life. Regardless, I consider myself privileged to have learned and to continue to learn from interactions with students and young scientists (of all ages) from all over the world and I love to be constantly reminded that I know next to nothing. PLOS Medicine, 2005
  7. Why Most Research Findings Are False 6 ! = true

    relationships not true relationships Genome study, 100 000 markers ~10 associated with disease ! ≈ 1023
  8. Why Most Research Findings Are False 7 Genome study, 100

    000 markers ~10 associated with disease ! ≈ 10%& ! = true relationships not true relationships
  9. Study Outcomes 8 Pick true relationship, probability study finds it

    true: 1 − # “Type 2 error rate” “false negative” Pick false relationship, probability study finds it true: $ “Type 1 error rate” “false positive”
  10. Positive Predictive Value !!" = number of true positives total

    number of positive outcomes 9 ! 4 experiment 6inds 4) =
  11. 10

  12. Why is ! = 0.05? 11

  13. Why is ! = 0.05? 12

  14. 13 https://xkcd.com/882/ https://xkcd.com/1478/

  15. Mostly False Results in Practice 14

  16. 15

  17. its not just in medicine... 16

  18. its not just in medicine... 17 Thomas Herndon, UMass student

    who attempted to replicate for Econ class assignment
  19. Web Experiments 18 controlled experiments randomized experiments A/B tests split

    tests Control/Treatment …
  20. Widespread Use and Value 19

  21. 20

  22. 21

  23. A/B Test 22 Population of Users Splitter All incoming requests

    Control (Existing System) Treatment (Modified Behavior) Logging Logging Analyzer Log A Log B
  24. Analyzing the Logs 23 Splitter Control (Existing System) Treatment (Modified

    Behavior) Logging Logging Analyzer Log A Log B Overall Evaluation Criterion (OEC) quantitative measure of goal Response Evaluation Metric
  25. Terminology 24 Splitter Control (Existing System) Treatment (Modified Behavior) Logging

    Logging Analyzer Log A Log B Factor: variable controlled by experiment Experimental Unit: entity over which metrics are calculated
  26. Tracking Users 25 HTTP is Stateless Client Server HTTP GET

    / HTTP/1.1 <html ...> HTTP GET /syllabus HTTP/1.1 <html ...>
  27. Cookies 26 HTTP is Stateless Client Server HTTP GET /

    HTTP/1.1 <html ...> HTTP GET / HTTP/1.1 <html ...>
  28. Opening your Cookie Jar 27 chrome://settings/siteData?search=cookies

  29. 28 Firefox: Tools | Web Developer | Storage Inspector

  30. Terminology 29 Splitter Control (Existing System) Treatment (Modified Behavior) Logging

    Logging Analyzer Log A Log B Factor: variable controlled by experiment Experimental Unit: entity over which metrics are calculated user (browser client)
  31. 30

  32. 31

  33. 32 Challenge: find the non-ad content!

  34. How Big an Experiment? 33

  35. How Big an Experiment? Strategy 1: How much can you

    spend? 34 ! = Total budget − .ixed costs cost per data point
  36. How Big an Experiment? Strategy 2: Needed statistical power 35

    Probability of rejecting false null hypothesis: if treatment causes a difference in OEC, probability it will be detected
  37. 36

  38. 37

  39. 38 ! = 2 $%&'/) + $%&+ ) ,- −

    ,% / 0 = 0.05 $%&'/) = 1.9599 … 7 = 0.20 $%&+ = 0.84... 2 $%&'/) + $%&+ ) = 15.86 ≈ 16
  40. Minimum Sample Size 39 ! = 16 Δ& Δ =

    sensitivity relative to standard deviation = 34 − 35 6 van Belle’s “Rule of Thumb”
  41. Example 5% of visitors in experimental population purchase average purchase

    = $100, standard deviation = $20. OEC: revenue 40 based on Kohavi et al., Controlled experiments on the web: survey and practical guide. 2008. How many users do we need for experiment to detect 10% change in revenue?
  42. Example 5% of visitors in experimental population purchase average purchase

    = $100, standard deviation = $40. OEC: revenue average spending = 0.05 $ $100 + 0.95 $ $0 = $5.00 Δ = sensitivity/std dev = 0.1 $ $+ $,- = 0.0125 41 based on Kohavi et al., Controlled experiments on the web: survey and practical guide. 2008. How many users do we need for experiment to detect 10% change in revenue? / = 16 Δ1 = 102,400 / = 16 (0.01 $ $5/$40)1 = 10.24M Detect 10% change with 8 = 0.05, 9 = 0.20 Detect 1% change with 8 = 0.05, 9 = 0.20
  43. Example 5% of visitors in experimental population purchase average purchase

    = $100, standard deviation = $40. OEC: revenue conversion rate assume conversion modeled as Bernoulli trial: standard dev = ! 1 − ! = 0.22 for ! = 0.05 Δ = sensitivity/std dev = 0.1 - 0.22 = 0.022 42 based on Kohavi et al., Controlled experiments on the web: survey and practical guide. 2008. How many users do we need for experiment to detect 10% change in conversion rate? . = 16 Δ0 = 33,057 . = 16 Δ0 = 102,400
  44. A/A Test 43 Population of Users Splitter All incoming requests

    Control (Existing System) Logging Logging Analyzer Log A Log A’ Control (Existing System)
  45. Multi-Variable Tests (MVT) Test multiple factors in one experiments: -

    More efficient – test many factors at once on same population - Estimate interactions between factors 44 F1 good F2 no effect F1 + F2 bad
  46. “Full Factorial” MVT 45 Splitter All incoming requests K factors

  47. “Fractional Factorial” MVT Plackett-Burman Design: each pair of factors appears

    same number of times 46 Group F1 F2 F3 1 1 1 1 2 0 1 1 3 1 0 1 4 1 1 0
  48. Experiments Gone Awry 47

  49. 48

  50. 49 ! = 16 Δ& Δ = 16/689003 = 0.0048

  51. 50

  52. 51

  53. 52

  54. 53

  55. 54

  56. 55

  57. Charge Project 2 is due Tuesday (Feb 5) Next week:

    resource allocation, stable matching 56 B2B Marketers Should Stop A/B Testing in 2018