Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lies, Damned Lies, and Statistics @ EuroPython 2018

Lies, Damned Lies, and Statistics @ EuroPython 2018

Talk presented at EuroPython 2018 in Edinburgh:
https://ep2018.europython.eu/conference/talks/lies-damned-lies-and-statistics

Abstract:
Statistics show that eating ice cream causes death by drowning.

If this sounds baffling, this talk will help you to understand correlation, bias, statistical significance and other statistical techniques that are commonly (mis)used to support an argument that leads, by accident or on purpose, to drawing the wrong conclusions.

The casual observer is exposed to the use of statistics and probability in everyday life, but it is extremely easy to fall victim of a statistical fallacy, even for professional users.

The purpose of this talk is to help the audience understand how to recognise and avoid these fallacies, by combining an introduction to statistics with examples of lies and damned lies, in a way that is approachable for beginners.

Agenda:
- Correlation and causation
- Simpson’s Paradox
- Sampling bias
- Data visualisation gone wild
- Statistical significance (and Data dredging a.k.a. p-hacking)

Aa38bb7a9c35bc414da6ec7dcd8d7339?s=128

Marco Bonzanini

July 27, 2018
Tweet

Transcript

  1. Lies, Damned Lies
 and Statistics @MarcoBonzanini EuroPython 2018 Edinburgh, UK

    July 2018
  2. In the Vatican City
 there are 5.88 popes
 per square

    mile 2
  3. This talk is about: • The misuse of statistics in

    everyday life • How (not) to lie with statistics This talk is not about: • Python • Advanced Statistical Models The audience (you!): • Good citizens • An interest in statistical literacy
 (without an advanced Math degree?) 3
  4. LIES, DAMNED LIES
 AND CORRELATION

  5. Correlation 5

  6. Correlation • Informal: a connection between two things • Measure

    the strength of the association between two variables 6
  7. Linear Correlation 7

  8. Linear Correlation 8 Positive Negative x x y y

  9. Correlation Example 9

  10. Correlation Example 10 Temperature Ice Cream
 Sales ($$$)

  11. “Correlation 
 does not imply
 causation” 11

  12. 12 Deaths by
 drowning Ice Cream
 Sales ($$$)

  13. 13 Lurking Variable

  14. 14 Temperature Ice Cream
 Sales ($$$) Temperature Deaths by
 drowning

    Lurking Variable
  15. More Lurking Variables 15

  16. More Lurking Variables 16 Damage
 caused
 by fire Firefighters
 deployed

  17. More Lurking Variables 17 Damage
 caused
 by fire Firefighters
 deployed

    Fire severity?
  18. Correlation and causation 18

  19. Correlation and causation • A causes B, or B causes

    A • A and B both cause C • C causes A and B • A causes C, and C causes B • No connection between A and B 19
  20. 20 http://www.tylervigen.com/spurious-correlations

  21. 21 http://www.tylervigen.com/spurious-correlations

  22. 22 https://www.buzzfeed.com/kjh2110/the-10-most-bizarre-correlations

  23. 23 https://www.buzzfeed.com/kjh2110/the-10-most-bizarre-correlations

  24. 24 http://www.nejm.org/doi/full/10.1056/NEJMon1211064

  25. LIES, DAMNED LIES,
 SLICING AND DICING
 YOUR DATA

  26. Simpson’s
 Paradox 26

  27. 27 University of California, Berkeley Graduate school admissions in 1973

    https://en.wikipedia.org/wiki/Simpson%27s_paradox
  28. 28 University of California, Berkeley Graduate school admissions in 1973

    https://en.wikipedia.org/wiki/Simpson%27s_paradox Gender bias?
  29. 29 University of California, Berkeley Graduate school admissions in 1973

    https://en.wikipedia.org/wiki/Simpson%27s_paradox
  30. 30 University of California, Berkeley Graduate school admissions in 1973

    https://en.wikipedia.org/wiki/Simpson%27s_paradox
  31. 31 University of California, Berkeley Graduate school admissions in 1973

    https://en.wikipedia.org/wiki/Simpson%27s_paradox
  32. 32 University of California, Berkeley Graduate school admissions in 1973

    https://en.wikipedia.org/wiki/Simpson%27s_paradox
  33. LIES, DAMNED LIES
 AND SAMPLING BIAS

  34. Sampling 34

  35. Sampling 35 • A selection of a subset of individuals

    • Purpose: estimate about the whole population • Hello Big Data!
  36. Bias 36

  37. Bias 37 • Prejudice? Intuition? • Cultural context? • In

    science: a systematic error
  38. “Dewey defeats Truman” 38

  39. “Dewey defeats Truman” 39 https://en.wikipedia.org/wiki/Dewey_Defeats_Truman

  40. “Dewey defeats Truman” 40 https://en.wikipedia.org/wiki/Dewey_Defeats_Truman • The Chicago Tribune printed

    the wrong headline on election night • The editor trusted the results of the phone survey • … in 1948, a sample of phone users was not representative of the general population
  41. Survivorship Bias 41

  42. Survivorship Bias • Bill Gates, Steve Jobs, Mark Zuckerberg
 are

    all college drop-outs • … should you quit studying? 42
  43. LIES, DAMNED LIES
 AND DATAVIZ

  44. “A picture is worth a thousand words” 44

  45. 45 https://en.wikipedia.org/wiki/Anscombe%27s_quartet

  46. 46 https://venngage.com/blog/misleading-graphs/

  47. 47 https://venngage.com/blog/misleading-graphs/

  48. 48 https://venngage.com/blog/misleading-graphs/

  49. 49 http://www.businessinsider.com/gun-deaths-in-florida-increased-with-stand-your-ground-2014-2?IR=T

  50. 50 http://www.businessinsider.com/gun-deaths-in-florida-increased-with-stand-your-ground-2014-2?IR=T

  51. 51 http://www.businessinsider.com/gun-deaths-in-florida-increased-with-stand-your-ground-2014-2?IR=T

  52. 52 https://www.raiplay.it/video/2016/04/Agor224-del-08042016-4d84cebb-472c-442c-82e0-df25c7e4d0ce.html

  53. 53 https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

  54. 54 https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

  55. 55 https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

  56. 56 https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

  57. 57 https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

  58. 58 https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

  59. LIES, DAMNED LIES
 AND SIGNIFICANCE

  60. Significant = Important 60 ?

  61. Statistically Significant Results 61

  62. Statistically Significant Results 62 • We are quite sure they

    are reliable (not by chance) • Maybe they’re not “big” • Maybe they’re not important • Maybe they’re not useful for decision making
  63. p-values 63

  64. 64 https://en.wikipedia.org/wiki/Misunderstandings_of_p-values

  65. p-values • Probability of observing our results (or more extreme)

    when the null hypothesis is true • Probability, not certainty • Often p < 0.05 (arbitrary) • Can we afford to be fooled by randomness
 every 1 time out of 20? 65
  66. Data dredging 66

  67. 67

  68. Data dredging • a.k.a. Data fishing or p-hacking • Convention:

    formulate hypothesis, collect data, prove/disprove hypothesis • Data dredging: look for patterns until something statistically significant comes up • Looking for patterns is ok
 Testing the hypothesis on the same data set is not 68
  69. SUMMARY

  70. — Dr. House “Everybody lies” 70

  71. 71 • Good Science ™ vs. Big headlines • Nobody

    is immune • Ask questions: What is the context? Who’s paying? What’s missing? • … “so what?”
  72. THANK YOU @MarcoBonzanini speakerdeck.com/marcobonzanini GitHub.com/bonzanini marcobonzanini.com

  73. None