Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lies, Damned Lies and Statistics @ PyCon Italia 2018

Lies, Damned Lies and Statistics @ PyCon Italia 2018

Statistics show that eating ice cream causes death by drowning.

If this sounds baffling, this talk will help you to understand correlation, bias, statistical significance and other statistical techniques that are commonly (mis)used to support an argument that leads, by accident or on purpose, to drawing the wrong conclusions.

The casual observer is exposed to the use of statistics and probability in everyday life, but it is extremely easy to fall victim of a statistical fallacy, even for professional users.

The purpose of this talk is to help the audience understand how to recognise and avoid these fallacies, by combining an introduction to statistics with examples of lies and damned lies, in a way that is approachable for beginners.

Aa38bb7a9c35bc414da6ec7dcd8d7339?s=128

Marco Bonzanini

April 19, 2018
Tweet

Transcript

  1. Lies, Damned Lies
 and Statistics @MarcoBonzanini PyCon Nove Florence, Italy

    April 2018
  2. In the Vatican City
 there are 5.88 popes
 per square

    mile 2
  3. This talk is about: • The misuse of statistics in

    everyday life • How (not) to lie with statistics This talk is not about: • Python • Advanced Statistical Models The audience (you!): • Good citizens • An interest in statistical literacy
 (without an advanced Math degree?) 3
  4. LIES, DAMNED LIES
 AND CORRELATION

  5. Correlation 5

  6. Correlation • Informal: a connection between two things • Measure

    the strength of the association between two variables 6
  7. Linear Correlation 7

  8. Linear Correlation 8 Positive Negative

  9. Correlation Example 9

  10. Correlation Example 10 Temperature Ice Cream
 Sales ($$$)

  11. “Correlation 
 does not imply
 causation” 11

  12. 12 Deaths by
 drowning Ice Cream
 Sales ($$$)

  13. 13 Lurking Variable

  14. 14 Temperature Ice Cream
 Sales ($$$) Temperature Deaths by
 drowning

    Lurking Variable
  15. More Lurking Variables 15

  16. More Lurking Variables 16 Damage
 caused
 by fire Firefighters
 deployed

  17. More Lurking Variables 17 Damage
 caused
 by fire Firefighters
 deployed

    Fire severity?
  18. Correlation and causation 18

  19. Correlation and causation • A causes B, or B causes

    A • A and B both cause C • C causes A and B • A causes C, and C causes B • No connection between A and B 19
  20. 20 http://www.tylervigen.com/spurious-correlations

  21. 21 http://www.tylervigen.com/spurious-correlations

  22. 22 https://www.buzzfeed.com/kjh2110/the-10-most-bizarre-correlations

  23. 23 https://www.buzzfeed.com/kjh2110/the-10-most-bizarre-correlations

  24. 24 http://www.nejm.org/doi/full/10.1056/NEJMon1211064

  25. LIES, DAMNED LIES,
 SLICING AND DICING
 YOUR DATA

  26. Simpson’s
 Paradox 26

  27. 27 University of California, Berkeley Graduate school admissions in 1973

    https://en.wikipedia.org/wiki/Simpson%27s_paradox
  28. 28 University of California, Berkeley Graduate school admissions in 1973

    https://en.wikipedia.org/wiki/Simpson%27s_paradox Gender bias?
  29. 29 University of California, Berkeley Graduate school admissions in 1973

    https://en.wikipedia.org/wiki/Simpson%27s_paradox
  30. 30 University of California, Berkeley Graduate school admissions in 1973

    https://en.wikipedia.org/wiki/Simpson%27s_paradox
  31. 31 University of California, Berkeley Graduate school admissions in 1973

    https://en.wikipedia.org/wiki/Simpson%27s_paradox
  32. 32 University of California, Berkeley Graduate school admissions in 1973

    https://en.wikipedia.org/wiki/Simpson%27s_paradox
  33. LIES, DAMNED LIES
 AND SAMPLING BIAS

  34. Sampling 34

  35. Sampling 35 • A selection of a subset of individuals

    • Purpose: estimate about the whole population • Hello Big Data!
  36. Bias 36

  37. Bias 37 • Prejudice? Intuition? • Cultural context? • In

    science: a systematic error
  38. “Dewey defeats Truman” 38

  39. “Dewey defeats Truman” 39 https://en.wikipedia.org/wiki/Dewey_Defeats_Truman

  40. “Dewey defeats Truman” 40 https://en.wikipedia.org/wiki/Dewey_Defeats_Truman • The Chicago Tribune printed

    the wrong headline on election night • The editor trusted the results of the phone survey • … in 1948, a sample of phone users was not representative of the general population
  41. Survivorship Bias 41

  42. Survivorship Bias • Bill Gates, Steve Jobs, Mark Zuckerberg
 are

    all college drop-outs • … should you quit studying? 42
  43. LIES, DAMNED LIES
 AND DATAVIZ

  44. “A picture is worth a thousand words” 44

  45. 45 https://en.wikipedia.org/wiki/Anscombe%27s_quartet

  46. 46 https://venngage.com/blog/misleading-graphs/

  47. 47 https://venngage.com/blog/misleading-graphs/

  48. 48 https://venngage.com/blog/misleading-graphs/

  49. 49 http://www.businessinsider.com/gun-deaths-in-florida-increased-with-stand-your-ground-2014-2?IR=T

  50. 50 http://www.businessinsider.com/gun-deaths-in-florida-increased-with-stand-your-ground-2014-2?IR=T

  51. 51 http://www.businessinsider.com/gun-deaths-in-florida-increased-with-stand-your-ground-2014-2?IR=T

  52. 52 https://www.raiplay.it/video/2016/04/Agor224-del-08042016-4d84cebb-472c-442c-82e0-df25c7e4d0ce.html

  53. 53 https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

  54. 54 https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

  55. 55 https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

  56. 56 https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

  57. LIES, DAMNED LIES
 AND SIGNIFICANCE

  58. Significant = Important 58 ?

  59. Statistically Significant Results 59

  60. Statistically Significant Results 60 • We are quite sure they

    are reliable (not by chance) • Maybe they’re not “big” • Maybe they’re not important • Maybe they’re not useful for decision making
  61. p-values 61

  62. 62 https://en.wikipedia.org/wiki/Misunderstandings_of_p-values

  63. p-values • Probability of observing our results (or more extreme)

    when the null hypothesis is true • Probability, not certainty • Often p < 0.05 (arbitrary) • Can we afford to be fooled by randomness
 every 1 time out of 20? 63
  64. Data dredging 64

  65. 65

  66. Data dredging • a.k.a. Data fishing or p-hacking • Convention:

    formulate hypothesis, collect data, prove/disprove hypothesis • Data dredging: look for patterns until something statistically significant comes up • Looking for patterns is ok
 Testing the hypothesis on the same data set is not 66
  67. SUMMARY

  68. — Dr. House “Everybody lies” 68

  69. 69 • Good Science ™ vs. Big headlines • Nobody

    is immune • Ask questions: What is the context? Who’s paying? What’s missing? • … “so what?”
  70. THANK YOU @MarcoBonzanini speakerdeck.com/marcobonzanini GitHub.com/bonzanini marcobonzanini.com