Lies, Damned Lies, and Statistics @ PyCon UK 2019

Aa38bb7a9c35bc414da6ec7dcd8d7339?s=47 Marco Bonzanini
September 14, 2019

Lies, Damned Lies, and Statistics @ PyCon UK 2019

Statistics show that eating ice cream causes death by drowning. If this sounds baffling, this talk will help you to understand correlation, bias, statistical significance and other statistical techniques that are commonly (mis)used to support an argument that leads, by accident or on purpose, to drawing the wrong conclusions.

The casual observer is exposed to the use of statistics and probability in everyday life, but it is extremely easy to fall victim of a statistical fallacy, even for professional users. The purpose of this talk is to help the audience understand how to recognise and avoid these fallacies, by combining an introduction to statistics with examples of lies and damned lies, in a way that is approachable for beginners.

Agenda:
- Correlation and causation
- Simpson’s Paradox
- Sampling bias and polluted surveys
- Data visualisation gone wild
- Statistical significance (and Data dredging a.k.a. p-hacking)

Aa38bb7a9c35bc414da6ec7dcd8d7339?s=128

Marco Bonzanini

September 14, 2019
Tweet

Transcript

  1. Lies, Damned Lies
 and Statistics @MarcoBonzanini PyCon UK 2019

  2. In the Vatican City
 there are 5.88 popes
 per square

    mile
  3. This talk is about: the misuse of stats in everyday

    life This talk is NOT about: Python The audience (you!): good citizens, with an interest in statistical literacy (without an advanced Math degree?)
  4. LIES, DAMNED LIES
 AND CORRELATION

  5. Correlation

  6. Correlation • Informal: a connection between two things • Measure

    the strength of the association between two variables
  7. Linear Correlation

  8. Linear Correlation Positive Negative x x y y

  9. Correlation Example

  10. Correlation Example Temperature Ice Cream
 Sales ($$$)

  11. “Correlation 
 does not imply
 causation”

  12. Deaths by
 drowning Ice Cream
 Sales ($$$)

  13. Lurking Variable

  14. Temperature Ice Cream
 Sales ($$$) Temperature Deaths by
 drowning Lurking

    Variable
  15. More Lurking Variables

  16. Damage
 caused
 by fire Firefighters
 deployed More Lurking Variables

  17. Damage
 caused
 by fire Firefighters
 deployed Fire severity? More Lurking

    Variables
  18. Correlation and causation

  19. Correlation and causation A A B B C A B

    C A B C
  20. http://www.tylervigen.com/spurious-correlations

  21. http://www.tylervigen.com/spurious-correlations

  22. https://www.buzzfeed.com/kjh2110/the-10-most-bizarre-correlations

  23. https://www.buzzfeed.com/kjh2110/the-10-most-bizarre-correlations

  24. http://www.nejm.org/doi/full/10.1056/NEJMon1211064

  25. LIES, DAMNED LIES,
 SLICING AND DICING
 YOUR DATA

  26. Simpson’s Paradox

  27. University of California, Berkeley Graduate school admissions in 1973 https://en.wikipedia.org/wiki/Simpson%27s_paradox

  28. University of California, Berkeley Graduate school admissions in 1973 https://en.wikipedia.org/wiki/Simpson%27s_paradox

    Gender bias?
  29. University of California, Berkeley Graduate school admissions in 1973 https://en.wikipedia.org/wiki/Simpson%27s_paradox

  30. University of California, Berkeley Graduate school admissions in 1973 https://en.wikipedia.org/wiki/Simpson%27s_paradox

  31. University of California, Berkeley Graduate school admissions in 1973 https://en.wikipedia.org/wiki/Simpson%27s_paradox

  32. University of California, Berkeley Graduate school admissions in 1973 https://en.wikipedia.org/wiki/Simpson%27s_paradox

  33. LIES, DAMNED LIES
 AND SAMPLING BIAS

  34. Sampling

  35. Sampling • A selection of a subset of individuals •

    Purpose: estimate about the whole population • Hello Big Data!
  36. Bias

  37. Bias • Prejudice? Intuition? • Cultural context? • In science:

    a systematic error
  38. “Dewey defeats Truman”

  39. https://en.wikipedia.org/wiki/Dewey_Defeats_Truman “Dewey defeats Truman”

  40. • The Chicago Tribune printed the wrong headline on election

    night • The editor trusted the results of the phone survey • … in 1948, a sample of phone users was not representative of the general population “Dewey defeats Truman” https://en.wikipedia.org/wiki/Dewey_Defeats_Truman
  41. Survivorship Bias

  42. Survivorship Bias • Bill Gates, Steve Jobs, Mark Zuckerberg
 are

    all college drop-outs • … should you quit studying?
  43. LIES, DAMNED LIES
 AND DATAVIZ

  44. “A picture is worth 
 a thousand words”

  45. https://en.wikipedia.org/wiki/Anscombe%27s_quartet

  46. https://venngage.com/blog/misleading-graphs/

  47. https://venngage.com/blog/misleading-graphs/

  48. https://venngage.com/blog/misleading-graphs/

  49. http://www.businessinsider.com/gun-deaths-in-florida-increased-with-stand-your-ground-2014-2?IR=T

  50. http://www.businessinsider.com/gun-deaths-in-florida-increased-with-stand-your-ground-2014-2?IR=T

  51. http://www.businessinsider.com/gun-deaths-in-florida-increased-with-stand-your-ground-2014-2?IR=T

  52. https://www.raiplay.it/video/2016/04/Agor224-del-08042016-4d84cebb-472c-442c-82e0-df25c7e4d0ce.html

  53. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

  54. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

  55. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

  56. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

  57. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

  58. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

  59. LIES, DAMNED LIES
 AND SIGNIFICANCE

  60. Significant = Important ?

  61. Statistically Significant Results

  62. • We are quite sure they are reliable (not by

    chance) • Maybe they’re not “big” • Maybe they’re not important • Maybe they’re not useful for decision making Statistically Significant Results
  63. p-values

  64. https://en.wikipedia.org/wiki/Misunderstandings_of_p-values

  65. p-values • Probability of observing our results (or more extreme)

    when the null hypothesis is true • Probability, not certainty • Often p < 0.05 (arbitrary) • Can we afford to be fooled by randomness
 every 1 time out of 20?
  66. Data dredging

  67. None
  68. Data dredging • a.k.a. Data fishing or p-hacking • Convention:

    formulate hypothesis, collect data, prove/disprove hypothesis • Data dredging: look for patterns until something statistically significant comes up • Looking for patterns is ok
 Testing the hypothesis on the same data set is not
  69. LIES, DAMNED LIES
 AND CELEBRITIES ON TWITTER

  70. https://twitter.com/billgates/status/1118196606975787008

  71. P(mosquito|death) P(death|mosquito) ≠

  72. SUMMARY

  73. — Dr. House “Everybody lies”

  74. • Good Science ™ vs. Big headlines • Nobody is

    immune • Ask questions:
 What is the context? 
 Who’s paying? 
 What’s missing? • … “so what?”
  75. THANK YOU @MarcoBonzanini @PyDataLondon