Marco Bonzanini
June 16, 2019
120

Lies, Damned Lies and Statistics @ PyLondinium 2019

Slides for my presentation on "Lies, Damned Lies and Statistics" at PyLondinium 2019
https://pylondinium.org/talks/talk-23.html

Abstract
Statistics show that eating ice cream causes death by drowning. If this sounds baffling, this talk will help you to understand correlation, bias, statistical significance and other statistical techniques that are commonly (mis)used to support an argument that leads, by accident or on purpose, to drawing the wrong conclusions.

The casual observer is exposed to the use of statistics and probability in everyday life, but it is extremely easy to fall victim of a statistical fallacy, even for professional users. The purpose of this talk is to help the audience understand how to recognise and avoid these fallacies, by combining an introduction to statistics with examples of lies and damned lies, in a way that is approachable for beginners.

Agenda:
- Correlation and causation
- Sampling bias and polluted surveys
- Data visualisation gone wild
- Statistical significance (and Data dredging a.k.a. p-hacking)

June 16, 2019

Transcript

1. Lies, Damned Lies
and Statistics
@MarcoBonzanini
PyLondinium 2019

2. In the Vatican City
there are 5.88 popes
per square mile

3. This talk is about: the misuse of stats in everyday life
This talk is NOT about: Python
The audience (you!): good citizens, with an interest in
statistical literacy (without an advanced Math degree?)

4. LIES, DAMNED LIES
AND CORRELATION

5. Correlation

6. Correlation
• Informal: a connection between two things
• Measure the strength of the association
between two variables

7. Linear Correlation

8. Linear Correlation
Positive Negative
x x
y
y

9. Correlation Example

10. Correlation Example
Temperature
Ice Cream
Sales (\$\$\$)

11. “Correlation
does not imply
causation”

12. Deaths by
drowning
Ice Cream
Sales (\$\$\$)

13. Lurking Variable

14. Temperature
Ice Cream
Sales (\$\$\$)
Temperature
Deaths by
drowning
Lurking Variable

15. More Lurking Variables

16. Damage
caused
by ﬁre
Fireﬁghters
deployed

More Lurking Variables

17. Damage
caused
by ﬁre
Fireﬁghters
deployed
Fire severity?
More Lurking Variables

18. Correlation and causation

19. Correlation and causation
• A causes B, or B causes A
• A and B both cause C
• C causes A and B
• A causes C, and C causes B
• No connection between A and B

20. http://www.tylervigen.com/spurious-correlations

21. http://www.tylervigen.com/spurious-correlations

22. https://www.buzzfeed.com/kjh2110/the-10-most-bizarre-correlations

23. https://www.buzzfeed.com/kjh2110/the-10-most-bizarre-correlations

24. http://www.nejm.org/doi/full/10.1056/NEJMon1211064

25. LIES, DAMNED LIES,
SLICING AND DICING

27. University of California, Berkeley

28. University of California, Berkeley
Gender bias?

29. University of California, Berkeley

30. University of California, Berkeley

31. University of California, Berkeley

32. University of California, Berkeley

33. LIES, DAMNED LIES
AND SAMPLING BIAS

34. Sampling

35. Sampling
• A selection of a subset of individuals
• Purpose: estimate about the whole population
• Hello Big Data!

36. Bias

37. Bias
• Prejudice? Intuition?
• Cultural context?
• In science: a systematic error

38. “Dewey defeats Truman”

39. https://en.wikipedia.org/wiki/Dewey_Defeats_Truman
“Dewey defeats Truman”

40. • The Chicago Tribune printed the wrong headline on
election night
• The editor trusted the results of the phone survey
• … in 1948, a sample of phone users was not
representative of the general population
“Dewey defeats Truman”
https://en.wikipedia.org/wiki/Dewey_Defeats_Truman

41. Survivorship Bias

42. Survivorship Bias
• Bill Gates, Steve Jobs, Mark Zuckerberg
are all college drop-outs
• … should you quit studying?

43. LIES, DAMNED LIES
AND DATAVIZ

44. “A picture is worth
a thousand words”

45. https://en.wikipedia.org/wiki/Anscombe%27s_quartet

52. https://www.raiplay.it/video/2016/04/Agor224-del-08042016-4d84cebb-472c-442c-82e0-df25c7e4d0ce.html

53. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaﬂets-ﬁve-tricks-european-elections

54. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaﬂets-ﬁve-tricks-european-elections

55. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaﬂets-ﬁve-tricks-european-elections

56. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaﬂets-ﬁve-tricks-european-elections

57. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaﬂets-ﬁve-tricks-european-elections

58. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaﬂets-ﬁve-tricks-european-elections

59. LIES, DAMNED LIES
AND SIGNIFICANCE

60. Significant = Important
?

61. Statistically Significant Results

62. • We are quite sure they are reliable (not by chance)
• Maybe they’re not “big”
• Maybe they’re not important
• Maybe they’re not useful for decision making
Statistically Significant Results

63. p-values

64. https://en.wikipedia.org/wiki/Misunderstandings_of_p-values

65. p-values
• Probability of observing our results (or more
extreme) when the null hypothesis is true
• Probability, not certainty
• Often p < 0.05 (arbitrary)
• Can we afford to be fooled by randomness
every 1 time out of 20?

66. Data dredging

67. Data dredging
• a.k.a. Data ﬁshing or p-hacking
• Convention: formulate hypothesis, collect data,
prove/disprove hypothesis
• Data dredging: look for patterns until something
statistically signiﬁcant comes up
• Looking for patterns is ok
Testing the hypothesis on the same data set is not

68. LIES, DAMNED LIES
AND CELEBRITIES ON

70. P(mosquito|death)
P(death|mosquito)

71. SUMMARY

72. — Dr. House
“Everybody lies”

73. • Good Science ™ vs. Big headlines
• Nobody is immune