Marco Bonzanini
April 19, 2018
320

# Lies, Damned Lies and Statistics @ PyCon Italia 2018

Statistics show that eating ice cream causes death by drowning.

If this sounds baffling, this talk will help you to understand correlation, bias, statistical significance and other statistical techniques that are commonly (mis)used to support an argument that leads, by accident or on purpose, to drawing the wrong conclusions.

The casual observer is exposed to the use of statistics and probability in everyday life, but it is extremely easy to fall victim of a statistical fallacy, even for professional users.

The purpose of this talk is to help the audience understand how to recognise and avoid these fallacies, by combining an introduction to statistics with examples of lies and damned lies, in a way that is approachable for beginners.

April 19, 2018

## Transcript

1. Lies, Damned Lies
and Statistics
@MarcoBonzanini
PyCon Nove
Florence, Italy
April 2018

2. In the Vatican City
there are 5.88 popes
per square mile
2

3. This talk is about:
• The misuse of statistics in everyday life
• How (not) to lie with statistics
This talk is not about:
• Python
• Advanced Statistical Models
The audience (you!):
• Good citizens
• An interest in statistical literacy
(without an advanced Math degree?)
3

4. LIES, DAMNED LIES
AND CORRELATION

5. Correlation
5

6. Correlation
• Informal: a connection between two things
• Measure the strength of the association
between two variables
6

7. Linear Correlation
7

8. Linear Correlation
8
Positive Negative

9. Correlation Example
9

10. Correlation Example
10
Temperature
Ice Cream
Sales (\$\$\$)

11. “Correlation
does not imply
causation”
11

12. 12
Deaths by
drowning
Ice Cream
Sales (\$\$\$)

13. 13
Lurking Variable

14. 14
Temperature
Ice Cream
Sales (\$\$\$)
Temperature
Deaths by
drowning
Lurking Variable

15. More Lurking Variables
15

16. More Lurking Variables
16
Damage
caused
by ﬁre
Fireﬁghters
deployed

17. More Lurking Variables
17
Damage
caused
by ﬁre
Fireﬁghters
deployed
Fire severity?

18. Correlation and causation
18

19. Correlation and causation
• A causes B, or B causes A
• A and B both cause C
• C causes A and B
• A causes C, and C causes B
• No connection between A and B
19

20. 20
http://www.tylervigen.com/spurious-correlations

21. 21
http://www.tylervigen.com/spurious-correlations

22. 22
https://www.buzzfeed.com/kjh2110/the-10-most-bizarre-correlations

23. 23
https://www.buzzfeed.com/kjh2110/the-10-most-bizarre-correlations

24. 24
http://www.nejm.org/doi/full/10.1056/NEJMon1211064

25. LIES, DAMNED LIES,
SLICING AND DICING

26. Simpson’s
26

27. 27
University of California, Berkeley

28. 28
University of California, Berkeley
Gender bias?

29. 29
University of California, Berkeley

30. 30
University of California, Berkeley

31. 31
University of California, Berkeley

32. 32
University of California, Berkeley

33. LIES, DAMNED LIES
AND SAMPLING BIAS

34. Sampling
34

35. Sampling
35
• A selection of a subset of individuals
• Purpose: estimate about the whole population
• Hello Big Data!

36. Bias
36

37. Bias
37
• Prejudice? Intuition?
• Cultural context?
• In science: a systematic error

38. “Dewey defeats Truman”
38

39. “Dewey defeats Truman”
39
https://en.wikipedia.org/wiki/Dewey_Defeats_Truman

40. “Dewey defeats Truman”
40
https://en.wikipedia.org/wiki/Dewey_Defeats_Truman
• The Chicago Tribune printed the wrong headline on
election night
• The editor trusted the results of the phone survey
• … in 1948, a sample of phone users was not
representative of the general population

41. Survivorship Bias
41

42. Survivorship Bias
• Bill Gates, Steve Jobs, Mark Zuckerberg
are all college drop-outs
• … should you quit studying?
42

43. LIES, DAMNED LIES
AND DATAVIZ

44. “A picture is worth
a thousand words”
44

45. 45
https://en.wikipedia.org/wiki/Anscombe%27s_quartet

46. 46

47. 47

48. 48

49. 49

50. 50

51. 51

52. 52
https://www.raiplay.it/video/2016/04/Agor224-del-08042016-4d84cebb-472c-442c-82e0-df25c7e4d0ce.html

53. 53
https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaﬂets-ﬁve-tricks-european-elections

54. 54
https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaﬂets-ﬁve-tricks-european-elections

55. 55
https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaﬂets-ﬁve-tricks-european-elections

56. 56
https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaﬂets-ﬁve-tricks-european-elections

57. LIES, DAMNED LIES
AND SIGNIFICANCE

58. Significant = Important
58
?

59. Statistically Significant Results
59

60. Statistically Significant Results
60
• We are quite sure they are reliable (not by chance)
• Maybe they’re not “big”
• Maybe they’re not important
• Maybe they’re not useful for decision making

61. p-values
61

62. 62
https://en.wikipedia.org/wiki/Misunderstandings_of_p-values

63. p-values
• Probability of observing our results (or more
extreme) when the null hypothesis is true
• Probability, not certainty
• Often p < 0.05 (arbitrary)
• Can we afford to be fooled by randomness
every 1 time out of 20?
63

64. Data dredging
64

65. 65

66. Data dredging
• a.k.a. Data ﬁshing or p-hacking
• Convention: formulate hypothesis, collect data,
prove/disprove hypothesis
• Data dredging: look for patterns until something
statistically signiﬁcant comes up
• Looking for patterns is ok
Testing the hypothesis on the same data set is not
66

67. SUMMARY

68. — Dr. House
“Everybody lies”
68

69. 69
• Good Science ™ vs. Big headlines
• Nobody is immune
• Ask questions: What is the context? Who’s paying?
What’s missing?
• … “so what?”

70. THANK YOU
@MarcoBonzanini
speakerdeck.com/marcobonzanini
GitHub.com/bonzanini
marcobonzanini.com