Marco Bonzanini
September 14, 2019
96

# Lies, Damned Lies, and Statistics @ PyCon UK 2019

Statistics show that eating ice cream causes death by drowning. If this sounds baffling, this talk will help you to understand correlation, bias, statistical significance and other statistical techniques that are commonly (mis)used to support an argument that leads, by accident or on purpose, to drawing the wrong conclusions.

The casual observer is exposed to the use of statistics and probability in everyday life, but it is extremely easy to fall victim of a statistical fallacy, even for professional users. The purpose of this talk is to help the audience understand how to recognise and avoid these fallacies, by combining an introduction to statistics with examples of lies and damned lies, in a way that is approachable for beginners.

Agenda:
- Correlation and causation
- Sampling bias and polluted surveys
- Data visualisation gone wild
- Statistical significance (and Data dredging a.k.a. p-hacking)

## Marco Bonzanini

September 14, 2019

## Transcript

mile
3. ### This talk is about: the misuse of stats in everyday

life This talk is NOT about: Python The audience (you!): good citizens, with an interest in statistical literacy (without an advanced Math degree?)

6. ### Correlation • Informal: a connection between two things • Measure

the strength of the association between two variables

Variable

Variables

C A B C

Gender bias?

35. ### Sampling • A selection of a subset of individuals •

Purpose: estimate about the whole population • Hello Big Data!

37. ### Bias • Prejudice? Intuition? • Cultural context? • In science:

a systematic error

40. ### • The Chicago Tribune printed the wrong headline on election

night • The editor trusted the results of the phone survey • … in 1948, a sample of phone users was not representative of the general population “Dewey defeats Truman” https://en.wikipedia.org/wiki/Dewey_Defeats_Truman

42. ### Survivorship Bias • Bill Gates, Steve Jobs, Mark Zuckerberg  are

all college drop-outs • … should you quit studying?

62. ### • We are quite sure they are reliable (not by

chance) • Maybe they’re not “big” • Maybe they’re not important • Maybe they’re not useful for decision making Statistically Significant Results

65. ### p-values • Probability of observing our results (or more extreme)

when the null hypothesis is true • Probability, not certainty • Often p < 0.05 (arbitrary) • Can we afford to be fooled by randomness  every 1 time out of 20?

67. ### Data dredging • a.k.a. Data ﬁshing or p-hacking • Convention:

formulate hypothesis, collect data, prove/disprove hypothesis • Data dredging: look for patterns until something statistically signiﬁcant comes up • Looking for patterns is ok  Testing the hypothesis on the same data set is not