Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lies, Damned Lies, and Statistics @ PyCon UK 2019

Marco Bonzanini
September 14, 2019

Lies, Damned Lies, and Statistics @ PyCon UK 2019

Statistics show that eating ice cream causes death by drowning. If this sounds baffling, this talk will help you to understand correlation, bias, statistical significance and other statistical techniques that are commonly (mis)used to support an argument that leads, by accident or on purpose, to drawing the wrong conclusions.

The casual observer is exposed to the use of statistics and probability in everyday life, but it is extremely easy to fall victim of a statistical fallacy, even for professional users. The purpose of this talk is to help the audience understand how to recognise and avoid these fallacies, by combining an introduction to statistics with examples of lies and damned lies, in a way that is approachable for beginners.

- Correlation and causation
- Simpson’s Paradox
- Sampling bias and polluted surveys
- Data visualisation gone wild
- Statistical significance (and Data dredging a.k.a. p-hacking)

Marco Bonzanini

September 14, 2019

More Decks by Marco Bonzanini

Other Decks in Science


  1. This talk is about: the misuse of stats in everyday

    life This talk is NOT about: Python The audience (you!): good citizens, with an interest in statistical literacy (without an advanced Math degree?)
  2. Correlation • Informal: a connection between two things • Measure

    the strength of the association between two variables
  3. Sampling • A selection of a subset of individuals •

    Purpose: estimate about the whole population • Hello Big Data!
  4. • The Chicago Tribune printed the wrong headline on election

    night • The editor trusted the results of the phone survey • … in 1948, a sample of phone users was not representative of the general population “Dewey defeats Truman” https://en.wikipedia.org/wiki/Dewey_Defeats_Truman
  5. Survivorship Bias • Bill Gates, Steve Jobs, Mark Zuckerberg

    all college drop-outs • … should you quit studying?
  6. • We are quite sure they are reliable (not by

    chance) • Maybe they’re not “big” • Maybe they’re not important • Maybe they’re not useful for decision making Statistically Significant Results
  7. p-values • Probability of observing our results (or more extreme)

    when the null hypothesis is true • Probability, not certainty • Often p < 0.05 (arbitrary) • Can we afford to be fooled by randomness
 every 1 time out of 20?
  8. Data dredging • a.k.a. Data fishing or p-hacking • Convention:

    formulate hypothesis, collect data, prove/disprove hypothesis • Data dredging: look for patterns until something statistically significant comes up • Looking for patterns is ok
 Testing the hypothesis on the same data set is not
  9. • Good Science ™ vs. Big headlines • Nobody is

    immune • Ask questions:
 What is the context? 
 Who’s paying? 
 What’s missing? • … “so what?”