Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lies, Damned Lies and Statistics @ PyCon Italia 2018

Lies, Damned Lies and Statistics @ PyCon Italia 2018

Statistics show that eating ice cream causes death by drowning.

If this sounds baffling, this talk will help you to understand correlation, bias, statistical significance and other statistical techniques that are commonly (mis)used to support an argument that leads, by accident or on purpose, to drawing the wrong conclusions.

The casual observer is exposed to the use of statistics and probability in everyday life, but it is extremely easy to fall victim of a statistical fallacy, even for professional users.

The purpose of this talk is to help the audience understand how to recognise and avoid these fallacies, by combining an introduction to statistics with examples of lies and damned lies, in a way that is approachable for beginners.

Marco Bonzanini

April 19, 2018
Tweet

More Decks by Marco Bonzanini

Other Decks in Science

Transcript

  1. Lies, Damned Lies

    and Statistics
    @MarcoBonzanini
    PyCon Nove
    Florence, Italy
    April 2018

    View Slide

  2. In the Vatican City

    there are 5.88 popes

    per square mile
    2

    View Slide

  3. This talk is about:
    • The misuse of statistics in everyday life
    • How (not) to lie with statistics
    This talk is not about:
    • Python
    • Advanced Statistical Models
    The audience (you!):
    • Good citizens
    • An interest in statistical literacy

    (without an advanced Math degree?)
    3

    View Slide

  4. LIES, DAMNED LIES

    AND CORRELATION

    View Slide

  5. Correlation
    5

    View Slide

  6. Correlation
    • Informal: a connection between two things
    • Measure the strength of the association
    between two variables
    6

    View Slide

  7. Linear Correlation
    7

    View Slide

  8. Linear Correlation
    8
    Positive Negative

    View Slide

  9. Correlation Example
    9

    View Slide

  10. Correlation Example
    10
    Temperature
    Ice Cream

    Sales ($$$)

    View Slide

  11. “Correlation 

    does not imply

    causation”
    11

    View Slide

  12. 12
    Deaths by

    drowning
    Ice Cream

    Sales ($$$)

    View Slide

  13. 13
    Lurking Variable

    View Slide

  14. 14
    Temperature
    Ice Cream

    Sales ($$$)
    Temperature
    Deaths by

    drowning
    Lurking Variable

    View Slide

  15. More Lurking Variables
    15

    View Slide

  16. More Lurking Variables
    16
    Damage

    caused

    by fire
    Firefighters

    deployed

    View Slide

  17. More Lurking Variables
    17
    Damage

    caused

    by fire
    Firefighters

    deployed
    Fire severity?

    View Slide

  18. Correlation and causation
    18

    View Slide

  19. Correlation and causation
    • A causes B, or B causes A
    • A and B both cause C
    • C causes A and B
    • A causes C, and C causes B
    • No connection between A and B
    19

    View Slide

  20. 20
    http://www.tylervigen.com/spurious-correlations

    View Slide

  21. 21
    http://www.tylervigen.com/spurious-correlations

    View Slide

  22. 22
    https://www.buzzfeed.com/kjh2110/the-10-most-bizarre-correlations

    View Slide

  23. 23
    https://www.buzzfeed.com/kjh2110/the-10-most-bizarre-correlations

    View Slide

  24. 24
    http://www.nejm.org/doi/full/10.1056/NEJMon1211064

    View Slide

  25. LIES, DAMNED LIES,

    SLICING AND DICING

    YOUR DATA

    View Slide

  26. Simpson’s

    Paradox
    26

    View Slide

  27. 27
    University of California, Berkeley
    Graduate school admissions in 1973
    https://en.wikipedia.org/wiki/Simpson%27s_paradox

    View Slide

  28. 28
    University of California, Berkeley
    Graduate school admissions in 1973
    https://en.wikipedia.org/wiki/Simpson%27s_paradox
    Gender bias?

    View Slide

  29. 29
    University of California, Berkeley
    Graduate school admissions in 1973
    https://en.wikipedia.org/wiki/Simpson%27s_paradox

    View Slide

  30. 30
    University of California, Berkeley
    Graduate school admissions in 1973
    https://en.wikipedia.org/wiki/Simpson%27s_paradox

    View Slide

  31. 31
    University of California, Berkeley
    Graduate school admissions in 1973
    https://en.wikipedia.org/wiki/Simpson%27s_paradox

    View Slide

  32. 32
    University of California, Berkeley
    Graduate school admissions in 1973
    https://en.wikipedia.org/wiki/Simpson%27s_paradox

    View Slide

  33. LIES, DAMNED LIES

    AND SAMPLING BIAS

    View Slide

  34. Sampling
    34

    View Slide

  35. Sampling
    35
    • A selection of a subset of individuals
    • Purpose: estimate about the whole population
    • Hello Big Data!

    View Slide

  36. Bias
    36

    View Slide

  37. Bias
    37
    • Prejudice? Intuition?
    • Cultural context?
    • In science: a systematic error

    View Slide

  38. “Dewey defeats Truman”
    38

    View Slide

  39. “Dewey defeats Truman”
    39
    https://en.wikipedia.org/wiki/Dewey_Defeats_Truman

    View Slide

  40. “Dewey defeats Truman”
    40
    https://en.wikipedia.org/wiki/Dewey_Defeats_Truman
    • The Chicago Tribune printed the wrong headline on
    election night
    • The editor trusted the results of the phone survey
    • … in 1948, a sample of phone users was not
    representative of the general population

    View Slide

  41. Survivorship Bias
    41

    View Slide

  42. Survivorship Bias
    • Bill Gates, Steve Jobs, Mark Zuckerberg

    are all college drop-outs
    • … should you quit studying?
    42

    View Slide

  43. LIES, DAMNED LIES

    AND DATAVIZ

    View Slide

  44. “A picture is worth
    a thousand words”
    44

    View Slide

  45. 45
    https://en.wikipedia.org/wiki/Anscombe%27s_quartet

    View Slide

  46. 46
    https://venngage.com/blog/misleading-graphs/

    View Slide

  47. 47
    https://venngage.com/blog/misleading-graphs/

    View Slide

  48. 48
    https://venngage.com/blog/misleading-graphs/

    View Slide

  49. 49
    http://www.businessinsider.com/gun-deaths-in-florida-increased-with-stand-your-ground-2014-2?IR=T

    View Slide

  50. 50
    http://www.businessinsider.com/gun-deaths-in-florida-increased-with-stand-your-ground-2014-2?IR=T

    View Slide

  51. 51
    http://www.businessinsider.com/gun-deaths-in-florida-increased-with-stand-your-ground-2014-2?IR=T

    View Slide

  52. 52
    https://www.raiplay.it/video/2016/04/Agor224-del-08042016-4d84cebb-472c-442c-82e0-df25c7e4d0ce.html

    View Slide

  53. 53
    https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

    View Slide

  54. 54
    https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

    View Slide

  55. 55
    https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

    View Slide

  56. 56
    https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

    View Slide

  57. LIES, DAMNED LIES

    AND SIGNIFICANCE

    View Slide

  58. Significant = Important
    58
    ?

    View Slide

  59. Statistically Significant Results
    59

    View Slide

  60. Statistically Significant Results
    60
    • We are quite sure they are reliable (not by chance)
    • Maybe they’re not “big”
    • Maybe they’re not important
    • Maybe they’re not useful for decision making

    View Slide

  61. p-values
    61

    View Slide

  62. 62
    https://en.wikipedia.org/wiki/Misunderstandings_of_p-values

    View Slide

  63. p-values
    • Probability of observing our results (or more
    extreme) when the null hypothesis is true
    • Probability, not certainty
    • Often p < 0.05 (arbitrary)
    • Can we afford to be fooled by randomness

    every 1 time out of 20?
    63

    View Slide

  64. Data dredging
    64

    View Slide

  65. 65

    View Slide

  66. Data dredging
    • a.k.a. Data fishing or p-hacking
    • Convention: formulate hypothesis, collect data,
    prove/disprove hypothesis
    • Data dredging: look for patterns until something
    statistically significant comes up
    • Looking for patterns is ok

    Testing the hypothesis on the same data set is not
    66

    View Slide

  67. SUMMARY

    View Slide

  68. — Dr. House
    “Everybody lies”
    68

    View Slide

  69. 69
    • Good Science ™ vs. Big headlines
    • Nobody is immune
    • Ask questions: What is the context? Who’s paying?
    What’s missing?
    • … “so what?”

    View Slide

  70. THANK YOU
    @MarcoBonzanini
    speakerdeck.com/marcobonzanini
    GitHub.com/bonzanini
    marcobonzanini.com

    View Slide