$30 off During Our Annual Pro Sale. View Details »

Lies, Damned Lies and Statistics @ PyLondinium 2019

Lies, Damned Lies and Statistics @ PyLondinium 2019

Slides for my presentation on "Lies, Damned Lies and Statistics" at PyLondinium 2019
https://pylondinium.org/talks/talk-23.html

Abstract
Statistics show that eating ice cream causes death by drowning. If this sounds baffling, this talk will help you to understand correlation, bias, statistical significance and other statistical techniques that are commonly (mis)used to support an argument that leads, by accident or on purpose, to drawing the wrong conclusions.

The casual observer is exposed to the use of statistics and probability in everyday life, but it is extremely easy to fall victim of a statistical fallacy, even for professional users. The purpose of this talk is to help the audience understand how to recognise and avoid these fallacies, by combining an introduction to statistics with examples of lies and damned lies, in a way that is approachable for beginners.

Agenda:
- Correlation and causation
- Simpson’s Paradox
- Sampling bias and polluted surveys
- Data visualisation gone wild
- Statistical significance (and Data dredging a.k.a. p-hacking)
- Bonus: Celebrities on Twitter

Marco Bonzanini

June 16, 2019
Tweet

More Decks by Marco Bonzanini

Other Decks in Science

Transcript

  1. Lies, Damned Lies

    and Statistics
    @MarcoBonzanini
    PyLondinium 2019

    View Slide

  2. In the Vatican City

    there are 5.88 popes

    per square mile

    View Slide

  3. This talk is about: the misuse of stats in everyday life
    This talk is NOT about: Python
    The audience (you!): good citizens, with an interest in
    statistical literacy (without an advanced Math degree?)

    View Slide

  4. LIES, DAMNED LIES

    AND CORRELATION

    View Slide

  5. Correlation

    View Slide

  6. Correlation
    • Informal: a connection between two things
    • Measure the strength of the association
    between two variables

    View Slide

  7. Linear Correlation

    View Slide

  8. Linear Correlation
    Positive Negative
    x x
    y
    y

    View Slide

  9. Correlation Example

    View Slide

  10. Correlation Example
    Temperature
    Ice Cream

    Sales ($$$)

    View Slide

  11. “Correlation 

    does not imply

    causation”

    View Slide

  12. Deaths by

    drowning
    Ice Cream

    Sales ($$$)

    View Slide

  13. Lurking Variable

    View Slide

  14. Temperature
    Ice Cream

    Sales ($$$)
    Temperature
    Deaths by

    drowning
    Lurking Variable

    View Slide

  15. More Lurking Variables

    View Slide

  16. Damage

    caused

    by fire
    Firefighters

    deployed

    More Lurking Variables

    View Slide

  17. Damage

    caused

    by fire
    Firefighters

    deployed
    Fire severity?
    More Lurking Variables

    View Slide

  18. Correlation and causation

    View Slide

  19. Correlation and causation
    • A causes B, or B causes A
    • A and B both cause C
    • C causes A and B
    • A causes C, and C causes B
    • No connection between A and B

    View Slide

  20. http://www.tylervigen.com/spurious-correlations

    View Slide

  21. http://www.tylervigen.com/spurious-correlations

    View Slide

  22. https://www.buzzfeed.com/kjh2110/the-10-most-bizarre-correlations

    View Slide

  23. https://www.buzzfeed.com/kjh2110/the-10-most-bizarre-correlations

    View Slide

  24. http://www.nejm.org/doi/full/10.1056/NEJMon1211064

    View Slide

  25. LIES, DAMNED LIES,

    SLICING AND DICING

    YOUR DATA

    View Slide

  26. Simpson’s Paradox

    View Slide

  27. University of California, Berkeley
    Graduate school admissions in 1973
    https://en.wikipedia.org/wiki/Simpson%27s_paradox

    View Slide

  28. University of California, Berkeley
    Graduate school admissions in 1973
    https://en.wikipedia.org/wiki/Simpson%27s_paradox
    Gender bias?

    View Slide

  29. University of California, Berkeley
    Graduate school admissions in 1973
    https://en.wikipedia.org/wiki/Simpson%27s_paradox

    View Slide

  30. University of California, Berkeley
    Graduate school admissions in 1973
    https://en.wikipedia.org/wiki/Simpson%27s_paradox

    View Slide

  31. University of California, Berkeley
    Graduate school admissions in 1973
    https://en.wikipedia.org/wiki/Simpson%27s_paradox

    View Slide

  32. University of California, Berkeley
    Graduate school admissions in 1973
    https://en.wikipedia.org/wiki/Simpson%27s_paradox

    View Slide

  33. LIES, DAMNED LIES

    AND SAMPLING BIAS

    View Slide

  34. Sampling

    View Slide

  35. Sampling
    • A selection of a subset of individuals
    • Purpose: estimate about the whole population
    • Hello Big Data!

    View Slide

  36. Bias

    View Slide

  37. Bias
    • Prejudice? Intuition?
    • Cultural context?
    • In science: a systematic error

    View Slide

  38. “Dewey defeats Truman”

    View Slide

  39. https://en.wikipedia.org/wiki/Dewey_Defeats_Truman
    “Dewey defeats Truman”

    View Slide

  40. • The Chicago Tribune printed the wrong headline on
    election night
    • The editor trusted the results of the phone survey
    • … in 1948, a sample of phone users was not
    representative of the general population
    “Dewey defeats Truman”
    https://en.wikipedia.org/wiki/Dewey_Defeats_Truman

    View Slide

  41. Survivorship Bias

    View Slide

  42. Survivorship Bias
    • Bill Gates, Steve Jobs, Mark Zuckerberg

    are all college drop-outs
    • … should you quit studying?

    View Slide

  43. LIES, DAMNED LIES

    AND DATAVIZ

    View Slide

  44. “A picture is worth 

    a thousand words”

    View Slide

  45. https://en.wikipedia.org/wiki/Anscombe%27s_quartet

    View Slide

  46. https://venngage.com/blog/misleading-graphs/

    View Slide

  47. https://venngage.com/blog/misleading-graphs/

    View Slide

  48. https://venngage.com/blog/misleading-graphs/

    View Slide

  49. http://www.businessinsider.com/gun-deaths-in-florida-increased-with-stand-your-ground-2014-2?IR=T

    View Slide

  50. http://www.businessinsider.com/gun-deaths-in-florida-increased-with-stand-your-ground-2014-2?IR=T

    View Slide

  51. http://www.businessinsider.com/gun-deaths-in-florida-increased-with-stand-your-ground-2014-2?IR=T

    View Slide

  52. https://www.raiplay.it/video/2016/04/Agor224-del-08042016-4d84cebb-472c-442c-82e0-df25c7e4d0ce.html

    View Slide

  53. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

    View Slide

  54. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

    View Slide

  55. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

    View Slide

  56. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

    View Slide

  57. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

    View Slide

  58. https://www.theguardian.com/news/datablog/2014/may/12/lies-election-leaflets-five-tricks-european-elections

    View Slide

  59. LIES, DAMNED LIES

    AND SIGNIFICANCE

    View Slide

  60. Significant = Important
    ?

    View Slide

  61. Statistically Significant Results

    View Slide

  62. • We are quite sure they are reliable (not by chance)
    • Maybe they’re not “big”
    • Maybe they’re not important
    • Maybe they’re not useful for decision making
    Statistically Significant Results

    View Slide

  63. p-values

    View Slide

  64. https://en.wikipedia.org/wiki/Misunderstandings_of_p-values

    View Slide

  65. p-values
    • Probability of observing our results (or more
    extreme) when the null hypothesis is true
    • Probability, not certainty
    • Often p < 0.05 (arbitrary)
    • Can we afford to be fooled by randomness

    every 1 time out of 20?

    View Slide

  66. Data dredging

    View Slide

  67. View Slide

  68. Data dredging
    • a.k.a. Data fishing or p-hacking
    • Convention: formulate hypothesis, collect data,
    prove/disprove hypothesis
    • Data dredging: look for patterns until something
    statistically significant comes up
    • Looking for patterns is ok

    Testing the hypothesis on the same data set is not

    View Slide

  69. LIES, DAMNED LIES

    AND CELEBRITIES ON
    TWITTER

    View Slide

  70. https://twitter.com/billgates/status/1118196606975787008

    View Slide

  71. P(mosquito|death)
    P(death|mosquito)

    View Slide

  72. SUMMARY

    View Slide

  73. — Dr. House
    “Everybody lies”

    View Slide

  74. • Good Science ™ vs. Big headlines
    • Nobody is immune
    • Ask questions:

    What is the context? 

    Who’s paying? 

    What’s missing?
    • … “so what?”

    View Slide

  75. THANK YOU
    @MarcoBonzanini
    @PyDataLondon

    View Slide

  76. PyData London Conference
    12-14 July 2019
    @PyDataLondon

    View Slide