Statistical Thinking for Data Science

B9ac79232e794df7c8e63e5e0df2fc26?s=47 Chris Fonnesbeck
February 08, 2015

Statistical Thinking for Data Science

PyTennessee 2015 Keynote Address

B9ac79232e794df7c8e63e5e0df2fc26?s=128

Chris Fonnesbeck

February 08, 2015
Tweet

Transcript

  1. Statistical Thinking for Data Science Chris Fonnesbeck Vanderbilt University

  2. None
  3. None
  4. 21/22 falling 7+ stories survived

  5. 2 fell together

  6. 40% at night

  7. “Even more surprising, the longer the fall, the greater the

    chance of survival.”
  8. 2 to 32 stories (average = 5.5)

  9. ?

  10. "... 132 such victims were admitted to the Animal Medical

    Center on 62nd Street in Manhattan ..."
  11. "Found" Data

  12. convenience sample

  13. Missing Data

  14. Representative

  15. Statistical Issues

  16. Big Data

  17. “With enough data, the numbers speak for themselves ” Chris

    Anderson, Wired
  18. Alfred Landon

  19. Literary Digest Straw Poll

  20. "Next week, the first answers from these ten million will

    begin the incoming tide of marked ballots, to be triple-checked, verified, five-times cross-classified and totalled."
  21. 2.4 million returns

  22. 41 - 55

  23. None
  24. George Gallup

  25. Sampled 50,000

  26. 66%

  27. Random Sampling

  28. None
  29. Bias

  30. None
  31. None
  32. Self-selection Bias

  33. None
  34. For some estimate of unknown quantity ,

  35. p = 0.5 sample_sizes = [10, 100, 1000, 10000, 100000]

    replicates = 1000 biases = [] for n in sample_sizes: bias = np.empty(replicates) for i in range(replicates): true_sample = np.random.normal(size=n) negative_values = true_sample<0 missing = np.random.binomial(1, p, n).astype(bool) observed_sample = true_sample[~(negative_values & missing)] bias[i] = observed_sample.mean() biases.append(bias)
  36. None
  37. Accuracy Mean Squared Error

  38. “The numbers have no way of speaking for themselves” Nate

    Silver
  39. White House Big Data Partners Workshop

  40. White House Big Data Partners Workshop 19 Participants 0 Statisticians

  41. NSF Working Group on Big Data

  42. NSF Working Group on Big Data 100 experts convened 0

    statisticians
  43. Moore Foundation Data Science Environments

  44. Moore Foundation Data Science Environments 0 directors with statistical expertise

  45. NIH BD2K Executive Committee

  46. NIH BD2K Executive Committee 17 committee members 0 statisticians

  47. Feeling left out?

  48. It's our own fault

  49. “Almost everything you learned in your college statistics course was

    wrong”
  50. Typical introductory statistics syllabus 1.Descriptive statistics and plotting

  51. Typical introductory statistics syllabus 1.Descriptive statistics and plotting 2.Basic probability

  52. Typical introductory statistics syllabus 1.Descriptive statistics and plotting 2.Basic probability

    3.Hypothesis testing
  53. Typical introductory statistics syllabus 1.Descriptive statistics and plotting 2.Basic probability

    3.Hypothesis testing 4.Experimental design
  54. Typical introductory statistics syllabus 1.Descriptive statistics and plotting 2.Basic probability

    3.Hypothesis testing 4.Experimental design 5.ANOVA
  55. Statistical Hypothesis Testing

  56. None
  57. None
  58. Test Statistic

  59. T-statistic

  60. None
  61. None
  62. None
  63. p-value

  64. None
  65. None
  66. false positive rate

  67. "The value for which , or 1 in 20, is

    1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not." R.A. Fisher
  68. p-value

  69. the probability that the observed differences are due to chance

  70. the probability that the observed differences are due to chance

  71. a measure of the reliability of the result

  72. a measure of the reliability of the result

  73. the probability that the null hypothesis is true

  74. the probability that the null hypothesis is true

  75. "If an experiment were repeated infinitely, p represents the proportion

    of values more extreme than the observed value, given that the null hypothesis is true."
  76. H0 : Mean duckling body mass did not differ among

    years.
  77. H0 : Mean duckling body mass did not differ among

    years.
  78. H0 : The prevalence of autism spectrum disorder for males

    and females were equal.
  79. H0 : The prevalence of autism spectrum disorder for males

    and females were equal.
  80. H0 : The density of large trees in logged and

    unlogged forest stands were equal
  81. H0 : The density of large trees in logged and

    unlogged forest stands were equal
  82. Statistical Straw Man

  83. Statistical hypotheses are not interesting

  84. Hypothesis tests are not decision support tools

  85. Multiple Comparisons

  86. None
  87. Family-wise Error Rate >>> 1. - (1. - 0.05) **

    20 0.6415140775914581
  88. import seaborn as sb import pandas as pd n =

    20 r = 36 df = pd.concat([pd.DataFrame({'y':np.random.normal(size=n), 'x':np.random.random(n), 'replicate':[i]*n}) for i in range(r)]) sb.lmplot('x', 'y', df, col='replicate', col_wrap=6)
  89. None
  90. Statistically Significant!

  91. None
  92. "Despite a large statistical literature for multiple testing corrections, usually

    it is impossible to decipher how much data dredging by the reporting authors or other research teams has preceded a reported research finding."
  93. What's the Alternative?

  94. Build models and use them to estimate things we care

    about
  95. Effect size estimation

  96. Data-generating Model

  97. None
  98. None
  99. Florida manatee Trichechus manatus

  100. None
  101. None
  102. None
  103. occupied?

  104. occupied? available?

  105. occupied? available? seen?

  106. None
  107. Estimating visibility

  108. None
  109. None
  110. None
  111. None
  112. None
  113. None
  114. None
  115. Bayesian Statistics

  116. None
  117. None
  118. Bayes' Formula

  119. Probabilistic Modeling

  120. Evidence-based Medicine

  121. ASD Interventions Research 19 independent studies 27 different interventions

  122. None
  123. None
  124. None
  125. None
  126. None
  127. None
  128. None
  129. None
  130. None
  131. None
  132. None
  133. “While everyone is looking at the polls and the storm,

    Romney’s slipping into the presidency. ”
  134. None
  135. Heirarchical modeling

  136. Pollster effects

  137. None
  138. None
  139. None
  140. None
  141. Data Science

  142. Data

  143. Science

  144. Those who ignore statistics are condemned to re-invent it. --

    Brad Efron