Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Statistical Thinking for Data Science
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Chris Fonnesbeck
February 08, 2015
Science
1.3k
5
Share
Statistical Thinking for Data Science
PyTennessee 2015 Keynote Address
Chris Fonnesbeck
February 08, 2015
More Decks by Chris Fonnesbeck
See All by Chris Fonnesbeck
Structured Decision-making and Adaptive Management For The Control Of Infectious Disease
fonnesbeck
3
130
Estimating Microbial Diversity
fonnesbeck
0
140
Bayesian Statistical Analysis: A Gentle Introduction
fonnesbeck
4
660
Other Decks in Science
See All in Science
(メタ)科学コミュニケーターからみたAI for Scienceの同床異夢
rmaruy
0
190
LayerXにおける業務の完全自動運転化に向けたAI技術活用事例 / layerx-ai-jsai2025
shimacos
9
22k
知能とはなにかーヒトとAIのあいだー
tagtag
PRO
0
180
20260220 OpenIDファウンデーション・ジャパン ご紹介 / 20260220 OpenID Foundation Japan Intro
oidfj
0
300
SpatialRDDパッケージによる空間回帰不連続デザイン
saltcooky12
0
190
イロレーティングを活用した関東大学サッカーの定量的実力評価 / A quantitative performance evaluation of Kanto University Football Association using Elo rating
konakalab
0
240
データマイニング - グラフ埋め込み入門
trycycle
PRO
1
200
[Paper Introduction] From Bytes to Ideas:Language Modeling with Autoregressive U-Nets
haruumiomoto
0
220
YouTubeにおける撤回論文の参照実態 / metascience-meetup2026
corgies
3
220
DMMにおけるABテスト検証設計の工夫
xc6da
1
1.7k
ハミルトン・ヤコビ方程式の解の性質と物理的意味
enakai00
0
190
検索と推論タスクに関する論文の紹介
ynakano
1
180
Featured
See All Featured
What’s in a name? Adding method to the madness
productmarketing
PRO
24
4k
Avoiding the “Bad Training, Faster” Trap in the Age of AI
tmiket
0
110
How to build an LLM SEO readiness audit: a practical framework
nmsamuel
1
700
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.2k
Reality Check: Gamification 10 Years Later
codingconduct
0
2.1k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
122
21k
How People are Using Generative and Agentic AI to Supercharge Their Products, Projects, Services and Value Streams Today
helenjbeal
1
150
Highjacked: Video Game Concept Design
rkendrick25
PRO
1
340
Ruling the World: When Life Gets Gamed
codingconduct
0
190
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
1
160
How STYLIGHT went responsive
nonsquared
100
6k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
231
23k
Transcript
Statistical Thinking for Data Science Chris Fonnesbeck Vanderbilt University
None
None
21/22 falling 7+ stories survived
2 fell together
40% at night
“Even more surprising, the longer the fall, the greater the
chance of survival.”
2 to 32 stories (average = 5.5)
?
"... 132 such victims were admitted to the Animal Medical
Center on 62nd Street in Manhattan ..."
"Found" Data
convenience sample
Missing Data
Representative
Statistical Issues
Big Data
“With enough data, the numbers speak for themselves ” Chris
Anderson, Wired
Alfred Landon
Literary Digest Straw Poll
"Next week, the first answers from these ten million will
begin the incoming tide of marked ballots, to be triple-checked, verified, five-times cross-classified and totalled."
2.4 million returns
41 - 55
None
George Gallup
Sampled 50,000
66%
Random Sampling
None
Bias
None
None
Self-selection Bias
None
For some estimate of unknown quantity ,
p = 0.5 sample_sizes = [10, 100, 1000, 10000, 100000]
replicates = 1000 biases = [] for n in sample_sizes: bias = np.empty(replicates) for i in range(replicates): true_sample = np.random.normal(size=n) negative_values = true_sample<0 missing = np.random.binomial(1, p, n).astype(bool) observed_sample = true_sample[~(negative_values & missing)] bias[i] = observed_sample.mean() biases.append(bias)
None
Accuracy Mean Squared Error
“The numbers have no way of speaking for themselves” Nate
Silver
White House Big Data Partners Workshop
White House Big Data Partners Workshop 19 Participants 0 Statisticians
NSF Working Group on Big Data
NSF Working Group on Big Data 100 experts convened 0
statisticians
Moore Foundation Data Science Environments
Moore Foundation Data Science Environments 0 directors with statistical expertise
NIH BD2K Executive Committee
NIH BD2K Executive Committee 17 committee members 0 statisticians
Feeling left out?
It's our own fault
“Almost everything you learned in your college statistics course was
wrong”
Typical introductory statistics syllabus 1.Descriptive statistics and plotting
Typical introductory statistics syllabus 1.Descriptive statistics and plotting 2.Basic probability
Typical introductory statistics syllabus 1.Descriptive statistics and plotting 2.Basic probability
3.Hypothesis testing
Typical introductory statistics syllabus 1.Descriptive statistics and plotting 2.Basic probability
3.Hypothesis testing 4.Experimental design
Typical introductory statistics syllabus 1.Descriptive statistics and plotting 2.Basic probability
3.Hypothesis testing 4.Experimental design 5.ANOVA
Statistical Hypothesis Testing
None
None
Test Statistic
T-statistic
None
None
None
p-value
None
None
false positive rate
"The value for which , or 1 in 20, is
1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not." R.A. Fisher
p-value
the probability that the observed differences are due to chance
the probability that the observed differences are due to chance
a measure of the reliability of the result
a measure of the reliability of the result
the probability that the null hypothesis is true
the probability that the null hypothesis is true
"If an experiment were repeated infinitely, p represents the proportion
of values more extreme than the observed value, given that the null hypothesis is true."
H0 : Mean duckling body mass did not differ among
years.
H0 : Mean duckling body mass did not differ among
years.
H0 : The prevalence of autism spectrum disorder for males
and females were equal.
H0 : The prevalence of autism spectrum disorder for males
and females were equal.
H0 : The density of large trees in logged and
unlogged forest stands were equal
H0 : The density of large trees in logged and
unlogged forest stands were equal
Statistical Straw Man
Statistical hypotheses are not interesting
Hypothesis tests are not decision support tools
Multiple Comparisons
None
Family-wise Error Rate >>> 1. - (1. - 0.05) **
20 0.6415140775914581
import seaborn as sb import pandas as pd n =
20 r = 36 df = pd.concat([pd.DataFrame({'y':np.random.normal(size=n), 'x':np.random.random(n), 'replicate':[i]*n}) for i in range(r)]) sb.lmplot('x', 'y', df, col='replicate', col_wrap=6)
None
Statistically Significant!
None
"Despite a large statistical literature for multiple testing corrections, usually
it is impossible to decipher how much data dredging by the reporting authors or other research teams has preceded a reported research finding."
What's the Alternative?
Build models and use them to estimate things we care
about
Effect size estimation
Data-generating Model
None
None
Florida manatee Trichechus manatus
None
None
None
occupied?
occupied? available?
occupied? available? seen?
None
Estimating visibility
None
None
None
None
None
None
None
Bayesian Statistics
None
None
Bayes' Formula
Probabilistic Modeling
Evidence-based Medicine
ASD Interventions Research 19 independent studies 27 different interventions
None
None
None
None
None
None
None
None
None
None
None
“While everyone is looking at the polls and the storm,
Romney’s slipping into the presidency. ”
None
Heirarchical modeling
Pollster effects
None
None
None
None
Data Science
Data
Science
Those who ignore statistics are condemned to re-invent it. --
Brad Efron