Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Jeff L.
June 28, 2014
Science
1
720
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
Tweet
Share
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
110
We are all statisticians now
jtleek
2
1.6k
Other Decks in Science
See All in Science
20240420 Global Azure 2024 | Azure Migrate でデータセンターのサーバーを評価&移行してみる
olivia_0707
2
900
機械学習による確率推定とカリブレーション/probabilistic-calibration-on-classification-model
ktgrstsh
2
240
生成AI による論文執筆サポートの手引き(ワークショップ) / A guide to supporting dissertation writing with generative AI (workshop)
ks91
PRO
0
250
いまAI組織が求める企画開発エンジニアとは?
roadroller
2
1.3k
ultraArmをモニター提供してもらった話
miura55
0
190
ABEMAの効果検証事例〜効果の異質性を考える〜
s1ok69oo
4
2.1k
Spectral Sparsification of Hypergraphs
tasusu
0
170
重複排除・高速バックアップ・ランサムウェア対策 三拍子そろったExaGrid × Veeam連携セミナー
climbteam
0
110
Mechanistic Interpretability の紹介
sohtakahashi
0
350
ベイズ最適化をゼロから
brainpadpr
2
810
Introduction to Graph Neural Networks
joisino
PRO
4
2.1k
As We May Interact: Challenges and Opportunities for Next-Generation Human-Information Interaction
signer
PRO
0
160
Featured
See All Featured
Designing the Hi-DPI Web
ddemaree
280
34k
Fontdeck: Realign not Redesign
paulrobertlloyd
82
5.2k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
44
2.2k
jQuery: Nuts, Bolts and Bling
dougneiner
61
7.5k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
364
24k
How To Stay Up To Date on Web Technology
chriscoyier
788
250k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
126
18k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
229
52k
Teambox: Starting and Learning
jrom
133
8.8k
Documentation Writing (for coders)
carmenintech
65
4.4k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
26
1.4k
Fireside Chat
paigeccino
34
3k
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks