Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Jeff L.
June 28, 2014
Science
1
740
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
Tweet
Share
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
110
We are all statisticians now
jtleek
2
1.6k
Other Decks in Science
See All in Science
インフラだけではない MLOps の話 @事例でわかるMLOps 機械学習の成果をスケールさせる処方箋 発売記念
icoxfog417
PRO
2
720
多次元展開法を用いた 多値バイクラスタリング モデルの提案
kosugitti
0
250
統計的因果探索: 背景知識とデータにより因果仮説を探索する
sshimizu2006
3
700
SciPyDataJapan 2025
schwalbe10
0
140
大規模言語モデルの論理構造の把握能力と予測モデルの生成
fuyu_quant0
0
110
Valuable Lessons Learned on Kaggle’s ARC AGI LLM Challenge (PyDataGlobal 2024)
ianozsvald
0
250
04_石井クンツ昌子_お茶の水女子大学理事_副学長_D_I社会実現へ向けて.pdf
sip3ristex
0
210
[第62回 CV勉強会@関東] Long-CLIP: Unlocking the Long-Text Capability of CLIP / kantoCV 62th ECCV 2024
lychee1223
1
870
ほたるのひかり/RayTracingCamp10
kugimasa
1
550
構造設計のための3D生成AI-最新の取り組みと今後の展開-
kojinishiguchi
0
920
非同期コミュニケーションの構造 -チャットツールを用いた組織における情報の流れの設計について-
koisono
0
220
Spectral Sparsification of Hypergraphs
tasusu
0
260
Featured
See All Featured
Understanding Cognitive Biases in Performance Measurement
bluesmoon
27
1.6k
Large-scale JavaScript Application Architecture
addyosmani
511
110k
Reflections from 52 weeks, 52 projects
jeffersonlam
348
20k
Speed Design
sergeychernyshev
28
830
ReactJS: Keep Simple. Everything can be a component!
pedronauck
666
120k
Faster Mobile Websites
deanohume
306
31k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.2k
Art, The Web, and Tiny UX
lynnandtonic
298
20k
The Straight Up "How To Draw Better" Workshop
denniskardys
232
140k
Why You Should Never Use an ORM
jnunemaker
PRO
55
9.2k
Code Review Best Practice
trishagee
67
18k
Scaling GitHub
holman
459
140k
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks