Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Jeff L.
June 28, 2014
Science
800
1
Share
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
120
We are all statisticians now
jtleek
2
1.7k
Other Decks in Science
See All in Science
白金鉱業Vol.21【初学者向け発表枠】身近な例から学ぶ数理最適化の基礎 / Learning the Basics of Mathematical Optimization Through Everyday Examples
brainpadpr
1
740
[NLP2026 参加報告会] AI for Science まとめ / NLP2026
lychee1223
0
1.9k
ダメな自分の育て方―性格タイプの「劣等機能」から理解するニガテ克服術
ppillc
0
140
Amusing Abliteration
ianozsvald
1
200
データベース05: SQL(2/3) 結合質問
trycycle
PRO
0
1.1k
Understanding CVP Waveforms: Interpretation and Clinical Implications in Anesthesiology
taka88
0
570
Vibecoding for Product Managers
ibknadedeji
0
170
TypeScript で WebAssembly を用いた 型安全なプラグイン設計
nagano
2
500
ITTF卓球世界ランキングのポイント比を用いた試合結果予測モデルの性能評価 / Performance evaluation of match result prediction models using the point ratio of the ITTF Table Tennis World Ranking
konakalab
0
130
AIを用いた PID制御で部屋 の温度制御をしてみた
nearme_tech
PRO
0
130
やるべきときにMLをやる AIエージェント開発
fufufukakaka
2
1.4k
データベース08: 実体関連モデルとは?
trycycle
PRO
0
1.1k
Featured
See All Featured
Lessons Learnt from Crawling 1000+ Websites
charlesmeaden
PRO
1
1.3k
Large-scale JavaScript Application Architecture
addyosmani
515
110k
How GitHub (no longer) Works
holman
316
150k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.7k
How Software Deployment tools have changed in the past 20 years
geshan
0
34k
Designing for Performance
lara
611
70k
Jess Joyce - The Pitfalls of Following Frameworks
techseoconnect
PRO
1
160
Lightning Talk: Beautiful Slides for Beginners
inesmontani
PRO
2
570
How to Align SEO within the Product Triangle To Get Buy-In & Support - #RIMC
aleyda
2
1.5k
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.3k
Measuring Dark Social's Impact On Conversion and Attribution
stephenakadiri
2
210
Beyond borders and beyond the search box: How to win the global "messy middle" with AI-driven SEO
davidcarrasco
3
150
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks