Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Jeff L.
June 28, 2014
Science
1
760
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
Tweet
Share
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
110
We are all statisticians now
jtleek
2
1.6k
Other Decks in Science
See All in Science
創薬における機械学習技術について
kanojikajino
16
5.3k
Healthcare Innovation through Business Entrepreneurship
clintwinters
0
220
Collective Predictive Coding Hypothesis and Beyond (@Japanese Association for Philosophy of Science, 26th October 2024)
tanichu
0
130
Explanatory material
yuki1986
0
300
Online Feedback Optimization
floriandoerfler
0
1.9k
MoveItを使った産業用ロボット向け動作作成方法の紹介 / Introduction to creating motion for industrial robots using MoveIt
ry0_ka
0
480
[第62回 CV勉強会@関東] Long-CLIP: Unlocking the Long-Text Capability of CLIP / kantoCV 62th ECCV 2024
lychee1223
1
940
Transport information Geometry: Current and Future II
lwc2017
0
150
データベース08: 実体関連モデルとは?
trycycle
PRO
0
660
LayerXにおける業務の完全自動運転化に向けたAI技術活用事例 / layerx-ai-jsai2025
shimacos
1
1k
IWASAKI Hideo
genomethica
0
100
機械学習 - DBSCAN
trycycle
PRO
0
880
Featured
See All Featured
Build The Right Thing And Hit Your Dates
maggiecrowley
36
2.7k
VelocityConf: Rendering Performance Case Studies
addyosmani
329
24k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
181
53k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
43
2.4k
Connecting the Dots Between Site Speed, User Experience & Your Business [WebExpo 2025]
tammyeverts
4
170
Code Review Best Practice
trishagee
68
18k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
15
1.5k
Bash Introduction
62gerente
614
210k
[RailsConf 2023] Rails as a piece of cake
palkan
55
5.6k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
29
1.8k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
8
780
Rails Girls Zürich Keynote
gr2m
94
14k
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks