Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Jeff L.
June 28, 2014
Science
1
720
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
Tweet
Share
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
110
We are all statisticians now
jtleek
2
1.6k
Other Decks in Science
See All in Science
Science of Scienceおよび科学計量学に関する研究論文の俯瞰可視化_ポスター版
hayataka88
0
120
Raccoon Roundworm
uni_of_nomi
0
160
ベイズ最適化をゼロから
brainpadpr
2
740
機械学習を支える連続最適化
nearme_tech
PRO
1
120
LIMEを用いた判断根拠の可視化
kentaitakura
0
320
解説!データ基盤の進化を後押しする手順とタイミング
shomaekawa
1
330
All-in-One Bioinformatics Platform Realized with Snowflake ~ From In Silico Drug Discovery, Disease Variant Analysis, to Single-Cell RNA-seq
ktatsuya
0
220
Презентация программы бакалавриата СПбГУ "Искусственный интеллект и наука о данных"
dscs
0
690
Презентация программы магистратуры СПбГУ "Искусственный интеллект и наука о данных"
dscs
0
370
第61回コンピュータビジョン勉強会「BioCLIP: A Vision Foundation Model for the Tree of Life」
x_ttyszk
1
1.5k
Direct Preference Optimization
zchenry
0
270
The Incredible Machine: Developer Productivity and the Impact of AI
tomzimmermann
0
380
Featured
See All Featured
How to Think Like a Performance Engineer
csswizardry
19
1.1k
Visualization
eitanlees
144
15k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
47
5k
Rebuilding a faster, lazier Slack
samanthasiow
79
8.6k
Adopting Sorbet at Scale
ufuk
73
9k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
364
22k
VelocityConf: Rendering Performance Case Studies
addyosmani
325
24k
The Language of Interfaces
destraynor
154
24k
The World Runs on Bad Software
bkeepers
PRO
65
11k
Bootstrapping a Software Product
garrettdimon
PRO
305
110k
Designing on Purpose - Digital PM Summit 2013
jponch
115
6.9k
The Cult of Friendly URLs
andyhume
78
6k
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks