Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Jeff L.
June 28, 2014
Science
1
700
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
Tweet
Share
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
100
We are all statisticians now
jtleek
2
1.6k
Other Decks in Science
See All in Science
2023-07-18_Verge_Genomics
lcolladotor
0
110
SIGDIAL論文読み会: PGTask: Introducing the Task of Profile Generation from Dialogues
kaiyo3
0
100
名古屋市立大学データサイエンス学部 夏のオープンキャンパス模擬授業20230818
ncu_ds
0
1k
AI(人工知能)の過去・現在・未来 —AIは人間を超えるのか—
tagtag
0
260
Pandas 2 vs Polars vs Dask (PyDataGlobal 2023 December)
ianozsvald
0
420
ベクトル型スーパーコンピュータ「AOBA-S」の性能評価
keichi
0
250
効果検証入門に物申してみた_JapanR_2023
s1ok69oo
6
4.5k
20240127_OpenRadiossエアバッグ解析
kamakiri1225
0
150
AI Alignment: A Comprehensive Survey
s_ota
0
180
同じデータでもP値が変わる話/key_considerations_in_NHST
florets1
1
1.1k
2023-08-02_spatialLIBD_BioC2023_demo
lcolladotor
0
100
Non-Gaussian methods for causal discovery
sshimizu2006
0
170
Featured
See All Featured
Embracing the Ebb and Flow
colly
79
4.1k
How GitHub Uses GitHub to Build GitHub
holman
468
290k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
9
8.3k
5 minutes of I Can Smell Your CMS
philhawksworth
199
19k
Fantastic passwords and where to find them - at NoRuKo
philnash
36
2.5k
Intergalactic Javascript Robots from Outer Space
tanoku
266
26k
Put a Button on it: Removing Barriers to Going Fast.
kastner
58
3k
Statistics for Hackers
jakevdp
789
220k
Learning to Love Humans: Emotional Interface Design
aarron
266
39k
Why You Should Never Use an ORM
jnunemaker
PRO
51
8.6k
Stop Working from a Prison Cell
hatefulcrawdad
266
19k
The Language of Interfaces
destraynor
151
23k
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks