Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Jeff L.
June 28, 2014
Science
1
760
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
Tweet
Share
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
110
We are all statisticians now
jtleek
2
1.6k
Other Decks in Science
See All in Science
学術講演会中央大学学員会府中支部
tagtag
0
300
02_西村訓弘_プログラムディレクター_人口減少を機にひらく未来社会.pdf
sip3ristex
0
540
生成検索エンジン最適化に関する研究の紹介
ynakano
2
1.2k
3次元点群を利用した植物の葉の自動セグメンテーションについて
kentaitakura
2
1.3k
Gemini Prompt Engineering: Practical Techniques for Tangible AI Outcomes
mfonobong
2
140
SpatialBiologyWestCoastUS2024
lcolladotor
0
150
Trend Classification of InSAR Displacement Time Series Using SAE–CNN
satai
3
510
IWASAKI Hideo
genomethica
0
120
SciPyDataJapan 2025
schwalbe10
0
250
地表面抽出の方法であるSMRFについて紹介
kentaitakura
1
780
Ignite の1年間の軌跡
ktombow
0
140
オンプレミス環境にKubernetesを構築する
koukimiura
0
300
Featured
See All Featured
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
35
2.5k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
The Pragmatic Product Professional
lauravandoore
36
6.8k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
29
9.6k
Mobile First: as difficult as doing things right
swwweet
223
9.8k
Done Done
chrislema
185
16k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
34
5.9k
How to train your dragon (web standard)
notwaldorf
96
6.1k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
15
1.6k
The Power of CSS Pseudo Elements
geoffreycrofte
77
5.9k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
357
30k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
3.9k
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks