Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Jeff L.
June 28, 2014
Science
1
770
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
Tweet
Share
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
110
We are all statisticians now
jtleek
2
1.6k
Other Decks in Science
See All in Science
LayerXにおける業務の完全自動運転化に向けたAI技術活用事例 / layerx-ai-jsai2025
shimacos
2
13k
DMMにおけるABテスト検証設計の工夫
xc6da
1
1.1k
MCMCのR-hatは分散分析である
moricup
0
470
データマイニング - コミュニティ発見
trycycle
PRO
0
160
Optimization of the Tournament Format for the Nationwide High School Kyudo Competition in Japan
konakalab
0
110
機械学習 - pandas入門
trycycle
PRO
0
330
機械学習 - ニューラルネットワーク入門
trycycle
PRO
0
870
SciPyDataJapan 2025
schwalbe10
0
270
論文紹介 音源分離:SCNET SPARSE COMPRESSION NETWORK FOR MUSIC SOURCE SEPARATION
kenmatsu4
0
340
My Little Monster
juzishuu
0
160
機械学習 - K-means & 階層的クラスタリング
trycycle
PRO
0
1.1k
テンソル分解による糖尿病の組織特異的遺伝子発現の統合解析を用いた関連疾患の予測
tagtag
2
280
Featured
See All Featured
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
1.7k
Building Better People: How to give real-time feedback that sticks.
wjessup
369
20k
The Invisible Side of Design
smashingmag
302
51k
Why You Should Never Use an ORM
jnunemaker
PRO
59
9.6k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
253
22k
What's in a price? How to price your products and services
michaelherold
246
12k
Side Projects
sachag
455
43k
The Straight Up "How To Draw Better" Workshop
denniskardys
238
140k
Gamification - CAS2011
davidbonilla
81
5.5k
Into the Great Unknown - MozCon
thekraken
40
2.1k
Building an army of robots
kneath
306
46k
Visualization
eitanlees
149
16k
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks