Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Jeff L.
June 28, 2014
Science
1
770
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
Tweet
Share
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
110
We are all statisticians now
jtleek
2
1.6k
Other Decks in Science
See All in Science
蔵本モデルが解き明かす同期と相転移の秘密 〜拍手のリズムはなぜ揃うのか?〜
syotasasaki593876
1
130
知能とはなにかーヒトとAIのあいだー
tagtag
0
110
Text-to-SQLの既存の評価指標を問い直す
gotalab555
1
120
NASの容量不足のお悩み解決!災害対策も兼ねた「Wasabi Cloud NAS」はここがスゴイ
climbteam
1
220
学術講演会中央大学学員会府中支部
tagtag
0
320
データマイニング - ノードの中心性
trycycle
PRO
0
300
Hakonwa-Quaternion
hiranabe
1
150
データベース08: 実体関連モデルとは?
trycycle
PRO
0
980
コンピュータビジョンによるロボットの視覚と判断:宇宙空間での適応と課題
hf149
1
420
機械学習 - ニューラルネットワーク入門
trycycle
PRO
0
880
動的トリートメント・レジームを推定するDynTxRegimeパッケージ
saltcooky12
0
220
主成分分析に基づく教師なし特徴抽出法を用いたコラーゲン-グリコサミノグリカンメッシュの遺伝子発現への影響
tagtag
0
110
Featured
See All Featured
Making Projects Easy
brettharned
120
6.4k
Side Projects
sachag
455
43k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
34
2.5k
Unsuck your backbone
ammeep
671
58k
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.5k
How to Ace a Technical Interview
jacobian
280
24k
Product Roadmaps are Hard
iamctodd
PRO
55
12k
4 Signs Your Business is Dying
shpigford
186
22k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
[RailsConf 2023] Rails as a piece of cake
palkan
57
6.1k
How To Stay Up To Date on Web Technology
chriscoyier
791
250k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
31
2.7k
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks