$30 off During Our Annual Pro Sale. View Details »
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Jeff L.
June 28, 2014
Science
1
770
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
Tweet
Share
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
110
We are all statisticians now
jtleek
2
1.6k
Other Decks in Science
See All in Science
Text-to-SQLの既存の評価指標を問い直す
gotalab555
1
140
KH Coderチュートリアル(スライド版)
koichih
1
53k
イロレーティングを活用した関東大学サッカーの定量的実力評価 / A quantitative performance evaluation of Kanto University Football Association using Elo rating
konakalab
0
130
データベース08: 実体関連モデルとは?
trycycle
PRO
0
1k
サイコロで理解する原子核崩壊と拡散現象 〜単純化されたモデルで本質を理解する〜
syotasasaki593876
0
130
機械学習 - K-means & 階層的クラスタリング
trycycle
PRO
0
1.2k
機械学習 - pandas入門
trycycle
PRO
0
380
機械学習 - 決定木からはじめる機械学習
trycycle
PRO
0
1.2k
ランサムウェア対策にも考慮したVMware、Hyper-V、Azure、AWS間のリアルタイムレプリケーション「Zerto」を徹底解説
climbteam
0
170
生成AIと学ぶPythonデータ分析再入門-Pythonによるクラスタリング・可視化をサクサク実施-
datascientistsociety
PRO
4
1.9k
AIに仕事を奪われる 最初の医師たちへ
ikora128
0
1k
Hakonwa-Quaternion
hiranabe
1
160
Featured
See All Featured
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
16
1.8k
What’s in a name? Adding method to the madness
productmarketing
PRO
24
3.8k
BBQ
matthewcrist
89
9.9k
Reflections from 52 weeks, 52 projects
jeffersonlam
355
21k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
231
22k
Embracing the Ebb and Flow
colly
88
4.9k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
9
1.1k
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
Rails Girls Zürich Keynote
gr2m
95
14k
Building a Scalable Design System with Sketch
lauravandoore
463
34k
How to Think Like a Performance Engineer
csswizardry
28
2.3k
How to train your dragon (web standard)
notwaldorf
97
6.4k
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks