Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Jeff L.
June 28, 2014
Science
1
730
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
Tweet
Share
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
110
We are all statisticians now
jtleek
2
1.6k
Other Decks in Science
See All in Science
AI科学の何が“哲学”の問題になるのか ~問いマッピングの試み~
rmaruy
1
2.3k
白金鉱業Meetup Vol.16_数理最適化案件のはじめかた・すすめかた
brainpadpr
3
990
butterfly_effect/butterfly_effect_in-house
florets1
1
110
多次元展開法を用いた 多値バイクラスタリング モデルの提案
kosugitti
0
200
3次元点群を利用した植物の葉の自動セグメンテーションについて
kentaitakura
2
650
山形とさくらんぼに関するレクチャー(YG-900)
07jp27
1
240
ABEMAの効果検証事例〜効果の異質性を考える〜
s1ok69oo
4
2.1k
Causal discovery based on non-Gaussianity and nonlinearity
sshimizu2006
0
200
最適化超入門
tkm2261
14
3.3k
20240420 Global Azure 2024 | Azure Migrate でデータセンターのサーバーを評価&移行してみる
olivia_0707
2
930
白金鉱業Meetup Vol.16_【初学者向け発表】 数理最適化のはじめの一歩 〜身近な問題で学ぶ最適化の面白さ〜
brainpadpr
10
1.7k
学術講演会中央大学学員会いわき支部
tagtag
0
110
Featured
See All Featured
Designing for humans not robots
tammielis
250
25k
Embracing the Ebb and Flow
colly
84
4.5k
Stop Working from a Prison Cell
hatefulcrawdad
267
20k
A Philosophy of Restraint
colly
203
16k
Testing 201, or: Great Expectations
jmmastey
41
7.1k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
49k
RailsConf 2023
tenderlove
29
940
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.3k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
3
180
Being A Developer After 40
akosma
88
590k
The Invisible Side of Design
smashingmag
298
50k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
38
1.9k
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks