Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Jeff L.
June 28, 2014
Science
1
750
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
Tweet
Share
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
110
We are all statisticians now
jtleek
2
1.6k
Other Decks in Science
See All in Science
FRAM - 複雑な社会技術システムの理解と分析
__ymgc__
1
130
Collective Predictive Coding Hypothesis and Beyond (@Japanese Association for Philosophy of Science, 26th October 2024)
tanichu
0
100
大規模言語モデルの論理構造の把握能力と予測モデルの生成
fuyu_quant0
0
120
05_山中真也_室蘭工業大学大学院工学研究科教授_だてプロの挑戦.pdf
sip3ristex
0
300
トラブルがあったコンペに学ぶデータ分析
tereka114
2
1.5k
Explanatory material
yuki1986
0
160
Online Feedback Optimization
floriandoerfler
0
1.1k
データベース02: データベースの概念
trycycle
PRO
1
660
The thin line between reconstruction, classification, and hallucination in brain decoding
ykamit
1
1.4k
3次元点群を利用した植物の葉の自動セグメンテーションについて
kentaitakura
2
1.1k
Cross-Media Information Spaces and Architectures (CISA)
signer
PRO
3
31k
小杉考司(専修大学)
kosugitti
2
640
Featured
See All Featured
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
251
21k
GraphQLとの向き合い方2022年版
quramy
46
14k
A Modern Web Designer's Workflow
chriscoyier
693
190k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
32
2.2k
How STYLIGHT went responsive
nonsquared
99
5.5k
Building Adaptive Systems
keathley
41
2.5k
jQuery: Nuts, Bolts and Bling
dougneiner
63
7.7k
We Have a Design System, Now What?
morganepeng
52
7.5k
The Power of CSS Pseudo Elements
geoffreycrofte
75
5.8k
Being A Developer After 40
akosma
91
590k
Designing Experiences People Love
moore
141
24k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks