Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Jeff L.
June 28, 2014
Science
780
1
Share
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
110
We are all statisticians now
jtleek
2
1.7k
Other Decks in Science
See All in Science
SpatialRDDパッケージによる空間回帰不連続デザイン
saltcooky12
0
210
論文紹介 音源分離:SCNET SPARSE COMPRESSION NETWORK FOR MUSIC SOURCE SEPARATION
kenmatsu4
0
620
ハミルトン・ヤコビ方程式の解の性質と物理的意味
enakai00
0
270
MATSUO Makiko
genomethica
0
130
ド文系だった私が、 KaggleのNCAAコンペでソロ金取れるまで
wakamatsu_takumu
2
2.2k
My Little Monster
juzishuu
0
700
主成分分析に基づく教師なし特徴抽出法を用いたコラーゲン-グリコサミノグリカンメッシュの遺伝子発現への影響
tagtag
PRO
0
240
知能とはなにかーヒトとAIのあいだー
tagtag
PRO
0
190
凸最適化からDC最適化まで
santana_hammer
1
380
DMMにおけるABテスト検証設計の工夫
xc6da
1
1.7k
俺たちは本当に分かり合えるのか? ~ PdMとスクラムチームの “ずれ” を科学する
bonotake
2
2.2k
共生概念の整理と AIアライメントの構想
hiroakihamada
0
180
Featured
See All Featured
Kristin Tynski - Automating Marketing Tasks With AI
techseoconnect
PRO
0
230
Marketing to machines
jonoalderson
1
5.2k
Color Theory Basics | Prateek | Gurzu
gurzu
0
290
Docker and Python
trallard
47
3.8k
Paper Plane
katiecoart
PRO
1
49k
What's in a price? How to price your products and services
michaelherold
247
13k
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
2
220
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
49
9.9k
RailsConf 2023
tenderlove
30
1.4k
Fireside Chat
paigeccino
42
3.9k
The B2B funnel & how to create a winning content strategy
katarinadahlin
PRO
1
340
KATA
mclloyd
PRO
35
15k
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks