Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Jeff L.
June 28, 2014
Science
1
770
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
Tweet
Share
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
110
We are all statisticians now
jtleek
2
1.7k
Other Decks in Science
See All in Science
安心・効率的な医療現場の実現へ ~オンプレAI & ノーコードワークフローで進める業務改革~
siyoo
0
450
KH Coderチュートリアル(スライド版)
koichih
1
58k
やるべきときにMLをやる AIエージェント開発
fufufukakaka
2
1.1k
タンパク質間相互作⽤を利⽤した⼈⼯知能による新しい薬剤遺伝⼦-疾患相互作⽤の同定
tagtag
PRO
0
140
機械学習 - SVM
trycycle
PRO
1
980
Text-to-SQLの既存の評価指標を問い直す
gotalab555
1
170
Distributional Regression
tackyas
0
340
機械学習 - 決定木からはじめる機械学習
trycycle
PRO
0
1.2k
データマイニング - グラフ埋め込み入門
trycycle
PRO
1
160
PPIのみを用いたAIによる薬剤–遺伝子–疾患 相互作用の同定
tagtag
PRO
0
160
(2025) Balade en cyclotomie
mansuy
0
450
検索と推論タスクに関する論文の紹介
ynakano
1
140
Featured
See All Featured
Music & Morning Musume
bryan
47
7.1k
Are puppies a ranking factor?
jonoalderson
1
2.7k
Unlocking the hidden potential of vector embeddings in international SEO
frankvandijk
0
170
So, you think you're a good person
axbom
PRO
2
1.9k
Agile that works and the tools we love
rasmusluckow
331
21k
Bridging the Design Gap: How Collaborative Modelling removes blockers to flow between stakeholders and teams @FastFlow conf
baasie
0
450
Winning Ecommerce Organic Search in an AI Era - #searchnstuff2025
aleyda
0
1.9k
Prompt Engineering for Job Search
mfonobong
0
160
Making the Leap to Tech Lead
cromwellryan
135
9.7k
Designing Experiences People Love
moore
144
24k
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
0
110
For a Future-Friendly Web
brad_frost
182
10k
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks