Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Jeff L.
June 28, 2014
Science
1
780
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
Tweet
Share
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
110
We are all statisticians now
jtleek
2
1.7k
Other Decks in Science
See All in Science
2025-06-11-ai_belgium
sofievl
1
250
Rashomon at the Sound: Reconstructing all possible paleoearthquake histories in the Puget Lowland through topological search
cossatot
0
730
俺たちは本当に分かり合えるのか? ~ PdMとスクラムチームの “ずれ” を科学する
bonotake
2
2.1k
Accelerating operator Sinkhorn iteration with overrelaxation
tasusu
0
240
タンパク質間相互作⽤を利⽤した⼈⼯知能による新しい薬剤遺伝⼦-疾患相互作⽤の同定
tagtag
PRO
0
180
ITTF卓球世界ランキングのポイント比を用いた試合結果予測モデルの性能評価 / Performance evaluation of match result prediction models using the point ratio of the ITTF Table Tennis World Ranking
konakalab
0
110
Algorithmic Aspects of Quiver Representations
tasusu
0
230
論文紹介 音源分離:SCNET SPARSE COMPRESSION NETWORK FOR MUSIC SOURCE SEPARATION
kenmatsu4
0
580
AIに仕事を奪われる 最初の医師たちへ
ikora128
0
1k
AIPシンポジウム 2025年度 成果報告会 「因果推論チーム」
sshimizu2006
3
420
Navigating Weather and Climate Data
rabernat
0
140
YouTubeにおける撤回論文の参照実態 / metascience-meetup2026
corgies
1
210
Featured
See All Featured
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
12
1.1k
The Cult of Friendly URLs
andyhume
79
6.8k
New Earth Scene 8
popppiees
1
1.8k
Leveraging Curiosity to Care for An Aging Population
cassininazir
1
200
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
49
9.9k
30 Presentation Tips
portentint
PRO
1
260
Speed Design
sergeychernyshev
33
1.6k
Bioeconomy Workshop: Dr. Julius Ecuru, Opportunities for a Bioeconomy in West Africa
akademiya2063
PRO
1
75
Building Flexible Design Systems
yeseniaperezcruz
330
40k
YesSQL, Process and Tooling at Scale
rocio
174
15k
Chasing Engaging Ingredients in Design
codingconduct
0
150
Building a A Zero-Code AI SEO Workflow
portentint
PRO
0
410
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks