Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Jeff L.
June 28, 2014
Science
1
700
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
Tweet
Share
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
100
We are all statisticians now
jtleek
2
1.6k
Other Decks in Science
See All in Science
論文輪読会 第16回 "NeRF:Representing Scenes as Neural"
academix
0
430
Machine Learning for Materials (Lecture 6)
aronwalsh
0
410
論文を批判的に読むときのチェックリスト
koro485
1
21k
BMI 研究はなぜ同じ失敗を繰り返すのか(日本BMI研究会, 2021.11.5)
ykamit
1
1.9k
DEIM2024 チュートリアル ~AWSで生成AIのRAGを使ったチャットボットを作ってみよう~
yamahiro
1
420
OpenFOAM初級編チュートリアル(キャビティ流れ)
kamakiri1225
0
250
20240127_OpenRadiossエアバッグ解析
kamakiri1225
0
130
(neuro)science with AI: Machine learning as scientific modeling
gaelvaroquaux
0
680
Xpenologyなるアングラプロジェクト周りについて語るやつ
sushi514
0
570
Design of three-dimensional binary manipulators based on the KS statistic and maximum empty circles (IECON2023)
konakalab
0
210
OptimizationNight~機械学習と数理最適化の融合~
hidenari
0
260
マルチモーダルモデルと自動運転 車載モデルのコスト・スループット・レイテンシ / LLM in Production Meetup #2 20231023
yuyamaguchi
1
1k
Featured
See All Featured
Designing for humans not robots
tammielis
247
25k
Designing Experiences People Love
moore
135
23k
Rebuilding a faster, lazier Slack
samanthasiow
72
8.2k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
12
1.4k
Facilitating Awesome Meetings
lara
39
5.5k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
111
35k
Code Reviewing Like a Champion
maltzj
512
39k
Building Adaptive Systems
keathley
29
1.8k
jQuery: Nuts, Bolts and Bling
dougneiner
57
7.1k
The Power of CSS Pseudo Elements
geoffreycrofte
58
4.9k
Git: the NoSQL Database
bkeepers
PRO
421
63k
JazzCon 2018 Closing Keynote - Leadership for the Reluctant Leader
reverentgeek
178
11k
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks