Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Jeff L.
June 28, 2014
Science
1
770
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
Tweet
Share
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
110
We are all statisticians now
jtleek
2
1.7k
Other Decks in Science
See All in Science
コンピュータビジョンによるロボットの視覚と判断:宇宙空間での適応と課題
hf149
1
520
データマイニング - コミュニティ発見
trycycle
PRO
0
200
【論文紹介】Is CLIP ideal? No. Can we fix it?Yes! 第65回 コンピュータビジョン勉強会@関東
shun6211
5
2.3k
Collective Predictive Coding as a Unified Theory for the Socio-Cognitive Human Minds
tanichu
0
160
先端因果推論特別研究チームの研究構想と 人間とAIが協働する自律因果探索の展望
sshimizu2006
3
760
データマイニング - ウェブとグラフ
trycycle
PRO
0
230
蔵本モデルが解き明かす同期と相転移の秘密 〜拍手のリズムはなぜ揃うのか?〜
syotasasaki593876
1
200
DMMにおけるABテスト検証設計の工夫
xc6da
1
1.5k
Kaggle: NeurIPS - Open Polymer Prediction 2025 コンペ 反省会
calpis10000
0
370
データベース14: B+木 & ハッシュ索引
trycycle
PRO
0
660
Performance Evaluation and Ranking of Drivers in Multiple Motorsports Using Massey’s Method
konakalab
0
140
中央大学AI・データサイエンスセンター 2025年第6回イブニングセミナー 『知能とはなにか ヒトとAIのあいだ』
tagtag
PRO
0
120
Featured
See All Featured
Darren the Foodie - Storyboard
khoart
PRO
2
2.4k
What's in a price? How to price your products and services
michaelherold
247
13k
Evolving SEO for Evolving Search Engines
ryanjones
0
120
Ecommerce SEO: The Keys for Success Now & Beyond - #SERPConf2024
aleyda
1
1.8k
Done Done
chrislema
186
16k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Information Architects: The Missing Link in Design Systems
soysaucechin
0
770
Between Models and Reality
mayunak
1
180
Mobile First: as difficult as doing things right
swwweet
225
10k
How to Talk to Developers About Accessibility
jct
2
130
30 Presentation Tips
portentint
PRO
1
210
Bootstrapping a Software Product
garrettdimon
PRO
307
120k
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks