Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Jeff L.
June 28, 2014
Science
1
770
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
Tweet
Share
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
110
We are all statisticians now
jtleek
2
1.6k
Other Decks in Science
See All in Science
データベース15: ビッグデータ時代のデータベース
trycycle
PRO
0
350
動的トリートメント・レジームを推定するDynTxRegimeパッケージ
saltcooky12
0
190
MCMCのR-hatは分散分析である
moricup
0
440
Accelerated Computing for Climate forecast
inureyes
PRO
0
120
Agent開発フレームワークのOverviewとW&B Weaveとのインテグレーション
siyoo
0
330
機械学習 - pandas入門
trycycle
PRO
0
310
安心・効率的な医療現場の実現へ ~オンプレAI & ノーコードワークフローで進める業務改革~
siyoo
0
320
My Favourite Book in 2024: Get Rid of Your Japanese Accent
lagenorhynque
1
110
点群ライブラリPDALをGoogleColabにて実行する方法の紹介
kentaitakura
1
400
03_草原和博_広島大学大学院人間社会科学研究科教授_デジタル_シティズンシップシティで_新たな_学び__をつくる.pdf
sip3ristex
0
600
Quelles valorisations des logiciels vers le monde socio-économique dans un contexte de Science Ouverte ?
bluehats
1
500
🌏地球から🌌宇宙まで! 〜ケプラーの法則で繋がる天体の運動〜
syotasasaki593876
1
100
Featured
See All Featured
The Power of CSS Pseudo Elements
geoffreycrofte
77
6k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
The Invisible Side of Design
smashingmag
301
51k
Building a Modern Day E-commerce SEO Strategy
aleyda
43
7.6k
Build The Right Thing And Hit Your Dates
maggiecrowley
37
2.9k
The World Runs on Bad Software
bkeepers
PRO
70
11k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
139
34k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.9k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
127
53k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
18
1.1k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
507
140k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks