Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Big data and reproducibility
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Jeff L.
June 28, 2014
Science
790
1
Share
Big data and reproducibility
Talk at JHU summer institute.
Jeff L.
June 28, 2014
More Decks by Jeff L.
See All by Jeff L.
Data Science at JHSPH
jtleek
2
110
We are all statisticians now
jtleek
2
1.7k
Other Decks in Science
See All in Science
データベース03: 関係データモデル
trycycle
PRO
1
500
機械学習 - K-means & 階層的クラスタリング
trycycle
PRO
0
1.5k
Rashomon at the Sound: Reconstructing all possible paleoearthquake histories in the Puget Lowland through topological search
cossatot
0
920
AIを用いた PID制御で部屋 の温度制御をしてみた
nearme_tech
PRO
0
110
会社でMLモデルを作るとは @電気通信大学 データアントレプレナーフェロープログラム
yuto16
1
670
検索と推論タスクに関する論文の紹介
ynakano
1
210
生成AIと司法書士の未来.pdf
tagtag
PRO
0
110
力学系から見た現代的な機械学習
hanbao
4
4.1k
Testing the Longevity Bottleneck Hypothesis
chinson03
0
290
大黒市で発生した大規模インシデント の ポストモーテムから読み解く、 記憶媒体消去の大切さ
shucho0103
0
150
Deep Space Network (abreviated)
tonyrice
0
150
因果推論と機械学習
sshimizu2006
1
1.1k
Featured
See All Featured
Building Applications with DynamoDB
mza
96
7k
Marketing to machines
jonoalderson
1
5.3k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
The SEO identity crisis: Don't let AI make you average
varn
0
460
Design in an AI World
tapps
1
210
Leo the Paperboy
mayatellez
7
1.8k
Joys of Absence: A Defence of Solitary Play
codingconduct
1
360
Faster Mobile Websites
deanohume
310
31k
Applied NLP in the Age of Generative AI
inesmontani
PRO
4
2.2k
Public Speaking Without Barfing On Your Shoes - THAT 2023
reverentgeek
1
390
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
231
23k
Lightning talk: Run Django tests with GitHub Actions
sabderemane
0
180
Transcript
Big data and reproducibility
None
N = SAMPLE SIZE
N = ($ YOU HAVE) ($ PER SAMPLE)
Year $ per (human) Genome
rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,
24092820
www.geni.com
http://erlichlab.wi.mit.edu/familinx/index.html
None
None
None
None
what went wrong? 2 things
what went wrong? transparency The data/code weren’t reproducible
what went wrong? transparency There was a lack of cooperation
what went wrong? expertise They used silly prediction rules (Pr(FEC)
= 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)
what went wrong? expertise They had study design problems (Batch
effects)
what went wrong? expertise Their predictions weren’t locked down Today:
Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1
At the end of the day the Potti analysis was
fully reproducible The problem is that the analysis was wrong
1st Discussion Point: What is reproducibility?
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
The goal: a result that is reproducible (the code and
data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
Who Reproduces Research? The truth is A I
don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng
hVps://github.com/jtleek/datasharing
2nd Discussion Point: Statistical modeling is only part of the
process
What is Data Analysis? Raw Data Cleaning /
ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng
3rd Discussion Point: Analysis is (often) an afterthought
hVp://bit.ly/OgW3xv
None
4th Discussion Point: Traditional statistics & epidemiology ideas still matter
for big data
association between shoe size and literacy
None
None
None
1. Reproducibility by data sharing 2. Big data is not
just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter
jhudatascience.org
9 classes 1 month long Every month
Cumulative Enrollment
jtleek.com/talks