Slide 1

Slide 1 text

Big data and reproducibility

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

N = SAMPLE SIZE

Slide 4

Slide 4 text

N = ($ YOU HAVE) ($ PER SAMPLE)

Slide 5

Slide 5 text

Year $ per (human) Genome

Slide 6

Slide 6 text

rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758, 24092820

Slide 7

Slide 7 text

www.geni.com

Slide 8

Slide 8 text

http://erlichlab.wi.mit.edu/familinx/index.html

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

what went wrong? 2 things

Slide 14

Slide 14 text

what went wrong? transparency The data/code weren’t reproducible

Slide 15

Slide 15 text

what went wrong? transparency There was a lack of cooperation

Slide 16

Slide 16 text

what went wrong? expertise They used silly prediction rules (Pr(FEC)  =  5/8[Pr(F)  +  Pr(E)  +  Pr(C)]  –  ¼)  

Slide 17

Slide 17 text

what went wrong? expertise They had study design problems (Batch  effects)  

Slide 18

Slide 18 text

what went wrong? expertise Their predictions weren’t locked down Today:  Pr(FEC)  =  0.8   Tomorrow:  Pr(FEC)  =  0.1    

Slide 19

Slide 19 text

At the end of the day the Potti analysis was fully reproducible The problem is that the analysis was wrong

Slide 20

Slide 20 text

1st Discussion Point: What is reproducibility?

Slide 21

Slide 21 text

The goal: a result that is reproducible (the code and data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)

Slide 22

Slide 22 text

The goal: a result that is reproducible (the code and data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)

Slide 23

Slide 23 text

Who  Reproduces  Research?   The  truth  is  A   I  don’t   care   The  truth  is  B   The  truth  is  not  A   Original  InvesRgator   Reproducers   The  truth  is  A   ScienRsts   General   Public   ???   Slide courtesy R. Peng

Slide 24

Slide 24 text

hVps://github.com/jtleek/datasharing  

Slide 25

Slide 25 text

2nd Discussion Point: Statistical modeling is only part of the process

Slide 26

Slide 26 text

What  is  Data  Analysis?   Raw  Data   Cleaning  /   ValidaRon   Pre-­‐processing   Exploratory   data  analysis   StaRsRcal  model   development   SensiRvity   analysis   Finalize   results  /  report   StaRsRcs!   Slide courtesy R. Peng

Slide 27

Slide 27 text

3rd Discussion Point: Analysis is (often) an afterthought

Slide 28

Slide 28 text

hVp://bit.ly/OgW3xv  

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

4th Discussion Point: Traditional statistics & epidemiology ideas still matter for big data

Slide 31

Slide 31 text

association between shoe size and literacy

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

1. Reproducibility by data sharing 2. Big data is not just statistics   3. Analysis is often an afterthought   4. Traditional ideas still matter  

Slide 36

Slide 36 text

jhudatascience.org

Slide 37

Slide 37 text

9 classes 1 month long Every month

Slide 38

Slide 38 text

Cumulative Enrollment

Slide 39

Slide 39 text

jtleek.com/talks