Big data and reproducibility

Slide 1

Slide 1 text

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

N = SAMPLE SIZE

Slide 4

Slide 4 text

N = ($ YOU HAVE) ($ PER SAMPLE)

Slide 5

Slide 5 text

Year $ per (human) Genome

Slide 6

Slide 6 text

rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758, 24092820

Slide 7

Slide 7 text

www.geni.com

Slide 8

Slide 8 text

http://erlichlab.wi.mit.edu/familinx/index.html

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

what went wrong? 2 things

Slide 14

Slide 14 text

what went wrong? transparency The data/code weren’t reproducible

Slide 15

Slide 15 text

what went wrong? transparency There was a lack of cooperation

Slide 16

Slide 16 text

what went wrong? expertise They used silly prediction rules (Pr(FEC) = 5/8[Pr(F) + Pr(E) + Pr(C)] – ¼)

Slide 17

Slide 17 text

what went wrong? expertise They had study design problems (Batch eﬀects)

Slide 18

Slide 18 text

what went wrong? expertise Their predictions weren’t locked down Today: Pr(FEC) = 0.8 Tomorrow: Pr(FEC) = 0.1

Slide 19

Slide 19 text

At the end of the day the Potti analysis was fully reproducible The problem is that the analysis was wrong

Slide 20

Slide 20 text

1st Discussion Point: What is reproducibility?

Slide 21

Slide 21 text

The goal: a result that is reproducible (the code and data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)

Slide 22

Slide 22 text

The goal: a result that is reproducible (the code and data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)

Slide 23

Slide 23 text

Who Reproduces Research? The truth is A I don’t care The truth is B The truth is not A Original InvesRgator Reproducers The truth is A ScienRsts General Public ??? Slide courtesy R. Peng

Slide 24

Slide 24 text

hVps://github.com/jtleek/datasharing

Slide 25

Slide 25 text

2nd Discussion Point: Statistical modeling is only part of the process

Slide 26

Slide 26 text

What is Data Analysis? Raw Data Cleaning / ValidaRon Pre-‐processing Exploratory data analysis StaRsRcal model development SensiRvity analysis Finalize results / report StaRsRcs! Slide courtesy R. Peng

Slide 27

Slide 27 text

3rd Discussion Point: Analysis is (often) an afterthought

Slide 28

Slide 28 text

hVp://bit.ly/OgW3xv

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

4th Discussion Point: Traditional statistics & epidemiology ideas still matter for big data

Slide 31

Slide 31 text

association between shoe size and literacy

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

1. Reproducibility by data sharing 2. Big data is not just statistics 3. Analysis is often an afterthought 4. Traditional ideas still matter

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text