Workflows for reproducible data science

Slide 1

Slide 1 text

Work fl ows for reproducible data science Mine Çetinkaya-Rundel Duke University + RStudio mine-cetinkaya-rundel [email protected] @minebocek 🔗 bit.ly/repro-ds-21

Slide 2

Slide 2 text

The results in Table 1 don’t seem to correspond to those in Figure 2!

Slide 3

Slide 3 text

61 3 44 94 12 4 45 20

Slide 4

Slide 4 text

70 have tried and failed to reproduce another scientist's experiments more than percent Monya Baker,. "1,500 scientists li ft the lid on reproducibility." Nature News 533.7604 (2016): 452.

Slide 5

Slide 5 text

50 have tried and failed to reproduce their own experiments more than percent Monya Baker. "1,500 scientists li ft the lid on reproducibility." Nature News 533.7604 (2016): 452.

Slide 6

Slide 6 text

965 Google Scholar Search, September 2021. results containing the term reproducibility crisis just in 2021 Google Scholar yields

Slide 7

Slide 7 text

1992 Jon Claerbout and Martin Karrenbach. "Electronic documents give reproducible research a new meaning." SEG Technical Program Expanded Abstracts 1992. Society of Exploration Geophysicists, 1992. 601-604. earliest reference reproducibility research* that I could find…

Slide 8

Slide 8 text

Jon Claerbout and Martin Karrenbach. "Electronic documents give reproducible research a new meaning." SEG Technical Program Expanded Abstracts 1992. Society of Exploration Geophysicists, 1992. 601-604.

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Michelle Lewis, et al. "Replication Study: Transcriptional amplification in tumor cells with elevated c-Myc." Elife 7 (2018): e30274.

Slide 11

Slide 11 text

Photo by Alexander Dummer on Unsplash setting the stage

Slide 12

Slide 12 text

replicability reproducibility same research question same research question same results same results new data same data

Slide 13

Slide 13 text

– Mark Holder “Your closest collaborator is you six months ago, but you don’t reply to emails.”

Slide 14

Slide 14 text

Term Estimate Std. Error Statistic p - value (Intercept) 9.06 0.929 9.76 1.13e-17 Sepal.Width -1.74 0.301 -5.77 4.51e - 8 Table 1. Regression output for predicting petal length from sepal width. Figure 2. Relationship between petal length and sepal width e.g.

Slide 15

Slide 15 text

Slide 16

Slide 16 text

analysis report Term Estimate Std. Error Statistic p - value (Intercept) 9.06 0.929 9.76 1.13e-17 Sepal.Width -1.74 0.301 -5.77 4.51e - 8 Table 1. Regression output for predicting petal length from sepal width.

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

making research reproducible

Slide 21

Slide 21 text

– David Donoho, paraphrasing Jon Claerbout “An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete so ft ware development environment and the complete set of instructions which generated the figures.” Jonathan Buckheit and David Donoho. "Wavelab and reproducible research." Wavelets and statistics. Springer, New York, NY, 1995. 55-81.

Slide 22

Slide 22 text

raw data code & documentation to reproduce the analysis specifications of your computational environment make available and accessible Peng, Roger. "The reproducibility crisis in science: A statistical counterattack." Significance 12.3 (2015): 30-32. Gentleman, Robert, and Duncan Temple Lang. "Statistical analyses and reproducible research." Journal of Computational and Graphical Statistics 16.1 (2007): 1-23.

Slide 23

Slide 23 text

– Keith Baggerly “The most important tool is the mindset, when starting, that the end product will be reproducible.”

Slide 24

Slide 24 text

nobody, not even yourself, can recreate any part of your analysis push button reproducibility in published work 💃 🎯

Slide 25

Slide 25 text

“There’s no one-size-fits-all solution for computational reproducibility.” Perkel, Je ff rey M. "A toolkit for data transparency takes shape." Nature 560 (2018): 513-515.

Slide 26

Slide 26 text

but the following might help… 8 principles

Slide 27

Slide 27 text

organize your project 1

Slide 28

Slide 28 text

– Jenny Bryan “File organization and naming are powerful weapons against chaos.”

Slide 29

Slide 29 text

level of organization

Slide 30

Slide 30 text

simpler analysis raw - data processed - data manuscript | - manuscript.Rmd more complex analysis raw - data processed - data scripts manuscript f i gures | - manuscript.Rmd stick with the conventions of your peers

Slide 31

Slide 31 text

Lucy D’Agostino McGowan. “One year to dissertate”, livefreeordichotomize.com/2018/09/14/one-year-to-dissertate. 📖 recommended reading

Slide 32

Slide 32 text

write READMEs liberally 2

Slide 33

Slide 33 text

raw - data processed - data scripts manuscript f i gures | - README.md | - airports.csv | - f l ights.csv | - planes.csv | - weather.csv # README This folder contains the raw data for the project. All datasets were downloaded from open f l ights.org/data.html on 2019-04-01. -airlines: Airline names -airports: Airports metadata - f l ights: F l ight data -planes: Plane metadata -weather: Hourly weather data | - airlines.csv

Slide 34

Slide 34 text

keep data tidy & machine readable 3

Slide 35

Slide 35 text

Student Exam Grade Name 1 2 Major Barney Donaldson 89 76 Data Science, Public Policy Clay Whelan 67 83 Public Policy Simran Bass 82 90 Statistics Chante Munro 45 72 Political Science, Statistics Gabrielle Cherry 32 79 . Kush Piper 98 sick Statistics Faizan Ratliff 82 75 Data Science Torin Ruiz 70 80 Sociology, Statistics Reiss Richardson missed exam 34 Neuroscience Ajwa Cochran 50 65 Data Science Low participation name exam_1 exam_2 f i rst_major second_major participation Barney Donaldson 89 76 Data Science Public Policy ok Clay Whelan 67 83 Public Policy NA ok Simran Bass 82 90 Statistics NA ok Chante Munro 45 72 Political Science Statistics low Gabrielle Cherry 32 79 NA NA ok Kush Piper 98 NA Statistics NA ok Faizan Ratliff 82 75 Data Science NA ok Torin Ruiz 70 80 Sociology Statistics ok Reiss Richardson NA 34 Neuroscience NA low Ajwa Cochran 50 65 Data Science NA low record code + document non-code steps + write tests

Slide 36

Slide 36 text

Mark Ziemann, Yotam Eren, and Assam El-Osta. "Gene name errors are widespread in the scientific literature." Genome biology 17.1 (2016): 177. doi.org/10.1186/s13059-016-1044-7. 📖 recommended reading

Slide 37

Slide 37 text

Karl Broman and Kara Woo. "Data organization in spreadsheets." The American Statistician 72.1 (2018): 2-10. doi.org/10.1080/00031305.2017.1375989. 📖 recommended reading

Slide 38

Slide 38 text

comment your code 4

Slide 39

Slide 39 text

🤷

Slide 40

Slide 40 text

use literate programming 5

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

Yihui Xie, JJ Allaire, and Garrett Grolemund. “R Markdown: The Definitive Guide”, bookdown.org/yihui/rmarkdown. 📖 recommended reading

Slide 45

Slide 45 text

Yihui Xie, Christophe Dervieux, and Emily Riederer. “R Markdown Cookbook”, bookdown.org/yihui/rmarkdown-cookbook. 📖 recommended reading

Slide 46

Slide 46 text

use version control 6

Slide 47

Slide 47 text

changes tracked by hosted on

Slide 48

Slide 48 text

Jenny Bryan, et. al. “Happy Git with R”, happygitwithr.com. 📖 recommended reading

Slide 49

Slide 49 text

automate your process 7

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Karl Broman. “Minimal Make”, kbroman.org/minimal_make. 📖 recommended reading

Slide 52

Slide 52 text

Will Landau. “The targets R Package User Manual”, books.ropensci.org/targets. 📖 recommended reading

Slide 53

Slide 53 text

share computing environment 8

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

Carl Boettiger. "An introduction to Docker for reproducible research." ACM SIGOPS Operating Systems Review 49.1 (2015): 71-79. Ben Marwick, Carl Boettiger, and Lincoln Mullen. "Packaging data analytical work reproducibly using R (and friends)." The American Statistician 72.1 (2018): 80-88. 📖 recommended reading

Slide 60

Slide 60 text

1 organize your project 2 write READMEs liberally 3 keep data tidy & machine readable 4 comment your code 5 use literate programming 6 use version control 7 automate your process 8 share computing environment

Slide 61

Slide 61 text

Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, Tracy K. Teal “Good enough practices in scientific computing." PLoS computational biology 13.6 (2017): e1005510. 📖 recommended reading

Slide 62

Slide 62 text

painful bits

Slide 63

Slide 63 text

> coming up with good names > stages of data cleaning > going back and redoing stu ff > adding interim steps > keeping track of the order of things > clutter of unneeded old stu f Karl Broman, tools4RR. kbroman.org/Tools4RR

Slide 64

Slide 64 text

looking into the future Photo by Sweet Ice Cream Photography on Unsplash.

Slide 65

Slide 65 text

– Keith Baggerly “The most important tool is the mindset, when starting, that the end product will be reproducible.”

Slide 66

Slide 66 text

– Karl Broman “The second most important tool is training.”

Slide 67

Slide 67 text

#1: Convince data scientists to adopt a reproducible data analysis work fl ow #2: Train new data scientists who don’t have any other work fl ow

Slide 68

Slide 68 text

statistics and data science educators who teach data analysis should be instilling best practices in students before they set out to do research

Slide 69

Slide 69 text

→ → → →

Slide 70

Slide 70 text

→ → → → → →

Slide 71

Slide 71 text

Work fl ows for reproducible data science mine-cetinkaya-rundel [email protected] @minebocek 🔗 bit.ly/repro-ds-21