Slide 1

Slide 1 text

Reproducible Research with R Kara Woo Research Scientist, Data Curation Sage Bionetworks 2018-11-29

Slide 2

Slide 2 text

reproducible research

Slide 3

Slide 3 text

Given my data and code, you should be able to come to the same conclusions.

Slide 4

Slide 4 text

Given my data and code, you should be able to come to the same conclusions.* in contrast to replicability: re-doing an experiment and getting the same results

Slide 5

Slide 5 text

reproducible research

Slide 6

Slide 6 text

reproducible research • “Uhhh which version of the data did I use, again?”

Slide 7

Slide 7 text

reproducible research • “Uhhh which version of the data did I use, again?” ➡ Verify results

Slide 8

Slide 8 text

reproducible research • “Uhhh which version of the data did I use, again?” ➡ Verify results • “Help! My collaborator joined the circus and left me to finish the manuscript.”

Slide 9

Slide 9 text

reproducible research • “Uhhh which version of the data did I use, again?” ➡ Verify results • “Help! My collaborator joined the circus and left me to finish the manuscript.” ➡ Collaborate with others

Slide 10

Slide 10 text

reproducible research • “Uhhh which version of the data did I use, again?” ➡ Verify results • “Help! My collaborator joined the circus and left me to finish the manuscript.” ➡ Collaborate with others • “My boss wants these figures updated ASAP and I have concert tickets tonight.”

Slide 11

Slide 11 text

reproducible research • “Uhhh which version of the data did I use, again?” ➡ Verify results • “Help! My collaborator joined the circus and left me to finish the manuscript.” ➡ Collaborate with others • “My boss wants these figures updated ASAP and I have concert tickets tonight.” ➡ Save time in the long run

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

we’re often not taught how to make projects reproducible…

Slide 15

Slide 15 text

…or we believe we don’t have time.

Slide 16

Slide 16 text

Marwick et al. 2017 https://doi.org/10.31235/osf.io/72n8g

Slide 17

Slide 17 text

– Keith Baggerly “The most important tool is the mindset, when starting, that the end product will be reproducible.”

Slide 18

Slide 18 text

1. Organize!

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

my-project/ ├── README.md | ├── data/ │ ├── raw/ │ └── processed/ | ├── code/ │ └── survival_analysis.R | ├── figures/ │ ├── fig1.png │ └── fig2.png | └── manuscript/ ├── project_manuscript.Rmd ├── project_manuscript.docx └── project_manuscript.pdf

Slide 24

Slide 24 text

my-project/ ├── README.md | ├── data/ │ ├── raw/ │ └── processed/ | ├── code/ │ └── survival_analysis.R | ├── figures/ │ ├── fig1.png │ └── fig2.png | └── manuscript/ ├── project_manuscript.Rmd ├── project_manuscript.docx └── project_manuscript.pdf

Slide 25

Slide 25 text

• You don’t need any special tools to do this: organization and informative file names go a long way • But some R tools can take it to the next level

Slide 26

Slide 26 text

usethis • Sets up commonly used components for projects and R packages • create_project() - set up a project • use_readme_md() • … usethis package by Jenny Bryan and Hadley Wickham: https://github.com/r-lib/usethis

Slide 27

Slide 27 text

2. Script

Slide 28

Slide 28 text

Record everything you did, because you will have to do it again.

Slide 29

Slide 29 text

– Broman & Woo, 2017 “Has this happened to you? You open an Excel file and start typing and nothing happens, and then you select a cell and you can start typing. Where did all of that initial text go? Well, sometimes it got entered into some random cell, to be discovered later during data analysis.” https://doi.org/10.1080/00031305.2017.1375989

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

Where did this data even come from?

Slide 33

Slide 33 text

• Sometimes spreadsheets and hand-entered data are what I’ve got • Sometimes data is coming from a database or website

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

Access data programmatically • Recreate the exact steps, query parameters, etc. to retrieve the data • No need to keep track of which files came from where • Original data remains untouched

Slide 37

Slide 37 text

ropenaq access air quality data from openaq > aq_measurements(country = "US", city = "Corvallis", parameter = "pm25", limit = 10, page = 1) # A tibble: 10 x 12 location parameter value unit country city latitude longitude 1 Corvallis - Circl… pm25 14.2 µg/m³ US Corvall… 44.6 -123. 2 Corvallis - Circl… pm25 15.3 µg/m³ US Corvall… 44.6 -123. 3 Corvallis - Circl… pm25 16 µg/m³ US Corvall… 44.6 -123. 4 Corvallis - Circl… pm25 14.9 µg/m³ US Corvall… 44.6 -123. 5 Corvallis - Circl… pm25 13.5 µg/m³ US Corvall… 44.6 -123. 6 Corvallis - Circl… pm25 12.5 µg/m³ US Corvall… 44.6 -123. 7 Corvallis - Circl… pm25 10.5 µg/m³ US Corvall… 44.6 -123. 8 Corvallis - Circl… pm25 12.5 µg/m³ US Corvall… 44.6 -123. 9 Corvallis - Circl… pm25 10.5 µg/m³ US Corvall… 44.6 -123. 10 Corvallis - Circl… pm25 8.4 µg/m³ US Corvall… 44.6 -123. # ... with 4 more variables: dateUTC , dateLocal , cityURL , # locationURL ropenaq package by Maëlle Salmon: https://github.com/ropensci/ropenaq

Slide 38

Slide 38 text

3. Write reproducible reports

Slide 39

Slide 39 text

R Markdown • Combine text, code, and figures into reproducible reports • “We surveyed `r nrow(survey_data)` participants.” • Papers, blog posts, presentations, books • No endless copy/pasting figures into Word!

Slide 40

Slide 40 text

huskydown • Write your thesis in R Markdown • Uses UW thesis template for formatting huskydown package by Ben Marwick: https://github.com/benmarwick/huskydown

Slide 41

Slide 41 text

huskydown • Write your thesis in R Markdown • Uses UW thesis template for formatting huskydown package by Ben Marwick: https://github.com/benmarwick/huskydown

Slide 42

Slide 42 text

4. Automate from end to end

Slide 43

Slide 43 text

Projects with many moving pieces • Accessing, wrangling, and analyzing data can be many steps • Time- and computation-intensive to re-run • Order matters, but can be hard to remember

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

drake • Workflow manager for R projects • Define a drake_plan() for all steps in the project • Use make() to run all steps, skipping things that haven’t changed • Supports parallel computing drake package by Will Landau: https://github.com/ropensci/drake/

Slide 46

Slide 46 text

Transforming science through open data and software

Slide 47

Slide 47 text

rOpenSci • >300 R packages for data retrieval, data extraction, database access, data munging, data deposition, reproducibility, geospatial data, text analysis

Slide 48

Slide 48 text

rOpenSci • >300 R packages for data retrieval, data extraction, database access, data munging, data deposition, reproducibility, geospatial data, text analysis • Resources for developers
 https://ropensci.github.io/dev_guide/

Slide 49

Slide 49 text

rOpenSci • >300 R packages for data retrieval, data extraction, database access, data munging, data deposition, reproducibility, geospatial data, text analysis • Resources for developers
 https://ropensci.github.io/dev_guide/ • Code review & support
 https://github.com/ropensci/onboarding/issues/230

Slide 50

Slide 50 text

rOpenSci • Discoverability
 https://ropensci.org/packages/
 https://ropensci.org/blog/

Slide 51

Slide 51 text

rOpenSci • Discoverability
 https://ropensci.org/packages/
 https://ropensci.org/blog/ • Community

Slide 52

Slide 52 text

– Will Landau “rOpenSci combines expertise and approachability, and its community inspires people to collaborate as the best versions of themselves.”

Slide 53

Slide 53 text

What I left out: • Data formats and best practices • Version control • Turning repeated code into functions • Turning projects into R packages • Licensing

Slide 54

Slide 54 text

Summary • Reproducibility begins in the ~* m i n d *~ • Reproducibility is a spectrum — many practices and tools can support it • The R ecosystem is rich with tools to help • rOpenSci supports many such tools, and a vibrant community

Slide 55

Slide 55 text

thanks! [email protected] karawoo.com @kara_woo