Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reproducible Research with R

Kara Woo
November 29, 2018

Reproducible Research with R

Reproducibility (or lack thereof) of research findings is a growing concern, but fortunately there are many tools and resources to aid analysts in developing transparent and reproducible projects. Kara will discuss the landscape of some of these tools, and how the rOpenSci community is advancing open, reproducible science through software and community.

Kara Woo

November 29, 2018
Tweet

More Decks by Kara Woo

Other Decks in Research

Transcript

  1. Given my data and code, you should be able to

    come to the same conclusions.
  2. Given my data and code, you should be able to

    come to the same conclusions.* in contrast to replicability: re-doing an experiment and getting the same results
  3. reproducible research • “Uhhh which version of the data did

    I use, again?” ➡ Verify results • “Help! My collaborator joined the circus and left me to finish the manuscript.”
  4. reproducible research • “Uhhh which version of the data did

    I use, again?” ➡ Verify results • “Help! My collaborator joined the circus and left me to finish the manuscript.” ➡ Collaborate with others
  5. reproducible research • “Uhhh which version of the data did

    I use, again?” ➡ Verify results • “Help! My collaborator joined the circus and left me to finish the manuscript.” ➡ Collaborate with others • “My boss wants these figures updated ASAP and I have concert tickets tonight.”
  6. reproducible research • “Uhhh which version of the data did

    I use, again?” ➡ Verify results • “Help! My collaborator joined the circus and left me to finish the manuscript.” ➡ Collaborate with others • “My boss wants these figures updated ASAP and I have concert tickets tonight.” ➡ Save time in the long run
  7. – Keith Baggerly “The most important tool is the mindset,

    when starting, that the end product will be reproducible.”
  8. my-project/ ├── README.md | ├── data/ │ ├── raw/ │

    └── processed/ | ├── code/ │ └── survival_analysis.R | ├── figures/ │ ├── fig1.png │ └── fig2.png | └── manuscript/ ├── project_manuscript.Rmd ├── project_manuscript.docx └── project_manuscript.pdf
  9. my-project/ ├── README.md | ├── data/ │ ├── raw/ │

    └── processed/ | ├── code/ │ └── survival_analysis.R | ├── figures/ │ ├── fig1.png │ └── fig2.png | └── manuscript/ ├── project_manuscript.Rmd ├── project_manuscript.docx └── project_manuscript.pdf
  10. • You don’t need any special tools to do this:

    organization and informative file names go a long way • But some R tools can take it to the next level
  11. usethis • Sets up commonly used components for projects and

    R packages • create_project() - set up a project • use_readme_md() • … usethis package by Jenny Bryan and Hadley Wickham: https://github.com/r-lib/usethis
  12. – Broman & Woo, 2017 “Has this happened to you?

    You open an Excel file and start typing and nothing happens, and then you select a cell and you can start typing. Where did all of that initial text go? Well, sometimes it got entered into some random cell, to be discovered later during data analysis.” https://doi.org/10.1080/00031305.2017.1375989
  13. • Sometimes spreadsheets and hand-entered data are what I’ve got

    • Sometimes data is coming from a database or website
  14. Access data programmatically • Recreate the exact steps, query parameters,

    etc. to retrieve the data • No need to keep track of which files came from where • Original data remains untouched
  15. ropenaq access air quality data from openaq > aq_measurements(country =

    "US", city = "Corvallis", parameter = "pm25", limit = 10, page = 1) # A tibble: 10 x 12 location parameter value unit country city latitude longitude <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> 1 Corvallis - Circl… pm25 14.2 µg/m³ US Corvall… 44.6 -123. 2 Corvallis - Circl… pm25 15.3 µg/m³ US Corvall… 44.6 -123. 3 Corvallis - Circl… pm25 16 µg/m³ US Corvall… 44.6 -123. 4 Corvallis - Circl… pm25 14.9 µg/m³ US Corvall… 44.6 -123. 5 Corvallis - Circl… pm25 13.5 µg/m³ US Corvall… 44.6 -123. 6 Corvallis - Circl… pm25 12.5 µg/m³ US Corvall… 44.6 -123. 7 Corvallis - Circl… pm25 10.5 µg/m³ US Corvall… 44.6 -123. 8 Corvallis - Circl… pm25 12.5 µg/m³ US Corvall… 44.6 -123. 9 Corvallis - Circl… pm25 10.5 µg/m³ US Corvall… 44.6 -123. 10 Corvallis - Circl… pm25 8.4 µg/m³ US Corvall… 44.6 -123. # ... with 4 more variables: dateUTC <dttm>, dateLocal <dttm>, cityURL <chr>, # locationURL <chr> ropenaq package by Maëlle Salmon: https://github.com/ropensci/ropenaq
  16. R Markdown • Combine text, code, and figures into reproducible

    reports • “We surveyed `r nrow(survey_data)` participants.” • Papers, blog posts, presentations, books • No endless copy/pasting figures into Word!
  17. huskydown • Write your thesis in R Markdown • Uses

    UW thesis template for formatting huskydown package by Ben Marwick: https://github.com/benmarwick/huskydown
  18. huskydown • Write your thesis in R Markdown • Uses

    UW thesis template for formatting huskydown package by Ben Marwick: https://github.com/benmarwick/huskydown
  19. Projects with many moving pieces • Accessing, wrangling, and analyzing

    data can be many steps • Time- and computation-intensive to re-run • Order matters, but can be hard to remember
  20. drake • Workflow manager for R projects • Define a

    drake_plan() for all steps in the project • Use make() to run all steps, skipping things that haven’t changed • Supports parallel computing drake package by Will Landau: https://github.com/ropensci/drake/
  21. rOpenSci • >300 R packages for data retrieval, data extraction,

    database access, data munging, data deposition, reproducibility, geospatial data, text analysis
  22. rOpenSci • >300 R packages for data retrieval, data extraction,

    database access, data munging, data deposition, reproducibility, geospatial data, text analysis • Resources for developers
 https://ropensci.github.io/dev_guide/
  23. rOpenSci • >300 R packages for data retrieval, data extraction,

    database access, data munging, data deposition, reproducibility, geospatial data, text analysis • Resources for developers
 https://ropensci.github.io/dev_guide/ • Code review & support
 https://github.com/ropensci/onboarding/issues/230
  24. – Will Landau “rOpenSci combines expertise and approachability, and its

    community inspires people to collaborate as the best versions of themselves.”
  25. What I left out: • Data formats and best practices

    • Version control • Turning repeated code into functions • Turning projects into R packages • Licensing
  26. Summary • Reproducibility begins in the ~* m i n

    d *~ • Reproducibility is a spectrum — many practices and tools can support it • The R ecosystem is rich with tools to help • rOpenSci supports many such tools, and a vibrant community