Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reproducible Research with R

Kara Woo
November 29, 2018

Reproducible Research with R

Reproducibility (or lack thereof) of research findings is a growing concern, but fortunately there are many tools and resources to aid analysts in developing transparent and reproducible projects. Kara will discuss the landscape of some of these tools, and how the rOpenSci community is advancing open, reproducible science through software and community.

Kara Woo

November 29, 2018
Tweet

More Decks by Kara Woo

Other Decks in Research

Transcript

  1. Reproducible
    Research with R
    Kara Woo

    Research Scientist, Data Curation

    Sage Bionetworks

    2018-11-29

    View full-size slide

  2. reproducible research

    View full-size slide

  3. Given my data and code, you should be
    able to come to the same conclusions.

    View full-size slide

  4. Given my data and code, you should be
    able to come to the same conclusions.*
    in contrast to replicability: re-doing an
    experiment and getting the same results

    View full-size slide

  5. reproducible research

    View full-size slide

  6. reproducible research
    • “Uhhh which version of the data did I use, again?”

    View full-size slide

  7. reproducible research
    • “Uhhh which version of the data did I use, again?”
    ➡ Verify results

    View full-size slide

  8. reproducible research
    • “Uhhh which version of the data did I use, again?”
    ➡ Verify results
    • “Help! My collaborator joined the circus and left me to
    finish the manuscript.”

    View full-size slide

  9. reproducible research
    • “Uhhh which version of the data did I use, again?”
    ➡ Verify results
    • “Help! My collaborator joined the circus and left me to
    finish the manuscript.”
    ➡ Collaborate with others

    View full-size slide

  10. reproducible research
    • “Uhhh which version of the data did I use, again?”
    ➡ Verify results
    • “Help! My collaborator joined the circus and left me to
    finish the manuscript.”
    ➡ Collaborate with others
    • “My boss wants these figures updated ASAP and I
    have concert tickets tonight.”

    View full-size slide

  11. reproducible research
    • “Uhhh which version of the data did I use, again?”
    ➡ Verify results
    • “Help! My collaborator joined the circus and left me to
    finish the manuscript.”
    ➡ Collaborate with others
    • “My boss wants these figures updated ASAP and I
    have concert tickets tonight.”
    ➡ Save time in the long run

    View full-size slide

  12. we’re often not taught how
    to make projects reproducible…

    View full-size slide

  13. …or we believe we
    don’t have time.

    View full-size slide

  14. Marwick et al. 2017 https://doi.org/10.31235/osf.io/72n8g

    View full-size slide

  15. – Keith Baggerly
    “The most important tool is the mindset,
    when starting, that the end product
    will be reproducible.”

    View full-size slide

  16. 1. Organize!

    View full-size slide

  17. my-project/
    ├── README.md
    |
    ├── data/
    │ ├── raw/
    │ └── processed/
    |
    ├── code/
    │ └── survival_analysis.R
    |
    ├── figures/
    │ ├── fig1.png
    │ └── fig2.png
    |
    └── manuscript/
    ├── project_manuscript.Rmd
    ├── project_manuscript.docx
    └── project_manuscript.pdf

    View full-size slide

  18. my-project/
    ├── README.md
    |
    ├── data/
    │ ├── raw/
    │ └── processed/
    |
    ├── code/
    │ └── survival_analysis.R
    |
    ├── figures/
    │ ├── fig1.png
    │ └── fig2.png
    |
    └── manuscript/
    ├── project_manuscript.Rmd
    ├── project_manuscript.docx
    └── project_manuscript.pdf

    View full-size slide

  19. • You don’t need any special tools to do this:
    organization and informative file names go a long way
    • But some R tools can take it to the next level

    View full-size slide

  20. usethis
    • Sets up commonly used components for projects
    and R packages
    • create_project() - set up a project
    • use_readme_md()
    • …
    usethis package by Jenny Bryan and Hadley Wickham: https://github.com/r-lib/usethis

    View full-size slide

  21. Record everything you did, because you will
    have to do it again.

    View full-size slide

  22. – Broman & Woo, 2017
    “Has this happened to you? You open an
    Excel file and start typing and nothing
    happens, and then you select a cell and you
    can start typing. Where did all of that initial
    text go? Well, sometimes it got entered into
    some random cell, to be discovered later
    during data analysis.”
    https://doi.org/10.1080/00031305.2017.1375989

    View full-size slide

  23. Where did this data even come from?

    View full-size slide

  24. • Sometimes spreadsheets and hand-entered data are
    what I’ve got
    • Sometimes data is coming from a database or
    website

    View full-size slide

  25. Access data
    programmatically
    • Recreate the exact steps, query parameters, etc. to
    retrieve the data
    • No need to keep track of which files came from
    where
    • Original data remains untouched

    View full-size slide

  26. ropenaq access air quality data from openaq
    > aq_measurements(country = "US", city = "Corvallis", parameter = "pm25", limit
    = 10, page = 1)
    # A tibble: 10 x 12
    location parameter value unit country city latitude longitude

    1 Corvallis - Circl… pm25 14.2 µg/m³ US Corvall… 44.6 -123.
    2 Corvallis - Circl… pm25 15.3 µg/m³ US Corvall… 44.6 -123.
    3 Corvallis - Circl… pm25 16 µg/m³ US Corvall… 44.6 -123.
    4 Corvallis - Circl… pm25 14.9 µg/m³ US Corvall… 44.6 -123.
    5 Corvallis - Circl… pm25 13.5 µg/m³ US Corvall… 44.6 -123.
    6 Corvallis - Circl… pm25 12.5 µg/m³ US Corvall… 44.6 -123.
    7 Corvallis - Circl… pm25 10.5 µg/m³ US Corvall… 44.6 -123.
    8 Corvallis - Circl… pm25 12.5 µg/m³ US Corvall… 44.6 -123.
    9 Corvallis - Circl… pm25 10.5 µg/m³ US Corvall… 44.6 -123.
    10 Corvallis - Circl… pm25 8.4 µg/m³ US Corvall… 44.6 -123.
    # ... with 4 more variables: dateUTC , dateLocal , cityURL ,
    # locationURL
    ropenaq package by Maëlle Salmon: https://github.com/ropensci/ropenaq

    View full-size slide

  27. 3. Write reproducible
    reports

    View full-size slide

  28. R Markdown
    • Combine text, code, and figures into reproducible
    reports
    • “We surveyed `r nrow(survey_data)`
    participants.”
    • Papers, blog posts, presentations, books
    • No endless copy/pasting figures into Word!

    View full-size slide

  29. huskydown
    • Write your thesis in R Markdown
    • Uses UW thesis template for formatting
    huskydown package by Ben Marwick: https://github.com/benmarwick/huskydown

    View full-size slide

  30. huskydown
    • Write your thesis in R Markdown
    • Uses UW thesis template for formatting
    huskydown package by Ben Marwick: https://github.com/benmarwick/huskydown

    View full-size slide

  31. 4. Automate from end
    to end

    View full-size slide

  32. Projects with many moving
    pieces
    • Accessing, wrangling, and analyzing data can be
    many steps
    • Time- and computation-intensive to re-run
    • Order matters, but can be hard to remember

    View full-size slide

  33. drake
    • Workflow manager for R projects
    • Define a drake_plan() for all steps in the project
    • Use make() to run all steps, skipping things that
    haven’t changed
    • Supports parallel computing
    drake package by Will Landau: https://github.com/ropensci/drake/

    View full-size slide

  34. Transforming science through
    open data and software

    View full-size slide

  35. rOpenSci
    • >300 R packages for data retrieval, data extraction,
    database access, data munging, data deposition,
    reproducibility, geospatial data, text analysis

    View full-size slide

  36. rOpenSci
    • >300 R packages for data retrieval, data extraction,
    database access, data munging, data deposition,
    reproducibility, geospatial data, text analysis
    • Resources for developers

    https://ropensci.github.io/dev_guide/

    View full-size slide

  37. rOpenSci
    • >300 R packages for data retrieval, data extraction,
    database access, data munging, data deposition,
    reproducibility, geospatial data, text analysis
    • Resources for developers

    https://ropensci.github.io/dev_guide/
    • Code review & support

    https://github.com/ropensci/onboarding/issues/230

    View full-size slide

  38. rOpenSci
    • Discoverability

    https://ropensci.org/packages/

    https://ropensci.org/blog/

    View full-size slide

  39. rOpenSci
    • Discoverability

    https://ropensci.org/packages/

    https://ropensci.org/blog/
    • Community

    View full-size slide

  40. – Will Landau
    “rOpenSci combines expertise and
    approachability, and its community inspires
    people to collaborate as the best versions
    of themselves.”

    View full-size slide

  41. What I left out:
    • Data formats and best practices
    • Version control
    • Turning repeated code into functions
    • Turning projects into R packages
    • Licensing

    View full-size slide

  42. Summary
    • Reproducibility begins in the ~* m i n d *~
    • Reproducibility is a spectrum — many practices and
    tools can support it
    • The R ecosystem is rich with tools to help
    • rOpenSci supports many such tools, and a vibrant
    community

    View full-size slide

  43. thanks!
    [email protected]

    karawoo.com

    @kara_woo

    View full-size slide