The workflowr R package: a framework for reproducible and collaborative data science

The workflowr R package: a framework for reproducible and collaborative data science

The workflowr R package helps scientists organize their research in a way that promotes effective project management, reproducibility, collaboration, and sharing of results. workflowr combines literate programming (knitr and rmarkdown) and version control (Git, via git2r) to generate a website containing time-stamped, versioned, and documented results. Any R user can quickly and easily adopt workflowr, which includes four key features: (1) workflowr automatically creates a directory structure for organizing data, code, and results; (2) workflowr uses the version control system Git to track different versions of the code and results without the user needing to understand Git syntax; (3) to support reproducibility, workflowr automatically includes code version information in webpages displaying results and; (4) workflowr facilitates online Web hosting (e.g. GitHub Pages) to share results. Our goal is that workflowr will make it easier for scientists to organize and communicate reproducible research results. Documentation and source code are available at https://github.com/jdblischak/workflowr.

9952931e69c37307ab4e1eb41dd14360?s=128

John Blischak

July 11, 2018
Tweet

Transcript

  1. The  workflowr R  package:  a   framework  for  reproducible  

    and  collaborative  data   science John  Blischak  (@jdblischak) 2018-­07-­11 useR!  2018  Brisbane,  Australia github.com/jdblischak/workflowr
  2. My  computational  challenges Organizing  files Tracking  intermediate  results Sharing  results

    John  Blischak  -­ github.com/jdblischak/workflowr
  3. John  Blischak  -­ github.com/jdblischak/workflowr

  4. Literate  programming John  Blischak  -­ github.com/jdblischak/workflowr Source  code Results file.Rmd

    file.html yihui.name/knitr rmarkdown.rstudio.com
  5. R  Markdown  websites John  Blischak  -­ github.com/jdblischak/workflowr rmarkdown.rstudio.com

  6. Version  control John  Blischak  -­ github.com/jdblischak/workflowr version:  2rko6xn message:  Start

     new… version:  d1zyskv message:  Update  parameters… version:  z6o3b97 message:  Label  axes… git-­scm.com github.com/ropensci/git2r
  7. Version  control  terminology repository – the  tracked  files  and  their

     revision  history commit – a  snapshot  of  the  current  state  of  the  files John  Blischak  -­ github.com/jdblischak/workflowr
  8. Web  hosting GitHub  Pages  – hosts  one  website  per  code

     repository John  Blischak  -­ github.com/jdblischak/workflowr pages.github.com
  9. workflowr Organized Reproducible Shareable John  Blischak  -­ github.com/jdblischak/workflowr Version-­controlled  websites

  10. Organized John  Blischak  -­ github.com/jdblischak/workflowr

  11. Start  a  new  project >  wflow_start("myproject") 1.  Creates  directory  with

     template  files 2.  Changes  working  directory 3.  Initiates  Git  repository  and  commits  files Also  available  as  RStudio Project  Template John  Blischak  -­ github.com/jdblischak/workflowr
  12. Organized  directory  structure John  Blischak  -­ github.com/jdblischak/workflowr R  Markdown  files

    HTML  files Website  options
  13. Reproducible John  Blischak  -­ github.com/jdblischak/workflowr

  14. Run  code  in  clean  environment John  Blischak  -­ github.com/jdblischak/workflowr >

     wflow_build(c("f1.Rmd",  "f2.Rmd")) f1.Rmd f2.Rmd github.com/r-­lib/callr
  15. Tracking  intermediate  results >  wflow_publish("analysis/file.Rmd") Performs  3-­steps: 1. Commits  analysis/file.Rmd

    2. Builds analysis/file.Rmd 3. Commits  docs/file.html and  figure  files John  Blischak  -­ github.com/jdblischak/workflowr
  16. Combining  rmarkdown and  Git John  Blischak  -­ github.com/jdblischak/workflowr Source  code

    Results 1ong9jt ln412fy Source  code Results wr1q7bk 3tg6lse
  17. View  past  results John  Blischak  -­ github.com/jdblischak/workflowr

  18. Other  reproducibility  features output:  workflowr::wflow_html Records  the  session  information  at

     the  end Sets  a  seed  prior  to  running  code John  Blischak  -­ github.com/jdblischak/workflowr
  19. Reproducibility  report John  Blischak  -­ github.com/jdblischak/workflowr

  20. Shareable John  Blischak  -­ github.com/jdblischak/workflowr

  21. Distribute  results  for  sharing Create  new  GitHub  repository >  wflow_git_push()

    John  Blischak  -­ github.com/jdblischak/workflowr ©  2018  GitHub  Inc. pages.github.com
  22. Installation 1. Install  R ◦ (Recommended)  Install  RStudio ◦ (Optional)

     Install  pandoc ◦ (Optional)  Install  Git 2. Install  workflowr from  CRAN ◦ install.packages("workflowr") 3. Create  an  account  on  GitHub Documentation:  https://jdblischak.github.io/workflowr John  Blischak  -­ github.com/jdblischak/workflowr
  23. In  summary,  using  workflowr… Enables  you  to  start  working  reproducibly

     immediately Allows  you  to  focus  on  your  analysis Shares  your  results  online John  Blischak  -­ github.com/jdblischak/workflowr
  24. Acknowledgements Co-­authors:  Peter  Carbonetto,  Matthew  Stephens Early  adopters  for  testing

     and  feedback Authors  and  contributors  to  knitr,  rmarkdown,  git2r,  callr John  Blischak  -­ github.com/jdblischak/workflowr
  25. workflowr Organized Reproducible Shareable John  Blischak  -­ github.com/jdblischak/workflowr Version-­controlled  websites