Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The workflowr R package: a framework for reproducible and collaborative data science

The workflowr R package: a framework for reproducible and collaborative data science

The workflowr R package helps scientists organize their research in a way that promotes effective project management, reproducibility, collaboration, and sharing of results. workflowr combines literate programming (knitr and rmarkdown) and version control (Git, via git2r) to generate a website containing time-stamped, versioned, and documented results. Any R user can quickly and easily adopt workflowr, which includes four key features: (1) workflowr automatically creates a directory structure for organizing data, code, and results; (2) workflowr uses the version control system Git to track different versions of the code and results without the user needing to understand Git syntax; (3) to support reproducibility, workflowr automatically includes code version information in webpages displaying results and; (4) workflowr facilitates online Web hosting (e.g. GitHub Pages) to share results. Our goal is that workflowr will make it easier for scientists to organize and communicate reproducible research results. Documentation and source code are available at https://github.com/jdblischak/workflowr.

John Blischak

July 11, 2018
Tweet

More Decks by John Blischak

Other Decks in Programming

Transcript

  1. The  workflowr R  package:  a  
    framework  for  reproducible  
    and  collaborative  data  
    science
    John  Blischak  (@jdblischak)
    2018-­07-­11
    useR!  2018  Brisbane,  Australia
    github.com/jdblischak/workflowr

    View Slide

  2. My  computational  challenges
    Organizing  files
    Tracking  intermediate  results
    Sharing  results
    John  Blischak  -­ github.com/jdblischak/workflowr

    View Slide

  3. John  Blischak  -­ github.com/jdblischak/workflowr

    View Slide

  4. Literate  programming
    John  Blischak  -­ github.com/jdblischak/workflowr
    Source  code Results
    file.Rmd file.html
    yihui.name/knitr rmarkdown.rstudio.com

    View Slide

  5. R  Markdown  websites
    John  Blischak  -­ github.com/jdblischak/workflowr rmarkdown.rstudio.com

    View Slide

  6. Version  control
    John  Blischak  -­ github.com/jdblischak/workflowr
    version:  2rko6xn
    message:  Start  new…
    version:  d1zyskv
    message:  Update  parameters…
    version:  z6o3b97
    message:  Label  axes…
    git-­scm.com
    github.com/ropensci/git2r

    View Slide

  7. Version  control  terminology
    repository – the  tracked  files  and  their  revision  history
    commit – a  snapshot  of  the  current  state  of  the  files
    John  Blischak  -­ github.com/jdblischak/workflowr

    View Slide

  8. Web  hosting
    GitHub  Pages  – hosts  one  website  per  code  repository
    John  Blischak  -­ github.com/jdblischak/workflowr pages.github.com

    View Slide

  9. workflowr
    Organized
    Reproducible
    Shareable
    John  Blischak  -­ github.com/jdblischak/workflowr
    Version-­controlled  websites

    View Slide

  10. Organized
    John  Blischak  -­ github.com/jdblischak/workflowr

    View Slide

  11. Start  a  new  project
    >  wflow_start("myproject")
    1.  Creates  directory  with  template  files
    2.  Changes  working  directory
    3.  Initiates  Git  repository  and  commits  files
    Also  available  as  RStudio Project  Template
    John  Blischak  -­ github.com/jdblischak/workflowr

    View Slide

  12. Organized  directory  structure
    John  Blischak  -­ github.com/jdblischak/workflowr
    R  Markdown  files
    HTML  files
    Website  options

    View Slide

  13. Reproducible
    John  Blischak  -­ github.com/jdblischak/workflowr

    View Slide

  14. Run  code  in  clean  environment
    John  Blischak  -­ github.com/jdblischak/workflowr
    >  wflow_build(c("f1.Rmd",  "f2.Rmd"))
    f1.Rmd
    f2.Rmd
    github.com/r-­lib/callr

    View Slide

  15. Tracking  intermediate  results
    >  wflow_publish("analysis/file.Rmd")
    Performs  3-­steps:
    1. Commits  analysis/file.Rmd
    2. Builds analysis/file.Rmd
    3. Commits  docs/file.html and  figure  files
    John  Blischak  -­ github.com/jdblischak/workflowr

    View Slide

  16. Combining  rmarkdown and  Git
    John  Blischak  -­ github.com/jdblischak/workflowr
    Source  code Results
    1ong9jt ln412fy
    Source  code Results
    wr1q7bk 3tg6lse

    View Slide

  17. View  past  results
    John  Blischak  -­ github.com/jdblischak/workflowr

    View Slide

  18. Other  reproducibility  features
    output:  workflowr::wflow_html
    Records  the  session  information  at  the  end
    Sets  a  seed  prior  to  running  code
    John  Blischak  -­ github.com/jdblischak/workflowr

    View Slide

  19. Reproducibility  report
    John  Blischak  -­ github.com/jdblischak/workflowr

    View Slide

  20. Shareable
    John  Blischak  -­ github.com/jdblischak/workflowr

    View Slide

  21. Distribute  results  for  sharing
    Create  new  GitHub  repository
    >  wflow_git_push()
    John  Blischak  -­ github.com/jdblischak/workflowr
    ©  2018  GitHub  Inc.
    pages.github.com

    View Slide

  22. Installation
    1. Install  R
    ◦ (Recommended)  Install  RStudio
    ◦ (Optional)  Install  pandoc
    ◦ (Optional)  Install  Git
    2. Install  workflowr from  CRAN
    ◦ install.packages("workflowr")
    3. Create  an  account  on  GitHub
    Documentation:  https://jdblischak.github.io/workflowr
    John  Blischak  -­ github.com/jdblischak/workflowr

    View Slide

  23. In  summary,  using  workflowr…
    Enables  you  to  start  working  reproducibly  immediately
    Allows  you  to  focus  on  your  analysis
    Shares  your  results  online
    John  Blischak  -­ github.com/jdblischak/workflowr

    View Slide

  24. Acknowledgements
    Co-­authors:  Peter  Carbonetto,  Matthew  Stephens
    Early  adopters  for  testing  and  feedback
    Authors  and  contributors  to  knitr,  rmarkdown,  git2r,  callr
    John  Blischak  -­ github.com/jdblischak/workflowr

    View Slide

  25. workflowr
    Organized
    Reproducible
    Shareable
    John  Blischak  -­ github.com/jdblischak/workflowr
    Version-­controlled  websites

    View Slide