2019-09-17 Creating Reproducible Data Science

2b651c3725763904a603ab0a63a46cc8?s=47 Alex Gold
September 17, 2019

2019-09-17 Creating Reproducible Data Science

A webinar for the John Deere company on creating reproducible data science; code at github.com/akgold/2019-09-17_john_deere_webinar.

2b651c3725763904a603ab0a63a46cc8?s=128

Alex Gold

September 17, 2019
Tweet

Transcript

  1. Creating Reproducible Data Science Alex K. Gold Solutions Engineer @alexkgold

    github.com/akgold/2019-09-17_john_deere_webinar
  2. Why care about reproducibility? Or portability?

  3. Why care about reproducibility? Or portability? • With Colleagues •

    With Future You • They don’t answer ☎ Sharing!
  4. How much reproducibility is enough? More Reproducible Not at all

    Somewhat Fully More Work
  5. A Taxonomy Of Irreproducibility

  6. Code that won’t run on someone else’s machine c://Documents/ /usr/home/agold

    ~/agold
  7. Difficulty Finding Latest Version ninja_analysis_final2_alex_final.Rmd “Well, the version I have…”

    “Let me just send you the newest…”
  8. Things that are tedious and you’ll need again “Well, I

    had to set up these variables like this…” model <- glm(outcome ~ input, family = “gamma”, data = my_dat, weights = wgt, subset = my_set, na_action = “na.exclude”, offset = 7, model = TRUE)
  9. Environmental factors that will break “Oh, so you just need

    to get this system dependency configured…” “Right, I used version 0.7.6 of the package, not 0.7.8…” “Oh yeah, this kinda doesn’t work on Windows…”
  10. Solutions

  11. Code that breaks on someone else’s machine

  12. Code that breaks on someone else’s machine -based Workflow

  13. If the first line of your R script is setwd(“C:\Users\jenny\path\that\only\I\have”)

    I will come into your office and SET YOUR COMPUTER ON FIRE
  14. If the first line of your R script is rm(list

    = ls()) I will come into your office and SET YOUR COMPUTER ON FIRE
  15. Avoiding Computer fires Project-based Workflow and here::here

  16. Avoiding Computer fires Demo!

  17. Code Structure

  18. Project-Based Workflow Learn More tidyverse.org/articles/2017/12/workflow-vs-script

  19. Difficulty Finding Latest Version

  20. Difficulty Finding Latest Version Version Control

  21. Version Control

  22. Version Control 1. Distinct versions. 2. Remote backups. 3. Collaborating

    on code. Why would I use it?
  23. Version Control A few terms. (Sorry) Origin

  24. Cute dog break.

  25. Version Control Branching Master “Feature”

  26. Git vs Github

  27. Where can I git? In RStudio On Github Command Line

  28. Version Control Demo!

  29. Version Control Command Review

  30. Version Control Branching master fix_model Git checkout -c fix_model git

    merge
  31. Version Control Learn More www.happygitwithr.com

  32. Things that are tedious and you’ll need again

  33. Things that are tedious and you’ll need again R Tools

  34. When you’ve used the same • function • RMarkdown document

    • boilerplate Shiny code 3x write a package
  35. Code Snippets, Functions, And Templates Demo!

  36. R Packages r-pkgs.had.co.nz

  37. Environmental factors that will break

  38. Environmental factors that will break Controlling Environments

  39. Why would my environment break?

  40. Reproducing Environments (Advanced Reproducibility) Save package state using packrat/renv

  41. Reproducing Environments (Advanced Reproducibility)

  42. Reproducing Environments Learn More environments.rstudio.com

  43. The End Code only for your machine -based workflow tidyverse.org/articles/2017/12/workflow-vs-script

    Trying to find and coordinate versions Use happygitwithr.com Reusing tedious work Write a r-pkgs.had.co.nz Reproducing environments packrat/renv environments.rstudio.com github.com/akgold/2019-09-17_john_deere_webinar