2019-09-17 Creating Reproducible Data Science

2b651c3725763904a603ab0a63a46cc8?s=47 Alex Gold
September 17, 2019

2019-09-17 Creating Reproducible Data Science

A webinar for the John Deere company on creating reproducible data science; code at github.com/akgold/2019-09-17_john_deere_webinar.

2b651c3725763904a603ab0a63a46cc8?s=128

Alex Gold

September 17, 2019
Tweet

Transcript

  1. 1.
  2. 3.

    Why care about reproducibility? Or portability? • With Colleagues •

    With Future You • They don’t answer ☎ Sharing!
  3. 8.

    Things that are tedious and you’ll need again “Well, I

    had to set up these variables like this…” model <- glm(outcome ~ input, family = “gamma”, data = my_dat, weights = wgt, subset = my_set, na_action = “na.exclude”, offset = 7, model = TRUE)
  4. 9.

    Environmental factors that will break “Oh, so you just need

    to get this system dependency configured…” “Right, I used version 0.7.6 of the package, not 0.7.8…” “Oh yeah, this kinda doesn’t work on Windows…”
  5. 10.
  6. 13.

    If the first line of your R script is setwd(“C:\Users\jenny\path\that\only\I\have”)

    I will come into your office and SET YOUR COMPUTER ON FIRE
  7. 14.

    If the first line of your R script is rm(list

    = ls()) I will come into your office and SET YOUR COMPUTER ON FIRE
  8. 34.

    When you’ve used the same • function • RMarkdown document

    • boilerplate Shiny code 3x write a package
  9. 43.

    The End Code only for your machine -based workflow tidyverse.org/articles/2017/12/workflow-vs-script

    Trying to find and coordinate versions Use happygitwithr.com Reusing tedious work Write a r-pkgs.had.co.nz Reproducing environments packrat/renv environments.rstudio.com github.com/akgold/2019-09-17_john_deere_webinar