Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2019-09-17 Creating Reproducible Data Science

Alex Gold
September 17, 2019

2019-09-17 Creating Reproducible Data Science

A webinar for the John Deere company on creating reproducible data science; code at github.com/akgold/2019-09-17_john_deere_webinar.

Alex Gold

September 17, 2019
Tweet

More Decks by Alex Gold

Other Decks in Business

Transcript

  1. Why care about reproducibility? Or portability? • With Colleagues •

    With Future You • They don’t answer ☎ Sharing!
  2. Things that are tedious and you’ll need again “Well, I

    had to set up these variables like this…” model <- glm(outcome ~ input, family = “gamma”, data = my_dat, weights = wgt, subset = my_set, na_action = “na.exclude”, offset = 7, model = TRUE)
  3. Environmental factors that will break “Oh, so you just need

    to get this system dependency configured…” “Right, I used version 0.7.6 of the package, not 0.7.8…” “Oh yeah, this kinda doesn’t work on Windows…”
  4. If the first line of your R script is setwd(“C:\Users\jenny\path\that\only\I\have”)

    I will come into your office and SET YOUR COMPUTER ON FIRE
  5. If the first line of your R script is rm(list

    = ls()) I will come into your office and SET YOUR COMPUTER ON FIRE
  6. When you’ve used the same • function • RMarkdown document

    • boilerplate Shiny code 3x write a package
  7. The End Code only for your machine -based workflow tidyverse.org/articles/2017/12/workflow-vs-script

    Trying to find and coordinate versions Use happygitwithr.com Reusing tedious work Write a r-pkgs.had.co.nz Reproducing environments packrat/renv environments.rstudio.com github.com/akgold/2019-09-17_john_deere_webinar