Slide 1

Slide 1 text

Creating Reproducible Data Science Alex K. Gold Solutions Engineer @alexkgold github.com/akgold/2019-09-17_john_deere_webinar

Slide 2

Slide 2 text

Why care about reproducibility? Or portability?

Slide 3

Slide 3 text

Why care about reproducibility? Or portability? • With Colleagues • With Future You • They don’t answer ☎ Sharing!

Slide 4

Slide 4 text

How much reproducibility is enough? More Reproducible Not at all Somewhat Fully More Work

Slide 5

Slide 5 text

A Taxonomy Of Irreproducibility

Slide 6

Slide 6 text

Code that won’t run on someone else’s machine c://Documents/ /usr/home/agold ~/agold

Slide 7

Slide 7 text

Difficulty Finding Latest Version ninja_analysis_final2_alex_final.Rmd “Well, the version I have…” “Let me just send you the newest…”

Slide 8

Slide 8 text

Things that are tedious and you’ll need again “Well, I had to set up these variables like this…” model <- glm(outcome ~ input, family = “gamma”, data = my_dat, weights = wgt, subset = my_set, na_action = “na.exclude”, offset = 7, model = TRUE)

Slide 9

Slide 9 text

Environmental factors that will break “Oh, so you just need to get this system dependency configured…” “Right, I used version 0.7.6 of the package, not 0.7.8…” “Oh yeah, this kinda doesn’t work on Windows…”

Slide 10

Slide 10 text

Solutions

Slide 11

Slide 11 text

Code that breaks on someone else’s machine

Slide 12

Slide 12 text

Code that breaks on someone else’s machine -based Workflow

Slide 13

Slide 13 text

If the first line of your R script is setwd(“C:\Users\jenny\path\that\only\I\have”) I will come into your office and SET YOUR COMPUTER ON FIRE

Slide 14

Slide 14 text

If the first line of your R script is rm(list = ls()) I will come into your office and SET YOUR COMPUTER ON FIRE

Slide 15

Slide 15 text

Avoiding Computer fires Project-based Workflow and here::here

Slide 16

Slide 16 text

Avoiding Computer fires Demo!

Slide 17

Slide 17 text

Code Structure

Slide 18

Slide 18 text

Project-Based Workflow Learn More tidyverse.org/articles/2017/12/workflow-vs-script

Slide 19

Slide 19 text

Difficulty Finding Latest Version

Slide 20

Slide 20 text

Difficulty Finding Latest Version Version Control

Slide 21

Slide 21 text

Version Control

Slide 22

Slide 22 text

Version Control 1. Distinct versions. 2. Remote backups. 3. Collaborating on code. Why would I use it?

Slide 23

Slide 23 text

Version Control A few terms. (Sorry) Origin

Slide 24

Slide 24 text

Cute dog break.

Slide 25

Slide 25 text

Version Control Branching Master “Feature”

Slide 26

Slide 26 text

Git vs Github

Slide 27

Slide 27 text

Where can I git? In RStudio On Github Command Line

Slide 28

Slide 28 text

Version Control Demo!

Slide 29

Slide 29 text

Version Control Command Review

Slide 30

Slide 30 text

Version Control Branching master fix_model Git checkout -c fix_model git merge

Slide 31

Slide 31 text

Version Control Learn More www.happygitwithr.com

Slide 32

Slide 32 text

Things that are tedious and you’ll need again

Slide 33

Slide 33 text

Things that are tedious and you’ll need again R Tools

Slide 34

Slide 34 text

When you’ve used the same • function • RMarkdown document • boilerplate Shiny code 3x write a package

Slide 35

Slide 35 text

Code Snippets, Functions, And Templates Demo!

Slide 36

Slide 36 text

R Packages r-pkgs.had.co.nz

Slide 37

Slide 37 text

Environmental factors that will break

Slide 38

Slide 38 text

Environmental factors that will break Controlling Environments

Slide 39

Slide 39 text

Why would my environment break?

Slide 40

Slide 40 text

Reproducing Environments (Advanced Reproducibility) Save package state using packrat/renv

Slide 41

Slide 41 text

Reproducing Environments (Advanced Reproducibility)

Slide 42

Slide 42 text

Reproducing Environments Learn More environments.rstudio.com

Slide 43

Slide 43 text

The End Code only for your machine -based workflow tidyverse.org/articles/2017/12/workflow-vs-script Trying to find and coordinate versions Use happygitwithr.com Reusing tedious work Write a r-pkgs.had.co.nz Reproducing environments packrat/renv environments.rstudio.com github.com/akgold/2019-09-17_john_deere_webinar