workflow: you should have one

Slide 1

Slide 1 text

  Jennifer Bryan   RStudio, University of British Columbia @JennyBryan @jennybc bit.ly/jenny-earl Go here for useful links to stuﬀ mentioned in this talk!!

Slide 2

Slide 2 text

workflow you should have one

Slide 3

Slide 3 text

‑decision fatigue ‑unique and special ❆❄❅ ‐ predictability ‐ proficiency ‐ access to help

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Here’s my highly polished blog post about deep learning. Here’s how I organized the files and wrangled the data.

Slide 6

Slide 6 text

Import Tidy Communicate Transform Visualise Model

Slide 7

Slide 7 text

Import Tidy Communicate Transform Visualise Model

Slide 8

Slide 8 text

Everything that exists in R is an object. Everything that happens in R is a function call. Interfaces to other software are part of R. — John Chambers

Slide 9

Slide 9 text

Import Tidy Communicate Transform Visualise Model

Slide 10

Slide 10 text

Import Tidy Communicate Transform Visualise Model

Slide 11

Slide 11 text

http://readxl.tidyverse.org readxl www.rstudio.com

Slide 12

Slide 12 text

http://googledrive.tidyverse.org

Slide 13

Slide 13 text

googlesheets + googledrive googlesheets4 =

Slide 14

Slide 14 text

What is your development environment? How do you organize a project? How do you manage a project over time? What about collaboration?

Slide 15

Slide 15 text

What is your default data receptacle? How do you manipulate data? How do you iterate?

Slide 16

Slide 16 text

http://stat545.com

Slide 17

Slide 17 text

Good enough practices in scientific computing Wilson, Bryan, Cranston, Kitzes, Nederbragt, Teal https://doi.org/10.1371/journal.pcbi.1005510 http://bit.ly/good-enuﬀ

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Excuse me, do you have a moment to talk about version control? https://doi.org/10.7287/peerj.preprints.3159v2

Slide 20

Slide 20 text

happygitwithr.com

Slide 21

Slide 21 text

http://reprex.tidyverse.org

Slide 22

Slide 22 text

workflow example #1

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

One folder per project That folder is an • RStudio Project (package? website? whatever) • Git repo, with associated GitHub remote Work on multiple projects at once w/ multiple instances of RStudio • Each gets own child R process • R & file browser have sane working directory

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

If the first line of your #rstats script is setwd("C:\Users\jenny\path\that\only\I\have"), I will come into your lab and SET YOUR COMPUTER ON FIRE . — Mash-up of rage tweets by @jennybc and @tpoi.

Slide 28

Slide 28 text

Use here package to build paths within a Project Paths are robust to diﬀerent working directories within the Project • Render .R and .Rmd that live in sub-folders! • Write paths in tests and vignettes w/o fear! here wraps the more powerful rprojroot package

Slide 29

Slide 29 text

library(here) #> here() starts at /here-demo system("tree") #> . #> !"" one #> !"" two #> !"" awesome.txt here("one", "two", "awesome.txt") #> [1] "/here-demo/one/two/awesome.txt" cat(readLines(here("one", "two", "awesome.txt"))) #> OMG this is so awesome! setwd(here("one")) getwd() #> [1] "/here-demo/one" here("one", "two", "awesome.txt") #> [1] "/here-demo/one/two/awesome.txt” cat(readLines(here("one", "two", "awesome.txt"))) #> OMG this is so awesome!

Slide 30

Slide 30 text

workflow example #2

Slide 31

Slide 31 text

list-columns EmbRAce tHe aWkwArd

Slide 32

Slide 32 text

#rstats lists via lego

Slide 33

Slide 33 text

map(.x, .f, ...) purrr::

Slide 34

Slide 34 text

map(.x, .f, ...) for every element of .x apply .f

Slide 35

Slide 35 text

.x = minis

Slide 36

Slide 36 text

map(minis, antennate)

Slide 37

Slide 37 text

.x = minis

Slide 38

Slide 38 text

map(minis, "pants")

Slide 39

Slide 39 text

.y = hair .x = minis

Slide 40

Slide 40 text

map2(minis, hair, enhair)

Slide 41

Slide 41 text

.y = weapons .x = minis

Slide 42

Slide 42 text

map2(minis, weapons, arm)

Slide 43

Slide 43 text

minis %>% map2(hair, enhair) %>% map2(weapons, arm)

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

this is a data frame! atomic vector list column

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

data frame nested data frame

Slide 48

Slide 48 text

gap_nested <- gapminder %>%  group_by(country) %>%  nest()  gap_nested  #> # A tibble: 142 × 2  #> country data  #>   #> 1 Afghanistan   #> 2 Albania   #> 3 Algeria   #> 4 Angola   #> 5 Argentina   #> 6 Australia   #> 7 Austria   #> 8 Bahrain   #> 9 Bangladesh   #> 10 Belgium   #> # ... with 132 more rows

Slide 49

Slide 49 text

gap_fits <- gap_nested %>%  mutate(fit = map(data, ~ lm(lifeExp ~ year, data = .x)))    gap_fits %>% tail(3)  #> # A tibble: 3 × 3  #> country data fit  #>   #> 1 Yemen, Rep.   #> 2 Zambia   #> 3 Zimbabwe   canada <- which(gap_fits$country == "Canada")  summary(gap_fits$fit[[canada]])  #> . . .  #> Coefficients:  #> Estimate Std. Error t value Pr(>|t|)   #> (Intercept) -3.583e+02 8.252e+00 -43.42 1.01e-12 ***  #> year 2.189e-01 4.169e-03 52.50 1.52e-13 ***  #> . . .   #> Residual standard error: 0.2492 on 10 degrees of freedom  #> Multiple R-squared: 0.9964, Adjusted R-squared: 0.996   #> F-statistic: 2757 on 1 and 10 DF, p-value: 1.521e-1

Slide 50

Slide 50 text

gap_fits %>%  mutate(rsq = map_dbl(fit, ~ summary(.x)[["r.squared"]])) %>%  arrange(rsq)  #> # A tibble: 142 × 4  #> country data fit rsq  #>   #> 1 Rwanda 0.01715964  #> 2 Botswana 0.03402340  #> 3 Zimbabwe 0.05623196  #> 4 Zambia 0.05983644  #> 5 Swaziland 0.06821087  #> 6 Lesotho 0.08485635  #> 7 Cote d'Ivoire 0.28337240  #> 8 South Africa 0.31246865  #> 9 Uganda 0.34215382  #> 10 Congo, Dem. Rep. 0.34820278  #> # ... with 132 more rows

Slide 51

Slide 51 text

gap_fits %>%  mutate(coef = map(fit, broom::tidy)) %>%  unnest(coef)  #> # A tibble: 284 × 6  #> country term estimate std.error statistic  #>   #> 1 Afghanistan (Intercept) -507.5342716 40.484161954 -12.536613  #> 2 Afghanistan year 0.2753287 0.020450934 13.462890  #> 3 Albania (Intercept) -594.0725110 65.655359062 -9.048348  #> 4 Albania year 0.3346832 0.033166387 10.091036  #> 5 Algeria (Intercept) -1067.8590396 43.802200843 -24.379118  #> 6 Algeria year 0.5692797 0.022127070 25.727749  #> 7 Angola (Intercept) -376.5047531 46.583370599 -8.082385  #> 8 Angola year 0.2093399 0.023532003 8.895964  #> 9 Argentina (Intercept) -389.6063445 9.677729641 -40.258031  #> 10 Argentina year 0.2317084 0.004888791 47.395847  #> # ... with 274 more rows, and 1 more variables: p.value

Slide 52

Slide 52 text

maybe you don’t, because it’s too painful for loops apply(), [slvmt]apply(), split(), by() with plyr: [adl][adl_]ply() with dplyr: df %>% group_by() %>% do() How do you do such things today?