workflow: you should have one

workflow: you should have one

Keynote talk at EARL London 2017 on the value of developing an intentional workflow. Find links to all the goodies here: https://github.com/jennybc/earl-london-2017-bryan#readme

0a4f62e90c976eeb44d33add75cca5af?s=128

Jennifer (Jenny) Bryan

September 13, 2017
Tweet

Transcript

  1.   Jennifer Bryan 
 RStudio, University of British Columbia

    @JennyBryan @jennybc bit.ly/jenny-earl Go here for useful links to stuff mentioned in this talk!!
  2. workflow you should have one

  3. ‑decision fatigue ‑unique and special ❆❄❅ ‐ predictability ‐ proficiency

    ‐ access to help
  4. None
  5. Here’s my highly polished blog post about deep learning. Here’s

    how I organized the files and wrangled the data.
  6. Import Tidy Communicate Transform Visualise Model

  7. Import Tidy Communicate Transform Visualise Model

  8. Everything that exists in R is an object. Everything that

    happens in R is a function call. Interfaces to other software are part of R. — John Chambers
  9. Import Tidy Communicate Transform Visualise Model

  10. Import Tidy Communicate Transform Visualise Model

  11. http://readxl.tidyverse.org readxl www.rstudio.com

  12. http://googledrive.tidyverse.org

  13. googlesheets + googledrive googlesheets4 =

  14. What is your development environment? How do you organize a

    project? How do you manage a project over time? What about collaboration?
  15. What is your default data receptacle? How do you manipulate

    data? How do you iterate?
  16. http://stat545.com

  17. Good enough practices in scientific computing Wilson, Bryan, Cranston, Kitzes,

    Nederbragt, Teal https://doi.org/10.1371/journal.pcbi.1005510 http://bit.ly/good-enuff
  18. None
  19. Excuse me, do you have a moment to talk about

    version control? https://doi.org/10.7287/peerj.preprints.3159v2
  20. happygitwithr.com

  21. http://reprex.tidyverse.org

  22. workflow example #1

  23. None
  24. None
  25. One folder per project That folder is an • RStudio

    Project (package? website? whatever) • Git repo, with associated GitHub remote Work on multiple projects at once w/ multiple instances of RStudio • Each gets own child R process • R & file browser have sane working directory
  26. None
  27. If the first line of your #rstats script is setwd("C:\Users\jenny\path\that\only\I\have"),

    I will come into your lab and SET YOUR COMPUTER ON FIRE . — Mash-up of rage tweets by @jennybc and @tpoi.
  28. Use here package to build paths within a Project Paths

    are robust to different working directories within the Project • Render .R and .Rmd that live in sub-folders! • Write paths in tests and vignettes w/o fear! here wraps the more powerful rprojroot package
  29. library(here) #> here() starts at <snip, snip>/here-demo system("tree") #> .

    #> !"" one #> !"" two #> !"" awesome.txt here("one", "two", "awesome.txt") #> [1] "<snip, snip>/here-demo/one/two/awesome.txt" cat(readLines(here("one", "two", "awesome.txt"))) #> OMG this is so awesome! setwd(here("one")) getwd() #> [1] "<snip, snip>/here-demo/one" here("one", "two", "awesome.txt") #> [1] "<snip, snip>/here-demo/one/two/awesome.txt” cat(readLines(here("one", "two", "awesome.txt"))) #> OMG this is so awesome!
  30. workflow example #2

  31. list-columns EmbRAce tHe aWkwArd

  32. #rstats lists via lego

  33. map(.x, .f, ...) purrr::

  34. map(.x, .f, ...) for every element of .x apply .f

  35. .x = minis

  36. map(minis, antennate)

  37. .x = minis

  38. map(minis, "pants")

  39. .y = hair .x = minis

  40. map2(minis, hair, enhair)

  41. .y = weapons .x = minis

  42. map2(minis, weapons, arm)

  43. minis %>% map2(hair, enhair) %>% map2(weapons, arm)

  44. None
  45. this is a data frame! atomic vector list column

  46. None
  47. data frame nested data frame

  48. gap_nested <- gapminder %>%
 group_by(country) %>%
 nest()
 gap_nested
 #> #

    A tibble: 142 × 2
 #> country data
 #> <fctr> <list>
 #> 1 Afghanistan <tibble [12 × 5]>
 #> 2 Albania <tibble [12 × 5]>
 #> 3 Algeria <tibble [12 × 5]>
 #> 4 Angola <tibble [12 × 5]>
 #> 5 Argentina <tibble [12 × 5]>
 #> 6 Australia <tibble [12 × 5]>
 #> 7 Austria <tibble [12 × 5]>
 #> 8 Bahrain <tibble [12 × 5]>
 #> 9 Bangladesh <tibble [12 × 5]>
 #> 10 Belgium <tibble [12 × 5]>
 #> # ... with 132 more rows
  49. gap_fits <- gap_nested %>%
 mutate(fit = map(data, ~ lm(lifeExp ~

    year, data = .x)))
 
 gap_fits %>% tail(3)
 #> # A tibble: 3 × 3
 #> country data fit
 #> <fctr> <list> <list>
 #> 1 Yemen, Rep. <tibble [12 × 5]> <S3: lm>
 #> 2 Zambia <tibble [12 × 5]> <S3: lm>
 #> 3 Zimbabwe <tibble [12 × 5]> <S3: lm>
 canada <- which(gap_fits$country == "Canada")
 summary(gap_fits$fit[[canada]])
 #> . . .
 #> Coefficients:
 #> Estimate Std. Error t value Pr(>|t|) 
 #> (Intercept) -3.583e+02 8.252e+00 -43.42 1.01e-12 ***
 #> year 2.189e-01 4.169e-03 52.50 1.52e-13 ***
 #> . . . 
 #> Residual standard error: 0.2492 on 10 degrees of freedom
 #> Multiple R-squared: 0.9964, Adjusted R-squared: 0.996 
 #> F-statistic: 2757 on 1 and 10 DF, p-value: 1.521e-1
  50. gap_fits %>%
 mutate(rsq = map_dbl(fit, ~ summary(.x)[["r.squared"]])) %>%
 arrange(rsq)
 #>

    # A tibble: 142 × 4
 #> country data fit rsq
 #> <fctr> <list> <list> <dbl>
 #> 1 Rwanda <tibble [12 × 5]> <S3: lm> 0.01715964
 #> 2 Botswana <tibble [12 × 5]> <S3: lm> 0.03402340
 #> 3 Zimbabwe <tibble [12 × 5]> <S3: lm> 0.05623196
 #> 4 Zambia <tibble [12 × 5]> <S3: lm> 0.05983644
 #> 5 Swaziland <tibble [12 × 5]> <S3: lm> 0.06821087
 #> 6 Lesotho <tibble [12 × 5]> <S3: lm> 0.08485635
 #> 7 Cote d'Ivoire <tibble [12 × 5]> <S3: lm> 0.28337240
 #> 8 South Africa <tibble [12 × 5]> <S3: lm> 0.31246865
 #> 9 Uganda <tibble [12 × 5]> <S3: lm> 0.34215382
 #> 10 Congo, Dem. Rep. <tibble [12 × 5]> <S3: lm> 0.34820278
 #> # ... with 132 more rows
  51. gap_fits %>%
 mutate(coef = map(fit, broom::tidy)) %>%
 unnest(coef)
 #> #

    A tibble: 284 × 6
 #> country term estimate std.error statistic
 #> <fctr> <chr> <dbl> <dbl> <dbl>
 #> 1 Afghanistan (Intercept) -507.5342716 40.484161954 -12.536613
 #> 2 Afghanistan year 0.2753287 0.020450934 13.462890
 #> 3 Albania (Intercept) -594.0725110 65.655359062 -9.048348
 #> 4 Albania year 0.3346832 0.033166387 10.091036
 #> 5 Algeria (Intercept) -1067.8590396 43.802200843 -24.379118
 #> 6 Algeria year 0.5692797 0.022127070 25.727749
 #> 7 Angola (Intercept) -376.5047531 46.583370599 -8.082385
 #> 8 Angola year 0.2093399 0.023532003 8.895964
 #> 9 Argentina (Intercept) -389.6063445 9.677729641 -40.258031
 #> 10 Argentina year 0.2317084 0.004888791 47.395847
 #> # ... with 274 more rows, and 1 more variables: p.value <dbl>
  52. maybe you don’t, because it’s too painful for loops apply(),

    [slvmt]apply(), split(), by() with plyr: [adl][adl_]ply() with dplyr: df %>% group_by() %>% do() How do you do such things today?
  53. Many other worked examples here: https://jennybc.github.io/purrr-tutorial/

  54. @JennyBryan @jennybc   bit.ly/jenny-earl