Upgrade to Pro — share decks privately, control downloads, hide ads and more …

workflow: you should have one

workflow: you should have one

Keynote talk at EARL London 2017 on the value of developing an intentional workflow. Find links to all the goodies here: https://github.com/jennybc/earl-london-2017-bryan#readme

Jennifer (Jenny) Bryan

September 13, 2017
Tweet

More Decks by Jennifer (Jenny) Bryan

Other Decks in Programming

Transcript

  1.  
    Jennifer Bryan 

    RStudio, University of British Columbia
    @JennyBryan @jennybc
    bit.ly/jenny-earl
    Go here for useful links to stuff
    mentioned in this talk!!

    View Slide

  2. workflow
    you should have one

    View Slide

  3. ‑decision fatigue
    ‑unique and special ❆❄❅
    ‐ predictability
    ‐ proficiency
    ‐ access to help

    View Slide

  4. View Slide

  5. Here’s my highly
    polished blog post
    about deep learning.
    Here’s how I
    organized the files and
    wrangled the data.

    View Slide

  6. Import
    Tidy
    Communicate
    Transform
    Visualise
    Model

    View Slide

  7. Import
    Tidy
    Communicate
    Transform
    Visualise
    Model

    View Slide

  8. Everything that exists in R is an object.
    Everything that happens in R is a function call.
    Interfaces to other software are part of R.
    — John Chambers

    View Slide

  9. Import
    Tidy
    Communicate
    Transform
    Visualise
    Model

    View Slide

  10. Import
    Tidy
    Communicate
    Transform
    Visualise
    Model

    View Slide

  11. http://readxl.tidyverse.org
    readxl
    www.rstudio.com

    View Slide

  12. http://googledrive.tidyverse.org

    View Slide

  13. googlesheets
    +
    googledrive
    googlesheets4
    =

    View Slide

  14. What is your development environment?
    How do you organize a project?
    How do you manage a project over time?
    What about collaboration?

    View Slide

  15. What is your default data receptacle?
    How do you manipulate data?
    How do you iterate?

    View Slide

  16. http://stat545.com

    View Slide

  17. Good enough practices in scientific computing
    Wilson, Bryan, Cranston, Kitzes, Nederbragt, Teal
    https://doi.org/10.1371/journal.pcbi.1005510
    http://bit.ly/good-enuff

    View Slide

  18. View Slide

  19. Excuse me, do you have a moment
    to talk about version control?
    https://doi.org/10.7287/peerj.preprints.3159v2

    View Slide

  20. happygitwithr.com

    View Slide

  21. http://reprex.tidyverse.org

    View Slide

  22. workflow
    example #1

    View Slide

  23. View Slide

  24. View Slide

  25. One folder per project
    That folder is an
    • RStudio Project (package? website? whatever)
    • Git repo, with associated GitHub remote
    Work on multiple projects at once w/ multiple
    instances of RStudio
    • Each gets own child R process
    • R & file browser have sane working directory

    View Slide

  26. View Slide

  27. If the first line of your #rstats script is
    setwd("C:\Users\jenny\path\that\only\I\have"),
    I will come into your lab and SET YOUR COMPUTER ON FIRE .
    — Mash-up of rage tweets by @jennybc and @tpoi.

    View Slide

  28. Use here package to build paths within a Project
    Paths are robust to different working directories
    within the Project
    • Render .R and .Rmd that live in sub-folders!
    • Write paths in tests and vignettes w/o fear!
    here wraps the more powerful rprojroot package

    View Slide

  29. library(here)
    #> here() starts at /here-demo
    system("tree")
    #> .
    #> !"" one
    #> !"" two
    #> !"" awesome.txt
    here("one", "two", "awesome.txt")
    #> [1] "/here-demo/one/two/awesome.txt"
    cat(readLines(here("one", "two", "awesome.txt")))
    #> OMG this is so awesome!
    setwd(here("one"))
    getwd()
    #> [1] "/here-demo/one"
    here("one", "two", "awesome.txt")
    #> [1] "/here-demo/one/two/awesome.txt”
    cat(readLines(here("one", "two", "awesome.txt")))
    #> OMG this is so awesome!

    View Slide

  30. workflow
    example #2

    View Slide

  31. list-columns
    EmbRAce
    tHe
    aWkwArd

    View Slide

  32. #rstats lists via lego

    View Slide

  33. map(.x, .f, ...)
    purrr::

    View Slide

  34. map(.x, .f, ...)
    for every element of .x
    apply .f

    View Slide

  35. .x = minis

    View Slide

  36. map(minis, antennate)

    View Slide

  37. .x = minis

    View Slide

  38. map(minis, "pants")

    View Slide

  39. .y = hair
    .x = minis

    View Slide

  40. map2(minis, hair, enhair)

    View Slide

  41. .y = weapons
    .x = minis

    View Slide

  42. map2(minis, weapons, arm)

    View Slide

  43. minis %>%
    map2(hair, enhair) %>%
    map2(weapons, arm)

    View Slide

  44. View Slide

  45. this is a data frame!
    atomic
    vector
    list
    column

    View Slide

  46. View Slide

  47. data frame nested data frame

    View Slide

  48. gap_nested <- gapminder %>%

    group_by(country) %>%

    nest()

    gap_nested

    #> # A tibble: 142 × 2

    #> country data

    #> 

    #> 1 Afghanistan 

    #> 2 Albania 

    #> 3 Algeria 

    #> 4 Angola 

    #> 5 Argentina 

    #> 6 Australia 

    #> 7 Austria 

    #> 8 Bahrain 

    #> 9 Bangladesh 

    #> 10 Belgium 

    #> # ... with 132 more rows

    View Slide

  49. gap_fits <- gap_nested %>%

    mutate(fit = map(data, ~ lm(lifeExp ~ year, data = .x)))


    gap_fits %>% tail(3)

    #> # A tibble: 3 × 3

    #> country data fit

    #> 

    #> 1 Yemen, Rep. 

    #> 2 Zambia 

    #> 3 Zimbabwe 

    canada <- which(gap_fits$country == "Canada")

    summary(gap_fits$fit[[canada]])

    #> . . .

    #> Coefficients:

    #> Estimate Std. Error t value Pr(>|t|) 

    #> (Intercept) -3.583e+02 8.252e+00 -43.42 1.01e-12 ***

    #> year 2.189e-01 4.169e-03 52.50 1.52e-13 ***

    #> . . . 

    #> Residual standard error: 0.2492 on 10 degrees of freedom

    #> Multiple R-squared: 0.9964, Adjusted R-squared: 0.996 

    #> F-statistic: 2757 on 1 and 10 DF, p-value: 1.521e-1

    View Slide

  50. gap_fits %>%

    mutate(rsq = map_dbl(fit, ~ summary(.x)[["r.squared"]])) %>%

    arrange(rsq)

    #> # A tibble: 142 × 4

    #> country data fit rsq

    #> 

    #> 1 Rwanda 0.01715964

    #> 2 Botswana 0.03402340

    #> 3 Zimbabwe 0.05623196

    #> 4 Zambia 0.05983644

    #> 5 Swaziland 0.06821087

    #> 6 Lesotho 0.08485635

    #> 7 Cote d'Ivoire 0.28337240

    #> 8 South Africa 0.31246865

    #> 9 Uganda 0.34215382

    #> 10 Congo, Dem. Rep. 0.34820278

    #> # ... with 132 more rows

    View Slide

  51. gap_fits %>%

    mutate(coef = map(fit, broom::tidy)) %>%

    unnest(coef)

    #> # A tibble: 284 × 6

    #> country term estimate std.error statistic

    #> 

    #> 1 Afghanistan (Intercept) -507.5342716 40.484161954 -12.536613

    #> 2 Afghanistan year 0.2753287 0.020450934 13.462890

    #> 3 Albania (Intercept) -594.0725110 65.655359062 -9.048348

    #> 4 Albania year 0.3346832 0.033166387 10.091036

    #> 5 Algeria (Intercept) -1067.8590396 43.802200843 -24.379118

    #> 6 Algeria year 0.5692797 0.022127070 25.727749

    #> 7 Angola (Intercept) -376.5047531 46.583370599 -8.082385

    #> 8 Angola year 0.2093399 0.023532003 8.895964

    #> 9 Argentina (Intercept) -389.6063445 9.677729641 -40.258031

    #> 10 Argentina year 0.2317084 0.004888791 47.395847

    #> # ... with 274 more rows, and 1 more variables: p.value

    View Slide

  52. maybe you don’t, because it’s too painful
    for loops
    apply(), [slvmt]apply(), split(), by()
    with plyr: [adl][adl_]ply()
    with dplyr: df %>% group_by() %>% do()
    How do you do such things today?

    View Slide

  53. Many other worked examples here:
    https://jennybc.github.io/purrr-tutorial/

    View Slide

  54. @JennyBryan
    @jennybc


    bit.ly/jenny-earl

    View Slide