Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Programming in the tidyverse

4f4eeaab8247b7a4221336902f376a14?s=47 Lionel Henry
February 23, 2019

Programming in the tidyverse

4f4eeaab8247b7a4221336902f376a14?s=128

Lionel Henry

February 23, 2019
Tweet

Transcript

  1. Programming in the Tidyverse

  2. Data analysis programming Production programming • Structure specific tasks
 data

    manipulation,
 data cleaning, plotting, ... • Interactivity and iteration • Reproducibility by few users • Structure repeated computations 
 functional programming, generic
 programming, metaprogramming, ... • Flexibility and robustness • Reusability by many users r-lib tidyverse
  3. Programming in the tidyverse • Tidyverse optimised for interactive analyses

    • Moving towards code reusability • How to program with the tidyverse • Demystifying tidy evaluation
  4. What is the tidyverse?

  5. What is the tidyverse • It is a set of

    packages (dplyr, tidyr, purrr, ggplot2, ...) • Website: https://tidyverse.org • Meta-package: library(tidyverse)
  6. What is the tidyverse • It is a set of

    principles • Human centered Computers + people • Consistent Reuse small set of ideas • Composable Solve larger problems • Inclusive Diverse community https://principles.tidyverse.org
  7. What is the tidyverse • It is a set of

    principles • Human centered Computers + people • Consistent Reuse small set of ideas • Composable Solve larger problems • Inclusive Diverse community https://principles.tidyverse.org
  8. Human centered Tidyverse packages ⟶ • Domain oriented • Language-like

    interface • Data is the important scope
  9. • Domain oriented • Language-like interface • Data is the

    important scope Set of verbs for data manipulation • select() • filter() • arrange() • mutate() • group_by() • summarise()
  10. flights # A tibble: 336,776 x 19 year month day

    dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 4 2013 1 1 544 545 -1 1004 # … with 336,772 more rows, and 12 more variables: sched_arr_time <int>, # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, …
  11. flights %>% filter(month == 10, day == 10) # A

    tibble: 687 x 19 year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 10 5 453 500 -7 624 2 2013 10 5 525 515 10 747 3 2013 10 5 541 545 -4 827 4 2013 10 5 542 545 -3 813 # … with 683 more rows, and 12 more variables: sched_arr_time <int>, # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, …
  12. flights %>% arrange(desc(month), desc(day)) # A tibble: 336,776 x 19

    year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 12 31 13 2359 14 439 2 2013 12 31 18 2359 19 449 3 2013 12 31 26 2245 101 129 4 2013 12 31 459 500 -1 655 # … with 336,772 more rows, and 12 more variables: sched_arr_time <int>, # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, # minute <dbl>, time_hour <dttm>
  13. flights %>% select(year, month, day) # A tibble: 336,776 x

    3 year month day <int> <int> <int> 1 2013 1 1 2 2013 1 1 3 2013 1 1 4 2013 1 1 # … with 336,772 more rows
  14. flights %>% select(year:day) # A tibble: 336,776 x 3 year

    month day <int> <int> <int> 1 2013 1 1 2 2013 1 1 3 2013 1 1 4 2013 1 1 # … with 336,772 more rows Equivalent to select(1:3)
  15. flights %>% select(ends_with("_time")) # A tibble: 336,776 x 5 dep_time

    sched_dep_time arr_time sched_arr_time air_time <int> <int> <int> <int> <dbl> 1 517 515 830 819 227 2 533 529 850 830 227 3 542 540 923 850 160 4 544 545 1004 1022 183 # … with 336,772 more rows
  16. flights %>% mutate( gain = arr_delay - dep_delay, gain_per_hour =

    gain / (air_time / 60) ) # A tibble: 336,776 x 21 year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 4 2013 1 1 544 545 -1 1004 # … with 336,772 more rows, and 14 more variables: sched_arr_time <int>, # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, …
  17. flights %>% transmute( gain = arr_delay - dep_delay, gain_per_hour =

    gain / (air_time / 60) ) # A tibble: 336,776 x 2 gain gain_per_hour <dbl> <dbl> 1 9 2.38 2 16 4.23 3 31 11.6 4 -17 -5.57 # … with 336,772 more rows • mutate() adds columns •transmute() creates a new tibble
  18. flights %>% group_by(month) %>% summarise(avg_delay = mean(arr_delay, na.rm = TRUE))

    # A tibble: 12 x 2 month n <int> <dbl> 1 1 6.13 2 2 5.61 3 3 5.81 4 4 11.2 5 5 3.52 6 6 16.5 7 7 16.7 8 8 6.04 9 9 -4.02 10 10 -0.167 11 11 0.461 12 12 14.9 • group_by() only affects
 future computations •summarise() makes one
 summary per level
  19. starwars[starwars$height < 200 & starwars$gender == "male", ] starwars %>%

    filter( height < 200, gender == "male" ) • Domain oriented • Language-like interface • Data is the important scope Data masking
  20. Data masking data %>% fill(year) %>% spread(key, count) starwars %>%

    ggplot(aes(height, mass)) + geom_point() + facet_wrap(vars(hair_color)) starwars %>% filter( height < 200, gender == "male" )
  21. Data masking starwars %>% base::subset(height < 150, name:mass) %>% base::transform(height

    = height / 100) starwars %>% stats::lm(formula = mass ~ height) In base R too! • Inspiration for dplyr • By R core member
 Peter Dalgaard
  22. Data masking • Unique feature of R • Great for

    data analysis • Focus on the task not the subsetting
 • Programming is a bit more involved
 Tidy eval framework
  23. Tidy eval • Powers data masking from the rlang package

    • Flexible and robust programming • Strange syntax: !! and !!!, enquo(), etc • Requires learning new concepts
  24. Tidy eval

  25. Tidy eval

  26. Demystifying tidy eval 1. Why tidy evaluation? 2. Do you

    actually need it? 3. Can it be used by laypeople?
  27. Why Tidy Eval?

  28. Why Tidy Eval starwars[starwars$height < 200 & starwars$gender == "male",

    ] starwars %>% filter( height < 200, gender == "male" ) Change the context of computation
  29. Why Tidy Eval starwars %>% filter( height < 200, gender

    == "male" ) <SQL> SELECT * FROM `starwars` WHERE ((`height` < 200.0) AND (`gender` = 'male')) Change the context of computation
  30. Why Tidy Eval ⟶ Need to delay computations list( height

    < 200, gender == "male" ) Error: object 'height' not found starwars %>% filter( height < 200, gender == "male" )
  31. Why Tidy Eval How it works • Delay computations by

    quoting • Change the context and resume computation starwars %>% filter( height < 200, gender == "male" )
  32. Quoted code is like a blueprint vars( height < 200,

    gender == "male" ) [[1]] <quosure> expr: ^height < 200 env: global [[2]] <quosure> expr: ^gender == "male" env: global • vars() is a fundamental quoting function • Returns blueprints of 
 delayed computations
  33. Quoted code is like a blueprint Flip side: Harder to

    reuse with different inputs • Loops • Functions
  34. columns <- c("hair_color", "skin_color") out <- rep(list(NULL), 2) for (i

    in seq_along(columns)) { out[[i]] <- starwars %>% summarise(avg = mean(columns[[i]], na.rm = TRUE)) }
  35. columns <- c("hair_color", "skin_color") out <- rep(list(NULL), 2) for (i

    in seq_along(columns)) { out[[i]] <- starwars %>% summarise(avg = mean(columns[[i]], na.rm = TRUE)) } out[[1]]
 
 # A tibble: 1 x 1 avg <dbl> 1 NA
  36. columns <- c("hair_color", "skin_color") out <- rep(list(NULL), 2) for (i

    in seq_along(columns)) { out[[i]] <- starwars %>% summarise(avg = mean(columns[[i]], na.rm = TRUE)) } out[[1]]
 
 # A tibble: 1 x 1 avg <dbl> 1 NA mean("hair_color", na.rm = TRUE) [1] NA Warning message: argument is not numeric or logical: returning NA
  37. average <- function(data, x) { data %>% summarise(avg = mean(x,

    na.rm = TRUE)) }
  38. average <- function(data, x) { data %>% summarise(avg = mean(x,

    na.rm = TRUE)) } average(starwars, "hair_color") # A tibble: 1 x 1 avg <dbl> 1 NA Warning message: argument is not numeric or logical: returning NA
  39. average <- function(data, x) { data %>% summarise(avg = mean(x,

    na.rm = TRUE)) } average(starwars, hair_color) Error: object 'hair_color' not found • Data masking is not transitive • x is masked instead of hair_color
  40. Quoted code is like a blueprint Programming requires modifying the

    blueprint • !! and !!! are surgery operators for blueprints • Need blueprint material: sym(), enquo(), ... Metaprogramming skills
  41. But before we get there... Do you need tidy eval?

    • Fixed column names • Columnwise mapping
  42. • No need for tidy eval when column names are

    fixed! • Trivial to implement Fixed column names
  43. data %>% transmute(bmi = mass / height^2) compute_bmi <- function(data)

    { data %>% transmute(bmi = mass / height^2) } compute_bmi <- function(data) { if (!all(c("mass", "height") %in% names(data))) { stop("`data` must contain `mass` and `height` columns") } data %>% transmute(bmi = mass / height^2) } 1. Repeated code
  44. data %>% transmute(bmi = mass / height^2) compute_bmi <- function(data)

    { data %>% transmute(bmi = mass / height^2) } compute_bmi <- function(data) { if (!all(c("mass", "height") %in% names(data))) { stop("`data` must contain `mass` and `height` columns") } data %>% transmute(bmi = mass / height^2) } 2. Wrap pipeline
  45. data %>% transmute(bmi = mass / height^2) 3. Check inputs

    compute_bmi <- function(data) { data %>% transmute(bmi = mass / height^2) } compute_bmi <- function(data) { if (!all(c("mass", "height") %in% names(data))) { stop("`data` must contain `mass` and `height` columns") } data %>% transmute(bmi = mass / height^2) }
  46. compute_bmi <- function(data) { if (!all(c("mass", "height") %in% names(data))) {

    stop("`data` must contain `mass` and `height` columns") } mean_height <- round(mean(data$height, na.rm = TRUE), 1) if (mean_height > 3) { warning(glue::glue( "Average height is { mean_height }, is it scaled in meters?" )) } data %>% transmute(bmi = mass / height^2) } 4. Check inputs!
  47. iris %>% compute_bmi() Error: `data` must contain `mass` and `height`

    columns
  48. starwars %>% compute_bmi() # A tibble: 87 x 1 bmi

    <dbl> 1 0.00260 2 0.00269 3 0.00347 4 0.00333 # … with 83 more rows Warning message: Average height is 174.4, is it scaled in meters?
  49. starwars %>% mutate(height = height / 100) %>% compute_bmi() #

    A tibble: 87 x 1 bmi <dbl> 1 26.0 2 26.9 3 34.7 4 33.3 # … with 83 more rows
  50. • For specific tasks where fixed names make sense •

    Callers must ensure existence of these columns • Input checking is important • Domain logic may have greater payoff Fixed column names
  51. Columnwise mapping • Repeat operations across columns by mapping •

    Important notion in R (apply family) • The purrr package is all about mapping!
  52. map() loops a function automatically map(data, n_distinct) $group [1] 3

    $value [1] 5 n_distinct(data$group) [1] 3 n_distinct(data$value) [1] 5
  53. Scoped variants for dplyr verbs • Map functions over a

    selection of columns • _all suffix ⟶ Map over all columns • _if suffix ⟶ Map over columns selected by a predicate • _at suffix ⟶ Map over a custom selection • Full dplyr features, including groups support Columnwise mapping
  54. Columnwise mapping No data masking • Take objects not blueprints

    • Easy to program with
  55. Columnwise mapping mutate_all() mutate_at() mutate_if() summarise_all() summarise_at() summarise_if() group_by_all() group_by_at()

    group_by_if() filter_all() filter_at() filter_if()
  56. mtcars %>% mutate_all(function(x) x / sd(x)) mtcars %>% mutate_all(~ .

    / sd(.)) # A tibble: 32 x 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 3.48 3.36 1.29 1.60 7.29 2.68 9.21 0 2.00 5.42 2.48 2 3.48 3.36 1.29 1.60 7.29 2.94 9.52 0 2.00 5.42 2.48 3 3.78 2.24 0.871 1.36 7.20 2.37 10.4 1.98 2.00 5.42 0.619 4 3.55 3.36 2.08 1.60 5.76 3.29 10.9 1.98 0 4.07 0.619 # … with 28 more rows Mapping a function
 over all columns Supports purrr formulas for anonymous functions
  57. iris %>% mutate_if(is.numeric, ~ . / sd(.)) # A tibble:

    150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fct> 1 6.16 8.03 0.793 0.262 setosa 2 5.92 6.88 0.793 0.262 setosa 3 5.68 7.34 0.736 0.262 setosa 4 5.56 7.11 0.850 0.262 setosa # … with 146 more rows Mapping a function
 over predicate selection This function determines
 which columns are changed
  58. iris %>% mutate_at(1:4, ~ . / sd(.)) # A tibble:

    150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fct> 1 6.16 8.03 0.793 0.262 setosa 2 5.92 6.88 0.793 0.262 setosa 3 5.68 7.34 0.736 0.262 setosa 4 5.56 7.11 0.850 0.262 setosa # … with 146 more rows Mapping a function
 over custom selection Numeric vector
 of column positions
  59. nms <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") iris %>% mutate_at(nms, ~

    . / sd(.)) # A tibble: 150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fct> 1 6.16 8.03 0.793 0.262 setosa 2 5.92 6.88 0.793 0.262 setosa 3 5.68 7.34 0.736 0.262 setosa 4 5.56 7.11 0.850 0.262 setosa # … with 146 more rows Mapping a function
 over custom selection Character vector
 of column names
  60. iris %>% mutate_at(vars(Sepal.Length, Sepal.Width), ~ . / sd(.)) iris %>%

    mutate_at(vars(starts_with("Sepal")), ~ . / sd(.)) # A tibble: 150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fct> 1 6.16 8.03 1.4 0.2 setosa 2 5.92 6.88 1.4 0.2 setosa 3 5.68 7.34 1.3 0.2 setosa 4 5.56 7.11 1.5 0.2 setosa # … with 146 more rows Mapping a function
 over custom selection Pass selection helpers
 with vars()
  61. iris %>% summarise_if(is.numeric, mean) # A tibble: 1 x 4

    Sepal.Length Sepal.Width Petal.Length Petal.Width <dbl> <dbl> <dbl> <dbl> 1 5.84 3.06 3.76 1.20 Consistent behaviour across variants
  62. iris %>% group_by(Species) %>% summarise_all(mean) # A tibble: 3 x

    5 Species Sepal.Length Sepal.Width Petal.Length Petal.Width <fct> <dbl> <dbl> <dbl> <dbl> 1 setosa 5.01 3.43 1.46 0.246 2 versicolor 5.94 2.77 4.26 1.33 3 virginica 6.59 2.97 5.55 2.03 Consistent behaviour across variants
  63. iris %>% group_by_if(is.factor) %>% summarise_at(vars(starts_with("Sepal")), mean) # A tibble: 3

    x 3 Species Sepal.Length Sepal.Width <fct> <dbl> <dbl> 1 setosa 5.01 3.43 2 versicolor 5.94 2.77 3 virginica 6.59 2.97 Consistent behaviour across variants
  64. Columnwise mapping • Scoped variants can be incredibly useful •

    Reuse skills from purrr and apply functions • No tidy eval needed for looping
  65. Tidy eval, the easy parts

  66. 1. Pass the dots 2. Subset .data 3. Interpolate Tidy

    eval, the easy parts
  67. 1. Pass the dots 2. Subset .data 3. Interpolate Tidy

    eval, the easy parts
  68. Passing the dots Easiest way to create a tidy eval

    function! starwars %>% group_by(gender) %>% summarise(n = n())
  69. my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n

    = n()) } Easiest way to create a tidy eval function! Passing the dots
  70. starwars %>% my_count_by(gender) # A tibble: 5 x 2 gender

    n <chr> <int> 1 NA 3 2 female 19 3 hermaphrodite 1 4 male 62 5 none 2 my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) } Passing the dots
  71. Recipient of dots takes care of everything! • No need

    to quote / delay blueprints • Properties of the function are inherited Passing the dots
  72. Two flavours of 
 tidy evaluation

  73. Two flavours starwars %>% mutate(birth_year - 100) starwars %>% group_by(birth_year)

    starwars %>% select(birth_year) starwars %>% filter(birth_year < 50) One of these things is not like the other things!
  74. Two flavours starwars %>% mutate(birth_year - 100) starwars %>% group_by(birth_year)

    starwars %>% select(birth_year) starwars %>% filter(birth_year < 50) One of these things is not like the other things! Action Selection
  75. tmp <- starwars$birth_year - 100 starwars$`birth_year - 100` <- tmp

    starwars %>% mutate(birth_year - 100) Most verbs take actions 1. New vectors are created 2. The data frame is modified
  76. Some verbs take selections 1. The position of columns is

    looked up 2. The data frame is reorganised starwars %>% select(birth_year) tmp <- match("birth_year", colnames(starwars)) starwars[, tmp]
  77. starwars %>% select(c(1, height)) starwars %>% select(1:height) starwars %>% select(-1,

    -height) Selections have special properties 1. c(), `-` and `:` understand positions and names 2. Selection helpers know about current variables
  78. starwars %>% select(ends_with("color")) starwars %>% select(matches("^[nm]a") starwars %>% select(10, everything())

    1. c(), `-` and `:` understand positions and names 2. Selection helpers know about current variables Selections have special properties
  79. Sometimes they appear to work the same way... starwars %>%

    select(height) # A tibble: 87 x 1 height <int> 1 172 2 167 3 96 # … with 84 more rows starwars %>% transmute(height) # A tibble: 87 x 1 height <int> 1 172 2 167 3 96 # … with 84 more rows
  80. starwars %>% select(1) # A tibble: 87 x 1 name

    <chr> 1 Luke Skywalker 2 C-3PO 3 R2-D2 # … with 84 more rows starwars %>% transmute(1) # A tibble: 87 x 1 `1` <dbl> 1 1 2 1 3 1 # … with 84 more rows Sometimes they appear to work the same way...
  81. What about group_by()? starwars %>% group_by(gender) # A tibble: 87

    x 13 # Groups: gender [5] name height mass hair_color skin_color eye_color <chr> <int> <dbl> <chr> <chr> <chr> 1 Luke… 172 77 blond fair blue 2 C-3PO 167 75 NA gold yellow 3 R2-D2 96 32 NA white, bl… red # … with 84 more rows, and 7 more variables
  82. starwars %>% group_by(ends_with("color")) Error: No tidyselect variables were registered What

    about group_by()? It takes actions!
  83. What about group_by()? It takes actions! starwars %>% group_by(height >

    170) %>% summarise(n()) # A tibble: 3 x 2 `height > 170` `n()` <lgl> <int> 1 FALSE 27 2 TRUE 54 3 NA 6
  84. Tip: Use the _at dplyr variants to pass selections! starwars

    %>% group_by_at(vars(ends_with("color")))
  85. starwars %>% my_count_by(gender) # A tibble: 5 x 2 gender

    n <chr> <int> 1 NA 3 2 female 19 3 hermaphrodite 1 4 male 62 5 none 2 my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) } Passing the dots
  86. starwars %>% my_count_by(GENDER = toupper(gender)) # A tibble: 5 x

    2 GENDER n <chr> <int> 1 NA 3 2 FEMALE 19 3 HERMAPHRODITE 1 4 MALE 62 5 NONE 2 my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) } Passing the dots
  87. starwars %>% my_count_by(ends_with("_color")) Error: No tidyselect variables were registered my_count_by

    <- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) } Passing the dots
  88. my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n

    = n()) }
  89. my_count_by <- function(data, ...) { data %>% group_by_at(vars(...)) %>% summarise(n

    = n()) }
  90. starwars %>% my_count_by(ends_with("_color"), -hair_color)) # A tibble: 53 x 3

    # Groups: skin_color [31] skin_color eye_color n <chr> <chr> <int> 1 blue blue 1 2 blue hazel 1 3 blue, grey yellow 2 4 brown blue 1 # … with 49 more rows my_count_by <- function(data, ...) { data %>% group_by_at(vars(...)) %>% summarise(n = n()) } Passing the dots
  91. • Dots can be passed to aes() or vars() in

    ggplot2! • They both take actions Passing the dots
  92. plot + facet_wrap(~ gender + hair_color) plot + facet_wrap(vars(gender, hair_color))

    Facetting with formulas versus vars() • Facets historically take formulas but vars have more features • You can pass dots to vars() • vars() accepts names for facet titles
  93. my_wrap <- function(...) { facet_wrap(vars(...), labeller = label_both) } •

    labeller controls label titles • Here, both variable name and facet category Passing the dots
  94. plot <- ggplot(mtcars, aes(disp, drat)) + geom_point() plot + my_wrap(

    cut_number(wt, 3), cyl ) 
 Actions! cut_number(wt, 3): (2.81,3.5] cyl: 6 cut_number(wt, 3): (2.81,3.5] cyl: 8 cut_number(wt, 3): (3.5,5.42] cyl: 8 cut_number(wt, 3): [1.51,2.81] cyl: 4 cut_number(wt, 3): [1.51,2.81] cyl: 6 cut_number(wt, 3): (2.81,3.5] cyl: 4 100 200 300 400 100 200 300 400 100 200 300 400 3.0 3.5 4.0 4.5 5.0 3.0 3.5 4.0 4.5 5.0 disp drat
  95. plot <- ggplot(mtcars, aes(disp, drat)) + geom_point() plot + my_wrap(

    Weight = cut_number(wt, 3), Cylinder = cyl ) Weight: (2.81,3.5] Cylinder: 6 Weight: (2.81,3.5] Cylinder: 8 Weight: (3.5,5.42] Cylinder: 8 Weight: [1.51,2.81] Cylinder: 4 Weight: [1.51,2.81] Cylinder: 6 Weight: (2.81,3.5] Cylinder: 4 100 200 300 400 100 200 300 400 100 200 300 400 3.0 3.5 4.0 4.5 5.0 3.0 3.5 4.0 4.5 5.0 disp drat Named
 actions!
  96. We just wrapped a single component ⟶ Can be composed

    in pipelines with +
 Two more examples • Entire ggplot2 pipeline • Multiple ggplot2 components Passing the dots
  97. Passing the dots scatter_wrap <- function(data, mapping = aes(), ...)

    { ggplot(data, mapping) + geom_point() + facet_wrap(vars(...), labeller = label_both) } Entire pipeline
  98. Passing the dots mtcars %>% scatter_wrap( aes(disp, drat), Cylinder =

    cyl ) Cylinder: 4 Cylinder: 6 Cylinder: 8 100 200 300 400 100 200 300 400 100 200 300 400 3.0 3.5 4.0 4.5 5.0 disp drat
  99. Passing the dots scatter_wrap <- function(...) { geom_point() + facet_wrap(vars(...),

    labeller = label_both) } Multiple components
  100. Passing the dots scatter_wrap <- function(...) { geom_point() + facet_wrap(vars(...),

    labeller = label_both) } Multiple components An addition pipeline must start with ggplot()
  101. Passing the dots scatter_wrap <- function(...) { list( geom_point(), facet_wrap(vars(...),

    labeller = label_both) ) } Multiple components
  102. Passing the dots ggplot(mtcars, aes(disp, drat)) + scatter_wrap(Cylinder = cyl)

    Cylinder: 4 Cylinder: 6 Cylinder: 8 100 200 300 400 100 200 300 400 100 200 300 400 3.0 3.5 4.0 4.5 5.0 disp drat
  103. • Easy way of creating data masking functions • Useful

    when pipeline has only one variable part • Do you need actions or selections? • See my RStudio::conf 2019 talk for more ideas Passing the dots
  104. 1. Pass the dots 2. Subset .data 3. Interpolate Tidy

    eval, the easy parts
  105. Subsetting .data What if you have multiple pipeline inputs? starwars

    %>% group_by(gender) %>% summarise(avg = mean(mass, na.rm = TRUE))
  106. .data is a pronoun that represents the data Subsetting .data

    starwars %>% group_by(.data$gender) %>% summarise(avg = mean(.data$mass, na.rm = TRUE))
  107. my_average <- function(data, grp_var, avg_var) { data %>% group_by(.data[[grp_var]]) %>%

    summarise(avg = mean(.data[[avg_var]], na.rm = TRUE)) } Subsetting .data Just pass column names to .data[[
  108. Subsetting .data starwars %>% my_average("gender", "mass") # A tibble: 5

    x 2 gender avg <chr> <dbl> 1 NA 46.3 2 female 54.0 3 hermaphrodite 1358 4 male 81.0 5 none 140 Take strings! No data masking
  109. Subsetting .data starwars %>% my_average(gender, mass) Error: object 'gender' not

    found Take strings! No data masking
  110. 1. Pass the dots 2. Subset .data 3. Interpolate Tidy

    eval, the easy parts
  111. • Now getting in the meat of tidy eval •

    Interpolation is a simple pattern • Delay a blueprint by quoting with enquo() • Insert it back in another blueprint by unquoting with !! • Forwards a blueprint across functions Interpolation
  112. Simple tidy eval pattern • Delay a blueprint with enquo()

    • Insert it back with !! my_average <- function(data, grp_var, avg_var) { data %>% group_by(.data[[grp_var]]) %>% summarise(avg = mean(.data[[avg_var]], na.rm = TRUE)) } Interpolation
  113. my_average <- function(data, grp_var, avg_var) { data %>% group_by(!!enquo(grp_var)) %>%

    summarise(avg = mean(!!enquo(avg_var), na.rm = TRUE)) } Simple tidy eval pattern • Delay a blueprint with enquo() • Insert it back with !! Interpolation
  114. starwars %>% my_average(gender, height) # A tibble: 5 x 2

    gender avg <chr> <dbl> 1 NA 120 2 female 165. 3 hermaphrodite 175 4 male 179. 5 none 200 Interpolation
  115. starwars %>% my_average(gender, height / 100) # A tibble: 5

    x 2 gender avg <chr> <dbl> 1 NA 1.2 2 female 1.65 3 hermaphrodite 1.75 4 male 1.79 5 none 2 • Full data masking • Create vectors on the fly Interpolation
  116. Planned syntax: interpolation with {{ arg }} Inspiration from the

    glue package thing <- "FOOBAR" glue::glue("Let's interpolate this { thing } right here") [1] "Let's interpolate this FOOBAR right here" Interpolation
  117. Planned syntax: interpolation with {{ arg }} my_average <- function(data,

    grp_var, avg_var) { data %>% group_by(!!enquo(grp_var)) %>% summarise(avg = mean(!!enquo(avg_var), na.rm = TRUE)) } Interpolation
  118. Planned syntax: interpolation with {{ arg }} my_average <- function(data,

    grp_var, avg_var) { data %>% group_by({{ grp_var }}) %>% summarise(avg = mean({{ avg_var }}, na.rm = TRUE)) } Interpolation
  119. • Simple pattern but quickly gets more complicated • What

    to unquote • Delayed blueprints with enquo() • Custom blueprint material: symbols, function calls, ... • Unquoting variants such as !!! • Simple interpolation should cover many cases Interpolation
  120. • Data masking is a unique R feature • Great

    for data analysis • Harder to program with • You might not need tidy eval • Fixed column names • Map functions on columns • Easy tidy eval techniques • Pass the dots • Subset .data • Quote and unquote (soon interpolate)