Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Programming in the tidyverse

Lionel Henry
February 23, 2019

Programming in the tidyverse

Lionel Henry

February 23, 2019
Tweet

More Decks by Lionel Henry

Other Decks in Programming

Transcript

  1. Data analysis programming Production programming • Structure specific tasks
 data

    manipulation,
 data cleaning, plotting, ... • Interactivity and iteration • Reproducibility by few users • Structure repeated computations 
 functional programming, generic
 programming, metaprogramming, ... • Flexibility and robustness • Reusability by many users r-lib tidyverse
  2. Programming in the tidyverse • Tidyverse optimised for interactive analyses

    • Moving towards code reusability • How to program with the tidyverse • Demystifying tidy evaluation
  3. What is the tidyverse • It is a set of

    packages (dplyr, tidyr, purrr, ggplot2, ...) • Website: https://tidyverse.org • Meta-package: library(tidyverse)
  4. What is the tidyverse • It is a set of

    principles • Human centered Computers + people • Consistent Reuse small set of ideas • Composable Solve larger problems • Inclusive Diverse community https://principles.tidyverse.org
  5. What is the tidyverse • It is a set of

    principles • Human centered Computers + people • Consistent Reuse small set of ideas • Composable Solve larger problems • Inclusive Diverse community https://principles.tidyverse.org
  6. • Domain oriented • Language-like interface • Data is the

    important scope Set of verbs for data manipulation • select() • filter() • arrange() • mutate() • group_by() • summarise()
  7. flights # A tibble: 336,776 x 19 year month day

    dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 4 2013 1 1 544 545 -1 1004 # … with 336,772 more rows, and 12 more variables: sched_arr_time <int>, # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, …
  8. flights %>% filter(month == 10, day == 10) # A

    tibble: 687 x 19 year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 10 5 453 500 -7 624 2 2013 10 5 525 515 10 747 3 2013 10 5 541 545 -4 827 4 2013 10 5 542 545 -3 813 # … with 683 more rows, and 12 more variables: sched_arr_time <int>, # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, …
  9. flights %>% arrange(desc(month), desc(day)) # A tibble: 336,776 x 19

    year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 12 31 13 2359 14 439 2 2013 12 31 18 2359 19 449 3 2013 12 31 26 2245 101 129 4 2013 12 31 459 500 -1 655 # … with 336,772 more rows, and 12 more variables: sched_arr_time <int>, # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, # minute <dbl>, time_hour <dttm>
  10. flights %>% select(year, month, day) # A tibble: 336,776 x

    3 year month day <int> <int> <int> 1 2013 1 1 2 2013 1 1 3 2013 1 1 4 2013 1 1 # … with 336,772 more rows
  11. flights %>% select(year:day) # A tibble: 336,776 x 3 year

    month day <int> <int> <int> 1 2013 1 1 2 2013 1 1 3 2013 1 1 4 2013 1 1 # … with 336,772 more rows Equivalent to select(1:3)
  12. flights %>% select(ends_with("_time")) # A tibble: 336,776 x 5 dep_time

    sched_dep_time arr_time sched_arr_time air_time <int> <int> <int> <int> <dbl> 1 517 515 830 819 227 2 533 529 850 830 227 3 542 540 923 850 160 4 544 545 1004 1022 183 # … with 336,772 more rows
  13. flights %>% mutate( gain = arr_delay - dep_delay, gain_per_hour =

    gain / (air_time / 60) ) # A tibble: 336,776 x 21 year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 4 2013 1 1 544 545 -1 1004 # … with 336,772 more rows, and 14 more variables: sched_arr_time <int>, # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, …
  14. flights %>% transmute( gain = arr_delay - dep_delay, gain_per_hour =

    gain / (air_time / 60) ) # A tibble: 336,776 x 2 gain gain_per_hour <dbl> <dbl> 1 9 2.38 2 16 4.23 3 31 11.6 4 -17 -5.57 # … with 336,772 more rows • mutate() adds columns •transmute() creates a new tibble
  15. flights %>% group_by(month) %>% summarise(avg_delay = mean(arr_delay, na.rm = TRUE))

    # A tibble: 12 x 2 month n <int> <dbl> 1 1 6.13 2 2 5.61 3 3 5.81 4 4 11.2 5 5 3.52 6 6 16.5 7 7 16.7 8 8 6.04 9 9 -4.02 10 10 -0.167 11 11 0.461 12 12 14.9 • group_by() only affects
 future computations •summarise() makes one
 summary per level
  16. starwars[starwars$height < 200 & starwars$gender == "male", ] starwars %>%

    filter( height < 200, gender == "male" ) • Domain oriented • Language-like interface • Data is the important scope Data masking
  17. Data masking data %>% fill(year) %>% spread(key, count) starwars %>%

    ggplot(aes(height, mass)) + geom_point() + facet_wrap(vars(hair_color)) starwars %>% filter( height < 200, gender == "male" )
  18. Data masking starwars %>% base::subset(height < 150, name:mass) %>% base::transform(height

    = height / 100) starwars %>% stats::lm(formula = mass ~ height) In base R too! • Inspiration for dplyr • By R core member
 Peter Dalgaard
  19. Data masking • Unique feature of R • Great for

    data analysis • Focus on the task not the subsetting
 • Programming is a bit more involved
 Tidy eval framework
  20. Tidy eval • Powers data masking from the rlang package

    • Flexible and robust programming • Strange syntax: !! and !!!, enquo(), etc • Requires learning new concepts
  21. Demystifying tidy eval 1. Why tidy evaluation? 2. Do you

    actually need it? 3. Can it be used by laypeople?
  22. Why Tidy Eval starwars[starwars$height < 200 & starwars$gender == "male",

    ] starwars %>% filter( height < 200, gender == "male" ) Change the context of computation
  23. Why Tidy Eval starwars %>% filter( height < 200, gender

    == "male" ) <SQL> SELECT * FROM `starwars` WHERE ((`height` < 200.0) AND (`gender` = 'male')) Change the context of computation
  24. Why Tidy Eval ⟶ Need to delay computations list( height

    < 200, gender == "male" ) Error: object 'height' not found starwars %>% filter( height < 200, gender == "male" )
  25. Why Tidy Eval How it works • Delay computations by

    quoting • Change the context and resume computation starwars %>% filter( height < 200, gender == "male" )
  26. Quoted code is like a blueprint vars( height < 200,

    gender == "male" ) [[1]] <quosure> expr: ^height < 200 env: global [[2]] <quosure> expr: ^gender == "male" env: global • vars() is a fundamental quoting function • Returns blueprints of 
 delayed computations
  27. Quoted code is like a blueprint Flip side: Harder to

    reuse with different inputs • Loops • Functions
  28. columns <- c("hair_color", "skin_color") out <- rep(list(NULL), 2) for (i

    in seq_along(columns)) { out[[i]] <- starwars %>% summarise(avg = mean(columns[[i]], na.rm = TRUE)) }
  29. columns <- c("hair_color", "skin_color") out <- rep(list(NULL), 2) for (i

    in seq_along(columns)) { out[[i]] <- starwars %>% summarise(avg = mean(columns[[i]], na.rm = TRUE)) } out[[1]]
 
 # A tibble: 1 x 1 avg <dbl> 1 NA
  30. columns <- c("hair_color", "skin_color") out <- rep(list(NULL), 2) for (i

    in seq_along(columns)) { out[[i]] <- starwars %>% summarise(avg = mean(columns[[i]], na.rm = TRUE)) } out[[1]]
 
 # A tibble: 1 x 1 avg <dbl> 1 NA mean("hair_color", na.rm = TRUE) [1] NA Warning message: argument is not numeric or logical: returning NA
  31. average <- function(data, x) { data %>% summarise(avg = mean(x,

    na.rm = TRUE)) } average(starwars, "hair_color") # A tibble: 1 x 1 avg <dbl> 1 NA Warning message: argument is not numeric or logical: returning NA
  32. average <- function(data, x) { data %>% summarise(avg = mean(x,

    na.rm = TRUE)) } average(starwars, hair_color) Error: object 'hair_color' not found • Data masking is not transitive • x is masked instead of hair_color
  33. Quoted code is like a blueprint Programming requires modifying the

    blueprint • !! and !!! are surgery operators for blueprints • Need blueprint material: sym(), enquo(), ... Metaprogramming skills
  34. But before we get there... Do you need tidy eval?

    • Fixed column names • Columnwise mapping
  35. • No need for tidy eval when column names are

    fixed! • Trivial to implement Fixed column names
  36. data %>% transmute(bmi = mass / height^2) compute_bmi <- function(data)

    { data %>% transmute(bmi = mass / height^2) } compute_bmi <- function(data) { if (!all(c("mass", "height") %in% names(data))) { stop("`data` must contain `mass` and `height` columns") } data %>% transmute(bmi = mass / height^2) } 1. Repeated code
  37. data %>% transmute(bmi = mass / height^2) compute_bmi <- function(data)

    { data %>% transmute(bmi = mass / height^2) } compute_bmi <- function(data) { if (!all(c("mass", "height") %in% names(data))) { stop("`data` must contain `mass` and `height` columns") } data %>% transmute(bmi = mass / height^2) } 2. Wrap pipeline
  38. data %>% transmute(bmi = mass / height^2) 3. Check inputs

    compute_bmi <- function(data) { data %>% transmute(bmi = mass / height^2) } compute_bmi <- function(data) { if (!all(c("mass", "height") %in% names(data))) { stop("`data` must contain `mass` and `height` columns") } data %>% transmute(bmi = mass / height^2) }
  39. compute_bmi <- function(data) { if (!all(c("mass", "height") %in% names(data))) {

    stop("`data` must contain `mass` and `height` columns") } mean_height <- round(mean(data$height, na.rm = TRUE), 1) if (mean_height > 3) { warning(glue::glue( "Average height is { mean_height }, is it scaled in meters?" )) } data %>% transmute(bmi = mass / height^2) } 4. Check inputs!
  40. starwars %>% compute_bmi() # A tibble: 87 x 1 bmi

    <dbl> 1 0.00260 2 0.00269 3 0.00347 4 0.00333 # … with 83 more rows Warning message: Average height is 174.4, is it scaled in meters?
  41. starwars %>% mutate(height = height / 100) %>% compute_bmi() #

    A tibble: 87 x 1 bmi <dbl> 1 26.0 2 26.9 3 34.7 4 33.3 # … with 83 more rows
  42. • For specific tasks where fixed names make sense •

    Callers must ensure existence of these columns • Input checking is important • Domain logic may have greater payoff Fixed column names
  43. Columnwise mapping • Repeat operations across columns by mapping •

    Important notion in R (apply family) • The purrr package is all about mapping!
  44. map() loops a function automatically map(data, n_distinct) $group [1] 3

    $value [1] 5 n_distinct(data$group) [1] 3 n_distinct(data$value) [1] 5
  45. Scoped variants for dplyr verbs • Map functions over a

    selection of columns • _all suffix ⟶ Map over all columns • _if suffix ⟶ Map over columns selected by a predicate • _at suffix ⟶ Map over a custom selection • Full dplyr features, including groups support Columnwise mapping
  46. mtcars %>% mutate_all(function(x) x / sd(x)) mtcars %>% mutate_all(~ .

    / sd(.)) # A tibble: 32 x 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 3.48 3.36 1.29 1.60 7.29 2.68 9.21 0 2.00 5.42 2.48 2 3.48 3.36 1.29 1.60 7.29 2.94 9.52 0 2.00 5.42 2.48 3 3.78 2.24 0.871 1.36 7.20 2.37 10.4 1.98 2.00 5.42 0.619 4 3.55 3.36 2.08 1.60 5.76 3.29 10.9 1.98 0 4.07 0.619 # … with 28 more rows Mapping a function
 over all columns Supports purrr formulas for anonymous functions
  47. iris %>% mutate_if(is.numeric, ~ . / sd(.)) # A tibble:

    150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fct> 1 6.16 8.03 0.793 0.262 setosa 2 5.92 6.88 0.793 0.262 setosa 3 5.68 7.34 0.736 0.262 setosa 4 5.56 7.11 0.850 0.262 setosa # … with 146 more rows Mapping a function
 over predicate selection This function determines
 which columns are changed
  48. iris %>% mutate_at(1:4, ~ . / sd(.)) # A tibble:

    150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fct> 1 6.16 8.03 0.793 0.262 setosa 2 5.92 6.88 0.793 0.262 setosa 3 5.68 7.34 0.736 0.262 setosa 4 5.56 7.11 0.850 0.262 setosa # … with 146 more rows Mapping a function
 over custom selection Numeric vector
 of column positions
  49. nms <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") iris %>% mutate_at(nms, ~

    . / sd(.)) # A tibble: 150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fct> 1 6.16 8.03 0.793 0.262 setosa 2 5.92 6.88 0.793 0.262 setosa 3 5.68 7.34 0.736 0.262 setosa 4 5.56 7.11 0.850 0.262 setosa # … with 146 more rows Mapping a function
 over custom selection Character vector
 of column names
  50. iris %>% mutate_at(vars(Sepal.Length, Sepal.Width), ~ . / sd(.)) iris %>%

    mutate_at(vars(starts_with("Sepal")), ~ . / sd(.)) # A tibble: 150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fct> 1 6.16 8.03 1.4 0.2 setosa 2 5.92 6.88 1.4 0.2 setosa 3 5.68 7.34 1.3 0.2 setosa 4 5.56 7.11 1.5 0.2 setosa # … with 146 more rows Mapping a function
 over custom selection Pass selection helpers
 with vars()
  51. iris %>% summarise_if(is.numeric, mean) # A tibble: 1 x 4

    Sepal.Length Sepal.Width Petal.Length Petal.Width <dbl> <dbl> <dbl> <dbl> 1 5.84 3.06 3.76 1.20 Consistent behaviour across variants
  52. iris %>% group_by(Species) %>% summarise_all(mean) # A tibble: 3 x

    5 Species Sepal.Length Sepal.Width Petal.Length Petal.Width <fct> <dbl> <dbl> <dbl> <dbl> 1 setosa 5.01 3.43 1.46 0.246 2 versicolor 5.94 2.77 4.26 1.33 3 virginica 6.59 2.97 5.55 2.03 Consistent behaviour across variants
  53. iris %>% group_by_if(is.factor) %>% summarise_at(vars(starts_with("Sepal")), mean) # A tibble: 3

    x 3 Species Sepal.Length Sepal.Width <fct> <dbl> <dbl> 1 setosa 5.01 3.43 2 versicolor 5.94 2.77 3 virginica 6.59 2.97 Consistent behaviour across variants
  54. Columnwise mapping • Scoped variants can be incredibly useful •

    Reuse skills from purrr and apply functions • No tidy eval needed for looping
  55. Passing the dots Easiest way to create a tidy eval

    function! starwars %>% group_by(gender) %>% summarise(n = n())
  56. my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n

    = n()) } Easiest way to create a tidy eval function! Passing the dots
  57. starwars %>% my_count_by(gender) # A tibble: 5 x 2 gender

    n <chr> <int> 1 NA 3 2 female 19 3 hermaphrodite 1 4 male 62 5 none 2 my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) } Passing the dots
  58. Recipient of dots takes care of everything! • No need

    to quote / delay blueprints • Properties of the function are inherited Passing the dots
  59. Two flavours starwars %>% mutate(birth_year - 100) starwars %>% group_by(birth_year)

    starwars %>% select(birth_year) starwars %>% filter(birth_year < 50) One of these things is not like the other things!
  60. Two flavours starwars %>% mutate(birth_year - 100) starwars %>% group_by(birth_year)

    starwars %>% select(birth_year) starwars %>% filter(birth_year < 50) One of these things is not like the other things! Action Selection
  61. tmp <- starwars$birth_year - 100 starwars$`birth_year - 100` <- tmp

    starwars %>% mutate(birth_year - 100) Most verbs take actions 1. New vectors are created 2. The data frame is modified
  62. Some verbs take selections 1. The position of columns is

    looked up 2. The data frame is reorganised starwars %>% select(birth_year) tmp <- match("birth_year", colnames(starwars)) starwars[, tmp]
  63. starwars %>% select(c(1, height)) starwars %>% select(1:height) starwars %>% select(-1,

    -height) Selections have special properties 1. c(), `-` and `:` understand positions and names 2. Selection helpers know about current variables
  64. starwars %>% select(ends_with("color")) starwars %>% select(matches("^[nm]a") starwars %>% select(10, everything())

    1. c(), `-` and `:` understand positions and names 2. Selection helpers know about current variables Selections have special properties
  65. Sometimes they appear to work the same way... starwars %>%

    select(height) # A tibble: 87 x 1 height <int> 1 172 2 167 3 96 # … with 84 more rows starwars %>% transmute(height) # A tibble: 87 x 1 height <int> 1 172 2 167 3 96 # … with 84 more rows
  66. starwars %>% select(1) # A tibble: 87 x 1 name

    <chr> 1 Luke Skywalker 2 C-3PO 3 R2-D2 # … with 84 more rows starwars %>% transmute(1) # A tibble: 87 x 1 `1` <dbl> 1 1 2 1 3 1 # … with 84 more rows Sometimes they appear to work the same way...
  67. What about group_by()? starwars %>% group_by(gender) # A tibble: 87

    x 13 # Groups: gender [5] name height mass hair_color skin_color eye_color <chr> <int> <dbl> <chr> <chr> <chr> 1 Luke… 172 77 blond fair blue 2 C-3PO 167 75 NA gold yellow 3 R2-D2 96 32 NA white, bl… red # … with 84 more rows, and 7 more variables
  68. What about group_by()? It takes actions! starwars %>% group_by(height >

    170) %>% summarise(n()) # A tibble: 3 x 2 `height > 170` `n()` <lgl> <int> 1 FALSE 27 2 TRUE 54 3 NA 6
  69. Tip: Use the _at dplyr variants to pass selections! starwars

    %>% group_by_at(vars(ends_with("color")))
  70. starwars %>% my_count_by(gender) # A tibble: 5 x 2 gender

    n <chr> <int> 1 NA 3 2 female 19 3 hermaphrodite 1 4 male 62 5 none 2 my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) } Passing the dots
  71. starwars %>% my_count_by(GENDER = toupper(gender)) # A tibble: 5 x

    2 GENDER n <chr> <int> 1 NA 3 2 FEMALE 19 3 HERMAPHRODITE 1 4 MALE 62 5 NONE 2 my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) } Passing the dots
  72. starwars %>% my_count_by(ends_with("_color")) Error: No tidyselect variables were registered my_count_by

    <- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) } Passing the dots
  73. starwars %>% my_count_by(ends_with("_color"), -hair_color)) # A tibble: 53 x 3

    # Groups: skin_color [31] skin_color eye_color n <chr> <chr> <int> 1 blue blue 1 2 blue hazel 1 3 blue, grey yellow 2 4 brown blue 1 # … with 49 more rows my_count_by <- function(data, ...) { data %>% group_by_at(vars(...)) %>% summarise(n = n()) } Passing the dots
  74. • Dots can be passed to aes() or vars() in

    ggplot2! • They both take actions Passing the dots
  75. plot + facet_wrap(~ gender + hair_color) plot + facet_wrap(vars(gender, hair_color))

    Facetting with formulas versus vars() • Facets historically take formulas but vars have more features • You can pass dots to vars() • vars() accepts names for facet titles
  76. my_wrap <- function(...) { facet_wrap(vars(...), labeller = label_both) } •

    labeller controls label titles • Here, both variable name and facet category Passing the dots
  77. plot <- ggplot(mtcars, aes(disp, drat)) + geom_point() plot + my_wrap(

    cut_number(wt, 3), cyl ) 
 Actions! cut_number(wt, 3): (2.81,3.5] cyl: 6 cut_number(wt, 3): (2.81,3.5] cyl: 8 cut_number(wt, 3): (3.5,5.42] cyl: 8 cut_number(wt, 3): [1.51,2.81] cyl: 4 cut_number(wt, 3): [1.51,2.81] cyl: 6 cut_number(wt, 3): (2.81,3.5] cyl: 4 100 200 300 400 100 200 300 400 100 200 300 400 3.0 3.5 4.0 4.5 5.0 3.0 3.5 4.0 4.5 5.0 disp drat
  78. plot <- ggplot(mtcars, aes(disp, drat)) + geom_point() plot + my_wrap(

    Weight = cut_number(wt, 3), Cylinder = cyl ) Weight: (2.81,3.5] Cylinder: 6 Weight: (2.81,3.5] Cylinder: 8 Weight: (3.5,5.42] Cylinder: 8 Weight: [1.51,2.81] Cylinder: 4 Weight: [1.51,2.81] Cylinder: 6 Weight: (2.81,3.5] Cylinder: 4 100 200 300 400 100 200 300 400 100 200 300 400 3.0 3.5 4.0 4.5 5.0 3.0 3.5 4.0 4.5 5.0 disp drat Named
 actions!
  79. We just wrapped a single component ⟶ Can be composed

    in pipelines with +
 Two more examples • Entire ggplot2 pipeline • Multiple ggplot2 components Passing the dots
  80. Passing the dots scatter_wrap <- function(data, mapping = aes(), ...)

    { ggplot(data, mapping) + geom_point() + facet_wrap(vars(...), labeller = label_both) } Entire pipeline
  81. Passing the dots mtcars %>% scatter_wrap( aes(disp, drat), Cylinder =

    cyl ) Cylinder: 4 Cylinder: 6 Cylinder: 8 100 200 300 400 100 200 300 400 100 200 300 400 3.0 3.5 4.0 4.5 5.0 disp drat
  82. Passing the dots scatter_wrap <- function(...) { geom_point() + facet_wrap(vars(...),

    labeller = label_both) } Multiple components An addition pipeline must start with ggplot()
  83. Passing the dots ggplot(mtcars, aes(disp, drat)) + scatter_wrap(Cylinder = cyl)

    Cylinder: 4 Cylinder: 6 Cylinder: 8 100 200 300 400 100 200 300 400 100 200 300 400 3.0 3.5 4.0 4.5 5.0 disp drat
  84. • Easy way of creating data masking functions • Useful

    when pipeline has only one variable part • Do you need actions or selections? • See my RStudio::conf 2019 talk for more ideas Passing the dots
  85. Subsetting .data What if you have multiple pipeline inputs? starwars

    %>% group_by(gender) %>% summarise(avg = mean(mass, na.rm = TRUE))
  86. .data is a pronoun that represents the data Subsetting .data

    starwars %>% group_by(.data$gender) %>% summarise(avg = mean(.data$mass, na.rm = TRUE))
  87. my_average <- function(data, grp_var, avg_var) { data %>% group_by(.data[[grp_var]]) %>%

    summarise(avg = mean(.data[[avg_var]], na.rm = TRUE)) } Subsetting .data Just pass column names to .data[[
  88. Subsetting .data starwars %>% my_average("gender", "mass") # A tibble: 5

    x 2 gender avg <chr> <dbl> 1 NA 46.3 2 female 54.0 3 hermaphrodite 1358 4 male 81.0 5 none 140 Take strings! No data masking
  89. • Now getting in the meat of tidy eval •

    Interpolation is a simple pattern • Delay a blueprint by quoting with enquo() • Insert it back in another blueprint by unquoting with !! • Forwards a blueprint across functions Interpolation
  90. Simple tidy eval pattern • Delay a blueprint with enquo()

    • Insert it back with !! my_average <- function(data, grp_var, avg_var) { data %>% group_by(.data[[grp_var]]) %>% summarise(avg = mean(.data[[avg_var]], na.rm = TRUE)) } Interpolation
  91. my_average <- function(data, grp_var, avg_var) { data %>% group_by(!!enquo(grp_var)) %>%

    summarise(avg = mean(!!enquo(avg_var), na.rm = TRUE)) } Simple tidy eval pattern • Delay a blueprint with enquo() • Insert it back with !! Interpolation
  92. starwars %>% my_average(gender, height) # A tibble: 5 x 2

    gender avg <chr> <dbl> 1 NA 120 2 female 165. 3 hermaphrodite 175 4 male 179. 5 none 200 Interpolation
  93. starwars %>% my_average(gender, height / 100) # A tibble: 5

    x 2 gender avg <chr> <dbl> 1 NA 1.2 2 female 1.65 3 hermaphrodite 1.75 4 male 1.79 5 none 2 • Full data masking • Create vectors on the fly Interpolation
  94. Planned syntax: interpolation with {{ arg }} Inspiration from the

    glue package thing <- "FOOBAR" glue::glue("Let's interpolate this { thing } right here") [1] "Let's interpolate this FOOBAR right here" Interpolation
  95. Planned syntax: interpolation with {{ arg }} my_average <- function(data,

    grp_var, avg_var) { data %>% group_by(!!enquo(grp_var)) %>% summarise(avg = mean(!!enquo(avg_var), na.rm = TRUE)) } Interpolation
  96. Planned syntax: interpolation with {{ arg }} my_average <- function(data,

    grp_var, avg_var) { data %>% group_by({{ grp_var }}) %>% summarise(avg = mean({{ avg_var }}, na.rm = TRUE)) } Interpolation
  97. • Simple pattern but quickly gets more complicated • What

    to unquote • Delayed blueprints with enquo() • Custom blueprint material: symbols, function calls, ... • Unquoting variants such as !!! • Simple interpolation should cover many cases Interpolation
  98. • Data masking is a unique R feature • Great

    for data analysis • Harder to program with • You might not need tidy eval • Fixed column names • Map functions on columns • Easy tidy eval techniques • Pass the dots • Subset .data • Quote and unquote (soon interpolate)