Programming in the tidyverse

Programming in the Tidyverse

Data analysis programming Production programming • Structure speciﬁc tasks  data
manipulation,  data cleaning, plotting, ... • Interactivity and iteration • Reproducibility by few users • Structure repeated computations   functional programming, generic  programming, metaprogramming, ... • Flexibility and robustness • Reusability by many users r-lib tidyverse

Programming in the tidyverse • Tidyverse optimised for interactive analyses
• Moving towards code reusability • How to program with the tidyverse • Demystifying tidy evaluation

What is the tidyverse?

What is the tidyverse • It is a set of
packages (dplyr, tidyr, purrr, ggplot2, ...) • Website: https://tidyverse.org • Meta-package: library(tidyverse)

What is the tidyverse • It is a set of
principles • Human centered Computers + people • Consistent Reuse small set of ideas • Composable Solve larger problems • Inclusive Diverse community https://principles.tidyverse.org

Human centered Tidyverse packages ⟶ • Domain oriented • Language-like
interface • Data is the important scope

• Domain oriented • Language-like interface • Data is the
important scope Set of verbs for data manipulation • select() • filter() • arrange() • mutate() • group_by() • summarise()

flights # A tibble: 336,776 x 19 year month day
dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 4 2013 1 1 544 545 -1 1004 # … with 336,772 more rows, and 12 more variables: sched_arr_time <int>, # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, …

flights %>% filter(month == 10, day == 10) # A
tibble: 687 x 19 year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 10 5 453 500 -7 624 2 2013 10 5 525 515 10 747 3 2013 10 5 541 545 -4 827 4 2013 10 5 542 545 -3 813 # … with 683 more rows, and 12 more variables: sched_arr_time <int>, # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, …

flights %>% arrange(desc(month), desc(day)) # A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 12 31 13 2359 14 439 2 2013 12 31 18 2359 19 449 3 2013 12 31 26 2245 101 129 4 2013 12 31 459 500 -1 655 # … with 336,772 more rows, and 12 more variables: sched_arr_time <int>, # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, # minute <dbl>, time_hour <dttm>

flights %>% select(year, month, day) # A tibble: 336,776 x
3 year month day <int> <int> <int> 1 2013 1 1 2 2013 1 1 3 2013 1 1 4 2013 1 1 # … with 336,772 more rows

flights %>% select(year:day) # A tibble: 336,776 x 3 year
month day <int> <int> <int> 1 2013 1 1 2 2013 1 1 3 2013 1 1 4 2013 1 1 # … with 336,772 more rows Equivalent to select(1:3)

flights %>% select(ends_with("_time")) # A tibble: 336,776 x 5 dep_time
sched_dep_time arr_time sched_arr_time air_time <int> <int> <int> <int> <dbl> 1 517 515 830 819 227 2 533 529 850 830 227 3 542 540 923 850 160 4 544 545 1004 1022 183 # … with 336,772 more rows

flights %>% mutate( gain = arr_delay - dep_delay, gain_per_hour =
gain / (air_time / 60) ) # A tibble: 336,776 x 21 year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 4 2013 1 1 544 545 -1 1004 # … with 336,772 more rows, and 14 more variables: sched_arr_time <int>, # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, …

flights %>% transmute( gain = arr_delay - dep_delay, gain_per_hour =
gain / (air_time / 60) ) # A tibble: 336,776 x 2 gain gain_per_hour <dbl> <dbl> 1 9 2.38 2 16 4.23 3 31 11.6 4 -17 -5.57 # … with 336,772 more rows • mutate() adds columns •transmute() creates a new tibble

flights %>% group_by(month) %>% summarise(avg_delay = mean(arr_delay, na.rm = TRUE))
# A tibble: 12 x 2 month n <int> <dbl> 1 1 6.13 2 2 5.61 3 3 5.81 4 4 11.2 5 5 3.52 6 6 16.5 7 7 16.7 8 8 6.04 9 9 -4.02 10 10 -0.167 11 11 0.461 12 12 14.9 • group_by() only affects  future computations •summarise() makes one  summary per level

starwars[starwars$height < 200 & starwars$gender == "male", ] starwars %>%
filter( height < 200, gender == "male" ) • Domain oriented • Language-like interface • Data is the important scope Data masking

Data masking data %>% fill(year) %>% spread(key, count) starwars %>%
ggplot(aes(height, mass)) + geom_point() + facet_wrap(vars(hair_color)) starwars %>% filter( height < 200, gender == "male" )

Data masking starwars %>% base::subset(height < 150, name:mass) %>% base::transform(height
= height / 100) starwars %>% stats::lm(formula = mass ~ height) In base R too! • Inspiration for dplyr • By R core member  Peter Dalgaard

Data masking • Unique feature of R • Great for
data analysis • Focus on the task not the subsetting  • Programming is a bit more involved  Tidy eval framework

Tidy eval • Powers data masking from the rlang package
• Flexible and robust programming • Strange syntax: !! and !!!, enquo(), etc • Requires learning new concepts

Tidy eval

Demystifying tidy eval 1. Why tidy evaluation? 2. Do you
actually need it? 3. Can it be used by laypeople?

Why Tidy Eval?

Why Tidy Eval starwars[starwars$height < 200 & starwars$gender == "male",
] starwars %>% filter( height < 200, gender == "male" ) Change the context of computation

Why Tidy Eval starwars %>% filter( height < 200, gender
== "male" ) <SQL> SELECT * FROM `starwars` WHERE ((`height` < 200.0) AND (`gender` = 'male')) Change the context of computation

Why Tidy Eval ⟶ Need to delay computations list( height
< 200, gender == "male" ) Error: object 'height' not found starwars %>% filter( height < 200, gender == "male" )

Why Tidy Eval How it works • Delay computations by
quoting • Change the context and resume computation starwars %>% filter( height < 200, gender == "male" )

Quoted code is like a blueprint vars( height < 200,
gender == "male" ) [[1]] <quosure> expr: ^height < 200 env: global [[2]] <quosure> expr: ^gender == "male" env: global • vars() is a fundamental quoting function • Returns blueprints of   delayed computations

Quoted code is like a blueprint Flip side: Harder to
reuse with different inputs • Loops • Functions

columns <- c("hair_color", "skin_color") out <- rep(list(NULL), 2) for (i
in seq_along(columns)) { out[[i]] <- starwars %>% summarise(avg = mean(columns[[i]], na.rm = TRUE)) }

in seq_along(columns)) { out[[i]] <- starwars %>% summarise(avg = mean(columns[[i]], na.rm = TRUE)) } out[[1]]    # A tibble: 1 x 1 avg <dbl> 1 NA

in seq_along(columns)) { out[[i]] <- starwars %>% summarise(avg = mean(columns[[i]], na.rm = TRUE)) } out[[1]]    # A tibble: 1 x 1 avg <dbl> 1 NA mean("hair_color", na.rm = TRUE) [1] NA Warning message: argument is not numeric or logical: returning NA

average <- function(data, x) { data %>% summarise(avg = mean(x,
na.rm = TRUE)) }

na.rm = TRUE)) } average(starwars, "hair_color") # A tibble: 1 x 1 avg <dbl> 1 NA Warning message: argument is not numeric or logical: returning NA

na.rm = TRUE)) } average(starwars, hair_color) Error: object 'hair_color' not found • Data masking is not transitive • x is masked instead of hair_color

Quoted code is like a blueprint Programming requires modifying the
blueprint • !! and !!! are surgery operators for blueprints • Need blueprint material: sym(), enquo(), ... Metaprogramming skills

But before we get there... Do you need tidy eval?
• Fixed column names • Columnwise mapping

• No need for tidy eval when column names are
ﬁxed! • Trivial to implement Fixed column names

data %>% transmute(bmi = mass / height^2) compute_bmi <- function(data)
{ data %>% transmute(bmi = mass / height^2) } compute_bmi <- function(data) { if (!all(c("mass", "height") %in% names(data))) { stop("`data` must contain `mass` and `height` columns") } data %>% transmute(bmi = mass / height^2) } 1. Repeated code

data %>% transmute(bmi = mass / height^2) compute_bmi <- function(data)
{ data %>% transmute(bmi = mass / height^2) } compute_bmi <- function(data) { if (!all(c("mass", "height") %in% names(data))) { stop("`data` must contain `mass` and `height` columns") } data %>% transmute(bmi = mass / height^2) } 2. Wrap pipeline

data %>% transmute(bmi = mass / height^2) 3. Check inputs
compute_bmi <- function(data) { data %>% transmute(bmi = mass / height^2) } compute_bmi <- function(data) { if (!all(c("mass", "height") %in% names(data))) { stop("`data` must contain `mass` and `height` columns") } data %>% transmute(bmi = mass / height^2) }

compute_bmi <- function(data) { if (!all(c("mass", "height") %in% names(data))) {
stop("`data` must contain `mass` and `height` columns") } mean_height <- round(mean(data$height, na.rm = TRUE), 1) if (mean_height > 3) { warning(glue::glue( "Average height is { mean_height }, is it scaled in meters?" )) } data %>% transmute(bmi = mass / height^2) } 4. Check inputs!

iris %>% compute_bmi() Error: `data` must contain `mass` and `height`
columns

starwars %>% compute_bmi() # A tibble: 87 x 1 bmi
<dbl> 1 0.00260 2 0.00269 3 0.00347 4 0.00333 # … with 83 more rows Warning message: Average height is 174.4, is it scaled in meters?

starwars %>% mutate(height = height / 100) %>% compute_bmi() #
A tibble: 87 x 1 bmi <dbl> 1 26.0 2 26.9 3 34.7 4 33.3 # … with 83 more rows

• For speciﬁc tasks where ﬁxed names make sense •
Callers must ensure existence of these columns • Input checking is important • Domain logic may have greater payoff Fixed column names

Columnwise mapping • Repeat operations across columns by mapping •
Important notion in R (apply family) • The purrr package is all about mapping!

map() loops a function automatically map(data, n_distinct) $group [1] 3
$value [1] 5 n_distinct(data$group) [1] 3 n_distinct(data$value) [1] 5

Scoped variants for dplyr verbs • Map functions over a
selection of columns • _all suffix ⟶ Map over all columns • _if suffix ⟶ Map over columns selected by a predicate • _at suffix ⟶ Map over a custom selection • Full dplyr features, including groups support Columnwise mapping

Columnwise mapping No data masking • Take objects not blueprints
• Easy to program with

Columnwise mapping mutate_all() mutate_at() mutate_if() summarise_all() summarise_at() summarise_if() group_by_all() group_by_at()
group_by_if() filter_all() filter_at() filter_if()

mtcars %>% mutate_all(function(x) x / sd(x)) mtcars %>% mutate_all(~ .
/ sd(.)) # A tibble: 32 x 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 3.48 3.36 1.29 1.60 7.29 2.68 9.21 0 2.00 5.42 2.48 2 3.48 3.36 1.29 1.60 7.29 2.94 9.52 0 2.00 5.42 2.48 3 3.78 2.24 0.871 1.36 7.20 2.37 10.4 1.98 2.00 5.42 0.619 4 3.55 3.36 2.08 1.60 5.76 3.29 10.9 1.98 0 4.07 0.619 # … with 28 more rows Mapping a function  over all columns Supports purrr formulas for anonymous functions

iris %>% mutate_if(is.numeric, ~ . / sd(.)) # A tibble:
150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fct> 1 6.16 8.03 0.793 0.262 setosa 2 5.92 6.88 0.793 0.262 setosa 3 5.68 7.34 0.736 0.262 setosa 4 5.56 7.11 0.850 0.262 setosa # … with 146 more rows Mapping a function  over predicate selection This function determines  which columns are changed

iris %>% mutate_at(1:4, ~ . / sd(.)) # A tibble:
150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fct> 1 6.16 8.03 0.793 0.262 setosa 2 5.92 6.88 0.793 0.262 setosa 3 5.68 7.34 0.736 0.262 setosa 4 5.56 7.11 0.850 0.262 setosa # … with 146 more rows Mapping a function  over custom selection Numeric vector  of column positions

nms <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") iris %>% mutate_at(nms, ~
. / sd(.)) # A tibble: 150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fct> 1 6.16 8.03 0.793 0.262 setosa 2 5.92 6.88 0.793 0.262 setosa 3 5.68 7.34 0.736 0.262 setosa 4 5.56 7.11 0.850 0.262 setosa # … with 146 more rows Mapping a function  over custom selection Character vector  of column names

iris %>% mutate_at(vars(Sepal.Length, Sepal.Width), ~ . / sd(.)) iris %>%
mutate_at(vars(starts_with("Sepal")), ~ . / sd(.)) # A tibble: 150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fct> 1 6.16 8.03 1.4 0.2 setosa 2 5.92 6.88 1.4 0.2 setosa 3 5.68 7.34 1.3 0.2 setosa 4 5.56 7.11 1.5 0.2 setosa # … with 146 more rows Mapping a function  over custom selection Pass selection helpers  with vars()

iris %>% summarise_if(is.numeric, mean) # A tibble: 1 x 4
Sepal.Length Sepal.Width Petal.Length Petal.Width <dbl> <dbl> <dbl> <dbl> 1 5.84 3.06 3.76 1.20 Consistent behaviour across variants

iris %>% group_by(Species) %>% summarise_all(mean) # A tibble: 3 x
5 Species Sepal.Length Sepal.Width Petal.Length Petal.Width <fct> <dbl> <dbl> <dbl> <dbl> 1 setosa 5.01 3.43 1.46 0.246 2 versicolor 5.94 2.77 4.26 1.33 3 virginica 6.59 2.97 5.55 2.03 Consistent behaviour across variants

iris %>% group_by_if(is.factor) %>% summarise_at(vars(starts_with("Sepal")), mean) # A tibble: 3
x 3 Species Sepal.Length Sepal.Width <fct> <dbl> <dbl> 1 setosa 5.01 3.43 2 versicolor 5.94 2.77 3 virginica 6.59 2.97 Consistent behaviour across variants

Columnwise mapping • Scoped variants can be incredibly useful •
Reuse skills from purrr and apply functions • No tidy eval needed for looping

Tidy eval, the easy parts

1. Pass the dots 2. Subset .data 3. Interpolate Tidy
eval, the easy parts

Passing the dots Easiest way to create a tidy eval
function! starwars %>% group_by(gender) %>% summarise(n = n())

my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n
= n()) } Easiest way to create a tidy eval function! Passing the dots

starwars %>% my_count_by(gender) # A tibble: 5 x 2 gender
n <chr> <int> 1 NA 3 2 female 19 3 hermaphrodite 1 4 male 62 5 none 2 my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) } Passing the dots

Recipient of dots takes care of everything! • No need
to quote / delay blueprints • Properties of the function are inherited Passing the dots

Two ﬂavours of   tidy evaluation

Two ﬂavours starwars %>% mutate(birth_year - 100) starwars %>% group_by(birth_year)
starwars %>% select(birth_year) starwars %>% filter(birth_year < 50) One of these things is not like the other things!

Two ﬂavours starwars %>% mutate(birth_year - 100) starwars %>% group_by(birth_year)
starwars %>% select(birth_year) starwars %>% filter(birth_year < 50) One of these things is not like the other things! Action Selection

tmp <- starwars$birth_year - 100 starwars$`birth_year - 100` <- tmp
starwars %>% mutate(birth_year - 100) Most verbs take actions 1. New vectors are created 2. The data frame is modiﬁed

Some verbs take selections 1. The position of columns is
looked up 2. The data frame is reorganised starwars %>% select(birth_year) tmp <- match("birth_year", colnames(starwars)) starwars[, tmp]

starwars %>% select(c(1, height)) starwars %>% select(1:height) starwars %>% select(-1,
-height) Selections have special properties 1. c(), `-` and `:` understand positions and names 2. Selection helpers know about current variables

starwars %>% select(ends_with("color")) starwars %>% select(matches("^[nm]a") starwars %>% select(10, everything())
1. c(), `-` and `:` understand positions and names 2. Selection helpers know about current variables Selections have special properties

Sometimes they appear to work the same way... starwars %>%
select(height) # A tibble: 87 x 1 height <int> 1 172 2 167 3 96 # … with 84 more rows starwars %>% transmute(height) # A tibble: 87 x 1 height <int> 1 172 2 167 3 96 # … with 84 more rows

starwars %>% select(1) # A tibble: 87 x 1 name
<chr> 1 Luke Skywalker 2 C-3PO 3 R2-D2 # … with 84 more rows starwars %>% transmute(1) # A tibble: 87 x 1 `1` <dbl> 1 1 2 1 3 1 # … with 84 more rows Sometimes they appear to work the same way...

What about group_by()? starwars %>% group_by(gender) # A tibble: 87
x 13 # Groups: gender [5] name height mass hair_color skin_color eye_color <chr> <int> <dbl> <chr> <chr> <chr> 1 Luke… 172 77 blond fair blue 2 C-3PO 167 75 NA gold yellow 3 R2-D2 96 32 NA white, bl… red # … with 84 more rows, and 7 more variables

starwars %>% group_by(ends_with("color")) Error: No tidyselect variables were registered What
about group_by()? It takes actions!

What about group_by()? It takes actions! starwars %>% group_by(height >
170) %>% summarise(n()) # A tibble: 3 x 2 `height > 170` `n()` <lgl> <int> 1 FALSE 27 2 TRUE 54 3 NA 6

Tip: Use the _at dplyr variants to pass selections! starwars
%>% group_by_at(vars(ends_with("color")))

starwars %>% my_count_by(gender) # A tibble: 5 x 2 gender
n <chr> <int> 1 NA 3 2 female 19 3 hermaphrodite 1 4 male 62 5 none 2 my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) } Passing the dots

starwars %>% my_count_by(GENDER = toupper(gender)) # A tibble: 5 x
2 GENDER n <chr> <int> 1 NA 3 2 FEMALE 19 3 HERMAPHRODITE 1 4 MALE 62 5 NONE 2 my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) } Passing the dots

starwars %>% my_count_by(ends_with("_color")) Error: No tidyselect variables were registered my_count_by
<- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) } Passing the dots

my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n
= n()) }

my_count_by <- function(data, ...) { data %>% group_by_at(vars(...)) %>% summarise(n
= n()) }

starwars %>% my_count_by(ends_with("_color"), -hair_color)) # A tibble: 53 x 3
# Groups: skin_color [31] skin_color eye_color n <chr> <chr> <int> 1 blue blue 1 2 blue hazel 1 3 blue, grey yellow 2 4 brown blue 1 # … with 49 more rows my_count_by <- function(data, ...) { data %>% group_by_at(vars(...)) %>% summarise(n = n()) } Passing the dots

• Dots can be passed to aes() or vars() in
ggplot2! • They both take actions Passing the dots

plot + facet_wrap(~ gender + hair_color) plot + facet_wrap(vars(gender, hair_color))
Facetting with formulas versus vars() • Facets historically take formulas but vars have more features • You can pass dots to vars() • vars() accepts names for facet titles

my_wrap <- function(...) { facet_wrap(vars(...), labeller = label_both) } •
labeller controls label titles • Here, both variable name and facet category Passing the dots

plot <- ggplot(mtcars, aes(disp, drat)) + geom_point() plot + my_wrap(
cut_number(wt, 3), cyl )   Actions! cut_number(wt, 3): (2.81,3.5] cyl: 6 cut_number(wt, 3): (2.81,3.5] cyl: 8 cut_number(wt, 3): (3.5,5.42] cyl: 8 cut_number(wt, 3): [1.51,2.81] cyl: 4 cut_number(wt, 3): [1.51,2.81] cyl: 6 cut_number(wt, 3): (2.81,3.5] cyl: 4 100 200 300 400 100 200 300 400 100 200 300 400 3.0 3.5 4.0 4.5 5.0 3.0 3.5 4.0 4.5 5.0 disp drat

plot <- ggplot(mtcars, aes(disp, drat)) + geom_point() plot + my_wrap(
Weight = cut_number(wt, 3), Cylinder = cyl ) Weight: (2.81,3.5] Cylinder: 6 Weight: (2.81,3.5] Cylinder: 8 Weight: (3.5,5.42] Cylinder: 8 Weight: [1.51,2.81] Cylinder: 4 Weight: [1.51,2.81] Cylinder: 6 Weight: (2.81,3.5] Cylinder: 4 100 200 300 400 100 200 300 400 100 200 300 400 3.0 3.5 4.0 4.5 5.0 3.0 3.5 4.0 4.5 5.0 disp drat Named  actions!

We just wrapped a single component ⟶ Can be composed
in pipelines with +  Two more examples • Entire ggplot2 pipeline • Multiple ggplot2 components Passing the dots

Passing the dots scatter_wrap <- function(data, mapping = aes(), ...)
{ ggplot(data, mapping) + geom_point() + facet_wrap(vars(...), labeller = label_both) } Entire pipeline

Passing the dots mtcars %>% scatter_wrap( aes(disp, drat), Cylinder =
cyl ) Cylinder: 4 Cylinder: 6 Cylinder: 8 100 200 300 400 100 200 300 400 100 200 300 400 3.0 3.5 4.0 4.5 5.0 disp drat

Passing the dots scatter_wrap <- function(...) { geom_point() + facet_wrap(vars(...),
labeller = label_both) } Multiple components

Passing the dots scatter_wrap <- function(...) { geom_point() + facet_wrap(vars(...),
labeller = label_both) } Multiple components An addition pipeline must start with ggplot()

Passing the dots scatter_wrap <- function(...) { list( geom_point(), facet_wrap(vars(...),
labeller = label_both) ) } Multiple components

Passing the dots ggplot(mtcars, aes(disp, drat)) + scatter_wrap(Cylinder = cyl)
Cylinder: 4 Cylinder: 6 Cylinder: 8 100 200 300 400 100 200 300 400 100 200 300 400 3.0 3.5 4.0 4.5 5.0 disp drat

• Easy way of creating data masking functions • Useful
when pipeline has only one variable part • Do you need actions or selections? • See my RStudio::conf 2019 talk for more ideas Passing the dots

Subsetting .data What if you have multiple pipeline inputs? starwars
%>% group_by(gender) %>% summarise(avg = mean(mass, na.rm = TRUE))

.data is a pronoun that represents the data Subsetting .data
starwars %>% group_by(.data$gender) %>% summarise(avg = mean(.data$mass, na.rm = TRUE))

my_average <- function(data, grp_var, avg_var) { data %>% group_by(.data[[grp_var]]) %>%
summarise(avg = mean(.data[[avg_var]], na.rm = TRUE)) } Subsetting .data Just pass column names to .data[[

Subsetting .data starwars %>% my_average("gender", "mass") # A tibble: 5
x 2 gender avg <chr> <dbl> 1 NA 46.3 2 female 54.0 3 hermaphrodite 1358 4 male 81.0 5 none 140 Take strings! No data masking

Subsetting .data starwars %>% my_average(gender, mass) Error: object 'gender' not
found Take strings! No data masking

• Now getting in the meat of tidy eval •
Interpolation is a simple pattern • Delay a blueprint by quoting with enquo() • Insert it back in another blueprint by unquoting with !! • Forwards a blueprint across functions Interpolation

Simple tidy eval pattern • Delay a blueprint with enquo()
• Insert it back with !! my_average <- function(data, grp_var, avg_var) { data %>% group_by(.data[[grp_var]]) %>% summarise(avg = mean(.data[[avg_var]], na.rm = TRUE)) } Interpolation

my_average <- function(data, grp_var, avg_var) { data %>% group_by(!!enquo(grp_var)) %>%
summarise(avg = mean(!!enquo(avg_var), na.rm = TRUE)) } Simple tidy eval pattern • Delay a blueprint with enquo() • Insert it back with !! Interpolation

starwars %>% my_average(gender, height) # A tibble: 5 x 2
gender avg <chr> <dbl> 1 NA 120 2 female 165. 3 hermaphrodite 175 4 male 179. 5 none 200 Interpolation

starwars %>% my_average(gender, height / 100) # A tibble: 5
x 2 gender avg <chr> <dbl> 1 NA 1.2 2 female 1.65 3 hermaphrodite 1.75 4 male 1.79 5 none 2 • Full data masking • Create vectors on the ﬂy Interpolation

Planned syntax: interpolation with {{ arg }} Inspiration from the
glue package thing <- "FOOBAR" glue::glue("Let's interpolate this { thing } right here") [1] "Let's interpolate this FOOBAR right here" Interpolation

Planned syntax: interpolation with {{ arg }} my_average <- function(data,
grp_var, avg_var) { data %>% group_by(!!enquo(grp_var)) %>% summarise(avg = mean(!!enquo(avg_var), na.rm = TRUE)) } Interpolation

Planned syntax: interpolation with {{ arg }} my_average <- function(data,
grp_var, avg_var) { data %>% group_by({{ grp_var }}) %>% summarise(avg = mean({{ avg_var }}, na.rm = TRUE)) } Interpolation

• Simple pattern but quickly gets more complicated • What
to unquote • Delayed blueprints with enquo() • Custom blueprint material: symbols, function calls, ... • Unquoting variants such as !!! • Simple interpolation should cover many cases Interpolation

• Data masking is a unique R feature • Great
for data analysis • Harder to program with • You might not need tidy eval • Fixed column names • Map functions on columns • Easy tidy eval techniques • Pass the dots • Subset .data • Quote and unquote (soon interpolate)

Programming in the tidyverse

Programming in the tidyverse

More Decks by Lionel Henry

Other Decks in Programming

Featured

Transcript