Programming in the tidyverse

Slide 1

Slide 1 text

Programming in the Tidyverse

Slide 2

Slide 2 text

Data analysis programming Production programming • Structure speciﬁc tasks  data manipulation,  data cleaning, plotting, ... • Interactivity and iteration • Reproducibility by few users • Structure repeated computations   functional programming, generic  programming, metaprogramming, ... • Flexibility and robustness • Reusability by many users r-lib tidyverse

Slide 3

Slide 3 text

Programming in the tidyverse • Tidyverse optimised for interactive analyses • Moving towards code reusability • How to program with the tidyverse • Demystifying tidy evaluation

Slide 4

Slide 4 text

What is the tidyverse?

Slide 5

Slide 5 text

What is the tidyverse • It is a set of packages (dplyr, tidyr, purrr, ggplot2, ...) • Website: https://tidyverse.org • Meta-package: library(tidyverse)

Slide 6

Slide 6 text

What is the tidyverse • It is a set of principles • Human centered Computers + people • Consistent Reuse small set of ideas • Composable Solve larger problems • Inclusive Diverse community https://principles.tidyverse.org

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Human centered Tidyverse packages ⟶ • Domain oriented • Language-like interface • Data is the important scope

Slide 9

Slide 9 text

• Domain oriented • Language-like interface • Data is the important scope Set of verbs for data manipulation • select() • filter() • arrange() • mutate() • group_by() • summarise()

Slide 10

Slide 10 text

flights # A tibble: 336,776 x 19 year month day dep_time sched_dep_time dep_delay arr_time 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 4 2013 1 1 544 545 -1 1004 # … with 336,772 more rows, and 12 more variables: sched_arr_time , # arr_delay , carrier , flight , tailnum , # origin , dest , air_time , distance , hour , …

Slide 11

Slide 11 text

flights %>% filter(month == 10, day == 10) # A tibble: 687 x 19 year month day dep_time sched_dep_time dep_delay arr_time 1 2013 10 5 453 500 -7 624 2 2013 10 5 525 515 10 747 3 2013 10 5 541 545 -4 827 4 2013 10 5 542 545 -3 813 # … with 683 more rows, and 12 more variables: sched_arr_time , # arr_delay , carrier , flight , tailnum , # origin , dest , air_time , distance , hour , …

Slide 12

Slide 12 text

flights %>% arrange(desc(month), desc(day)) # A tibble: 336,776 x 19 year month day dep_time sched_dep_time dep_delay arr_time 1 2013 12 31 13 2359 14 439 2 2013 12 31 18 2359 19 449 3 2013 12 31 26 2245 101 129 4 2013 12 31 459 500 -1 655 # … with 336,772 more rows, and 12 more variables: sched_arr_time , # arr_delay , carrier , flight , tailnum , # origin , dest , air_time , distance , hour , # minute , time_hour

Slide 13

Slide 13 text

flights %>% select(year, month, day) # A tibble: 336,776 x 3 year month day 1 2013 1 1 2 2013 1 1 3 2013 1 1 4 2013 1 1 # … with 336,772 more rows

Slide 14

Slide 14 text

flights %>% select(year:day) # A tibble: 336,776 x 3 year month day 1 2013 1 1 2 2013 1 1 3 2013 1 1 4 2013 1 1 # … with 336,772 more rows Equivalent to select(1:3)

Slide 15

Slide 15 text

flights %>% select(ends_with("_time")) # A tibble: 336,776 x 5 dep_time sched_dep_time arr_time sched_arr_time air_time 1 517 515 830 819 227 2 533 529 850 830 227 3 542 540 923 850 160 4 544 545 1004 1022 183 # … with 336,772 more rows

Slide 16

Slide 16 text

flights %>% mutate( gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) ) # A tibble: 336,776 x 21 year month day dep_time sched_dep_time dep_delay arr_time 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 4 2013 1 1 544 545 -1 1004 # … with 336,772 more rows, and 14 more variables: sched_arr_time , # arr_delay , carrier , flight , tailnum , # origin , dest , air_time , distance , hour , …

Slide 17

Slide 17 text

flights %>% transmute( gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) ) # A tibble: 336,776 x 2 gain gain_per_hour 1 9 2.38 2 16 4.23 3 31 11.6 4 -17 -5.57 # … with 336,772 more rows • mutate() adds columns •transmute() creates a new tibble

Slide 18

Slide 18 text

flights %>% group_by(month) %>% summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) # A tibble: 12 x 2 month n 1 1 6.13 2 2 5.61 3 3 5.81 4 4 11.2 5 5 3.52 6 6 16.5 7 7 16.7 8 8 6.04 9 9 -4.02 10 10 -0.167 11 11 0.461 12 12 14.9 • group_by() only affects  future computations •summarise() makes one  summary per level

Slide 19

Slide 19 text

starwars[starwars$height < 200 & starwars$gender == "male", ] starwars %>% filter( height < 200, gender == "male" ) • Domain oriented • Language-like interface • Data is the important scope Data masking

Slide 20

Slide 20 text

Data masking data %>% fill(year) %>% spread(key, count) starwars %>% ggplot(aes(height, mass)) + geom_point() + facet_wrap(vars(hair_color)) starwars %>% filter( height < 200, gender == "male" )

Slide 21

Slide 21 text

Data masking starwars %>% base::subset(height < 150, name:mass) %>% base::transform(height = height / 100) starwars %>% stats::lm(formula = mass ~ height) In base R too! • Inspiration for dplyr • By R core member  Peter Dalgaard

Slide 22

Slide 22 text

Data masking • Unique feature of R • Great for data analysis • Focus on the task not the subsetting  • Programming is a bit more involved  Tidy eval framework

Slide 23

Slide 23 text

Tidy eval • Powers data masking from the rlang package • Flexible and robust programming • Strange syntax: !! and !!!, enquo(), etc • Requires learning new concepts

Slide 24

Slide 24 text

Tidy eval

Slide 25

Slide 25 text

Tidy eval

Slide 26

Slide 26 text

Demystifying tidy eval 1. Why tidy evaluation? 2. Do you actually need it? 3. Can it be used by laypeople?

Slide 27

Slide 27 text

Why Tidy Eval?

Slide 28

Slide 28 text

Why Tidy Eval starwars[starwars$height < 200 & starwars$gender == "male", ] starwars %>% filter( height < 200, gender == "male" ) Change the context of computation

Slide 29

Slide 29 text

Why Tidy Eval starwars %>% filter( height < 200, gender == "male" ) SELECT * FROM `starwars` WHERE ((`height` < 200.0) AND (`gender` = 'male')) Change the context of computation

Slide 30

Slide 30 text

Why Tidy Eval ⟶ Need to delay computations list( height < 200, gender == "male" ) Error: object 'height' not found starwars %>% filter( height < 200, gender == "male" )

Slide 31

Slide 31 text

Why Tidy Eval How it works • Delay computations by quoting • Change the context and resume computation starwars %>% filter( height < 200, gender == "male" )

Slide 32

Slide 32 text

Quoted code is like a blueprint vars( height < 200, gender == "male" ) [[1]] expr: ^height < 200 env: global [[2]] expr: ^gender == "male" env: global • vars() is a fundamental quoting function • Returns blueprints of   delayed computations

Slide 33

Slide 33 text

Quoted code is like a blueprint Flip side: Harder to reuse with different inputs • Loops • Functions

Slide 34

Slide 34 text

columns <- c("hair_color", "skin_color") out <- rep(list(NULL), 2) for (i in seq_along(columns)) { out[[i]] <- starwars %>% summarise(avg = mean(columns[[i]], na.rm = TRUE)) }

Slide 35

Slide 35 text

Slide 36

Slide 36 text

columns <- c("hair_color", "skin_color") out <- rep(list(NULL), 2) for (i in seq_along(columns)) { out[[i]] <- starwars %>% summarise(avg = mean(columns[[i]], na.rm = TRUE)) } out[[1]]    # A tibble: 1 x 1 avg 1 NA mean("hair_color", na.rm = TRUE) [1] NA Warning message: argument is not numeric or logical: returning NA

Slide 37

Slide 37 text

average <- function(data, x) { data %>% summarise(avg = mean(x, na.rm = TRUE)) }

Slide 38

Slide 38 text

average <- function(data, x) { data %>% summarise(avg = mean(x, na.rm = TRUE)) } average(starwars, "hair_color") # A tibble: 1 x 1 avg 1 NA Warning message: argument is not numeric or logical: returning NA

Slide 39

Slide 39 text

average <- function(data, x) { data %>% summarise(avg = mean(x, na.rm = TRUE)) } average(starwars, hair_color) Error: object 'hair_color' not found • Data masking is not transitive • x is masked instead of hair_color

Slide 40

Slide 40 text

Quoted code is like a blueprint Programming requires modifying the blueprint • !! and !!! are surgery operators for blueprints • Need blueprint material: sym(), enquo(), ... Metaprogramming skills

Slide 41

Slide 41 text

But before we get there... Do you need tidy eval? • Fixed column names • Columnwise mapping

Slide 42

Slide 42 text

• No need for tidy eval when column names are ﬁxed! • Trivial to implement Fixed column names

Slide 43

Slide 43 text

data %>% transmute(bmi = mass / height^2) compute_bmi <- function(data) { data %>% transmute(bmi = mass / height^2) } compute_bmi <- function(data) { if (!all(c("mass", "height") %in% names(data))) { stop("`data` must contain `mass` and `height` columns") } data %>% transmute(bmi = mass / height^2) } 1. Repeated code

Slide 44

Slide 44 text

Slide 45

Slide 45 text

data %>% transmute(bmi = mass / height^2) 3. Check inputs compute_bmi <- function(data) { data %>% transmute(bmi = mass / height^2) } compute_bmi <- function(data) { if (!all(c("mass", "height") %in% names(data))) { stop("`data` must contain `mass` and `height` columns") } data %>% transmute(bmi = mass / height^2) }

Slide 46

Slide 46 text

compute_bmi <- function(data) { if (!all(c("mass", "height") %in% names(data))) { stop("`data` must contain `mass` and `height` columns") } mean_height <- round(mean(data$height, na.rm = TRUE), 1) if (mean_height > 3) { warning(glue::glue( "Average height is { mean_height }, is it scaled in meters?" )) } data %>% transmute(bmi = mass / height^2) } 4. Check inputs!

Slide 47

Slide 47 text

iris %>% compute_bmi() Error: `data` must contain `mass` and `height` columns

Slide 48

Slide 48 text

starwars %>% compute_bmi() # A tibble: 87 x 1 bmi 1 0.00260 2 0.00269 3 0.00347 4 0.00333 # … with 83 more rows Warning message: Average height is 174.4, is it scaled in meters?

Slide 49

Slide 49 text

starwars %>% mutate(height = height / 100) %>% compute_bmi() # A tibble: 87 x 1 bmi 1 26.0 2 26.9 3 34.7 4 33.3 # … with 83 more rows

Slide 50

Slide 50 text

• For speciﬁc tasks where ﬁxed names make sense • Callers must ensure existence of these columns • Input checking is important • Domain logic may have greater payoff Fixed column names

Slide 51

Slide 51 text

Columnwise mapping • Repeat operations across columns by mapping • Important notion in R (apply family) • The purrr package is all about mapping!

Slide 52

Slide 52 text

map() loops a function automatically map(data, n_distinct) $group [1] 3 $value [1] 5 n_distinct(data$group) [1] 3 n_distinct(data$value) [1] 5

Slide 53

Slide 53 text

Scoped variants for dplyr verbs • Map functions over a selection of columns • _all suffix ⟶ Map over all columns • _if suffix ⟶ Map over columns selected by a predicate • _at suffix ⟶ Map over a custom selection • Full dplyr features, including groups support Columnwise mapping

Slide 54

Slide 54 text

Columnwise mapping No data masking • Take objects not blueprints • Easy to program with

Slide 55

Slide 55 text

Columnwise mapping mutate_all() mutate_at() mutate_if() summarise_all() summarise_at() summarise_if() group_by_all() group_by_at() group_by_if() filter_all() filter_at() filter_if()

Slide 56

Slide 56 text

mtcars %>% mutate_all(function(x) x / sd(x)) mtcars %>% mutate_all(~ . / sd(.)) # A tibble: 32 x 11 mpg cyl disp hp drat wt qsec vs am gear carb 1 3.48 3.36 1.29 1.60 7.29 2.68 9.21 0 2.00 5.42 2.48 2 3.48 3.36 1.29 1.60 7.29 2.94 9.52 0 2.00 5.42 2.48 3 3.78 2.24 0.871 1.36 7.20 2.37 10.4 1.98 2.00 5.42 0.619 4 3.55 3.36 2.08 1.60 5.76 3.29 10.9 1.98 0 4.07 0.619 # … with 28 more rows Mapping a function  over all columns Supports purrr formulas for anonymous functions

Slide 57

Slide 57 text

iris %>% mutate_if(is.numeric, ~ . / sd(.)) # A tibble: 150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 6.16 8.03 0.793 0.262 setosa 2 5.92 6.88 0.793 0.262 setosa 3 5.68 7.34 0.736 0.262 setosa 4 5.56 7.11 0.850 0.262 setosa # … with 146 more rows Mapping a function  over predicate selection This function determines  which columns are changed

Slide 58

Slide 58 text

iris %>% mutate_at(1:4, ~ . / sd(.)) # A tibble: 150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 6.16 8.03 0.793 0.262 setosa 2 5.92 6.88 0.793 0.262 setosa 3 5.68 7.34 0.736 0.262 setosa 4 5.56 7.11 0.850 0.262 setosa # … with 146 more rows Mapping a function  over custom selection Numeric vector  of column positions

Slide 59

Slide 59 text

nms <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") iris %>% mutate_at(nms, ~ . / sd(.)) # A tibble: 150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 6.16 8.03 0.793 0.262 setosa 2 5.92 6.88 0.793 0.262 setosa 3 5.68 7.34 0.736 0.262 setosa 4 5.56 7.11 0.850 0.262 setosa # … with 146 more rows Mapping a function  over custom selection Character vector  of column names

Slide 60

Slide 60 text

iris %>% mutate_at(vars(Sepal.Length, Sepal.Width), ~ . / sd(.)) iris %>% mutate_at(vars(starts_with("Sepal")), ~ . / sd(.)) # A tibble: 150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 6.16 8.03 1.4 0.2 setosa 2 5.92 6.88 1.4 0.2 setosa 3 5.68 7.34 1.3 0.2 setosa 4 5.56 7.11 1.5 0.2 setosa # … with 146 more rows Mapping a function  over custom selection Pass selection helpers  with vars()

Slide 61

Slide 61 text

iris %>% summarise_if(is.numeric, mean) # A tibble: 1 x 4 Sepal.Length Sepal.Width Petal.Length Petal.Width 1 5.84 3.06 3.76 1.20 Consistent behaviour across variants

Slide 62

Slide 62 text

iris %>% group_by(Species) %>% summarise_all(mean) # A tibble: 3 x 5 Species Sepal.Length Sepal.Width Petal.Length Petal.Width 1 setosa 5.01 3.43 1.46 0.246 2 versicolor 5.94 2.77 4.26 1.33 3 virginica 6.59 2.97 5.55 2.03 Consistent behaviour across variants

Slide 63

Slide 63 text

iris %>% group_by_if(is.factor) %>% summarise_at(vars(starts_with("Sepal")), mean) # A tibble: 3 x 3 Species Sepal.Length Sepal.Width 1 setosa 5.01 3.43 2 versicolor 5.94 2.77 3 virginica 6.59 2.97 Consistent behaviour across variants

Slide 64

Slide 64 text

Columnwise mapping • Scoped variants can be incredibly useful • Reuse skills from purrr and apply functions • No tidy eval needed for looping

Slide 65

Slide 65 text

Tidy eval, the easy parts

Slide 66

Slide 66 text

1. Pass the dots 2. Subset .data 3. Interpolate Tidy eval, the easy parts

Slide 67

Slide 67 text

1. Pass the dots 2. Subset .data 3. Interpolate Tidy eval, the easy parts

Slide 68

Slide 68 text

Passing the dots Easiest way to create a tidy eval function! starwars %>% group_by(gender) %>% summarise(n = n())

Slide 69

Slide 69 text

my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) } Easiest way to create a tidy eval function! Passing the dots

Slide 70

Slide 70 text

starwars %>% my_count_by(gender) # A tibble: 5 x 2 gender n 1 NA 3 2 female 19 3 hermaphrodite 1 4 male 62 5 none 2 my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) } Passing the dots

Slide 71

Slide 71 text

Recipient of dots takes care of everything! • No need to quote / delay blueprints • Properties of the function are inherited Passing the dots

Slide 72

Slide 72 text

Two ﬂavours of   tidy evaluation

Slide 73

Slide 73 text

Two ﬂavours starwars %>% mutate(birth_year - 100) starwars %>% group_by(birth_year) starwars %>% select(birth_year) starwars %>% filter(birth_year < 50) One of these things is not like the other things!

Slide 74

Slide 74 text

Slide 75

Slide 75 text

tmp <- starwars$birth_year - 100 starwars$`birth_year - 100` <- tmp starwars %>% mutate(birth_year - 100) Most verbs take actions 1. New vectors are created 2. The data frame is modiﬁed

Slide 76

Slide 76 text

Some verbs take selections 1. The position of columns is looked up 2. The data frame is reorganised starwars %>% select(birth_year) tmp <- match("birth_year", colnames(starwars)) starwars[, tmp]

Slide 77

Slide 77 text

starwars %>% select(c(1, height)) starwars %>% select(1:height) starwars %>% select(-1, -height) Selections have special properties 1. c(), `-` and `:` understand positions and names 2. Selection helpers know about current variables

Slide 78

Slide 78 text

starwars %>% select(ends_with("color")) starwars %>% select(matches("^[nm]a") starwars %>% select(10, everything()) 1. c(), `-` and `:` understand positions and names 2. Selection helpers know about current variables Selections have special properties

Slide 79

Slide 79 text

Sometimes they appear to work the same way... starwars %>% select(height) # A tibble: 87 x 1 height 1 172 2 167 3 96 # … with 84 more rows starwars %>% transmute(height) # A tibble: 87 x 1 height 1 172 2 167 3 96 # … with 84 more rows

Slide 80

Slide 80 text

starwars %>% select(1) # A tibble: 87 x 1 name 1 Luke Skywalker 2 C-3PO 3 R2-D2 # … with 84 more rows starwars %>% transmute(1) # A tibble: 87 x 1 `1` 1 1 2 1 3 1 # … with 84 more rows Sometimes they appear to work the same way...

Slide 81

Slide 81 text

What about group_by()? starwars %>% group_by(gender) # A tibble: 87 x 13 # Groups: gender [5] name height mass hair_color skin_color eye_color 1 Luke… 172 77 blond fair blue 2 C-3PO 167 75 NA gold yellow 3 R2-D2 96 32 NA white, bl… red # … with 84 more rows, and 7 more variables

Slide 82

Slide 82 text

starwars %>% group_by(ends_with("color")) Error: No tidyselect variables were registered What about group_by()? It takes actions!

Slide 83

Slide 83 text

What about group_by()? It takes actions! starwars %>% group_by(height > 170) %>% summarise(n()) # A tibble: 3 x 2 `height > 170` `n()` 1 FALSE 27 2 TRUE 54 3 NA 6

Slide 84

Slide 84 text

Tip: Use the _at dplyr variants to pass selections! starwars %>% group_by_at(vars(ends_with("color")))

Slide 85

Slide 85 text

Slide 86

Slide 86 text

starwars %>% my_count_by(GENDER = toupper(gender)) # A tibble: 5 x 2 GENDER n 1 NA 3 2 FEMALE 19 3 HERMAPHRODITE 1 4 MALE 62 5 NONE 2 my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) } Passing the dots

Slide 87

Slide 87 text

starwars %>% my_count_by(ends_with("_color")) Error: No tidyselect variables were registered my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) } Passing the dots

Slide 88

Slide 88 text

my_count_by <- function(data, ...) { data %>% group_by(...) %>% summarise(n = n()) }

Slide 89

Slide 89 text

my_count_by <- function(data, ...) { data %>% group_by_at(vars(...)) %>% summarise(n = n()) }

Slide 90

Slide 90 text

starwars %>% my_count_by(ends_with("_color"), -hair_color)) # A tibble: 53 x 3 # Groups: skin_color [31] skin_color eye_color n 1 blue blue 1 2 blue hazel 1 3 blue, grey yellow 2 4 brown blue 1 # … with 49 more rows my_count_by <- function(data, ...) { data %>% group_by_at(vars(...)) %>% summarise(n = n()) } Passing the dots

Slide 91

Slide 91 text

• Dots can be passed to aes() or vars() in ggplot2! • They both take actions Passing the dots

Slide 92

Slide 92 text

plot + facet_wrap(~ gender + hair_color) plot + facet_wrap(vars(gender, hair_color)) Facetting with formulas versus vars() • Facets historically take formulas but vars have more features • You can pass dots to vars() • vars() accepts names for facet titles

Slide 93

Slide 93 text

my_wrap <- function(...) { facet_wrap(vars(...), labeller = label_both) } • labeller controls label titles • Here, both variable name and facet category Passing the dots

Slide 94

Slide 94 text

plot <- ggplot(mtcars, aes(disp, drat)) + geom_point() plot + my_wrap( cut_number(wt, 3), cyl )   Actions! cut_number(wt, 3): (2.81,3.5] cyl: 6 cut_number(wt, 3): (2.81,3.5] cyl: 8 cut_number(wt, 3): (3.5,5.42] cyl: 8 cut_number(wt, 3): [1.51,2.81] cyl: 4 cut_number(wt, 3): [1.51,2.81] cyl: 6 cut_number(wt, 3): (2.81,3.5] cyl: 4 100 200 300 400 100 200 300 400 100 200 300 400 3.0 3.5 4.0 4.5 5.0 3.0 3.5 4.0 4.5 5.0 disp drat

Slide 95

Slide 95 text

plot <- ggplot(mtcars, aes(disp, drat)) + geom_point() plot + my_wrap( Weight = cut_number(wt, 3), Cylinder = cyl ) Weight: (2.81,3.5] Cylinder: 6 Weight: (2.81,3.5] Cylinder: 8 Weight: (3.5,5.42] Cylinder: 8 Weight: [1.51,2.81] Cylinder: 4 Weight: [1.51,2.81] Cylinder: 6 Weight: (2.81,3.5] Cylinder: 4 100 200 300 400 100 200 300 400 100 200 300 400 3.0 3.5 4.0 4.5 5.0 3.0 3.5 4.0 4.5 5.0 disp drat Named  actions!

Slide 96

Slide 96 text

We just wrapped a single component ⟶ Can be composed in pipelines with +  Two more examples • Entire ggplot2 pipeline • Multiple ggplot2 components Passing the dots

Slide 97

Slide 97 text

Passing the dots scatter_wrap <- function(data, mapping = aes(), ...) { ggplot(data, mapping) + geom_point() + facet_wrap(vars(...), labeller = label_both) } Entire pipeline

Slide 98

Slide 98 text

Passing the dots mtcars %>% scatter_wrap( aes(disp, drat), Cylinder = cyl ) Cylinder: 4 Cylinder: 6 Cylinder: 8 100 200 300 400 100 200 300 400 100 200 300 400 3.0 3.5 4.0 4.5 5.0 disp drat

Slide 99

Slide 99 text

Passing the dots scatter_wrap <- function(...) { geom_point() + facet_wrap(vars(...), labeller = label_both) } Multiple components

Slide 100

Slide 100 text

Passing the dots scatter_wrap <- function(...) { geom_point() + facet_wrap(vars(...), labeller = label_both) } Multiple components An addition pipeline must start with ggplot()

Slide 101

Slide 101 text

Passing the dots scatter_wrap <- function(...) { list( geom_point(), facet_wrap(vars(...), labeller = label_both) ) } Multiple components

Slide 102

Slide 102 text

Passing the dots ggplot(mtcars, aes(disp, drat)) + scatter_wrap(Cylinder = cyl) Cylinder: 4 Cylinder: 6 Cylinder: 8 100 200 300 400 100 200 300 400 100 200 300 400 3.0 3.5 4.0 4.5 5.0 disp drat

Slide 103

Slide 103 text

• Easy way of creating data masking functions • Useful when pipeline has only one variable part • Do you need actions or selections? • See my RStudio::conf 2019 talk for more ideas Passing the dots

Slide 104

Slide 104 text

1. Pass the dots 2. Subset .data 3. Interpolate Tidy eval, the easy parts

Slide 105

Slide 105 text

Subsetting .data What if you have multiple pipeline inputs? starwars %>% group_by(gender) %>% summarise(avg = mean(mass, na.rm = TRUE))

Slide 106

Slide 106 text

.data is a pronoun that represents the data Subsetting .data starwars %>% group_by(.data$gender) %>% summarise(avg = mean(.data$mass, na.rm = TRUE))

Slide 107

Slide 107 text

my_average <- function(data, grp_var, avg_var) { data %>% group_by(.data[[grp_var]]) %>% summarise(avg = mean(.data[[avg_var]], na.rm = TRUE)) } Subsetting .data Just pass column names to .data[[

Slide 108

Slide 108 text

Subsetting .data starwars %>% my_average("gender", "mass") # A tibble: 5 x 2 gender avg 1 NA 46.3 2 female 54.0 3 hermaphrodite 1358 4 male 81.0 5 none 140 Take strings! No data masking

Slide 109

Slide 109 text

Subsetting .data starwars %>% my_average(gender, mass) Error: object 'gender' not found Take strings! No data masking

Slide 110

Slide 110 text

1. Pass the dots 2. Subset .data 3. Interpolate Tidy eval, the easy parts

Slide 111

Slide 111 text

• Now getting in the meat of tidy eval • Interpolation is a simple pattern • Delay a blueprint by quoting with enquo() • Insert it back in another blueprint by unquoting with !! • Forwards a blueprint across functions Interpolation

Slide 112

Slide 112 text

Simple tidy eval pattern • Delay a blueprint with enquo() • Insert it back with !! my_average <- function(data, grp_var, avg_var) { data %>% group_by(.data[[grp_var]]) %>% summarise(avg = mean(.data[[avg_var]], na.rm = TRUE)) } Interpolation

Slide 113

Slide 113 text

my_average <- function(data, grp_var, avg_var) { data %>% group_by(!!enquo(grp_var)) %>% summarise(avg = mean(!!enquo(avg_var), na.rm = TRUE)) } Simple tidy eval pattern • Delay a blueprint with enquo() • Insert it back with !! Interpolation

Slide 114

Slide 114 text

starwars %>% my_average(gender, height) # A tibble: 5 x 2 gender avg 1 NA 120 2 female 165. 3 hermaphrodite 175 4 male 179. 5 none 200 Interpolation

Slide 115

Slide 115 text

starwars %>% my_average(gender, height / 100) # A tibble: 5 x 2 gender avg 1 NA 1.2 2 female 1.65 3 hermaphrodite 1.75 4 male 1.79 5 none 2 • Full data masking • Create vectors on the ﬂy Interpolation

Slide 116

Slide 116 text

Planned syntax: interpolation with {{ arg }} Inspiration from the glue package thing <- "FOOBAR" glue::glue("Let's interpolate this { thing } right here") [1] "Let's interpolate this FOOBAR right here" Interpolation

Slide 117

Slide 117 text

Planned syntax: interpolation with {{ arg }} my_average <- function(data, grp_var, avg_var) { data %>% group_by(!!enquo(grp_var)) %>% summarise(avg = mean(!!enquo(avg_var), na.rm = TRUE)) } Interpolation

Slide 118

Slide 118 text

Planned syntax: interpolation with {{ arg }} my_average <- function(data, grp_var, avg_var) { data %>% group_by({{ grp_var }}) %>% summarise(avg = mean({{ avg_var }}, na.rm = TRUE)) } Interpolation

Slide 119

Slide 119 text

• Simple pattern but quickly gets more complicated • What to unquote • Delayed blueprints with enquo() • Custom blueprint material: symbols, function calls, ... • Unquoting variants such as !!! • Simple interpolation should cover many cases Interpolation

Slide 120

Slide 120 text

• Data masking is a unique R feature • Great for data analysis • Harder to program with • You might not need tidy eval • Fixed column names • Map functions on columns • Easy tidy eval techniques • Pass the dots • Subset .data • Quote and unquote (soon interpolate)