New features in dplyr 1.1.0, plus an intro to ivs

dplyr 1.1.0 Davis Vaughan January 2023

dplyr 1.1.0 ivs

Per-operation grouping Joins New vector functions

🔗 dplyr.tidyverse.org # A tibble: 6 × 3 company year
revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18

🔗 dplyr.tidyverse.org transactions # A tibble: 6 × 3 company
year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18

🔗 dplyr.tidyverse.org Total yearly revenue of each company? transactions #
A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18

🔗 dplyr.tidyverse.org transactions %>% group_by(company, year) %>% summarise(revenue = sum(revenue))
#> `summarise()` has grouped output by 'company'. #> You can override using the `.groups` argument. #> # A tibble: 4 × 3 #> # Groups: company [2] #> company year revenue #> <chr> <dbl> <dbl> #> 1 A 2019 70 #> 2 A 2020 4 #> 3 B 2021 10 #> 4 B 2023 30

🔗 dplyr.tidyverse.org transactions[, .(revenue = sum(revenue)), by = .(company, year)]

🔗 dplyr.tidyverse.org transactions %>% group_by(company, year) %>% summarise(revenue = sum(revenue))
transactions %>% summarise(revenue = sum(revenue), .by = c(company, year))

🔗 dplyr.tidyverse.org transactions %>% summarise(revenue = sum(revenue), .by = c(company,
year)) #> # A tibble: 4 × 3 #> company year revenue #> <chr> <dbl> <dbl> #> 1 A 2019 70 #> 2 A 2020 4 #> 3 B 2021 10 #> 4 B 2023 30

.by group_by() Per-operation grouping Persistent grouping Selects variables using tidy-select
Computes expressions using data-masking (i.e., like mutate())

🔗 dplyr.tidyverse.org transactions %>% slice_max(revenue, n = 2, by =
company) #> # A tibble: 4 × 3 #> company year revenue #> <chr> <dbl> <dbl> #> 1 A 2019 50 #> 2 A 2019 20 #> 3 B 2023 18 #> 4 B 2023 12 Top two transactions per company?

mutate() summarise() filter() slice_*()

So what about group_by()?

🔗 dplyr.tidyverse.org # A tibble: 2 × 2 id name
<chr> <chr> 1 A Apple 2 B Boeing companies transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18

🔗 dplyr.tidyverse.org inner_join(transactions, companies, by = c(company = "id")) #
A tibble: 2 × 2 id name <chr> <chr> 1 A Apple 2 B Boeing companies transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18

🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id)) # A tibble:
2 × 2 id name <chr> <chr> 1 A Apple 2 B Boeing companies transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18

🔗 dplyr.tidyverse.org # A tibble: 3 × 3 id since
name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18

🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id), keep = TRUE)
#> # A tibble: 9 × 6 #> company year revenue id since name #> <chr> <dbl> <dbl> <chr> <dbl> <chr> #> 1 A 2019 20 A 1976 Apple #> 2 A 2019 50 A 1976 Apple #> 3 A 2020 4 A 1976 Apple #> 4 B 2021 10 B 1916 Boeing #> 5 B 2021 10 B 2022 Amazon #> 6 B 2023 12 B 1916 Boeing #> 7 B 2023 12 B 2022 Amazon #> 8 B 2023 18 B 1916 Boeing #> 9 B 2023 18 B 2022 Amazon #> Warning: #> Each row in `x` is expected to match at most 1 row in `y`. #> ℹ Row 4 of `x` matches multiple rows. #> ℹ If multiple matches are expected, set `multiple = "all"` #> to silence this warning.

🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id), keep = TRUE)
#> # A tibble: 9 × 6 #> company year revenue id since name #> <chr> <dbl> <dbl> <chr> <dbl> <chr> #> 1 A 2019 20 A 1976 Apple #> 2 A 2019 50 A 1976 Apple #> 3 A 2020 4 A 1976 Apple #> 4 B 2021 10 B 1916 Boeing #> 5 B 2021 10 B 2022 Amazon #> 6 B 2023 12 B 1916 Boeing #> 7 B 2023 12 B 2022 Amazon #> 8 B 2023 18 B 1916 Boeing #> 9 B 2023 18 B 2022 Amazon #> Warning: #> Each row in `x` is expected to match at most 1 row in `y`. #> ℹ Row 4 of `x` matches multiple rows. #> ℹ If multiple matches are expected, set `multiple = "all"` #> to silence this warning. 😭

year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies

year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies 2021 >= 1916 is TRUE 2021 >= 2022 is FALSE

year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies 2021 >= 1916 is TRUE 2021 >= 2022 is FALSE year >= since

year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies 2021 >= 1916 is TRUE 2021 >= 2022 is FALSE year >= since Inequality Condition

🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id, year >= since))
#> # A tibble: 8 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 20 1976 Apple #> 2 A 2019 50 1976 Apple #> 3 A 2020 4 1976 Apple #> 4 B 2021 10 1916 Boeing #> 5 B 2023 12 1916 Boeing #> 6 B 2023 12 2022 Amazon #> 7 B 2023 18 1916 Boeing #> 8 B 2023 18 2022 Amazon

🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id, year >= since))
#> # A tibble: 8 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 20 1976 Apple #> 2 A 2019 50 1976 Apple #> 3 A 2020 4 1976 Apple #> 4 B 2021 10 1916 Boeing #> 5 B 2023 12 1916 Boeing #> 6 B 2023 12 2022 Amazon #> 7 B 2023 18 1916 Boeing #> 8 B 2023 18 2022 Amazon 😭

🔗 dplyr.tidyverse.org 2023 >= 1916 is TRUE 2023 >= 2022
is TRUE transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies

is TRUE

is TRUE 2022 is closer to 2023 than 1916 (107 years) (1 year)

🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id, closest(year >= since)))
#> # A tibble: 6 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 20 1976 Apple #> 2 A 2019 50 1976 Apple #> 3 A 2020 4 1976 Apple #> 4 B 2021 10 1916 Boeing #> 5 B 2023 12 2022 Amazon #> 6 B 2023 18 2022 Amazon

🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id, closest(year >= since)))
#> # A tibble: 6 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 20 1976 Apple #> 2 A 2019 50 1976 Apple #> 3 A 2020 4 1976 Apple #> 4 B 2021 10 1916 Boeing #> 5 B 2023 12 2022 Amazon #> 6 B 2023 18 2022 Amazon Rolling Join

🔗 dplyr.tidyverse.org •Equality Joins •join_by(id, a == b) •Inequality Joins
•join_by(id, date >= since) •Rolling Joins •join_by(id, closest(date >= since)) •Overlap Joins •join_by(id, between(date, y_lower, y_upper)) •join_by(id, overlaps(x_lower, x_upper, y_lower, y_upper)) •join_by(id, within(x_lower, x_upper, y_lower, y_upper))

🔗 dplyr.tidyverse.org •Equality Joins •join_by(id, a == b) •Inequality Joins
•join_by(id, date >= since) •Rolling Joins •join_by(id, closest(date >= since)) •Overlap Joins •join_by(id, between(date, y_lower, y_upper)) •join_by(id, overlaps(x_lower, x_upper, y_lower, y_upper)) •join_by(id, within(x_lower, x_upper, y_lower, y_upper)) Cross Joins cross_join()

consecutive_id() case_match()

🔗 dplyr.tidyverse.org # A tibble: 8 × 2 name text
<chr> <chr> 1 Hadley I'll never learn Python. 2 Davis But aren't you speaking at PyCon? 3 Hadley So? 4 Hadley That doesn't influence my decision. 5 Hadley I'm not budging! 6 Mara Typical, Hadley. Stubborn as always. 7 Davis Fair enough! 8 Davis Let's move on. transcript

<chr> <chr> 1 Hadley I'll never learn Python. 2 Davis But aren't you speaking at PyCon? 3 Hadley So? 4 Hadley That doesn't influence my decision. 5 Hadley I'm not budging! 6 Mara Typical, Hadley. Stubborn as always. 7 Davis Fair enough! 8 Davis Let's move on. transcript Flatten each continuous thought to one line?

<chr> <chr> 1 Hadley I'll never learn Python. 2 Davis But aren't you speaking at PyCon? 3 Hadley So? 4 Hadley That doesn't influence my decision. 5 Hadley I'm not budging! 6 Mara Typical, Hadley. Stubborn as always. 7 Davis Fair enough! 8 Davis Let's move on. transcript .by = name?

🔗 dplyr.tidyverse.org transcript %>% mutate(id = consecutive_id(name)) #> # A
tibble: 8 × 3 #> name text id #> <chr> <chr> <int> #> 1 Hadley I'll never learn Python. 1 #> 2 Davis But aren't you speaking at PyCon? 2 #> 3 Hadley So? 3 #> 4 Hadley That doesn't influence my decision. 3 #> 5 Hadley I'm not budging! 3 #> 6 Mara Typical, Hadley. Stubborn as always. 4 #> 7 Davis Fair enough! 5 #> 8 Davis Let's move on. 5

🔗 dplyr.tidyverse.org transcript %>% mutate(id = consecutive_id(name)) %>% summarise( text
= stringr::str_flatten(text, collapse = " "), .by = c(id, name) ) #> # A tibble: 5 × 3 #> id name text #> <int> <chr> <chr> #> 1 1 Hadley I'll never learn Python. #> 2 2 Davis But aren't you speaking at PyCon? #> 3 3 Hadley So? That doesn't influence my decision. I'm not budging! #> 4 4 Mara Typical, Hadley. Stubborn as always. #> 5 5 Davis Fair enough! Let's move on.

🔗 dplyr.tidyverse.org x <- c(1, 12, -5, 6, -2, NA,
0) case_when( x >= 10 ~ "large", x >= 0 ~ "small", x < 0 ~ NA )

0) case_when( x >= 10 ~ "large", x >= 0 ~ "small", x < 0 ~ NA ) #> Error: `NA` must be <character>, not <logical>. 😭 NA_character_

0) case_when( x >= 10 ~ "large", x >= 0 ~ "small", x < 0 ~ NA ) #> [1] "small" "large" NA "small" NA NA "small"

0) case_when( x >= 10 ~ "large", x >= 0 ~ "small", is.na(x) ~ "missing", TRUE ~ "other" ) #> [1] "small" "large" "other" "small" "other" "missing" "small"

0) case_when( x >= 10 ~ "large", x >= 0 ~ "small", is.na(x) ~ "missing", .default = "other" ) #> [1] "small" "large" "other" "small" "other" "missing" "small"

🔗 dplyr.tidyverse.org x <- c("USA", "Canada", "France", "UK", "China", NA,
"Mexico", "Russia") case_when( x %in% c("USA", "Canada", "Mexico") ~ "North America", x %in% c("France", "UK") ~ "Europe", x %in% "China" ~ "Asia" ) #> [1] "North America" "North America" "Europe" "Europe" "Asia" NA #> [7] "North America" NA

"Mexico", "Russia") case_match( x, c("USA", "Canada", "Mexico") ~ "North America", c("France", "UK") ~ "Europe", "China" ~ "Asia" ) #> [1] "North America" "North America" "Europe" "Europe" "Asia" NA #> [7] "North America" NA

"Mexico", "Russia") case_match( x, c("USA", "Canada", "Mexico") ~ "North America", c("France", "UK") ~ "Europe", "China" ~ "Asia", NA ~ "missing", .default = "unknown" ) #> [1] "North America" "North America" "Europe" "Europe" "Asia" "missing" #> [7] "North America" "unknown"

🔗 davisvaughan.github.io/ivs # A tibble: 7 × 3 name start
end <chr> <int> <int> 1 Davis 2014 2016 2 Greg 2015 2017 3 Greg 2019 2021 4 Ben 2019 2020 5 Robby 2016 2017 6 Robby 2019 2021 7 Robby 2022 2023 apartment

🔗 davisvaughan.github.io/ivs # A tibble: 7 × 3 name start
end <chr> <int> <int> 1 Davis 2014 2016 2 Greg 2015 2017 3 Greg 2019 2021 4 Ben 2019 2020 5 Robby 2016 2017 6 Robby 2019 2021 7 Robby 2022 2023 apartment # A tibble: 7 × 2 name range <chr> <iv<int>> 1 Davis [2014, 2016) 2 Greg [2015, 2017) 3 Greg [2019, 2021) 4 Ben [2019, 2020) 5 Robby [2016, 2017) 6 Robby [2019, 2021) 7 Robby [2022, 2023) apartment iv(start, end)

🔗 davisvaughan.github.io/ivs

🔗 davisvaughan.github.io/ivs Number of people in the apartment per year?

🔗 davisvaughan.github.io/ivs 1 2 2 0 3 0 2 0
1 Number of people in the apartment per year?

🔗 davisvaughan.github.io/ivs rolling %>% mutate(count = iv_count_between(year, apartment$range)) #> #
A tibble: 9 × 2 #> year count #> <int> <int> #> 1 2014 1 #> 2 2015 2 #> 3 2016 2 #> 4 2017 0 #> 5 2018 0 #> 6 2019 3 #> 7 2020 2 #> 8 2021 0 #> 9 2022 1 # A tibble: 7 × 2 name range <chr> <iv<int>> 1 Davis [2014, 2016) 2 Greg [2015, 2017) 3 Greg [2019, 2021) 4 Ben [2019, 2020) 5 Robby [2016, 2017) 6 Robby [2019, 2021) 7 Robby [2022, 2023) apartment

🔗 davisvaughan.github.io/ivs When was the apartment occupied?

🔗 davisvaughan.github.io/ivs apartment %>% reframe(range = iv_groups(range)) #> # A
tibble: 3 × 1 #> range #> <iv<int>> #> 1 [2014, 2017) #> 2 [2019, 2021) #> 3 [2022, 2023) # A tibble: 7 × 2 name range <chr> <iv<int>> 1 Davis [2014, 2016) 2 Greg [2015, 2017) 3 Greg [2019, 2021) 4 Ben [2019, 2020) 5 Robby [2016, 2017) 6 Robby [2019, 2021) 7 Robby [2022, 2023) apartment

🔗 davisvaughan.github.io/ivs When was the apartment empty?

🔗 davisvaughan.github.io/ivs apartment %>% reframe(range = iv_set_complement(range)) #> # A
tibble: 2 × 1 #> range #> <iv<int>> #> 1 [2017, 2019) #> 2 [2021, 2022) # A tibble: 7 × 2 name range <chr> <iv<int>> 1 Davis [2014, 2016) 2 Greg [2015, 2017) 3 Greg [2019, 2021) 4 Ben [2019, 2020) 5 Robby [2016, 2017) 6 Robby [2019, 2021) 7 Robby [2022, 2023) apartment

dplyr.tidyverse.org davisvaughan.github.io/ivs Thank you!

New features in dplyr 1.1.0, plus an intro to ivs

New features in dplyr 1.1.0, plus an intro to ivs

More Decks by Davis Vaughan

Featured

Transcript