Upgrade to Pro — share decks privately, control downloads, hide ads and more …

New features in dplyr 1.1.0, plus an intro to ivs

Davis Vaughan
January 13, 2023
1k

New features in dplyr 1.1.0, plus an intro to ivs

This talk introduces some of the big new features in dplyr 1.1.0, including:
- Per-operation grouping
- Join improvements
- New vectors functions

We also spend some time at the end of the talk discussing a new package for working with interval vectors, ivs.

Davis Vaughan

January 13, 2023
Tweet

Transcript

  1. 🔗 dplyr.tidyverse.org # A tibble: 6 × 3 company year

    revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18
  2. 🔗 dplyr.tidyverse.org transactions # A tibble: 6 × 3 company

    year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18
  3. 🔗 dplyr.tidyverse.org Total yearly revenue of each company? transactions #

    A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18
  4. 🔗 dplyr.tidyverse.org transactions %>% group_by(company, year) %>% summarise(revenue = sum(revenue))

    #> `summarise()` has grouped output by 'company'. #> You can override using the `.groups` argument. #> # A tibble: 4 × 3 #> # Groups: company [2] #> company year revenue #> <chr> <dbl> <dbl> #> 1 A 2019 70 #> 2 A 2020 4 #> 3 B 2021 10 #> 4 B 2023 30
  5. 🔗 dplyr.tidyverse.org transactions %>% group_by(company, year) %>% summarise(revenue = sum(revenue))

    transactions %>% summarise(revenue = sum(revenue), .by = c(company, year))
  6. 🔗 dplyr.tidyverse.org transactions %>% summarise(revenue = sum(revenue), .by = c(company,

    year)) #> # A tibble: 4 × 3 #> company year revenue #> <chr> <dbl> <dbl> #> 1 A 2019 70 #> 2 A 2020 4 #> 3 B 2021 10 #> 4 B 2023 30
  7. .by group_by() Per-operation grouping Persistent grouping Selects variables using tidy-select

    Computes expressions using data-masking (i.e., like mutate())
  8. 🔗 dplyr.tidyverse.org transactions %>% slice_max(revenue, n = 2, by =

    company) #> # A tibble: 4 × 3 #> company year revenue #> <chr> <dbl> <dbl> #> 1 A 2019 50 #> 2 A 2019 20 #> 3 B 2023 18 #> 4 B 2023 12 Top two transactions per company?
  9. 🔗 dplyr.tidyverse.org # A tibble: 2 × 2 id name

    <chr> <chr> 1 A Apple 2 B Boeing companies transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18
  10. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, by = c(company = "id")) #

    A tibble: 2 × 2 id name <chr> <chr> 1 A Apple 2 B Boeing companies transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18
  11. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id)) # A tibble:

    2 × 2 id name <chr> <chr> 1 A Apple 2 B Boeing companies transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18
  12. 🔗 dplyr.tidyverse.org # A tibble: 3 × 3 id since

    name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18
  13. 🔗 dplyr.tidyverse.org # A tibble: 3 × 3 id since

    name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18
  14. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id), keep = TRUE)

    #> # A tibble: 9 × 6 #> company year revenue id since name #> <chr> <dbl> <dbl> <chr> <dbl> <chr> #> 1 A 2019 20 A 1976 Apple #> 2 A 2019 50 A 1976 Apple #> 3 A 2020 4 A 1976 Apple #> 4 B 2021 10 B 1916 Boeing #> 5 B 2021 10 B 2022 Amazon #> 6 B 2023 12 B 1916 Boeing #> 7 B 2023 12 B 2022 Amazon #> 8 B 2023 18 B 1916 Boeing #> 9 B 2023 18 B 2022 Amazon #> Warning: #> Each row in `x` is expected to match at most 1 row in `y`. #> ℹ Row 4 of `x` matches multiple rows. #> ℹ If multiple matches are expected, set `multiple = "all"` #> to silence this warning.
  15. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id), keep = TRUE)

    #> # A tibble: 9 × 6 #> company year revenue id since name #> <chr> <dbl> <dbl> <chr> <dbl> <chr> #> 1 A 2019 20 A 1976 Apple #> 2 A 2019 50 A 1976 Apple #> 3 A 2020 4 A 1976 Apple #> 4 B 2021 10 B 1916 Boeing #> 5 B 2021 10 B 2022 Amazon #> 6 B 2023 12 B 1916 Boeing #> 7 B 2023 12 B 2022 Amazon #> 8 B 2023 18 B 1916 Boeing #> 9 B 2023 18 B 2022 Amazon #> Warning: #> Each row in `x` is expected to match at most 1 row in `y`. #> ℹ Row 4 of `x` matches multiple rows. #> ℹ If multiple matches are expected, set `multiple = "all"` #> to silence this warning.
  16. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id), keep = TRUE)

    #> # A tibble: 9 × 6 #> company year revenue id since name #> <chr> <dbl> <dbl> <chr> <dbl> <chr> #> 1 A 2019 20 A 1976 Apple #> 2 A 2019 50 A 1976 Apple #> 3 A 2020 4 A 1976 Apple #> 4 B 2021 10 B 1916 Boeing #> 5 B 2021 10 B 2022 Amazon #> 6 B 2023 12 B 1916 Boeing #> 7 B 2023 12 B 2022 Amazon #> 8 B 2023 18 B 1916 Boeing #> 9 B 2023 18 B 2022 Amazon #> Warning: #> Each row in `x` is expected to match at most 1 row in `y`. #> ℹ Row 4 of `x` matches multiple rows. #> ℹ If multiple matches are expected, set `multiple = "all"` #> to silence this warning. 😭
  17. 🔗 dplyr.tidyverse.org transactions # A tibble: 6 × 3 company

    year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies
  18. 🔗 dplyr.tidyverse.org transactions # A tibble: 6 × 3 company

    year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies
  19. 🔗 dplyr.tidyverse.org transactions # A tibble: 6 × 3 company

    year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies 2021 >= 1916 is TRUE 2021 >= 2022 is FALSE
  20. 🔗 dplyr.tidyverse.org transactions # A tibble: 6 × 3 company

    year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies 2021 >= 1916 is TRUE 2021 >= 2022 is FALSE year >= since
  21. 🔗 dplyr.tidyverse.org transactions # A tibble: 6 × 3 company

    year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies 2021 >= 1916 is TRUE 2021 >= 2022 is FALSE year >= since Inequality Condition
  22. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id, year >= since))

    #> # A tibble: 8 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 20 1976 Apple #> 2 A 2019 50 1976 Apple #> 3 A 2020 4 1976 Apple #> 4 B 2021 10 1916 Boeing #> 5 B 2023 12 1916 Boeing #> 6 B 2023 12 2022 Amazon #> 7 B 2023 18 1916 Boeing #> 8 B 2023 18 2022 Amazon
  23. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id, year >= since))

    #> # A tibble: 8 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 20 1976 Apple #> 2 A 2019 50 1976 Apple #> 3 A 2020 4 1976 Apple #> 4 B 2021 10 1916 Boeing #> 5 B 2023 12 1916 Boeing #> 6 B 2023 12 2022 Amazon #> 7 B 2023 18 1916 Boeing #> 8 B 2023 18 2022 Amazon
  24. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id, year >= since))

    #> # A tibble: 8 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 20 1976 Apple #> 2 A 2019 50 1976 Apple #> 3 A 2020 4 1976 Apple #> 4 B 2021 10 1916 Boeing #> 5 B 2023 12 1916 Boeing #> 6 B 2023 12 2022 Amazon #> 7 B 2023 18 1916 Boeing #> 8 B 2023 18 2022 Amazon 😭
  25. 🔗 dplyr.tidyverse.org 2023 >= 1916 is TRUE 2023 >= 2022

    is TRUE transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies
  26. 🔗 dplyr.tidyverse.org 2023 >= 1916 is TRUE 2023 >= 2022

    is TRUE transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies
  27. 🔗 dplyr.tidyverse.org 2023 >= 1916 is TRUE 2023 >= 2022

    is TRUE 2022 is closer to 2023 than 1916 (107 years) (1 year)
  28. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id, closest(year >= since)))

    #> # A tibble: 6 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 20 1976 Apple #> 2 A 2019 50 1976 Apple #> 3 A 2020 4 1976 Apple #> 4 B 2021 10 1916 Boeing #> 5 B 2023 12 2022 Amazon #> 6 B 2023 18 2022 Amazon
  29. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id, closest(year >= since)))

    #> # A tibble: 6 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 20 1976 Apple #> 2 A 2019 50 1976 Apple #> 3 A 2020 4 1976 Apple #> 4 B 2021 10 1916 Boeing #> 5 B 2023 12 2022 Amazon #> 6 B 2023 18 2022 Amazon Rolling Join
  30. 🔗 dplyr.tidyverse.org •Equality Joins •join_by(id, a == b) •Inequality Joins

    •join_by(id, date >= since) •Rolling Joins •join_by(id, closest(date >= since)) •Overlap Joins •join_by(id, between(date, y_lower, y_upper)) •join_by(id, overlaps(x_lower, x_upper, y_lower, y_upper)) •join_by(id, within(x_lower, x_upper, y_lower, y_upper))
  31. 🔗 dplyr.tidyverse.org •Equality Joins •join_by(id, a == b) •Inequality Joins

    •join_by(id, date >= since) •Rolling Joins •join_by(id, closest(date >= since)) •Overlap Joins •join_by(id, between(date, y_lower, y_upper)) •join_by(id, overlaps(x_lower, x_upper, y_lower, y_upper)) •join_by(id, within(x_lower, x_upper, y_lower, y_upper)) Cross Joins cross_join()
  32. 🔗 dplyr.tidyverse.org # A tibble: 8 × 2 name text

    <chr> <chr> 1 Hadley I'll never learn Python. 2 Davis But aren't you speaking at PyCon? 3 Hadley So? 4 Hadley That doesn't influence my decision. 5 Hadley I'm not budging! 6 Mara Typical, Hadley. Stubborn as always. 7 Davis Fair enough! 8 Davis Let's move on. transcript
  33. 🔗 dplyr.tidyverse.org # A tibble: 8 × 2 name text

    <chr> <chr> 1 Hadley I'll never learn Python. 2 Davis But aren't you speaking at PyCon? 3 Hadley So? 4 Hadley That doesn't influence my decision. 5 Hadley I'm not budging! 6 Mara Typical, Hadley. Stubborn as always. 7 Davis Fair enough! 8 Davis Let's move on. transcript Flatten each continuous thought to one line?
  34. 🔗 dplyr.tidyverse.org # A tibble: 8 × 2 name text

    <chr> <chr> 1 Hadley I'll never learn Python. 2 Davis But aren't you speaking at PyCon? 3 Hadley So? 4 Hadley That doesn't influence my decision. 5 Hadley I'm not budging! 6 Mara Typical, Hadley. Stubborn as always. 7 Davis Fair enough! 8 Davis Let's move on. transcript .by = name?
  35. 🔗 dplyr.tidyverse.org transcript %>% mutate(id = consecutive_id(name)) #> # A

    tibble: 8 × 3 #> name text id #> <chr> <chr> <int> #> 1 Hadley I'll never learn Python. 1 #> 2 Davis But aren't you speaking at PyCon? 2 #> 3 Hadley So? 3 #> 4 Hadley That doesn't influence my decision. 3 #> 5 Hadley I'm not budging! 3 #> 6 Mara Typical, Hadley. Stubborn as always. 4 #> 7 Davis Fair enough! 5 #> 8 Davis Let's move on. 5
  36. 🔗 dplyr.tidyverse.org transcript %>% mutate(id = consecutive_id(name)) %>% summarise( text

    = stringr::str_flatten(text, collapse = " "), .by = c(id, name) ) #> # A tibble: 5 × 3 #> id name text #> <int> <chr> <chr> #> 1 1 Hadley I'll never learn Python. #> 2 2 Davis But aren't you speaking at PyCon? #> 3 3 Hadley So? That doesn't influence my decision. I'm not budging! #> 4 4 Mara Typical, Hadley. Stubborn as always. #> 5 5 Davis Fair enough! Let's move on.
  37. 🔗 dplyr.tidyverse.org x <- c(1, 12, -5, 6, -2, NA,

    0) case_when( x >= 10 ~ "large", x >= 0 ~ "small", x < 0 ~ NA )
  38. 🔗 dplyr.tidyverse.org x <- c(1, 12, -5, 6, -2, NA,

    0) case_when( x >= 10 ~ "large", x >= 0 ~ "small", x < 0 ~ NA ) #> Error: `NA` must be <character>, not <logical>. 😭 NA_character_
  39. 🔗 dplyr.tidyverse.org x <- c(1, 12, -5, 6, -2, NA,

    0) case_when( x >= 10 ~ "large", x >= 0 ~ "small", x < 0 ~ NA ) #> [1] "small" "large" NA "small" NA NA "small"
  40. 🔗 dplyr.tidyverse.org x <- c(1, 12, -5, 6, -2, NA,

    0) case_when( x >= 10 ~ "large", x >= 0 ~ "small", is.na(x) ~ "missing", TRUE ~ "other" ) #> [1] "small" "large" "other" "small" "other" "missing" "small"
  41. 🔗 dplyr.tidyverse.org x <- c(1, 12, -5, 6, -2, NA,

    0) case_when( x >= 10 ~ "large", x >= 0 ~ "small", is.na(x) ~ "missing", .default = "other" ) #> [1] "small" "large" "other" "small" "other" "missing" "small"
  42. 🔗 dplyr.tidyverse.org x <- c("USA", "Canada", "France", "UK", "China", NA,

    "Mexico", "Russia") case_when( x %in% c("USA", "Canada", "Mexico") ~ "North America", x %in% c("France", "UK") ~ "Europe", x %in% "China" ~ "Asia" ) #> [1] "North America" "North America" "Europe" "Europe" "Asia" NA #> [7] "North America" NA
  43. 🔗 dplyr.tidyverse.org x <- c("USA", "Canada", "France", "UK", "China", NA,

    "Mexico", "Russia") case_when( x %in% c("USA", "Canada", "Mexico") ~ "North America", x %in% c("France", "UK") ~ "Europe", x %in% "China" ~ "Asia" ) #> [1] "North America" "North America" "Europe" "Europe" "Asia" NA #> [7] "North America" NA
  44. 🔗 dplyr.tidyverse.org x <- c("USA", "Canada", "France", "UK", "China", NA,

    "Mexico", "Russia") case_match( x, c("USA", "Canada", "Mexico") ~ "North America", c("France", "UK") ~ "Europe", "China" ~ "Asia" ) #> [1] "North America" "North America" "Europe" "Europe" "Asia" NA #> [7] "North America" NA
  45. 🔗 dplyr.tidyverse.org x <- c("USA", "Canada", "France", "UK", "China", NA,

    "Mexico", "Russia") case_match( x, c("USA", "Canada", "Mexico") ~ "North America", c("France", "UK") ~ "Europe", "China" ~ "Asia", NA ~ "missing", .default = "unknown" ) #> [1] "North America" "North America" "Europe" "Europe" "Asia" "missing" #> [7] "North America" "unknown"
  46. ivs

  47. 🔗 davisvaughan.github.io/ivs # A tibble: 7 × 3 name start

    end <chr> <int> <int> 1 Davis 2014 2016 2 Greg 2015 2017 3 Greg 2019 2021 4 Ben 2019 2020 5 Robby 2016 2017 6 Robby 2019 2021 7 Robby 2022 2023 apartment
  48. 🔗 davisvaughan.github.io/ivs # A tibble: 7 × 3 name start

    end <chr> <int> <int> 1 Davis 2014 2016 2 Greg 2015 2017 3 Greg 2019 2021 4 Ben 2019 2020 5 Robby 2016 2017 6 Robby 2019 2021 7 Robby 2022 2023 apartment # A tibble: 7 × 2 name range <chr> <iv<int>> 1 Davis [2014, 2016) 2 Greg [2015, 2017) 3 Greg [2019, 2021) 4 Ben [2019, 2020) 5 Robby [2016, 2017) 6 Robby [2019, 2021) 7 Robby [2022, 2023) apartment iv(start, end)
  49. 🔗 davisvaughan.github.io/ivs 1 2 2 0 3 0 2 0

    1 Number of people in the apartment per year?
  50. 🔗 davisvaughan.github.io/ivs rolling %>% mutate(count = iv_count_between(year, apartment$range)) #> #

    A tibble: 9 × 2 #> year count #> <int> <int> #> 1 2014 1 #> 2 2015 2 #> 3 2016 2 #> 4 2017 0 #> 5 2018 0 #> 6 2019 3 #> 7 2020 2 #> 8 2021 0 #> 9 2022 1 # A tibble: 7 × 2 name range <chr> <iv<int>> 1 Davis [2014, 2016) 2 Greg [2015, 2017) 3 Greg [2019, 2021) 4 Ben [2019, 2020) 5 Robby [2016, 2017) 6 Robby [2019, 2021) 7 Robby [2022, 2023) apartment
  51. 🔗 davisvaughan.github.io/ivs apartment %>% reframe(range = iv_groups(range)) #> # A

    tibble: 3 × 1 #> range #> <iv<int>> #> 1 [2014, 2017) #> 2 [2019, 2021) #> 3 [2022, 2023) # A tibble: 7 × 2 name range <chr> <iv<int>> 1 Davis [2014, 2016) 2 Greg [2015, 2017) 3 Greg [2019, 2021) 4 Ben [2019, 2020) 5 Robby [2016, 2017) 6 Robby [2019, 2021) 7 Robby [2022, 2023) apartment
  52. 🔗 davisvaughan.github.io/ivs apartment %>% reframe(range = iv_set_complement(range)) #> # A

    tibble: 2 × 1 #> range #> <iv<int>> #> 1 [2017, 2019) #> 2 [2021, 2022) # A tibble: 7 × 2 name range <chr> <iv<int>> 1 Davis [2014, 2016) 2 Greg [2015, 2017) 3 Greg [2019, 2021) 4 Ben [2019, 2020) 5 Robby [2016, 2017) 6 Robby [2019, 2021) 7 Robby [2022, 2023) apartment