Upgrade to Pro — share decks privately, control downloads, hide ads and more …

New features in dplyr 1.1.0, plus an intro to ivs

Davis Vaughan
January 13, 2023
930

New features in dplyr 1.1.0, plus an intro to ivs

This talk introduces some of the big new features in dplyr 1.1.0, including:
- Per-operation grouping
- Join improvements
- New vectors functions

We also spend some time at the end of the talk discussing a new package for working with interval vectors, ivs.

Davis Vaughan

January 13, 2023
Tweet

Transcript

  1. dplyr 1.1.0 Davis Vaughan January 2023

  2. dplyr 1.1.0 ivs

  3. Per-operation grouping Joins New vector functions

  4. 🔗 dplyr.tidyverse.org # A tibble: 6 × 3 company year

    revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18
  5. 🔗 dplyr.tidyverse.org transactions # A tibble: 6 × 3 company

    year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18
  6. 🔗 dplyr.tidyverse.org Total yearly revenue of each company? transactions #

    A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18
  7. 🔗 dplyr.tidyverse.org transactions %>% group_by(company, year) %>% summarise(revenue = sum(revenue))

    #> `summarise()` has grouped output by 'company'. #> You can override using the `.groups` argument. #> # A tibble: 4 × 3 #> # Groups: company [2] #> company year revenue #> <chr> <dbl> <dbl> #> 1 A 2019 70 #> 2 A 2020 4 #> 3 B 2021 10 #> 4 B 2023 30
  8. 🔗 dplyr.tidyverse.org transactions[, .(revenue = sum(revenue)), by = .(company, year)]

  9. 🔗 dplyr.tidyverse.org transactions %>% group_by(company, year) %>% summarise(revenue = sum(revenue))

    transactions %>% summarise(revenue = sum(revenue), .by = c(company, year))
  10. 🔗 dplyr.tidyverse.org transactions %>% summarise(revenue = sum(revenue), .by = c(company,

    year)) #> # A tibble: 4 × 3 #> company year revenue #> <chr> <dbl> <dbl> #> 1 A 2019 70 #> 2 A 2020 4 #> 3 B 2021 10 #> 4 B 2023 30
  11. .by group_by() Per-operation grouping Persistent grouping Selects variables using tidy-select

    Computes expressions using data-masking (i.e., like mutate())
  12. 🔗 dplyr.tidyverse.org transactions %>% slice_max(revenue, n = 2, by =

    company) #> # A tibble: 4 × 3 #> company year revenue #> <chr> <dbl> <dbl> #> 1 A 2019 50 #> 2 A 2019 20 #> 3 B 2023 18 #> 4 B 2023 12 Top two transactions per company?
  13. mutate() summarise() filter() slice_*()

  14. So what about group_by()?

  15. Joins

  16. 🔗 dplyr.tidyverse.org # A tibble: 2 × 2 id name

    <chr> <chr> 1 A Apple 2 B Boeing companies transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18
  17. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, by = c(company = "id")) #

    A tibble: 2 × 2 id name <chr> <chr> 1 A Apple 2 B Boeing companies transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18
  18. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id)) # A tibble:

    2 × 2 id name <chr> <chr> 1 A Apple 2 B Boeing companies transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18
  19. 🔗 dplyr.tidyverse.org # A tibble: 3 × 3 id since

    name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18
  20. 🔗 dplyr.tidyverse.org # A tibble: 3 × 3 id since

    name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18
  21. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id), keep = TRUE)

    #> # A tibble: 9 × 6 #> company year revenue id since name #> <chr> <dbl> <dbl> <chr> <dbl> <chr> #> 1 A 2019 20 A 1976 Apple #> 2 A 2019 50 A 1976 Apple #> 3 A 2020 4 A 1976 Apple #> 4 B 2021 10 B 1916 Boeing #> 5 B 2021 10 B 2022 Amazon #> 6 B 2023 12 B 1916 Boeing #> 7 B 2023 12 B 2022 Amazon #> 8 B 2023 18 B 1916 Boeing #> 9 B 2023 18 B 2022 Amazon #> Warning: #> Each row in `x` is expected to match at most 1 row in `y`. #> ℹ Row 4 of `x` matches multiple rows. #> ℹ If multiple matches are expected, set `multiple = "all"` #> to silence this warning.
  22. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id), keep = TRUE)

    #> # A tibble: 9 × 6 #> company year revenue id since name #> <chr> <dbl> <dbl> <chr> <dbl> <chr> #> 1 A 2019 20 A 1976 Apple #> 2 A 2019 50 A 1976 Apple #> 3 A 2020 4 A 1976 Apple #> 4 B 2021 10 B 1916 Boeing #> 5 B 2021 10 B 2022 Amazon #> 6 B 2023 12 B 1916 Boeing #> 7 B 2023 12 B 2022 Amazon #> 8 B 2023 18 B 1916 Boeing #> 9 B 2023 18 B 2022 Amazon #> Warning: #> Each row in `x` is expected to match at most 1 row in `y`. #> ℹ Row 4 of `x` matches multiple rows. #> ℹ If multiple matches are expected, set `multiple = "all"` #> to silence this warning.
  23. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id), keep = TRUE)

    #> # A tibble: 9 × 6 #> company year revenue id since name #> <chr> <dbl> <dbl> <chr> <dbl> <chr> #> 1 A 2019 20 A 1976 Apple #> 2 A 2019 50 A 1976 Apple #> 3 A 2020 4 A 1976 Apple #> 4 B 2021 10 B 1916 Boeing #> 5 B 2021 10 B 2022 Amazon #> 6 B 2023 12 B 1916 Boeing #> 7 B 2023 12 B 2022 Amazon #> 8 B 2023 18 B 1916 Boeing #> 9 B 2023 18 B 2022 Amazon #> Warning: #> Each row in `x` is expected to match at most 1 row in `y`. #> ℹ Row 4 of `x` matches multiple rows. #> ℹ If multiple matches are expected, set `multiple = "all"` #> to silence this warning. 😭
  24. 🔗 dplyr.tidyverse.org transactions # A tibble: 6 × 3 company

    year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies
  25. 🔗 dplyr.tidyverse.org transactions # A tibble: 6 × 3 company

    year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies
  26. 🔗 dplyr.tidyverse.org transactions # A tibble: 6 × 3 company

    year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies 2021 >= 1916 is TRUE 2021 >= 2022 is FALSE
  27. 🔗 dplyr.tidyverse.org transactions # A tibble: 6 × 3 company

    year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies 2021 >= 1916 is TRUE 2021 >= 2022 is FALSE year >= since
  28. 🔗 dplyr.tidyverse.org transactions # A tibble: 6 × 3 company

    year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies 2021 >= 1916 is TRUE 2021 >= 2022 is FALSE year >= since Inequality Condition
  29. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id, year >= since))

    #> # A tibble: 8 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 20 1976 Apple #> 2 A 2019 50 1976 Apple #> 3 A 2020 4 1976 Apple #> 4 B 2021 10 1916 Boeing #> 5 B 2023 12 1916 Boeing #> 6 B 2023 12 2022 Amazon #> 7 B 2023 18 1916 Boeing #> 8 B 2023 18 2022 Amazon
  30. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id, year >= since))

    #> # A tibble: 8 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 20 1976 Apple #> 2 A 2019 50 1976 Apple #> 3 A 2020 4 1976 Apple #> 4 B 2021 10 1916 Boeing #> 5 B 2023 12 1916 Boeing #> 6 B 2023 12 2022 Amazon #> 7 B 2023 18 1916 Boeing #> 8 B 2023 18 2022 Amazon
  31. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id, year >= since))

    #> # A tibble: 8 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 20 1976 Apple #> 2 A 2019 50 1976 Apple #> 3 A 2020 4 1976 Apple #> 4 B 2021 10 1916 Boeing #> 5 B 2023 12 1916 Boeing #> 6 B 2023 12 2022 Amazon #> 7 B 2023 18 1916 Boeing #> 8 B 2023 18 2022 Amazon 😭
  32. 🔗 dplyr.tidyverse.org 2023 >= 1916 is TRUE 2023 >= 2022

    is TRUE transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies
  33. 🔗 dplyr.tidyverse.org 2023 >= 1916 is TRUE 2023 >= 2022

    is TRUE transactions # A tibble: 6 × 3 company year revenue <chr> <dbl> <dbl> 1 A 2019 20 2 A 2019 50 3 A 2020 4 4 B 2021 10 5 B 2023 12 6 B 2023 18 # A tibble: 3 × 3 id since name <chr> <dbl> <chr> 1 A 1976 Apple 2 B 1916 Boeing 3 B 2022 Amazon companies
  34. 🔗 dplyr.tidyverse.org 2023 >= 1916 is TRUE 2023 >= 2022

    is TRUE
  35. 🔗 dplyr.tidyverse.org 2023 >= 1916 is TRUE 2023 >= 2022

    is TRUE 2022 is closer to 2023 than 1916 (107 years) (1 year)
  36. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id, closest(year >= since)))

    #> # A tibble: 6 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 20 1976 Apple #> 2 A 2019 50 1976 Apple #> 3 A 2020 4 1976 Apple #> 4 B 2021 10 1916 Boeing #> 5 B 2023 12 2022 Amazon #> 6 B 2023 18 2022 Amazon
  37. 🔗 dplyr.tidyverse.org inner_join(transactions, companies, join_by(company == id, closest(year >= since)))

    #> # A tibble: 6 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 20 1976 Apple #> 2 A 2019 50 1976 Apple #> 3 A 2020 4 1976 Apple #> 4 B 2021 10 1916 Boeing #> 5 B 2023 12 2022 Amazon #> 6 B 2023 18 2022 Amazon Rolling Join
  38. 🔗 dplyr.tidyverse.org •Equality Joins •join_by(id, a == b) •Inequality Joins

    •join_by(id, date >= since) •Rolling Joins •join_by(id, closest(date >= since)) •Overlap Joins •join_by(id, between(date, y_lower, y_upper)) •join_by(id, overlaps(x_lower, x_upper, y_lower, y_upper)) •join_by(id, within(x_lower, x_upper, y_lower, y_upper))
  39. 🔗 dplyr.tidyverse.org •Equality Joins •join_by(id, a == b) •Inequality Joins

    •join_by(id, date >= since) •Rolling Joins •join_by(id, closest(date >= since)) •Overlap Joins •join_by(id, between(date, y_lower, y_upper)) •join_by(id, overlaps(x_lower, x_upper, y_lower, y_upper)) •join_by(id, within(x_lower, x_upper, y_lower, y_upper)) Cross Joins cross_join()
  40. consecutive_id() case_match()

  41. 🔗 dplyr.tidyverse.org # A tibble: 8 × 2 name text

    <chr> <chr> 1 Hadley I'll never learn Python. 2 Davis But aren't you speaking at PyCon? 3 Hadley So? 4 Hadley That doesn't influence my decision. 5 Hadley I'm not budging! 6 Mara Typical, Hadley. Stubborn as always. 7 Davis Fair enough! 8 Davis Let's move on. transcript
  42. 🔗 dplyr.tidyverse.org # A tibble: 8 × 2 name text

    <chr> <chr> 1 Hadley I'll never learn Python. 2 Davis But aren't you speaking at PyCon? 3 Hadley So? 4 Hadley That doesn't influence my decision. 5 Hadley I'm not budging! 6 Mara Typical, Hadley. Stubborn as always. 7 Davis Fair enough! 8 Davis Let's move on. transcript Flatten each continuous thought to one line?
  43. 🔗 dplyr.tidyverse.org # A tibble: 8 × 2 name text

    <chr> <chr> 1 Hadley I'll never learn Python. 2 Davis But aren't you speaking at PyCon? 3 Hadley So? 4 Hadley That doesn't influence my decision. 5 Hadley I'm not budging! 6 Mara Typical, Hadley. Stubborn as always. 7 Davis Fair enough! 8 Davis Let's move on. transcript .by = name?
  44. 🔗 dplyr.tidyverse.org transcript %>% mutate(id = consecutive_id(name)) #> # A

    tibble: 8 × 3 #> name text id #> <chr> <chr> <int> #> 1 Hadley I'll never learn Python. 1 #> 2 Davis But aren't you speaking at PyCon? 2 #> 3 Hadley So? 3 #> 4 Hadley That doesn't influence my decision. 3 #> 5 Hadley I'm not budging! 3 #> 6 Mara Typical, Hadley. Stubborn as always. 4 #> 7 Davis Fair enough! 5 #> 8 Davis Let's move on. 5
  45. 🔗 dplyr.tidyverse.org transcript %>% mutate(id = consecutive_id(name)) %>% summarise( text

    = stringr::str_flatten(text, collapse = " "), .by = c(id, name) ) #> # A tibble: 5 × 3 #> id name text #> <int> <chr> <chr> #> 1 1 Hadley I'll never learn Python. #> 2 2 Davis But aren't you speaking at PyCon? #> 3 3 Hadley So? That doesn't influence my decision. I'm not budging! #> 4 4 Mara Typical, Hadley. Stubborn as always. #> 5 5 Davis Fair enough! Let's move on.
  46. 🔗 dplyr.tidyverse.org x <- c(1, 12, -5, 6, -2, NA,

    0) case_when( x >= 10 ~ "large", x >= 0 ~ "small", x < 0 ~ NA )
  47. 🔗 dplyr.tidyverse.org x <- c(1, 12, -5, 6, -2, NA,

    0) case_when( x >= 10 ~ "large", x >= 0 ~ "small", x < 0 ~ NA ) #> Error: `NA` must be <character>, not <logical>. 😭 NA_character_
  48. 🔗 dplyr.tidyverse.org x <- c(1, 12, -5, 6, -2, NA,

    0) case_when( x >= 10 ~ "large", x >= 0 ~ "small", x < 0 ~ NA ) #> [1] "small" "large" NA "small" NA NA "small"
  49. 🔗 dplyr.tidyverse.org x <- c(1, 12, -5, 6, -2, NA,

    0) case_when( x >= 10 ~ "large", x >= 0 ~ "small", is.na(x) ~ "missing", TRUE ~ "other" ) #> [1] "small" "large" "other" "small" "other" "missing" "small"
  50. 🔗 dplyr.tidyverse.org x <- c(1, 12, -5, 6, -2, NA,

    0) case_when( x >= 10 ~ "large", x >= 0 ~ "small", is.na(x) ~ "missing", .default = "other" ) #> [1] "small" "large" "other" "small" "other" "missing" "small"
  51. 🔗 dplyr.tidyverse.org x <- c("USA", "Canada", "France", "UK", "China", NA,

    "Mexico", "Russia") case_when( x %in% c("USA", "Canada", "Mexico") ~ "North America", x %in% c("France", "UK") ~ "Europe", x %in% "China" ~ "Asia" ) #> [1] "North America" "North America" "Europe" "Europe" "Asia" NA #> [7] "North America" NA
  52. 🔗 dplyr.tidyverse.org x <- c("USA", "Canada", "France", "UK", "China", NA,

    "Mexico", "Russia") case_when( x %in% c("USA", "Canada", "Mexico") ~ "North America", x %in% c("France", "UK") ~ "Europe", x %in% "China" ~ "Asia" ) #> [1] "North America" "North America" "Europe" "Europe" "Asia" NA #> [7] "North America" NA
  53. 🔗 dplyr.tidyverse.org x <- c("USA", "Canada", "France", "UK", "China", NA,

    "Mexico", "Russia") case_match( x, c("USA", "Canada", "Mexico") ~ "North America", c("France", "UK") ~ "Europe", "China" ~ "Asia" ) #> [1] "North America" "North America" "Europe" "Europe" "Asia" NA #> [7] "North America" NA
  54. 🔗 dplyr.tidyverse.org x <- c("USA", "Canada", "France", "UK", "China", NA,

    "Mexico", "Russia") case_match( x, c("USA", "Canada", "Mexico") ~ "North America", c("France", "UK") ~ "Europe", "China" ~ "Asia", NA ~ "missing", .default = "unknown" ) #> [1] "North America" "North America" "Europe" "Europe" "Asia" "missing" #> [7] "North America" "unknown"
  55. ivs

  56. 🔗 davisvaughan.github.io/ivs # A tibble: 7 × 3 name start

    end <chr> <int> <int> 1 Davis 2014 2016 2 Greg 2015 2017 3 Greg 2019 2021 4 Ben 2019 2020 5 Robby 2016 2017 6 Robby 2019 2021 7 Robby 2022 2023 apartment
  57. 🔗 davisvaughan.github.io/ivs # A tibble: 7 × 3 name start

    end <chr> <int> <int> 1 Davis 2014 2016 2 Greg 2015 2017 3 Greg 2019 2021 4 Ben 2019 2020 5 Robby 2016 2017 6 Robby 2019 2021 7 Robby 2022 2023 apartment # A tibble: 7 × 2 name range <chr> <iv<int>> 1 Davis [2014, 2016) 2 Greg [2015, 2017) 3 Greg [2019, 2021) 4 Ben [2019, 2020) 5 Robby [2016, 2017) 6 Robby [2019, 2021) 7 Robby [2022, 2023) apartment iv(start, end)
  58. 🔗 davisvaughan.github.io/ivs

  59. 🔗 davisvaughan.github.io/ivs Number of people in the apartment per year?

  60. 🔗 davisvaughan.github.io/ivs 1 2 2 0 3 0 2 0

    1 Number of people in the apartment per year?
  61. 🔗 davisvaughan.github.io/ivs rolling %>% mutate(count = iv_count_between(year, apartment$range)) #> #

    A tibble: 9 × 2 #> year count #> <int> <int> #> 1 2014 1 #> 2 2015 2 #> 3 2016 2 #> 4 2017 0 #> 5 2018 0 #> 6 2019 3 #> 7 2020 2 #> 8 2021 0 #> 9 2022 1 # A tibble: 7 × 2 name range <chr> <iv<int>> 1 Davis [2014, 2016) 2 Greg [2015, 2017) 3 Greg [2019, 2021) 4 Ben [2019, 2020) 5 Robby [2016, 2017) 6 Robby [2019, 2021) 7 Robby [2022, 2023) apartment
  62. 🔗 davisvaughan.github.io/ivs When was the apartment occupied?

  63. 🔗 davisvaughan.github.io/ivs When was the apartment occupied?

  64. 🔗 davisvaughan.github.io/ivs apartment %>% reframe(range = iv_groups(range)) #> # A

    tibble: 3 × 1 #> range #> <iv<int>> #> 1 [2014, 2017) #> 2 [2019, 2021) #> 3 [2022, 2023) # A tibble: 7 × 2 name range <chr> <iv<int>> 1 Davis [2014, 2016) 2 Greg [2015, 2017) 3 Greg [2019, 2021) 4 Ben [2019, 2020) 5 Robby [2016, 2017) 6 Robby [2019, 2021) 7 Robby [2022, 2023) apartment
  65. 🔗 davisvaughan.github.io/ivs When was the apartment empty?

  66. 🔗 davisvaughan.github.io/ivs When was the apartment empty?

  67. 🔗 davisvaughan.github.io/ivs apartment %>% reframe(range = iv_set_complement(range)) #> # A

    tibble: 2 × 1 #> range #> <iv<int>> #> 1 [2017, 2019) #> 2 [2021, 2022) # A tibble: 7 × 2 name range <chr> <iv<int>> 1 Davis [2014, 2016) 2 Greg [2015, 2017) 3 Greg [2019, 2021) 4 Ben [2019, 2020) 5 Robby [2016, 2017) 6 Robby [2019, 2021) 7 Robby [2022, 2023) apartment
  68. dplyr.tidyverse.org davisvaughan.github.io/ivs Thank you!