今こそ、data.tableを学ぼう! / datatable1130

今こそ、data.tableを学ぼう! / datatable1130

Tokyo.R#87 http://tokyor.connpass.com/event/182672/ での発表スライド

D12a80cab206033a820ccff8319f957b?s=128

Uryu Shinya

August 01, 2020
Tweet

Transcript

  1. data.table Shinya Uryu ( @u_ribo uribo) 20200801 Tokyo.R#87 =Λֶ΅͏ʂ ύοέʔδ

    ࠓͦ͜ɺ DT[i, j, by] fifelse fcase .SD .N nafill .GRP
  2. (JU)VC4QPOTPST ͷօ͞· IUUQTHJUIVCDPNTQPOTPSTVSJCP ఏڙ @yutannihilation @katsurakob @kanji14134 @siero5335 @ niszet

    @ ito4303 @ ak9782427
  3. ໨࣍   σʔλૢ࡞ ⁠ EBUBUBCMFͷ֦͕Γ ⁞ EBUBUBCMFͷར఺

  4. MJCSBSZ EBUBUBCMF WFSTJPO  4JODFPO$3"/ ⁞

  5. EBUBUBCMF &YUFOTJPOPGEBUBGSBNF ʢେن໛σʔλͷʣޮ཰తͳॲཧɺ ߴ଎ͳಡΈॻ͖͕ಘҙ EBUBUBCMFΫϥε %5 Λఏڙ EBUBGSBNFʹࣅͨ<Λ༻͍ͨૢ࡞ IUUQTSEBUBUBCMFHJUMBCJPEBUBUBCMF

  6. group_by join data.table 13s 92s dplyr 157s 392s datatable 72s

    1301s pandas 97s 1514s dask 127s out of memory DataFrames.jl 43s 5755s spark 37s 369s ੑೳଌఆ IUUQTIPBJHJUIVCJPECCFODINBSL (#   ߦྻ
  7. ੑೳଌఆ .#   ߦྻ time data.table::fread() 2.85s fread()ʢcolClassࢦఆʣ 2.57s

    readr::read_csv() 14.0s fst::read_fst() 2.44s
  8. EQMZSͱͷൺֱ ґଘ͕গͳ͍ ϝϞϦར༻ޮ཰ʹҧ͍ อकతͳ։ൃʢWTϦϦʔε଎౓ʣ EQMZSΑΓ΋ྺ࢙ͷ͋Δ։ൃྺ͕ͩ҆ఆత େ෯ʹڍಈ͕มΘΔ͜ͱ͕ك <J K CZ>ܗࣜͷૢ࡞ੑ͸Ұ؏

  9. EQMZSͱͷൺֱ ґଘͷগͳ͞ tools::package_dependencies("data.table", recursive = TRUE)[[1]] #> [1] "methods" tools::package_dependencies("dplyr",

    recursive = TRUE)[[1]] #> [1] "ellipsis" "generics" "glue" "lifecycle" "magrittr" #> [6] "methods" "R6" "rlang" "tibble" "tidyselect" #> [11] "utils" "vctrs" "cli" "crayon" "fansi" #> [16] "pillar" "pkgconfig" "purrr" "digest" "assertthat" #> [21] "grDevices" "utf8" "tools" ⁞
  10. EQMZSͱͷൺֱ ϝϞϦར༻ޮ཰   $PNQBSJOH&GpDJFODZBOE4QFFEPGAEBUBUBCMFA"EEJOHWBSJBCMFT pMUFSJOHSPXT BOETVNNBSJ[JOHCZHSPVQu 5ZTPO#BSSFUUDPNIUUQTUZTPOCBSSFUUDPNKFLZMMVQEBUFEBUBUBCMF@NFNPSZ ྻૢ࡞ʹ͓͍ͯEBUBUBCMFͷํ͕ Θ͔ͣʹ༏ΕΔ நग़ͷաఔͰ͸

    άϧʔϓͷू໿޻ఔ͸ඇޮ཰ʁ EBUBUBCMF͕উΔ IUUQTUXJUUFSDPNIFBMUIBOETUBUTTUBUVT
  11. EQMZSͱͷൺֱ ϝϞϦར༻ޮ཰   d <- data.table( x = rnorm(10)) address(d)

    #> [1] “0x7f8afc694400" df_tbl <- copy(d) %>% as_tibble() address(df_tbl) #> [1] "0x7f8ace2fb1c8" d[, z := rnorm(10)] address(d) #> [1] "0x7f8afc694400" ม਺ʹॲཧΛՃ͑ͯ΋ ΦϒδΣΫτ͕ڞ༗͞ΕΔ ϝϞϦΞυϨε͕มΘΒͳ͍ df_tbl <- df_tbl %>% mutate(z = rnorm(1e6)) address(df_tbl) #> [1] "0x7f8afc68eea8" UJEZWFSTF ϝϞϦΞυϨε͕มΘΔ
  12. ͳͥޮ཰͕ྑ͘ɺߴ଎ͳͷ͔ $ݴޠ $ Ͱ͸ͳ͍ ʹΑΔ࣮૷ TIBMMPXίϐʔͰͷྻͷ஋ͷ ௥Ճɾߋ৽ɾ࡟আ Ωʔར༻ͱόΠφϦαʔν ʹΑΔݕࡧ

  13.   BSSBOHF GJMUFS DT[i, j, by] EBUBUBCMFߏจ NVUBUF TFMFDU HSPVQ@CZ

    TVNNBSJTF ର৅ ૢ࡞ ू߹
  14. df %>% group_by(id) %>% summarise(val1_sum = sum(val1), .groups = "drop")

    id code val1 1 c 1 1 b 2 1 c 3 2 a 4 2 a 5 1 b 6 2 a 7 1 c 8 id val1_sum 1 20 2 16 JEྻͷ஋͝ͱʹWBMΛ߹ܭ͢Δ ྫ ⁞ DT[, .(val1_sum = sum(val1)), by = id]
  15. id code val1 1 c 1 1 b 2 1

    c 3 2 a 4 2 a 5 1 b 6 2 a 7 1 c 8 id val1_sum 1 12 2 16 DPEFCͷ৚݅Ͱྫ⁞ͱಉ͡ॲཧ ྫ   df %>% filter(code != "b") %>% group_by(id) %>% summarise(val1_sum = sum(val1), .groups = "drop") DT[code != "b", .(val1_sum = sum(val1)), by = id]
  16. id val1_sum val2_sum 1 12 36 2 16 40 DPEFCͷ৚݅ɺJE͝ͱʹWBMྻͷ߹ܭ

    ྫ ⁠ id code val1 val2 1 c 1 9 1 b 2 10 1 c 3 11 2 a 4 12 2 a 5 13 1 b 6 14 2 a 7 15 1 c 8 16 DT[code != "b", lapply(.SD, sum), by = id, .SDcols = patterns("^val")] 4VCTFUPGUIF%BUBUBCMF df %>% filter(code != "b") %>% group_by(id) %>% summarise(across(starts_with("val"), sum), .groups = "drop")
  17. df %>% filter(val1 < 3) df %>% filter(val1 < 3

    | val2 > 9) df %>% filter(val1 < 3 & val2 > 9) Jߦͷநग़ pMUFS DT[val1 < 3] DT[val1 < 3 | val2 > 9] # or৚݅ DT[val1 < 3 & val2 > 9] # and৚݅ df %>% filter(code %in% c("a", “d")) df %>% filter(!code %in% c("a", "b")) df %>% filter( stringr::str_detect(code, "^c$")) # on۟ DT[c("a", "d"),on = "code",nomatch = 0] # a,bҎ֎ DT[!code %in% c("a", "b")] # ਖ਼نදݱͰͷࢦఆ DT[like(code, "^c$")] ਺஋ จࣈྻ
  18. Jߦͷฒͼସ͑ BSSBOHF DT[order(id)] # খ͍͞ॱ(ঢॱ) #> id code val1 val2

    #> 1: 1 c 1 9 #> 2: 1 b 2 10 #> 3: 1 c 3 11 #> 4: 1 b 6 14 #> 5: 1 c 8 16 #> 6: 2 a 4 12 #> 7: 2 a 5 13 #> 8: 2 a 7 15 df %>% arrange(id) df %>% arrange(desc(id)) setorder(DT, -id, na.last = TRUE) DT # େ͖͍ॱ(߱ॱ) #> id code val1 val2 #> 1: 2 a 4 12 #> 2: 2 a 5 13 #> 3: 2 a 7 15 #> 4: 1 c 1 9 #> 5: 1 b 2 10 #> 6: 1 c 3 11 #> 7: 1 b 6 14 #> 8: 1 c 8 16
  19. Kྻͷબ୒ TFMFDU ୯Ұ ෳ਺ ෳ਺ DT[, "id"] DT[, list(id)] DT[,

    .(id)] DT[, 1] cols <- c("val1", "val2") DT[, c(1, 2)] DT[, ..cols] DT[, cols, with = FALSE] DT[, list(val1, val2)] DT[, .(val1, val2)] จࣈྻϕΫτϧΛ ༩͑Δࡍʹ͸ඞཁ ʢDPMTΛྻ໊ѻ͍͠ͳ͍ʣ df %>% select(id) df %>% select(val1, val2) ෳ਺ ࡟আ DT[, !c("id")] DT[, !cols, with = FALSE] ˠ࣍ͷεϥΠυ Ͱߦͬͨํ͕ޮ཰త  ͸MJTU ͷγϣʔτΧοτ
  20. K஋ͷ௥ՃɾՃ޻ɾ࡟আ NVUBUF ηΠ΢νPSΞγΧԋࢉࢠ 1ZUIPOͰ΋ಋೖ +PFM(BSMJDI.JMMFS  64'JTIBOE8JMEMJGF4FSWJDF1VCMJDEPNBJO DT[, code_title =

    toupper(code)] #> Error in `[.data.table`(DT, , code_title = toupper(code)): ࢖ΘΕ͍ͯͳ͍Ҿ਺ (code_title = toupper(code)) DT[, code_title := toupper(code)] ݁Ռ͸ग़ྗ͞Εͳ͍ ෭࡞༻͕͋Δ dt_copy <- copy(DT) address(DT) #> [1] "0x7f953bc4b800" address(dt_copy) #> [1] "0x7f953bdb8800" EBUBUBCMFͰͷEFFQDPQZ
  21. K஋ͷ௥ՃɾՃ޻ɾ࡟আ NVUBUF DT[, code_title := toupper(code)] # ௥Ճ DT[, code

    := toupper(code)] # ߋ৽ DT[, code := NULL] # ࡟আ df %>% mutate(code_title = toupper(code)) df %>% mutate(code = toupper(code)) df %>% mutate(code_title = NULL) df %>% mutate(val1 = val1 + 2, val2 = val2 + val1) ୯Ұ ෳ਺ DT[, c("val1", "val2") := list(val1 + 2, val2 + val1)] DT[, ':=' (val1 = val1 + 2, val2 = val2 + val1)] ࠨลӈล ৚݅ DT[c("a", "c"), val2 := val2 * 2, on = c("code")] DPEFྻͷ஋͕B DͰ͋Ε͹WBM 
  22. CZू໿ TVNNBSJTF DT[, .(val1_sum = sum(val1)), by = code] DT[,

    id := NULL][, lapply(.SD, sum), by = code, .SDcols = is.numeric] df %>% group_by(code) %>% summarise(val1_sum = sum(val1), .groups = “drop") df %>% select(!id) %>% group_by(code) %>% summarise(across(where(is.numeric), sum), .groups = "drop") EQMZSHSPVQ@CZ ͱಉ౳ 4%DPMTʹ༩͑ΒΕͨྻ͕ 4%ʹ౉͞ΕΔ ୯Ұ ෳ਺ .SDcols = val1:val2 .SDcols = c("val1", "val2") .SDcols = patterns("^val") ͜ΕΒͷࢦఆํ๏Ͱ΋0,
  23. DT[, .(val1_sum = sum(val1)), keyby = code] #> code val1_sum

    #> 1: a 16 #> 2: b 8 #> 3: c 12 άϧʔϓԽͷิॿػೳ / /(31 (31 DT[, .(val1_sum = sum(val1)), by = code] #> code val1_sum #> 1: c 12 #> 2: b 8 #> 3: a 16 LFZCZʹࢦఆͨ͠ม਺ͷฒͼͰฒͼସ͑Δ DT[, .N, by = id] #> id N #> 1: 1 5 #> 2: 2 3 DT[, .GRP, by = id] #> id GRP #> 1: 1 1 #> 2: 2 2 άϧʔϓม਺ͷΧ΢ϯτ DT[, .NGRP, by = id] #> id GRP #> 1: 1 2 #> 2: 2 2 άϧʔϓ JE ͷݸ਺ άϧʔϓม਺ͷ֬ೝ LFZCZ
  24. ݁߹ KPJO PO۟΋͘͠͸NFSHF inner_join( flights, planes, by = "tailnum") library(nycflights13)

    flights_dt <- as.data.table(flights) planes_dt <- as.data.table(planes) left_join( flights, planes, by = "tailnum") merge( flights_dt, planes_dt, all.x = TRUE, all.y = FALSE, by = "tailnum") flights_dt[planes_dt, on = "tailnum"]
  25. ॎԣม׵ library(tidyr) data("fish_encounters") #> # A tibble: 114 x 3

    #> fish station seen #> <fct> <fct> <int> #> 1 4842 Release 1 #> 2 4842 I80_1 1 #> 3 4842 Lisbon 1 #> 4 4842 Rstr 1 #> 5 4842 Base_TD 1 #> 6 4842 BCE 1 #> 7 4842 BCW 1 #> 8 4842 BCE2 1 #> 9 4842 BCW2 1 #> 10 4842 MAE 1 #> # … with 104 more rows #> fish Release_1 I80_1_1 Lisbon_1 Rstr_1 Base_TD_1 BCE_1 BCW_1 BCE2_1 BCW2_1 #> 1: 4842 1 1 1 1 1 1 1 1 1 #> 2: 4843 1 1 1 1 1 1 1 1 1 #> 3: 4844 1 1 1 1 1 1 1 1 1 …((লུ) fish_encounters %>% as.data.table() %>% dcast(... ~ station + seen, value.var = "seen") fish_encounters %>% pivot_wider(names_from = station, values_from = seen) ͋Μ·ΓΠέͯͳ͍ XJEFS ⁞
  26. data("billboard") #> # A tibble: 317 x 79 #> artist

    track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 #> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA #> 2 2Ge+h… The … 2000-09-02 91 87 92 NA NA NA NA NA #> 3 3 Doo… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 #> # … with 314 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>, billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", names_prefix = "wk", values_to = "rank", values_drop_na = TRUE) #> artist track date.entered week rank #> 1: 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87 #> 2: 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91 #> 3: 3 Doors Down Kryptonite 2000-04-08 wk1 81 #> 4: 3 Doors Down Loser 2000-10-21 wk1 76 #> 5: 504 Boyz Wobble Wobble 2000-04-15 wk1 57 #> --- #> 5303: Creed Higher 1999-09-11 wk63 50 #> 5304: Lonestar Amazed 1999-06-05 wk63 45 #> 5305: Creed Higher 1999-09-11 wk64 50 #> 5306: Lonestar Amazed 1999-06-05 wk64 50 #> 5307: Creed Higher 1999-09-11 wk65 49 billboard %>% as.data.table() %>% melt(id = c("artist", "track", "date.entered"), measure = patterns("^wk"), variable.name = "week", value.name = "rank", na.rm = TRUE) ৑௕ʜ ॎԣม׵ MPOHFS   UJEZGBTUͰੋඇ
  27. ϕΫτϧૢ࡞ؔ਺ fcase(x < 5L, 1L, x >= 5L, 3L) #

    dplyr::case_when #> [1] 1 1 1 1 3 3 3 3 NA 3 GDBTF x <- c(seq_len(8), NA, 10L) fifelse(x < 5L, TRUE, FALSE) # dplyr::if_else #> [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE NA FALSE pGFMTF
  28. ϕΫτϧૢ࡞ؔ਺ fcoalesce(x, seq_len(10)) # dplyr::coalesce #> [1] 1 2 3

    4 5 6 7 8 9 10 GDPBMFTDF OBpMM nafill(x, fill = 0) #> [1] 1 2 3 4 5 6 7 8 0 10
  29.  ؔ࿈ύοέʔδ

  30. EUQMZS όοΫάϥ΢ϯυʹEBUBUBCMF EQMZS EBUBUBCMF ࣮ߦ଎౓͸ EBUBUBCMFEUQMZSEQMZS EQMZSߏจ͕࢖͑ͯ EBUBUBCMFʹ͍ۙύϑΥʔϚϯε

  31. EUQMZS dt_starwars %>% group_by(species) %>% summarise(height = mean(height, na.rm =

    TRUE), .groups = “drop") library(dtplyr) dt_starwars <- lazy_dt(starwars) #> Source: local data table [?? x 3] #> Call: `_DT1`[, .(height = mean(height, #> na.rm = TRUE), .groups = "drop"), #> keyby = .(species)] #> #> species height .groups #> <chr> <dbl> <chr> #> 1 <NA> 181. drop #> 2 Aleena 79 drop #> 3 Besalisk 198 drop #> 4 Cerean 198 drop #> 5 Chagrian 196 drop #> 6 Clawdite 168 drop #> # Use as.data.table()/as.data.frame()/as_tibble() to access results .Last.value %>% as_tibble() EQMZSͰͷॲཧ ࣮ߦ݁ՌΛ ؀ڥʹฦ͢
  32. "OE.PSFʜ UJEZUBCMF EUQMZS͕αϙʔτ͢Δͷ͸EQMZS UJEZSͷؔ਺΋ؚΊΔ UJEZGBTUʜUJEZSͷQJWPU@ ͳͲ UJEZGTUʜGTUΛ࢖͍ͬͯͯߴ଎ ʢͬΆ͍ʣ

  33. ·ͱΊ EBUBUBCMF   σʔλͷૢ࡞͸<ԋࢉࢠ಺Ͱ ⁠ UJEZWFSTFʹྼΒͳ͍ػೳ ⁞ ߴ଎ɺϝϞϦར༻ޮ཰΋ྑ͍ ؔ࿈ɺ֦ுύοέʔδ΋ॆ࣮ ⁡

  34. Enjoy ☝ ऴ IUUQTHJUIVCDPNTQPOTPSTVSJCP