Slide 1

Slide 1 text

data.table Shinya Uryu ( @u_ribo uribo) 20200801 Tokyo.R#87 =Λֶ΅͏ʂ ύοέʔδ ࠓͦ͜ɺ DT[i, j, by] fifelse fcase .SD .N nafill .GRP

Slide 2

Slide 2 text

(JU)VC4QPOTPST ͷօ͞· IUUQTHJUIVCDPNTQPOTPSTVSJCP ఏڙ @yutannihilation @katsurakob @kanji14134 @siero5335 @ niszet @ ito4303 @ ak9782427

Slide 3

Slide 3 text

໨࣍   σʔλૢ࡞ ⁠ EBUBUBCMFͷ֦͕Γ ⁞ EBUBUBCMFͷར఺

Slide 4

Slide 4 text

MJCSBSZ EBUBUBCMF WFSTJPO 4JODFPO$3"/ ⁞

Slide 5

Slide 5 text

EBUBUBCMF &YUFOTJPOPGEBUBGSBNF ʢେن໛σʔλͷʣޮ཰తͳॲཧɺ ߴ଎ͳಡΈॻ͖͕ಘҙ EBUBUBCMFΫϥε %5 Λఏڙ EBUBGSBNFʹࣅͨ<Λ༻͍ͨૢ࡞ IUUQTSEBUBUBCMFHJUMBCJPEBUBUBCMF

Slide 6

Slide 6 text

group_by join data.table 13s 92s dplyr 157s 392s datatable 72s 1301s pandas 97s 1514s dask 127s out of memory DataFrames.jl 43s 5755s spark 37s 369s ੑೳଌఆ IUUQTIPBJHJUIVCJPECCFODINBSL (# ߦྻ

Slide 7

Slide 7 text

ੑೳଌఆ .# ߦྻ time data.table::fread() 2.85s fread()ʢcolClassࢦఆʣ 2.57s readr::read_csv() 14.0s fst::read_fst() 2.44s

Slide 8

Slide 8 text

EQMZSͱͷൺֱ ґଘ͕গͳ͍ ϝϞϦར༻ޮ཰ʹҧ͍ อकతͳ։ൃʢWTϦϦʔε଎౓ʣ EQMZSΑΓ΋ྺ࢙ͷ͋Δ։ൃྺ͕ͩ҆ఆత େ෯ʹڍಈ͕มΘΔ͜ͱ͕ك ܗࣜͷૢ࡞ੑ͸Ұ؏

Slide 9

Slide 9 text

EQMZSͱͷൺֱ ґଘͷগͳ͞ tools::package_dependencies("data.table", recursive = TRUE)[[1]] #> [1] "methods" tools::package_dependencies("dplyr", recursive = TRUE)[[1]] #> [1] "ellipsis" "generics" "glue" "lifecycle" "magrittr" #> [6] "methods" "R6" "rlang" "tibble" "tidyselect" #> [11] "utils" "vctrs" "cli" "crayon" "fansi" #> [16] "pillar" "pkgconfig" "purrr" "digest" "assertthat" #> [21] "grDevices" "utf8" "tools" ⁞

Slide 10

Slide 10 text

EQMZSͱͷൺֱ ϝϞϦར༻ޮ཰   $PNQBSJOH&GpDJFODZBOE4QFFEPGAEBUBUBCMFA"EEJOHWBSJBCMFT pMUFSJOHSPXT BOETVNNBSJ[JOHCZHSPVQu 5ZTPO#BSSFUUDPNIUUQTUZTPOCBSSFUUDPNKFLZMMVQEBUFEBUBUBCMF@NFNPSZ ྻૢ࡞ʹ͓͍ͯEBUBUBCMFͷํ͕ Θ͔ͣʹ༏ΕΔ நग़ͷաఔͰ͸ άϧʔϓͷू໿޻ఔ͸ඇޮ཰ʁ EBUBUBCMF͕উΔ IUUQTUXJUUFSDPNIFBMUIBOETUBUTTUBUVT

Slide 11

Slide 11 text

EQMZSͱͷൺֱ ϝϞϦར༻ޮ཰   d <- data.table( x = rnorm(10)) address(d) #> [1] “0x7f8afc694400" df_tbl <- copy(d) %>% as_tibble() address(df_tbl) #> [1] "0x7f8ace2fb1c8" d[, z := rnorm(10)] address(d) #> [1] "0x7f8afc694400" ม਺ʹॲཧΛՃ͑ͯ΋ ΦϒδΣΫτ͕ڞ༗͞ΕΔ ϝϞϦΞυϨε͕มΘΒͳ͍ df_tbl <- df_tbl %>% mutate(z = rnorm(1e6)) address(df_tbl) #> [1] "0x7f8afc68eea8" UJEZWFSTF ϝϞϦΞυϨε͕มΘΔ

Slide 12

Slide 12 text

ͳͥޮ཰͕ྑ͘ɺߴ଎ͳͷ͔ $ݴޠ $Ͱ͸ͳ͍ ʹΑΔ࣮૷ TIBMMPXίϐʔͰͷྻͷ஋ͷ ௥Ճɾߋ৽ɾ࡟আ Ωʔར༻ͱόΠφϦαʔν ʹΑΔݕࡧ

Slide 13

Slide 13 text

  BSSBOHF GJMUFS DT[i, j, by] EBUBUBCMFߏจ NVUBUF TFMFDU HSPVQ@CZ TVNNBSJTF ର৅ ૢ࡞ ू߹

Slide 14

Slide 14 text

df %>% group_by(id) %>% summarise(val1_sum = sum(val1), .groups = "drop") id code val1 1 c 1 1 b 2 1 c 3 2 a 4 2 a 5 1 b 6 2 a 7 1 c 8 id val1_sum 1 20 2 16 JEྻͷ஋͝ͱʹWBMΛ߹ܭ͢Δ ྫ ⁞ DT[, .(val1_sum = sum(val1)), by = id]

Slide 15

Slide 15 text

id code val1 1 c 1 1 b 2 1 c 3 2 a 4 2 a 5 1 b 6 2 a 7 1 c 8 id val1_sum 1 12 2 16 DPEFCͷ৚݅Ͱྫ⁞ͱಉ͡ॲཧ ྫ   df %>% filter(code != "b") %>% group_by(id) %>% summarise(val1_sum = sum(val1), .groups = "drop") DT[code != "b", .(val1_sum = sum(val1)), by = id]

Slide 16

Slide 16 text

id val1_sum val2_sum 1 12 36 2 16 40 DPEFCͷ৚݅ɺJE͝ͱʹWBMྻͷ߹ܭ ྫ ⁠ id code val1 val2 1 c 1 9 1 b 2 10 1 c 3 11 2 a 4 12 2 a 5 13 1 b 6 14 2 a 7 15 1 c 8 16 DT[code != "b", lapply(.SD, sum), by = id, .SDcols = patterns("^val")] 4VCTFUPGUIF%BUBUBCMF df %>% filter(code != "b") %>% group_by(id) %>% summarise(across(starts_with("val"), sum), .groups = "drop")

Slide 17

Slide 17 text

df %>% filter(val1 < 3) df %>% filter(val1 < 3 | val2 > 9) df %>% filter(val1 < 3 & val2 > 9) Jߦͷநग़ pMUFS DT[val1 < 3] DT[val1 < 3 | val2 > 9] # or৚݅ DT[val1 < 3 & val2 > 9] # and৚݅ df %>% filter(code %in% c("a", “d")) df %>% filter(!code %in% c("a", "b")) df %>% filter( stringr::str_detect(code, "^c$")) # on۟ DT[c("a", "d"),on = "code",nomatch = 0] # a,bҎ֎ DT[!code %in% c("a", "b")] # ਖ਼نදݱͰͷࢦఆ DT[like(code, "^c$")] ਺஋ จࣈྻ

Slide 18

Slide 18 text

Jߦͷฒͼସ͑ BSSBOHF DT[order(id)] # খ͍͞ॱ(ঢॱ) #> id code val1 val2 #> 1: 1 c 1 9 #> 2: 1 b 2 10 #> 3: 1 c 3 11 #> 4: 1 b 6 14 #> 5: 1 c 8 16 #> 6: 2 a 4 12 #> 7: 2 a 5 13 #> 8: 2 a 7 15 df %>% arrange(id) df %>% arrange(desc(id)) setorder(DT, -id, na.last = TRUE) DT # େ͖͍ॱ(߱ॱ) #> id code val1 val2 #> 1: 2 a 4 12 #> 2: 2 a 5 13 #> 3: 2 a 7 15 #> 4: 1 c 1 9 #> 5: 1 b 2 10 #> 6: 1 c 3 11 #> 7: 1 b 6 14 #> 8: 1 c 8 16

Slide 19

Slide 19 text

Kྻͷબ୒ TFMFDU ୯Ұ ෳ਺ ෳ਺ DT[, "id"] DT[, list(id)] DT[, .(id)] DT[, 1] cols <- c("val1", "val2") DT[, c(1, 2)] DT[, ..cols] DT[, cols, with = FALSE] DT[, list(val1, val2)] DT[, .(val1, val2)] จࣈྻϕΫτϧΛ ༩͑Δࡍʹ͸ඞཁ ʢDPMTΛྻ໊ѻ͍͠ͳ͍ʣ df %>% select(id) df %>% select(val1, val2) ෳ਺ ࡟আ DT[, !c("id")] DT[, !cols, with = FALSE] ˠ࣍ͷεϥΠυ Ͱߦͬͨํ͕ޮ཰త ͸MJTU ͷγϣʔτΧοτ

Slide 20

Slide 20 text

K஋ͷ௥ՃɾՃ޻ɾ࡟আ NVUBUF ηΠ΢νPSΞγΧԋࢉࢠ 1ZUIPOͰ΋ಋೖ +PFM(BSMJDI.JMMFS 64'JTIBOE8JMEMJGF4FSWJDF1VCMJDEPNBJO DT[, code_title = toupper(code)] #> Error in `[.data.table`(DT, , code_title = toupper(code)): ࢖ΘΕ͍ͯͳ͍Ҿ਺ (code_title = toupper(code)) DT[, code_title := toupper(code)] ݁Ռ͸ग़ྗ͞Εͳ͍ ෭࡞༻͕͋Δ dt_copy <- copy(DT) address(DT) #> [1] "0x7f953bc4b800" address(dt_copy) #> [1] "0x7f953bdb8800" EBUBUBCMFͰͷEFFQDPQZ

Slide 21

Slide 21 text

K஋ͷ௥ՃɾՃ޻ɾ࡟আ NVUBUF DT[, code_title := toupper(code)] # ௥Ճ DT[, code := toupper(code)] # ߋ৽ DT[, code := NULL] # ࡟আ df %>% mutate(code_title = toupper(code)) df %>% mutate(code = toupper(code)) df %>% mutate(code_title = NULL) df %>% mutate(val1 = val1 + 2, val2 = val2 + val1) ୯Ұ ෳ਺ DT[, c("val1", "val2") := list(val1 + 2, val2 + val1)] DT[, ':=' (val1 = val1 + 2, val2 = val2 + val1)] ࠨลӈล ৚݅ DT[c("a", "c"), val2 := val2 * 2, on = c("code")] DPEFྻͷ஋͕B DͰ͋Ε͹WBM

Slide 22

Slide 22 text

CZू໿ TVNNBSJTF DT[, .(val1_sum = sum(val1)), by = code] DT[, id := NULL][, lapply(.SD, sum), by = code, .SDcols = is.numeric] df %>% group_by(code) %>% summarise(val1_sum = sum(val1), .groups = “drop") df %>% select(!id) %>% group_by(code) %>% summarise(across(where(is.numeric), sum), .groups = "drop") EQMZSHSPVQ@CZ ͱಉ౳ 4%DPMTʹ༩͑ΒΕͨྻ͕ 4%ʹ౉͞ΕΔ ୯Ұ ෳ਺ .SDcols = val1:val2 .SDcols = c("val1", "val2") .SDcols = patterns("^val") ͜ΕΒͷࢦఆํ๏Ͱ΋0,

Slide 23

Slide 23 text

DT[, .(val1_sum = sum(val1)), keyby = code] #> code val1_sum #> 1: a 16 #> 2: b 8 #> 3: c 12 άϧʔϓԽͷิॿػೳ / /(31 (31 DT[, .(val1_sum = sum(val1)), by = code] #> code val1_sum #> 1: c 12 #> 2: b 8 #> 3: a 16 LFZCZʹࢦఆͨ͠ม਺ͷฒͼͰฒͼସ͑Δ DT[, .N, by = id] #> id N #> 1: 1 5 #> 2: 2 3 DT[, .GRP, by = id] #> id GRP #> 1: 1 1 #> 2: 2 2 άϧʔϓม਺ͷΧ΢ϯτ DT[, .NGRP, by = id] #> id GRP #> 1: 1 2 #> 2: 2 2 άϧʔϓ JE ͷݸ਺ άϧʔϓม਺ͷ֬ೝ LFZCZ

Slide 24

Slide 24 text

݁߹ KPJO PO۟΋͘͠͸NFSHF inner_join( flights, planes, by = "tailnum") library(nycflights13) flights_dt <- as.data.table(flights) planes_dt <- as.data.table(planes) left_join( flights, planes, by = "tailnum") merge( flights_dt, planes_dt, all.x = TRUE, all.y = FALSE, by = "tailnum") flights_dt[planes_dt, on = "tailnum"]

Slide 25

Slide 25 text

ॎԣม׵ library(tidyr) data("fish_encounters") #> # A tibble: 114 x 3 #> fish station seen #> #> 1 4842 Release 1 #> 2 4842 I80_1 1 #> 3 4842 Lisbon 1 #> 4 4842 Rstr 1 #> 5 4842 Base_TD 1 #> 6 4842 BCE 1 #> 7 4842 BCW 1 #> 8 4842 BCE2 1 #> 9 4842 BCW2 1 #> 10 4842 MAE 1 #> # … with 104 more rows #> fish Release_1 I80_1_1 Lisbon_1 Rstr_1 Base_TD_1 BCE_1 BCW_1 BCE2_1 BCW2_1 #> 1: 4842 1 1 1 1 1 1 1 1 1 #> 2: 4843 1 1 1 1 1 1 1 1 1 #> 3: 4844 1 1 1 1 1 1 1 1 1 …((লུ) fish_encounters %>% as.data.table() %>% dcast(... ~ station + seen, value.var = "seen") fish_encounters %>% pivot_wider(names_from = station, values_from = seen) ͋Μ·ΓΠέͯͳ͍ XJEFS ⁞

Slide 26

Slide 26 text

data("billboard") #> # A tibble: 317 x 79 #> artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 #> #> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA #> 2 2Ge+h… The … 2000-09-02 91 87 92 NA NA NA NA NA #> 3 3 Doo… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 #> # … with 314 more rows, and 68 more variables: wk9 , wk10 , billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", names_prefix = "wk", values_to = "rank", values_drop_na = TRUE) #> artist track date.entered week rank #> 1: 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87 #> 2: 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91 #> 3: 3 Doors Down Kryptonite 2000-04-08 wk1 81 #> 4: 3 Doors Down Loser 2000-10-21 wk1 76 #> 5: 504 Boyz Wobble Wobble 2000-04-15 wk1 57 #> --- #> 5303: Creed Higher 1999-09-11 wk63 50 #> 5304: Lonestar Amazed 1999-06-05 wk63 45 #> 5305: Creed Higher 1999-09-11 wk64 50 #> 5306: Lonestar Amazed 1999-06-05 wk64 50 #> 5307: Creed Higher 1999-09-11 wk65 49 billboard %>% as.data.table() %>% melt(id = c("artist", "track", "date.entered"), measure = patterns("^wk"), variable.name = "week", value.name = "rank", na.rm = TRUE) ৑௕ʜ ॎԣม׵ MPOHFS   UJEZGBTUͰੋඇ

Slide 27

Slide 27 text

ϕΫτϧૢ࡞ؔ਺ fcase(x < 5L, 1L, x >= 5L, 3L) # dplyr::case_when #> [1] 1 1 1 1 3 3 3 3 NA 3 GDBTF x <- c(seq_len(8), NA, 10L) fifelse(x < 5L, TRUE, FALSE) # dplyr::if_else #> [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE NA FALSE pGFMTF

Slide 28

Slide 28 text

ϕΫτϧૢ࡞ؔ਺ fcoalesce(x, seq_len(10)) # dplyr::coalesce #> [1] 1 2 3 4 5 6 7 8 9 10 GDPBMFTDF OBpMM nafill(x, fill = 0) #> [1] 1 2 3 4 5 6 7 8 0 10

Slide 29

Slide 29 text

ؔ࿈ύοέʔδ

Slide 30

Slide 30 text

EUQMZS όοΫάϥ΢ϯυʹEBUBUBCMF EQMZSEBUBUBCMF ࣮ߦ଎౓͸ EBUBUBCMFEUQMZSEQMZS EQMZSߏจ͕࢖͑ͯ EBUBUBCMFʹ͍ۙύϑΥʔϚϯε

Slide 31

Slide 31 text

EUQMZS dt_starwars %>% group_by(species) %>% summarise(height = mean(height, na.rm = TRUE), .groups = “drop") library(dtplyr) dt_starwars <- lazy_dt(starwars) #> Source: local data table [?? x 3] #> Call: `_DT1`[, .(height = mean(height, #> na.rm = TRUE), .groups = "drop"), #> keyby = .(species)] #> #> species height .groups #> #> 1 181. drop #> 2 Aleena 79 drop #> 3 Besalisk 198 drop #> 4 Cerean 198 drop #> 5 Chagrian 196 drop #> 6 Clawdite 168 drop #> # Use as.data.table()/as.data.frame()/as_tibble() to access results .Last.value %>% as_tibble() EQMZSͰͷॲཧ ࣮ߦ݁ՌΛ ؀ڥʹฦ͢

Slide 32

Slide 32 text

"OE.PSFʜ UJEZUBCMF EUQMZS͕αϙʔτ͢Δͷ͸EQMZS UJEZSͷؔ਺΋ؚΊΔ UJEZGBTUʜUJEZSͷQJWPU@ ͳͲ UJEZGTUʜGTUΛ࢖͍ͬͯͯߴ଎ ʢͬΆ͍ʣ

Slide 33

Slide 33 text

·ͱΊ EBUBUBCMF   σʔλͷૢ࡞͸<ԋࢉࢠ಺Ͱ ⁠ UJEZWFSTFʹྼΒͳ͍ػೳ ⁞ ߴ଎ɺϝϞϦར༻ޮ཰΋ྑ͍ ؔ࿈ɺ֦ுύοέʔδ΋ॆ࣮ ⁡

Slide 34

Slide 34 text

Enjoy ☝ ऴ IUUQTHJUIVCDPNTQPOTPSTVSJCP