Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up
for free
今こそ、data.tableを学ぼう! / datatable1130
Uryu Shinya
August 01, 2020
Programming
5
4.8k
今こそ、data.tableを学ぼう! / datatable1130
Tokyo.R#87
http://tokyor.connpass.com/event/182672/
での発表スライド
Uryu Shinya
August 01, 2020
Tweet
Share
More Decks by Uryu Shinya
See All by Uryu Shinya
s_uryu
2
1.8k
s_uryu
1
1.3k
s_uryu
2
310
s_uryu
1
3k
s_uryu
6
3.9k
s_uryu
40
9.1k
s_uryu
3
2k
s_uryu
4
1k
s_uryu
2
2.4k
Other Decks in Programming
See All in Programming
yumcyawiz
4
640
shigeruoda
0
480
taoshotaro
1
370
rshindo
2
300
showwin
0
130
ufoo68
1
180
akatsukinewgrad
0
210
line_developers_tw
1
490
junmikai
0
290
hr01
0
1.6k
azdaroth
0
140
hanhan1978
0
300
Featured
See All Featured
chriscoyier
145
19k
jasonvnalue
82
8.1k
destraynor
223
47k
malarkey
192
8.6k
swwweet
206
6.8k
paulrobertlloyd
71
1.4k
andyhume
62
3.4k
ufuk
56
5.4k
davidbonilla
70
3.5k
chrislema
231
16k
bkeepers
52
4.1k
addyosmani
310
21k
Transcript
data.table Shinya Uryu ( @u_ribo uribo) 20200801 Tokyo.R#87 =Λֶ΅͏ʂ ύοέʔδ
ࠓͦ͜ɺ DT[i, j, by] fifelse fcase .SD .N nafill .GRP
(JU)VC4QPOTPST ͷօ͞· IUUQTHJUIVCDPNTQPOTPSTVSJCP ఏڙ @yutannihilation @katsurakob @kanji14134 @siero5335 @ niszet
@ ito4303 @ ak9782427
࣍ σʔλૢ࡞ EBUBUBCMFͷ֦͕Γ ⁞ EBUBUBCMFͷར
MJCSBSZ EBUBUBCMF WFSTJPO 4JODFPO$3"/ ⁞
EBUBUBCMF &YUFOTJPOPGEBUBGSBNF ʢେنσʔλͷʣޮతͳॲཧɺ ߴͳಡΈॻ͖͕ಘҙ EBUBUBCMFΫϥε %5 Λఏڙ EBUBGSBNFʹࣅͨ<Λ༻͍ͨૢ࡞ IUUQTSEBUBUBCMFHJUMBCJPEBUBUBCMF
group_by join data.table 13s 92s dplyr 157s 392s datatable 72s
1301s pandas 97s 1514s dask 127s out of memory DataFrames.jl 43s 5755s spark 37s 369s ੑೳଌఆ IUUQTIPBJHJUIVCJPECCFODINBSL (# ߦྻ
ੑೳଌఆ .# ߦྻ time data.table::fread() 2.85s fread()ʢcolClassࢦఆʣ 2.57s
readr::read_csv() 14.0s fst::read_fst() 2.44s
EQMZSͱͷൺֱ ґଘ͕গͳ͍ ϝϞϦར༻ޮʹҧ͍ อकతͳ։ൃʢWTϦϦʔεʣ EQMZSΑΓྺ࢙ͷ͋Δ։ൃྺ͕ͩ҆ఆత େ෯ʹڍಈ͕มΘΔ͜ͱ͕ك <J K CZ>ܗࣜͷૢ࡞ੑҰ؏
EQMZSͱͷൺֱ ґଘͷগͳ͞ tools::package_dependencies("data.table", recursive = TRUE)[[1]] #> [1] "methods" tools::package_dependencies("dplyr",
recursive = TRUE)[[1]] #> [1] "ellipsis" "generics" "glue" "lifecycle" "magrittr" #> [6] "methods" "R6" "rlang" "tibble" "tidyselect" #> [11] "utils" "vctrs" "cli" "crayon" "fansi" #> [16] "pillar" "pkgconfig" "purrr" "digest" "assertthat" #> [21] "grDevices" "utf8" "tools" ⁞
EQMZSͱͷൺֱ ϝϞϦར༻ޮ $PNQBSJOH&GpDJFODZBOE4QFFEPGAEBUBUBCMFA"EEJOHWBSJBCMFT pMUFSJOHSPXT BOETVNNBSJ[JOHCZHSPVQu 5ZTPO#BSSFUUDPNIUUQTUZTPOCBSSFUUDPNKFLZMMVQEBUFEBUBUBCMF@NFNPSZ ྻૢ࡞ʹ͓͍ͯEBUBUBCMFͷํ͕ Θ͔ͣʹ༏ΕΔ நग़ͷաఔͰ
άϧʔϓͷूఔඇޮʁ EBUBUBCMF͕উΔ IUUQTUXJUUFSDPNIFBMUIBOETUBUTTUBUVT
EQMZSͱͷൺֱ ϝϞϦར༻ޮ d <- data.table( x = rnorm(10)) address(d)
#> [1] “0x7f8afc694400" df_tbl <- copy(d) %>% as_tibble() address(df_tbl) #> [1] "0x7f8ace2fb1c8" d[, z := rnorm(10)] address(d) #> [1] "0x7f8afc694400" มʹॲཧΛՃ͑ͯ ΦϒδΣΫτ͕ڞ༗͞ΕΔ ϝϞϦΞυϨε͕มΘΒͳ͍ df_tbl <- df_tbl %>% mutate(z = rnorm(1e6)) address(df_tbl) #> [1] "0x7f8afc68eea8" UJEZWFSTF ϝϞϦΞυϨε͕มΘΔ
ͳͥޮ͕ྑ͘ɺߴͳͷ͔ $ݴޠ $ Ͱͳ͍ ʹΑΔ࣮ TIBMMPXίϐʔͰͷྻͷͷ Ճɾߋ৽ɾআ Ωʔར༻ͱόΠφϦαʔν ʹΑΔݕࡧ
BSSBOHF GJMUFS DT[i, j, by] EBUBUBCMFߏจ NVUBUF TFMFDU HSPVQ@CZ
TVNNBSJTF ର ૢ࡞ ू߹
df %>% group_by(id) %>% summarise(val1_sum = sum(val1), .groups = "drop")
id code val1 1 c 1 1 b 2 1 c 3 2 a 4 2 a 5 1 b 6 2 a 7 1 c 8 id val1_sum 1 20 2 16 JEྻͷ͝ͱʹWBMΛ߹ܭ͢Δ ྫ ⁞ DT[, .(val1_sum = sum(val1)), by = id]
id code val1 1 c 1 1 b 2 1
c 3 2 a 4 2 a 5 1 b 6 2 a 7 1 c 8 id val1_sum 1 12 2 16 DPEFCͷ݅Ͱྫ⁞ͱಉ͡ॲཧ ྫ df %>% filter(code != "b") %>% group_by(id) %>% summarise(val1_sum = sum(val1), .groups = "drop") DT[code != "b", .(val1_sum = sum(val1)), by = id]
id val1_sum val2_sum 1 12 36 2 16 40 DPEFCͷ݅ɺJE͝ͱʹWBMྻͷ߹ܭ
ྫ id code val1 val2 1 c 1 9 1 b 2 10 1 c 3 11 2 a 4 12 2 a 5 13 1 b 6 14 2 a 7 15 1 c 8 16 DT[code != "b", lapply(.SD, sum), by = id, .SDcols = patterns("^val")] 4VCTFUPGUIF%BUBUBCMF df %>% filter(code != "b") %>% group_by(id) %>% summarise(across(starts_with("val"), sum), .groups = "drop")
df %>% filter(val1 < 3) df %>% filter(val1 < 3
| val2 > 9) df %>% filter(val1 < 3 & val2 > 9) Jߦͷநग़ pMUFS DT[val1 < 3] DT[val1 < 3 | val2 > 9] # or݅ DT[val1 < 3 & val2 > 9] # and݅ df %>% filter(code %in% c("a", “d")) df %>% filter(!code %in% c("a", "b")) df %>% filter( stringr::str_detect(code, "^c$")) # on۟ DT[c("a", "d"),on = "code",nomatch = 0] # a,bҎ֎ DT[!code %in% c("a", "b")] # ਖ਼نදݱͰͷࢦఆ DT[like(code, "^c$")] จࣈྻ
Jߦͷฒͼସ͑ BSSBOHF DT[order(id)] # খ͍͞ॱ(ঢॱ) #> id code val1 val2
#> 1: 1 c 1 9 #> 2: 1 b 2 10 #> 3: 1 c 3 11 #> 4: 1 b 6 14 #> 5: 1 c 8 16 #> 6: 2 a 4 12 #> 7: 2 a 5 13 #> 8: 2 a 7 15 df %>% arrange(id) df %>% arrange(desc(id)) setorder(DT, -id, na.last = TRUE) DT # େ͖͍ॱ(߱ॱ) #> id code val1 val2 #> 1: 2 a 4 12 #> 2: 2 a 5 13 #> 3: 2 a 7 15 #> 4: 1 c 1 9 #> 5: 1 b 2 10 #> 6: 1 c 3 11 #> 7: 1 b 6 14 #> 8: 1 c 8 16
Kྻͷબ TFMFDU ୯Ұ ෳ ෳ DT[, "id"] DT[, list(id)] DT[,
.(id)] DT[, 1] cols <- c("val1", "val2") DT[, c(1, 2)] DT[, ..cols] DT[, cols, with = FALSE] DT[, list(val1, val2)] DT[, .(val1, val2)] จࣈྻϕΫτϧΛ ༩͑Δࡍʹඞཁ ʢDPMTΛྻ໊ѻ͍͠ͳ͍ʣ df %>% select(id) df %>% select(val1, val2) ෳ আ DT[, !c("id")] DT[, !cols, with = FALSE] ˠ࣍ͷεϥΠυ Ͱߦͬͨํ͕ޮత MJTU ͷγϣʔτΧοτ
KͷՃɾՃɾআ NVUBUF ηΠνPSΞγΧԋࢉࢠ 1ZUIPOͰಋೖ +PFM(BSMJDI.JMMFS 64'JTIBOE8JMEMJGF4FSWJDF1VCMJDEPNBJO DT[, code_title =
toupper(code)] #> Error in `[.data.table`(DT, , code_title = toupper(code)): ΘΕ͍ͯͳ͍Ҿ (code_title = toupper(code)) DT[, code_title := toupper(code)] ݁Ռग़ྗ͞Εͳ͍ ෭࡞༻͕͋Δ dt_copy <- copy(DT) address(DT) #> [1] "0x7f953bc4b800" address(dt_copy) #> [1] "0x7f953bdb8800" EBUBUBCMFͰͷEFFQDPQZ
KͷՃɾՃɾআ NVUBUF DT[, code_title := toupper(code)] # Ճ DT[, code
:= toupper(code)] # ߋ৽ DT[, code := NULL] # আ df %>% mutate(code_title = toupper(code)) df %>% mutate(code = toupper(code)) df %>% mutate(code_title = NULL) df %>% mutate(val1 = val1 + 2, val2 = val2 + val1) ୯Ұ ෳ DT[, c("val1", "val2") := list(val1 + 2, val2 + val1)] DT[, ':=' (val1 = val1 + 2, val2 = val2 + val1)] ࠨลӈล ݅ DT[c("a", "c"), val2 := val2 * 2, on = c("code")] DPEFྻͷ͕B DͰ͋ΕWBM
CZू TVNNBSJTF DT[, .(val1_sum = sum(val1)), by = code] DT[,
id := NULL][, lapply(.SD, sum), by = code, .SDcols = is.numeric] df %>% group_by(code) %>% summarise(val1_sum = sum(val1), .groups = “drop") df %>% select(!id) %>% group_by(code) %>% summarise(across(where(is.numeric), sum), .groups = "drop") EQMZSHSPVQ@CZ ͱಉ 4%DPMTʹ༩͑ΒΕͨྻ͕ 4%ʹ͞ΕΔ ୯Ұ ෳ .SDcols = val1:val2 .SDcols = c("val1", "val2") .SDcols = patterns("^val") ͜ΕΒͷࢦఆํ๏Ͱ0,
DT[, .(val1_sum = sum(val1)), keyby = code] #> code val1_sum
#> 1: a 16 #> 2: b 8 #> 3: c 12 άϧʔϓԽͷิॿػೳ / /(31 (31 DT[, .(val1_sum = sum(val1)), by = code] #> code val1_sum #> 1: c 12 #> 2: b 8 #> 3: a 16 LFZCZʹࢦఆͨ͠มͷฒͼͰฒͼସ͑Δ DT[, .N, by = id] #> id N #> 1: 1 5 #> 2: 2 3 DT[, .GRP, by = id] #> id GRP #> 1: 1 1 #> 2: 2 2 άϧʔϓมͷΧϯτ DT[, .NGRP, by = id] #> id GRP #> 1: 1 2 #> 2: 2 2 άϧʔϓ JE ͷݸ άϧʔϓมͷ֬ೝ LFZCZ
݁߹ KPJO PO۟͘͠NFSHF inner_join( flights, planes, by = "tailnum") library(nycflights13)
flights_dt <- as.data.table(flights) planes_dt <- as.data.table(planes) left_join( flights, planes, by = "tailnum") merge( flights_dt, planes_dt, all.x = TRUE, all.y = FALSE, by = "tailnum") flights_dt[planes_dt, on = "tailnum"]
ॎԣม library(tidyr) data("fish_encounters") #> # A tibble: 114 x 3
#> fish station seen #> <fct> <fct> <int> #> 1 4842 Release 1 #> 2 4842 I80_1 1 #> 3 4842 Lisbon 1 #> 4 4842 Rstr 1 #> 5 4842 Base_TD 1 #> 6 4842 BCE 1 #> 7 4842 BCW 1 #> 8 4842 BCE2 1 #> 9 4842 BCW2 1 #> 10 4842 MAE 1 #> # … with 104 more rows #> fish Release_1 I80_1_1 Lisbon_1 Rstr_1 Base_TD_1 BCE_1 BCW_1 BCE2_1 BCW2_1 #> 1: 4842 1 1 1 1 1 1 1 1 1 #> 2: 4843 1 1 1 1 1 1 1 1 1 #> 3: 4844 1 1 1 1 1 1 1 1 1 …((লུ) fish_encounters %>% as.data.table() %>% dcast(... ~ station + seen, value.var = "seen") fish_encounters %>% pivot_wider(names_from = station, values_from = seen) ͋Μ·ΓΠέͯͳ͍ XJEFS ⁞
data("billboard") #> # A tibble: 317 x 79 #> artist
track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 #> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA #> 2 2Ge+h… The … 2000-09-02 91 87 92 NA NA NA NA NA #> 3 3 Doo… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 #> # … with 314 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>, billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", names_prefix = "wk", values_to = "rank", values_drop_na = TRUE) #> artist track date.entered week rank #> 1: 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87 #> 2: 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91 #> 3: 3 Doors Down Kryptonite 2000-04-08 wk1 81 #> 4: 3 Doors Down Loser 2000-10-21 wk1 76 #> 5: 504 Boyz Wobble Wobble 2000-04-15 wk1 57 #> --- #> 5303: Creed Higher 1999-09-11 wk63 50 #> 5304: Lonestar Amazed 1999-06-05 wk63 45 #> 5305: Creed Higher 1999-09-11 wk64 50 #> 5306: Lonestar Amazed 1999-06-05 wk64 50 #> 5307: Creed Higher 1999-09-11 wk65 49 billboard %>% as.data.table() %>% melt(id = c("artist", "track", "date.entered"), measure = patterns("^wk"), variable.name = "week", value.name = "rank", na.rm = TRUE) ʜ ॎԣม MPOHFS UJEZGBTUͰੋඇ
ϕΫτϧૢ࡞ؔ fcase(x < 5L, 1L, x >= 5L, 3L) #
dplyr::case_when #> [1] 1 1 1 1 3 3 3 3 NA 3 GDBTF x <- c(seq_len(8), NA, 10L) fifelse(x < 5L, TRUE, FALSE) # dplyr::if_else #> [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE NA FALSE pGFMTF
ϕΫτϧૢ࡞ؔ fcoalesce(x, seq_len(10)) # dplyr::coalesce #> [1] 1 2 3
4 5 6 7 8 9 10 GDPBMFTDF OBpMM nafill(x, fill = 0) #> [1] 1 2 3 4 5 6 7 8 0 10
ؔ࿈ύοέʔδ
EUQMZS όοΫάϥϯυʹEBUBUBCMF EQMZS EBUBUBCMF ࣮ߦ EBUBUBCMFEUQMZSEQMZS EQMZSߏจ͕͑ͯ EBUBUBCMFʹ͍ۙύϑΥʔϚϯε
EUQMZS dt_starwars %>% group_by(species) %>% summarise(height = mean(height, na.rm =
TRUE), .groups = “drop") library(dtplyr) dt_starwars <- lazy_dt(starwars) #> Source: local data table [?? x 3] #> Call: `_DT1`[, .(height = mean(height, #> na.rm = TRUE), .groups = "drop"), #> keyby = .(species)] #> #> species height .groups #> <chr> <dbl> <chr> #> 1 <NA> 181. drop #> 2 Aleena 79 drop #> 3 Besalisk 198 drop #> 4 Cerean 198 drop #> 5 Chagrian 196 drop #> 6 Clawdite 168 drop #> # Use as.data.table()/as.data.frame()/as_tibble() to access results .Last.value %>% as_tibble() EQMZSͰͷॲཧ ࣮ߦ݁ՌΛ ڥʹฦ͢
"OE.PSFʜ UJEZUBCMF EUQMZS͕αϙʔτ͢ΔͷEQMZS UJEZSͷؚؔΊΔ UJEZGBTUʜUJEZSͷQJWPU@ ͳͲ UJEZGTUʜGTUΛ͍ͬͯͯߴ ʢͬΆ͍ʣ
·ͱΊ EBUBUBCMF σʔλͷૢ࡞<ԋࢉࢠͰ UJEZWFSTFʹྼΒͳ͍ػೳ ⁞ ߴɺϝϞϦར༻ޮྑ͍ ؔ࿈ɺ֦ுύοέʔδॆ࣮
Enjoy ☝ ऴ IUUQTHJUIVCDPNTQPOTPSTVSJCP