Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
今こそ、data.tableを学ぼう! / datatable1130
Uryu Shinya
August 01, 2020
Programming
6
7.8k
今こそ、data.tableを学ぼう! / datatable1130
Tokyo.R#87
http://tokyor.connpass.com/event/182672/
での発表スライド
Uryu Shinya
August 01, 2020
Tweet
Share
More Decks by Uryu Shinya
See All by Uryu Shinya
データ分析入門 / tokupon-ds2022
s_uryu
3
3.4k
Quartoを使ってみませんか / quarto_get_started
s_uryu
4
820
OSSベースでのRパッケージ開発のすすめ / rjpusers2021rpkgdev
s_uryu
0
640
{targets}でワークフローを管理せよ / Workflow management with targets
s_uryu
2
910
地図を描画する / ggplot_map
s_uryu
2
590
R 4.1.0で導入された パイプ演算子 (|>)の紹介 / r_native_pipe
s_uryu
2
1.7k
Rで家を光らせる / tokyor90_r2hue
s_uryu
2
530
コロナ禍における住居探し / iekaitai202001
s_uryu
0
550
tidymodelsで覚えるRでのモデル構築と運用 / tidymodels2020
s_uryu
8
3.9k
Other Decks in Programming
See All in Programming
Enzyme から React Native Testing Library に移行した経緯 / 2022-07-20
tamago3keran
1
160
OSS貢献を気軽にしたい Let's Go Talk #1
yuyaabo
2
230
20220706_Google Apps Scriptを実演で学ぶ~ GAS × Slack ~
apachan
2
610
ふんわり理解するcontext
rukiadia
1
170
Google I/O 2022 Android関連概要 / Google I/O 2022 Android summary
phicdy
0
360
AWS Config Custom Rule、ノーコードでできるかな?
watany
0
250
More Than Micro Frontends: 3 Further Use Cases for Module Federation @DWX 2022
manfredsteyer
PRO
0
340
リーダブルテストコード / #vstat
jnchito
46
34k
There's an API for that!
mariatta
PRO
0
100
プロダクトのタイプ別 GraphQL クライアントの選び方
shozawa
0
8.1k
Pluggable Storage in PostgreSQL
sira
1
180
Lookerとdbtの共存
ttccddtoki
0
610
Featured
See All Featured
Bash Introduction
62gerente
598
210k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
655
120k
Designing Experiences People Love
moore
130
22k
Adopting Sorbet at Scale
ufuk
63
7.6k
Embracing the Ebb and Flow
colly
73
3.4k
Why Our Code Smells
bkeepers
PRO
324
55k
It's Worth the Effort
3n
172
26k
Building an army of robots
kneath
299
40k
Why You Should Never Use an ORM
jnunemaker
PRO
47
7.6k
How GitHub (no longer) Works
holman
297
140k
The Invisible Side of Design
smashingmag
290
48k
Designing with Data
zakiwarfel
91
4k
Transcript
data.table Shinya Uryu ( @u_ribo uribo) 20200801 Tokyo.R#87 =Λֶ΅͏ʂ ύοέʔδ
ࠓͦ͜ɺ DT[i, j, by] fifelse fcase .SD .N nafill .GRP
(JU)VC4QPOTPST ͷօ͞· IUUQTHJUIVCDPNTQPOTPSTVSJCP ఏڙ @yutannihilation @katsurakob @kanji14134 @siero5335 @ niszet
@ ito4303 @ ak9782427
࣍ σʔλૢ࡞ EBUBUBCMFͷ֦͕Γ ⁞ EBUBUBCMFͷར
MJCSBSZ EBUBUBCMF WFSTJPO 4JODFPO$3"/ ⁞
EBUBUBCMF &YUFOTJPOPGEBUBGSBNF ʢେنσʔλͷʣޮతͳॲཧɺ ߴͳಡΈॻ͖͕ಘҙ EBUBUBCMFΫϥε %5 Λఏڙ EBUBGSBNFʹࣅͨ<Λ༻͍ͨૢ࡞ IUUQTSEBUBUBCMFHJUMBCJPEBUBUBCMF
group_by join data.table 13s 92s dplyr 157s 392s datatable 72s
1301s pandas 97s 1514s dask 127s out of memory DataFrames.jl 43s 5755s spark 37s 369s ੑೳଌఆ IUUQTIPBJHJUIVCJPECCFODINBSL (# ߦྻ
ੑೳଌఆ .# ߦྻ time data.table::fread() 2.85s fread()ʢcolClassࢦఆʣ 2.57s
readr::read_csv() 14.0s fst::read_fst() 2.44s
EQMZSͱͷൺֱ ґଘ͕গͳ͍ ϝϞϦར༻ޮʹҧ͍ อकతͳ։ൃʢWTϦϦʔεʣ EQMZSΑΓྺ࢙ͷ͋Δ։ൃྺ͕ͩ҆ఆత େ෯ʹڍಈ͕มΘΔ͜ͱ͕ك <J K CZ>ܗࣜͷૢ࡞ੑҰ؏
EQMZSͱͷൺֱ ґଘͷগͳ͞ tools::package_dependencies("data.table", recursive = TRUE)[[1]] #> [1] "methods" tools::package_dependencies("dplyr",
recursive = TRUE)[[1]] #> [1] "ellipsis" "generics" "glue" "lifecycle" "magrittr" #> [6] "methods" "R6" "rlang" "tibble" "tidyselect" #> [11] "utils" "vctrs" "cli" "crayon" "fansi" #> [16] "pillar" "pkgconfig" "purrr" "digest" "assertthat" #> [21] "grDevices" "utf8" "tools" ⁞
EQMZSͱͷൺֱ ϝϞϦར༻ޮ $PNQBSJOH&GpDJFODZBOE4QFFEPGAEBUBUBCMFA"EEJOHWBSJBCMFT pMUFSJOHSPXT BOETVNNBSJ[JOHCZHSPVQu 5ZTPO#BSSFUUDPNIUUQTUZTPOCBSSFUUDPNKFLZMMVQEBUFEBUBUBCMF@NFNPSZ ྻૢ࡞ʹ͓͍ͯEBUBUBCMFͷํ͕ Θ͔ͣʹ༏ΕΔ நग़ͷաఔͰ
άϧʔϓͷूఔඇޮʁ EBUBUBCMF͕উΔ IUUQTUXJUUFSDPNIFBMUIBOETUBUTTUBUVT
EQMZSͱͷൺֱ ϝϞϦར༻ޮ d <- data.table( x = rnorm(10)) address(d)
#> [1] “0x7f8afc694400" df_tbl <- copy(d) %>% as_tibble() address(df_tbl) #> [1] "0x7f8ace2fb1c8" d[, z := rnorm(10)] address(d) #> [1] "0x7f8afc694400" มʹॲཧΛՃ͑ͯ ΦϒδΣΫτ͕ڞ༗͞ΕΔ ϝϞϦΞυϨε͕มΘΒͳ͍ df_tbl <- df_tbl %>% mutate(z = rnorm(1e6)) address(df_tbl) #> [1] "0x7f8afc68eea8" UJEZWFSTF ϝϞϦΞυϨε͕มΘΔ
ͳͥޮ͕ྑ͘ɺߴͳͷ͔ $ݴޠ $ Ͱͳ͍ ʹΑΔ࣮ TIBMMPXίϐʔͰͷྻͷͷ Ճɾߋ৽ɾআ Ωʔར༻ͱόΠφϦαʔν ʹΑΔݕࡧ
BSSBOHF GJMUFS DT[i, j, by] EBUBUBCMFߏจ NVUBUF TFMFDU HSPVQ@CZ
TVNNBSJTF ର ૢ࡞ ू߹
df %>% group_by(id) %>% summarise(val1_sum = sum(val1), .groups = "drop")
id code val1 1 c 1 1 b 2 1 c 3 2 a 4 2 a 5 1 b 6 2 a 7 1 c 8 id val1_sum 1 20 2 16 JEྻͷ͝ͱʹWBMΛ߹ܭ͢Δ ྫ ⁞ DT[, .(val1_sum = sum(val1)), by = id]
id code val1 1 c 1 1 b 2 1
c 3 2 a 4 2 a 5 1 b 6 2 a 7 1 c 8 id val1_sum 1 12 2 16 DPEFCͷ݅Ͱྫ⁞ͱಉ͡ॲཧ ྫ df %>% filter(code != "b") %>% group_by(id) %>% summarise(val1_sum = sum(val1), .groups = "drop") DT[code != "b", .(val1_sum = sum(val1)), by = id]
id val1_sum val2_sum 1 12 36 2 16 40 DPEFCͷ݅ɺJE͝ͱʹWBMྻͷ߹ܭ
ྫ id code val1 val2 1 c 1 9 1 b 2 10 1 c 3 11 2 a 4 12 2 a 5 13 1 b 6 14 2 a 7 15 1 c 8 16 DT[code != "b", lapply(.SD, sum), by = id, .SDcols = patterns("^val")] 4VCTFUPGUIF%BUBUBCMF df %>% filter(code != "b") %>% group_by(id) %>% summarise(across(starts_with("val"), sum), .groups = "drop")
df %>% filter(val1 < 3) df %>% filter(val1 < 3
| val2 > 9) df %>% filter(val1 < 3 & val2 > 9) Jߦͷநग़ pMUFS DT[val1 < 3] DT[val1 < 3 | val2 > 9] # or݅ DT[val1 < 3 & val2 > 9] # and݅ df %>% filter(code %in% c("a", “d")) df %>% filter(!code %in% c("a", "b")) df %>% filter( stringr::str_detect(code, "^c$")) # on۟ DT[c("a", "d"),on = "code",nomatch = 0] # a,bҎ֎ DT[!code %in% c("a", "b")] # ਖ਼نදݱͰͷࢦఆ DT[like(code, "^c$")] จࣈྻ
Jߦͷฒͼସ͑ BSSBOHF DT[order(id)] # খ͍͞ॱ(ঢॱ) #> id code val1 val2
#> 1: 1 c 1 9 #> 2: 1 b 2 10 #> 3: 1 c 3 11 #> 4: 1 b 6 14 #> 5: 1 c 8 16 #> 6: 2 a 4 12 #> 7: 2 a 5 13 #> 8: 2 a 7 15 df %>% arrange(id) df %>% arrange(desc(id)) setorder(DT, -id, na.last = TRUE) DT # େ͖͍ॱ(߱ॱ) #> id code val1 val2 #> 1: 2 a 4 12 #> 2: 2 a 5 13 #> 3: 2 a 7 15 #> 4: 1 c 1 9 #> 5: 1 b 2 10 #> 6: 1 c 3 11 #> 7: 1 b 6 14 #> 8: 1 c 8 16
Kྻͷબ TFMFDU ୯Ұ ෳ ෳ DT[, "id"] DT[, list(id)] DT[,
.(id)] DT[, 1] cols <- c("val1", "val2") DT[, c(1, 2)] DT[, ..cols] DT[, cols, with = FALSE] DT[, list(val1, val2)] DT[, .(val1, val2)] จࣈྻϕΫτϧΛ ༩͑Δࡍʹඞཁ ʢDPMTΛྻ໊ѻ͍͠ͳ͍ʣ df %>% select(id) df %>% select(val1, val2) ෳ আ DT[, !c("id")] DT[, !cols, with = FALSE] ˠ࣍ͷεϥΠυ Ͱߦͬͨํ͕ޮత MJTU ͷγϣʔτΧοτ
KͷՃɾՃɾআ NVUBUF ηΠνPSΞγΧԋࢉࢠ 1ZUIPOͰಋೖ +PFM(BSMJDI.JMMFS 64'JTIBOE8JMEMJGF4FSWJDF1VCMJDEPNBJO DT[, code_title =
toupper(code)] #> Error in `[.data.table`(DT, , code_title = toupper(code)): ΘΕ͍ͯͳ͍Ҿ (code_title = toupper(code)) DT[, code_title := toupper(code)] ݁Ռग़ྗ͞Εͳ͍ ෭࡞༻͕͋Δ dt_copy <- copy(DT) address(DT) #> [1] "0x7f953bc4b800" address(dt_copy) #> [1] "0x7f953bdb8800" EBUBUBCMFͰͷEFFQDPQZ
KͷՃɾՃɾআ NVUBUF DT[, code_title := toupper(code)] # Ճ DT[, code
:= toupper(code)] # ߋ৽ DT[, code := NULL] # আ df %>% mutate(code_title = toupper(code)) df %>% mutate(code = toupper(code)) df %>% mutate(code_title = NULL) df %>% mutate(val1 = val1 + 2, val2 = val2 + val1) ୯Ұ ෳ DT[, c("val1", "val2") := list(val1 + 2, val2 + val1)] DT[, ':=' (val1 = val1 + 2, val2 = val2 + val1)] ࠨลӈล ݅ DT[c("a", "c"), val2 := val2 * 2, on = c("code")] DPEFྻͷ͕B DͰ͋ΕWBM
CZू TVNNBSJTF DT[, .(val1_sum = sum(val1)), by = code] DT[,
id := NULL][, lapply(.SD, sum), by = code, .SDcols = is.numeric] df %>% group_by(code) %>% summarise(val1_sum = sum(val1), .groups = “drop") df %>% select(!id) %>% group_by(code) %>% summarise(across(where(is.numeric), sum), .groups = "drop") EQMZSHSPVQ@CZ ͱಉ 4%DPMTʹ༩͑ΒΕͨྻ͕ 4%ʹ͞ΕΔ ୯Ұ ෳ .SDcols = val1:val2 .SDcols = c("val1", "val2") .SDcols = patterns("^val") ͜ΕΒͷࢦఆํ๏Ͱ0,
DT[, .(val1_sum = sum(val1)), keyby = code] #> code val1_sum
#> 1: a 16 #> 2: b 8 #> 3: c 12 άϧʔϓԽͷิॿػೳ / /(31 (31 DT[, .(val1_sum = sum(val1)), by = code] #> code val1_sum #> 1: c 12 #> 2: b 8 #> 3: a 16 LFZCZʹࢦఆͨ͠มͷฒͼͰฒͼସ͑Δ DT[, .N, by = id] #> id N #> 1: 1 5 #> 2: 2 3 DT[, .GRP, by = id] #> id GRP #> 1: 1 1 #> 2: 2 2 άϧʔϓมͷΧϯτ DT[, .NGRP, by = id] #> id GRP #> 1: 1 2 #> 2: 2 2 άϧʔϓ JE ͷݸ άϧʔϓมͷ֬ೝ LFZCZ
݁߹ KPJO PO۟͘͠NFSHF inner_join( flights, planes, by = "tailnum") library(nycflights13)
flights_dt <- as.data.table(flights) planes_dt <- as.data.table(planes) left_join( flights, planes, by = "tailnum") merge( flights_dt, planes_dt, all.x = TRUE, all.y = FALSE, by = "tailnum") flights_dt[planes_dt, on = "tailnum"]
ॎԣม library(tidyr) data("fish_encounters") #> # A tibble: 114 x 3
#> fish station seen #> <fct> <fct> <int> #> 1 4842 Release 1 #> 2 4842 I80_1 1 #> 3 4842 Lisbon 1 #> 4 4842 Rstr 1 #> 5 4842 Base_TD 1 #> 6 4842 BCE 1 #> 7 4842 BCW 1 #> 8 4842 BCE2 1 #> 9 4842 BCW2 1 #> 10 4842 MAE 1 #> # … with 104 more rows #> fish Release_1 I80_1_1 Lisbon_1 Rstr_1 Base_TD_1 BCE_1 BCW_1 BCE2_1 BCW2_1 #> 1: 4842 1 1 1 1 1 1 1 1 1 #> 2: 4843 1 1 1 1 1 1 1 1 1 #> 3: 4844 1 1 1 1 1 1 1 1 1 …((লུ) fish_encounters %>% as.data.table() %>% dcast(... ~ station + seen, value.var = "seen") fish_encounters %>% pivot_wider(names_from = station, values_from = seen) ͋Μ·ΓΠέͯͳ͍ XJEFS ⁞
data("billboard") #> # A tibble: 317 x 79 #> artist
track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 #> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA #> 2 2Ge+h… The … 2000-09-02 91 87 92 NA NA NA NA NA #> 3 3 Doo… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 #> # … with 314 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>, billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", names_prefix = "wk", values_to = "rank", values_drop_na = TRUE) #> artist track date.entered week rank #> 1: 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87 #> 2: 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91 #> 3: 3 Doors Down Kryptonite 2000-04-08 wk1 81 #> 4: 3 Doors Down Loser 2000-10-21 wk1 76 #> 5: 504 Boyz Wobble Wobble 2000-04-15 wk1 57 #> --- #> 5303: Creed Higher 1999-09-11 wk63 50 #> 5304: Lonestar Amazed 1999-06-05 wk63 45 #> 5305: Creed Higher 1999-09-11 wk64 50 #> 5306: Lonestar Amazed 1999-06-05 wk64 50 #> 5307: Creed Higher 1999-09-11 wk65 49 billboard %>% as.data.table() %>% melt(id = c("artist", "track", "date.entered"), measure = patterns("^wk"), variable.name = "week", value.name = "rank", na.rm = TRUE) ʜ ॎԣม MPOHFS UJEZGBTUͰੋඇ
ϕΫτϧૢ࡞ؔ fcase(x < 5L, 1L, x >= 5L, 3L) #
dplyr::case_when #> [1] 1 1 1 1 3 3 3 3 NA 3 GDBTF x <- c(seq_len(8), NA, 10L) fifelse(x < 5L, TRUE, FALSE) # dplyr::if_else #> [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE NA FALSE pGFMTF
ϕΫτϧૢ࡞ؔ fcoalesce(x, seq_len(10)) # dplyr::coalesce #> [1] 1 2 3
4 5 6 7 8 9 10 GDPBMFTDF OBpMM nafill(x, fill = 0) #> [1] 1 2 3 4 5 6 7 8 0 10
ؔ࿈ύοέʔδ
EUQMZS όοΫάϥϯυʹEBUBUBCMF EQMZS EBUBUBCMF ࣮ߦ EBUBUBCMFEUQMZSEQMZS EQMZSߏจ͕͑ͯ EBUBUBCMFʹ͍ۙύϑΥʔϚϯε
EUQMZS dt_starwars %>% group_by(species) %>% summarise(height = mean(height, na.rm =
TRUE), .groups = “drop") library(dtplyr) dt_starwars <- lazy_dt(starwars) #> Source: local data table [?? x 3] #> Call: `_DT1`[, .(height = mean(height, #> na.rm = TRUE), .groups = "drop"), #> keyby = .(species)] #> #> species height .groups #> <chr> <dbl> <chr> #> 1 <NA> 181. drop #> 2 Aleena 79 drop #> 3 Besalisk 198 drop #> 4 Cerean 198 drop #> 5 Chagrian 196 drop #> 6 Clawdite 168 drop #> # Use as.data.table()/as.data.frame()/as_tibble() to access results .Last.value %>% as_tibble() EQMZSͰͷॲཧ ࣮ߦ݁ՌΛ ڥʹฦ͢
"OE.PSFʜ UJEZUBCMF EUQMZS͕αϙʔτ͢ΔͷEQMZS UJEZSͷؚؔΊΔ UJEZGBTUʜUJEZSͷQJWPU@ ͳͲ UJEZGTUʜGTUΛ͍ͬͯͯߴ ʢͬΆ͍ʣ
·ͱΊ EBUBUBCMF σʔλͷૢ࡞<ԋࢉࢠͰ UJEZWFSTFʹྼΒͳ͍ػೳ ⁞ ߴɺϝϞϦར༻ޮྑ͍ ؔ࿈ɺ֦ுύοέʔδॆ࣮
Enjoy ☝ ऴ IUUQTHJUIVCDPNTQPOTPSTVSJCP