Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
今こそ、data.tableを学ぼう! / datatable1130
Uryu Shinya
August 01, 2020
Programming
6
8k
今こそ、data.tableを学ぼう! / datatable1130
Tokyo.R#87
http://tokyor.connpass.com/event/182672/
での発表スライド
Uryu Shinya
August 01, 2020
Tweet
Share
More Decks by Uryu Shinya
See All by Uryu Shinya
Rによる大規模データの処理
s_uryu
2
620
R言語入門 (R-4.2.2 2022年11月版) / introduction to r
s_uryu
2
1k
textlintr: Rユーザのためのtextlintによる校正
s_uryu
0
270
Rによるデータ可視化と地図表現
s_uryu
1
1.4k
データ分析入門 / tokupon-ds2022
s_uryu
3
4.3k
Quartoを使ってみませんか / quarto_get_started
s_uryu
4
1.9k
OSSベースでのRパッケージ開発のすすめ / rjpusers2021rpkgdev
s_uryu
0
800
{targets}でワークフローを管理せよ / Workflow management with targets
s_uryu
2
1.7k
地図を描画する / ggplot_map
s_uryu
2
730
Other Decks in Programming
See All in Programming
OSC大阪 パスワード認証は人類には早すぎる ~ IDaaSを使ったソーシャルログインのすすめ ~
authyasan
4
570
Use KMM to call the API of the National Tax Agency
akkeylab
0
290
監視せなあかんし、五大紙だけにオオカミってな🐺🐺🐺🐺🐺
sadnessojisan
2
1.3k
OSSから学んだPR Descriptionの書き方
fugakkbn
4
120
OIDC仕様に準拠した Makuake ID連携基盤構築の裏側
ymtdzzz
0
150
僕が考えた超最強のKMMアプリの作り方
spbaya0141
0
180
ipa-medit: Memory search and patch tool for IPA without Jailbreaking/ipa-medit-bh2022-europe
tkmru
0
130
AWSとCPUのムフフな関係
cmdemura
0
450
ちょうぜつ改め21世紀ふつうのソフトウェア設計
tanakahisateru
7
6.2k
WordPress(再)入門 - 基礎知識・環境編
oleindesign
1
120
domain層のモジュール化 / MoT TechTalk #15
mot_techtalk
0
100
Spring BootとKubernetesで実現する今どきのDevOps入門
xblood
0
340
Featured
See All Featured
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
31
20k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
318
19k
Producing Creativity
orderedlist
PRO
335
37k
What's in a price? How to price your products and services
michaelherold
233
9.7k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
349
27k
Imperfection Machines: The Place of Print at Facebook
scottboms
254
12k
A Philosophy of Restraint
colly
193
15k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
270
12k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
120
29k
Gamification - CAS2011
davidbonilla
75
4.1k
Stop Working from a Prison Cell
hatefulcrawdad
263
18k
Three Pipe Problems
jasonvnalue
89
8.9k
Transcript
data.table Shinya Uryu ( @u_ribo uribo) 20200801 Tokyo.R#87 =Λֶ΅͏ʂ ύοέʔδ
ࠓͦ͜ɺ DT[i, j, by] fifelse fcase .SD .N nafill .GRP
(JU)VC4QPOTPST ͷօ͞· IUUQTHJUIVCDPNTQPOTPSTVSJCP ఏڙ @yutannihilation @katsurakob @kanji14134 @siero5335 @ niszet
@ ito4303 @ ak9782427
࣍ σʔλૢ࡞ EBUBUBCMFͷ֦͕Γ ⁞ EBUBUBCMFͷར
MJCSBSZ EBUBUBCMF WFSTJPO 4JODFPO$3"/ ⁞
EBUBUBCMF &YUFOTJPOPGEBUBGSBNF ʢେنσʔλͷʣޮతͳॲཧɺ ߴͳಡΈॻ͖͕ಘҙ EBUBUBCMFΫϥε %5 Λఏڙ EBUBGSBNFʹࣅͨ<Λ༻͍ͨૢ࡞ IUUQTSEBUBUBCMFHJUMBCJPEBUBUBCMF
group_by join data.table 13s 92s dplyr 157s 392s datatable 72s
1301s pandas 97s 1514s dask 127s out of memory DataFrames.jl 43s 5755s spark 37s 369s ੑೳଌఆ IUUQTIPBJHJUIVCJPECCFODINBSL (# ߦྻ
ੑೳଌఆ .# ߦྻ time data.table::fread() 2.85s fread()ʢcolClassࢦఆʣ 2.57s
readr::read_csv() 14.0s fst::read_fst() 2.44s
EQMZSͱͷൺֱ ґଘ͕গͳ͍ ϝϞϦར༻ޮʹҧ͍ อकతͳ։ൃʢWTϦϦʔεʣ EQMZSΑΓྺ࢙ͷ͋Δ։ൃྺ͕ͩ҆ఆత େ෯ʹڍಈ͕มΘΔ͜ͱ͕ك <J K CZ>ܗࣜͷૢ࡞ੑҰ؏
EQMZSͱͷൺֱ ґଘͷগͳ͞ tools::package_dependencies("data.table", recursive = TRUE)[[1]] #> [1] "methods" tools::package_dependencies("dplyr",
recursive = TRUE)[[1]] #> [1] "ellipsis" "generics" "glue" "lifecycle" "magrittr" #> [6] "methods" "R6" "rlang" "tibble" "tidyselect" #> [11] "utils" "vctrs" "cli" "crayon" "fansi" #> [16] "pillar" "pkgconfig" "purrr" "digest" "assertthat" #> [21] "grDevices" "utf8" "tools" ⁞
EQMZSͱͷൺֱ ϝϞϦར༻ޮ $PNQBSJOH&GpDJFODZBOE4QFFEPGAEBUBUBCMFA"EEJOHWBSJBCMFT pMUFSJOHSPXT BOETVNNBSJ[JOHCZHSPVQu 5ZTPO#BSSFUUDPNIUUQTUZTPOCBSSFUUDPNKFLZMMVQEBUF
[email protected]
ྻૢ࡞ʹ͓͍ͯEBUBUBCMFͷํ͕ Θ͔ͣʹ༏ΕΔ நग़ͷաఔͰ
άϧʔϓͷूఔඇޮʁ EBUBUBCMF͕উΔ IUUQTUXJUUFSDPNIFBMUIBOETUBUTTUBUVT
EQMZSͱͷൺֱ ϝϞϦར༻ޮ d <- data.table( x = rnorm(10)) address(d)
#> [1] “0x7f8afc694400" df_tbl <- copy(d) %>% as_tibble() address(df_tbl) #> [1] "0x7f8ace2fb1c8" d[, z := rnorm(10)] address(d) #> [1] "0x7f8afc694400" มʹॲཧΛՃ͑ͯ ΦϒδΣΫτ͕ڞ༗͞ΕΔ ϝϞϦΞυϨε͕มΘΒͳ͍ df_tbl <- df_tbl %>% mutate(z = rnorm(1e6)) address(df_tbl) #> [1] "0x7f8afc68eea8" UJEZWFSTF ϝϞϦΞυϨε͕มΘΔ
ͳͥޮ͕ྑ͘ɺߴͳͷ͔ $ݴޠ $ Ͱͳ͍ ʹΑΔ࣮ TIBMMPXίϐʔͰͷྻͷͷ Ճɾߋ৽ɾআ Ωʔར༻ͱόΠφϦαʔν ʹΑΔݕࡧ
BSSBOHF GJMUFS DT[i, j, by] EBUBUBCMFߏจ NVUBUF TFMFDU
[email protected]
TVNNBSJTF ର ૢ࡞ ू߹
df %>% group_by(id) %>% summarise(val1_sum = sum(val1), .groups = "drop")
id code val1 1 c 1 1 b 2 1 c 3 2 a 4 2 a 5 1 b 6 2 a 7 1 c 8 id val1_sum 1 20 2 16 JEྻͷ͝ͱʹWBMΛ߹ܭ͢Δ ྫ ⁞ DT[, .(val1_sum = sum(val1)), by = id]
id code val1 1 c 1 1 b 2 1
c 3 2 a 4 2 a 5 1 b 6 2 a 7 1 c 8 id val1_sum 1 12 2 16 DPEFCͷ݅Ͱྫ⁞ͱಉ͡ॲཧ ྫ df %>% filter(code != "b") %>% group_by(id) %>% summarise(val1_sum = sum(val1), .groups = "drop") DT[code != "b", .(val1_sum = sum(val1)), by = id]
id val1_sum val2_sum 1 12 36 2 16 40 DPEFCͷ݅ɺJE͝ͱʹWBMྻͷ߹ܭ
ྫ id code val1 val2 1 c 1 9 1 b 2 10 1 c 3 11 2 a 4 12 2 a 5 13 1 b 6 14 2 a 7 15 1 c 8 16 DT[code != "b", lapply(.SD, sum), by = id, .SDcols = patterns("^val")] 4VCTFUPGUIF%BUBUBCMF df %>% filter(code != "b") %>% group_by(id) %>% summarise(across(starts_with("val"), sum), .groups = "drop")
df %>% filter(val1 < 3) df %>% filter(val1 < 3
| val2 > 9) df %>% filter(val1 < 3 & val2 > 9) Jߦͷநग़ pMUFS DT[val1 < 3] DT[val1 < 3 | val2 > 9] # or݅ DT[val1 < 3 & val2 > 9] # and݅ df %>% filter(code %in% c("a", “d")) df %>% filter(!code %in% c("a", "b")) df %>% filter( stringr::str_detect(code, "^c$")) # on۟ DT[c("a", "d"),on = "code",nomatch = 0] # a,bҎ֎ DT[!code %in% c("a", "b")] # ਖ਼نදݱͰͷࢦఆ DT[like(code, "^c$")] จࣈྻ
Jߦͷฒͼସ͑ BSSBOHF DT[order(id)] # খ͍͞ॱ(ঢॱ) #> id code val1 val2
#> 1: 1 c 1 9 #> 2: 1 b 2 10 #> 3: 1 c 3 11 #> 4: 1 b 6 14 #> 5: 1 c 8 16 #> 6: 2 a 4 12 #> 7: 2 a 5 13 #> 8: 2 a 7 15 df %>% arrange(id) df %>% arrange(desc(id)) setorder(DT, -id, na.last = TRUE) DT # େ͖͍ॱ(߱ॱ) #> id code val1 val2 #> 1: 2 a 4 12 #> 2: 2 a 5 13 #> 3: 2 a 7 15 #> 4: 1 c 1 9 #> 5: 1 b 2 10 #> 6: 1 c 3 11 #> 7: 1 b 6 14 #> 8: 1 c 8 16
Kྻͷબ TFMFDU ୯Ұ ෳ ෳ DT[, "id"] DT[, list(id)] DT[,
.(id)] DT[, 1] cols <- c("val1", "val2") DT[, c(1, 2)] DT[, ..cols] DT[, cols, with = FALSE] DT[, list(val1, val2)] DT[, .(val1, val2)] จࣈྻϕΫτϧΛ ༩͑Δࡍʹඞཁ ʢDPMTΛྻ໊ѻ͍͠ͳ͍ʣ df %>% select(id) df %>% select(val1, val2) ෳ আ DT[, !c("id")] DT[, !cols, with = FALSE] ˠ࣍ͷεϥΠυ Ͱߦͬͨํ͕ޮత MJTU ͷγϣʔτΧοτ
KͷՃɾՃɾআ NVUBUF ηΠνPSΞγΧԋࢉࢠ 1ZUIPOͰಋೖ +PFM(BSMJDI.JMMFS 64'JTIBOE8JMEMJGF4FSWJDF1VCMJDEPNBJO DT[, code_title =
toupper(code)] #> Error in `[.data.table`(DT, , code_title = toupper(code)): ΘΕ͍ͯͳ͍Ҿ (code_title = toupper(code)) DT[, code_title := toupper(code)] ݁Ռग़ྗ͞Εͳ͍ ෭࡞༻͕͋Δ dt_copy <- copy(DT) address(DT) #> [1] "0x7f953bc4b800" address(dt_copy) #> [1] "0x7f953bdb8800" EBUBUBCMFͰͷEFFQDPQZ
KͷՃɾՃɾআ NVUBUF DT[, code_title := toupper(code)] # Ճ DT[, code
:= toupper(code)] # ߋ৽ DT[, code := NULL] # আ df %>% mutate(code_title = toupper(code)) df %>% mutate(code = toupper(code)) df %>% mutate(code_title = NULL) df %>% mutate(val1 = val1 + 2, val2 = val2 + val1) ୯Ұ ෳ DT[, c("val1", "val2") := list(val1 + 2, val2 + val1)] DT[, ':=' (val1 = val1 + 2, val2 = val2 + val1)] ࠨลӈล ݅ DT[c("a", "c"), val2 := val2 * 2, on = c("code")] DPEFྻͷ͕B DͰ͋ΕWBM
CZू TVNNBSJTF DT[, .(val1_sum = sum(val1)), by = code] DT[,
id := NULL][, lapply(.SD, sum), by = code, .SDcols = is.numeric] df %>% group_by(code) %>% summarise(val1_sum = sum(val1), .groups = “drop") df %>% select(!id) %>% group_by(code) %>% summarise(across(where(is.numeric), sum), .groups = "drop") EQMZS
[email protected]
ͱಉ 4%DPMTʹ༩͑ΒΕͨྻ͕ 4%ʹ͞ΕΔ ୯Ұ ෳ .SDcols = val1:val2 .SDcols = c("val1", "val2") .SDcols = patterns("^val") ͜ΕΒͷࢦఆํ๏Ͱ0,
DT[, .(val1_sum = sum(val1)), keyby = code] #> code val1_sum
#> 1: a 16 #> 2: b 8 #> 3: c 12 άϧʔϓԽͷิॿػೳ / /(31 (31 DT[, .(val1_sum = sum(val1)), by = code] #> code val1_sum #> 1: c 12 #> 2: b 8 #> 3: a 16 LFZCZʹࢦఆͨ͠มͷฒͼͰฒͼସ͑Δ DT[, .N, by = id] #> id N #> 1: 1 5 #> 2: 2 3 DT[, .GRP, by = id] #> id GRP #> 1: 1 1 #> 2: 2 2 άϧʔϓมͷΧϯτ DT[, .NGRP, by = id] #> id GRP #> 1: 1 2 #> 2: 2 2 άϧʔϓ JE ͷݸ άϧʔϓมͷ֬ೝ LFZCZ
݁߹ KPJO PO۟͘͠NFSHF inner_join( flights, planes, by = "tailnum") library(nycflights13)
flights_dt <- as.data.table(flights) planes_dt <- as.data.table(planes) left_join( flights, planes, by = "tailnum") merge( flights_dt, planes_dt, all.x = TRUE, all.y = FALSE, by = "tailnum") flights_dt[planes_dt, on = "tailnum"]
ॎԣม library(tidyr) data("fish_encounters") #> # A tibble: 114 x 3
#> fish station seen #> <fct> <fct> <int> #> 1 4842 Release 1 #> 2 4842 I80_1 1 #> 3 4842 Lisbon 1 #> 4 4842 Rstr 1 #> 5 4842 Base_TD 1 #> 6 4842 BCE 1 #> 7 4842 BCW 1 #> 8 4842 BCE2 1 #> 9 4842 BCW2 1 #> 10 4842 MAE 1 #> # … with 104 more rows #> fish Release_1 I80_1_1 Lisbon_1 Rstr_1 Base_TD_1 BCE_1 BCW_1 BCE2_1 BCW2_1 #> 1: 4842 1 1 1 1 1 1 1 1 1 #> 2: 4843 1 1 1 1 1 1 1 1 1 #> 3: 4844 1 1 1 1 1 1 1 1 1 …((লུ) fish_encounters %>% as.data.table() %>% dcast(... ~ station + seen, value.var = "seen") fish_encounters %>% pivot_wider(names_from = station, values_from = seen) ͋Μ·ΓΠέͯͳ͍ XJEFS ⁞
data("billboard") #> # A tibble: 317 x 79 #> artist
track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 #> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA #> 2 2Ge+h… The … 2000-09-02 91 87 92 NA NA NA NA NA #> 3 3 Doo… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 #> # … with 314 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>, billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", names_prefix = "wk", values_to = "rank", values_drop_na = TRUE) #> artist track date.entered week rank #> 1: 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87 #> 2: 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91 #> 3: 3 Doors Down Kryptonite 2000-04-08 wk1 81 #> 4: 3 Doors Down Loser 2000-10-21 wk1 76 #> 5: 504 Boyz Wobble Wobble 2000-04-15 wk1 57 #> --- #> 5303: Creed Higher 1999-09-11 wk63 50 #> 5304: Lonestar Amazed 1999-06-05 wk63 45 #> 5305: Creed Higher 1999-09-11 wk64 50 #> 5306: Lonestar Amazed 1999-06-05 wk64 50 #> 5307: Creed Higher 1999-09-11 wk65 49 billboard %>% as.data.table() %>% melt(id = c("artist", "track", "date.entered"), measure = patterns("^wk"), variable.name = "week", value.name = "rank", na.rm = TRUE) ʜ ॎԣม MPOHFS UJEZGBTUͰੋඇ
ϕΫτϧૢ࡞ؔ fcase(x < 5L, 1L, x >= 5L, 3L) #
dplyr::case_when #> [1] 1 1 1 1 3 3 3 3 NA 3 GDBTF x <- c(seq_len(8), NA, 10L) fifelse(x < 5L, TRUE, FALSE) # dplyr::if_else #> [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE NA FALSE pGFMTF
ϕΫτϧૢ࡞ؔ fcoalesce(x, seq_len(10)) # dplyr::coalesce #> [1] 1 2 3
4 5 6 7 8 9 10 GDPBMFTDF OBpMM nafill(x, fill = 0) #> [1] 1 2 3 4 5 6 7 8 0 10
ؔ࿈ύοέʔδ
EUQMZS όοΫάϥϯυʹEBUBUBCMF EQMZS EBUBUBCMF ࣮ߦ EBUBUBCMFEUQMZSEQMZS EQMZSߏจ͕͑ͯ EBUBUBCMFʹ͍ۙύϑΥʔϚϯε
EUQMZS dt_starwars %>% group_by(species) %>% summarise(height = mean(height, na.rm =
TRUE), .groups = “drop") library(dtplyr) dt_starwars <- lazy_dt(starwars) #> Source: local data table [?? x 3] #> Call: `_DT1`[, .(height = mean(height, #> na.rm = TRUE), .groups = "drop"), #> keyby = .(species)] #> #> species height .groups #> <chr> <dbl> <chr> #> 1 <NA> 181. drop #> 2 Aleena 79 drop #> 3 Besalisk 198 drop #> 4 Cerean 198 drop #> 5 Chagrian 196 drop #> 6 Clawdite 168 drop #> # Use as.data.table()/as.data.frame()/as_tibble() to access results .Last.value %>% as_tibble() EQMZSͰͷॲཧ ࣮ߦ݁ՌΛ ڥʹฦ͢
"OE.PSFʜ UJEZUBCMF EUQMZS͕αϙʔτ͢ΔͷEQMZS UJEZSͷؚؔΊΔ UJEZGBTUʜUJEZSͷ
[email protected]
ͳͲ UJEZGTUʜGTUΛ͍ͬͯͯߴ ʢͬΆ͍ʣ
·ͱΊ EBUBUBCMF σʔλͷૢ࡞<ԋࢉࢠͰ UJEZWFSTFʹྼΒͳ͍ػೳ ⁞ ߴɺϝϞϦར༻ޮྑ͍ ؔ࿈ɺ֦ுύοέʔδॆ࣮
Enjoy ☝ ऴ IUUQTHJUIVCDPNTQPOTPSTVSJCP