Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
今こそ、data.tableを学ぼう! / datatable1130
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Uryu Shinya
August 01, 2020
Programming
9.3k
5
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
今こそ、data.tableを学ぼう! / datatable1130
Tokyo.R#87
http://tokyor.connpass.com/event/182672/
での発表スライド
Uryu Shinya
August 01, 2020
More Decks by Uryu Shinya
See All by Uryu Shinya
安全なAI利用のためのLLM(大規模言語モデル)の利用と評価 / japanr2025
s_uryu
0
84
生成AIサービスを用いた研究活動の支援
s_uryu
0
220
R研究集会(2024)のご案内
s_uryu
1
750
生成AIを用いたサービスの紹介
s_uryu
1
240
生成AIの基礎的事項と社会に与える影響
s_uryu
0
96
Rの機械学習フレームワークの紹介〜tidymodelsを中心に〜 / machine_learning_with_r2024
s_uryu
0
1.5k
地理空間データの機械学習への適用 / machine_learning_for_spatial_data
s_uryu
0
390
mandaRa: R言語ユーザのための新しい知識共有の場 / mandara_tokyor111
s_uryu
2
790
R言語入門 (R-4.3.3 2024年4月版) / introduction to r
s_uryu
7
7.2k
Other Decks in Programming
See All in Programming
DynamoDBには集計系のクエリがないけどなんとかしたい
musan
1
140
生成AI時代にこそ効くGo | Why Go Works in the Age of Generative AI
mom0tomo
8
3.2k
「AIで開発し、AIを届ける」をEvalでつなぐ 〜AIネイティブに始めるプロダクト開発の実践〜 / Connecting "Develop with AI, deliver AI" with Eval
rkaga
4
5.1k
OSもどきOS
arkw
0
570
Strategic Design in the Frontend: Moduliths & Micro Frontends @DDDEurope
manfredsteyer
PRO
0
100
RTSPクライアントを自作してみた話
simotin13
0
610
そのテスト、説明できますか?~LWテスト戦略FW~のご紹介
nakahara
0
140
Language Server 使ってる? 〜VSCode と Zed の場合〜 / Are you using a Language Server? ~For VS Code and Zed~
handlename
0
790
Webフレームワークの ベンチマークについて
yusukebe
0
170
New "Type" system on PicoRuby
pocke
1
960
Creating Composable Callables in Contemporary C++
rollbear
0
140
Lemonade + Foundry Toolkit でお手軽アプリ開発
seosoft
1
340
Featured
See All Featured
Building the Perfect Custom Keyboard
takai
2
800
Heart Work Chapter 1 - Part 1
lfama
PRO
7
36k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
35
2.5k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.9k
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
2
860
Building Adaptive Systems
keathley
44
3.1k
B2B Lead Gen: Tactics, Traps & Triumph
marketingsoph
0
150
Learning to Love Humans: Emotional Interface Design
aarron
275
41k
From π to Pie charts
rasagy
0
210
How to Ace a Technical Interview
jacobian
281
24k
Tips & Tricks on How to Get Your First Job In Tech
honzajavorek
1
540
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.7k
Transcript
data.table Shinya Uryu ( @u_ribo uribo) 20200801 Tokyo.R#87 =Λֶ΅͏ʂ ύοέʔδ
ࠓͦ͜ɺ DT[i, j, by] fifelse fcase .SD .N nafill .GRP
(JU)VC4QPOTPST ͷօ͞· IUUQTHJUIVCDPNTQPOTPSTVSJCP ఏڙ @yutannihilation @katsurakob @kanji14134 @siero5335 @ niszet
@ ito4303 @ ak9782427
࣍ σʔλૢ࡞ EBUBUBCMFͷ֦͕Γ ⁞ EBUBUBCMFͷར
MJCSBSZ EBUBUBCMF WFSTJPO 4JODFPO$3"/ ⁞
EBUBUBCMF &YUFOTJPOPGEBUBGSBNF ʢେنσʔλͷʣޮతͳॲཧɺ ߴͳಡΈॻ͖͕ಘҙ EBUBUBCMFΫϥε %5 Λఏڙ EBUBGSBNFʹࣅͨ<Λ༻͍ͨૢ࡞ IUUQTSEBUBUBCMFHJUMBCJPEBUBUBCMF
group_by join data.table 13s 92s dplyr 157s 392s datatable 72s
1301s pandas 97s 1514s dask 127s out of memory DataFrames.jl 43s 5755s spark 37s 369s ੑೳଌఆ IUUQTIPBJHJUIVCJPECCFODINBSL (# ߦྻ
ੑೳଌఆ .# ߦྻ time data.table::fread() 2.85s fread()ʢcolClassࢦఆʣ 2.57s
readr::read_csv() 14.0s fst::read_fst() 2.44s
EQMZSͱͷൺֱ ґଘ͕গͳ͍ ϝϞϦར༻ޮʹҧ͍ อकతͳ։ൃʢWTϦϦʔεʣ EQMZSΑΓྺ࢙ͷ͋Δ։ൃྺ͕ͩ҆ఆత େ෯ʹڍಈ͕มΘΔ͜ͱ͕ك <J K CZ>ܗࣜͷૢ࡞ੑҰ؏
EQMZSͱͷൺֱ ґଘͷগͳ͞ tools::package_dependencies("data.table", recursive = TRUE)[[1]] #> [1] "methods" tools::package_dependencies("dplyr",
recursive = TRUE)[[1]] #> [1] "ellipsis" "generics" "glue" "lifecycle" "magrittr" #> [6] "methods" "R6" "rlang" "tibble" "tidyselect" #> [11] "utils" "vctrs" "cli" "crayon" "fansi" #> [16] "pillar" "pkgconfig" "purrr" "digest" "assertthat" #> [21] "grDevices" "utf8" "tools" ⁞
EQMZSͱͷൺֱ ϝϞϦར༻ޮ $PNQBSJOH&GpDJFODZBOE4QFFEPGAEBUBUBCMFA"EEJOHWBSJBCMFT pMUFSJOHSPXT BOETVNNBSJ[JOHCZHSPVQu 5ZTPO#BSSFUUDPNIUUQTUZTPOCBSSFUUDPNKFLZMMVQEBUFEBUBUBCMF@NFNPSZ ྻૢ࡞ʹ͓͍ͯEBUBUBCMFͷํ͕ Θ͔ͣʹ༏ΕΔ நग़ͷաఔͰ
άϧʔϓͷूఔඇޮʁ EBUBUBCMF͕উΔ IUUQTUXJUUFSDPNIFBMUIBOETUBUTTUBUVT
EQMZSͱͷൺֱ ϝϞϦར༻ޮ d <- data.table( x = rnorm(10)) address(d)
#> [1] “0x7f8afc694400" df_tbl <- copy(d) %>% as_tibble() address(df_tbl) #> [1] "0x7f8ace2fb1c8" d[, z := rnorm(10)] address(d) #> [1] "0x7f8afc694400" มʹॲཧΛՃ͑ͯ ΦϒδΣΫτ͕ڞ༗͞ΕΔ ϝϞϦΞυϨε͕มΘΒͳ͍ df_tbl <- df_tbl %>% mutate(z = rnorm(1e6)) address(df_tbl) #> [1] "0x7f8afc68eea8" UJEZWFSTF ϝϞϦΞυϨε͕มΘΔ
ͳͥޮ͕ྑ͘ɺߴͳͷ͔ $ݴޠ $ Ͱͳ͍ ʹΑΔ࣮ TIBMMPXίϐʔͰͷྻͷͷ Ճɾߋ৽ɾআ Ωʔར༻ͱόΠφϦαʔν ʹΑΔݕࡧ
BSSBOHF GJMUFS DT[i, j, by] EBUBUBCMFߏจ NVUBUF TFMFDU HSPVQ@CZ
TVNNBSJTF ର ૢ࡞ ू߹
df %>% group_by(id) %>% summarise(val1_sum = sum(val1), .groups = "drop")
id code val1 1 c 1 1 b 2 1 c 3 2 a 4 2 a 5 1 b 6 2 a 7 1 c 8 id val1_sum 1 20 2 16 JEྻͷ͝ͱʹWBMΛ߹ܭ͢Δ ྫ ⁞ DT[, .(val1_sum = sum(val1)), by = id]
id code val1 1 c 1 1 b 2 1
c 3 2 a 4 2 a 5 1 b 6 2 a 7 1 c 8 id val1_sum 1 12 2 16 DPEFCͷ݅Ͱྫ⁞ͱಉ͡ॲཧ ྫ df %>% filter(code != "b") %>% group_by(id) %>% summarise(val1_sum = sum(val1), .groups = "drop") DT[code != "b", .(val1_sum = sum(val1)), by = id]
id val1_sum val2_sum 1 12 36 2 16 40 DPEFCͷ݅ɺJE͝ͱʹWBMྻͷ߹ܭ
ྫ id code val1 val2 1 c 1 9 1 b 2 10 1 c 3 11 2 a 4 12 2 a 5 13 1 b 6 14 2 a 7 15 1 c 8 16 DT[code != "b", lapply(.SD, sum), by = id, .SDcols = patterns("^val")] 4VCTFUPGUIF%BUBUBCMF df %>% filter(code != "b") %>% group_by(id) %>% summarise(across(starts_with("val"), sum), .groups = "drop")
df %>% filter(val1 < 3) df %>% filter(val1 < 3
| val2 > 9) df %>% filter(val1 < 3 & val2 > 9) Jߦͷநग़ pMUFS DT[val1 < 3] DT[val1 < 3 | val2 > 9] # or݅ DT[val1 < 3 & val2 > 9] # and݅ df %>% filter(code %in% c("a", “d")) df %>% filter(!code %in% c("a", "b")) df %>% filter( stringr::str_detect(code, "^c$")) # on۟ DT[c("a", "d"),on = "code",nomatch = 0] # a,bҎ֎ DT[!code %in% c("a", "b")] # ਖ਼نදݱͰͷࢦఆ DT[like(code, "^c$")] จࣈྻ
Jߦͷฒͼସ͑ BSSBOHF DT[order(id)] # খ͍͞ॱ(ঢॱ) #> id code val1 val2
#> 1: 1 c 1 9 #> 2: 1 b 2 10 #> 3: 1 c 3 11 #> 4: 1 b 6 14 #> 5: 1 c 8 16 #> 6: 2 a 4 12 #> 7: 2 a 5 13 #> 8: 2 a 7 15 df %>% arrange(id) df %>% arrange(desc(id)) setorder(DT, -id, na.last = TRUE) DT # େ͖͍ॱ(߱ॱ) #> id code val1 val2 #> 1: 2 a 4 12 #> 2: 2 a 5 13 #> 3: 2 a 7 15 #> 4: 1 c 1 9 #> 5: 1 b 2 10 #> 6: 1 c 3 11 #> 7: 1 b 6 14 #> 8: 1 c 8 16
Kྻͷબ TFMFDU ୯Ұ ෳ ෳ DT[, "id"] DT[, list(id)] DT[,
.(id)] DT[, 1] cols <- c("val1", "val2") DT[, c(1, 2)] DT[, ..cols] DT[, cols, with = FALSE] DT[, list(val1, val2)] DT[, .(val1, val2)] จࣈྻϕΫτϧΛ ༩͑Δࡍʹඞཁ ʢDPMTΛྻ໊ѻ͍͠ͳ͍ʣ df %>% select(id) df %>% select(val1, val2) ෳ আ DT[, !c("id")] DT[, !cols, with = FALSE] ˠ࣍ͷεϥΠυ Ͱߦͬͨํ͕ޮత MJTU ͷγϣʔτΧοτ
KͷՃɾՃɾআ NVUBUF ηΠνPSΞγΧԋࢉࢠ 1ZUIPOͰಋೖ +PFM(BSMJDI.JMMFS 64'JTIBOE8JMEMJGF4FSWJDF1VCMJDEPNBJO DT[, code_title =
toupper(code)] #> Error in `[.data.table`(DT, , code_title = toupper(code)): ΘΕ͍ͯͳ͍Ҿ (code_title = toupper(code)) DT[, code_title := toupper(code)] ݁Ռग़ྗ͞Εͳ͍ ෭࡞༻͕͋Δ dt_copy <- copy(DT) address(DT) #> [1] "0x7f953bc4b800" address(dt_copy) #> [1] "0x7f953bdb8800" EBUBUBCMFͰͷEFFQDPQZ
KͷՃɾՃɾআ NVUBUF DT[, code_title := toupper(code)] # Ճ DT[, code
:= toupper(code)] # ߋ৽ DT[, code := NULL] # আ df %>% mutate(code_title = toupper(code)) df %>% mutate(code = toupper(code)) df %>% mutate(code_title = NULL) df %>% mutate(val1 = val1 + 2, val2 = val2 + val1) ୯Ұ ෳ DT[, c("val1", "val2") := list(val1 + 2, val2 + val1)] DT[, ':=' (val1 = val1 + 2, val2 = val2 + val1)] ࠨลӈล ݅ DT[c("a", "c"), val2 := val2 * 2, on = c("code")] DPEFྻͷ͕B DͰ͋ΕWBM
CZू TVNNBSJTF DT[, .(val1_sum = sum(val1)), by = code] DT[,
id := NULL][, lapply(.SD, sum), by = code, .SDcols = is.numeric] df %>% group_by(code) %>% summarise(val1_sum = sum(val1), .groups = “drop") df %>% select(!id) %>% group_by(code) %>% summarise(across(where(is.numeric), sum), .groups = "drop") EQMZSHSPVQ@CZ ͱಉ 4%DPMTʹ༩͑ΒΕͨྻ͕ 4%ʹ͞ΕΔ ୯Ұ ෳ .SDcols = val1:val2 .SDcols = c("val1", "val2") .SDcols = patterns("^val") ͜ΕΒͷࢦఆํ๏Ͱ0,
DT[, .(val1_sum = sum(val1)), keyby = code] #> code val1_sum
#> 1: a 16 #> 2: b 8 #> 3: c 12 άϧʔϓԽͷิॿػೳ / /(31 (31 DT[, .(val1_sum = sum(val1)), by = code] #> code val1_sum #> 1: c 12 #> 2: b 8 #> 3: a 16 LFZCZʹࢦఆͨ͠มͷฒͼͰฒͼସ͑Δ DT[, .N, by = id] #> id N #> 1: 1 5 #> 2: 2 3 DT[, .GRP, by = id] #> id GRP #> 1: 1 1 #> 2: 2 2 άϧʔϓมͷΧϯτ DT[, .NGRP, by = id] #> id GRP #> 1: 1 2 #> 2: 2 2 άϧʔϓ JE ͷݸ άϧʔϓมͷ֬ೝ LFZCZ
݁߹ KPJO PO۟͘͠NFSHF inner_join( flights, planes, by = "tailnum") library(nycflights13)
flights_dt <- as.data.table(flights) planes_dt <- as.data.table(planes) left_join( flights, planes, by = "tailnum") merge( flights_dt, planes_dt, all.x = TRUE, all.y = FALSE, by = "tailnum") flights_dt[planes_dt, on = "tailnum"]
ॎԣม library(tidyr) data("fish_encounters") #> # A tibble: 114 x 3
#> fish station seen #> <fct> <fct> <int> #> 1 4842 Release 1 #> 2 4842 I80_1 1 #> 3 4842 Lisbon 1 #> 4 4842 Rstr 1 #> 5 4842 Base_TD 1 #> 6 4842 BCE 1 #> 7 4842 BCW 1 #> 8 4842 BCE2 1 #> 9 4842 BCW2 1 #> 10 4842 MAE 1 #> # … with 104 more rows #> fish Release_1 I80_1_1 Lisbon_1 Rstr_1 Base_TD_1 BCE_1 BCW_1 BCE2_1 BCW2_1 #> 1: 4842 1 1 1 1 1 1 1 1 1 #> 2: 4843 1 1 1 1 1 1 1 1 1 #> 3: 4844 1 1 1 1 1 1 1 1 1 …((লུ) fish_encounters %>% as.data.table() %>% dcast(... ~ station + seen, value.var = "seen") fish_encounters %>% pivot_wider(names_from = station, values_from = seen) ͋Μ·ΓΠέͯͳ͍ XJEFS ⁞
data("billboard") #> # A tibble: 317 x 79 #> artist
track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 #> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA #> 2 2Ge+h… The … 2000-09-02 91 87 92 NA NA NA NA NA #> 3 3 Doo… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 #> # … with 314 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>, billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", names_prefix = "wk", values_to = "rank", values_drop_na = TRUE) #> artist track date.entered week rank #> 1: 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87 #> 2: 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91 #> 3: 3 Doors Down Kryptonite 2000-04-08 wk1 81 #> 4: 3 Doors Down Loser 2000-10-21 wk1 76 #> 5: 504 Boyz Wobble Wobble 2000-04-15 wk1 57 #> --- #> 5303: Creed Higher 1999-09-11 wk63 50 #> 5304: Lonestar Amazed 1999-06-05 wk63 45 #> 5305: Creed Higher 1999-09-11 wk64 50 #> 5306: Lonestar Amazed 1999-06-05 wk64 50 #> 5307: Creed Higher 1999-09-11 wk65 49 billboard %>% as.data.table() %>% melt(id = c("artist", "track", "date.entered"), measure = patterns("^wk"), variable.name = "week", value.name = "rank", na.rm = TRUE) ʜ ॎԣม MPOHFS UJEZGBTUͰੋඇ
ϕΫτϧૢ࡞ؔ fcase(x < 5L, 1L, x >= 5L, 3L) #
dplyr::case_when #> [1] 1 1 1 1 3 3 3 3 NA 3 GDBTF x <- c(seq_len(8), NA, 10L) fifelse(x < 5L, TRUE, FALSE) # dplyr::if_else #> [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE NA FALSE pGFMTF
ϕΫτϧૢ࡞ؔ fcoalesce(x, seq_len(10)) # dplyr::coalesce #> [1] 1 2 3
4 5 6 7 8 9 10 GDPBMFTDF OBpMM nafill(x, fill = 0) #> [1] 1 2 3 4 5 6 7 8 0 10
ؔ࿈ύοέʔδ
EUQMZS όοΫάϥϯυʹEBUBUBCMF EQMZS EBUBUBCMF ࣮ߦ EBUBUBCMFEUQMZSEQMZS EQMZSߏจ͕͑ͯ EBUBUBCMFʹ͍ۙύϑΥʔϚϯε
EUQMZS dt_starwars %>% group_by(species) %>% summarise(height = mean(height, na.rm =
TRUE), .groups = “drop") library(dtplyr) dt_starwars <- lazy_dt(starwars) #> Source: local data table [?? x 3] #> Call: `_DT1`[, .(height = mean(height, #> na.rm = TRUE), .groups = "drop"), #> keyby = .(species)] #> #> species height .groups #> <chr> <dbl> <chr> #> 1 <NA> 181. drop #> 2 Aleena 79 drop #> 3 Besalisk 198 drop #> 4 Cerean 198 drop #> 5 Chagrian 196 drop #> 6 Clawdite 168 drop #> # Use as.data.table()/as.data.frame()/as_tibble() to access results .Last.value %>% as_tibble() EQMZSͰͷॲཧ ࣮ߦ݁ՌΛ ڥʹฦ͢
"OE.PSFʜ UJEZUBCMF EUQMZS͕αϙʔτ͢ΔͷEQMZS UJEZSͷؚؔΊΔ UJEZGBTUʜUJEZSͷQJWPU@ ͳͲ UJEZGTUʜGTUΛ͍ͬͯͯߴ ʢͬΆ͍ʣ
·ͱΊ EBUBUBCMF σʔλͷૢ࡞<ԋࢉࢠͰ UJEZWFSTFʹྼΒͳ͍ػೳ ⁞ ߴɺϝϞϦར༻ޮྑ͍ ؔ࿈ɺ֦ுύοέʔδॆ࣮
Enjoy ☝ ऴ IUUQTHJUIVCDPNTQPOTPSTVSJCP