$30 off During Our Annual Pro Sale. View Details »

今こそ、data.tableを学ぼう! / datatable1130

今こそ、data.tableを学ぼう! / datatable1130

Tokyo.R#87 http://tokyor.connpass.com/event/182672/ での発表スライド

Uryu Shinya

August 01, 2020
Tweet

More Decks by Uryu Shinya

Other Decks in Programming

Transcript

  1. data.table
    Shinya Uryu ( @u_ribo uribo)
    20200801 Tokyo.R#87
    =Λֶ΅͏ʂ
    ύοέʔδ
    ࠓͦ͜ɺ
    DT[i, j, by]
    fifelse
    fcase
    .SD
    .N
    nafill
    .GRP

    View Slide

  2. (JU)VC4QPOTPST
    ͷօ͞·
    IUUQTHJUIVCDPNTQPOTPSTVSJCP
    ఏڙ
    @yutannihilation @katsurakob @kanji14134 @siero5335 @ niszet
    @ ito4303 @ ak9782427

    View Slide

  3. ໨࣍
      σʔλૢ࡞
    ⁠ EBUBUBCMFͷ֦͕Γ
    ⁞ EBUBUBCMFͷར఺

    View Slide

  4. MJCSBSZ EBUBUBCMF

    WFSTJPO

    4JODFPO$3"/

    View Slide

  5. EBUBUBCMF
    &YUFOTJPOPGEBUBGSBNF
    ʢେن໛σʔλͷʣޮ཰తͳॲཧɺ
    ߴ଎ͳಡΈॻ͖͕ಘҙ
    EBUBUBCMFΫϥε %5
    Λఏڙ
    EBUBGSBNFʹࣅͨ<Λ༻͍ͨૢ࡞
    IUUQTSEBUBUBCMFHJUMBCJPEBUBUBCMF

    View Slide

  6. group_by join
    data.table 13s 92s
    dplyr 157s 392s
    datatable 72s 1301s
    pandas 97s 1514s
    dask 127s out of memory
    DataFrames.jl 43s 5755s
    spark 37s 369s
    ੑೳଌఆ
    IUUQTIPBJHJUIVCJPECCFODINBSL
    (#
    ߦྻ

    View Slide

  7. ੑೳଌఆ .#
    ߦྻ
    time
    data.table::fread() 2.85s
    fread()ʢcolClassࢦఆʣ 2.57s
    readr::read_csv() 14.0s
    fst::read_fst() 2.44s



    View Slide

  8. EQMZSͱͷൺֱ
    ґଘ͕গͳ͍
    ϝϞϦར༻ޮ཰ʹҧ͍
    อकతͳ։ൃʢWTϦϦʔε଎౓ʣ
    EQMZSΑΓ΋ྺ࢙ͷ͋Δ։ൃྺ͕ͩ҆ఆత
    େ෯ʹڍಈ͕มΘΔ͜ͱ͕ك
    ܗࣜͷૢ࡞ੑ͸Ұ؏

    View Slide

  9. EQMZSͱͷൺֱ ґଘͷগͳ͞
    tools::package_dependencies("data.table", recursive = TRUE)[[1]]
    #> [1] "methods"
    tools::package_dependencies("dplyr", recursive = TRUE)[[1]]
    #> [1] "ellipsis" "generics" "glue" "lifecycle" "magrittr"
    #> [6] "methods" "R6" "rlang" "tibble" "tidyselect"
    #> [11] "utils" "vctrs" "cli" "crayon" "fansi"
    #> [16] "pillar" "pkgconfig" "purrr" "digest" "assertthat"
    #> [21] "grDevices" "utf8" "tools"

    View Slide

  10. EQMZSͱͷൺֱ ϝϞϦར༻ޮ཰

    $PNQBSJOH&GpDJFODZBOE4QFFEPGAEBUBUBCMFA"EEJOHWBSJBCMFT pMUFSJOHSPXT BOETVNNBSJ[JOHCZHSPVQu
    5ZTPO#BSSFUUDPNIUUQTUZTPOCBSSFUUDPNKFLZMMVQEBUFEBUBUBCMF@NFNPSZ
    ྻૢ࡞ʹ͓͍ͯEBUBUBCMFͷํ͕
    Θ͔ͣʹ༏ΕΔ
    நग़ͷաఔͰ͸
    άϧʔϓͷू໿޻ఔ͸ඇޮ཰ʁ
    EBUBUBCMF͕উΔ
    IUUQTUXJUUFSDPNIFBMUIBOETUBUTTUBUVT

    View Slide

  11. EQMZSͱͷൺֱ ϝϞϦར༻ޮ཰

    d <- data.table(
    x = rnorm(10))
    address(d)
    #> [1] “0x7f8afc694400"
    df_tbl <-
    copy(d) %>%
    as_tibble()
    address(df_tbl)
    #> [1] "0x7f8ace2fb1c8"
    d[, z := rnorm(10)]
    address(d)
    #> [1] "0x7f8afc694400"
    ม਺ʹॲཧΛՃ͑ͯ΋
    ΦϒδΣΫτ͕ڞ༗͞ΕΔ
    ϝϞϦΞυϨε͕มΘΒͳ͍
    df_tbl <-
    df_tbl %>%
    mutate(z = rnorm(1e6))
    address(df_tbl)
    #> [1] "0x7f8afc68eea8"
    UJEZWFSTF
    ϝϞϦΞυϨε͕มΘΔ

    View Slide

  12. ͳͥޮ཰͕ྑ͘ɺߴ଎ͳͷ͔
    $ݴޠ $Ͱ͸ͳ͍
    ʹΑΔ࣮૷
    TIBMMPXίϐʔͰͷྻͷ஋ͷ
    ௥Ճɾߋ৽ɾ࡟আ
    Ωʔར༻ͱόΠφϦαʔν
    ʹΑΔݕࡧ

    View Slide


  13. BSSBOHF
    GJMUFS
    DT[i, j, by]
    EBUBUBCMFߏจ
    NVUBUF
    TFMFDU
    HSPVQ@CZ
    TVNNBSJTF
    ର৅ ૢ࡞ ू߹

    View Slide

  14. df %>%
    group_by(id) %>%
    summarise(val1_sum = sum(val1),
    .groups = "drop")
    id code val1
    1 c 1
    1 b 2
    1 c 3
    2 a 4
    2 a 5
    1 b 6
    2 a 7
    1 c 8
    id val1_sum
    1 20
    2 16
    JEྻͷ஋͝ͱʹWBMΛ߹ܭ͢Δ
    ྫ ⁞
    DT[,
    .(val1_sum = sum(val1)),
    by = id]

    View Slide

  15. id code val1
    1 c 1
    1 b 2
    1 c 3
    2 a 4
    2 a 5
    1 b 6
    2 a 7
    1 c 8
    id val1_sum
    1 12
    2 16
    DPEFCͷ৚݅Ͱྫ⁞ͱಉ͡ॲཧ
    ྫ  
    df %>%
    filter(code != "b") %>%
    group_by(id) %>%
    summarise(val1_sum = sum(val1),
    .groups = "drop")
    DT[code != "b",
    .(val1_sum = sum(val1)),
    by = id]

    View Slide

  16. id val1_sum val2_sum
    1 12 36
    2 16 40
    DPEFCͷ৚݅ɺJE͝ͱʹWBMྻͷ߹ܭ
    ྫ ⁠
    id code val1 val2
    1 c 1 9
    1 b 2 10
    1 c 3 11
    2 a 4 12
    2 a 5 13
    1 b 6 14
    2 a 7 15
    1 c 8 16
    DT[code != "b",
    lapply(.SD, sum),
    by = id,
    .SDcols = patterns("^val")]
    4VCTFUPGUIF%BUBUBCMF
    df %>%
    filter(code != "b") %>%
    group_by(id) %>%
    summarise(across(starts_with("val"), sum),
    .groups = "drop")

    View Slide

  17. df %>%
    filter(val1 < 3)
    df %>%
    filter(val1 < 3 | val2 > 9)
    df %>%
    filter(val1 < 3 & val2 > 9)
    Jߦͷநग़ pMUFS

    DT[val1 < 3]
    DT[val1 < 3 | val2 > 9] # or৚݅
    DT[val1 < 3 & val2 > 9] # and৚݅
    df %>%
    filter(code %in% c("a", “d"))
    df %>%
    filter(!code %in% c("a", "b"))
    df %>%
    filter(
    stringr::str_detect(code, "^c$"))
    # on۟
    DT[c("a", "d"),on = "code",nomatch = 0]
    # a,bҎ֎
    DT[!code %in% c("a", "b")]
    # ਖ਼نදݱͰͷࢦఆ
    DT[like(code, "^c$")]
    ਺஋
    จࣈྻ

    View Slide

  18. Jߦͷฒͼସ͑ BSSBOHF

    DT[order(id)] # খ͍͞ॱ(ঢॱ)
    #> id code val1 val2
    #> 1: 1 c 1 9
    #> 2: 1 b 2 10
    #> 3: 1 c 3 11
    #> 4: 1 b 6 14
    #> 5: 1 c 8 16
    #> 6: 2 a 4 12
    #> 7: 2 a 5 13
    #> 8: 2 a 7 15
    df %>%
    arrange(id)
    df %>%
    arrange(desc(id))
    setorder(DT, -id, na.last = TRUE)
    DT # େ͖͍ॱ(߱ॱ)
    #> id code val1 val2
    #> 1: 2 a 4 12
    #> 2: 2 a 5 13
    #> 3: 2 a 7 15
    #> 4: 1 c 1 9
    #> 5: 1 b 2 10
    #> 6: 1 c 3 11
    #> 7: 1 b 6 14
    #> 8: 1 c 8 16

    View Slide

  19. Kྻͷબ୒ TFMFDU

    ୯Ұ
    ෳ਺
    ෳ਺
    DT[, "id"] DT[, list(id)]
    DT[, .(id)]
    DT[, 1]
    cols <- c("val1", "val2") DT[, c(1, 2)]
    DT[, ..cols]
    DT[, cols, with = FALSE]
    DT[, list(val1, val2)]
    DT[, .(val1, val2)] จࣈྻϕΫτϧΛ
    ༩͑Δࡍʹ͸ඞཁ
    ʢDPMTΛྻ໊ѻ͍͠ͳ͍ʣ
    df %>%
    select(id)
    df %>%
    select(val1, val2)
    ෳ਺
    ࡟আ DT[, !c("id")]
    DT[, !cols, with = FALSE]
    ˠ࣍ͷεϥΠυ
    Ͱߦͬͨํ͕ޮ཰త

    ͸MJTU
    ͷγϣʔτΧοτ

    View Slide

  20. K஋ͷ௥ՃɾՃ޻ɾ࡟আ NVUBUF

    ηΠ΢νPSΞγΧԋࢉࢠ
    1ZUIPOͰ΋ಋೖ
    +PFM(BSMJDI.JMMFS
    64'JTIBOE8JMEMJGF4FSWJDF1VCMJDEPNBJO
    DT[, code_title = toupper(code)]
    #> Error in `[.data.table`(DT, , code_title = toupper(code)):
    ࢖ΘΕ͍ͯͳ͍Ҿ਺ (code_title = toupper(code))
    DT[, code_title := toupper(code)]
    ݁Ռ͸ग़ྗ͞Εͳ͍
    ෭࡞༻͕͋Δ
    dt_copy <-
    copy(DT)
    address(DT)
    #> [1] "0x7f953bc4b800"
    address(dt_copy)
    #> [1] "0x7f953bdb8800"
    EBUBUBCMFͰͷEFFQDPQZ

    View Slide

  21. K஋ͷ௥ՃɾՃ޻ɾ࡟আ NVUBUF

    DT[, code_title := toupper(code)] # ௥Ճ
    DT[, code := toupper(code)] # ߋ৽
    DT[, code := NULL] # ࡟আ
    df %>%
    mutate(code_title = toupper(code))
    df %>%
    mutate(code = toupper(code))
    df %>%
    mutate(code_title = NULL)
    df %>%
    mutate(val1 = val1 + 2,
    val2 = val2 + val1)
    ୯Ұ
    ෳ਺
    DT[, c("val1", "val2") := list(val1 + 2, val2 + val1)]
    DT[, ':=' (val1 = val1 + 2, val2 = val2 + val1)]
    ࠨลӈล
    ৚݅
    DT[c("a", "c"), val2 := val2 * 2, on = c("code")]
    DPEFྻͷ஋͕B DͰ͋Ε͹WBM

    View Slide

  22. CZू໿ TVNNBSJTF

    DT[,
    .(val1_sum = sum(val1)),
    by = code]
    DT[, id := NULL][,
    lapply(.SD, sum),
    by = code,
    .SDcols = is.numeric]
    df %>%
    group_by(code) %>%
    summarise(val1_sum = sum(val1),
    .groups = “drop")
    df %>%
    select(!id) %>%
    group_by(code) %>%
    summarise(across(where(is.numeric), sum),
    .groups = "drop")
    EQMZSHSPVQ@CZ
    ͱಉ౳
    4%DPMTʹ༩͑ΒΕͨྻ͕
    4%ʹ౉͞ΕΔ
    ୯Ұ
    ෳ਺
    .SDcols = val1:val2
    .SDcols = c("val1", "val2")
    .SDcols = patterns("^val")
    ͜ΕΒͷࢦఆํ๏Ͱ΋0,

    View Slide

  23. DT[,
    .(val1_sum = sum(val1)),
    keyby = code]
    #> code val1_sum
    #> 1: a 16
    #> 2: b 8
    #> 3: c 12
    άϧʔϓԽͷิॿػೳ
    / /(31
    (31
    DT[,
    .(val1_sum = sum(val1)),
    by = code]
    #> code val1_sum
    #> 1: c 12
    #> 2: b 8
    #> 3: a 16
    LFZCZʹࢦఆͨ͠ม਺ͷฒͼͰฒͼସ͑Δ
    DT[, .N, by = id]
    #> id N
    #> 1: 1 5
    #> 2: 2 3
    DT[, .GRP, by = id]
    #> id GRP
    #> 1: 1 1
    #> 2: 2 2
    άϧʔϓม਺ͷΧ΢ϯτ
    DT[, .NGRP, by = id]
    #> id GRP
    #> 1: 1 2
    #> 2: 2 2
    άϧʔϓ JE
    ͷݸ਺
    άϧʔϓม਺ͷ֬ೝ
    LFZCZ

    View Slide

  24. ݁߹ KPJO

    PO۟΋͘͠͸NFSHF

    inner_join(
    flights,
    planes,
    by = "tailnum")
    library(nycflights13)
    flights_dt <- as.data.table(flights)
    planes_dt <- as.data.table(planes)
    left_join(
    flights,
    planes,
    by = "tailnum")
    merge(
    flights_dt,
    planes_dt,
    all.x = TRUE,
    all.y = FALSE,
    by = "tailnum")
    flights_dt[planes_dt, on = "tailnum"]

    View Slide

  25. ॎԣม׵
    library(tidyr)
    data("fish_encounters")
    #> # A tibble: 114 x 3
    #> fish station seen
    #>
    #> 1 4842 Release 1
    #> 2 4842 I80_1 1
    #> 3 4842 Lisbon 1
    #> 4 4842 Rstr 1
    #> 5 4842 Base_TD 1
    #> 6 4842 BCE 1
    #> 7 4842 BCW 1
    #> 8 4842 BCE2 1
    #> 9 4842 BCW2 1
    #> 10 4842 MAE 1
    #> # … with 104 more rows
    #> fish Release_1 I80_1_1 Lisbon_1 Rstr_1 Base_TD_1 BCE_1 BCW_1 BCE2_1 BCW2_1
    #> 1: 4842 1 1 1 1 1 1 1 1 1
    #> 2: 4843 1 1 1 1 1 1 1 1 1
    #> 3: 4844 1 1 1 1 1 1 1 1 1
    …((লུ)
    fish_encounters %>%
    as.data.table() %>%
    dcast(... ~ station + seen, value.var = "seen")
    fish_encounters %>%
    pivot_wider(names_from = station, values_from = seen)
    ͋Μ·ΓΠέͯͳ͍
    XJEFS

    View Slide

  26. data("billboard")
    #> # A tibble: 317 x 79
    #> artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
    #>
    #> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
    #> 2 2Ge+h… The … 2000-09-02 91 87 92 NA NA NA NA NA
    #> 3 3 Doo… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
    #> # … with 314 more rows, and 68 more variables: wk9 , wk10 ,
    billboard %>%
    pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    names_prefix = "wk",
    values_to = "rank",
    values_drop_na = TRUE)
    #> artist track date.entered week rank
    #> 1: 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
    #> 2: 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
    #> 3: 3 Doors Down Kryptonite 2000-04-08 wk1 81
    #> 4: 3 Doors Down Loser 2000-10-21 wk1 76
    #> 5: 504 Boyz Wobble Wobble 2000-04-15 wk1 57
    #> ---
    #> 5303: Creed Higher 1999-09-11 wk63 50
    #> 5304: Lonestar Amazed 1999-06-05 wk63 45
    #> 5305: Creed Higher 1999-09-11 wk64 50
    #> 5306: Lonestar Amazed 1999-06-05 wk64 50
    #> 5307: Creed Higher 1999-09-11 wk65 49
    billboard %>%
    as.data.table() %>%
    melt(id = c("artist", "track", "date.entered"),
    measure = patterns("^wk"),
    variable.name = "week",
    value.name = "rank",
    na.rm = TRUE)
    ৑௕ʜ
    ॎԣม׵ MPOHFS

    UJEZGBTUͰੋඇ

    View Slide

  27. ϕΫτϧૢ࡞ؔ਺
    fcase(x < 5L, 1L,
    x >= 5L, 3L) # dplyr::case_when
    #> [1] 1 1 1 1 3 3 3 3 NA 3
    GDBTF
    x <- c(seq_len(8), NA, 10L)
    fifelse(x < 5L, TRUE, FALSE) # dplyr::if_else
    #> [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE NA FALSE
    pGFMTF

    View Slide

  28. ϕΫτϧૢ࡞ؔ਺
    fcoalesce(x, seq_len(10)) # dplyr::coalesce
    #> [1] 1 2 3 4 5 6 7 8 9 10
    GDPBMFTDF
    OBpMM
    nafill(x, fill = 0)
    #> [1] 1 2 3 4 5 6 7 8 0 10

    View Slide


  29. ؔ࿈ύοέʔδ

    View Slide

  30. EUQMZS
    όοΫάϥ΢ϯυʹEBUBUBCMF
    EQMZSEBUBUBCMF
    ࣮ߦ଎౓͸
    EBUBUBCMFEUQMZSEQMZS
    EQMZSߏจ͕࢖͑ͯ
    EBUBUBCMFʹ͍ۙύϑΥʔϚϯε

    View Slide

  31. EUQMZS
    dt_starwars %>%
    group_by(species) %>%
    summarise(height = mean(height, na.rm = TRUE),
    .groups = “drop")
    library(dtplyr)
    dt_starwars <-
    lazy_dt(starwars)
    #> Source: local data table [?? x 3]
    #> Call: `_DT1`[, .(height = mean(height,
    #> na.rm = TRUE), .groups = "drop"),
    #> keyby = .(species)]
    #>
    #> species height .groups
    #>
    #> 1 181. drop
    #> 2 Aleena 79 drop
    #> 3 Besalisk 198 drop
    #> 4 Cerean 198 drop
    #> 5 Chagrian 196 drop
    #> 6 Clawdite 168 drop
    #> # Use as.data.table()/as.data.frame()/as_tibble() to access results
    .Last.value %>%
    as_tibble()
    EQMZSͰͷॲཧ
    ࣮ߦ݁ՌΛ
    ؀ڥʹฦ͢

    View Slide

  32. "OE.PSFʜ
    UJEZUBCMF
    EUQMZS͕αϙʔτ͢Δͷ͸EQMZS
    UJEZSͷؔ਺΋ؚΊΔ
    UJEZGBTUʜUJEZSͷQJWPU@
    ͳͲ
    UJEZGTUʜGTUΛ࢖͍ͬͯͯߴ଎
    ʢͬΆ͍ʣ

    View Slide

  33. ·ͱΊ
    EBUBUBCMF
      σʔλͷૢ࡞͸<ԋࢉࢠ಺Ͱ
    ⁠ UJEZWFSTFʹྼΒͳ͍ػೳ
    ⁞ ߴ଎ɺϝϞϦར༻ޮ཰΋ྑ͍
    ؔ࿈ɺ֦ுύοέʔδ΋ॆ࣮

    View Slide

  34. Enjoy


    IUUQTHJUIVCDPNTQPOTPSTVSJCP

    View Slide