$30 off During Our Annual Pro Sale. View Details »

vroom

 vroom

File import in R could be considered a solved problem, with multiple widely used packages (data.table, readr, and others) providing fast, robust import of common formats in addition to the functions available in base R.

However I feel there is still room for improvement in existing approaches. vroom is able to index and then query multi-Gigabyte files, including those with categorical, text and temporal data, in near real-time. This is a huge boon for interactive data analysis as you can jump directly into exploratory analysis without sampling or long waits for full import. vroom leverages the Altrep framework introduced in R 3.5 along with lazy, just-in-time parsing of the data to provide this improved latency without requiring changes to existing data manipulation code.

Jim Hester

July 12, 2019
Tweet

More Decks by Jim Hester

Other Decks in Programming

Transcript

  1. SPEAKERDECK.COM/JIMHESTER/VROOM
    Vroom
    Life’s T
    oo Short T
    o Drive Slow
    Jim Hester
     @jimhester  @jimhester_
    Read
    Photo by Joe Neric on Unsplash

    View Slide

  2. SPEAKERDECK.COM/JIMHESTER/VROOM
    SPEAKERDECK.COM/JIMHESTER/VROOM
    Photo by Joshua Reddekopp on Unsplash

    View Slide

  3. .1 second
    1 second
    10 seconds
    source: https://www.nngroup.co/articles/response-times-3-important-limits
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  4. source: https://twitter
    .com/amasad/status/1141066550180552704
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  5. T
    axi T
    rip Fare
    Photo by Andre Benz on Unsplash
    Observations: 14,776,615
    Variables: 11
    File Size: 1.55G
    chr [4]: medallion, hack_license, v
    dbl [6]: fare_amount, surcharge,
    dttm [1]: pickup_datetime
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  6. T
    axi T
    rip Fare
    Photo by Andre Benz on Unsplash
    Observations: 14,776,615
    Variables: 11
    File Size: 1.55G
    chr [4]: medallion, hack_license, v
    dbl [6]: fare_amount, surcharge,
    dttm [1]: pickup_datetime
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  7. T
    axi T
    rip Fare
    Photo by Andre Benz on Unsplash
    Observations: 14,776,615
    Variables: 11
    File Size: 1.55G
    chr [4]: medallion, hack_license, v
    dbl [6]: fare_amount, surcharge,
    dttm [1]: pickup_datetime
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  8. T
    axi T
    rip Fare
    Photo by Andre Benz on Unsplash
    Observations: 14,776,615
    Variables: 11
    File Size: 1.55G
    chr [4]: medallion, hack_license, v
    dbl [6]: fare_amount, surcharge,
    dttm [1]: pickup_datetime
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  9. T
    axi T
    rip Fare
    Photo by Andre Benz on Unsplash
    Observations: 14,776,615
    Variables: 11
    File Size: 1.55G
    chr [4]: medallion, hack_license, v
    dbl [6]: fare_amount, surcharge,
    dttm [1]: pickup_datetime
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  10. T
    axi T
    rip Fare
    Photo by Andre Benz on Unsplash
    Observations: 14,776,615
    Variables: 11
    File Size: 1.55G
    chr [4]: medallion, hack_license, v
    dbl [6]: fare_amount, surcharge,
    dttm [1]: pickup_datetime
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  11. SPEAKERDECK.COM/JIMHESTER/VROOM
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  12. SPEAKERDECK.COM/JIMHESTER/VROOM
    SPEAKERDECK.COM/JIMHESTER/VROOM
    1.8 seconds

    View Slide

  13. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T
    im Mossholder on Unsplash
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  14. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T
    im Mossholder on Unsplash
    Memory mapped
    Importance
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  15. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T
    im Mossholder on Unsplash
    Memory mapped
    Multi-threaded
    Importance
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  16. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T
    im Mossholder on Unsplash
    Memory mapped
    Multi-threaded
    strcspn()
    Importance
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  17. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T
    im Mossholder on Unsplash
    Memory mapped
    Multi-threaded
    strcspn()
    Altrep
    Importance
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  18. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash
    Al
    trep
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  19. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash
    Al
    trep
    Alternative representation
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  20. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash
    Al
    trep
    Alternative representation
    R 3.5+
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  21. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash
    Al
    trep
    Alternative representation
    R 3.5+
    Custom memory storage
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  22. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash
    Al
    trep
    Alternative representation
    R 3.5+
    Custom memory storage
    T
    ransparent to R & C/C++
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  23. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash
    Al
    trep
    Alternative representation
    R 3.5+
    Custom memory storage
    T
    ransparent to R & C/C++
    On-demand parsing
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  24. SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  25. SPEAKERDECK.COM/JIMHESTER/VROOM
    x
    "a" "abc" "d"
    "efghi" "xy"
    Global string pool

    View Slide

  26. SPEAKERDECK.COM/JIMHESTER/VROOM
    x
    "a" "abc" "d"
    "efghi" "xy"
    Global string pool
    y

    View Slide

  27. SPEAKERDECK.COM/JIMHESTER/VROOM
    x
    "a" "abc" "d"
    "efghi" "xy"
    Global string pool
    +++ less memory
    y

    View Slide

  28. SPEAKERDECK.COM/JIMHESTER/VROOM
    x
    "a" "abc" "d"
    "efghi" "xy"
    Global string pool
    +++ less memory
    --- hash lookup
    y

    View Slide

  29. SPEAKERDECK.COM/JIMHESTER/VROOM
    x
    "a" "abc" "d"
    "efghi" "xy"
    Global string pool
    +++ less memory
    --- hash lookup
    --- single threaded
    y

    View Slide

  30. Cost of
    laziness

    View Slide

  31. SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  32. SPEAKERDECK.COM/JIMHESTER/VROOM
    Maximum cost

    View Slide

  33. All doubles

    View Slide

  34. SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  35. SPEAKERDECK.COM/JIMHESTER/VROOM
    0.47 seconds

    View Slide

  36. SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  37. All characters

    View Slide

  38. SPEAKERDECK.COM/JIMHESTER/VROOM
    less raw data than dbl
    Much slower than dbl

    View Slide

  39. SPEAKERDECK.COM/JIMHESTER/VROOM
    0.36 seconds

    View Slide

  40. Other fea
    tures

    View Slide

  41. vroom("taxi.csv") %>%
    dplyr::select(medallion, pickup_datetime, ends_with("amount"))
    # select
    vroom("taxi.csv",
    col_select = list(medallion, pickup_datetime, ends_with("amount")))
    # remove
    vroom("taxi.csv", col_select = -hack_license)
    # rename
    vroom("taxi.csv", col_select = list(taxi = medallion, everything()))
    SPEAKERDECK.COM/JIMHESTER/VROOM
    Selection

    View Slide

  42. vroom("taxi.csv") %>%
    dplyr::select(medallion, pickup_datetime, ends_with("amount"))
    # select
    vroom("taxi.csv",
    col_select = list(medallion, pickup_datetime, ends_with("amount")))
    # remove
    vroom("taxi.csv", col_select = -hack_license)
    # rename
    vroom("taxi.csv", col_select = list(taxi = medallion, everything()))
    SPEAKERDECK.COM/JIMHESTER/VROOM
    Selection

    View Slide

  43. vroom("taxi.csv") %>%
    dplyr::select(medallion, pickup_datetime, ends_with("amount"))
    # select
    vroom("taxi.csv",
    col_select = list(medallion, pickup_datetime, ends_with("amount")))
    # remove
    vroom("taxi.csv", col_select = -hack_license)
    # rename
    vroom("taxi.csv", col_select = list(taxi = medallion, everything()))
    SPEAKERDECK.COM/JIMHESTER/VROOM
    Selection

    View Slide

  44. vroom("taxi.csv") %>%
    dplyr::select(medallion, pickup_datetime, ends_with("amount"))
    # select
    vroom("taxi.csv",
    col_select = list(medallion, pickup_datetime, ends_with("amount")))
    # remove
    vroom("taxi.csv", col_select = -hack_license)
    # rename
    vroom("taxi.csv", col_select = list(taxi = medallion, everything()))
    SPEAKERDECK.COM/JIMHESTER/VROOM
    Selection

    View Slide

  45. vroom(c("taxi_1.csv", "taxi_2.csv"), id = "path")
    Mul
    tiple files
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  46. vroom(c("taxi_1.csv", "taxi_2.csv"), id = "path")
    Mul
    tiple files
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  47. 12.5 seconds
    vroom(c("taxi_1.csv", "taxi_2.csv"), id = "path")
    Mul
    tiple files
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide

  48. vroom_fwf(altrep = TRUE) 740 ms
    vroom_fwf(altrep = FALSE) 10.1sec
    readr::read_fwf() 24.3sec
    read.fwf() 17.8min
    SPEAKERDECK.COM/JIMHESTER/VROOM
    fixed width files
    480,174 x 156 - 1.05 Gb file

    View Slide

  49. length(vroom_lines(file)) 501 ms
    length(data.table::fread(file, sep = "\n",
    header = FALSE)[1]]) 9.19sec
    length(readr::read_lines(file)) 11.82sec
    length(readLines(file)) 20.72sec
    Counting lines
    SPEAKERDECK.COM/JIMHESTER/VROOM
    14,776,616 lines - 1.55 Gb file

    View Slide

  50. Writing
    SPEAKERDECK.COM/JIMHESTER/VROOM
    vroom_write(df, "taxi.tsv.gz", delim = "\t")

    View Slide

  51. Writing
    SPEAKERDECK.COM/JIMHESTER/VROOM
    vroom_write(df, "taxi.tsv.gz", delim = "\t")

    View Slide

  52. Writing
    SPEAKERDECK.COM/JIMHESTER/VROOM
    vroom_write(df, "taxi.tsv.gz", delim = "\t")
    gzip support in
    data.table devel

    View Slide

  53. install.packages("vroom")
    vroom.r-lib.org
    bit.ly/vroom-yt
    github.com/r-lib/vroom/issues
     @jimhester
     @jimhester_
    vroom
    SPEAKERDECK.COM/JIMHESTER/VROOM

    View Slide