vroom

 vroom

File import in R could be considered a solved problem, with multiple widely used packages (data.table, readr, and others) providing fast, robust import of common formats in addition to the functions available in base R.

However I feel there is still room for improvement in existing approaches. vroom is able to index and then query multi-Gigabyte files, including those with categorical, text and temporal data, in near real-time. This is a huge boon for interactive data analysis as you can jump directly into exploratory analysis without sampling or long waits for full import. vroom leverages the Altrep framework introduced in R 3.5 along with lazy, just-in-time parsing of the data to provide this improved latency without requiring changes to existing data manipulation code.

6170c1d1970baf2a36a9ae2955e47ff3?s=128

Jim Hester

July 12, 2019
Tweet

Transcript

  1. SPEAKERDECK.COM/JIMHESTER/VROOM Vroom Life’s T oo Short T o Drive Slow

    Jim Hester  @jimhester  @jimhester_ Read Photo by Joe Neric on Unsplash
  2. SPEAKERDECK.COM/JIMHESTER/VROOM SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Joshua Reddekopp on Unsplash

  3. .1 second 1 second 10 seconds source: https://www.nngroup.co/articles/response-times-3-important-limits SPEAKERDECK.COM/JIMHESTER/VROOM

  4. source: https://twitter .com/amasad/status/1141066550180552704 SPEAKERDECK.COM/JIMHESTER/VROOM

  5. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  6. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  7. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  8. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  9. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  10. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  11. SPEAKERDECK.COM/JIMHESTER/VROOM SPEAKERDECK.COM/JIMHESTER/VROOM

  12. SPEAKERDECK.COM/JIMHESTER/VROOM SPEAKERDECK.COM/JIMHESTER/VROOM 1.8 seconds

  13. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash SPEAKERDECK.COM/JIMHESTER/VROOM

  14. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash Memory mapped

    Importance SPEAKERDECK.COM/JIMHESTER/VROOM
  15. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash Memory mapped

    Multi-threaded Importance SPEAKERDECK.COM/JIMHESTER/VROOM
  16. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash Memory mapped

    Multi-threaded strcspn() Importance SPEAKERDECK.COM/JIMHESTER/VROOM
  17. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash Memory mapped

    Multi-threaded strcspn() Altrep Importance SPEAKERDECK.COM/JIMHESTER/VROOM
  18. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep SPEAKERDECK.COM/JIMHESTER/VROOM

  19. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative

    representation SPEAKERDECK.COM/JIMHESTER/VROOM
  20. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative

    representation R 3.5+ SPEAKERDECK.COM/JIMHESTER/VROOM
  21. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative

    representation R 3.5+ Custom memory storage SPEAKERDECK.COM/JIMHESTER/VROOM
  22. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative

    representation R 3.5+ Custom memory storage T ransparent to R & C/C++ SPEAKERDECK.COM/JIMHESTER/VROOM
  23. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative

    representation R 3.5+ Custom memory storage T ransparent to R & C/C++ On-demand parsing SPEAKERDECK.COM/JIMHESTER/VROOM
  24. SPEAKERDECK.COM/JIMHESTER/VROOM

  25. SPEAKERDECK.COM/JIMHESTER/VROOM x "a" "abc" "d" "efghi" "xy" Global string pool

  26. SPEAKERDECK.COM/JIMHESTER/VROOM x "a" "abc" "d" "efghi" "xy" Global string pool

    y
  27. SPEAKERDECK.COM/JIMHESTER/VROOM x "a" "abc" "d" "efghi" "xy" Global string pool

    +++ less memory y
  28. SPEAKERDECK.COM/JIMHESTER/VROOM x "a" "abc" "d" "efghi" "xy" Global string pool

    +++ less memory --- hash lookup y
  29. SPEAKERDECK.COM/JIMHESTER/VROOM x "a" "abc" "d" "efghi" "xy" Global string pool

    +++ less memory --- hash lookup --- single threaded y
  30. Cost of laziness

  31. SPEAKERDECK.COM/JIMHESTER/VROOM

  32. SPEAKERDECK.COM/JIMHESTER/VROOM Maximum cost

  33. All doubles

  34. SPEAKERDECK.COM/JIMHESTER/VROOM

  35. SPEAKERDECK.COM/JIMHESTER/VROOM 0.47 seconds

  36. SPEAKERDECK.COM/JIMHESTER/VROOM

  37. All characters

  38. SPEAKERDECK.COM/JIMHESTER/VROOM less raw data than dbl Much slower than dbl

  39. SPEAKERDECK.COM/JIMHESTER/VROOM 0.36 seconds

  40. Other fea tures

  41. vroom("taxi.csv") %>% dplyr::select(medallion, pickup_datetime, ends_with("amount")) # select vroom("taxi.csv", col_select =

    list(medallion, pickup_datetime, ends_with("amount"))) # remove vroom("taxi.csv", col_select = -hack_license) # rename vroom("taxi.csv", col_select = list(taxi = medallion, everything())) SPEAKERDECK.COM/JIMHESTER/VROOM Selection
  42. vroom("taxi.csv") %>% dplyr::select(medallion, pickup_datetime, ends_with("amount")) # select vroom("taxi.csv", col_select =

    list(medallion, pickup_datetime, ends_with("amount"))) # remove vroom("taxi.csv", col_select = -hack_license) # rename vroom("taxi.csv", col_select = list(taxi = medallion, everything())) SPEAKERDECK.COM/JIMHESTER/VROOM Selection
  43. vroom("taxi.csv") %>% dplyr::select(medallion, pickup_datetime, ends_with("amount")) # select vroom("taxi.csv", col_select =

    list(medallion, pickup_datetime, ends_with("amount"))) # remove vroom("taxi.csv", col_select = -hack_license) # rename vroom("taxi.csv", col_select = list(taxi = medallion, everything())) SPEAKERDECK.COM/JIMHESTER/VROOM Selection
  44. vroom("taxi.csv") %>% dplyr::select(medallion, pickup_datetime, ends_with("amount")) # select vroom("taxi.csv", col_select =

    list(medallion, pickup_datetime, ends_with("amount"))) # remove vroom("taxi.csv", col_select = -hack_license) # rename vroom("taxi.csv", col_select = list(taxi = medallion, everything())) SPEAKERDECK.COM/JIMHESTER/VROOM Selection
  45. vroom(c("taxi_1.csv", "taxi_2.csv"), id = "path") Mul tiple files SPEAKERDECK.COM/JIMHESTER/VROOM

  46. vroom(c("taxi_1.csv", "taxi_2.csv"), id = "path") Mul tiple files SPEAKERDECK.COM/JIMHESTER/VROOM

  47. 12.5 seconds vroom(c("taxi_1.csv", "taxi_2.csv"), id = "path") Mul tiple files

    SPEAKERDECK.COM/JIMHESTER/VROOM
  48. vroom_fwf(altrep = TRUE) 740 ms vroom_fwf(altrep = FALSE) 10.1sec readr::read_fwf()

    24.3sec read.fwf() 17.8min SPEAKERDECK.COM/JIMHESTER/VROOM fixed width files 480,174 x 156 - 1.05 Gb file
  49. length(vroom_lines(file)) 501 ms length(data.table::fread(file, sep = "\n", header = FALSE)[1]])

    9.19sec length(readr::read_lines(file)) 11.82sec length(readLines(file)) 20.72sec Counting lines SPEAKERDECK.COM/JIMHESTER/VROOM 14,776,616 lines - 1.55 Gb file
  50. Writing SPEAKERDECK.COM/JIMHESTER/VROOM vroom_write(df, "taxi.tsv.gz", delim = "\t")

  51. Writing SPEAKERDECK.COM/JIMHESTER/VROOM vroom_write(df, "taxi.tsv.gz", delim = "\t")

  52. Writing SPEAKERDECK.COM/JIMHESTER/VROOM vroom_write(df, "taxi.tsv.gz", delim = "\t") gzip support in

    data.table devel
  53. install.packages("vroom") vroom.r-lib.org bit.ly/vroom-yt github.com/r-lib/vroom/issues  @jimhester  @jimhester_ vroom SPEAKERDECK.COM/JIMHESTER/VROOM