Upgrade to Pro — share decks privately, control downloads, hide ads and more …

vroom

 vroom

File import in R could be considered a solved problem, with multiple widely used packages (data.table, readr, and others) providing fast, robust import of common formats in addition to the functions available in base R.

However I feel there is still room for improvement in existing approaches. vroom is able to index and then query multi-Gigabyte files, including those with categorical, text and temporal data, in near real-time. This is a huge boon for interactive data analysis as you can jump directly into exploratory analysis without sampling or long waits for full import. vroom leverages the Altrep framework introduced in R 3.5 along with lazy, just-in-time parsing of the data to provide this improved latency without requiring changes to existing data manipulation code.

Jim Hester

July 12, 2019
Tweet

More Decks by Jim Hester

Other Decks in Programming

Transcript

  1. SPEAKERDECK.COM/JIMHESTER/VROOM Vroom Life’s T oo Short T o Drive Slow

    Jim Hester  @jimhester  @jimhester_ Read Photo by Joe Neric on Unsplash
  2. SPEAKERDECK.COM/JIMHESTER/VROOM SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Joshua Reddekopp on Unsplash

  3. .1 second 1 second 10 seconds source: https://www.nngroup.co/articles/response-times-3-important-limits SPEAKERDECK.COM/JIMHESTER/VROOM

  4. source: https://twitter .com/amasad/status/1141066550180552704 SPEAKERDECK.COM/JIMHESTER/VROOM

  5. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  6. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  7. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  8. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  9. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  10. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  11. SPEAKERDECK.COM/JIMHESTER/VROOM SPEAKERDECK.COM/JIMHESTER/VROOM

  12. SPEAKERDECK.COM/JIMHESTER/VROOM SPEAKERDECK.COM/JIMHESTER/VROOM 1.8 seconds

  13. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash SPEAKERDECK.COM/JIMHESTER/VROOM

  14. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash Memory mapped

    Importance SPEAKERDECK.COM/JIMHESTER/VROOM
  15. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash Memory mapped

    Multi-threaded Importance SPEAKERDECK.COM/JIMHESTER/VROOM
  16. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash Memory mapped

    Multi-threaded strcspn() Importance SPEAKERDECK.COM/JIMHESTER/VROOM
  17. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash Memory mapped

    Multi-threaded strcspn() Altrep Importance SPEAKERDECK.COM/JIMHESTER/VROOM
  18. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep SPEAKERDECK.COM/JIMHESTER/VROOM

  19. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative

    representation SPEAKERDECK.COM/JIMHESTER/VROOM
  20. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative

    representation R 3.5+ SPEAKERDECK.COM/JIMHESTER/VROOM
  21. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative

    representation R 3.5+ Custom memory storage SPEAKERDECK.COM/JIMHESTER/VROOM
  22. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative

    representation R 3.5+ Custom memory storage T ransparent to R & C/C++ SPEAKERDECK.COM/JIMHESTER/VROOM
  23. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative

    representation R 3.5+ Custom memory storage T ransparent to R & C/C++ On-demand parsing SPEAKERDECK.COM/JIMHESTER/VROOM
  24. SPEAKERDECK.COM/JIMHESTER/VROOM

  25. SPEAKERDECK.COM/JIMHESTER/VROOM x "a" "abc" "d" "efghi" "xy" Global string pool

  26. SPEAKERDECK.COM/JIMHESTER/VROOM x "a" "abc" "d" "efghi" "xy" Global string pool

    y
  27. SPEAKERDECK.COM/JIMHESTER/VROOM x "a" "abc" "d" "efghi" "xy" Global string pool

    +++ less memory y
  28. SPEAKERDECK.COM/JIMHESTER/VROOM x "a" "abc" "d" "efghi" "xy" Global string pool

    +++ less memory --- hash lookup y
  29. SPEAKERDECK.COM/JIMHESTER/VROOM x "a" "abc" "d" "efghi" "xy" Global string pool

    +++ less memory --- hash lookup --- single threaded y
  30. Cost of laziness

  31. SPEAKERDECK.COM/JIMHESTER/VROOM

  32. SPEAKERDECK.COM/JIMHESTER/VROOM Maximum cost

  33. All doubles

  34. SPEAKERDECK.COM/JIMHESTER/VROOM

  35. SPEAKERDECK.COM/JIMHESTER/VROOM 0.47 seconds

  36. SPEAKERDECK.COM/JIMHESTER/VROOM

  37. All characters

  38. SPEAKERDECK.COM/JIMHESTER/VROOM less raw data than dbl Much slower than dbl

  39. SPEAKERDECK.COM/JIMHESTER/VROOM 0.36 seconds

  40. Other fea tures

  41. vroom("taxi.csv") %>% dplyr::select(medallion, pickup_datetime, ends_with("amount")) # select vroom("taxi.csv", col_select =

    list(medallion, pickup_datetime, ends_with("amount"))) # remove vroom("taxi.csv", col_select = -hack_license) # rename vroom("taxi.csv", col_select = list(taxi = medallion, everything())) SPEAKERDECK.COM/JIMHESTER/VROOM Selection
  42. vroom("taxi.csv") %>% dplyr::select(medallion, pickup_datetime, ends_with("amount")) # select vroom("taxi.csv", col_select =

    list(medallion, pickup_datetime, ends_with("amount"))) # remove vroom("taxi.csv", col_select = -hack_license) # rename vroom("taxi.csv", col_select = list(taxi = medallion, everything())) SPEAKERDECK.COM/JIMHESTER/VROOM Selection
  43. vroom("taxi.csv") %>% dplyr::select(medallion, pickup_datetime, ends_with("amount")) # select vroom("taxi.csv", col_select =

    list(medallion, pickup_datetime, ends_with("amount"))) # remove vroom("taxi.csv", col_select = -hack_license) # rename vroom("taxi.csv", col_select = list(taxi = medallion, everything())) SPEAKERDECK.COM/JIMHESTER/VROOM Selection
  44. vroom("taxi.csv") %>% dplyr::select(medallion, pickup_datetime, ends_with("amount")) # select vroom("taxi.csv", col_select =

    list(medallion, pickup_datetime, ends_with("amount"))) # remove vroom("taxi.csv", col_select = -hack_license) # rename vroom("taxi.csv", col_select = list(taxi = medallion, everything())) SPEAKERDECK.COM/JIMHESTER/VROOM Selection
  45. vroom(c("taxi_1.csv", "taxi_2.csv"), id = "path") Mul tiple files SPEAKERDECK.COM/JIMHESTER/VROOM

  46. vroom(c("taxi_1.csv", "taxi_2.csv"), id = "path") Mul tiple files SPEAKERDECK.COM/JIMHESTER/VROOM

  47. 12.5 seconds vroom(c("taxi_1.csv", "taxi_2.csv"), id = "path") Mul tiple files

    SPEAKERDECK.COM/JIMHESTER/VROOM
  48. vroom_fwf(altrep = TRUE) 740 ms vroom_fwf(altrep = FALSE) 10.1sec readr::read_fwf()

    24.3sec read.fwf() 17.8min SPEAKERDECK.COM/JIMHESTER/VROOM fixed width files 480,174 x 156 - 1.05 Gb file
  49. length(vroom_lines(file)) 501 ms length(data.table::fread(file, sep = "\n", header = FALSE)[1]])

    9.19sec length(readr::read_lines(file)) 11.82sec length(readLines(file)) 20.72sec Counting lines SPEAKERDECK.COM/JIMHESTER/VROOM 14,776,616 lines - 1.55 Gb file
  50. Writing SPEAKERDECK.COM/JIMHESTER/VROOM vroom_write(df, "taxi.tsv.gz", delim = "\t")

  51. Writing SPEAKERDECK.COM/JIMHESTER/VROOM vroom_write(df, "taxi.tsv.gz", delim = "\t")

  52. Writing SPEAKERDECK.COM/JIMHESTER/VROOM vroom_write(df, "taxi.tsv.gz", delim = "\t") gzip support in

    data.table devel
  53. install.packages("vroom") vroom.r-lib.org bit.ly/vroom-yt github.com/r-lib/vroom/issues  @jimhester  @jimhester_ vroom SPEAKERDECK.COM/JIMHESTER/VROOM