Upgrade to Pro — share decks privately, control downloads, hide ads and more …

vroom

 vroom

File import in R could be considered a solved problem, with multiple widely used packages (data.table, readr, and others) providing fast, robust import of common formats in addition to the functions available in base R.

However I feel there is still room for improvement in existing approaches. vroom is able to index and then query multi-Gigabyte files, including those with categorical, text and temporal data, in near real-time. This is a huge boon for interactive data analysis as you can jump directly into exploratory analysis without sampling or long waits for full import. vroom leverages the Altrep framework introduced in R 3.5 along with lazy, just-in-time parsing of the data to provide this improved latency without requiring changes to existing data manipulation code.

Jim Hester

July 12, 2019
Tweet

More Decks by Jim Hester

Other Decks in Programming

Transcript

  1. SPEAKERDECK.COM/JIMHESTER/VROOM Vroom Life’s T oo Short T o Drive Slow

    Jim Hester  @jimhester  @jimhester_ Read Photo by Joe Neric on Unsplash
  2. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  3. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  4. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  5. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  6. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  7. T axi T rip Fare Photo by Andre Benz on

    Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM
  8. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash Memory mapped

    Multi-threaded Importance SPEAKERDECK.COM/JIMHESTER/VROOM
  9. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash Memory mapped

    Multi-threaded strcspn() Importance SPEAKERDECK.COM/JIMHESTER/VROOM
  10. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash Memory mapped

    Multi-threaded strcspn() Altrep Importance SPEAKERDECK.COM/JIMHESTER/VROOM
  11. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative

    representation R 3.5+ SPEAKERDECK.COM/JIMHESTER/VROOM
  12. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative

    representation R 3.5+ Custom memory storage SPEAKERDECK.COM/JIMHESTER/VROOM
  13. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative

    representation R 3.5+ Custom memory storage T ransparent to R & C/C++ SPEAKERDECK.COM/JIMHESTER/VROOM
  14. SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative

    representation R 3.5+ Custom memory storage T ransparent to R & C/C++ On-demand parsing SPEAKERDECK.COM/JIMHESTER/VROOM
  15. SPEAKERDECK.COM/JIMHESTER/VROOM x "a" "abc" "d" "efghi" "xy" Global string pool

    +++ less memory --- hash lookup --- single threaded y
  16. vroom("taxi.csv") %>% dplyr::select(medallion, pickup_datetime, ends_with("amount")) # select vroom("taxi.csv", col_select =

    list(medallion, pickup_datetime, ends_with("amount"))) # remove vroom("taxi.csv", col_select = -hack_license) # rename vroom("taxi.csv", col_select = list(taxi = medallion, everything())) SPEAKERDECK.COM/JIMHESTER/VROOM Selection
  17. vroom("taxi.csv") %>% dplyr::select(medallion, pickup_datetime, ends_with("amount")) # select vroom("taxi.csv", col_select =

    list(medallion, pickup_datetime, ends_with("amount"))) # remove vroom("taxi.csv", col_select = -hack_license) # rename vroom("taxi.csv", col_select = list(taxi = medallion, everything())) SPEAKERDECK.COM/JIMHESTER/VROOM Selection
  18. vroom("taxi.csv") %>% dplyr::select(medallion, pickup_datetime, ends_with("amount")) # select vroom("taxi.csv", col_select =

    list(medallion, pickup_datetime, ends_with("amount"))) # remove vroom("taxi.csv", col_select = -hack_license) # rename vroom("taxi.csv", col_select = list(taxi = medallion, everything())) SPEAKERDECK.COM/JIMHESTER/VROOM Selection
  19. vroom("taxi.csv") %>% dplyr::select(medallion, pickup_datetime, ends_with("amount")) # select vroom("taxi.csv", col_select =

    list(medallion, pickup_datetime, ends_with("amount"))) # remove vroom("taxi.csv", col_select = -hack_license) # rename vroom("taxi.csv", col_select = list(taxi = medallion, everything())) SPEAKERDECK.COM/JIMHESTER/VROOM Selection
  20. vroom_fwf(altrep = TRUE) 740 ms vroom_fwf(altrep = FALSE) 10.1sec readr::read_fwf()

    24.3sec read.fwf() 17.8min SPEAKERDECK.COM/JIMHESTER/VROOM fixed width files 480,174 x 156 - 1.05 Gb file
  21. length(vroom_lines(file)) 501 ms length(data.table::fread(file, sep = "\n", header = FALSE)[1]])

    9.19sec length(readr::read_lines(file)) 11.82sec length(readLines(file)) 20.72sec Counting lines SPEAKERDECK.COM/JIMHESTER/VROOM 14,776,616 lines - 1.55 Gb file