Slide 1

Slide 1 text

SPEAKERDECK.COM/JIMHESTER/VROOM Vroom Life’s T oo Short T o Drive Slow Jim Hester  @jimhester  @jimhester_ Read Photo by Joe Neric on Unsplash

Slide 2

Slide 2 text

SPEAKERDECK.COM/JIMHESTER/VROOM SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Joshua Reddekopp on Unsplash

Slide 3

Slide 3 text

.1 second 1 second 10 seconds source: https://www.nngroup.co/articles/response-times-3-important-limits SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 4

Slide 4 text

source: https://twitter .com/amasad/status/1141066550180552704 SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 5

Slide 5 text

T axi T rip Fare Photo by Andre Benz on Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 6

Slide 6 text

T axi T rip Fare Photo by Andre Benz on Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 7

Slide 7 text

T axi T rip Fare Photo by Andre Benz on Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 8

Slide 8 text

T axi T rip Fare Photo by Andre Benz on Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 9

Slide 9 text

T axi T rip Fare Photo by Andre Benz on Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 10

Slide 10 text

T axi T rip Fare Photo by Andre Benz on Unsplash Observations: 14,776,615 Variables: 11 File Size: 1.55G chr [4]: medallion, hack_license, v dbl [6]: fare_amount, surcharge, dttm [1]: pickup_datetime SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 11

Slide 11 text

SPEAKERDECK.COM/JIMHESTER/VROOM SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 12

Slide 12 text

SPEAKERDECK.COM/JIMHESTER/VROOM SPEAKERDECK.COM/JIMHESTER/VROOM 1.8 seconds

Slide 13

Slide 13 text

SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 14

Slide 14 text

SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash Memory mapped Importance SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 15

Slide 15 text

SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash Memory mapped Multi-threaded Importance SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 16

Slide 16 text

SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash Memory mapped Multi-threaded strcspn() Importance SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 17

Slide 17 text

SPEAKERDECK.COM/JIMHESTER/VROOM Photo by T im Mossholder on Unsplash Memory mapped Multi-threaded strcspn() Altrep Importance SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 18

Slide 18 text

SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 19

Slide 19 text

SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative representation SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 20

Slide 20 text

SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative representation R 3.5+ SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 21

Slide 21 text

SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative representation R 3.5+ Custom memory storage SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 22

Slide 22 text

SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative representation R 3.5+ Custom memory storage T ransparent to R & C/C++ SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 23

Slide 23 text

SPEAKERDECK.COM/JIMHESTER/VROOM Photo by Callum Shaw on Unsplash Al trep Alternative representation R 3.5+ Custom memory storage T ransparent to R & C/C++ On-demand parsing SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 24

Slide 24 text

SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 25

Slide 25 text

SPEAKERDECK.COM/JIMHESTER/VROOM x "a" "abc" "d" "efghi" "xy" Global string pool

Slide 26

Slide 26 text

SPEAKERDECK.COM/JIMHESTER/VROOM x "a" "abc" "d" "efghi" "xy" Global string pool y

Slide 27

Slide 27 text

SPEAKERDECK.COM/JIMHESTER/VROOM x "a" "abc" "d" "efghi" "xy" Global string pool +++ less memory y

Slide 28

Slide 28 text

SPEAKERDECK.COM/JIMHESTER/VROOM x "a" "abc" "d" "efghi" "xy" Global string pool +++ less memory --- hash lookup y

Slide 29

Slide 29 text

SPEAKERDECK.COM/JIMHESTER/VROOM x "a" "abc" "d" "efghi" "xy" Global string pool +++ less memory --- hash lookup --- single threaded y

Slide 30

Slide 30 text

Cost of laziness

Slide 31

Slide 31 text

SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 32

Slide 32 text

SPEAKERDECK.COM/JIMHESTER/VROOM Maximum cost

Slide 33

Slide 33 text

All doubles

Slide 34

Slide 34 text

SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 35

Slide 35 text

SPEAKERDECK.COM/JIMHESTER/VROOM 0.47 seconds

Slide 36

Slide 36 text

SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 37

Slide 37 text

All characters

Slide 38

Slide 38 text

SPEAKERDECK.COM/JIMHESTER/VROOM less raw data than dbl Much slower than dbl

Slide 39

Slide 39 text

SPEAKERDECK.COM/JIMHESTER/VROOM 0.36 seconds

Slide 40

Slide 40 text

Other fea tures

Slide 41

Slide 41 text

vroom("taxi.csv") %>% dplyr::select(medallion, pickup_datetime, ends_with("amount")) # select vroom("taxi.csv", col_select = list(medallion, pickup_datetime, ends_with("amount"))) # remove vroom("taxi.csv", col_select = -hack_license) # rename vroom("taxi.csv", col_select = list(taxi = medallion, everything())) SPEAKERDECK.COM/JIMHESTER/VROOM Selection

Slide 42

Slide 42 text

vroom("taxi.csv") %>% dplyr::select(medallion, pickup_datetime, ends_with("amount")) # select vroom("taxi.csv", col_select = list(medallion, pickup_datetime, ends_with("amount"))) # remove vroom("taxi.csv", col_select = -hack_license) # rename vroom("taxi.csv", col_select = list(taxi = medallion, everything())) SPEAKERDECK.COM/JIMHESTER/VROOM Selection

Slide 43

Slide 43 text

vroom("taxi.csv") %>% dplyr::select(medallion, pickup_datetime, ends_with("amount")) # select vroom("taxi.csv", col_select = list(medallion, pickup_datetime, ends_with("amount"))) # remove vroom("taxi.csv", col_select = -hack_license) # rename vroom("taxi.csv", col_select = list(taxi = medallion, everything())) SPEAKERDECK.COM/JIMHESTER/VROOM Selection

Slide 44

Slide 44 text

vroom("taxi.csv") %>% dplyr::select(medallion, pickup_datetime, ends_with("amount")) # select vroom("taxi.csv", col_select = list(medallion, pickup_datetime, ends_with("amount"))) # remove vroom("taxi.csv", col_select = -hack_license) # rename vroom("taxi.csv", col_select = list(taxi = medallion, everything())) SPEAKERDECK.COM/JIMHESTER/VROOM Selection

Slide 45

Slide 45 text

vroom(c("taxi_1.csv", "taxi_2.csv"), id = "path") Mul tiple files SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 46

Slide 46 text

vroom(c("taxi_1.csv", "taxi_2.csv"), id = "path") Mul tiple files SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 47

Slide 47 text

12.5 seconds vroom(c("taxi_1.csv", "taxi_2.csv"), id = "path") Mul tiple files SPEAKERDECK.COM/JIMHESTER/VROOM

Slide 48

Slide 48 text

vroom_fwf(altrep = TRUE) 740 ms vroom_fwf(altrep = FALSE) 10.1sec readr::read_fwf() 24.3sec read.fwf() 17.8min SPEAKERDECK.COM/JIMHESTER/VROOM fixed width files 480,174 x 156 - 1.05 Gb file

Slide 49

Slide 49 text

length(vroom_lines(file)) 501 ms length(data.table::fread(file, sep = "\n", header = FALSE)[1]]) 9.19sec length(readr::read_lines(file)) 11.82sec length(readLines(file)) 20.72sec Counting lines SPEAKERDECK.COM/JIMHESTER/VROOM 14,776,616 lines - 1.55 Gb file

Slide 50

Slide 50 text

Writing SPEAKERDECK.COM/JIMHESTER/VROOM vroom_write(df, "taxi.tsv.gz", delim = "\t")

Slide 51

Slide 51 text

Writing SPEAKERDECK.COM/JIMHESTER/VROOM vroom_write(df, "taxi.tsv.gz", delim = "\t")

Slide 52

Slide 52 text

Writing SPEAKERDECK.COM/JIMHESTER/VROOM vroom_write(df, "taxi.tsv.gz", delim = "\t") gzip support in data.table devel

Slide 53

Slide 53 text

install.packages("vroom") vroom.r-lib.org bit.ly/vroom-yt github.com/r-lib/vroom/issues  @jimhester  @jimhester_ vroom SPEAKERDECK.COM/JIMHESTER/VROOM