Slide 1

Slide 1 text

You Love Big Data! So Does R. https://github.com/akgold/bdl_2019 @alexkgold Alex K Gold #rstats

Slide 2

Slide 2 text

R For Big Data Single-Threaded Just Slow TRUE In-Memory DON’T MATTER @alexkgold

Slide 3

Slide 3 text

Single-Threaded Just Slow In-Memory ● doFuture ● RStudio Server Pro Launcher ● Fast Enough (profvis) ● Rcpp @alexkgold

Slide 4

Slide 4 text

In-Memory Database Dev Machine To Scale @alexkgold

Slide 5

Slide 5 text

Big Data Strategies for R @alexkgold

Slide 6

Slide 6 text

Strategy 1: Sample and Model Use favorite R modeling package (Caret/Parsnip/rsample). Really good for iterating/prototyping. ☹ Requires care for sampling and scaling. ☹ Not good for BI tasks. @alexkgold

Slide 7

Slide 7 text

Data: London Planning Permissions 2006-current Demo! @alexkgold Model

Slide 8

Slide 8 text

Strategy 2: Chunk and Pull Great when discrete chunks exist. Facilitates parallelization. ☹ Can’t have interactions between chunks. ☹ Eventually pull in all data. . . . @alexkgold

Slide 9

Slide 9 text

Demo! @alexkgold

Slide 10

Slide 10 text

Strategy 3: Push Compute to Data Take advantage of database strengths. Get whole dataset, but move less data. ☹ Operations might not be permitted in database. ☹ Maybe your database is slow? @alexkgold

Slide 11

Slide 11 text

Demo! @alexkgold

Slide 12

Slide 12 text

3 Big Data Strategies for R 1. Sample and Model 2. Chunk and Pull 3. Push Compute to Data But why R? ➔ None are perfect! @alexkgold

Slide 13

Slide 13 text

Import Tidy Visualize Transform Model Communicate or Automate Discovery Cycle @alexkgold

Slide 14

Slide 14 text

What about deployment? Open-Source (Free!) • Build-your-own • Shiny Server Enterprise Products • RStudio Connect • Evals @alexkgold

Slide 15

Slide 15 text

Recommendation Summary Problem Solution Single-Threading ● Many R packages ○ My favorite: doFuture ● RStudio Server Pro Job Launcher R is Slow ● Profile with profvis ● Write in a faster language, call from R (Rcpp) In-Memory Data ● Adopt a big data paradigm for R 1. Sample and Model 2. Chunk and Pull 3. Push Compute to Data @alexkgold https://github.com/akgold/bdl_2019 db.rstudio.com spark.rstudio.com therinspark.com