R and Big Data (Databases)

You Love Big Data! So Does R. https://github.com/akgold/bdl_2019 @alexkgold Alex
K Gold #rstats

R For Big Data Single-Threaded Just Slow TRUE In-Memory DON’T
MATTER @alexkgold

Single-Threaded Just Slow In-Memory • doFuture • RStudio Server Pro
Launcher • Fast Enough (profvis) • Rcpp @alexkgold

In-Memory Database Dev Machine To Scale @alexkgold

Big Data Strategies for R @alexkgold

Strategy 1: Sample and Model Use favorite R modeling package
(Caret/Parsnip/rsample). Really good for iterating/prototyping. ☹ Requires care for sampling and scaling. ☹ Not good for BI tasks. @alexkgold

Data: London Planning Permissions 2006-current Demo! @alexkgold Model

Strategy 2: Chunk and Pull Great when discrete chunks exist.
Facilitates parallelization. ☹ Can’t have interactions between chunks. ☹ Eventually pull in all data. . . . @alexkgold

Demo! @alexkgold

Strategy 3: Push Compute to Data Take advantage of database
strengths. Get whole dataset, but move less data. ☹ Operations might not be permitted in database. ☹ Maybe your database is slow? @alexkgold

Demo! @alexkgold

3 Big Data Strategies for R 1. Sample and Model
2. Chunk and Pull 3. Push Compute to Data But why R? ➔ None are perfect! @alexkgold

Import Tidy Visualize Transform Model Communicate or Automate Discovery Cycle
@alexkgold

What about deployment? Open-Source (Free!) • Build-your-own • Shiny Server
Enterprise Products • RStudio Connect • Evals @alexkgold

Recommendation Summary Problem Solution Single-Threading • Many R packages ◦
My favorite: doFuture • RStudio Server Pro Job Launcher R is Slow • Profile with profvis • Write in a faster language, call from R (Rcpp) In-Memory Data • Adopt a big data paradigm for R 1. Sample and Model 2. Chunk and Pull 3. Push Compute to Data @alexkgold https://github.com/akgold/bdl_2019 db.rstudio.com spark.rstudio.com therinspark.com

R and Big Data (Databases)

R and Big Data (Databases)

Alex Gold

More Decks by Alex Gold

Other Decks in Programming

Featured

Transcript

You Love Big Data! So Does R. https://github.com/akgold/bdl_2019 @alexkgold Alex

R For Big Data Single-Threaded Just Slow TRUE In-Memory DON’T

Single-Threaded Just Slow In-Memory • doFuture • RStudio Server Pro

In-Memory Database Dev Machine To Scale @alexkgold

Big Data Strategies for R @alexkgold

Strategy 1: Sample and Model Use favorite R modeling package

Data: London Planning Permissions 2006-current Demo! @alexkgold Model

Strategy 2: Chunk and Pull Great when discrete chunks exist.

Demo! @alexkgold

Strategy 3: Push Compute to Data Take advantage of database

Demo! @alexkgold

3 Big Data Strategies for R 1. Sample and Model

Import Tidy Visualize Transform Model Communicate or Automate Discovery Cycle

What about deployment? Open-Source (Free!) • Build-your-own • Shiny Server

Recommendation Summary Problem Solution Single-Threading • Many R packages ◦