R and Big Data (Databases)

2b651c3725763904a603ab0a63a46cc8?s=47 Alex Gold
November 14, 2019

R and Big Data (Databases)

Presentation for Big Data London, 2019 on using big data in R via database connections.


Alex Gold

November 14, 2019


  1. You Love Big Data! So Does R. https://github.com/akgold/bdl_2019 @alexkgold Alex

    K Gold #rstats
  2. R For Big Data Single-Threaded Just Slow TRUE In-Memory DON’T

    MATTER @alexkgold
  3. Single-Threaded Just Slow In-Memory • doFuture • RStudio Server Pro

    Launcher • Fast Enough (profvis) • Rcpp @alexkgold
  4. In-Memory Database Dev Machine To Scale @alexkgold

  5. Big Data Strategies for R @alexkgold

  6. Strategy 1: Sample and Model Use favorite R modeling package

    (Caret/Parsnip/rsample). Really good for iterating/prototyping. ☹ Requires care for sampling and scaling. ☹ Not good for BI tasks. @alexkgold
  7. Data: London Planning Permissions 2006-current Demo! @alexkgold Model

  8. Strategy 2: Chunk and Pull Great when discrete chunks exist.

    Facilitates parallelization. ☹ Can’t have interactions between chunks. ☹ Eventually pull in all data. . . . @alexkgold
  9. Demo! @alexkgold

  10. Strategy 3: Push Compute to Data Take advantage of database

    strengths. Get whole dataset, but move less data. ☹ Operations might not be permitted in database. ☹ Maybe your database is slow? @alexkgold
  11. Demo! @alexkgold

  12. 3 Big Data Strategies for R 1. Sample and Model

    2. Chunk and Pull 3. Push Compute to Data But why R? ➔ None are perfect! @alexkgold
  13. Import Tidy Visualize Transform Model Communicate or Automate Discovery Cycle

  14. What about deployment? Open-Source (Free!) • Build-your-own • Shiny Server

    Enterprise Products • RStudio Connect • Evals @alexkgold
  15. Recommendation Summary Problem Solution Single-Threading • Many R packages ◦

    My favorite: doFuture • RStudio Server Pro Job Launcher R is Slow • Profile with profvis • Write in a faster language, call from R (Rcpp) In-Memory Data • Adopt a big data paradigm for R 1. Sample and Model 2. Chunk and Pull 3. Push Compute to Data @alexkgold https://github.com/akgold/bdl_2019 db.rstudio.com spark.rstudio.com therinspark.com