2019-06-12 Big Data Toronto: Spark

2019-06-12 Big Data Toronto: Spark

A presentation for Big Data Toronto 2019 on using R with big data via sparklyr.

Code at https://github.com/akgold/big_data_2019

2b651c3725763904a603ab0a63a46cc8?s=128

Alex Gold

June 12, 2019
Tweet

Transcript

  1. You’re Not Afraid of Big Data rstd.io/big_data_19 You’re Not Afraid

    of Big Data ...neither is R. @alexkgold Alex K Gold
  2. R For Big Data Single-Threaded Just Slow TRUE In-Memory DON’T

    MATTER 2
  3. Single-Threaded Just Slow In-Memory • doFuture • RStudio Server Pro

    Launcher • Fast Enough (profvis) • Rcpp 3
  4. In-Memory Database Dev Machine To Scale 4

  5. R Big Data Paradigms 5

  6. Paradigm 1: Sample and Model Use favorite R modeling package

    (Caret/Parsnip/rsample). Really good for iterating/prototyping. ☹ Requires care for sampling and scaling. ☹ Not good for BI tasks. 6
  7. Paradigm 2: Chunk and Pull Great when discrete chunks exist.

    Facilitates parallelization. ☹ Can’t have interactions between chunks. ☹ Eventually pull in all data. . . . 7
  8. Paradigm 3: Push Compute to Data Take advantage of database

    strengths. Get whole dataset, but move less data. ☹ Operations might not be permitted in database. ☹ Maybe your database is slow? 8
  9. 3 Big Data Paradigms for R 1. Sample and Model

    2. Chunk and Pull 3. Push Compute to Data But why R? 9 ➔ None are perfect! ➔ Use more than one!
  10. 10 Import Tidy Visualize Transform Model Communicate or Automate Discovery

    Cycle
  11. ➔ Visualize model quality (bad) with Shiny ➔ Clean data

    using RMarkdown script ➔ Explore with Spark SQL ➔ Fit a (bad) model using Spark ML Demo! 11 Model
  12. 12 • General purpose distributed computation. • APIs for Scala,

    Python, and Java, and ... Otherwise… Connect via: • DBI Database connectors (github.com/r-dbi): ◦ SQLite ◦ PostGres ◦ MariaDB ◦ MySQL ◦ Google BigQuery ◦ ODBC Process via • dbplyr - run dplyr code in database • modeldb - fit model in database • tidypredict - predict in database • dbplot - plot in database
  13. What about deployment? Open-Source (Free!) • Build-your-own • Shiny Server

    Enterprise Products • RStudio Connect • Free 45 Day Eval • Quickstart 13
  14. Recommendation Summary Problem Solution Single-Threading • Many R packages ◦

    My favorite: doFuture • RStudio Server Pro Job Launcher R is Slow • Profile with profvis • Write in a faster language, call from R (Rcpp) In-Memory Data • Adopt a big data paradigm for R 1. Sample and Model 2. Chunk and Pull 3. Push Compute to Data @alexkgold rstd.io/big_data_19 db.rstudio.com spark.rstudio.com 14