Hi, I’m Kelly from RStudio Solutions Engineering I’m a co-organizer of the Washington DC R-Ladies Meetup I work in a cross-functional group at RStudio, serving our customers, sales team, support team and product engineering teams. My favorite thing we make is RStudio Connect!
Sparklyr Resources spark.rstudio.com RStudio Webinars On Demand: 1. Introducing an R interface for Apache Spark 2. Extending Spark using sparklyr and R 3. Advanced Features of sparklyr 4. Understanding Spark and sparklyr deployment modes resources.rstudio.com/working-with-spark
Coming Soon! The R in Spark > Mastering Spark with R: The Complete Guide to Large-Scale Analysis and Modeling - Analysis - Modeling - Pipelines - Clusters - Connections - Data - Tuning - Extensions - Distributed R - Streaming - Contributing
Plan for Today Get access to our infrastructure (RStudio Server Pro) Introduction to Spark and R Analysis with Sparklyr (DBI, dplyr review) Modeling (super basic) Sparklyr Skills: - Patterns for reading and writing data - Working with streaming data - The mechanics of distributing R code
What is Spark? Unified: Spark supports many libraries, cluster technologies and storage systems Analytics: The discovery and interpretation of data to communicate information Engine: Spark is expected to be efficient and generic Large-Scale: “cluster-scale” - a set of connected computers working together “Apache Spark is a unified analytics engine for large-scale data processing.” -- spark.apache.org
Exercise: Install Spark Locally and Connect ● Load sparklyr ● Install Spark 2.3 ● Check installed versions ● Connect! ● Disconnect! FAQ: What is a local Spark cluster good for? ● Getting started ● Testing code ● Troubleshooting What should this be?
Exercise: RStudio Connections Pane ● Reconnect to Spark Local through the RStudio Connections Pane ● Copy mtcars to Spark Local ● Preview mtcars in the Connections Pane
Tour: Spark Tools in the Connections Pane Spark UI - Opens the Spark web interface; a shortcut to spark_ui(sc) Log - Opens the Spark web logs; a shortcut to spark_log(sc) SQL - Opens a new SQL query Help - Opens the reference documentation in a new web browser window Disconnect - Disconnects from Spark; a shortcut to spark_disconnect(sc)
Exercise: Built-in Functions The percentile() function expects a column name, and returns the exact percentile of a column in a group. What if the operation isn't available through dplyr & sparklyr? ● Look for a built-in function available in Spark ● dplyr passes functions it doesn't recognize to the query engine as-is
Exercise: Correlation - the corrr package Common EDA task: calculate and visualize correlations - find out what kind of statistical relationship between paired sets of variables. Spark function to calculate correlations across an entire dataset: sparklyr::ml_corr() The corrr R package contains a backend for Spark, so when a Spark object is used in corrr - the computation happens in Spark. The correlate() function runs sparklyr::ml_corr()
Visualizations: Push Compute, Collect Results 1. Ensure the transformation operations happen in Spark 2. Bring the results back into R after the transformations have been performed
Visualizations: Using dbplot dbplot - helper functions for plotting with remote data The dbplot_histogram() function makes Spark calculate the bins and the count per bin and outputs a ggplot object
Problematic Visualizations: Scatter Plots No amount of “pushing the computation” to Spark will help with this problem because the data must be plotted in individual dots Alternative is to find a plot type that represents the x/y relationship and concentration in a way that it is easy to perceive and to “physically” plot Use dbplot_raster() to create a scatter-like plot in Spark, while only collecting a small subset of the remote dataset
Challenge: Explore dbplot compute functions Use dbplot to retrieve the raw data and create alternative visualizations To retrieve the aggregates but not plots, use: ● db_compute_bins() ● db_compute_count() ● db_compute_raster() ● db_compute_boxplot()
Spark Caching Example Big data can stress your Spark session if transformations occur before modeling Before fitting a model: The compute() function takes the end of a dplyr piped set of transformations and saves the results to Spark memory Understand Spark Caching: spark.rstudio.com/guides/caching
Reading Data: Paths Spark assumes that every file in that directory is part of the same dataset Loading multiple files into a single data object: - In R, load each file individually into your R session - In Spark: folder as a dataset, pass the path containing all the files.
Reading Data: Schema Spark detects column names and types (schema) but at a cost (time). For medium-large datasets, or files that are read many times, cost accumulates. Provide a `columns` argument to describe the schema.
Reading Data: CSV Parsing Tools Spark modes for parsing mal-formed CSV files: Use in sparklyr by passing to the options argument Permissive Inserts NULL values for missing tokens Drop Malformed Drops lines that are malformed Fail Fast Aborts if it encounters any malformed line
Reading Data: Working with JSON How to extract data from nested files: - Option 1: Use a combination of get_json_object() and to_json() - Option 2: Use the sparklyr.nested package and sdf_unnest()
Connecting Spark with Alternative File Systems Spark defaults to the file system on which it is currently running. The file system protocol can be changed when reading or writing via the sparklyr path argument. Other file system protocols: ● Databricks: _dbfs://_ ● Amazon S3: _s3a://_ ● Microsoft Azure: _wasb://_ ● Google Storage: _gs://_
Reading Data Pro-Tip: Memory = FALSE “Map” files without copying data into Spark’s distributed memory Use Cases: - Big Data where not all columns needed (use dplyr to select)
Spark Streaming Read: stream_read_*() functions - define one or more sources Transform: dplyr, SQL, feature transformers, scoring pipelines, distributed R code Write: stream_write_*() functions - define one or more sinks
Exercise: The Simplest Stream Continuously copy text files from source to sink (destination) ● Create a source path ● Generate a test stream ● Write to destination path The stream starts running with stream_write_*()and will monitor the source path, processing data into the destination path as it arrives.
Exercise: The Simplest Stream Use stream_generate_test() to produce a file every second containing lines of text that follow a given distribution. Use view_stream() to track rows per second processed in the source, destination and their latest values over time.
Spark Streaming: Shiny Shiny’s reactive framework is well suited to support streaming information - display real-time data from Spark using reactiveSpark() - Design Shiny Application While the application is running, start the test stream running in another R process - Local Job Launcher!
Exercise: Build K-Means Streaming Application Adapt this Shiny Gallery example to display real-time data with reactiveSpark() shiny.rstudio.com/gallery
Run Shiny App as a Background Job - Create a “Run App” script - Run script as a Local Job - Use rstudioapi::viewer() to open a window for the app - Start the stream!