Spark + R WSDS

Introduction to Spark + R Women in Statistics and Data
Science 2019

Hi, I’m Kelly from RStudio Solutions Engineering I’m a co-organizer
of the Washington DC R-Ladies Meetup I work in a cross-functional group at RStudio, serving our customers, sales team, support team and product engineering teams. My favorite thing we make is RStudio Connect!

Sparklyr Resources spark.rstudio.com RStudio Webinars On Demand: 1. Introducing an
R interface for Apache Spark 2. Extending Spark using sparklyr and R 3. Advanced Features of sparklyr 4. Understanding Spark and sparklyr deployment modes resources.rstudio.com/working-with-spark

Sparklyr Video Resources Webinars Javier’s Youtube Channel

Coming Soon! The R in Spark > Mastering Spark with
R: The Complete Guide to Large-Scale Analysis and Modeling - Analysis - Modeling - Pipelines - Clusters - Connections - Data - Tuning - Extensions - Distributed R - Streaming - Contributing

Plan for Today Get access to our infrastructure (RStudio Server
Pro) Introduction to Spark and R Analysis with Sparklyr (DBI, dplyr review) Modeling (super basic) Sparklyr Skills: - Patterns for reading and writing data - Working with streaming data - The mechanics of distributing R code

Resource: RStudio Webinars

What is Spark? Unified: Spark supports many libraries, cluster technologies
and storage systems Analytics: The discovery and interpretation of data to communicate information Engine: Spark is expected to be efficient and generic Large-Scale: “cluster-scale” - a set of connected computers working together “Apache Spark is a unified analytics engine for large-scale data processing.” -- spark.apache.org

Analysis with Sparklyr

Exercise: Install Spark Locally and Connect • Load sparklyr •
Install Spark 2.3 • Check installed versions • Connect! • Disconnect! FAQ: What is a local Spark cluster good for? • Getting started • Testing code • Troubleshooting What should this be?

Exercise: RStudio Connections Pane • Reconnect to Spark Local through
the RStudio Connections Pane • Copy mtcars to Spark Local • Preview mtcars in the Connections Pane

Tour: Spark Tools in the Connections Pane Spark UI -
Opens the Spark web interface; a shortcut to spark_ui(sc) Log - Opens the Spark web logs; a shortcut to spark_log(sc) SQL - Opens a new SQL query Help - Opens the reference documentation in a new web browser window Disconnect - Disconnects from Spark; a shortcut to spark_disconnect(sc)

Exercise: SQL Files from Connections Pane • Create a SQL
file from the Connections Pane • Craft a test query • Save and preview the result

Exercise: Constructing Queries - DBI vs. dplyr

Basic Analysis: Communicate with R Markdown R for data science
workflow - Set up R Markdown document to take notes and run exercises in

Exercise: dplyr Group By Reference: R for Data Science -
Chapter 5

Exercise: Built-in Functions The percentile() function expects a column name,
and returns the exact percentile of a column in a group. What if the operation isn't available through dplyr & sparklyr? • Look for a built-in function available in Spark • dplyr passes functions it doesn't recognize to the query engine as-is

Exercise: Correlation - the corrr package Common EDA task: calculate
and visualize correlations - find out what kind of statistical relationship between paired sets of variables. Spark function to calculate correlations across an entire dataset: sparklyr::ml_corr() The corrr R package contains a backend for Spark, so when a Spark object is used in corrr - the computation happens in Spark. The correlate() function runs sparklyr::ml_corr()

Visualizations: Push Compute, Collect Results 1. Ensure the transformation operations
happen in Spark 2. Bring the results back into R after the transformations have been performed

Visualizations: Using dbplot dbplot - helper functions for plotting with
remote data The dbplot_histogram() function makes Spark calculate the bins and the count per bin and outputs a ggplot object

Problematic Visualizations: Scatter Plots No amount of “pushing the computation”
to Spark will help with this problem because the data must be plotted in individual dots Alternative is to find a plot type that represents the x/y relationship and concentration in a way that it is easy to perceive and to “physically” plot Use dbplot_raster() to create a scatter-like plot in Spark, while only collecting a small subset of the remote dataset

Challenge: Explore dbplot compute functions Use dbplot to retrieve the
raw data and create alternative visualizations To retrieve the aggregates but not plots, use: • db_compute_bins() • db_compute_count() • db_compute_raster() • db_compute_boxplot()

More Resources Visit spark.rstudio.com/dplyr

Modeling

Exercise: K-Means Modeling on the Iris Dataset

Exercise: Plot Predictions - Cluster Membership Don’t type all this
out! - Open plot-kmeans.R

Spark Caching Example Big data can stress your Spark session
if transformations occur before modeling Before fitting a model: The compute() function takes the end of a dplyr piped set of transformations and saves the results to Spark memory Understand Spark Caching: spark.rstudio.com/guides/caching

Data: Read/Write

Reading Data: Paths Spark assumes that every file in that
directory is part of the same dataset Loading multiple files into a single data object: - In R, load each file individually into your R session - In Spark: folder as a dataset, pass the path containing all the files.

Reading Data: Schema Spark detects column names and types (schema)
but at a cost (time). For medium-large datasets, or files that are read many times, cost accumulates. Provide a `columns` argument to describe the schema.

Reading Data: CSV Parsing Tools Spark modes for parsing mal-formed
CSV files: Use in sparklyr by passing to the options argument Permissive Inserts NULL values for missing tokens Drop Malformed Drops lines that are malformed Fail Fast Aborts if it encounters any malformed line

Reading Data: Working with JSON How to extract data from
nested files: - Option 1: Use a combination of get_json_object() and to_json() - Option 2: Use the sparklyr.nested package and sdf_unnest()

File Formats - Optimized for Performance CSV, JSON - Useful,
common standards, not optimized for performance Binary file formats reduce storage space and improve performance: - Apache Parquet - Apache ORC - Apache Avro

Exercise: Benchmark Writing from Spark Don’t type all this out
- open filewrite-benchmark.R

Connecting Spark with Alternative File Systems Spark defaults to the
file system on which it is currently running. The file system protocol can be changed when reading or writing via the sparklyr path argument. Other file system protocols: • Databricks: _dbfs://_ • Amazon S3: _s3a://_ • Microsoft Azure: _wasb://_ • Google Storage: _gs://_

Example: Access the AWS s3a Protocol Accessing the “s3a” protocol
requires adding a package to the sparklyr.connect.packages configuration setting:

Reading Data Pro-Tip: Memory = FALSE “Map” files without copying
data into Spark’s distributed memory Use Cases: - Big Data where not all columns needed (use dplyr to select)

Streaming Data

Spark Streaming Read: stream_read_*() functions - define one or more
sources Transform: dplyr, SQL, feature transformers, scoring pipelines, distributed R code Write: stream_write_*() functions - define one or more sinks

Exercise: The Simplest Stream Continuously copy text files from source
to sink (destination) • Create a source path • Generate a test stream • Write to destination path The stream starts running with stream_write_*()and will monitor the source path, processing data into the destination path as it arrives.

Exercise: The Simplest Stream Use stream_generate_test() to produce a file
every second containing lines of text that follow a given distribution. Use view_stream() to track rows per second processed in the source, destination and their latest values over time.

Spark Streaming: Shiny Shiny’s reactive framework is well suited to
support streaming information - display real-time data from Spark using reactiveSpark() - Design Shiny Application While the application is running, start the test stream running in another R process - Local Job Launcher!

Exercise: Build K-Means Streaming Application Adapt this Shiny Gallery example
to display real-time data with reactiveSpark() shiny.rstudio.com/gallery

Run Shiny App as a Background Job - Create a
“Run App” script - Run script as a Local Job - Use rstudioapi::viewer() to open a window for the app - Start the stream!

More Spark Streaming Resources Visit spark.rstudio.com/guides/streaming

Reference Material

Deployment Mode Resources Standalone in AWS https://spark.rstudio.com/examples/stand-alone-aws/ YARN EMR in
AWS https://spark.rstudio.com/examples/yarn-cluster-emr/ Cloudera https://spark.rstudio.com/examples/cloudera-aws/ Databricks https://docs.databricks.com/spark/latest/sparkr/rstudio.html

RStudio Databricks Integration RStudio on Databricks https://docs.databricks.com/spark/latest/sparkr/rstudio.html library(sparklyr) sparkR.session() sc
<- spark_connect(method = "databricks")

Extending Spark with Sparklyr

Spark + R WSDS

Spark + R WSDS

More Decks by kellobri

Other Decks in Technology

Featured

Transcript