Slide 1

Slide 1 text

Introduction to Spark + R Women in Statistics and Data Science 2019

Slide 2

Slide 2 text

Hi, I’m Kelly from RStudio Solutions Engineering I’m a co-organizer of the Washington DC R-Ladies Meetup I work in a cross-functional group at RStudio, serving our customers, sales team, support team and product engineering teams. My favorite thing we make is RStudio Connect!

Slide 3

Slide 3 text

Sparklyr Resources spark.rstudio.com RStudio Webinars On Demand: 1. Introducing an R interface for Apache Spark 2. Extending Spark using sparklyr and R 3. Advanced Features of sparklyr 4. Understanding Spark and sparklyr deployment modes resources.rstudio.com/working-with-spark

Slide 4

Slide 4 text

Sparklyr Video Resources Webinars Javier’s Youtube Channel

Slide 5

Slide 5 text

Coming Soon! The R in Spark > Mastering Spark with R: The Complete Guide to Large-Scale Analysis and Modeling - Analysis - Modeling - Pipelines - Clusters - Connections - Data - Tuning - Extensions - Distributed R - Streaming - Contributing

Slide 6

Slide 6 text

Plan for Today Get access to our infrastructure (RStudio Server Pro) Introduction to Spark and R Analysis with Sparklyr (DBI, dplyr review) Modeling (super basic) Sparklyr Skills: - Patterns for reading and writing data - Working with streaming data - The mechanics of distributing R code

Slide 7

Slide 7 text

Resource: RStudio Webinars

Slide 8

Slide 8 text

Resource: RStudio Webinars

Slide 9

Slide 9 text

Resource: RStudio Webinars

Slide 10

Slide 10 text

What is Spark? Unified: Spark supports many libraries, cluster technologies and storage systems Analytics: The discovery and interpretation of data to communicate information Engine: Spark is expected to be efficient and generic Large-Scale: “cluster-scale” - a set of connected computers working together “Apache Spark is a unified analytics engine for large-scale data processing.” -- spark.apache.org

Slide 11

Slide 11 text

Resource: RStudio Webinars

Slide 12

Slide 12 text

Analysis with Sparklyr

Slide 13

Slide 13 text

Exercise: Install Spark Locally and Connect ● Load sparklyr ● Install Spark 2.3 ● Check installed versions ● Connect! ● Disconnect! FAQ: What is a local Spark cluster good for? ● Getting started ● Testing code ● Troubleshooting What should this be?

Slide 14

Slide 14 text

Exercise: RStudio Connections Pane ● Reconnect to Spark Local through the RStudio Connections Pane ● Copy mtcars to Spark Local ● Preview mtcars in the Connections Pane

Slide 15

Slide 15 text

Tour: Spark Tools in the Connections Pane Spark UI - Opens the Spark web interface; a shortcut to spark_ui(sc) Log - Opens the Spark web logs; a shortcut to spark_log(sc) SQL - Opens a new SQL query Help - Opens the reference documentation in a new web browser window Disconnect - Disconnects from Spark; a shortcut to spark_disconnect(sc)

Slide 16

Slide 16 text

Exercise: SQL Files from Connections Pane ● Create a SQL file from the Connections Pane ● Craft a test query ● Save and preview the result

Slide 17

Slide 17 text

Exercise: Constructing Queries - DBI vs. dplyr

Slide 18

Slide 18 text

Basic Analysis: Communicate with R Markdown R for data science workflow - Set up R Markdown document to take notes and run exercises in

Slide 19

Slide 19 text

Exercise: dplyr Group By Reference: R for Data Science - Chapter 5

Slide 20

Slide 20 text

Exercise: Built-in Functions The percentile() function expects a column name, and returns the exact percentile of a column in a group. What if the operation isn't available through dplyr & sparklyr? ● Look for a built-in function available in Spark ● dplyr passes functions it doesn't recognize to the query engine as-is

Slide 21

Slide 21 text

Exercise: Correlation - the corrr package Common EDA task: calculate and visualize correlations - find out what kind of statistical relationship between paired sets of variables. Spark function to calculate correlations across an entire dataset: sparklyr::ml_corr() The corrr R package contains a backend for Spark, so when a Spark object is used in corrr - the computation happens in Spark. The correlate() function runs sparklyr::ml_corr()

Slide 22

Slide 22 text

Visualizations: Push Compute, Collect Results 1. Ensure the transformation operations happen in Spark 2. Bring the results back into R after the transformations have been performed

Slide 23

Slide 23 text

Visualizations: Using dbplot dbplot - helper functions for plotting with remote data The dbplot_histogram() function makes Spark calculate the bins and the count per bin and outputs a ggplot object

Slide 24

Slide 24 text

Problematic Visualizations: Scatter Plots No amount of “pushing the computation” to Spark will help with this problem because the data must be plotted in individual dots Alternative is to find a plot type that represents the x/y relationship and concentration in a way that it is easy to perceive and to “physically” plot Use dbplot_raster() to create a scatter-like plot in Spark, while only collecting a small subset of the remote dataset

Slide 25

Slide 25 text

Challenge: Explore dbplot compute functions Use dbplot to retrieve the raw data and create alternative visualizations To retrieve the aggregates but not plots, use: ● db_compute_bins() ● db_compute_count() ● db_compute_raster() ● db_compute_boxplot()

Slide 26

Slide 26 text

More Resources Visit spark.rstudio.com/dplyr

Slide 27

Slide 27 text

Modeling

Slide 28

Slide 28 text

Exercise: K-Means Modeling on the Iris Dataset

Slide 29

Slide 29 text

Exercise: Plot Predictions - Cluster Membership Don’t type all this out! - Open plot-kmeans.R

Slide 30

Slide 30 text

Spark Caching Example Big data can stress your Spark session if transformations occur before modeling Before fitting a model: The compute() function takes the end of a dplyr piped set of transformations and saves the results to Spark memory Understand Spark Caching: spark.rstudio.com/guides/caching

Slide 31

Slide 31 text

Data: Read/Write

Slide 32

Slide 32 text

Reading Data: Paths Spark assumes that every file in that directory is part of the same dataset Loading multiple files into a single data object: - In R, load each file individually into your R session - In Spark: folder as a dataset, pass the path containing all the files.

Slide 33

Slide 33 text

Reading Data: Schema Spark detects column names and types (schema) but at a cost (time). For medium-large datasets, or files that are read many times, cost accumulates. Provide a `columns` argument to describe the schema.

Slide 34

Slide 34 text

Reading Data: CSV Parsing Tools Spark modes for parsing mal-formed CSV files: Use in sparklyr by passing to the options argument Permissive Inserts NULL values for missing tokens Drop Malformed Drops lines that are malformed Fail Fast Aborts if it encounters any malformed line

Slide 35

Slide 35 text

Reading Data: Working with JSON How to extract data from nested files: - Option 1: Use a combination of get_json_object() and to_json() - Option 2: Use the sparklyr.nested package and sdf_unnest()

Slide 36

Slide 36 text

File Formats - Optimized for Performance CSV, JSON - Useful, common standards, not optimized for performance Binary file formats reduce storage space and improve performance: - Apache Parquet - Apache ORC - Apache Avro

Slide 37

Slide 37 text

Exercise: Benchmark Writing from Spark Don’t type all this out - open filewrite-benchmark.R

Slide 38

Slide 38 text

Connecting Spark with Alternative File Systems Spark defaults to the file system on which it is currently running. The file system protocol can be changed when reading or writing via the sparklyr path argument. Other file system protocols: ● Databricks: _dbfs://_ ● Amazon S3: _s3a://_ ● Microsoft Azure: _wasb://_ ● Google Storage: _gs://_

Slide 39

Slide 39 text

Example: Access the AWS s3a Protocol Accessing the “s3a” protocol requires adding a package to the sparklyr.connect.packages configuration setting:

Slide 40

Slide 40 text

Reading Data Pro-Tip: Memory = FALSE “Map” files without copying data into Spark’s distributed memory Use Cases: - Big Data where not all columns needed (use dplyr to select)

Slide 41

Slide 41 text

Streaming Data

Slide 42

Slide 42 text

Spark Streaming Read: stream_read_*() functions - define one or more sources Transform: dplyr, SQL, feature transformers, scoring pipelines, distributed R code Write: stream_write_*() functions - define one or more sinks

Slide 43

Slide 43 text

Exercise: The Simplest Stream Continuously copy text files from source to sink (destination) ● Create a source path ● Generate a test stream ● Write to destination path The stream starts running with stream_write_*()and will monitor the source path, processing data into the destination path as it arrives.

Slide 44

Slide 44 text

Exercise: The Simplest Stream Use stream_generate_test() to produce a file every second containing lines of text that follow a given distribution. Use view_stream() to track rows per second processed in the source, destination and their latest values over time.

Slide 45

Slide 45 text

Spark Streaming: Shiny Shiny’s reactive framework is well suited to support streaming information - display real-time data from Spark using reactiveSpark() - Design Shiny Application While the application is running, start the test stream running in another R process - Local Job Launcher!

Slide 46

Slide 46 text

Exercise: Build K-Means Streaming Application Adapt this Shiny Gallery example to display real-time data with reactiveSpark() shiny.rstudio.com/gallery

Slide 47

Slide 47 text

Run Shiny App as a Background Job - Create a “Run App” script - Run script as a Local Job - Use rstudioapi::viewer() to open a window for the app - Start the stream!

Slide 48

Slide 48 text

More Spark Streaming Resources Visit spark.rstudio.com/guides/streaming

Slide 49

Slide 49 text

Reference Material

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

Resource: RStudio Webinars

Slide 53

Slide 53 text

Deployment Mode Resources Standalone in AWS https://spark.rstudio.com/examples/stand-alone-aws/ YARN EMR in AWS https://spark.rstudio.com/examples/yarn-cluster-emr/ Cloudera https://spark.rstudio.com/examples/cloudera-aws/ Databricks https://docs.databricks.com/spark/latest/sparkr/rstudio.html

Slide 54

Slide 54 text

RStudio Databricks Integration RStudio on Databricks https://docs.databricks.com/spark/latest/sparkr/rstudio.html library(sparklyr) sparkR.session() sc <- spark_connect(method = "databricks")

Slide 55

Slide 55 text

Resource: RStudio Webinars

Slide 56

Slide 56 text

Resource: RStudio Webinars

Slide 57

Slide 57 text

Extending Spark with Sparklyr

Slide 58

Slide 58 text

Resource: RStudio Webinars

Slide 59

Slide 59 text

Resource: RStudio Webinars