Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark + R WSDS

Fd59f90efdaa9dea8f7d9c2f0c930a2b?s=47 kellobri
October 03, 2019

Spark + R WSDS

A Workshop for the Women in Statistics and Data Science 2019 Conference



October 03, 2019


  1. Introduction to Spark + R Women in Statistics and Data

    Science 2019
  2. Hi, I’m Kelly from RStudio Solutions Engineering I’m a co-organizer

    of the Washington DC R-Ladies Meetup I work in a cross-functional group at RStudio, serving our customers, sales team, support team and product engineering teams. My favorite thing we make is RStudio Connect!
  3. Sparklyr Resources spark.rstudio.com RStudio Webinars On Demand: 1. Introducing an

    R interface for Apache Spark 2. Extending Spark using sparklyr and R 3. Advanced Features of sparklyr 4. Understanding Spark and sparklyr deployment modes resources.rstudio.com/working-with-spark
  4. Sparklyr Video Resources Webinars Javier’s Youtube Channel

  5. Coming Soon! The R in Spark > Mastering Spark with

    R: The Complete Guide to Large-Scale Analysis and Modeling - Analysis - Modeling - Pipelines - Clusters - Connections - Data - Tuning - Extensions - Distributed R - Streaming - Contributing
  6. Plan for Today Get access to our infrastructure (RStudio Server

    Pro) Introduction to Spark and R Analysis with Sparklyr (DBI, dplyr review) Modeling (super basic) Sparklyr Skills: - Patterns for reading and writing data - Working with streaming data - The mechanics of distributing R code
  7. Resource: RStudio Webinars

  8. Resource: RStudio Webinars

  9. Resource: RStudio Webinars

  10. What is Spark? Unified: Spark supports many libraries, cluster technologies

    and storage systems Analytics: The discovery and interpretation of data to communicate information Engine: Spark is expected to be efficient and generic Large-Scale: “cluster-scale” - a set of connected computers working together “Apache Spark is a unified analytics engine for large-scale data processing.” -- spark.apache.org
  11. Resource: RStudio Webinars

  12. Analysis with Sparklyr

  13. Exercise: Install Spark Locally and Connect • Load sparklyr •

    Install Spark 2.3 • Check installed versions • Connect! • Disconnect! FAQ: What is a local Spark cluster good for? • Getting started • Testing code • Troubleshooting What should this be?
  14. Exercise: RStudio Connections Pane • Reconnect to Spark Local through

    the RStudio Connections Pane • Copy mtcars to Spark Local • Preview mtcars in the Connections Pane
  15. Tour: Spark Tools in the Connections Pane Spark UI -

    Opens the Spark web interface; a shortcut to spark_ui(sc) Log - Opens the Spark web logs; a shortcut to spark_log(sc) SQL - Opens a new SQL query Help - Opens the reference documentation in a new web browser window Disconnect - Disconnects from Spark; a shortcut to spark_disconnect(sc)
  16. Exercise: SQL Files from Connections Pane • Create a SQL

    file from the Connections Pane • Craft a test query • Save and preview the result
  17. Exercise: Constructing Queries - DBI vs. dplyr

  18. Basic Analysis: Communicate with R Markdown R for data science

    workflow - Set up R Markdown document to take notes and run exercises in
  19. Exercise: dplyr Group By Reference: R for Data Science -

    Chapter 5
  20. Exercise: Built-in Functions The percentile() function expects a column name,

    and returns the exact percentile of a column in a group. What if the operation isn't available through dplyr & sparklyr? • Look for a built-in function available in Spark • dplyr passes functions it doesn't recognize to the query engine as-is
  21. Exercise: Correlation - the corrr package Common EDA task: calculate

    and visualize correlations - find out what kind of statistical relationship between paired sets of variables. Spark function to calculate correlations across an entire dataset: sparklyr::ml_corr() The corrr R package contains a backend for Spark, so when a Spark object is used in corrr - the computation happens in Spark. The correlate() function runs sparklyr::ml_corr()
  22. Visualizations: Push Compute, Collect Results 1. Ensure the transformation operations

    happen in Spark 2. Bring the results back into R after the transformations have been performed
  23. Visualizations: Using dbplot dbplot - helper functions for plotting with

    remote data The dbplot_histogram() function makes Spark calculate the bins and the count per bin and outputs a ggplot object
  24. Problematic Visualizations: Scatter Plots No amount of “pushing the computation”

    to Spark will help with this problem because the data must be plotted in individual dots Alternative is to find a plot type that represents the x/y relationship and concentration in a way that it is easy to perceive and to “physically” plot Use dbplot_raster() to create a scatter-like plot in Spark, while only collecting a small subset of the remote dataset
  25. Challenge: Explore dbplot compute functions Use dbplot to retrieve the

    raw data and create alternative visualizations To retrieve the aggregates but not plots, use: • db_compute_bins() • db_compute_count() • db_compute_raster() • db_compute_boxplot()
  26. More Resources Visit spark.rstudio.com/dplyr

  27. Modeling

  28. Exercise: K-Means Modeling on the Iris Dataset

  29. Exercise: Plot Predictions - Cluster Membership Don’t type all this

    out! - Open plot-kmeans.R
  30. Spark Caching Example Big data can stress your Spark session

    if transformations occur before modeling Before fitting a model: The compute() function takes the end of a dplyr piped set of transformations and saves the results to Spark memory Understand Spark Caching: spark.rstudio.com/guides/caching
  31. Data: Read/Write

  32. Reading Data: Paths Spark assumes that every file in that

    directory is part of the same dataset Loading multiple files into a single data object: - In R, load each file individually into your R session - In Spark: folder as a dataset, pass the path containing all the files.
  33. Reading Data: Schema Spark detects column names and types (schema)

    but at a cost (time). For medium-large datasets, or files that are read many times, cost accumulates. Provide a `columns` argument to describe the schema.
  34. Reading Data: CSV Parsing Tools Spark modes for parsing mal-formed

    CSV files: Use in sparklyr by passing to the options argument Permissive Inserts NULL values for missing tokens Drop Malformed Drops lines that are malformed Fail Fast Aborts if it encounters any malformed line
  35. Reading Data: Working with JSON How to extract data from

    nested files: - Option 1: Use a combination of get_json_object() and to_json() - Option 2: Use the sparklyr.nested package and sdf_unnest()
  36. File Formats - Optimized for Performance CSV, JSON - Useful,

    common standards, not optimized for performance Binary file formats reduce storage space and improve performance: - Apache Parquet - Apache ORC - Apache Avro
  37. Exercise: Benchmark Writing from Spark Don’t type all this out

    - open filewrite-benchmark.R
  38. Connecting Spark with Alternative File Systems Spark defaults to the

    file system on which it is currently running. The file system protocol can be changed when reading or writing via the sparklyr path argument. Other file system protocols: • Databricks: _dbfs://_ • Amazon S3: _s3a://_ • Microsoft Azure: _wasb://_ • Google Storage: _gs://_
  39. Example: Access the AWS s3a Protocol Accessing the “s3a” protocol

    requires adding a package to the sparklyr.connect.packages configuration setting:
  40. Reading Data Pro-Tip: Memory = FALSE “Map” files without copying

    data into Spark’s distributed memory Use Cases: - Big Data where not all columns needed (use dplyr to select)
  41. Streaming Data

  42. Spark Streaming Read: stream_read_*() functions - define one or more

    sources Transform: dplyr, SQL, feature transformers, scoring pipelines, distributed R code Write: stream_write_*() functions - define one or more sinks
  43. Exercise: The Simplest Stream Continuously copy text files from source

    to sink (destination) • Create a source path • Generate a test stream • Write to destination path The stream starts running with stream_write_*()and will monitor the source path, processing data into the destination path as it arrives.
  44. Exercise: The Simplest Stream Use stream_generate_test() to produce a file

    every second containing lines of text that follow a given distribution. Use view_stream() to track rows per second processed in the source, destination and their latest values over time.
  45. Spark Streaming: Shiny Shiny’s reactive framework is well suited to

    support streaming information - display real-time data from Spark using reactiveSpark() - Design Shiny Application While the application is running, start the test stream running in another R process - Local Job Launcher!
  46. Exercise: Build K-Means Streaming Application Adapt this Shiny Gallery example

    to display real-time data with reactiveSpark() shiny.rstudio.com/gallery
  47. Run Shiny App as a Background Job - Create a

    “Run App” script - Run script as a Local Job - Use rstudioapi::viewer() to open a window for the app - Start the stream!
  48. More Spark Streaming Resources Visit spark.rstudio.com/guides/streaming

  49. Reference Material

  50. None
  51. None
  52. Resource: RStudio Webinars

  53. Deployment Mode Resources Standalone in AWS https://spark.rstudio.com/examples/stand-alone-aws/ YARN EMR in

    AWS https://spark.rstudio.com/examples/yarn-cluster-emr/ Cloudera https://spark.rstudio.com/examples/cloudera-aws/ Databricks https://docs.databricks.com/spark/latest/sparkr/rstudio.html
  54. RStudio Databricks Integration RStudio on Databricks https://docs.databricks.com/spark/latest/sparkr/rstudio.html library(sparklyr) sparkR.session() sc

    <- spark_connect(method = "databricks")
  55. Resource: RStudio Webinars

  56. Resource: RStudio Webinars

  57. Extending Spark with Sparklyr

  58. Resource: RStudio Webinars

  59. Resource: RStudio Webinars