$30 off During Our Annual Pro Sale. View Details »

Spark + R WSDS

kellobri
October 03, 2019

Spark + R WSDS

A Workshop for the Women in Statistics and Data Science 2019 Conference

kellobri

October 03, 2019
Tweet

More Decks by kellobri

Other Decks in Technology

Transcript

  1. Introduction to Spark + R
    Women in Statistics and Data Science 2019

    View Slide

  2. Hi, I’m Kelly from
    RStudio Solutions Engineering
    I’m a co-organizer of the Washington DC R-Ladies Meetup
    I work in a cross-functional group at RStudio, serving our
    customers, sales team, support team and product
    engineering teams.
    My favorite thing we make is RStudio Connect!

    View Slide

  3. Sparklyr Resources
    spark.rstudio.com
    RStudio Webinars On Demand:
    1. Introducing an R interface for Apache
    Spark
    2. Extending Spark using sparklyr and R
    3. Advanced Features of sparklyr
    4. Understanding Spark and sparklyr
    deployment modes
    resources.rstudio.com/working-with-spark

    View Slide

  4. Sparklyr Video Resources
    Webinars Javier’s Youtube Channel

    View Slide

  5. Coming Soon!
    The R in Spark > Mastering Spark with R: The
    Complete Guide to Large-Scale Analysis and
    Modeling
    - Analysis
    - Modeling
    - Pipelines
    - Clusters
    - Connections
    - Data
    - Tuning
    - Extensions
    - Distributed R
    - Streaming
    - Contributing

    View Slide

  6. Plan for Today
    Get access to our infrastructure (RStudio Server Pro)
    Introduction to Spark and R
    Analysis with Sparklyr (DBI, dplyr review)
    Modeling (super basic)
    Sparklyr Skills:
    - Patterns for reading and writing data
    - Working with streaming data
    - The mechanics of distributing R code

    View Slide

  7. Resource: RStudio Webinars

    View Slide

  8. Resource: RStudio Webinars

    View Slide

  9. Resource: RStudio Webinars

    View Slide

  10. What is Spark?
    Unified: Spark supports many libraries, cluster technologies and storage systems
    Analytics: The discovery and interpretation of data to communicate information
    Engine: Spark is expected to be efficient and generic
    Large-Scale: “cluster-scale” - a set of connected computers working together
    “Apache Spark is a unified analytics engine for
    large-scale data processing.” -- spark.apache.org

    View Slide

  11. Resource: RStudio Webinars

    View Slide

  12. Analysis with Sparklyr

    View Slide

  13. Exercise: Install Spark Locally and Connect
    ● Load sparklyr
    ● Install Spark 2.3
    ● Check installed versions
    ● Connect!
    ● Disconnect!
    FAQ: What is a local Spark cluster good for?
    ● Getting started
    ● Testing code
    ● Troubleshooting
    What should
    this be?

    View Slide

  14. Exercise: RStudio Connections Pane
    ● Reconnect to Spark Local through the RStudio Connections Pane
    ● Copy mtcars to Spark Local
    ● Preview mtcars in the Connections Pane

    View Slide

  15. Tour: Spark Tools in the Connections Pane
    Spark UI - Opens the Spark web interface; a shortcut to spark_ui(sc)
    Log - Opens the Spark web logs; a shortcut to spark_log(sc)
    SQL - Opens a new SQL query
    Help - Opens the reference documentation in a new web browser window
    Disconnect - Disconnects from Spark; a shortcut to spark_disconnect(sc)

    View Slide

  16. Exercise: SQL Files from Connections Pane
    ● Create a SQL file from the
    Connections Pane
    ● Craft a test query
    ● Save and preview the result

    View Slide

  17. Exercise: Constructing Queries - DBI vs. dplyr

    View Slide

  18. Basic Analysis: Communicate with R Markdown
    R for data science workflow
    - Set up R Markdown document to
    take notes and run exercises in

    View Slide

  19. Exercise: dplyr Group By
    Reference: R for Data Science - Chapter 5

    View Slide

  20. Exercise: Built-in Functions
    The percentile() function expects a
    column name, and returns the exact percentile
    of a column in a group.
    What if the operation isn't available
    through dplyr & sparklyr?
    ● Look for a built-in function available in Spark
    ● dplyr passes functions it doesn't recognize to the
    query engine as-is

    View Slide

  21. Exercise: Correlation - the corrr package
    Common EDA task: calculate and visualize
    correlations - find out what kind of statistical
    relationship between paired sets of variables.
    Spark function to calculate correlations across
    an entire dataset: sparklyr::ml_corr()
    The corrr R package contains a backend for Spark, so
    when a Spark object is used in corrr - the computation
    happens in Spark.
    The correlate() function runs sparklyr::ml_corr()

    View Slide

  22. Visualizations: Push Compute, Collect Results
    1. Ensure the
    transformation
    operations happen in
    Spark
    2. Bring the results
    back into R after the
    transformations have
    been performed

    View Slide

  23. Visualizations: Using dbplot dbplot - helper functions for plotting
    with remote data
    The dbplot_histogram() function makes Spark calculate
    the bins and the count per bin and outputs a ggplot object

    View Slide

  24. Problematic Visualizations: Scatter Plots
    No amount of “pushing the computation” to Spark will help with this
    problem because the data must be plotted in individual dots
    Alternative is to find a plot type that represents the x/y relationship and concentration in a way
    that it is easy to perceive and to “physically” plot
    Use dbplot_raster() to create a scatter-like plot in Spark, while only collecting a small
    subset of the remote dataset

    View Slide

  25. Challenge: Explore dbplot compute functions
    Use dbplot to retrieve the raw data and create alternative
    visualizations
    To retrieve the aggregates but not plots, use:
    ● db_compute_bins()
    ● db_compute_count()
    ● db_compute_raster()
    ● db_compute_boxplot()

    View Slide

  26. More Resources
    Visit spark.rstudio.com/dplyr

    View Slide

  27. Modeling

    View Slide

  28. Exercise: K-Means Modeling on the Iris Dataset

    View Slide

  29. Exercise: Plot Predictions - Cluster Membership
    Don’t type all this out! - Open plot-kmeans.R

    View Slide

  30. Spark Caching Example
    Big data can stress your Spark session if
    transformations occur before modeling
    Before fitting a model: The compute() function takes the end
    of a dplyr piped set of transformations and saves the results
    to Spark memory
    Understand Spark Caching:
    spark.rstudio.com/guides/caching

    View Slide

  31. Data: Read/Write

    View Slide

  32. Reading Data: Paths
    Spark assumes that every file in that directory is part of the same dataset
    Loading multiple files into a single data object:
    - In R, load each file individually into your R session
    - In Spark: folder as a dataset, pass the path containing all the files.

    View Slide

  33. Reading Data: Schema
    Spark detects column names and types (schema) but at a cost (time). For
    medium-large datasets, or files that are read many times, cost accumulates.
    Provide a `columns` argument to describe the schema.

    View Slide

  34. Reading Data: CSV Parsing Tools
    Spark modes for parsing mal-formed CSV files:
    Use in sparklyr by passing to the options
    argument
    Permissive
    Inserts NULL values for missing tokens
    Drop Malformed
    Drops lines that are malformed
    Fail Fast
    Aborts if it encounters any malformed
    line

    View Slide

  35. Reading Data: Working with JSON
    How to extract data from nested files:
    - Option 1: Use a combination of get_json_object() and to_json()
    - Option 2: Use the sparklyr.nested package and sdf_unnest()

    View Slide

  36. File Formats - Optimized for Performance
    CSV, JSON - Useful, common standards, not optimized for performance
    Binary file formats reduce storage space and improve performance:
    - Apache Parquet
    - Apache ORC
    - Apache Avro

    View Slide

  37. Exercise: Benchmark Writing from Spark
    Don’t type all this out - open filewrite-benchmark.R

    View Slide

  38. Connecting Spark with Alternative File Systems
    Spark defaults to the file system on which it is currently running. The file system
    protocol can be changed when reading or writing via the sparklyr path argument.
    Other file system protocols:
    ● Databricks: _dbfs://_
    ● Amazon S3: _s3a://_
    ● Microsoft Azure: _wasb://_
    ● Google Storage: _gs://_

    View Slide

  39. Example: Access the AWS s3a Protocol
    Accessing the “s3a” protocol requires adding a package to the
    sparklyr.connect.packages configuration setting:

    View Slide

  40. Reading Data Pro-Tip: Memory = FALSE
    “Map” files without copying data into Spark’s distributed memory
    Use Cases:
    - Big Data where not all columns needed (use dplyr to select)

    View Slide

  41. Streaming Data

    View Slide

  42. Spark Streaming
    Read: stream_read_*() functions - define one or more sources
    Transform: dplyr, SQL, feature transformers, scoring pipelines, distributed R code
    Write: stream_write_*() functions - define one or more sinks

    View Slide

  43. Exercise: The Simplest Stream
    Continuously copy text files from source to sink (destination)
    ● Create a source path
    ● Generate a test stream
    ● Write to destination path
    The stream starts running with
    stream_write_*()and will monitor
    the source path, processing data into
    the destination path as it arrives.

    View Slide

  44. Exercise: The Simplest Stream
    Use stream_generate_test() to produce a file every second containing lines
    of text that follow a given distribution. Use view_stream() to track rows per
    second processed in the source, destination and their latest values over time.

    View Slide

  45. Spark Streaming: Shiny
    Shiny’s reactive framework is well suited to support streaming information - display
    real-time data from Spark using reactiveSpark()
    - Design Shiny Application
    While the application is running, start the test stream running in another R process
    - Local Job Launcher!

    View Slide

  46. Exercise: Build K-Means Streaming Application
    Adapt this Shiny Gallery
    example to display
    real-time data with
    reactiveSpark()
    shiny.rstudio.com/gallery

    View Slide

  47. Run Shiny App as a Background Job
    - Create a “Run App” script
    - Run script as a Local Job
    - Use rstudioapi::viewer()
    to open a window for the app
    - Start the stream!

    View Slide

  48. More Spark Streaming Resources
    Visit spark.rstudio.com/guides/streaming

    View Slide

  49. Reference Material

    View Slide

  50. View Slide

  51. View Slide

  52. Resource: RStudio Webinars

    View Slide

  53. Deployment Mode Resources
    Standalone in AWS https://spark.rstudio.com/examples/stand-alone-aws/
    YARN EMR in AWS https://spark.rstudio.com/examples/yarn-cluster-emr/
    Cloudera https://spark.rstudio.com/examples/cloudera-aws/
    Databricks https://docs.databricks.com/spark/latest/sparkr/rstudio.html

    View Slide

  54. RStudio Databricks Integration
    RStudio on Databricks
    https://docs.databricks.com/spark/latest/sparkr/rstudio.html
    library(sparklyr)
    sparkR.session()
    sc <- spark_connect(method = "databricks")

    View Slide

  55. Resource: RStudio Webinars

    View Slide

  56. Resource: RStudio Webinars

    View Slide

  57. Extending Spark with
    Sparklyr

    View Slide

  58. Resource: RStudio Webinars

    View Slide

  59. Resource: RStudio Webinars

    View Slide