Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Moonshot Spark

Avatar for Rose Toomey Rose Toomey
November 15, 2019

Moonshot Spark

Moonshot Spark: Serverless Apache Spark with GraalVM
Scale By The Bay 2019

An overview of the current state of serverless Apache Spark, where the future might go, and a look at how GraalVM could benefit the Apache Spark ecosystem.

Avatar for Rose Toomey

Rose Toomey

November 15, 2019
Tweet

More Decks by Rose Toomey

Other Decks in Technology

Transcript

  1. About me NYC. Finance. Technology. Code. • My job is

    to write code, but I keep finding myself back working with the data - Lead API Developer at Gemini Trust - Director at Novus Partners • Now: coding and working with data full time - software engineer at Coatue Management Spark At Scale In The Cloud Spark+AI Summit Europe 2019 2
  2. Apache Spark business as usual Things an Apache Spark cluster

    loves: • Processing very large datasets using • Lots of memory • Fast local storage • RPC communications • Granular control over cluster nodes - Scheduling - Dynamic allocation 3
  3. Serverless the very best bits • Reduce costs • Scale

    easily • Quick deployments and updates • Stateless functions are very, well, functional • Easy to reason about • Easy to compose • Microservices are appealing • Raise your hand if you have been the compile time victim of a mono-Repo Of Unusual Size • Infrastructure as code • DevOps! NoOps! SomebodyElseOps! 4
  4. Apache Spark + Serverless? join me on holiday from reality

    I want • The ability to run an Apache Spark job • Without worrying about the details of provisioning the cluster • On input data of any size • Knowing horizontal auto-scaling will - just work - reliably reduce the costs of my job over time • And it should be simpler than whatever I do now • Plus integrate out of box with popular logging and monitoring tools 5
  5. Serverless In an ideal world, serverless architecture lets us focus

    on our application while abstracting over the resources necessary to run it. • Your code is a function - That may have external dependencies like cloud storage or third-party APIs • That runs in a stateless container - Triggered by some external event • And "somebody else" handles the plumbing 7
  6. What you control Staying focused on this ideal world, you

    would get billed by the sub-second for only those resources your function actually consumes. 1. The amount of memory you allocate (may also control number of cores) 2. The function timeout relative to the maximum allowed function runtime 3. Cloud provider access policies so the containers can access external resources 4. From this point on, you're limited to customizing your container 8
  7. What you don't control 1. How long external services take

    to respond 2. Reliability of the underlying infrastructure 3. The maximum amount of time your cloud provider will allow a "function as a service" (FaaS) to run 9
  8. Obstacles • Memory efficiency • JVM slow startup relative to

    allowed "function as a service" (FaaS) runtime • Spark keeps local state in memory and on disk - Where will shuffle keep its data now? ‣ Remember timeouts? Shuffle will effectively behave like a third party API now, because it will directly depend on external resources • Scheduling 10
  9. Stateless Functions All limits are provided as of November 2019

    - please consult the docs for most current details Amazon Web Services (AWS) Lambda Memory: 128 - 3008 MB memory Function duration: up to 900 seconds (15 minutes) Deployment size: 250 MB unzipped including layers https://docs.aws.amazon.com/lambda/latest/dg/limits.html Google Cloud Provider (GCP) Functions Memory: up to 2048 MB memory Function duration: 540 seconds (9 minutes) Deployment size: 500 MB uncompressed source plus modules https://cloud.google.com/functions/quotas Microsoft Azure Functions (limits vary by hosting plan) Memory: up to 1.5 GB memory Function duration: 10 minutes Deployment size: 500 MB uncompressed source plus modules https://docs.microsoft.com/en-us/azure/azure-functions/functions-scale 12
  10. Stateless Functions All limits are provided as of November 2019

    - please consult the docs for most current details IBM Cloud Functions
 Memory: up to 2048 MB memory Function duration: up to 600,000 ms (10 minutes) Deployment size: 48MB base64 encoded (custom Docker image with dependencies allowed) https://cloud.ibm.com/docs/openwhisk?topic=cloud-functions-limits Open sourced: Apache OpenWhisk 13 Oracle Cloud Functions
 Memory: up to 1024 MB memory Function duration: up to 120 seconds (2 minutes) Deployment size: 6 MB https://docs.cloud.oracle.com/iaas/Content/Functions/Tasks/ functionscustomizing.htm Open sourced: Fn Project
  11. Stateless Spark, you say? Spark evolved to process very large

    datasets. Using Spark inside stateless functions - introduces time and memory constraints - plus now we have to worry about the size of the deployable - depending on how often function is invoked, might not cost less than running jobs on a traditional cluster But what happens when you have small, intermittent datasets that you want to process using the same Spark code you used to handle much bigger datasets? 14
  12. Embarrassingly parallel Certain kinds of standalone Spark server jobs are

    embarrassingly parallel: • The input data is small (doesn't need to be split up) • The analysis consists of many operations that are - Independent - Idempotent Instead of a single standalone Spark server running N operations, now you have N stateless functions, each running a standalone Spark server that executes a single operation. Results written externally. If it fails, try again. No doubt we lose time in invocation costs, but we might still save money relative to keeping a Spark cluster idle for "infrequent" incoming requests. ? 15
  13. What if it won't fit in a single function? What

    if • There are multiple discrete functions that need to transform the data in order? - I was promised I could compose functions • Our input data isn't small enough to be processed by a single function, needs to be split up first • We have multiple Spark jobs that need to run in a defined order These cases aren't really the same but share the need for orchestration of some kind. 17
  14. Reductio ad absurdum half steppin cluster management OK, we have

    too much data for a single stateless function and stateless functions can't directly communicate with each other. So let's use other cloud provider services to knit things together: • The event trigger now invokes a step orchestration service instead of a serverless function • Need to coordinate data between stages using relatively slow/expensive external means of communication, e.g. cloud storage or key-value stores
 Step orchestration - Kicks off a stateless function that can break up the input data into discrete granular pieces small enough to be processed by a stateless function without running out of memory or timing out - Then kicks off the actual functions that process the data, in parallel
 Ugly, error prone, hard to manage, but the seed of some useful ideas 18
  15. Viable options 1. Run it on a serverless container instead

    - No time limit - More resources 2. Serverless workflow: orchestrate the job using cloud provider services 3. Use a purpose-built cloud provider offering - If you already went all in on the cloud provider ecosystem, might as well use their helper services to nail it all together 4. Delegate cluster management 19
  16. Serverless containers Hybrid approach to get around limitations of functions

    Use a serverless function to launch a standalone Spark cluster in a serverless container. 20 AWS Fargate Run containers without managing servers or clusters. https://aws.amazon.com/fargate/ Google Cloud Run Run stateless containers on a fully managed environment or on Anthos. https://cloud.google.com/run/ Microsoft Azure Container Instances Run stateless containers on a fully managed environment or on Anthos. https://cloud.google.com/run/
  17. Serverless orchestration an interesting off-shoot Another hybrid approach: it's the

    coordination of the pipeline that's serverless. We're agnostic of what the actual steps do, except that a stateless function can launch them. 21 AWS Step Functions Coordinate multiple AWS services into serverless workflows https://aws.amazon.com/step-functions/ AWS Step Functions Coordinate multiple Microsoft services into serverless workflows https://aws.amazon.com/step-functions/
  18. Serverless ETL, batteries included
 just add code* Microsoft Data Accelerator

    for Apache Spark
 Plug and play managed data pipeline toolkit
 https://github.com/microsoft/data-accelerator Google Cloud Dataproc Fully managed cloud service for running Spark and Hadoop clusters 
 https://cloud.google.com/dataproc/
 See also: Google Cloud Dataflow, for stream and batch processing AWS Glue Fully managed ETL service 
 https://aws.amazon.com/glue/
 See also: AWS Data Pipeline, to process and move data between different AWS compute and storage services * if you already bought in to the rest of the pipeline ecosystem 22
  19. Delegating cluster management Not for the first time, Kubernetes to

    the rescue. • Spark 2.3+ has an experimental native Kubernetes scheduler - Supports plain old spark-submit although monitoring and fault tolerance aren’t polished - But also Spark Operator, which extends Kubernetes operator pattern using Custom Resource Definitions to provide better • Spark 3.0 improvements on the horizon - Ongoing Spark Operator improvements - [SPARK-27963][core] Allow dynamic allocation without a shuffle service • [SPARK-24793] Make spark-submit more useful with k8s • Dynamic resource allocation - External shuffle 23
  20. Cluster management built on on top of cloud providers In

    addition to cloud providers' own offerings, there are third parties making some interesting changes to how Spark clusters work in the cloud. Databricks Runtime Core (serverless) Available for: AWS, Azure Introducing Databricks Optimized Autoscaling on Apache Spark™ Qubole Qubole Announces Apache Spark on AWS Lambda Available for: AWS, GCP, Azure, Oracle 24
  21. The Shuffle problem • Shuffle blocks are stored locally on

    the node where the executor is running • An external shuffle service, which is a separate process to the executor, still serves up those shuffle blocks from the node's local storage • Since functions might lack local storage and can't communicate with each other, serverless Spark implementations focus on writing shuffle data to remote resources (cloud storage, key-value stores, even messaging queues) - Worst case, shuffle could generate so many intermediate files that it exceeds the resources limitations on a function 26
  22. The auto-scaling problem • Scaling up is straight forward •

    But scaling down is much more complex - Even with an external shuffle service, shuffle blocks are stored locally on the node - Even with dynamic allocation, the node won't be shut down (it will just free up to do work on a different job) So fixing auto-scaling means fixing shuffle: - Use fast external storage that scales nicely ‣ NFS keeps turning up here ‣ See also Apache Crail (incubating) which uses direct memory access - Clever improvements to write data locally and offload to external storage only when there are no remaining executors on the node 
 27
  23. The scheduler problem But now we have another problem! •

    when you touch resource allocation, like wanting to improve auto-scaling... • when you delegate cluster management ‣ you touch the scheduler, which is both complex and non- public Getting around this requires either • delegating cluster management to something which uses a different scheduler - Note Kubernetes already allows extending the stock scheduler • Forking Spark 28
  24. Spark 3.0 and serverless Spark core developers are having some

    very interesting discussions about these topics [SPARK-27941] Serverless Spark in the Cloud [SPARK-25299] Use remote storage for persisting shuffle data / SPIP / Deep dive into shuffle reliability [SPARK-19700] Design an API for pluggable scheduler implementations 29
  25. More Abstraction Apache Beam Implement batch and streaming data processing

    
 jobs that run on any execution engine. Written in Scala / Functions in Java, Python, or Go
 Available on Google Cloud Dataflow https://beam.apache.org/ 30
  26. Faster, leaner, serverless JVM on serverless has a high overhead:

    • Container startup speed • CPU and memory (giving more memory improves startup time, but increases costs) • And we already discussed increased runtime due to remote storage overhead - which shuffle improvements will help in time. But what can we do to improve container cold start time and resource usage? 32
  27. GraalVM
 JVM speedups for the intrepid You've probably heard about

    using GraalVM to compile JVM code faster. GraalVM Native Image (Early Adopter Technology) • Supports Ahead of Time (AOT) compilation - Improves startup time - Classes can be initialized at build time for shorter startup times (although some classes may require runtime initialization for a properly working app ) • Executable is smaller (uses less memory) Substrate VM only includes what will actually be used at runtime
 Running Apache Spark as a native image would start up quicker, run faster, and use less memory. 33
  28. JIT vs AOT Just in time (JIT) compilation works at

    runtime: • When the application starts up, everything needs to be loaded into memory, parsed/verified, initialized (slow) • Eventually*, the JIT compiler will profile and then compile bytecode which runs many time ("hot") into optimized native code Ahead of Time (AOT) compilation optimizes everything before execution time. It can perform optimization too costly (slow!) for a JIT compiler. For long-running applications, JIT is great. It's responsive and lazy (why optimize code that nobody ever runs?). But for short-lived applications, the longer compile time and certain limitations of compiling a native image are worth the tradeoff. JEP 295: Ahead-of-Time Compilation * eventually could be a long time relative to the runtime of the function 34
  29. Polyglot Spark Madame, parlez-vous Python? GraalVM Polyglot API has support

    for running other languages in JVM based applications. Python users in the Spark world can experience a number of interoperability slowdowns. Conventional wisdom online says that - even though the Python interoperability is slow relative to Scala! - it's not the slowest part. But aren't developer costs something too? What kind of speedup would polyglot support give these users? 35
  30. Just begun As of November 2019, the idea of building

    a native image of Apache Spark is in its infancy. Only rumors. But it's such a good idea I predict it will happen within a year or two. And for people who compile Apache Spark - it's a lot faster with GraalVM GraalVM 19.1: Compiling Faster (Thomas Wuerthinger) 36
  31. Interested? • What we do: data engineering @ Coatue -

    Terabyte scale, billions of rows - Lambda architecture - Functional programming • Stack - Scala (cats, shapeless, fs2, http4s) - Spark / Hadoop / EMR / Databricks - Data warehouses - Python / R / Tableau - Chat with me or email: [email protected] - Twitter: @prasinous 37
  32. Papers • Serverless Computing: One Step Forward, Two Steps Back

    (J. Hellerstein et al, CIDR 2019) • From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers (S. Fouladi et al. USENIX ATC '19) • Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure (Q. Pu, S. Venkataraman, and I. Stoica, USENIX NSDI '19) • Towards Practical Serverless Analytics (Q. Pu, UC Berkeley Technical Report No. UCB/EECS-2019-105) • Occupy the Cloud: Distributed computing for the 99% (E. Jonas, S. Venkataraman, I. Stoica, B. Recht, arXiv 1702.0402) 39
  33. Blog posts • Cloud Dataproc Spark Jobs on GKE: How

    to get started (Christopher Crosbie and Patrick Clay) • AWS Lambda - does it fit in with data processing? (Bartosz Konieczny) • Benchmarks, Spark and Graal (Phil Phil) • Instant Netty Startup using GraalVM Native Image Generation (Codrut Stancu) • Small & fast Docker images using GraalVM’s native-image (Adam Warski) • Updates on Class Initialization in GraalVM Native Image Generation (Christian Wimmer) • Mastering Java Cold Start On AWS Lamba Volume 1 (Serkan Özal) • How To Manage And Monitor Apache Spark On Kubernetes - Part 1 Part 2 (Chaoran You and Stavros Kontopoulos) • Spark scheduling in Kubernetes (Palantir) 40
  34. Presentations • Improving Apache Spark Downscaling (Christopher Crosbie and Ben

    Sidham, Spark+AI Summit Europe 2019) • Reliable Performance at Scale with Apache Spark on Kubernetes (Will Manning and Matt Cheah, Spark+AI Summit Europe 2019) • Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters (Prakhar Jain and Venkata Krishnan Sowrirajan, Spark+AI Summit Europe 2019) • Maximizing Performance with GraalVM (Thomas Wuerthinger, Oracle Code 2019) • Twitter’s Quest for a Wholly Graal Runtime (Chris Thalinger) • Improving GraalVM Native Image (Christian Wimmer, JVM Language Summit 2019) • Efficient Management of Ephemeral Data in Serverless Computing (Patrick Stuedi, Fourth International Workshop on Serverless Computing) • Adopting GraalVM (Petr Zapletal, Scale By The Bay 2018) • Apache Spark on Kubernetes (Anirudh Ramanathan and Tim Chen, Spark+AI Summit 2017) • Lambda Architecture in the Cloud with Azure Databricks (Andrei Varanovich, Spark+AI Summit DEV6) 41
  35. Use Cases • Serverless Model Serving: OpenWhisk, Apache Spark, and

    MLeap (Jowanza Joseph) • How We Implemented a Fully Serverless Recommender System Using GCP (Will Fuks) • Serverless Data Engineering: AWS Glue + Lambda + Athena + QuickSight (Peter Begle) • Implementing ETL job using AWS Glue (Vijayendra Bhati) • Orchestrate Apache Spark applications using AWS Step Functions and Apache Livy (Tanzir Musabbir) • ETL with standalone Spark containers for ingesting small files (Joe Corey) • Event-driven serverless ETL using AWS Lambda, Step Functions, EMR and Fargate (Siddarth Sharma) • Using Graal AOT for a realistic project? (Jurgen Voorneveld) 42
  36. Code • https://github.com/qubole/spark-on-lambda • https://github.com/pywren/pywren • https://github.com/lambci/docker-lambda • https://github.com/graalvm/graalvm-demos •

    https://github.com/palantir/spark-tpcds-benchmark • https://github.com/GoogleCloudPlatform/spark-on-k8s- operator • https://github.com/palantir/k8s-spark-scheduler • https://github.com/CloudScala/scala-graalvm-docker 43
  37. Image Credits Title slide: Moon - North Pole (NASA ID:

    PIA00126) Looking ahead slide: Astronaut Edward White during first EVA performed during Gemini 4 flight (NASA ID: S65-34635) Resources slide: Apollo 11 Mission image - View of moon limb,with Earth on the horizon (NASA ID: as11-44-6551) Architecture icons: • AWS (https://aws.amazon.com/architecture/icons/) • GCP (https://cloud.google.com/icons/) • Azure (https://www.microsoft.com/en-us/download/details.aspx?id=41937) • IBM Cloud Functions (https://github.com/ibm-functions) • Oracle Cloud Functions (https://docs.cloud.oracle.com/iaas/Content/Functions/ Concepts/functionsoverview.htm#limits) • Apache OpenWhisk (https://github.com/apache/openwhisk-website/tree/ master/images/logo) • Apache Beam (https://beam.apache.org/community/logos/) • Kubernetes (https://github.com/kubernetes/community/tree/master/icons) • Databricks (https://databricks.com/wp-content/uploads/) 44