Moonshot Spark: Serverless Apache Spark with GraalVM
Scale By The Bay 2019
An overview of the current state of serverless Apache Spark, where the future might go, and a look at how GraalVM could benefit the Apache Spark ecosystem.
to write code, but I keep finding myself back working with the data - Lead API Developer at Gemini Trust - Director at Novus Partners • Now: coding and working with data full time - software engineer at Coatue Management Spark At Scale In The Cloud Spark+AI Summit Europe 2019 2
loves: • Processing very large datasets using • Lots of memory • Fast local storage • RPC communications • Granular control over cluster nodes - Scheduling - Dynamic allocation 3
easily • Quick deployments and updates • Stateless functions are very, well, functional • Easy to reason about • Easy to compose • Microservices are appealing • Raise your hand if you have been the compile time victim of a mono-Repo Of Unusual Size • Infrastructure as code • DevOps! NoOps! SomebodyElseOps! 4
I want • The ability to run an Apache Spark job • Without worrying about the details of provisioning the cluster • On input data of any size • Knowing horizontal auto-scaling will - just work - reliably reduce the costs of my job over time • And it should be simpler than whatever I do now • Plus integrate out of box with popular logging and monitoring tools 5
on our application while abstracting over the resources necessary to run it. • Your code is a function - That may have external dependencies like cloud storage or third-party APIs • That runs in a stateless container - Triggered by some external event • And "somebody else" handles the plumbing 7
would get billed by the sub-second for only those resources your function actually consumes. 1. The amount of memory you allocate (may also control number of cores) 2. The function timeout relative to the maximum allowed function runtime 3. Cloud provider access policies so the containers can access external resources 4. From this point on, you're limited to customizing your container 8
to respond 2. Reliability of the underlying infrastructure 3. The maximum amount of time your cloud provider will allow a "function as a service" (FaaS) to run 9
allowed "function as a service" (FaaS) runtime • Spark keeps local state in memory and on disk - Where will shuffle keep its data now? ‣ Remember timeouts? Shuffle will effectively behave like a third party API now, because it will directly depend on external resources • Scheduling 10
- please consult the docs for most current details Amazon Web Services (AWS) Lambda Memory: 128 - 3008 MB memory Function duration: up to 900 seconds (15 minutes) Deployment size: 250 MB unzipped including layers https://docs.aws.amazon.com/lambda/latest/dg/limits.html Google Cloud Provider (GCP) Functions Memory: up to 2048 MB memory Function duration: 540 seconds (9 minutes) Deployment size: 500 MB uncompressed source plus modules https://cloud.google.com/functions/quotas Microsoft Azure Functions (limits vary by hosting plan) Memory: up to 1.5 GB memory Function duration: 10 minutes Deployment size: 500 MB uncompressed source plus modules https://docs.microsoft.com/en-us/azure/azure-functions/functions-scale 12
- please consult the docs for most current details IBM Cloud Functions Memory: up to 2048 MB memory Function duration: up to 600,000 ms (10 minutes) Deployment size: 48MB base64 encoded (custom Docker image with dependencies allowed) https://cloud.ibm.com/docs/openwhisk?topic=cloud-functions-limits Open sourced: Apache OpenWhisk 13 Oracle Cloud Functions Memory: up to 1024 MB memory Function duration: up to 120 seconds (2 minutes) Deployment size: 6 MB https://docs.cloud.oracle.com/iaas/Content/Functions/Tasks/ functionscustomizing.htm Open sourced: Fn Project
datasets. Using Spark inside stateless functions - introduces time and memory constraints - plus now we have to worry about the size of the deployable - depending on how often function is invoked, might not cost less than running jobs on a traditional cluster But what happens when you have small, intermittent datasets that you want to process using the same Spark code you used to handle much bigger datasets? 14
embarrassingly parallel: • The input data is small (doesn't need to be split up) • The analysis consists of many operations that are - Independent - Idempotent Instead of a single standalone Spark server running N operations, now you have N stateless functions, each running a standalone Spark server that executes a single operation. Results written externally. If it fails, try again. No doubt we lose time in invocation costs, but we might still save money relative to keeping a Spark cluster idle for "infrequent" incoming requests. ? 15
if • There are multiple discrete functions that need to transform the data in order? - I was promised I could compose functions • Our input data isn't small enough to be processed by a single function, needs to be split up first • We have multiple Spark jobs that need to run in a defined order These cases aren't really the same but share the need for orchestration of some kind. 17
too much data for a single stateless function and stateless functions can't directly communicate with each other. So let's use other cloud provider services to knit things together: • The event trigger now invokes a step orchestration service instead of a serverless function • Need to coordinate data between stages using relatively slow/expensive external means of communication, e.g. cloud storage or key-value stores Step orchestration - Kicks off a stateless function that can break up the input data into discrete granular pieces small enough to be processed by a stateless function without running out of memory or timing out - Then kicks off the actual functions that process the data, in parallel Ugly, error prone, hard to manage, but the seed of some useful ideas 18
- No time limit - More resources 2. Serverless workflow: orchestrate the job using cloud provider services 3. Use a purpose-built cloud provider offering - If you already went all in on the cloud provider ecosystem, might as well use their helper services to nail it all together 4. Delegate cluster management 19
Use a serverless function to launch a standalone Spark cluster in a serverless container. 20 AWS Fargate Run containers without managing servers or clusters. https://aws.amazon.com/fargate/ Google Cloud Run Run stateless containers on a fully managed environment or on Anthos. https://cloud.google.com/run/ Microsoft Azure Container Instances Run stateless containers on a fully managed environment or on Anthos. https://cloud.google.com/run/
coordination of the pipeline that's serverless. We're agnostic of what the actual steps do, except that a stateless function can launch them. 21 AWS Step Functions Coordinate multiple AWS services into serverless workflows https://aws.amazon.com/step-functions/ AWS Step Functions Coordinate multiple Microsoft services into serverless workflows https://aws.amazon.com/step-functions/
for Apache Spark Plug and play managed data pipeline toolkit https://github.com/microsoft/data-accelerator Google Cloud Dataproc Fully managed cloud service for running Spark and Hadoop clusters https://cloud.google.com/dataproc/ See also: Google Cloud Dataflow, for stream and batch processing AWS Glue Fully managed ETL service https://aws.amazon.com/glue/ See also: AWS Data Pipeline, to process and move data between different AWS compute and storage services * if you already bought in to the rest of the pipeline ecosystem 22
the rescue. • Spark 2.3+ has an experimental native Kubernetes scheduler - Supports plain old spark-submit although monitoring and fault tolerance aren’t polished - But also Spark Operator, which extends Kubernetes operator pattern using Custom Resource Definitions to provide better • Spark 3.0 improvements on the horizon - Ongoing Spark Operator improvements - [SPARK-27963][core] Allow dynamic allocation without a shuffle service • [SPARK-24793] Make spark-submit more useful with k8s • Dynamic resource allocation - External shuffle 23
addition to cloud providers' own offerings, there are third parties making some interesting changes to how Spark clusters work in the cloud. Databricks Runtime Core (serverless) Available for: AWS, Azure Introducing Databricks Optimized Autoscaling on Apache Spark™ Qubole Qubole Announces Apache Spark on AWS Lambda Available for: AWS, GCP, Azure, Oracle 24
the node where the executor is running • An external shuffle service, which is a separate process to the executor, still serves up those shuffle blocks from the node's local storage • Since functions might lack local storage and can't communicate with each other, serverless Spark implementations focus on writing shuffle data to remote resources (cloud storage, key-value stores, even messaging queues) - Worst case, shuffle could generate so many intermediate files that it exceeds the resources limitations on a function 26
But scaling down is much more complex - Even with an external shuffle service, shuffle blocks are stored locally on the node - Even with dynamic allocation, the node won't be shut down (it will just free up to do work on a different job) So fixing auto-scaling means fixing shuffle: - Use fast external storage that scales nicely ‣ NFS keeps turning up here ‣ See also Apache Crail (incubating) which uses direct memory access - Clever improvements to write data locally and offload to external storage only when there are no remaining executors on the node 27
when you touch resource allocation, like wanting to improve auto-scaling... • when you delegate cluster management ‣ you touch the scheduler, which is both complex and non- public Getting around this requires either • delegating cluster management to something which uses a different scheduler - Note Kubernetes already allows extending the stock scheduler • Forking Spark 28
very interesting discussions about these topics [SPARK-27941] Serverless Spark in the Cloud [SPARK-25299] Use remote storage for persisting shuffle data / SPIP / Deep dive into shuffle reliability [SPARK-19700] Design an API for pluggable scheduler implementations 29
jobs that run on any execution engine. Written in Scala / Functions in Java, Python, or Go Available on Google Cloud Dataflow https://beam.apache.org/ 30
• Container startup speed • CPU and memory (giving more memory improves startup time, but increases costs) • And we already discussed increased runtime due to remote storage overhead - which shuffle improvements will help in time. But what can we do to improve container cold start time and resource usage? 32
using GraalVM to compile JVM code faster. GraalVM Native Image (Early Adopter Technology) • Supports Ahead of Time (AOT) compilation - Improves startup time - Classes can be initialized at build time for shorter startup times (although some classes may require runtime initialization for a properly working app ) • Executable is smaller (uses less memory) Substrate VM only includes what will actually be used at runtime Running Apache Spark as a native image would start up quicker, run faster, and use less memory. 33
runtime: • When the application starts up, everything needs to be loaded into memory, parsed/verified, initialized (slow) • Eventually*, the JIT compiler will profile and then compile bytecode which runs many time ("hot") into optimized native code Ahead of Time (AOT) compilation optimizes everything before execution time. It can perform optimization too costly (slow!) for a JIT compiler. For long-running applications, JIT is great. It's responsive and lazy (why optimize code that nobody ever runs?). But for short-lived applications, the longer compile time and certain limitations of compiling a native image are worth the tradeoff. JEP 295: Ahead-of-Time Compilation * eventually could be a long time relative to the runtime of the function 34
for running other languages in JVM based applications. Python users in the Spark world can experience a number of interoperability slowdowns. Conventional wisdom online says that - even though the Python interoperability is slow relative to Scala! - it's not the slowest part. But aren't developer costs something too? What kind of speedup would polyglot support give these users? 35
a native image of Apache Spark is in its infancy. Only rumors. But it's such a good idea I predict it will happen within a year or two. And for people who compile Apache Spark - it's a lot faster with GraalVM GraalVM 19.1: Compiling Faster (Thomas Wuerthinger) 36
(J. Hellerstein et al, CIDR 2019) • From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers (S. Fouladi et al. USENIX ATC '19) • Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure (Q. Pu, S. Venkataraman, and I. Stoica, USENIX NSDI '19) • Towards Practical Serverless Analytics (Q. Pu, UC Berkeley Technical Report No. UCB/EECS-2019-105) • Occupy the Cloud: Distributed computing for the 99% (E. Jonas, S. Venkataraman, I. Stoica, B. Recht, arXiv 1702.0402) 39
to get started (Christopher Crosbie and Patrick Clay) • AWS Lambda - does it fit in with data processing? (Bartosz Konieczny) • Benchmarks, Spark and Graal (Phil Phil) • Instant Netty Startup using GraalVM Native Image Generation (Codrut Stancu) • Small & fast Docker images using GraalVM’s native-image (Adam Warski) • Updates on Class Initialization in GraalVM Native Image Generation (Christian Wimmer) • Mastering Java Cold Start On AWS Lamba Volume 1 (Serkan Özal) • How To Manage And Monitor Apache Spark On Kubernetes - Part 1 Part 2 (Chaoran You and Stavros Kontopoulos) • Spark scheduling in Kubernetes (Palantir) 40
Sidham, Spark+AI Summit Europe 2019) • Reliable Performance at Scale with Apache Spark on Kubernetes (Will Manning and Matt Cheah, Spark+AI Summit Europe 2019) • Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters (Prakhar Jain and Venkata Krishnan Sowrirajan, Spark+AI Summit Europe 2019) • Maximizing Performance with GraalVM (Thomas Wuerthinger, Oracle Code 2019) • Twitter’s Quest for a Wholly Graal Runtime (Chris Thalinger) • Improving GraalVM Native Image (Christian Wimmer, JVM Language Summit 2019) • Efficient Management of Ephemeral Data in Serverless Computing (Patrick Stuedi, Fourth International Workshop on Serverless Computing) • Adopting GraalVM (Petr Zapletal, Scale By The Bay 2018) • Apache Spark on Kubernetes (Anirudh Ramanathan and Tim Chen, Spark+AI Summit 2017) • Lambda Architecture in the Cloud with Azure Databricks (Andrei Varanovich, Spark+AI Summit DEV6) 41