Moonshot Spark

Moonshot Spark Rose Toomey, Coatue Management Scale by the Bay
2019 1

About me NYC. Finance. Technology. Code. • My job is
to write code, but I keep ﬁnding myself back working with the data - Lead API Developer at Gemini Trust - Director at Novus Partners • Now: coding and working with data full time - software engineer at Coatue Management Spark At Scale In The Cloud Spark+AI Summit Europe 2019 2

Apache Spark business as usual Things an Apache Spark cluster
loves: • Processing very large datasets using • Lots of memory • Fast local storage • RPC communications • Granular control over cluster nodes - Scheduling - Dynamic allocation 3

Serverless the very best bits • Reduce costs • Scale
easily • Quick deployments and updates • Stateless functions are very, well, functional • Easy to reason about • Easy to compose • Microservices are appealing • Raise your hand if you have been the compile time victim of a mono-Repo Of Unusual Size • Infrastructure as code • DevOps! NoOps! SomebodyElseOps! 4

Apache Spark + Serverless? join me on holiday from reality
I want • The ability to run an Apache Spark job • Without worrying about the details of provisioning the cluster • On input data of any size • Knowing horizontal auto-scaling will - just work - reliably reduce the costs of my job over time • And it should be simpler than whatever I do now • Plus integrate out of box with popular logging and monitoring tools 5

1 2 3 4 Paradigm Shift Stateless Functions Serverless Clusters
Engine Room

Serverless In an ideal world, serverless architecture lets us focus
on our application while abstracting over the resources necessary to run it. • Your code is a function - That may have external dependencies like cloud storage or third-party APIs • That runs in a stateless container - Triggered by some external event • And "somebody else" handles the plumbing 7

What you control Staying focused on this ideal world, you
would get billed by the sub-second for only those resources your function actually consumes. 1. The amount of memory you allocate (may also control number of cores) 2. The function timeout relative to the maximum allowed function runtime 3. Cloud provider access policies so the containers can access external resources 4. From this point on, you're limited to customizing your container 8

What you don't control 1. How long external services take
to respond 2. Reliability of the underlying infrastructure 3. The maximum amount of time your cloud provider will allow a "function as a service" (FaaS) to run 9

Obstacles • Memory efficiency • JVM slow startup relative to
allowed "function as a service" (FaaS) runtime • Spark keeps local state in memory and on disk - Where will shuffle keep its data now? ‣ Remember timeouts? Shuffle will effectively behave like a third party API now, because it will directly depend on external resources • Scheduling 10

Engine  Room

Stateless Functions All limits are provided as of November 2019
- please consult the docs for most current details Amazon Web Services (AWS) Lambda Memory: 128 - 3008 MB memory Function duration: up to 900 seconds (15 minutes) Deployment size: 250 MB unzipped including layers https://docs.aws.amazon.com/lambda/latest/dg/limits.html Google Cloud Provider (GCP) Functions Memory: up to 2048 MB memory Function duration: 540 seconds (9 minutes) Deployment size: 500 MB uncompressed source plus modules https://cloud.google.com/functions/quotas Microsoft Azure Functions (limits vary by hosting plan) Memory: up to 1.5 GB memory Function duration: 10 minutes Deployment size: 500 MB uncompressed source plus modules https://docs.microsoft.com/en-us/azure/azure-functions/functions-scale 12

Stateless Functions All limits are provided as of November 2019
- please consult the docs for most current details IBM Cloud Functions  Memory: up to 2048 MB memory Function duration: up to 600,000 ms (10 minutes) Deployment size: 48MB base64 encoded (custom Docker image with dependencies allowed) https://cloud.ibm.com/docs/openwhisk?topic=cloud-functions-limits Open sourced: Apache OpenWhisk 13 Oracle Cloud Functions  Memory: up to 1024 MB memory Function duration: up to 120 seconds (2 minutes) Deployment size: 6 MB https://docs.cloud.oracle.com/iaas/Content/Functions/Tasks/ functionscustomizing.htm Open sourced: Fn Project

Stateless Spark, you say? Spark evolved to process very large
datasets. Using Spark inside stateless functions - introduces time and memory constraints - plus now we have to worry about the size of the deployable - depending on how often function is invoked, might not cost less than running jobs on a traditional cluster But what happens when you have small, intermittent datasets that you want to process using the same Spark code you used to handle much bigger datasets? 14

Embarrassingly parallel Certain kinds of standalone Spark server jobs are
embarrassingly parallel: • The input data is small (doesn't need to be split up) • The analysis consists of many operations that are - Independent - Idempotent Instead of a single standalone Spark server running N operations, now you have N stateless functions, each running a standalone Spark server that executes a single operation. Results written externally. If it fails, try again. No doubt we lose time in invocation costs, but we might still save money relative to keeping a Spark cluster idle for "infrequent" incoming requests. ？ 15

Engine Room

What if it won't fit in a single function? What
if • There are multiple discrete functions that need to transform the data in order? - I was promised I could compose functions • Our input data isn't small enough to be processed by a single function, needs to be split up first • We have multiple Spark jobs that need to run in a defined order These cases aren't really the same but share the need for orchestration of some kind. 17

Reductio ad absurdum half steppin cluster management OK, we have
too much data for a single stateless function and stateless functions can't directly communicate with each other. So let's use other cloud provider services to knit things together: • The event trigger now invokes a step orchestration service instead of a serverless function • Need to coordinate data between stages using relatively slow/expensive external means of communication, e.g. cloud storage or key-value stores  Step orchestration - Kicks oﬀ a stateless function that can break up the input data into discrete granular pieces small enough to be processed by a stateless function without running out of memory or timing out - Then kicks oﬀ the actual functions that process the data, in parallel  Ugly, error prone, hard to manage, but the seed of some useful ideas 18

Viable options 1. Run it on a serverless container instead
- No time limit - More resources 2. Serverless workﬂow: orchestrate the job using cloud provider services 3. Use a purpose-built cloud provider oﬀering - If you already went all in on the cloud provider ecosystem, might as well use their helper services to nail it all together 4. Delegate cluster management 19

Serverless containers Hybrid approach to get around limitations of functions
Use a serverless function to launch a standalone Spark cluster in a serverless container. 20 AWS Fargate Run containers without managing servers or clusters. https://aws.amazon.com/fargate/ Google Cloud Run Run stateless containers on a fully managed environment or on Anthos. https://cloud.google.com/run/ Microsoft Azure Container Instances Run stateless containers on a fully managed environment or on Anthos. https://cloud.google.com/run/

Serverless orchestration an interesting off-shoot Another hybrid approach: it's the
coordination of the pipeline that's serverless. We're agnostic of what the actual steps do, except that a stateless function can launch them. 21 AWS Step Functions Coordinate multiple AWS services into serverless workflows https://aws.amazon.com/step-functions/ AWS Step Functions Coordinate multiple Microsoft services into serverless workflows https://aws.amazon.com/step-functions/

Serverless ETL, batteries included  just add code* Microsoft Data Accelerator
for Apache Spark  Plug and play managed data pipeline toolkit  https://github.com/microsoft/data-accelerator Google Cloud Dataproc Fully managed cloud service for running Spark and Hadoop clusters   https://cloud.google.com/dataproc/  See also: Google Cloud Dataﬂow, for stream and batch processing AWS Glue Fully managed ETL service   https://aws.amazon.com/glue/  See also: AWS Data Pipeline, to process and move data between diﬀerent AWS compute and storage services * if you already bought in to the rest of the pipeline ecosystem 22

Delegating cluster management Not for the first time, Kubernetes to
the rescue. • Spark 2.3+ has an experimental native Kubernetes scheduler - Supports plain old spark-submit although monitoring and fault tolerance aren’t polished - But also Spark Operator, which extends Kubernetes operator pattern using Custom Resource Definitions to provide better • Spark 3.0 improvements on the horizon - Ongoing Spark Operator improvements - [SPARK-27963][core] Allow dynamic allocation without a shuffle service • [SPARK-24793] Make spark-submit more useful with k8s • Dynamic resource allocation - External shuffle 23

Cluster management built on on top of cloud providers In
addition to cloud providers' own oﬀerings, there are third parties making some interesting changes to how Spark clusters work in the cloud. Databricks Runtime Core (serverless) Available for: AWS, Azure Introducing Databricks Optimized Autoscaling on Apache Spark™ Qubole Qubole Announces Apache Spark on AWS Lambda Available for: AWS, GCP, Azure, Oracle 24

Engine Room

The Shuffle problem • Shuffle blocks are stored locally on
the node where the executor is running • An external shuffle service, which is a separate process to the executor, still serves up those shuffle blocks from the node's local storage • Since functions might lack local storage and can't communicate with each other, serverless Spark implementations focus on writing shuffle data to remote resources (cloud storage, key-value stores, even messaging queues) - Worst case, shuffle could generate so many intermediate files that it exceeds the resources limitations on a function 26

The auto-scaling problem • Scaling up is straight forward •
But scaling down is much more complex - Even with an external shuffle service, shuffle blocks are stored locally on the node - Even with dynamic allocation, the node won't be shut down (it will just free up to do work on a different job) So fixing auto-scaling means fixing shuffle: - Use fast external storage that scales nicely ‣ NFS keeps turning up here ‣ See also Apache Crail (incubating) which uses direct memory access - Clever improvements to write data locally and offload to external storage only when there are no remaining executors on the node   27

The scheduler problem But now we have another problem! •
when you touch resource allocation, like wanting to improve auto-scaling... • when you delegate cluster management ‣ you touch the scheduler, which is both complex and non- public Getting around this requires either • delegating cluster management to something which uses a diﬀerent scheduler - Note Kubernetes already allows extending the stock scheduler • Forking Spark 28

Spark 3.0 and serverless Spark core developers are having some
very interesting discussions about these topics [SPARK-27941] Serverless Spark in the Cloud [SPARK-25299] Use remote storage for persisting shuﬄe data / SPIP / Deep dive into shuﬄe reliability [SPARK-19700] Design an API for pluggable scheduler implementations 29

More Abstraction Apache Beam Implement batch and streaming data processing
  jobs that run on any execution engine. Written in Scala / Functions in Java, Python, or Go  Available on Google Cloud Dataﬂow https://beam.apache.org/ 30

Looking Ahead tl;dr decouple storage and compute ... and GraalVM

Faster, leaner, serverless JVM on serverless has a high overhead:
• Container startup speed • CPU and memory (giving more memory improves startup time, but increases costs) • And we already discussed increased runtime due to remote storage overhead - which shuﬄe improvements will help in time. But what can we do to improve container cold start time and resource usage? 32

GraalVM  JVM speedups for the intrepid You've probably heard about
using GraalVM to compile JVM code faster. GraalVM Native Image (Early Adopter Technology) • Supports Ahead of Time (AOT) compilation - Improves startup time - Classes can be initialized at build time for shorter startup times (although some classes may require runtime initialization for a properly working app ) • Executable is smaller (uses less memory) Substrate VM only includes what will actually be used at runtime  Running Apache Spark as a native image would start up quicker, run faster, and use less memory. 33

JIT vs AOT Just in time (JIT) compilation works at
runtime: • When the application starts up, everything needs to be loaded into memory, parsed/verified, initialized (slow) • Eventually*, the JIT compiler will profile and then compile bytecode which runs many time ("hot") into optimized native code Ahead of Time (AOT) compilation optimizes everything before execution time. It can perform optimization too costly (slow!) for a JIT compiler. For long-running applications, JIT is great. It's responsive and lazy (why optimize code that nobody ever runs?). But for short-lived applications, the longer compile time and certain limitations of compiling a native image are worth the tradeoff. JEP 295: Ahead-of-Time Compilation * eventually could be a long time relative to the runtime of the function 34

Polyglot Spark Madame, parlez-vous Python? GraalVM Polyglot API has support
for running other languages in JVM based applications. Python users in the Spark world can experience a number of interoperability slowdowns. Conventional wisdom online says that - even though the Python interoperability is slow relative to Scala! - it's not the slowest part. But aren't developer costs something too? What kind of speedup would polyglot support give these users? 35

Just begun As of November 2019, the idea of building
a native image of Apache Spark is in its infancy. Only rumors. But it's such a good idea I predict it will happen within a year or two. And for people who compile Apache Spark - it's a lot faster with GraalVM GraalVM 19.1: Compiling Faster (Thomas Wuerthinger) 36

Interested? • What we do: data engineering @ Coatue -
Terabyte scale, billions of rows - Lambda architecture - Functional programming • Stack - Scala (cats, shapeless, fs2, http4s) - Spark / Hadoop / EMR / Databricks - Data warehouses - Python / R / Tableau - Chat with me or email: [email protected] - Twitter: @prasinous 37

Resources

Papers • Serverless Computing: One Step Forward, Two Steps Back
(J. Hellerstein et al, CIDR 2019) • From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers (S. Fouladi et al. USENIX ATC '19) • Shuﬄing, Fast and Slow: Scalable Analytics on Serverless Infrastructure (Q. Pu, S. Venkataraman, and I. Stoica, USENIX NSDI '19) • Towards Practical Serverless Analytics (Q. Pu, UC Berkeley Technical Report No. UCB/EECS-2019-105) • Occupy the Cloud: Distributed computing for the 99% (E. Jonas, S. Venkataraman, I. Stoica, B. Recht, arXiv 1702.0402) 39

Blog posts • Cloud Dataproc Spark Jobs on GKE: How
to get started (Christopher Crosbie and Patrick Clay) • AWS Lambda - does it ﬁt in with data processing? (Bartosz Konieczny) • Benchmarks, Spark and Graal (Phil Phil) • Instant Netty Startup using GraalVM Native Image Generation (Codrut Stancu) • Small & fast Docker images using GraalVM’s native-image (Adam Warski) • Updates on Class Initialization in GraalVM Native Image Generation (Christian Wimmer) • Mastering Java Cold Start On AWS Lamba Volume 1 (Serkan Özal) • How To Manage And Monitor Apache Spark On Kubernetes - Part 1 Part 2 (Chaoran You and Stavros Kontopoulos) • Spark scheduling in Kubernetes (Palantir) 40

Presentations • Improving Apache Spark Downscaling (Christopher Crosbie and Ben
Sidham, Spark+AI Summit Europe 2019) • Reliable Performance at Scale with Apache Spark on Kubernetes (Will Manning and Matt Cheah, Spark+AI Summit Europe 2019) • Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters (Prakhar Jain and Venkata Krishnan Sowrirajan, Spark+AI Summit Europe 2019) • Maximizing Performance with GraalVM (Thomas Wuerthinger, Oracle Code 2019) • Twitter’s Quest for a Wholly Graal Runtime (Chris Thalinger) • Improving GraalVM Native Image (Christian Wimmer, JVM Language Summit 2019) • Eﬃcient Management of Ephemeral Data in Serverless Computing (Patrick Stuedi, Fourth International Workshop on Serverless Computing) • Adopting GraalVM (Petr Zapletal, Scale By The Bay 2018) • Apache Spark on Kubernetes (Anirudh Ramanathan and Tim Chen, Spark+AI Summit 2017) • Lambda Architecture in the Cloud with Azure Databricks (Andrei Varanovich, Spark+AI Summit DEV6) 41

Use Cases • Serverless Model Serving: OpenWhisk, Apache Spark, and
MLeap (Jowanza Joseph) • How We Implemented a Fully Serverless Recommender System Using GCP (Will Fuks) • Serverless Data Engineering: AWS Glue + Lambda + Athena + QuickSight (Peter Begle) • Implementing ETL job using AWS Glue (Vijayendra Bhati) • Orchestrate Apache Spark applications using AWS Step Functions and Apache Livy (Tanzir Musabbir) • ETL with standalone Spark containers for ingesting small ﬁles (Joe Corey) • Event-driven serverless ETL using AWS Lambda, Step Functions, EMR and Fargate (Siddarth Sharma) • Using Graal AOT for a realistic project? (Jurgen Voorneveld) 42

Code • https://github.com/qubole/spark-on-lambda • https://github.com/pywren/pywren • https://github.com/lambci/docker-lambda • https://github.com/graalvm/graalvm-demos •
https://github.com/palantir/spark-tpcds-benchmark • https://github.com/GoogleCloudPlatform/spark-on-k8s- operator • https://github.com/palantir/k8s-spark-scheduler • https://github.com/CloudScala/scala-graalvm-docker 43

Image Credits Title slide: Moon - North Pole (NASA ID:
PIA00126) Looking ahead slide: Astronaut Edward White during ﬁrst EVA performed during Gemini 4 ﬂight (NASA ID: S65-34635) Resources slide: Apollo 11 Mission image - View of moon limb,with Earth on the horizon (NASA ID: as11-44-6551) Architecture icons: • AWS (https://aws.amazon.com/architecture/icons/) • GCP (https://cloud.google.com/icons/) • Azure (https://www.microsoft.com/en-us/download/details.aspx?id=41937) • IBM Cloud Functions (https://github.com/ibm-functions) • Oracle Cloud Functions (https://docs.cloud.oracle.com/iaas/Content/Functions/ Concepts/functionsoverview.htm#limits) • Apache OpenWhisk (https://github.com/apache/openwhisk-website/tree/ master/images/logo) • Apache Beam (https://beam.apache.org/community/logos/) • Kubernetes (https://github.com/kubernetes/community/tree/master/icons) • Databricks (https://databricks.com/wp-content/uploads/) 44

Moonshot Spark

Moonshot Spark

More Decks by Rose Toomey

Other Decks in Technology

Featured

Transcript