Slide 1

Slide 1 text

A Data-Centric Lens on Cloud Programming and Serverless Computing JOE HELLERSTEIN, UC BERKELEY

Slide 2

Slide 2 text

Hydro: Stateful Serverless and Beyond Avoiding Coordination Serverless Computing The CALM Theorem

Slide 3

Slide 3 text

3 Cray-1, 1976 Supercomputers iPhone, 2007 Smart Phones Macintosh, 1984 Personal Computers PDP-11, 1970 Minicomputers Sea Changes in Computing

Slide 4

Slide 4 text

4 New Platform + New Language = Innovation Cray-1, 1976 Supercomputers iPhone, 2007 Smart Phones Macintosh, 1984 Personal Computers PDP-11, 1970 Minicomputers

Slide 5

Slide 5 text

5 How will folks program the cloud? In a way that fosters unexpected innovation Distributed programming is hard! • Parallelism, consistency, partial failure, … Autoscaling makes it harder! The Big Question

Slide 6

Slide 6 text

6 We’ve been talking about this for a while!

Slide 7

Slide 7 text

7 Industry finally woke up: Serverless Computing Industry Response: Serverless Computing 7

Slide 8

Slide 8 text

Serverless 101: Functions-as-a-Service (FaaS) Enable developers (outside of AWS, Azure, Google, etc.) to program the cloud Access to 1000s of cores, PBs of RAM Fine-grained resource usage and efficiency Enables new economic, pricing models, etc.

Slide 9

Slide 9 text

9 Autoscaling Massive Data Processing Unbounded Distributed Computing 1 STEP FORWARD 2 STEPS BACK ✅ Serverless & the Three Promises of the Cloud

Slide 10

Slide 10 text

10 Three Limitations of Current FaaS (e.g. AWS Lambda) I/O Bottlenecks 10-100x higher latency than SSD disks, charges for each I/O. 15-min lifetimes Functions routinely fail, can’t assume any session context No Inbound Network Communication Instead, “communicate” through global services on every call

Slide 11

Slide 11 text

11 Still, Serverless Opens the Conversation Small steps to Big Questions

Slide 12

Slide 12 text

Hydro: Stateful Serverless and Beyond Avoiding Coordination Serverless Computing + Autoscaling — Latency-Sensitive Data Access — Distributed Computing The CALM Theorem

Slide 13

Slide 13 text

13 A First Step: Embracing State Program State: Local data that is managed across invocations Challenge 1: Data Gravity Expensive to move state around. This policy problem is not so hard. Challenge 2: Distributed Consistency This correctness problem is difficult and unavoidable!

Slide 14

Slide 14 text

14 The Challenge: Consistency Ensure that distant agents agree (or will agree) on common knowledge. Classic example: data replication How do we know if they agree on the value of a mutable variable x? x = ❤

Slide 15

Slide 15 text

15 The Challenge: Consistency Ensure that distant agents agree (or will agree) on common knowledge. Classic example: data replication How do we know if they agree on the value of a mutable variable x? If they disagree now, what could happen later? x = ❤ x =

Slide 16

Slide 16 text

16 Classical Consistency Mechanisms: Coordination Consensus (Paxos, etc), Commit (Two-Phase Commit, etc)

Slide 17

Slide 17 text

17 Coordination Avoidance (a poem) the first principle of successful scalability is to batter the consistency mechanisms down to a minimum move them off the critical path hide them in a rarely visited corner of the system, and then make it as hard as possible for application developers to get permission to use them —James Hamilton (IBM, MS, Amazon) in Birman, Chockler: “Toward a Cloud Computing Research Agenda”, LADIS 2009 ” “

Slide 18

Slide 18 text

18 Why Avoid Coordination? Waiting for control is bad Tail latency of a quorum of machines can be very high (straggler effects) Waiting leads to slowdown cascades It’s not just “your” problem!

Slide 19

Slide 19 text

19 Towards a Solution Traditional distributed systems is all about I/O What if we reason about application semantics? With thanks to Peter Bailis…

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

Hydro: Stateful Serverless and Beyond • Autoscaling stateful? Avoid Coordination! • Semantics to the rescue? Avoiding Coordination Serverless Computing + Autoscaling — Latency-Sensitive Data Access — Distributed Computing The CALM Theorem

Slide 22

Slide 22 text

22 Hellerstein JM. The Declarative Imperative: Experiences and conjectures in distributed logic. ACM PODS Keynote, June 2010 ACM SIGMOD Record, Sep 2010. Ameloot TJ, Neven F, Van den Bussche J. Relational transducers for declarative networking. JACM, Apr 2013. Ameloot TJ, Ketsman B, Neven F, Zinn D. Weaker forms of monotonicity for declarative networking: a more fine-grained answer to the CALM-conjecture. ACM TODS, Feb 2016. Hellerstein JM, Alvaro P. Keeping CALM: When Distributed Consistency is Easy. To appear, CACM 2020. Theorem (CALM): A distributed program P has a consistent, coordination-free distributed implementation if and only if it is monotonic. CALM: CONSISTENCY AS LOGICAL MONOTONICITY

Slide 23

Slide 23 text

23 We’ll need some formal definitions

Slide 24

Slide 24 text

24 Intuitively… Consistency: A unique outcome guaranteed regardless of NW shenanigans. Monotonicity: The set of outcomes only grows during execution. Emit outputs without regret! Coordination: Responses we await even though we have all the data.

Slide 25

Slide 25 text

25 Distributed Deadlock: Once you observe the existence of a waits-for cycle, you can (autonomously) declare deadlock. More information will not change the result. Garbage Collection: Suspecting garbage (the non-existence of a path from root) is not enough; more information may change the result. Hence you are required to check all nodes for information (under any assignment of objects to nodes!) Two Canonical Examples Deadlock! Garbage?

Slide 26

Slide 26 text

26 That’s interesting. Who cares? CALM thinking inspires crazy-fast, infinitely-scalable systems No coordination = insane parallelism and smooth scalability E.g. we’ll see the Anna KVS in a few slides We can actually check monotonicity syntactically in a logic language! E.g. in SQL. Or Bloom. But who writes distributed programs in logic?! CALM explains CAP, the times when we get Safety+Liveness A conversation for another day… http://bit.ly/calm-cacm

Slide 27

Slide 27 text

Hydro: Stateful Serverless and Beyond • Autoscaling stateful? Avoid Coordination! • Semantics to the rescue? Avoiding Coordination Serverless Computing + Autoscaling — Latency-Sensitive Data Access — Distributed Computing The CALM Theorem Monotonicity is the “bright line” between what can and cannot be done coordination-free

Slide 28

Slide 28 text

Hydro Hydro: A Platform for Programming the Cloud Anna: autoscaling mul---er KVS ICDE18, VLDB19 HydroLogic: a disorderly IR Hydrolysis: a cloud compiler toolkit Cloudburst: Stateful FaaS https://arxiv.org/abs/2001.04592 f(x) ? Logic Programming Functional Reactive Actors Futures

Slide 29

Slide 29 text

29 Anna Serverless KVS • Anyscale: perform like Redis, scale like S3 • CALM consistency levels via simple lattices • Autoscaling & multitier serverless storage • Won best-of-conference at ICDE, VLDB1, 2 1 Wu, Chenggang, et al. "Anna: A kvs for any scale." IEEE Transac*ons on Knowledge and Data Engineering (2019). 2 Wu, Chenggang, Vikram SreekanC, and Joseph M. Hellerstein. "Autoscaling Cered cloud storage in Anna." PVLDB 12.6 (2019): 624-638.

Slide 30

Slide 30 text

30 Anna Performance Shared-nothing at all scales (even across threads) Crazy fast under contention Up to 700x faster than Masstree within a multicore machine Up to 10x faster than Cassandra in a geo-distributed deployment Coordination-free consistency. No atomics, no locks, no waiting ever! 700x!

Slide 31

Slide 31 text

31 CALM Consistency Simple, clean lattice composition gives range of consistency levels 31 Lines of C++ code modified by system component KEEP CALM AND WRITE(X)

Slide 32

Slide 32 text

32 Autoscaling & Multi-Tier Cost Tradeoffs 350x the performance of DynamoDB for the same price!

Slide 33

Slide 33 text

Cloudburst: A Stateful Serverless Platform Main Challenge: Cache consistency! Hydrocache: new consistency protocols for distributed client “sessions” Compute Storage

Slide 34

Slide 34 text

34 Multiple Consistency Levels Here Too Read Atomic transactions AFT1: a fault tolerance shim layer between any FaaS and any object store • Currently evaluated between AWS Lambda and AWS S3! Multisite Transactional Causal Consistency (MTCC)2 Causal: Preserve Lamport’s happened before relation Multisite transactional: Nested functions running across multiple machines. 34 1Sreekanti, Vikram, et al. A Fault-Tolerance Shim for Serverless Computing. To appear, Eurosys (2020). 2Wu, Chenggang, et al. Transactional Causal Consistency for Serverless Computing. To appear, ACM SIGMOD (2020).

Slide 35

Slide 35 text

Running a Twitter Clone on Cloudburst 1 10 100 1000 Cloudburst (LWW) Cloudburst (Causal) Redis Cloudburst (LWW) Cloudburst (Causal) Redis Reads Writes Latency (ms) 16.1 18.0 15.0 397 501 810 31.9 79 27.9 503 801 921.3

Slide 36

Slide 36 text

Prediction Serving on Cloudburst 200 400 600 800 1000 1200 1400 Python Cloudburst AWS SageMaker AWS Lambda Latency (ms) 182.5 210.2 355.8 1181 191.5 277.4 416.6 1364

Slide 37

Slide 37 text

37 Applications on Cloudburst Hydro Anna: autoscaling mul---er KVS ICDE18, VLDB19 HydroLogic: a disorderly IR Hydrolysis: a cloud compiler toolkit ? Logic Programming Functional Reactive Actors Futures Cloudburst: Stateful FaaS https://arxiv.org/abs/2001.04592 f(x)

Slide 38

Slide 38 text

38 Applications on Cloudburst Hydro Anna: autoscaling multi-tier KVS ICDE18, VLDB19 Cloudburst: Stateful FaaS https://arxiv.org/abs/2001.04592 f(x) Serverless Data Science Robot Mo-on Planning ML Predic-on: ModelZoo Charles Lin Devin Petersohn Simon Mo Rehan Durrani Aditya Ramkumar Avinash Arjavalingam Jeffrey Ichnowski

Slide 39

Slide 39 text

ModelZoo on Cloudburst

Slide 40

Slide 40 text

Why Serverless Jupyter? Large Jupyter deployments! ⋅ Berkeley DataHub: Jupyter deployment that serves over 37,000 students! ⋅ Scaling issues ⋅ Resource efficiency issues

Slide 41

Slide 41 text

A single user’s compute demands RESOURCE DEMANDS TIME Running a cell Typing; thinking; not at computer

Slide 42

Slide 42 text

RESOURCE DEMANDS + … (x 37000) + 37000

Slide 43

Slide 43 text

Deadline rush Light utilization RESOURCE DEMANDS TIME

Slide 44

Slide 44 text

Jupyter on Cloudburst ⋅ A prototype Jupyter notebook that has been ported to execute on Cloudburst ⋅ Each cell is a serverless func:on execu:on ⋅ Notebooks hold zero provisioned compute when not running a cell!

Slide 45

Slide 45 text

Cloudburst

Slide 46

Slide 46 text

Cloudburst

Slide 47

Slide 47 text

Cloudburst x: 3

Slide 48

Slide 48 text

Cloudburst x: 3 Program state stored in Anna Each cell retrieves only the definiDons it needs

Slide 49

Slide 49 text

Demo: Jupyter on Cloudburst

Slide 50

Slide 50 text

Beyond opera:onal benefits ⋅ Serverless architecture does much more than just address Jupyter’s scaling and cost problems ⋅ Also enables new direc:ons for Jupyter!

Slide 51

Slide 51 text

Demo: Spiking memory usage

Slide 52

Slide 52 text

This is only a taste of what’s possible. Future work: Choosing instance types for cells

Slide 53

Slide 53 text

Cell returns immediately! `table` loads in background This is only a taste of what’s possible. Future work: AutomaHc asynchronous evaluaHon

Slide 54

Slide 54 text

Cloudburst a: np.array(...) This is only a taste of what’s possible. Future work: True notebook sharing

Slide 55

Slide 55 text

Scalable Serverless Robot Motion Planning with Cloudburst and Anna Jeffrey Ichnowski, Chenggang Wu, Jackson Chui, Raghav Anand, Samuel Paradis, Vikram Sreekanti, Joseph Hellerstein, Joseph E. Gonzalez, Ion Stoica, Ken Goldberg AUTOLAB

Slide 56

Slide 56 text

Robot Wants to Get Around Room

Slide 57

Slide 57 text

Motion Planning Compute Requirements Navigation Manipulation Different problems require different amounts of computing High CPU Low CPU “Go to office desk” “Declutter desk”

Slide 58

Slide 58 text

Motion Planning Compute Requirements 0 25 50 75 100 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Robot’s CPU usage over time Simple planning
 problem Complex planning
 problem Robot moving Robot moving

Slide 59

Slide 59 text

Motion Planning in Serverless Computing λ λ λ λ λ λ λ λ Requirements: • Low-latency simultaneous launch of 100s of functions • Fast sharing of best path between functions • Conflict resolution based on path cost

Slide 60

Slide 60 text

Lambda Communication λ λ λ λ λ λ λ λ AWS API Endpoint “start 8 lambdas” How do they communicate? Originally Built on AWS Lambda

Slide 61

Slide 61 text

Lambda Communication λ λ λ λ λ λ λ λ AWS API Endpoint “start 8 lambdas” How do they communicate? Originally Built on AWS Lambda

Slide 62

Slide 62 text

Lambda Communication λ λ λ λ λ λ λ λ AWS API Endpoint “start 8 lambdas” coordinator (EC2 1-core instance) my IP is 192.168.0.54

Slide 63

Slide 63 text

Communication Bottleneck λ λ λ λ λ λ λ λ AWS API Endpoint “start 8 lambdas” coordinator (EC2 1-core instance) my IP is 192.168.0.54 Bottleneck on number of lambdas

Slide 64

Slide 64 text

Overcoming Communication Bottleneck λ λ λ λ λ λ λ λ Cloudburst API Endpoint “start 8 executors” Anna “use Anna key: (…)” Anna

Slide 65

Slide 65 text

Before Anna Lattices λa λb …, τa 3 , τa 2 , τa 1 …, τb 3 , τb 2 , τb 1 …, τc 3 , τc 2 , τc 1 λc τb 1 , τc 1 , τb 2 , τc 2 , τc 3 , τb 3 , … τa 1 , τc 1 , τa 2 , τc 2 , τc 3 , τa 3 , … τa 1 , τb 1 , τa 2 , τb 2 , τb 3 , τa 3 , … coordinator (EC2 1-core instance)

Slide 66

Slide 66 text

After Anna Lattices λa λb …, τa 3 , τa 2 , τa 1 …, τb 3 , τb 2 , τb 1 …, τc 3 , τc 2 , τc 1 λc τb 1 , τc 1 , τa 3 , … τa 1 , τa 2 , τc 3 , … τa 1 , τb 1 , τa 3 , … Anna KVS

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

Decluttering with Cloudburst + Anna Bottle to grasp

Slide 70

Slide 70 text

Cloudburst + Anna: Motion Plan Cost Over Time 10 s 20 s 30 s 40 s 50 s 60 s 10 concurrent Cloudburst functions 100 concurrent Cloudburst functions 100 90 80 70 60 50 40 30 Cost
 (sum of joint angle changes) Cost w/10 functions after 60 seconds = cost w/100 functions after 2 seconds

Slide 71

Slide 71 text

ROBOT DEMO

Slide 72

Slide 72 text

• Anna: CALM anyscale KVS • Stateful FaaS: cache consistency • Stateful FaaS applications Hydro: Stateful Serverless and Beyond • Autoscaling stateful? Avoid Coordination! • Semantics to the rescue? Avoiding Coordination Serverless Computing + Autoscaling — Latency-Sensitive Data Access — Distributed Computing The CALM Theorem Monotonicity is the “bright line” between what can and cannot be done coordination-free

Slide 73

Slide 73 text

73 We’re pushing the state of art in FaaS But Stateful FaaS is a limited API Python functions + explicit storage With limited contracts from the PL Developer must reason about consistency guarantees And decide when app logic needs to coordinate The real dream takes time! Did We Answer the Big Question? Not Yet.

Slide 74

Slide 74 text

74 Research Futures Hydro Anna: autoscaling mul---er KVS ICDE18, VLDB19 HydroLogic: a disorderly IR Hydrolysis: a cloud compiler toolkit ? Logic Programming Functional Reactive Actors Futures Cloudburst: Stateful FaaS https://arxiv.org/abs/2001.04592 f(x)

Slide 75

Slide 75 text

Hydro: https://github.com/hydro-project Bloom: http://bloom-lang.net RiseLab: https://rise.cs.berkeley.edu [email protected] @joe_hellerstein 7 5 More Information Chenggang Wu Vikram Sreekanti Joseph Gonzalez Johann Schleier-Smith Charles Lin Music composed and performed by: