Making Distributed Computing Easy (Ion Stoica, Anyscale)

Slide 1

Slide 1 text

Making Distributed Computing Easy Ion Stoica Co-Founder, Executive Chairman & President, Anyscale Professor, UC Berkeley

Slide 2

Slide 2 text

Distributed apps are becoming the norm This Talk 01 Building distributed apps is very hard 02 Ray and Anyscale make developing, deploying and managing distributed apps easy 03

Slide 3

Slide 3 text

01 02 Apps increasingly incorporate AI AI workloads are becoming distributed Distributed apps becoming the norm

Slide 4

Slide 4 text

01 02 Apps increasingly incorporate AI AI workloads are becoming distributed Distributed apps becoming the norm 01

Slide 5

Slide 5 text

Apps increasingly incorporate AI

Slide 6

Slide 6 text

01 02 Apps increasingly incorporate AI AI workloads are becoming distributed Distributed apps becoming the norm 01 02

Slide 7

Slide 7 text

https://openai.com/blog/ai-and-compute/ 35x every 18 m onths Compute demands, 2012-2019 (AI)

Slide 8

Slide 8 text

https://devblogs.nvidia.com/training-bert-with-gpus/ 2020 GPT-3 Compute demands, 2012-2020 (AI) No sign of slow down... 35x every 18 m onths

Slide 9

Slide 9 text

35x every 18 m onths 2020 GPT-3 Growing gap between demand and supply Moore’s Law (2x every 18 months) CPU https://openai.com/blog/ai-and-compute/

Slide 10

Slide 10 text

35x every 18 m onths 2020 GPT-3 Specialized hardware is not enough Moore’s Law (2x every 18 months) CPU https://openai.com/blog/ai-and-compute/ GPU* TPU * No way out but to distribute these apps!

Slide 11

Slide 11 text

01 02 Apps becoming more and more complex Development ⇒ production is challenging Building distributed apps very hard!

Slide 12

Slide 12 text

01 02 Apps becoming more and more complex Development ⇒ production is challenging Building distributed apps very hard! 01

Slide 13

Slide 13 text

Improve inference accuracy in fast changing environments Examples: recommendations, financial predictions, resource allocations Data Ingestion & Featurization Training Serving Online learning

Slide 14

Slide 14 text

Solution: stitch together a bunch of distributed systems Data Ingestion & Featurization Training Serving Online learning Serve

Slide 15

Slide 15 text

Data Ingestion & Featurization Training Serving Online learning Serve Hyper tuning

Slide 16

Slide 16 text

Training Serving Serve Simulations Environment Agent State/ reward Action Examples: industry automation, self-driving, trading & finance, system optimizations, recommendations, etc. Solution: stitch together a bunch of distributed systems (e.g., Facebook’s Horizon) Reinforcement learning

Slide 17

Slide 17 text

Training Serving Simulations Environment Agent Solution: build it from scratch (e.g. DeepMind’s Acme) Reinforcement learning State/ reward Action

Slide 18

Slide 18 text

Backend Business Logic Serving / Inference request reply Serving Solution: stitch together a bunch of distributed systems Backend: Business Logic & Inference

Slide 19

Slide 19 text

High performance, but very expensive ● Time ● People Few companies can afford, e.g., Google, Facebook, ... Challenges with building from scratch

Slide 20

Slide 20 text

Hard to develop: different APIs Hard to deploy & manage: impedance mismatch Slow: high overhead of moving data between different systems Data Processing Training Serving Hyper. Tuning Business Logic Simulations Serving KFServing Challenges with stitching together

Slide 21

Slide 21 text

Ray unifies distributed workloads

Slide 22

Slide 22 text

Data Processing Training Serving Hyper. Tuning Others Ray ecosystem + Native universal framework for distributed computing Business Logic

Slide 23

Slide 23 text

Data Processing Training Serving Hyper. Tuning Others Ray ecosystem + Native Best ecosystem of distributed libraries Instead of stitching systems, call libraries in same system Easy to develop, manage, and deploy Business Logic

Slide 24

Slide 24 text

Online learning Reinforcement learning Business Logic & Inference Framework for online learning Real examples at Ray Summit

Slide 25

Slide 25 text

01 02 Apps becoming more and more complex Development ⇒ production is challenging Building distributed apps very hard! 01 02

Slide 26

Slide 26 text

Edit Run Debug Test / staging Deploy Development Production Application Lifecycle

Slide 27

Slide 27 text

Edit Run Debug Test / staging Deploy Development Production What do developers want ? Develop on your laptop Test and deploy on the cluster/cloud Laptop Cluster/cloud

Slide 28

Slide 28 text

Edit Run Debug Test / staging Deploy Development Production Develop on your laptop? Huge barrier - Local development and tests cannot reproduce cluster deployment - Need to package/dockerize app

Slide 29

Slide 29 text

Edit Run Debug Test / staging Deploy Development Production Develop on the cluster? Hard to develop on the cluster/cloud: no good tools, slow to launch nodes, expensive.

Slide 30

Slide 30 text

Anyscale simplifies development, deployment and management of Ray apps

Slide 31

Slide 31 text

universal framework for distributed computing Data Processing Training Serving Hyper. Tuning Business Logic & Simulations Others Ray ecosystem + Native

Slide 32

Slide 32 text

Edit Run Debug Test / staging Deploy Development Production Anyscale: Best of both worlds Laptop development experience and cloud scale

Slide 33

Slide 33 text

Edit Run Debug Development Development: Infinite laptop Can start developing on your laptop...

Slide 34

Slide 34 text

Edit Run Debug Development Development: Infinite laptop ... then transparently move to the cloud Like your laptop but with “infinite” resources!

Slide 35

Slide 35 text

Development Infinite laptop: How? NEW 1. Sync up local environment and files to the cloud Edit Run Debug

Slide 36

Slide 36 text

Infinite laptop: How? import ray ray.client().connect() ... >python ray_prog.py >RAY_ADDRESS=”anyscale://“ python ray_prog.py ray_prog.py run on laptop run in the cloud 1. Sync up local environment and files to the cloud 2. Run program in the cloud with no code changes NEW NEW

Slide 37

Slide 37 text

1. Sync up local environment and files to the cloud 2. Run program in the cloud with no code changes 3. Serverless experience Development Infinite laptop: How? NEW NEW Edit Run Debug

Slide 38

Slide 38 text

1. Sync up local environment and files to the cloud 2. Run program in the cloud with no code changes 3. Serverless experience 4. Debug programs like on your laptop Development Infinite laptop: How? NEW NEW NEW Edit Run Debug

Slide 39

Slide 39 text

Development → Production Edit Run Debug Test / staging Deploy Development Production

Slide 40

Slide 40 text

Development → Production Edit Run Debug Test / staging Deploy Development Production 1. App packaging NEW

Slide 41

Slide 41 text

Development → Production Edit Run Debug Test / staging Deploy Development Production 1. App packaging 2. SDK & REST APIs NEW NEW

Slide 42

Slide 42 text

Development → Production Edit Run Debug Test / staging Deploy Development Production 1. App packaging 2. SDK & REST APIs 3. Monitoring & observability NEW NEW

Slide 43

Slide 43 text

Demo