Ray_Essentials__Introduction_to_Ray_for_machine_learning.pdf

Slide 1

Slide 1 text

Ray Essentials: Introduction to Ray for machine learning Jules S. Damji, Ray Team @Anyscale X: @2twitme LinkedIn: https://www.linkedin.com/in/dmatrix/

Slide 2

Slide 2 text

A quick poll…

Slide 3

Slide 3 text

$whoami ● Lead Developer Advocate, Anyscale & Ray Team ● Sr. Developer Advocate, Databricks, Apache Spark/MLﬂow Team ● Led Developer Advocacy, Hortonworks ● Held SWE positions: ○ Sun Microsystems ○ Netscape ○ @Home ○ Loudcloud/Opsware ○ Verisign

Slide 4

Slide 4 text

Who do I work for … Who we are::Original creators of Ray, a uniﬁed general-purpose framework for scalable distributed computing What we do: Scalable compute for AI as managed service, with Ray at its core, and the best platform to develop & run AI apps Why we do it: Scaling is a necessity, scaling is hard; make distributed computing easy and simple for everyone

Slide 5

Slide 5 text

Anyscale Platform for AI Infra Anyscale cloud infrastructure Ray Data Train Tune RL Serve Workspaces Jobs Services Serving Fine-tuning Anyscale Endpoints OS LLMs

Slide 6

Slide 6 text

🗓 Today’s agenda ● Why & What’s Ray & Ray Ecosystem ● Ray Architecture & Components ● Ray Core Design & Scaling Patterns & APIs ● Demo ● Wrap up…

Slide 7

Slide 7 text

Why Ray + What’s Ray

Slide 8

Slide 8 text

Why Ray? Machine learning is pervasive Distributed computing is a necessity Python is the default language for DS/ML

Slide 9

Slide 9 text

Blessings of scale ….

Slide 10

Slide 10 text

Blessings of scale …. 1. Model size are getting larger - Model size is exponentially increasing. - Models are too large to into a single GPU. - We need to shard the models across multiple GPUs for training - e.g. ZeRO, Model Parallel, Pipeline Parallel BERT(2019): 336M params(1.34GB) Llama-2 (2023): 70B params(280GB) ~20x GPT-4: ~1800B >5,000x

Slide 11

Slide 11 text

Supply demand-problem 11 11 35x every 18 m onths 2020-2023 GPT-[3, 4] CPU https://openai.com/blog/ai-and-compute/ GPU* TPU * Llama 2, Falcon, PaLM etc…

Slide 12

Slide 12 text

Supply demand-problem 12 12 35x every 18 m onths 2020-2023 GPT-[3, 4] CPU https://openai.com/blog/ai-and-compute/ GPU* TPU * Llama 2, Falcon, PaLM etc… No way out but to distribute!

Slide 13

Slide 13 text

Python DS/ML Ecosystem

Slide 14

Slide 14 text

What’s Ray ? ● A simple/general-purpose library for distributed computing ● An ecosystem of Python Ray AI libraries (for scaling ML & more) ● Runs on laptop, public cloud, K8s, on-premise ● Easy to install and get started …. pip install ray[default] A layered cake of functionality and capabilities for scaling ML workloads

Slide 15

Slide 15 text

A layered cake and ecosystem 15 Ray AI Libraries enable simple scaling of AI workloads.

Slide 16

Slide 16 text

16 A Layered Cake and Ecosystem Ray AI Libraries enable simple scaling of AI workloads.

Slide 17

Slide 17 text

AI libraries

Slide 18

Slide 18 text

Ray architecture & components

Slide 19

Slide 19 text

Anatomy of a Ray cluster Driver Worker Global Control Store (GCS) Scheduler Object Store Raylet Worker Worker Scheduler Object Store Raylet Worker Worker Scheduler Object Store Raylet … … Head Node Worker Node #1 Worker Node #N . . . Unique to Ray

Slide 20

Slide 20 text

Anatomy of a Ray worker process

Slide 21

Slide 21 text

Creation of a Ray task … # Driver code … A_ref = A.remote() # Worker 1 code … B = B.remote() return B

Slide 22

Slide 22 text

Creation of a Ray object …

Slide 23

Slide 23 text

Creation of a Ray Actor … handle_A = ActorA.remote() handle_B= ActorB.remote()

Slide 24

Slide 24 text

Actor creation sequence …

Slide 25

Slide 25 text

Ray core design & scaling patterns

Slide 26

Slide 26 text

Ray basic design pattern ● Ray Parallel Tasks ○ Functions as stateless units of execution ○ Functions distributed across the cluster as tasks ● Ray Objects as Futures ○ Distributed (immutable objects) store in the cluster ○ Fetched when materialized ○ Enable massive asynchronous parallelism ● Ray Actors ○ Stateful service on a cluster ○ Enable Message passing ● Patterns for Parallel Programming ● Ray Distributed Library Integration Patterns

Slide 27

Slide 27 text

Scaling design patterns Different data / Same function Same data / Different function Compute Data Batch Training / Inference AutoML Batch Tuning Different data / Same function Different hyperparam per job ... ...

Slide 28

Slide 28 text

28 Python → Ray API 28 def f(x): # do something with x: y = … return y v = f(x) @ray.remote def f(x): # do something with x: y = … return y v_ref= f.remote(x) f() Node … Task Distributed f() Node class Cls(): def __init__(self, x): def f(self, a): … def g(self, a): … Actor @ray.remote class Cls(): def __init__(self, x): def f(self, a): … def g(self, a): … cls = Cls.remote() cls.f.remote(a) Cls Node … Distributed Cls Node import numpy as np a= np.arange(1, 10e6) b = a * 2 Distributed immutable object import numpy as np a = np.arange(1, 10e6) obj_a = ray.put(a) b = ray.get(obj_a) * 2 Node … Distributed Node a a

Slide 29

Slide 29 text

Ray Task def f(a, b): return a + b f(1, 2) f(2, 3) f(4, 5) @ray.remote(num_cpus=3) f.remote(1, 2) # returns 3 f.remote(2, 3) # returns 5 f.remote(3, 4) # returns 7 f.remote(4, 5) # returns 9 A function remotely executed in a cluster Result = 3 Result = 5 f(3, 4) Result = 7 Result = 9

Slide 30

Slide 30 text

30 Ray Actor class HostActor: def __init__(self): self.model = load_model(“s3://model_checkpoint”) self.num_devices = os.environ["CUDA_VISIBLE_DEVICES"] def inference(self, data): return self.model(data) def f(self, output): return f"{output} {self.num_devices}" A class remotely executed in a cluster @ray.remote(num_gpus=4) actor = HostActor.remote() # Create an actor actor.f.remote("hi") # returns "hi 0,1,2,3" actor.inference(input) # returns predictions… Host Host client client method method local states local states

Slide 31

Slide 31 text

Distributed objects

Slide 32

Slide 32 text

Distributed object store Aa = read_array(file1) b = read_array(file2) a_x = ray.put(a) b_x = ray.put(b) s_x = sum.remote(a_x, b_x) val = ray.get(s_x) print(val) aaa a b s sum() Shared object store

Slide 33

Slide 33 text

When to use Ray

Slide 34

Slide 34 text

When to use Ray & Ray AI Libraries? Scale a single type of workload ● Data ingestion for ML ● Batch Inference at scale ● Distributed Training ● Only serving or online inference Scale end-to-end ML applications Run ecosystem libraries using a uniﬁed API Build a custom ML platform ● Spotify, Instacart ● Pinterest & DoorDash ● Samsara & Niantic ● Uber Eats & LinkedIn Data Train Tune Serve

Slide 35

Slide 35 text

When to use Ray & Ray AI Libraries? Scale a single type of workload ● Data ingestion for ML ● Batch Inference at scale ● Distributed Training

Slide 36

Slide 36 text

Ray for Data ingest ● Ray Datasets as a common data format ● Easily read from disk/cloud, or from other formats (images, CVS, Parquet, HF etc) ● Fully distributed ○ Can handle data too big to ﬁt on one node or even the entire cluster Trainer Worker Worker Worker Worker Dataset Trainer.fit

Slide 37

Slide 37 text

Ray Data overview High performance distributed IO ds = ray.data.read_parquet("s3://some/bucket") ds = ray.data.read_csv("/tmp/some_file.csv") Leverages Apache Arrow’s high-performance IO Parallelized using Ray’s high-throughput task execution or actor pool execution Scales to PiB-scale jobs in production (Amazon) Read from storage Transform data ds = ds.map_batches(batch_func) ds = ds.map(func) ds.iter_batches() -> Iterator ds.write_parquet("s3://some/bucket") Consume data

Slide 38

Slide 38 text

Simple batch inference example Using user defined functions (UDFs) Logical data ﬂow: CPU CPU

Slide 39

Slide 39 text

A simple batch inference example

Slide 40

Slide 40 text

Multi-stage (heterogeneous) pipeline Read Preprocess Inference Save GPU CPU

Slide 41

Slide 41 text

Heterogeneous pipeline (CPU + GPU)

Slide 42

Slide 42 text

Ray Train: Distributed ML/DL training Ray Train is a library for developing, orchestrating, and scaling distributed deep learning applications.

Slide 43

Slide 43 text

Scaling across cluster … Ray Train: Integrates with deep learning frameworks

Slide 44

Slide 44 text

When to use Ray & Ray AI Libraries? Scale end-to-end ML applications Data Train Tune Serve

Slide 45

Slide 45 text

Ray for end-to-end ML application Storage and Tracking Preprocessing Training Scoring Serving …` ... ... Ray AI libraries can provide that … single end-to-end application

Slide 46

Slide 46 text

Who’s using Ray …. 24,000+ GitHub 5,000+ Depend on Ray 1,000+ Organizations Using Ray 27,000+ GitHub stars 5,000+ Repositories Depend on Ray 870+ Community Contributors

Slide 47

Slide 47 text

Who’s using Ray …. 24,000+ GitHub 5,000+ Depend on Ray 1,000+ Organizations Using Ray 27,000+ GitHub stars 5,000+ Repositories Depend on Ray 870+ Community Contributors

Slide 48

Slide 48 text

🍎 Recap: Today we learned… 🚂 Why Ray & What’s Ray Ecosystem ● Architecture components ● When to use Ray ● Who uses Ray at scale 🔬 Ray Design & Scaling Patterns & APIs ● Ray Tasks, Actors, Objects

Slide 49

Slide 49 text

🔗 Reading list. Ray Education GitHub Access bonus notebooks and scripts about Ray. Ray documentation API references and user guides. Anyscale Blogs Real world use cases and announcements. YouTube Tutorials Video walkthroughs about learning LLMs with Ray.

Slide 50

Slide 50 text

🔗 Resources ● How to ﬁne tune and serve LLMs simply, quickly and cost effectively using Ray + DeepSpeed + HuggingFace ● Get started with DeepSpeed and Ray ● Training 175B Parameter Language Models at 1000 GPU scale with Alpa and Ray ● Fast, ﬂexible, and scalable data loading for ML training with Ray Data ● Ray Serve: Tackling the cost and complexity of serving AI in production ● Scaling Model Batch Inference in Ray: Using Actors, ActorPool, and Ray Data ● Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Models to Unique Applications (part-1) ● Fine-Tuning LLMs: LoRA or Full-Parameter? An in-depth Analysis with Llama 2 (part-2)

Slide 51

Slide 51 text

Ray Demo