Modern Compute Stack for Scaling Large AI/ML/LLM Workloads

Slide 1

Slide 1 text

Modern compute stack for scaling large ML/AI/LLM workloads Jules S. Damji Anyscale (Ray Team)

Slide 2

Slide 2 text

A quick poll …

Slide 3

Slide 3 text

$whami ● Lead Developer Advocate, Anyscale & Ray Team ● Sr. Developer Advocate, Databricks, Apache Spark/MLﬂow Team ● Led Developer Advocacy, Hortonworks ● Held SWE positions: ○ Sun Microsystems ○ Netscape ○ @Home ○ Loudcloud/Opsware ○ Verisign

Slide 4

Slide 4 text

Who do I work for … Who we are::Original creators of Ray, a uniﬁed general-purpose framework for scalable distributed computing What we do: Scalable compute for AI as managed service, with Ray at its core, and the best platform to develop & run AI apps Why we do it: Scaling is a necessity, scaling is hard; make distributed computing easy and simple for everyone

Slide 5

Slide 5 text

Agenda ● Challenges with existing ML/AI stack ○ Scaling AI/ML large workloads ○ Infrastructure management ● What is Ray & Why Ray AI Libraries? ○ Ray Data & Ray Trainers ● Emerging modern stack for LLMs ○ Challenges of distributed training for LLMs ○ 🤗 + Ray AI Libraries == easy distributed training ● Demo ○ Fine-tuning & scaling an LLM model with 🤗 + Ray AI Libraries

Slide 6

Slide 6 text

Challenges 🙄 with existing ML/AI stack

Slide 7

Slide 7 text

Challenge #1 Still not easy to go from dev to prod at scale. preprocess.py train.py eval.py run_workflow.py

Slide 8

Slide 8 text

Challenge #2 What happens when your ML infra gets out of date? preprocess.py train.py eval.py run_workflow.py

Slide 9

Slide 9 text

Key problems of existing ML infrastructure Scaling is hard, especially for data scientists Platforms solutions can limit ﬂexibility But custom infrastructure is too hard

Slide 10

Slide 10 text

How do we fix these problems… We want to address these problems! 1. Increase Developer velocity 2. Manage complex infrastructure 3. Scale end-to-end ML pipelines ● We want simplicity with blessings of scale …!

Slide 11

Slide 11 text

Analogy of simpler times … Good ole days! Filesystem "single sklearn script"

Slide 12

Slide 12 text

What we desire …simplicity & scale Storage and Tracking Preprocessing Training Scoring Serving …` ... ... Ray AI libraries can provide that … single application

Slide 13

Slide 13 text

What is Ray… ? ● A simple/general-purpose library for distributed computing ● Comes w/ uniﬁed Python Ray AI Libraries (for scaling ML and more) ● Runs on laptop, public cloud, K8s, on-premise A layered cake of functionality and capability for scaling ML workloads

Slide 14

Slide 14 text

A layered cake & ecosystem Library + app ecosystem Ray core

Slide 15

Slide 15 text

A layered cake & ecosystem

Slide 16

Slide 16 text

Who’s using Ray …. 24,000+ GitHub 5,000+ Depend on Ray 1,000+ Organizations Using Ray 27,000+ GitHub stars 5,000+ Repositories Depend on Ray 870+ Community Contributors

Slide 17

Slide 17 text

Ray AI Libraries: Ray Data + RayTrain …

Slide 18

Slide 18 text

What’s the ML/AI/LLM stack? Zero-3 for ﬁne-tuning OSS LLM For orchestration, scaling & accelerators (GPUs, TPUs, Infrantia1-2, Trainium) Ray Data

Slide 19

Slide 19 text

When to use Ray AI Libraries? Scale a single type of workload Scale end-to-end ML applications Run ecosystem libraries using a uniﬁed API Build a custom ML platform

Slide 20

Slide 20 text

Ray AI Libraries : Ray Data ingest ● Ray Datasets as a common data format ● Easily read from disk/cloud, or from other formats (images, CVS, Parquet, HF etc) ● Fully distributed ● Can handle data too big to ﬁt on one node or even the entire cluster Trainer Worker Worker Worker Worker Dataset Trainer.fit

Slide 21

Slide 21 text

Ray Data overview High performance distributed IO ds = ray.data.read_parquet("s3://some/bucket") ds = ray.data.read_csv("/tmp/some_file.csv") Leverages Apache Arrow’s high-performance IO Parallelized using Ray’s high-throughput task execution or actor pool execution Scales to PiB-scale jobs in production (Amazon) Read from storage Transform data ds = ds.map_batches(batch_func) ds = ds.map(func) ds.iter_batches() -> Iterator ds.write_parquet("s3://some/bucket") Consume data

Slide 22

Slide 22 text

Ray Data’s : Preprocessors ● Ray Data provides out-of-box preprocessors for common ML tasks ● Write your own UDFs to map-apply APIs

Slide 23

Slide 23 text

Simple batch inference example Using user defined functions (UDFs) Logical data ﬂow:

Slide 24

Slide 24 text

A simple batch inference example

Slide 25

Slide 25 text

Multi-stage (heterogeneous) pipeline Read Preprocess Inference Save GPU CPU

Slide 26

Slide 26 text

Heterogeneous pipeline (CPU + GPU)

Slide 27

Slide 27 text

Ray Train: Distributed ML/DL training Ray Train is a library for developing, orchestrating, and scaling distributed deep learning applications.

Slide 28

Slide 28 text

Scaling across cluster … Compatibility: Integrates with deep learning frameworks

Slide 29

Slide 29 text

PyTorch Setup distributed env Setup DDP model Setup distributed sampler Move batches to GPU Compatibility: Integrates with deep learning frameworks

Slide 30

Slide 30 text

Compatibility: Integrates with deep learning frameworks Hugging Face Transformers

Slide 31

Slide 31 text

Define a storage_path to persist checkpoints and artifacts to ● Cloud storage: AWS S3, GCS, … ● Shared file systems: Hadoop HDFS, AWS EFS, GCP Filestore, … Compatibility: Integrates with persistent storage

Slide 32

Slide 32 text

Ray Trainer: Trainer: Parallelization ● PyTorch DDP on a Ray Cluster ○ FSDP, DeepSpeed supported ● Abstracts away infrastructure ● Supports CPUs, GPUs, TPUs etc workers Trainer Worker 1 Worker 2 Worker 3 Worker 4

Slide 33

Slide 33 text

🤗 + Ray Train: Training, Scaling & Fine-Tuning LLMs

Slide 34

Slide 34 text

Trends & challenges on deep learning training Large Datasets Large Models Data Parallelism Model Parallelism Distributed Training

Slide 35

Slide 35 text

Challenges in distributed training Compatibility ⚙ Scalability 🚀 Large Model Training 󰙥 : An OSS production-ready solution part of the stack

Slide 36

Slide 36 text

Model Data Supported distributed strategies include: ● ZeRO ● Pipeline Parallelism ● Tensor Parallelism LLM Support 󰙥 DeepSpeed and Accelerate

Slide 37

Slide 37 text

What’s the LLM stack ? Fine-tuning OSS pretrained models Zero-3 For ﬁne-tuning OSS LLM For orchestration, scaling & accelerators (GPUs, TPUs, Infrantia1-2, Trainium) Ray Data

Slide 38

Slide 38 text

● Distributed Data Parallel/FSDP + DeepSpeed training on a Ray Cluster ○ Takes advantage of PyTorch DDP & Hugging Face support for it ● Runs user-deﬁned Hugging Face code without any changes ● Automatically converts Ray Datasets to format expected by Hugging Face Ray AI + 🤗 Trainer: Implementation

Slide 39

Slide 39 text

Fine-tuning GPT-J/6B Ray AIR + HF + DeepSpeed Easy way to ﬁne-tune an OSS LLM… EleutherAI-GPT-J/6B Pile dataset 825GB tiny-shakespeare 40K lines ~ 1.2MB Fine-tuned EleutherAI-GPT-J/6B Pile + tiny-shakespeare

Slide 40

Slide 40 text

🤗 training workflow dataset = load_dataset("yelp_review_full") train_dataset, eval_dataset = dataset["train"], dataset["test"] model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) training_args = TrainingArguments(f"{model_checkpoint}-yelp", evaluation_strategy="epoch") trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset) trainer.train()

Slide 41

Slide 41 text

Slide 42

Slide 42 text

🤗 training workflow, distributed with Ray AI Libraries dataset = load_dataset("yelp_review_full") train_dataset, eval_dataset = dataset["train"], dataset["test"] def trainer_init_per_worker(train_dataset, eval_dataset, **config): model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) training_args = TrainingArguments(f"{model_checkpoint}-yelp", evaluation_strategy="epoch") trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset) return trainer trainer = TorchTrainer( trainer_init_per_worker=trainer_init_per_worker, scaling_config=ScalingConfig(num_workers=3, use_gpu=True), datasets={"train": ray.data.from_huggingface(train_dataset), "evaluation": ray.data.from_huggingface(eval_dataset)}, ) 42

Slide 43

Slide 43 text

Slide 44

Slide 44 text

🤗 + Ray AI Libraries + LLM Demo

Slide 45

Slide 45 text

Try it out … https://bit.ly/ray-llm-examples

Slide 46

Slide 46 text

Demo recording …

Slide 47

Slide 47 text

Summary ● Outlined & explored existing challenges & trends in scaling workloads ● Offered an opinionated emerging modern stack for ML/AI/LLMs ● Provided insight and intuition into Ray Data + Ray Train ● Demonstrated the modern stack to ﬁne-tune an LLM

Slide 48

Slide 48 text

Ray + LLM Workshop … Friday

Slide 49

Slide 49 text

Resources ● How to ﬁne tune and serve LLMs simply, quickly and cost effectively using Ray + DeepSpeed + HuggingFace ● Get started with DeepSpeed and Ray ● Training 175B Parameter Language Models at 1000 GPU scale with Alpa and Ray ● Fast, ﬂexible, and scalable data loading for ML training with Ray Data ● Ray Serve: Tackling the cost and complexity of serving AI in production ● Scaling Model Batch Inference in Ray: Using Actors, ActorPool, and Ray Data ● Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Models to Unique Applications (part-1) ● Fine-Tuning LLMs: LoRA or Full-Parameter? An in-depth Analysis with Llama 2 (part-2)

Slide 50

Slide 50 text

Remember to vote and share feedback on the QCon App. Please vote and leave feedback! Any questions? email: [email protected] X: @2twitme