Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modern Compute Stack for Scaling Large AI/ML/LLM Workloads

Modern Compute Stack for Scaling Large AI/ML/LLM Workloads

Advanced machine learning (ML) models, particularly large language models (LLMs), require scaling beyond a single machine. As open-source LLMs become more prevalent on platforms and model hubs like HuggingFace (HF), ML practitioners and GenAI developers are increasingly inclined to fine-tune these models with their private data to suit their specific needs.

However, several concerns arise: which compute infrastructure should be used for distributed fine-tuning and training? How can ML workloads be effectively scaled for data ingestion, training/tuning, or inference? How can large models be accommodated within a cluster? And how can CPUs and GPUs be optimally utilized?

Fortunately, an opinionated stack is emerging among ML practitioners, leveraging open-source libraries.

This session focuses on the integration of HuggingFace and Ray AI Runtime (AIR), enabling scaling of model training and data loading. We’ll delve into implementation details, explore the

Transformer APIs, and demonstrate how Ray AIR facilitates an end-to-end ML workflow, encompassing data ingestion, training/tuning, or inference.

By exploring the integration between HF and Ray AI Libraries, we’ll discuss how Ray’s orchestration capabilities fulfill computation and memory requirements. Also, we’ll showcase how existing HF Transformer APIs, DeepSpeed, and Accelerate code can seamlessly integrate with Ray Trainers and demonstrate its capabilities within this emerging component stack.

Finally, we’ll demonstrate how to fine-tune an open-source LLM model with HF Transformer APIs with Ray Data and Ray Trainers.

Anyscale

October 08, 2023
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. $whami • Lead Developer Advocate, Anyscale & Ray Team •

    Sr. Developer Advocate, Databricks, Apache Spark/MLflow Team • Led Developer Advocacy, Hortonworks • Held SWE positions: ◦ Sun Microsystems ◦ Netscape ◦ @Home ◦ Loudcloud/Opsware ◦ Verisign
  2. Who do I work for … Who we are::Original creators

    of Ray, a unified general-purpose framework for scalable distributed computing What we do: Scalable compute for AI as managed service, with Ray at its core, and the best platform to develop & run AI apps Why we do it: Scaling is a necessity, scaling is hard; make distributed computing easy and simple for everyone
  3. Agenda • Challenges with existing ML/AI stack ◦ Scaling AI/ML

    large workloads ◦ Infrastructure management • What is Ray & Why Ray AI Libraries? ◦ Ray Data & Ray Trainers • Emerging modern stack for LLMs ◦ Challenges of distributed training for LLMs ◦ 🤗 + Ray AI Libraries == easy distributed training • Demo ◦ Fine-tuning & scaling an LLM model with 🤗 + Ray AI Libraries
  4. Challenge #1 Still not easy to go from dev to

    prod at scale. preprocess.py train.py eval.py run_workflow.py
  5. Challenge #2 What happens when your ML infra gets out

    of date? preprocess.py train.py eval.py run_workflow.py
  6. Key problems of existing ML infrastructure Scaling is hard, especially

    for data scientists Platforms solutions can limit flexibility But custom infrastructure is too hard
  7. How do we fix these problems… We want to address

    these problems! 1. Increase Developer velocity 2. Manage complex infrastructure 3. Scale end-to-end ML pipelines • We want simplicity with blessings of scale …!
  8. What we desire …simplicity & scale Storage and Tracking Preprocessing

    Training Scoring Serving …` ... ... Ray AI libraries can provide that … single application
  9. What is Ray… ? • A simple/general-purpose library for distributed

    computing • Comes w/ unified Python Ray AI Libraries (for scaling ML and more) • Runs on laptop, public cloud, K8s, on-premise A layered cake of functionality and capability for scaling ML workloads
  10. Who’s using Ray …. 24,000+ GitHub 5,000+ Depend on Ray

    1,000+ Organizations Using Ray 27,000+ GitHub stars 5,000+ Repositories Depend on Ray 870+ Community Contributors
  11. What’s the ML/AI/LLM stack? Zero-3 for fine-tuning OSS LLM For

    orchestration, scaling & accelerators (GPUs, TPUs, Infrantia1-2, Trainium) Ray Data
  12. When to use Ray AI Libraries? Scale a single type

    of workload Scale end-to-end ML applications Run ecosystem libraries using a unified API Build a custom ML platform
  13. Ray AI Libraries : Ray Data ingest • Ray Datasets

    as a common data format • Easily read from disk/cloud, or from other formats (images, CVS, Parquet, HF etc) • Fully distributed • Can handle data too big to fit on one node or even the entire cluster Trainer Worker Worker Worker Worker Dataset Trainer.fit
  14. Ray Data overview High performance distributed IO ds = ray.data.read_parquet("s3://some/bucket")

    ds = ray.data.read_csv("/tmp/some_file.csv") Leverages Apache Arrow’s high-performance IO Parallelized using Ray’s high-throughput task execution or actor pool execution Scales to PiB-scale jobs in production (Amazon) Read from storage Transform data ds = ds.map_batches(batch_func) ds = ds.map(func) ds.iter_batches() -> Iterator ds.write_parquet("s3://some/bucket") Consume data
  15. Ray Data’s : Preprocessors • Ray Data provides out-of-box preprocessors

    for common ML tasks • Write your own UDFs to map-apply APIs
  16. Ray Train: Distributed ML/DL training Ray Train is a library

    for developing, orchestrating, and scaling distributed deep learning applications.
  17. PyTorch Setup distributed env Setup DDP model Setup distributed sampler

    Move batches to GPU Compatibility: Integrates with deep learning frameworks
  18. Define a storage_path to persist checkpoints and artifacts to •

    Cloud storage: AWS S3, GCS, … • Shared file systems: Hadoop HDFS, AWS EFS, GCP Filestore, … Compatibility: Integrates with persistent storage
  19. Ray Trainer: Trainer: Parallelization • PyTorch DDP on a Ray

    Cluster ◦ FSDP, DeepSpeed supported • Abstracts away infrastructure • Supports CPUs, GPUs, TPUs etc workers Trainer Worker 1 Worker 2 Worker 3 Worker 4
  20. Trends & challenges on deep learning training Large Datasets Large

    Models Data Parallelism Model Parallelism Distributed Training
  21. Challenges in distributed training Compatibility ⚙ Scalability 🚀 Large Model

    Training 󰙥 : An OSS production-ready solution part of the stack
  22. Model Data Supported distributed strategies include: • ZeRO • Pipeline

    Parallelism • Tensor Parallelism LLM Support 󰙥 DeepSpeed and Accelerate
  23. What’s the LLM stack ? Fine-tuning OSS pretrained models Zero-3

    For fine-tuning OSS LLM For orchestration, scaling & accelerators (GPUs, TPUs, Infrantia1-2, Trainium) Ray Data
  24. • Distributed Data Parallel/FSDP + DeepSpeed training on a Ray

    Cluster ◦ Takes advantage of PyTorch DDP & Hugging Face support for it • Runs user-defined Hugging Face code without any changes • Automatically converts Ray Datasets to format expected by Hugging Face Ray AI + 🤗 Trainer: Implementation
  25. Fine-tuning GPT-J/6B Ray AIR + HF + DeepSpeed Easy way

    to fine-tune an OSS LLM… EleutherAI-GPT-J/6B Pile dataset 825GB tiny-shakespeare 40K lines ~ 1.2MB Fine-tuned EleutherAI-GPT-J/6B Pile + tiny-shakespeare
  26. 🤗 training workflow dataset = load_dataset("yelp_review_full") train_dataset, eval_dataset = dataset["train"],

    dataset["test"] model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) training_args = TrainingArguments(f"{model_checkpoint}-yelp", evaluation_strategy="epoch") trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset) trainer.train()
  27. 🤗 training workflow, distributed with Ray AI Libraries dataset =

    load_dataset("yelp_review_full") train_dataset, eval_dataset = dataset["train"], dataset["test"] def trainer_init_per_worker(train_dataset, eval_dataset, **config): model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) training_args = TrainingArguments(f"{model_checkpoint}-yelp", evaluation_strategy="epoch") trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset) return trainer
  28. 🤗 training workflow, distributed with Ray AI Libraries dataset =

    load_dataset("yelp_review_full") train_dataset, eval_dataset = dataset["train"], dataset["test"] def trainer_init_per_worker(train_dataset, eval_dataset, **config): model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) training_args = TrainingArguments(f"{model_checkpoint}-yelp", evaluation_strategy="epoch") trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset) return trainer trainer = TorchTrainer( trainer_init_per_worker=trainer_init_per_worker, scaling_config=ScalingConfig(num_workers=3, use_gpu=True), datasets={"train": ray.data.from_huggingface(train_dataset), "evaluation": ray.data.from_huggingface(eval_dataset)}, ) 42
  29. 🤗 training workflow, distributed with Ray AI Libraries dataset =

    load_dataset("yelp_review_full") train_dataset, eval_dataset = dataset["train"], dataset["test"] def trainer_init_per_worker(train_dataset, eval_dataset, **config): model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) training_args = TrainingArguments(f"{model_checkpoint}-yelp", evaluation_strategy="epoch") trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset) return trainer trainer = TorchTrainer( trainer_init_per_worker=trainer_init_per_worker, scaling_config=ScalingConfig(num_workers=3, use_gpu=True), datasets={"train": ray.data.from_huggingface(train_dataset), "evaluation": ray.data.from_huggingface(eval_dataset)}, ) result = trainer.fit()
  30. Summary • Outlined & explored existing challenges & trends in

    scaling workloads • Offered an opinionated emerging modern stack for ML/AI/LLMs • Provided insight and intuition into Ray Data + Ray Train • Demonstrated the modern stack to fine-tune an LLM
  31. Resources • How to fine tune and serve LLMs simply,

    quickly and cost effectively using Ray + DeepSpeed + HuggingFace • Get started with DeepSpeed and Ray • Training 175B Parameter Language Models at 1000 GPU scale with Alpa and Ray • Fast, flexible, and scalable data loading for ML training with Ray Data • Ray Serve: Tackling the cost and complexity of serving AI in production • Scaling Model Batch Inference in Ray: Using Actors, ActorPool, and Ray Data • Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Models to Unique Applications (part-1) • Fine-Tuning LLMs: LoRA or Full-Parameter? An in-depth Analysis with Llama 2 (part-2)
  32. Remember to vote and share feedback on the QCon App.

    Please vote and leave feedback! Any questions? email: [email protected] X: @2twitme