현기증 난단 말이에요!: PALI, PALI PALI and PALANG - 황은진

PALI, PALI PALI and PALANG Eunjin Hwang / Lablup Inc.

AI Inference  Enables the utilization of AI models to
make predictionsand generate outputs based on new, unseen data

Scalability Efficiency Consideration for AI Inference  To deliver AI-powered
services & products

Generative AI and Inference  Characteristics of Generative AI models
– Language models, image generators… – Require specialized inference approaches  Different resource requirements – Often have large memory footprints and computational demands  Importance of efficient inference – Crucial for generating high-quality outputs in real-time

Challenges of AI Inference Diverse model architectures & frameworks Low
latency & high throughput Scalability & resource management Varying hardware requirements & constraints

Inference Toolkit / Serving Solutions TensorFlow Serving Google, 2016~ Triton
Inference Server NVIDIA, 2018~ OpenVINO Intel, 2018~ ONNXRuntime Microsoft, 2018~ RedisAI RedisAI, 2019~ TorchServe Facebook, 2020~ Seldon Core SeldonIO, 2018~ Kserve Google, 2020~ Framework-spcific Multi-model format Model Server Wrapper K8s-specific Triton OpenAI, 2023~ TensorRT-LLM NVIDIA, 2023~ Llama.cpp / ggml ggml, 2023~ LLM-specific vLLM 2023~ Keras-model Google, 2023~

Puzzles of Inference  Rapid updates of models and model
serving software – Difficulty in keeping up with frequent releases of improved model serving software  Challenges in deploying model + model serving software as container images – Difficulty in adapting to fast-paced changes in the ecosystem  Complexity of integrating model serving software optimized for various AI accelerator hardware with models – Challenges in managing optimized combinations of software and models for each hardware  Complexity of managing the interoperability matrix between various model serving software and supported models – Difficulty in managing and testing the matrix of supported model and software combinations

Solving Puzzles of Inference  Backend.AI's approach: Separating models and
serving software, providing automatic conversion and integration – Enabling independent updates and management of models and serving software – Providing unified management and abstraction for various AI accelerator hardware

Inference at Scale: Super-small to Ultra-large  Scalability: critical aspect
of inference  Handle inference at various scales – From NVIDIA Jetson Nano & RPi4 to 2000 node clusters  Catches both efficient resource utilization and automatic scaling

Other Technical Difficulties  Reduce resource consumption – Integrates model
optimization, quantization, and compression toolkits – Reduce the memory footprint and computational requirements of AI models  Minimize latency & maximize throughput – Smart model caching mechanisms – Efficient data loading strategies  Complex & resource intensive models – Backend.AI offers seamless and performant inference

Bring the Best of AI Training to AI Inference 
Leveraging the advancements in AI training for inference – Can the same techniques optimized for training large models be used for inference?  Goals – Applying best practices and techniques from training to optimize inference performance – Ensuring seamless integration between training and inference pipelines

Backend.AI Suite  Model service – Backend.AI 23.03 (Alpha) ~
24.03 (Official) – Model session management: Sokovan-controlled inference – Model traffic: Backend.AI AppProxy v4 – Model fine-tuning pipeline: FastTrack  Flexibility, performance, and ease of use – Dynamic inference: combining models and inference engines based on runtime requests – vLLM, TensorRT-LLM+Triton, ONNXruntime as pre-built model engine – Automatic model loading, dynamic batching, and request prioritization

PALI: Make AI Inference Scalable  PALI: Performant AI Launcher
for Inference – Combination of Backend.AI Model Player, model storage, and pre-defined models – Scalability and performance benefits of PALI Model Player Hugging Face NVIDIA NIM ION Model Store

PALI: Model Player

PALI: Model Service

PALI: LLM Test with Chatbot UI

PALI2 : Scalable AI H/W Infrastructure  PALI PALI: Hardware
appliance with built-in PALI – Performant AI Launcher for Inference on Portable AI Landing Infrastructure – Scalability through easy connection of multiple units – Optimized for AI workloads, offering high performance and low latency  Partners More to come! PALI PALI with NVIDIA GH200 Kyocera Mirai Instant.AI

PALANG: PALI for Language Model Management  PALANG: Pre-configured PALI
platform for LANGuage models – Ready-to-use setup for inference and fine-tuning – Simplified deployment and management of language models – Helmsman ✓ NLP interface for your model finetuning – Talkativot ✓ Chat with your model, simplified

Helmsman: Chat with PALI  Agent Helmsman – NLP interface
for PALANG model fine-tuning & controlling ✓ Translate user intents into actionable commands ✓ Autonomously handle complex workflows ✓ Specific Use Case: ◦ Operating container based session creation with allocated resources ◦ Executing code on user behalf at remote session ◦ Launching LLM fine-tuning ◦ Deploying fine-tuned model for inference – All done with text prompt! Helmsman - person responsible for steering and controlling a direction of the ship

Helmsman: Listing torchtune models on session

Helmsman: Request to start fine-tuning

Talkativot: Chat with your fine-tuned models, simplified

Goal with PALI: WYTIWYG  WYTIWYG: What You Think/Talk is
What You Get – Lablup’s goal: Translating AI ideas into reality  Seamless and intuitive experience for deploying and utilizing AI models  Empowering users to bring their AI visions to life effortlessly

Conclusion  Challenges of AI inference  Backend.AI's inference solutions'
key benefits  Importance of scalable and efficient AI inference  PALI suite for every AI inference needs

현기증 난단 말이에요!: PALI, PALI PALI and PALANG - 황은진

현기증 난단 말이에요!: PALI, PALI PALI and PALANG - 황은진

Lablup Inc. PRO

More Decks by Lablup Inc.

Other Decks in Technology

Featured

Transcript

PALI, PALI PALI and PALANG Eunjin Hwang / Lablup Inc.

AI Inference  Enables the utilization of AI models to

Scalability Efficiency Consideration for AI Inference  To deliver AI-powered

Generative AI and Inference  Characteristics of Generative AI models

Challenges of AI Inference Diverse model architectures & frameworks Low

Inference Toolkit / Serving Solutions TensorFlow Serving Google, 2016~ Triton

Puzzles of Inference  Rapid updates of models and model

Solving Puzzles of Inference  Backend.AI's approach: Separating models and

Inference at Scale: Super-small to Ultra-large  Scalability: critical aspect

Other Technical Difficulties  Reduce resource consumption – Integrates model

Bring the Best of AI Training to AI Inference 

Backend.AI Suite  Model service – Backend.AI 23.03 (Alpha) ~

PALI: Make AI Inference Scalable  PALI: Performant AI Launcher

PALI: Model Player

PALI: Model Player

PALI: Model Service

PALI: LLM Test with Chatbot UI

PALI2 : Scalable AI H/W Infrastructure  PALI PALI: Hardware

PALANG: PALI for Language Model Management  PALANG: Pre-configured PALI

Helmsman: Chat with PALI  Agent Helmsman – NLP interface

Helmsman: Listing torchtune models on session

Helmsman: Request to start fine-tuning

Talkativot: Chat with your fine-tuned models, simplified

Goal with PALI: WYTIWYG  WYTIWYG: What You Think/Talk is

Conclusion  Challenges of AI inference  Backend.AI's inference solutions'