Upgrade to Pro — share decks privately, control downloads, hide ads and more …

현기증 난단 말이에요!: PALI, PALI PALI and PALANG - 황은진

현기증 난단 말이에요!: PALI, PALI PALI and PALANG - 황은진

Lablup Inc.

November 27, 2024
Tweet

More Decks by Lablup Inc.

Other Decks in Technology

Transcript

  1. AI Inference  Enables the utilization of AI models to

    make predictionsand generate outputs based on new, unseen data
  2. Generative AI and Inference  Characteristics of Generative AI models

    – Language models, image generators… – Require specialized inference approaches  Different resource requirements – Often have large memory footprints and computational demands  Importance of efficient inference – Crucial for generating high-quality outputs in real-time
  3. Challenges of AI Inference Diverse model architectures & frameworks Low

    latency & high throughput Scalability & resource management Varying hardware requirements & constraints
  4. Inference Toolkit / Serving Solutions TensorFlow Serving Google, 2016~ Triton

    Inference Server NVIDIA, 2018~ OpenVINO Intel, 2018~ ONNXRuntime Microsoft, 2018~ RedisAI RedisAI, 2019~ TorchServe Facebook, 2020~ Seldon Core SeldonIO, 2018~ Kserve Google, 2020~ Framework-spcific Multi-model format Model Server Wrapper K8s-specific Triton OpenAI, 2023~ TensorRT-LLM NVIDIA, 2023~ Llama.cpp / ggml ggml, 2023~ LLM-specific vLLM 2023~ Keras-model Google, 2023~
  5. Puzzles of Inference  Rapid updates of models and model

    serving software – Difficulty in keeping up with frequent releases of improved model serving software  Challenges in deploying model + model serving software as container images – Difficulty in adapting to fast-paced changes in the ecosystem  Complexity of integrating model serving software optimized for various AI accelerator hardware with models – Challenges in managing optimized combinations of software and models for each hardware  Complexity of managing the interoperability matrix between various model serving software and supported models – Difficulty in managing and testing the matrix of supported model and software combinations
  6. Solving Puzzles of Inference  Backend.AI's approach: Separating models and

    serving software, providing automatic conversion and integration – Enabling independent updates and management of models and serving software – Providing unified management and abstraction for various AI accelerator hardware
  7. Inference at Scale: Super-small to Ultra-large  Scalability: critical aspect

    of inference  Handle inference at various scales – From NVIDIA Jetson Nano & RPi4 to 2000 node clusters  Catches both efficient resource utilization and automatic scaling
  8. Other Technical Difficulties  Reduce resource consumption – Integrates model

    optimization, quantization, and compression toolkits – Reduce the memory footprint and computational requirements of AI models  Minimize latency & maximize throughput – Smart model caching mechanisms – Efficient data loading strategies  Complex & resource intensive models – Backend.AI offers seamless and performant inference
  9. Bring the Best of AI Training to AI Inference 

    Leveraging the advancements in AI training for inference – Can the same techniques optimized for training large models be used for inference?  Goals – Applying best practices and techniques from training to optimize inference performance – Ensuring seamless integration between training and inference pipelines
  10. Backend.AI Suite  Model service – Backend.AI 23.03 (Alpha) ~

    24.03 (Official) – Model session management: Sokovan-controlled inference – Model traffic: Backend.AI AppProxy v4 – Model fine-tuning pipeline: FastTrack  Flexibility, performance, and ease of use – Dynamic inference: combining models and inference engines based on runtime requests – vLLM, TensorRT-LLM+Triton, ONNXruntime as pre-built model engine – Automatic model loading, dynamic batching, and request prioritization
  11. PALI: Make AI Inference Scalable  PALI: Performant AI Launcher

    for Inference – Combination of Backend.AI Model Player, model storage, and pre-defined models – Scalability and performance benefits of PALI Model Player Hugging Face NVIDIA NIM ION Model Store
  12. PALI2 : Scalable AI H/W Infrastructure  PALI PALI: Hardware

    appliance with built-in PALI – Performant AI Launcher for Inference on Portable AI Landing Infrastructure – Scalability through easy connection of multiple units – Optimized for AI workloads, offering high performance and low latency  Partners More to come! PALI PALI with NVIDIA GH200 Kyocera Mirai Instant.AI
  13. PALANG: PALI for Language Model Management  PALANG: Pre-configured PALI

    platform for LANGuage models – Ready-to-use setup for inference and fine-tuning – Simplified deployment and management of language models – Helmsman ✓ NLP interface for your model finetuning – Talkativot ✓ Chat with your model, simplified
  14. Helmsman: Chat with PALI  Agent Helmsman – NLP interface

    for PALANG model fine-tuning & controlling ✓ Translate user intents into actionable commands ✓ Autonomously handle complex workflows ✓ Specific Use Case: ◦ Operating container based session creation with allocated resources ◦ Executing code on user behalf at remote session ◦ Launching LLM fine-tuning ◦ Deploying fine-tuned model for inference – All done with text prompt! Helmsman - person responsible for steering and controlling a direction of the ship
  15. Goal with PALI: WYTIWYG  WYTIWYG: What You Think/Talk is

    What You Get – Lablup’s goal: Translating AI ideas into reality  Seamless and intuitive experience for deploying and utilizing AI models  Empowering users to bring their AI visions to life effortlessly
  16. Conclusion  Challenges of AI inference  Backend.AI's inference solutions'

    key benefits  Importance of scalable and efficient AI inference  PALI suite for every AI inference needs