Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LLM Development Knowledge Sharing

LLM Development Knowledge Sharing

LLM Development Knowledge
Presented at AIMET.tech on 21 May 2024

Kamolphan Liwprasert

May 21, 2024
Tweet

More Decks by Kamolphan Liwprasert

Other Decks in Technology

Transcript

  1. Fine Tuning: Use Pre-trained Model + New Data • Load

    dataset: Load data from Hugging Face by using Ray Dataset to load. • Preprocess dataset: using the Ray dataset function to tokenize data for the preprocessing process. • Fine-tune model: Using the Ray train function with the train function from Hugging Face to create a fine-tune foundation model. • Tune model: Ray provides a tuning function for hyperparameter tuning the model. https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ff3a
  2. PEFT: Parameter-Efficient Fine-Tuning 🤗 PEFT (Parameter-Efficient Fine-Tuning) is a library

    for efficiently adapting large pretrained models to various downstream applications without fine-tuning all of a modelʼs parameters because it is prohibitively costly. PEFT methods only fine-tune a small number of (extra) model parameters - significantly decreasing computational and storage costs - while yielding performance comparable to a fully fine-tuned model. This makes it more accessible to train and store large language models (LLMs) on consumer hardware. PEFT is integrated with the Transformers, Diffusers, and Accelerate libraries to provide a faster and easier way to load, train, and use large models for inference. (Text copied from source) https://huggingface.co/docs/peft/en/index
  3. PEFT: LoRA - Low-Rank Adaptation LoRA is low-rank decomposition method

    to reduce the number of trainable parameters which speeds up finetuning large models and uses less memory. In PEFT, using LoRA is as easy as setting up a LoraConfig and wrapping it with get_peft_model() to create a trainable PeftModel. (Text copied from source) https://huggingface.co/docs/peft/en/task_guides/lora_based_methods
  4. PEFT: Quantization Quantization represents data with fewer bits, making it

    a useful technique for reducing memory-usage and accelerating inference especially when it comes to large language models (LLMs). There are several ways to quantize a model including: • optimizing which model weights are quantized with the AWQ algorithm • independently quantizing each row of a weight matrix with the GPTQ algorithm • quantizing to 8-bit and 4-bit precision with the bitsandbytes library • quantizing to as low as 2-bit precision with the AQLM algorithm However, after a model is quantized it isnʼt typically further trained for downstream tasks because training can be unstable due to the lower precision of the weights and activations. But since PEFT methods only add extra trainable parameters, this allows you to train a quantized model with a PEFT adapter on top! Combining quantization with PEFT can be a good strategy for training even the largest models on a single GPU. For example, QLoRA is a method that quantizes a model to 4-bits and then trains it with LoRA. This method allows you to finetune a 65B parameter model on a single 48GB GPU! (Text copied from source) https://huggingface.co/docs/peft/main/en/developer_guides/quantization
  5. Example Fine Tuning Fine tuning model Typhoon ของ SCB ดวย

    Data ทางการแพทย (https://huggingface.co/datasets/Thaweewat/thai-med-pack ) ครับ ลองไปใชกับ dataset อื่นดูไดครับ โดยการ train แอด config ใหมันสามารถ train กับ GPU P100 บน Kaggle (bitsandbytes + LoRA) ไดนะครับ รวมไปถึงการ push model ของเวอรชั่นที่ fine tune เราขึ้น Huggingface นะครับ Fine tuning: https://www.kaggle.com/code/batprem/typhoon-fine-tuning-based-line?fbclid=I wZXh0bgNhZW0CMTAAAR0tZOUqMI5HwTGYS6QFTCszQ5YurwGC8NqdC4DxIZMjZ5 -cjEeNXndpfV0_aem_ASFShTdKXSnbSRFYaGW2NToQummHUv1KWO3N8nqOSm0 704yhH448mA9G_aslKyQGeOaG7qAiH5nWJsUXrIFtrP5l Inference: https://www.kaggle.com/code/batprem/typhoon-load-model-and-test?fbclid=IwZ Xh0bgNhZW0CMTAAAR1lamevLVItERHWma0U7KcwtAb6JT58BxOKEx9bp3gY-CqKE Gy79M3SEmk_aem_ASHyJrJXdC216cLOWSRrXK96ylTd4GO9l8kxNxUhXHHkNXrgf8 rLEP_vxLhzukOrdAiUhTLT2IxlcBbawZkELfu6 Training visual ดวย Weight and bias: https://www.kaggle.com/code/batprem/typhoon-wandb-visualization?scriptVersi onId=173192354&fbclid=IwZXh0bgNhZW0CMTAAAR1c6K9-iz2FvvqcUkZC7gzDBxJu IyLWZAOVPAIWLWQZQjQ8ZogmAm2lQhg_aem_ASGFoTBA7XQbqGBDOPUx-uMYcq wy-a-pHxz5ZZjgTIDQsCAcMQpMKqsMgnFmMi7MmPdexnP8K9iADWMsiCWE2Uww https://www.facebook.com/photo?fbid=852664400209645&set=a.419487336860689
  6. Why self-host LLM? 1. Cost efficient in long term (if

    choosing wisely) ➢ Need to tune the latency to make the model faster 2. Customization & fine-tuning ➢ No lock-in to a particular model 3. Security compliance & data residency / privacy
  7. Artificial Analysis: Comparing API Prices Llama 3 Instruct (Services) LM

    APIs Provider (from previous section) https://artificialanalysis.ai/
  8. LangChain 🦜🔗 LangChain 🦜🔗 is a framework (Python / JS

    library) for developing applications powered by large language models (LLMs). The main values of LangChain are: ✓ Components: abstractions for working with language models, along with a collection of implementations for each abstraction. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not ✓ Off-the-shelf chains: a structured assembly of components for accomplishing specific higher-level tasks https://www.langchain.com/langchain
  9. Semantic Kernel from Microsoft Semantic Kernel is an SDK that

    integrates Large Language Models (LLMs) like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages like C#, Python, and Java. Semantic Kernel achieves this by allowing you to define plugins that can be chained together in just a few lines of code. What makes Semantic Kernel special, however, is its ability to automatically orchestrate plugins with AI. With Semantic Kernel planners, you can ask an LLM to generate a plan that achieves a user's unique goal. Afterwards, Semantic Kernel will execute the plan for the user. https://github.com/microsoft/semantic-kernel
  10. What is Vector DB A database to store embedding representations

    and use Approximate Nearest Neighbor (ANN) algorithm to query from the query embedding. https://www.pinecone.io/learn/vector-database/
  11. LLMOps & MLOps LLM unique characteristics: ✔ Requirements of managing

    LLMs ✔ Large size ✔ Complex training requirements ✔ High computational demands LLMOps = MLOps + https://cloud.google.com/discover/what-is-llmops
  12. What can LLMOps do? LLMOps involves a comprehensive set of

    activities, including: • Model deployment and maintenance: deploying and managing LLMs on cloud platforms or on-premises infrastructure • Data management: curating and preparing training data, as well as monitoring and maintaining data quality • Model training and fine-tuning: training and refining LLMs to improve their performance on specific tasks • Monitoring and evaluation: tracking LLM performance, identifying errors, and optimizing models • Security and compliance: ensuring the security and regulatory compliance of LLM operations https://cloud.google.com/discover/what-is-llmops
  13. Ray LLM: Inference Landscape Ray-LLM: Behind the screens of top-tier

    companies such as OpenAI, Uber, and Cohere, which help deploy LLM models, is Anyscale. This platform allowed developers to serve LLM models, and it was developed by the Ray Team, but I would like to share some details about their tool, Ray-LLM. This tool solves a technical problem when serving the LLM model, which could be divided into three issues. (Text copied from blog post) https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ff3a
  14. Ray LLM: Optimized batching Continuous batching: GPUs are underutilised with

    native batching with different word sequences, but this method will concatenate new input sequence tokens into the end of the token batch to fill the batch sequence, which increases the throughput of the system. (Text copied from blog post) https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ff3a
  15. Ray LLM: Speculative Decoding Speculative decoding: This method uses a

    small model to speculate the K token ahead and then a large model to verify, and it will emit a token if the token isnʼt correct. These methods allow for faster forward passes per token, which reduces latency since large models only verify and can do it in parallel. (Text copied from blog post) https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ff3a
  16. Ray LLM: Hybrid routing Hybrid routing: Using a supervised classifier

    to classify queries that feed into the model and then select a suitable model before feeding into the LLM. This method helps to create agents based on the LLM since each LLM has its own specificity for each task, so it could be a better idea to let the model that has more context answer it. (Text copied from blog post) https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ff3a
  17. vLLM Model serving for LLM Easy, fast, and cheap LLM

    serving for everyone vLLM is fast with: ✅ State-of-the-art serving throughput ✅ Efficient management of attention key and value memory with PagedAttention ✅ Continuous batching of incoming requests ✅ Fast model execution with CUDA/HIP graph ✅ Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache ✅ Optimized CUDA kernels https://github.com/vllm-project/vllm
  18. Candle vLLM Efficent platform for inference and serving local LLMs

    including an OpenAI compatible API server. Based on Huggingface Candle (Rust). https://github.com/EricLBuehler/candle-vllm