LLM Development Knowledge Sharing

LLM App Development 21 May 2024 Fon :)

Development Steps

Diagram by Chat GPT

LLM Development https://docs.ray.io/en/latest/ray-overview/use-cases.html

Fine Tuning: Use Pre-trained Model + New Data • Load
dataset: Load data from Hugging Face by using Ray Dataset to load. • Preprocess dataset: using the Ray dataset function to tokenize data for the preprocessing process. • Fine-tune model: Using the Ray train function with the train function from Hugging Face to create a fine-tune foundation model. • Tune model: Ray provides a tuning function for hyperparameter tuning the model. https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ﬀ3a

PEFT: Parameter-Efficient Fine-Tuning 🤗 PEFT (Parameter-Efficient Fine-Tuning) is a library
for efficiently adapting large pretrained models to various downstream applications without fine-tuning all of a modelʼs parameters because it is prohibitively costly. PEFT methods only fine-tune a small number of (extra) model parameters - significantly decreasing computational and storage costs - while yielding performance comparable to a fully fine-tuned model. This makes it more accessible to train and store large language models (LLMs) on consumer hardware. PEFT is integrated with the Transformers, Diffusers, and Accelerate libraries to provide a faster and easier way to load, train, and use large models for inference. (Text copied from source) https://huggingface.co/docs/peft/en/index

PEFT: LoRA - Low-Rank Adaptation LoRA is low-rank decomposition method
to reduce the number of trainable parameters which speeds up finetuning large models and uses less memory. In PEFT, using LoRA is as easy as setting up a LoraConfig and wrapping it with get_peft_model() to create a trainable PeftModel. (Text copied from source) https://huggingface.co/docs/peft/en/task_guides/lora_based_methods

PEFT: Quantization Quantization represents data with fewer bits, making it
a useful technique for reducing memory-usage and accelerating inference especially when it comes to large language models (LLMs). There are several ways to quantize a model including: • optimizing which model weights are quantized with the AWQ algorithm • independently quantizing each row of a weight matrix with the GPTQ algorithm • quantizing to 8-bit and 4-bit precision with the bitsandbytes library • quantizing to as low as 2-bit precision with the AQLM algorithm However, after a model is quantized it isnʼt typically further trained for downstream tasks because training can be unstable due to the lower precision of the weights and activations. But since PEFT methods only add extra trainable parameters, this allows you to train a quantized model with a PEFT adapter on top! Combining quantization with PEFT can be a good strategy for training even the largest models on a single GPU. For example, QLoRA is a method that quantizes a model to 4-bits and then trains it with LoRA. This method allows you to finetune a 65B parameter model on a single 48GB GPU! (Text copied from source) https://huggingface.co/docs/peft/main/en/developer_guides/quantization

QLoRA https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html

Example Fine Tuning Fine tuning model Typhoon ของ SCB ดวย
Data ทางการแพทย (https://huggingface.co/datasets/Thaweewat/thai-med-pack ) ครับ ลองไปใชกับ dataset อื่นดูไดครับ โดยการ train แอด config ใหมันสามารถ train กับ GPU P100 บน Kaggle (bitsandbytes + LoRA) ไดนะครับ รวมไปถึงการ push model ของเวอรชั่นที่ fine tune เราขึ้น Huggingface นะครับ Fine tuning: https://www.kaggle.com/code/batprem/typhoon-fine-tuning-based-line?fbclid=I wZXh0bgNhZW0CMTAAAR0tZOUqMI5HwTGYS6QFTCszQ5YurwGC8NqdC4DxIZMjZ5 -cjEeNXndpfV0_aem_ASFShTdKXSnbSRFYaGW2NToQummHUv1KWO3N8nqOSm0 704yhH448mA9G_aslKyQGeOaG7qAiH5nWJsUXrIFtrP5l Inference: https://www.kaggle.com/code/batprem/typhoon-load-model-and-test?fbclid=IwZ Xh0bgNhZW0CMTAAAR1lamevLVItERHWma0U7KcwtAb6JT58BxOKEx9bp3gY-CqKE Gy79M3SEmk_aem_ASHyJrJXdC216cLOWSRrXK96ylTd4GO9l8kxNxUhXHHkNXrgf8 rLEP_vxLhzukOrdAiUhTLT2IxlcBbawZkELfu6 Training visual ดวย Weight and bias: https://www.kaggle.com/code/batprem/typhoon-wandb-visualization?scriptVersi onId=173192354&fbclid=IwZXh0bgNhZW0CMTAAAR1c6K9-iz2FvvqcUkZC7gzDBxJu IyLWZAOVPAIWLWQZQjQ8ZogmAm2lQhg_aem_ASGFoTBA7XQbqGBDOPUx-uMYcq wy-a-pHxz5ZZjgTIDQsCAcMQpMKqsMgnFmMi7MmPdexnP8K9iADWMsiCWE2Uww https://www.facebook.com/photo?fbid=852664400209645&set=a.419487336860689

Platform

Nvidia NeMo: End-to-End Generative AI Platform

Nvidia NeMo Framework (cont.) https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

Language Model API providers

Artiﬁcial Analysis https://artificialanalysis.ai/

Artiﬁcial Analysis: Quality vs Price https://artificialanalysis.ai/

Artiﬁcial Analysis: API Prices https://artificialanalysis.ai/

Artiﬁcial Analysis: Throughput https://artificialanalysis.ai/

Closed-source vs Open-weight models https://twitter.com/maximelabonne/status/1790519226677026831/photo/1

🏆 LMSYS Chatbot Arena Leaderboard https://chat.lmsys.org/?leaderboard

Services to Host Language Models

Why self-host LLM? 1. Cost eﬀicient in long term (if
choosing wisely) ➢ Need to tune the latency to make the model faster 2. Customization & fine-tuning ➢ No lock-in to a particular model 3. Security compliance & data residency / privacy

Artiﬁcial Analysis: Hosting Llama 3 Instruct https://artificialanalysis.ai/

Artiﬁcial Analysis: Hosting Llama 3 API Prices https://artificialanalysis.ai/

Artiﬁcial Analysis: Comparing API Prices Llama 3 Instruct (Services) LM
APIs Provider (from previous section) https://artificialanalysis.ai/

groq - LPU Inference Engine https://groq.com/

LLM Development Frameworks

LangChain 🦜🔗 LangChain 🦜🔗 is a framework (Python / JS
library) for developing applications powered by large language models (LLMs). The main values of LangChain are: ✓ Components: abstractions for working with language models, along with a collection of implementations for each abstraction. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not ✓ Oﬀ-the-shelf chains: a structured assembly of components for accomplishing specific higher-level tasks https://www.langchain.com/langchain

LlamaIndex Turn your enterprise data into production-ready LLM applications https://www.llamaindex.ai/

Semantic Kernel from Microsoft Semantic Kernel is an SDK that
integrates Large Language Models (LLMs) like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages like C#, Python, and Java. Semantic Kernel achieves this by allowing you to define plugins that can be chained together in just a few lines of code. What makes Semantic Kernel special, however, is its ability to automatically orchestrate plugins with AI. With Semantic Kernel planners, you can ask an LLM to generate a plan that achieves a user's unique goal. Afterwards, Semantic Kernel will execute the plan for the user. https://github.com/microsoft/semantic-kernel

Langroid (Recommended) https://langroid.github.io/langroid/

RAG Concept :Retrieval Augmented Generator

RAG - Ask→ Retrieve from DB → Generate Answer

Steps in RAG

Document Search example Vector DB Embedding Model

Vector DB

What is Vector DB A database to store embedding representations
and use Approximate Nearest Neighbor (ANN) algorithm to query from the query embedding. https://www.pinecone.io/learn/vector-database/

Vector DB Landscape https://blog.det.life/why-you-shouldnt-invest-in-vector-databases-c0cd3f59d23c

Knowledge Graph + RAG Using Llama Index https://medium.com/@transformergpt/unleashing-the-power-of-knowledge-graphs-in-retrieval-augmented-generation-rag-step-by-step-84c2adc66c1c Example of
Knowledge Graph:

Graph DB: Neo4j and Vector Index https://neo4j.com/developer-blog/knowledge-graph-rag-application/

Deep Lake: Database for AI https://www.deeplake.ai/

LLMOps

LLMOps & MLOps LLM unique characteristics: ✔ Requirements of managing
LLMs ✔ Large size ✔ Complex training requirements ✔ High computational demands LLMOps = MLOps + https://cloud.google.com/discover/what-is-llmops

What can LLMOps do? LLMOps involves a comprehensive set of
activities, including: • Model deployment and maintenance: deploying and managing LLMs on cloud platforms or on-premises infrastructure • Data management: curating and preparing training data, as well as monitoring and maintaining data quality • Model training and fine-tuning: training and refining LLMs to improve their performance on specific tasks • Monitoring and evaluation: tracking LLM performance, identifying errors, and optimizing models • Security and compliance: ensuring the security and regulatory compliance of LLM operations https://cloud.google.com/discover/what-is-llmops

Role & Responsibility https://aws.amazon.com/blogs/machine-learning/fmops-llmops-operationalize-generative-ai-and-diﬀerences-with-mlops/

Who use LLMOps https://aws.amazon.com/blogs/machine-learning/fmops-llmops-operationalize-generative-ai-and-diﬀerences-with-mlops/

Ray for LLMOps

Ray AI Runtime & Model Life cycle https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ﬀ3a (basically the
same image)

Ray LLM: Inference Landscape Ray-LLM: Behind the screens of top-tier
companies such as OpenAI, Uber, and Cohere, which help deploy LLM models, is Anyscale. This platform allowed developers to serve LLM models, and it was developed by the Ray Team, but I would like to share some details about their tool, Ray-LLM. This tool solves a technical problem when serving the LLM model, which could be divided into three issues. (Text copied from blog post) https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ﬀ3a

Ray LLM: Optimized batching Continuous batching: GPUs are underutilised with
native batching with diﬀerent word sequences, but this method will concatenate new input sequence tokens into the end of the token batch to fill the batch sequence, which increases the throughput of the system. (Text copied from blog post) https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ﬀ3a

Ray LLM: Speculative Decoding Speculative decoding: This method uses a
small model to speculate the K token ahead and then a large model to verify, and it will emit a token if the token isnʼt correct. These methods allow for faster forward passes per token, which reduces latency since large models only verify and can do it in parallel. (Text copied from blog post) https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ﬀ3a

Ray LLM: Hybrid routing Hybrid routing: Using a supervised classifier
to classify queries that feed into the model and then select a suitable model before feeding into the LLM. This method helps to create agents based on the LLM since each LLM has its own specificity for each task, so it could be a better idea to let the model that has more context answer it. (Text copied from blog post) https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ﬀ3a

Inference / Serving

vLLM Model serving for LLM Easy, fast, and cheap LLM
serving for everyone vLLM is fast with: ✅ State-of-the-art serving throughput ✅ Eﬀicient management of attention key and value memory with PagedAttention ✅ Continuous batching of incoming requests ✅ Fast model execution with CUDA/HIP graph ✅ Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache ✅ Optimized CUDA kernels https://github.com/vllm-project/vllm

Ray Serve

Text Generation Inference https://huggingface.co/docs/text-generation-inference/index

Candle - ML framework for Rust https://github.com/huggingface/candle

Candle vLLM Eﬀicent platform for inference and serving local LLMs
including an OpenAI compatible API server. Based on Huggingface Candle (Rust). https://github.com/EricLBuehler/candle-vllm

Bend - Pythonic programming language https://github.com/HigherOrderCO/Bend

Speculative Decoding https://arxiv.org/abs/2211.17192

KAN - Future of Neural Network ? https://arxiv.org/abs/2404.19756

KAN - Kolmogorov-Arnold Networks https://medium.com/@isaakmwangi2018/a-simplified-explanation-of-the-new-kolmogorov-arnold-network-kan-from-mit-cbb59793a040

KAN-GPT: Not now, but it’s pretty new https://github.com/AdityaNG/kan-gpt

GaLore for LLM Training (not Fine-Tuning) https://arxiv.org/abs/2403.03507

Responsible AI

Google AI Principles https://ai.google/responsibility/principles/

Safety Rating https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/configure-safety-attributes

https://www.microsoft.com/en-us/ai/principles-and-approach/ Microsoft Responsible AI Principles

Resources

Awesome LLMOps https://github.com/tensorchord/Awesome-LLMOps

LLM Development Knowledge Sharing

LLM Development Knowledge Sharing

More Decks by Kamolphan Liwprasert

Other Decks in Technology

Featured

Transcript