LLM Development Knowledge Sharing

Slide 1

Slide 1 text

LLM App Development 21 May 2024 Fon :)

Slide 2

Slide 2 text

Development Steps

Slide 3

Slide 3 text

Diagram by Chat GPT

Slide 4

Slide 4 text

LLM Development https://docs.ray.io/en/latest/ray-overview/use-cases.html

Slide 5

Slide 5 text

Fine Tuning: Use Pre-trained Model + New Data ● Load dataset: Load data from Hugging Face by using Ray Dataset to load. ● Preprocess dataset: using the Ray dataset function to tokenize data for the preprocessing process. ● Fine-tune model: Using the Ray train function with the train function from Hugging Face to create a fine-tune foundation model. ● Tune model: Ray provides a tuning function for hyperparameter tuning the model. https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ﬀ3a

Slide 6

Slide 6 text

PEFT: Parameter-Efficient Fine-Tuning 🤗 PEFT (Parameter-Efficient Fine-Tuning) is a library for efficiently adapting large pretrained models to various downstream applications without fine-tuning all of a modelʼs parameters because it is prohibitively costly. PEFT methods only fine-tune a small number of (extra) model parameters - significantly decreasing computational and storage costs - while yielding performance comparable to a fully fine-tuned model. This makes it more accessible to train and store large language models (LLMs) on consumer hardware. PEFT is integrated with the Transformers, Diffusers, and Accelerate libraries to provide a faster and easier way to load, train, and use large models for inference. (Text copied from source) https://huggingface.co/docs/peft/en/index

Slide 7

Slide 7 text

PEFT: LoRA - Low-Rank Adaptation LoRA is low-rank decomposition method to reduce the number of trainable parameters which speeds up finetuning large models and uses less memory. In PEFT, using LoRA is as easy as setting up a LoraConfig and wrapping it with get_peft_model() to create a trainable PeftModel. (Text copied from source) https://huggingface.co/docs/peft/en/task_guides/lora_based_methods

Slide 8

Slide 8 text

PEFT: Quantization Quantization represents data with fewer bits, making it a useful technique for reducing memory-usage and accelerating inference especially when it comes to large language models (LLMs). There are several ways to quantize a model including: ● optimizing which model weights are quantized with the AWQ algorithm ● independently quantizing each row of a weight matrix with the GPTQ algorithm ● quantizing to 8-bit and 4-bit precision with the bitsandbytes library ● quantizing to as low as 2-bit precision with the AQLM algorithm However, after a model is quantized it isnʼt typically further trained for downstream tasks because training can be unstable due to the lower precision of the weights and activations. But since PEFT methods only add extra trainable parameters, this allows you to train a quantized model with a PEFT adapter on top! Combining quantization with PEFT can be a good strategy for training even the largest models on a single GPU. For example, QLoRA is a method that quantizes a model to 4-bits and then trains it with LoRA. This method allows you to finetune a 65B parameter model on a single 48GB GPU! (Text copied from source) https://huggingface.co/docs/peft/main/en/developer_guides/quantization

Slide 9

Slide 9 text

QLoRA https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html

Slide 10

Slide 10 text

Example Fine Tuning Fine tuning model Typhoon ของ SCB ดวย Data ทางการแพทย (https://huggingface.co/datasets/Thaweewat/thai-med-pack ) ครับ ลองไปใชกับ dataset อื่นดูไดครับ โดยการ train แอด config ใหมันสามารถ train กับ GPU P100 บน Kaggle (bitsandbytes + LoRA) ไดนะครับ รวมไปถึงการ push model ของเวอรชั่นที่ fine tune เราขึ้น Huggingface นะครับ Fine tuning: https://www.kaggle.com/code/batprem/typhoon-fine-tuning-based-line?fbclid=I wZXh0bgNhZW0CMTAAAR0tZOUqMI5HwTGYS6QFTCszQ5YurwGC8NqdC4DxIZMjZ5 -cjEeNXndpfV0_aem_ASFShTdKXSnbSRFYaGW2NToQummHUv1KWO3N8nqOSm0 704yhH448mA9G_aslKyQGeOaG7qAiH5nWJsUXrIFtrP5l Inference: https://www.kaggle.com/code/batprem/typhoon-load-model-and-test?fbclid=IwZ Xh0bgNhZW0CMTAAAR1lamevLVItERHWma0U7KcwtAb6JT58BxOKEx9bp3gY-CqKE Gy79M3SEmk_aem_ASHyJrJXdC216cLOWSRrXK96ylTd4GO9l8kxNxUhXHHkNXrgf8 rLEP_vxLhzukOrdAiUhTLT2IxlcBbawZkELfu6 Training visual ดวย Weight and bias: https://www.kaggle.com/code/batprem/typhoon-wandb-visualization?scriptVersi onId=173192354&fbclid=IwZXh0bgNhZW0CMTAAAR1c6K9-iz2FvvqcUkZC7gzDBxJu IyLWZAOVPAIWLWQZQjQ8ZogmAm2lQhg_aem_ASGFoTBA7XQbqGBDOPUx-uMYcq wy-a-pHxz5ZZjgTIDQsCAcMQpMKqsMgnFmMi7MmPdexnP8K9iADWMsiCWE2Uww https://www.facebook.com/photo?fbid=852664400209645&set=a.419487336860689

Slide 11

Slide 11 text

Platform

Slide 12

Slide 12 text

Nvidia NeMo: End-to-End Generative AI Platform

Slide 13

Slide 13 text

Nvidia NeMo Framework (cont.) https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

Slide 14

Slide 14 text

Language Model API providers

Slide 15

Slide 15 text

Artiﬁcial Analysis https://artificialanalysis.ai/

Slide 16

Slide 16 text

Artiﬁcial Analysis: Quality vs Price https://artificialanalysis.ai/

Slide 17

Slide 17 text

Artiﬁcial Analysis: API Prices https://artificialanalysis.ai/

Slide 18

Slide 18 text

Artiﬁcial Analysis: Throughput https://artificialanalysis.ai/

Slide 19

Slide 19 text

Closed-source vs Open-weight models https://twitter.com/maximelabonne/status/1790519226677026831/photo/1

Slide 20

Slide 20 text

🏆 LMSYS Chatbot Arena Leaderboard https://chat.lmsys.org/?leaderboard

Slide 21

Slide 21 text

Services to Host Language Models

Slide 22

Slide 22 text

Why self-host LLM? 1. Cost eﬀicient in long term (if choosing wisely) ➢ Need to tune the latency to make the model faster 2. Customization & fine-tuning ➢ No lock-in to a particular model 3. Security compliance & data residency / privacy

Slide 23

Slide 23 text

Artiﬁcial Analysis: Hosting Llama 3 Instruct https://artificialanalysis.ai/

Slide 24

Slide 24 text

Artiﬁcial Analysis: Hosting Llama 3 API Prices https://artificialanalysis.ai/

Slide 25

Slide 25 text

Artiﬁcial Analysis: Comparing API Prices Llama 3 Instruct (Services) LM APIs Provider (from previous section) https://artificialanalysis.ai/

Slide 26

Slide 26 text

groq - LPU Inference Engine https://groq.com/

Slide 27

Slide 27 text

LLM Development Frameworks

Slide 28

Slide 28 text

LangChain 🦜🔗 LangChain 🦜🔗 is a framework (Python / JS library) for developing applications powered by large language models (LLMs). The main values of LangChain are: ✓ Components: abstractions for working with language models, along with a collection of implementations for each abstraction. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not ✓ Oﬀ-the-shelf chains: a structured assembly of components for accomplishing specific higher-level tasks https://www.langchain.com/langchain

Slide 29

Slide 29 text

LlamaIndex Turn your enterprise data into production-ready LLM applications https://www.llamaindex.ai/

Slide 30

Slide 30 text

Semantic Kernel from Microsoft Semantic Kernel is an SDK that integrates Large Language Models (LLMs) like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages like C#, Python, and Java. Semantic Kernel achieves this by allowing you to define plugins that can be chained together in just a few lines of code. What makes Semantic Kernel special, however, is its ability to automatically orchestrate plugins with AI. With Semantic Kernel planners, you can ask an LLM to generate a plan that achieves a user's unique goal. Afterwards, Semantic Kernel will execute the plan for the user. https://github.com/microsoft/semantic-kernel

Slide 31

Slide 31 text

Langroid (Recommended) https://langroid.github.io/langroid/

Slide 32

Slide 32 text

RAG Concept :Retrieval Augmented Generator

Slide 33

Slide 33 text

RAG - Ask→ Retrieve from DB → Generate Answer

Slide 34

Slide 34 text

Steps in RAG

Slide 35

Slide 35 text

Document Search example Vector DB Embedding Model

Slide 36

Slide 36 text

Vector DB

Slide 37

Slide 37 text

What is Vector DB A database to store embedding representations and use Approximate Nearest Neighbor (ANN) algorithm to query from the query embedding. https://www.pinecone.io/learn/vector-database/

Slide 38

Slide 38 text

Vector DB Landscape https://blog.det.life/why-you-shouldnt-invest-in-vector-databases-c0cd3f59d23c

Slide 39

Slide 39 text

Knowledge Graph + RAG Using Llama Index https://medium.com/@transformergpt/unleashing-the-power-of-knowledge-graphs-in-retrieval-augmented-generation-rag-step-by-step-84c2adc66c1c Example of Knowledge Graph:

Slide 40

Slide 40 text

Graph DB: Neo4j and Vector Index https://neo4j.com/developer-blog/knowledge-graph-rag-application/

Slide 41

Slide 41 text

Data?

Slide 42

Slide 42 text

Deep Lake: Database for AI https://www.deeplake.ai/

Slide 43

Slide 43 text

LLMOps

Slide 44

Slide 44 text

LLMOps & MLOps LLM unique characteristics: ✔ Requirements of managing LLMs ✔ Large size ✔ Complex training requirements ✔ High computational demands LLMOps = MLOps + https://cloud.google.com/discover/what-is-llmops

Slide 45

Slide 45 text

What can LLMOps do? LLMOps involves a comprehensive set of activities, including: ● Model deployment and maintenance: deploying and managing LLMs on cloud platforms or on-premises infrastructure ● Data management: curating and preparing training data, as well as monitoring and maintaining data quality ● Model training and fine-tuning: training and refining LLMs to improve their performance on specific tasks ● Monitoring and evaluation: tracking LLM performance, identifying errors, and optimizing models ● Security and compliance: ensuring the security and regulatory compliance of LLM operations https://cloud.google.com/discover/what-is-llmops

Slide 46

Slide 46 text

Role & Responsibility https://aws.amazon.com/blogs/machine-learning/fmops-llmops-operationalize-generative-ai-and-diﬀerences-with-mlops/

Slide 47

Slide 47 text

Who use LLMOps https://aws.amazon.com/blogs/machine-learning/fmops-llmops-operationalize-generative-ai-and-diﬀerences-with-mlops/

Slide 48

Slide 48 text

Ray for LLMOps

Slide 49

Slide 49 text

Ray AI Runtime & Model Life cycle https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ﬀ3a (basically the same image)

Slide 50

Slide 50 text

Ray LLM: Inference Landscape Ray-LLM: Behind the screens of top-tier companies such as OpenAI, Uber, and Cohere, which help deploy LLM models, is Anyscale. This platform allowed developers to serve LLM models, and it was developed by the Ray Team, but I would like to share some details about their tool, Ray-LLM. This tool solves a technical problem when serving the LLM model, which could be divided into three issues. (Text copied from blog post) https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ﬀ3a

Slide 51

Slide 51 text

Ray LLM: Optimized batching Continuous batching: GPUs are underutilised with native batching with diﬀerent word sequences, but this method will concatenate new input sequence tokens into the end of the token batch to fill the batch sequence, which increases the throughput of the system. (Text copied from blog post) https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ﬀ3a

Slide 52

Slide 52 text

Ray LLM: Speculative Decoding Speculative decoding: This method uses a small model to speculate the K token ahead and then a large model to verify, and it will emit a token if the token isnʼt correct. These methods allow for faster forward passes per token, which reduces latency since large models only verify and can do it in parallel. (Text copied from blog post) https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ﬀ3a

Slide 53

Slide 53 text

Ray LLM: Hybrid routing Hybrid routing: Using a supervised classifier to classify queries that feed into the model and then select a suitable model before feeding into the LLM. This method helps to create agents based on the LLM since each LLM has its own specificity for each task, so it could be a better idea to let the model that has more context answer it. (Text copied from blog post) https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ﬀ3a

Slide 54

Slide 54 text

Inference / Serving

Slide 55

Slide 55 text

vLLM Model serving for LLM Easy, fast, and cheap LLM serving for everyone vLLM is fast with: ✅ State-of-the-art serving throughput ✅ Eﬀicient management of attention key and value memory with PagedAttention ✅ Continuous batching of incoming requests ✅ Fast model execution with CUDA/HIP graph ✅ Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache ✅ Optimized CUDA kernels https://github.com/vllm-project/vllm

Slide 56

Slide 56 text

Ray Serve

Slide 57

Slide 57 text

Text Generation Inference https://huggingface.co/docs/text-generation-inference/index

Slide 58

Slide 58 text

Trend

Slide 59

Slide 59 text

Candle - ML framework for Rust https://github.com/huggingface/candle

Slide 60

Slide 60 text

Candle vLLM Eﬀicent platform for inference and serving local LLMs including an OpenAI compatible API server. Based on Huggingface Candle (Rust). https://github.com/EricLBuehler/candle-vllm

Slide 61

Slide 61 text

Bend - Pythonic programming language https://github.com/HigherOrderCO/Bend

Slide 62

Slide 62 text

Speculative Decoding https://arxiv.org/abs/2211.17192

Slide 63

Slide 63 text

KAN - Future of Neural Network ? https://arxiv.org/abs/2404.19756

Slide 64

Slide 64 text

KAN - Kolmogorov-Arnold Networks https://medium.com/@isaakmwangi2018/a-simplified-explanation-of-the-new-kolmogorov-arnold-network-kan-from-mit-cbb59793a040

Slide 65

Slide 65 text

KAN-GPT: Not now, but it’s pretty new https://github.com/AdityaNG/kan-gpt

Slide 66

Slide 66 text

GaLore for LLM Training (not Fine-Tuning) https://arxiv.org/abs/2403.03507

Slide 67

Slide 67 text

Responsible AI

Slide 68

Slide 68 text

Google AI Principles https://ai.google/responsibility/principles/

Slide 69

Slide 69 text

Safety Rating https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/configure-safety-attributes

Slide 70

Slide 70 text

https://www.microsoft.com/en-us/ai/principles-and-approach/ Microsoft Responsible AI Principles