Slide 1

Slide 1 text

Keep the Cost Down: A Review on Methods to Optimize LLM’s KV-Cache Consumption Presented by Kamolphan Liwprasert 2024-08-19 Shi, L., Zhang, H., Yao, Y., Li, Z., & Zhao, H. (2024). Keep the Cost Down: A Review on Methods to Optimize LLM’ s KV-Cache Consumption (arXiv:2407.18003). arXiv. https://doi.org/10.48550/arXiv.2407.18003

Slide 2

Slide 2 text

http://arxiv.org/abs/2407.18003

Slide 3

Slide 3 text

Introduction

Slide 4

Slide 4 text

Background ● LLMs are now widely used, but their efficiency is challenged by the Transformer architectureʼs struggle with handling long texts. ● KV-Cache has emerged as a pivotal solution to this issue ✅ Converting the time complexity of token generation from quadratic to linear ❌ Increased GPU memory overhead proportional to conversation length Shi, L., Zhang, H., Yao, Y., Li, Z., & Zhao, H. (2024). Keep the Cost Down: A Review on Methods to Optimize LLM’ s KV-Cache Consumption (arXiv:2407.18003). arXiv. https://doi.org/10.48550/arXiv.2407.18003

Slide 5

Slide 5 text

Goals 1. Optimizing KV-Cache space usage of LLM: pre-training, deployment, inference phases 2. Review of the landscape of LLM Optimization 3. Metrics for evaluating the long-text capabilities of large language models, from both efficiency and capability perspectives

Slide 6

Slide 6 text

Challenges of LLM 🔥 Decoder Only Transformer architecture has a quadratic time complexity when understanding text sequences. 🔥 During inference, the auto-regressive decoding mechanism amplifies this issue, as it repeats the process for each token generated.

Slide 7

Slide 7 text

What is KV-Cache? KV-Cache, by storing the keys and values tensor in attention module generated by past tokens, can reduce the time complexity required to generate each token to linear, greatly improving inference efficiency. KV-Cache is a mechanism that leverages the properties of causal masking in MHA (Multi-Head Attention) to store and reuse intermediate computations, thereby optimizing the efficiency of LLMs, especially when dealing with long sequences.

Slide 8

Slide 8 text

How does KV-Cache work? 1. Token production: Each token produces intermediate K and V tensors. 2. Token generation: When generating subsequent tokens, the KV tensors of preceding tokens are required to compute the self-attention. 3. KV caching: These K and V tensors are cached in GPUs, which is known as the KV cache. Gao, B., He, Z., Sharma, P., Kang, Q., Jevdjic, D., Deng, J., ... & Zuo, P. (2024). AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving. arXiv preprint arXiv:2403.19708.

Slide 9

Slide 9 text

Challenges of KV-Cache 🔥 KV-Cache will increase linearly with the length of the sequence, and the memory required will become larger and larger, especially for giant models like GPT-3. 🔥 It is also hard to reuse the same dialogues and will become a bottleneck from the low memory bandwidth of GPU comparing to its computation.

Slide 10

Slide 10 text

Benefit of optimizing KV-Cache ✅ Reducing memory usage, which leads to cost reduction and less energy consumption ✅ Improving LLM Serving efficiency ✅ Enhancing LLMsʼ performance with longer contexts

Slide 11

Slide 11 text

GitHub: Awesome-KV-Cache Related research paper on KV Cache: https://github.com/zcli-charlie/Awesome-KV-Cache

Slide 12

Slide 12 text

Optimization

Slide 13

Slide 13 text

https://github.com/zcli-charlie/Awesome-KV-Cache/blob/main/assets/Main.png

Slide 14

Slide 14 text

Optimization: 3 Main Stages Post-Training Stage ● Eviction ● Quantization Training Stage ● Most effective ● Not suitable for modifying existing models ● Not suitable for low computational power ● Architecture changes Deployment Stage ● Optimizing KV-Cache

Slide 15

Slide 15 text

1. Training Stage

Slide 16

Slide 16 text

https://github.com/zcli-charlie/Awesome-KV-Cache/blob/main/assets/Main.png 1

Slide 17

Slide 17 text

Training Stage ● The most effective KV-Cache compression method. ● Related to LLM architecture changes. ● Could not apply to the existing models. MHA (Multi-Head Attention) MQA (Multi-Query Attention) GQA (Grouped Query Attention)

Slide 18

Slide 18 text

Comparison between MHA, MQA and GQA MHA (Multi-Head Attention) MQA (Multi-Query Attention) GQA (Grouped Query Attention)

Slide 19

Slide 19 text

2. Deployment Stage

Slide 20

Slide 20 text

https://github.com/zcli-charlie/Awesome-KV-Cache/blob/main/assets/Main.png 2

Slide 21

Slide 21 text

Deployment Stage ● Page Attention Kwon et al. (2023) vLLM framework for high performance LLM serving. ● DistAttention (DistKV-LLM) Lin et al. (2024) KV-Cache was able to achieve distributed deployment on multiple servers. This significantly improved the efficiency of providing LLM services using large-scale cloud servers ● ChunkAttention Ye et al. (2024) made the model avoid repeated calculation of some tokens in the pre-fill stage, speeding up the response speed of the deployment system ● InfLLM Jin et al. (2024) allows large models to achieve near-infinite context without additional training and uses very little additional KV-Cache.

Slide 22

Slide 22 text

vLLM: LLM Serving with Paged Attention https://github.com/vllm-project/vllm https://arxiv.org/abs/2309.06180

Slide 23

Slide 23 text

3. Post-Training Stage

Slide 24

Slide 24 text

https://github.com/zcli-charlie/Awesome-KV-Cache/blob/main/assets/Main.png 3

Slide 25

Slide 25 text

Post-Training Stage Eviction methods are about the policies to discard unnecessary tokens. Two lines of approaches exist: static policies, which are designed before inference and remains consistent across every inference request, and dynamic policies, which utilize the information generated while inference to identify important tokens.

Slide 26

Slide 26 text

Eviction Policy

Slide 27

Slide 27 text

Evaluation

Slide 28

Slide 28 text

Datasets ● Longbench (Bai et al., 2023) is the first bilingual (English and Chinese) multitask benchmark for long context understanding. It consists of 21 datasets across 6 task categories: single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks, and code completion. ● Passkey retrieval (Mohtashami & Jaggi, 2023) proposed the passkey retrieval task. The models are required to retrieve a random passkey hidden in a long document. ● Needle in a Haystack (Kuratov et al., 2024). They proposed BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts within extensive texts. BABILong hides algorithmically generated question answering and reasoning problems inside a corpus of book texts. BABILong consists of 20 tasks designed for evaluation of basic aspects of reasoning. ● Few-shot Testing (Brown et al., 2020) format or by simulating multi-turn dialogues, in order to test the modelʼs capabilities with long texts. Furthermore, for some inference-type tests, the Chain-of-Thought (CoT) strategy proposed in Wei et al. (2022) can be adopted to further increase the length of few-shot texts.

Slide 29

Slide 29 text

Evaluation Metrics ● Per Token GPU-Memory Usage For KV Cache, the most intuitive optimization indicator would be the memory space occupied by each token. The LLaMA2-7B model, as a typical example, theoretically occupies 0.5MB of memory for each KV Cache entry. ● Throughput and Latency Throughput, usually measured in tokens per second (token/s), represents how many new tokens the model can generate per second. In the Decoding phase, latency is usually considered to be the time required to generate each new token, typically in milliseconds. ● Perplexity (PPL) For each token, the model calculates the natural logarithm likelihood value of the probability distribution predicted based on its previous tokens, takes the average, and then calculates the value as the exponent of e, where ANLL refers to the average natural logarithm likelihood. PPL can provide a rough reference for the performance changes of the model. If PPL rises sharply, it usually means that the modelʼs ability has significantly decreased, such as completely losing language ability, etc.

Slide 30

Slide 30 text

Conclusion

Slide 31

Slide 31 text

Key Takeaway ● Principles of KV Cache Optimization: The main goal of KV Cache optimization is to reduce memory usage by compressing the Keys and Values in the KV pairs. ● Trade-offs in Deletion vs. Compression: There's a trade-off between deleting less important KV pairs to save memory and compressing the entire KV Cache without deletion. The former may impact model performance while the latter focuses on retaining information. ● Extremes in KV Cache Management: A potential future direction is to store the KV Cache externally, turning KV Cache management into a retrieval task. ● Future Directions in Storage and Retrieval Technologies: The future of LLMs will likely see storage and retrieval technologies becoming as important as the computational models themselves, opening new possibilities for LLM efficiency and versatility.

Slide 32

Slide 32 text

https://github.com/zcli-charlie/Awesome-KV-Cache/blob/main/assets/Main.png

Slide 33

Slide 33 text

Reference Shi, L., Zhang, H., Yao, Y., Li, Z., & Zhao, H. (2024). Keep the Cost Down: A Review on Methods to Optimize LLM’ s KV-Cache Consumption (arXiv:2407.18003). arXiv. https://doi.org/10.48550/arXiv.2407.18003

Slide 34

Slide 34 text

GitHub: Awesome-KV-Cache https://github.com/zcli-charlie/Awesome-KV-Cache