comes at an overhead of higher memory use, which becomes a bottleneck for concurrent processing of requests. For Llama 3.1 8B, with 8K context length on an NVIDIA A100 GPU, you can only serve about 24 requests per second.2 As a result, a lot of research has emerged to make better use of this bottleneck.3 KV caching Memory available 40GB Model Weights (FP16) 16GB KV Cache per request 1GB 1. https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms 2. https://lmcache.ai/kv_cache_calculator.html 3. https://arxiv.org/abs/2309.06180, https://lmcache.ai/tech_report.pdf