Slide 8
Slide 8 text
How does KV-Cache work?
1. Token production: Each token produces intermediate K
and V tensors.
2. Token generation: When generating subsequent tokens,
the KV tensors of preceding tokens are required to
compute the self-attention.
3. KV caching: These K and V tensors are cached in GPUs,
which is known as the KV cache.
Gao, B., He, Z., Sharma, P., Kang, Q., Jevdjic, D., Deng, J., ... & Zuo, P. (2024). AttentionStore:
Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model
Serving. arXiv preprint arXiv:2403.19708.