Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[re:Invent 2025] Distributed inference on AWS: ...

[re:Invent 2025] Distributed inference on AWS: Deep dive into inference optimizations (AIM353)

As Large Language Models (LLMs) become integral to applications, optimizing inference performance is paramount, especially at scale. This session delves into cutting-edge techniques, research and frameworks that enhance LLM serving efficiency, focusing on KV-cache management, PagedAttention mechanisms, and disaggregated serving architectures. We will cover best practices for self-managed inference on AWS: use cases with architecture diagrams, deep dive into how ML inference works on accelerated compute, and common patterns of optimizations in the field. Attendees will gain a comprehensive understanding of the current landscape and future directions in inference optimization, with the knowledge to implement these advancements in distributed systems.

Avatar for Keita Watanabe

Keita Watanabe

January 05, 2026
Tweet

More Decks by Keita Watanabe

Other Decks in Technology

Transcript

  1. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. AIM353 Aman Shanbhag GenAI Specialist Solutions Architect WWSO AIML Frameworks AWS Keita Watanabe GenAI Specialist Solutions Architect WWSO AIML Frameworks AWS Distributed Inference on AWS: A Deep Dive into Inference Optimizations
  2. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Challenges in LLM Inference Is apple a fruit? Yes it is. User ChatBot
  3. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Is apple a Fruit? Tokenization Is apple a Fruit? Text generation Yes it is EOS Detokenization Yes it is EOS Is apple a fruit? Yes it is User ChatBot https://www.researchgate.net/publication/382884366_Generative_AI_as_a_Service_in_6G_Edge-Cloud_Generation_Task_Offloading_by_In-context_Learning/figures?lo=1 BOS BOS
  4. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Challenges in LLM Inference What exactly is going on? Which hardware to use? How can I streamline? How to scale big model inference? Developer https://www.researchgate.net/publication/382884366_Generative_AI_as_a_Service_in_6G_Edge-Cloud_Generation_Task_Offloading_by_In-context_Learning/figures?lo=1
  5. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. N V I D I A G P U S , A N D A W S M L A C C E L E R A T O R S Accelerated computing portfolio Trainium accelerator Inferentia accelerator B200, H200, H100, A100, L4, L40S A10G, T4 G5 P4de G6 P4d P5 P5e GPUs P6- B200 G6e Inf2 Trn1 Trn2 AWS ML chips Trn3 P5en P6e- GB200
  6. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. LLM Inference Optimization Strategies Model Architecture Optimization KV Cache Management Distributed Inference System Optimization Quantization Attention Mechanism Operator Fusion Scheduling & Batching Data Parallel Pipeline Parallel Tensor Parallel Context Parallel Disaggregated Serving Expert Parallel Basics of Text Generation Inference Text Generation Inference Transformers Architecture Prefill vs. Decode and KV Cache Key Metrics Memory/Compute Requirement Roofline Model Agenda
  7. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Basics of Text Generation Inference
  8. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Anatomy of LLM inference is EOS it Is apple a Fruit? Yes LLM Iteration 1 Yes LLM Iteration 2 it LLM Iteration 3 is LLM Iteration 4 https://www.researchgate.net/publication/382884366_Generative_AI_as_a_Service_in_6G_Edge-Cloud_Generation_Task_Offloading_by_In-context_Learning/figures?lo=1 BOS
  9. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Anatomy of LLM inference is EOS it Is apple a Fruit? Yes LLM Iteration 1 Yes LLM Iteration 2 it LLM Iteration 3 is LLM Iteration 4 https://www.researchgate.net/publication/382884366_Generative_AI_as_a_Service_in_6G_Edge-Cloud_Generation_Task_Offloading_by_In-context_Learning/figures?lo=1 BOS Time to First Token (TTFT) Time per Output Token (TPOT)
  10. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Anatomy of LLM inference is EOS it Is apple a Fruit? Yes LLM Iteration 1 Yes LLM Iteration 2 it LLM Iteration 3 is LLM Iteration 4 https://www.researchgate.net/publication/382884366_Generative_AI_as_a_Service_in_6G_Edge-Cloud_Generation_Task_Offloading_by_In-context_Learning/figures?lo=1 BOS KV Cache 1 KV Cache 2 KV Cache 3
  11. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. So, what’s exactly going on at pre-fill stage?
  12. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transformer Architecture Transformer Layer 1 Transformer Layer 𝑙 Transformer Layer 𝐿 Prediction Head … … Normalization Norm(𝐻𝑙−1 ) Hidden States 𝐻𝑙−1 [𝑆, 𝐸] Attention Block Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 ) Normalization Norm(𝑂attn) MLP Block 𝑀𝐿𝑃(𝐼MLP; 𝑊G, WU , 𝑊D) 𝐼attn [S,E] 𝑂MLP [S,E] = 𝐻𝑙 [𝑆, 𝐸] 𝑂attn [S,E] 𝐼MLP [S,E] 𝑂MLP [S,E] + + + symbol dimension S Input Sequence Length E Embedding Dimension Is apple a Fruit? S E Yes BOS
  13. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transformer Layer: 𝑙 Attention Block: Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 ) 𝐼attn [S,E] 𝑂Attn [S,E] Q 𝑾𝑸 K 𝑾𝑲 O 𝑾𝑶 V 𝑾𝑽 × Mask × Normalization Norm(𝐻𝑙−1 ) Attention Block Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 ) Normalization Norm(𝑂attn) MLP Block 𝑀𝐿𝑃(𝐼MLP; 𝑊G, WU , 𝑊D) 𝐼attn [S,E] 𝑂attn [S,E] 𝐼MLP [S,E] 𝑂MLP [S,E] + + u + + 𝐻𝑙−1 [S,E] 𝐻𝑙−1 [S,E] ※ Positional Encoding is ignored Softmax
  14. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transformer Layer: 𝑙 MLP Block: 𝑀𝐿𝑃(𝐼MLP; 𝑊Gate, WUp , 𝑊Down) Normalization Norm(𝐻𝑙−1 ) Attention Block Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 ) Normalization Norm(𝑂attn) MLP Block 𝑀𝐿𝑃(𝐼MLP; 𝑊G, WU , 𝑊D) 𝐼attn [S,E] 𝑂attn [S,E] 𝐼MLP [S,E] 𝑂MLP [S,E] + + u + SiLU Gate 𝑾𝑮 Up 𝑾𝑼 × 𝐼MLP [S,E] Down 𝑾𝑫 𝐻𝑙−1 [S,E] + 𝑂attn [S,E] 𝑂MLP [S,E]
  15. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. How is decode phase different?
  16. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transformer Architecture Transformer Layer 1 Transformer Layer 𝑙 Transformer Layer 𝐿 Prediction Head … … Normalization Norm(𝐻𝑙−1 ) Hidden States 𝐻𝑙−1 [𝑆 = 1, 𝐸] Attention Block Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 , | 𝐾cache, 𝑉cache) Normalization Norm(𝑂attn) MLP Block 𝑀𝐿𝑃(𝐼MLP; 𝑊G, WU , 𝑊D) 𝐼attn [S=1,E] 𝑂MLP [S=1,E] = 𝐻𝑙 [𝑆 = 1, 𝐸] 𝑂attn [S=1,E] 𝐼MLP [S=1,E] 𝑂MLP [S=1,E] + + + symbol dimension S Input Sequence Length E Embedding Dimension S = 1 E Yes it
  17. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transformer Layer: 𝑙 Attention Block: Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 ) 𝐼attn [S=1,E] 𝑂Attn [S=1,E] Q 𝑾𝑸 K 𝑾𝑲 O 𝑾𝑶 V 𝑾𝑽 × Mask × Normalization Norm(𝐻𝑙−1 ) Attention Block Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 | 𝐾cache, 𝑉cache) Normalization Norm(𝑂attn) MLP Block 𝑀𝐿𝑃(𝐼MLP; 𝑊G, WU , 𝑊D) 𝐼attn [S=1,E] 𝑂attn [S=1,E] 𝐼MLP [S=1,E] 𝑂MLP [S=1,E] + + u + + 𝐻𝑙−1 [S=1,E] 𝐻𝑙−1 [S=1,E] 𝐾cache 𝑉cache Concat Concat Softmax
  18. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transformer Layer: 𝑙 MLP Block: 𝑀𝐿𝑃(𝐼MLP; 𝑊Gate, WUp , 𝑊Down) Normalization Norm(𝐻𝑙−1 ) Attention Block Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 ) Normalization Norm(𝑂attn) MLP Block 𝑀𝐿𝑃(𝐼MLP; 𝑊G, WU , 𝑊D) 𝐼attn [S=1,E] 𝑂attn [S=1,E] 𝐼MLP [S=1,E] 𝑂MLP [S=1,E] + + u + SiLU Gate 𝑾𝑮 Up 𝑾𝑼 × 𝐼MLP [S=1,E] Down 𝑾𝑫 𝐻𝑙−1 [S=1,E] + 𝑂attn [S=1,E] 𝑂MLP [S=1,E]
  19. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Q 𝑾𝑸 K 𝑾𝑲 O 𝑾𝑶 V 𝑾𝑽 × Softmax × 𝐾cache 𝑉cache Concat Concat Q 𝑾𝑸 K 𝑾𝑲 O 𝑾𝑶 V 𝑾𝑽 × Mask × Prefill phase Decode phase 𝐼attn [S=1,E] 𝑂Attn [S=1,E] 𝐼attn [S,E] 𝑂Attn [S,E] Softmax Mask
  20. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. HBM Key-value caching mechanism symbol dimension S Sequence Length E Embedding Dimension https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ Q 𝑆 𝐸 × ×: Matrix Multiply HBM: High Bandwidth Memory K^T 𝐸 𝑆 × V 𝑆 𝐸 Prefill Decode Step 1 K Cache Result 𝑆 𝐸 = V Cache Q 1 𝐸 × K^T 𝐸 𝑆 + 1 × V 𝑆 + 1 𝐸 Result 1 𝐸 = Write Read
  21. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Memory Usage in LLM Inference Example Parameters (FP8) 70 GB KV Cache 10 GB Other 7 GB VRAM consumption Llama3 70B Parameters FP32: 4 bytes per parameter BF16: 2 bytes per parameter P8: 1 byte per FP4: 0.5 byte per parameter KV Cache KV Cache Formula Total size of KV cache (bytes) ≈ batch_size × sequence_length × 2 × num_layers × hidden_size × sizeof(precision) Others Activations and other overhead 10~15% of parameter footprint Llama3 70B, batch size 1, sequence length 8K (l=80, h=8192) FP8
  22. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. How fast/efficient LLM inference would be on my hardware?
  23. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. NVIDIA GPU instances 24 © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Instance Size ACC NUM ACC ACC Memory Acc. P2P BW EFA P5.48xlarge H100 8 640 GB 900 GB/s 3200 Gbps EFAv2 P5e.48xlarge H200 8 1128 GB 900 GB/s 3200 Gbps EFAv2 P5en.48xlarge H200 8 1128 GB 900 GB/s 3200 Gbps EFAv3 P6-B200.48xlarge B200 8 1440 GB 1.8 TB/s 3200 Gbps EFAv4 P6e-GB200.36xlarge GB200 4 740 GB 1.8 TB/s 3200 Gbps EFAv4 u-p6e-gb200x36 GB200 36 6.7 TB 1.8 TB/s 14400 Gbps EFAv4 u-p6e-gb200x72 GB200 72 13.3 TB 1.8 TB/s 28800 Gbps EFAv4 https://aws.amazon.com/ec2/instance-types/p5/ P6 NVIDIA H100/H200 TENSOR CORE P5 NVIDIA B200 TENSOR CORE
  24. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Q 𝑾𝑸 K 𝑾𝑲 O 𝑾𝑶 V 𝑾𝑽 × Softmax × 𝐾cache 𝑉cache Concat Concat Q 𝑾𝑸 K 𝑾𝑲 O 𝑾𝑶 V 𝑾𝑽 × Softmax × Prefill phase Decode phase Mask Mask
  25. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Why Memory Matters as Much as Compute
  26. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Rooflines for all the accelerators
  27. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Model Architecture Optimization
  28. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Dive deeper into the Attention Block
  29. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Multi-head attention – AKA the vanilla attention symbol dimension B Batch size S Sequence length E Embedding dimension H Attention head dimension N Number of query key value heads 𝑾𝑸𝟏 [𝑬, 𝑯] 𝑾𝑸𝟎 [𝑬, 𝑯] 𝑾𝑸𝑵 [𝐄, 𝐇] … 𝐼attn [𝑆, 𝐸] 𝑸𝟎 𝑺, 𝑯 𝑸𝟏 𝑺, 𝑯 𝑸𝑵 𝑺, 𝑯 … 𝑰attn[𝑺, 𝑬] 𝑰attn[𝑺, 𝑬] 𝑰attn[𝑺, 𝑬] × × × = = = 𝑾𝑲𝟏 [𝑬, 𝑯] 𝑾𝑲𝟎 [𝑬, 𝑯] 𝑾𝑲𝑵 [𝐄, 𝐇] … 𝐼attn [𝑆, 𝐸] 𝑲𝟎 𝑺, 𝑯 𝑲𝟏 𝑺, 𝑯 𝑲𝑵 𝑺, 𝑯 … 𝑰attn[𝑺, 𝑬] 𝑰attn[𝑺, 𝑬] 𝑰attn[𝑺, 𝑬] × × × = = = 𝑾𝑽𝟏 [𝑬, 𝑯] 𝑾𝑽𝟎 [𝑬, 𝑯] 𝑾𝑽𝑵 [𝐄, 𝐇] … 𝐼attn [𝑆, 𝐸] 𝑽𝟎 𝑺, 𝑯 𝑽𝟏 𝑺, 𝑯 𝑽𝑵 𝑺, 𝑯 … 𝑰attn[𝑺, 𝑬] 𝑰attn[𝑺, 𝑬] 𝑰attn[𝑺, 𝑬] × × × = = =
  30. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Multi-head attention – AKA the vanilla attention symbol dimension B Batch size S Sequence length E Embedding dimension H Attention head dimension N Number of query key value heads 𝑸𝟎 𝑺, 𝑯 𝑸𝟏 𝑺, 𝑯 𝑸𝑵 𝑺, 𝑯 𝑲𝟎 𝑻 𝑯, 𝑺 𝑲𝟏 𝑻 𝑯, 𝑺 𝑲𝑵 𝑻 𝑯, 𝑺 Softmax 𝑽𝟎 𝑺, 𝑯 𝑽𝟏 𝑺, 𝑯 𝑽𝑵 𝑺, 𝑯 … … … × × × × × × 𝑾𝑶𝟏 𝑯, 𝑬 𝑾𝑶𝟐 𝑯, 𝑬 𝑾𝑶𝑵 𝑯, 𝑬 … × × ×
  31. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. KV Cache bottleneck LLM inference GPU Memory Usage Sequence Length Memory Consumption The KV cache size scales with sequence length, often consuming the majority of GPU memory. Latency Loading massive KV cache from HBM for each generated token slows down decoding Scalability Limit Pose a hard limit on the context lengths and batch sizes that can be feasibly deployed KV Cache KV Cache Formula Total size of KV cache (bytes) ≈ batch_size × sequence_length × 2 × num_layers × hidden_size × sizeof(precision) hidden_size = (num_heads * dim_head)
  32. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Multi Query Attention 𝑸𝟎 𝑺, 𝑯 𝑸𝟏 𝑺, 𝑯 𝑸𝑵−𝟏 𝑺, 𝑯 𝑲𝟎 𝑻 𝑯, 𝑺 𝑲𝟏 𝑻 𝑯, 𝑺 𝑲𝑵−𝟏 𝑻 𝑯, 𝑺 Softmax 𝑽𝟎 𝑺, 𝑯 𝑽𝟏 𝑺, 𝑯 𝑽𝑵−𝟏 𝑺, 𝑯 … … … × × × × × × 𝑾𝑶𝟏 𝑯, 𝑬 𝑾𝑶𝟐 𝑯, 𝑬 𝑾𝑶𝑵−𝟏 𝑯, 𝑬 … × × × 𝑸𝑵 𝑺, 𝑯 𝑲𝑵 𝑻 𝑯, 𝑺 𝑽𝑵 𝑺, 𝑯 × × 𝑾𝑶𝑵 𝑯, 𝑬 × symbol dimensio n B Batch size S Sequence Length E Embeddi ng Dimensio n H Attention head N Number of query heads K Number of key/value heads G q heads per kv head (N//K) 𝑸𝟎 𝑺, 𝑯 𝑸𝟏 𝑺, 𝑯 𝑸𝑵−𝟏 𝑺, 𝑯 𝑲𝟎 𝑻 𝑯, 𝑺 Softmax 𝑽𝟎 𝑺, 𝑯 … × × × × × × 𝑾𝑶𝟏 𝑯, 𝑬 𝑾𝑶𝟐 𝑯, 𝑬 𝑾𝑶𝑵−𝟏 𝑯, 𝑬 … × × × 𝑸𝑵 𝑺, 𝑯 × × 𝑾𝑶𝑵 𝑯, 𝑬 × Multi Head Attention (MHA) Multi Query Attention (MQA) G=2 MQA → K=1
  33. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Grouped Query Attention 𝑸𝟎 𝑺, 𝑯 𝑸𝟏 𝑺, 𝑯 𝑸𝑵−𝟏 𝑺, 𝑯 𝑲𝟎 𝑻 𝑯, 𝑺 𝑲𝟏 𝑻 𝑯, 𝑺 𝑲𝑵−𝟏 𝑻 𝑯, 𝑺 Softmax 𝑽𝟎 𝑺, 𝑯 𝑽𝟏 𝑺, 𝑯 𝑽𝑵−𝟏 𝑺, 𝑯 … … … × × × × × × 𝑾𝑶𝟏 𝑯, 𝑬 𝑾𝑶𝟐 𝑯, 𝑬 𝑾𝑶𝑵−𝟏 𝑯, 𝑬 … × × × 𝑸𝑵 𝑺, 𝑯 𝑲𝑵 𝑻 𝑯, 𝑺 𝑽𝑵 𝑺, 𝑯 × × 𝑾𝑶𝑵 𝑯, 𝑬 × symbol dimensio n B Batch size S Sequence Length E Embeddi ng Dimensio n H Attention head N Number of query heads K Number of key/value heads G q heads per kv head (N//K) 𝑸𝟎 𝑺, 𝑯 𝑸𝟏 𝑺, 𝑯 𝑸𝑵−𝟏 𝑺, 𝑯 𝑲𝟎 𝑻 𝑯, 𝑺 Softmax 𝑽𝟎 𝑺, 𝑯 … … … × × × × × × 𝑾𝑶𝟏 𝑯, 𝑬 𝑾𝑶𝟐 𝑯, 𝑬 𝑾𝑶𝑵−𝟏 𝑯, 𝑬 … × × × 𝑸𝑵 𝑺, 𝑯 𝑲𝑲 𝑻 𝑯, 𝑺 𝑽𝑲 𝑺, 𝑯 × × 𝑾𝑶𝑵 𝑯, 𝑬 × Multi Head Attention (MHA) Grouped Query Attention (GQA) G=2 MQA → K=1
  34. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Low Precision Inference
  35. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. FLOPS table Modern Hardware corser faster Device FP32 (TFLOPs) TF32 Tensor (TFLOPs) BF16 Tensor (TFLOPs) FP8 Tensor (TFLOPs) NVIDIA H200 67 495 989 1979 NVIDIA H100 67 495 989 1979 NVIDIA B200 75 1125 2250 4500 NVIDIA L4 30.3 60 121 243 NVIDIA L40S 91.6 183 362 733 RTX Pro 4500 Blackwell 55 105 211 422
  36. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Weight Activation Quantization FLOPS Arithmetic Intensity (FLOPS/Byte) Attention Block (FP16/FP8) MLP Block (FP16/FP8) Roofline FP8 FP16 ① ② ② MLP Block is compute-bound: high FLOPs/byte already; FP8 lets us use faster low-bit tensor cores, so the compute roof lifts and the MLP moves up to a higher plateau. ① The attention block, especially with KV-cache traffic, is strongly memory-bound. When we quantize weights and activations to FP8, each value is half the size, so we move more FLOPs per byte — our arithmetic intensity increases, and the point moves up along the memory roof. https://arxiv.org/pdf/2310.19102
  37. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Remove attention bottleneck
  38. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. GPU Architecture Streaming Multiprocessor (SM) L1 Cache Streaming Multiprocessor (SM) L1 Cache Streaming Multiprocessor (SM) L1 Cache Streaming Multiprocessor (SM) L1 Cache L2 Cache High-Bandwidth Memory (HBM) GPU … L1 Cache is a small, very fast on-chip cache that can be programmer controlled. The L2 Cache is a relatively large hardware-controlled cache with faster memory bandwidth. DRAM or HBM stores parameters, activations, optimizer state, etc. The GPU’s compute block that bundles execution units (CUDA/Tensor cores) with warp schedulers, registers, and on- chip memory (L1 Cache). https://jax-ml.github.io/scaling-book/gpus/
  39. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. GPU Memory Hirarchy (ex, A100) HBM SRAM
  40. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. GPU Memory Hirarchy (ex, A100) HBM SRAM How can we minimize data movement between SRAM <> HBM?
  41. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Streaming Multiprocessor (SM) Standard Attention Flash Attention HBM SRAM Load 𝑄, 𝐾 Write 𝑆 = 𝑄𝐾𝑇 Load 𝑆 Write 𝑃 = Softma𝑥(𝑆) Load 𝑃, 𝑉 Write 𝑂 = 𝑃𝑉 https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention Streaming Multiprocessor (SM) HBM SRAM Load 𝐾𝑗 , 𝑉 𝑗 Load 𝑄𝑖 𝑂𝑖 𝑙𝑖 𝑚𝑖 Write 𝑂𝑖 𝑙𝑖 𝑣𝑖 𝑆𝑖𝑗 = 𝑄𝑖 𝐾𝑗 𝑇 𝑚′ = 𝑟𝑜𝑤𝑚𝑎𝑥 𝑜𝑓 𝑆 𝑃 = exp(𝑠 − 𝑚) 𝑙 = 𝑟𝑜𝑤𝑠𝑢𝑚 𝑜𝑓 𝑃 𝑚 = max 𝑚𝑖𝑗 ′ , 𝑚 Calculate O from l & m
  42. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. How can we manage multiple requests?
  43. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Attention Score Calculation × 𝑸 𝑾𝑸 K 𝑾𝑲 O 𝑾𝑶 V 𝑾𝑽 × Softmax × 𝑴𝒂𝒔𝒌 𝐼attn [S,E] S E E H × 𝐼attn 𝑊𝑄 / 𝑊𝐾 / 𝑊𝑉 𝑄 S H symbol dimension S Sequence length E Embedding dimension H Attention head dimension 𝐾 S H 𝑉 S H
  44. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Attention Score Calculation 𝑸 𝑾𝑸 K 𝑾𝑲 O 𝑾𝑶 V 𝑾𝑽 × Softmax × 𝑴𝒂𝒔𝒌 𝐼attn [S,E] symbol dimension S Sequence length E Embedding dimension H Attention head dimension S S ⋅ 𝑄 S H 𝐾𝑇 S H Softmax ×
  45. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S H symbol dimension S Sequence length E Embedding dimension H Attention head dimension 𝐾𝑇 1 + 1 = BOS 𝑄 S H 1 + 1 = BOS S H 𝑉𝑇 Key observation attention mask makes K/V pairs “invisible” https://huggingface.co/blog/continuous_batching 1 + 1 = BOS
  46. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 𝑄 𝐾𝑇/ 𝑉𝑇 Input Prompts: [“1+1=”, “The best Japanese Food is”, “Shoyu is made of”] Step1 2 1 + 1 = BOS 1 + 1 = BOS Continuous batching
  47. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 𝑄 𝐾cache 𝑇 / 𝑉cache 𝑇 Step2 EOS 𝐾𝑇/ 𝑉𝑇 BOS The best Japanese 2 BOS The best Japanese 2 1 + 1 = BOS Input Prompts: [“1+1=”, “The best Japanese Food is”, “Shoyu is made of”] Continuous batching Chunked Prefill!
  48. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 𝑄 𝐾cache 𝑇 / 𝑉cache 𝑇 Step3 Sushi 𝐾𝑇/ 𝑉𝑇 is BOS Shoyu is food Input Prompts: [“1+1=”, “The best Japanese Food is”, “Shoyu is made of”] BOS The best Japanese is BOS Shoyu is food Continuous batching
  49. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to manage the KV cache in HBM?
  50. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Memory waste in KV cache https://minjiazhang.github.io/courses/sp24-resource/vLLM-pre.pdf KV cache BOS 1 + 1 = 2 EOS resv … resv … BOS The best Japa nese Food is suchi … Max Sequence Length = 2048 In-use slots Request A Request B In use slots Reserved slots Red slots never used Internal Fragmentation Reserved slots Red slots never used Internal Fragmentation External Fragmentation
  51. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Paged Attention https://minjiazhang.github.io/courses/sp24-resource/vLLM-pre.pdf Logical Block 0 BOS 1 + 1 Logical Block 1 = Logical Block 2 Logical Block 3 Physical Block 0 BOS 1 + 1 Physical Block 1 Physical Block 2 Physical Block 3 Physical Block 4 = Physical Block 5 Logical KV blocks Physical KV blocks on GPU HBM Physical block number #filled 1 4 5 1 Logical KV blocks
  52. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Paged Attention https://minjiazhang.github.io/courses/sp24-resource/vLLM-pre.pdf Logical Block 0 BOS 1 + 1 Logical Block 1 = 2 Logical Block 2 Logical Block 3 Physical Block 0 BOS 1 + 1 Physical Block 1 Physical Block 2 Physical Block 3 Physical Block 4 = 2 resv resv Physical Block 5 Logical KV blocks Physical KV blocks on GPU HBM Physical block number #filled 1 4 5 1 → 2 Logical KV blocks Minimizes internal fragmentation!
  53. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Serving Multiple Reqkests https://minjiazhang.github.io/courses/sp24-resource/vLLM-pre.pdf Logical Block 0 BOS 1 + 1 Logical Block 1 = 2 Logical Block 2 Logical Block 3 Physical Block 0 BOS 1 + 1 Physical Block 1 BOS The best Physical Block 2 Physical Block 3 Physical Block 4 = 2 Physical Block 5 Logical KV blocks Physical KV blocks on GPU HBM Logical Block 0 BOS The best Logical Block 1 Logical Block 2 Logical Block 3 Logical KV blocks
  54. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Distributed Inference
  55. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Large Memory Needs for Large Models
  56. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Distributed Inference!
  57. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. NVLink Switches NVLink GPU1 GPU2 GPU6 GPU5 GPU4 GPU3 GPU7 GPU8 PCIe Switches PCIe Switches CPU0 CPU1
  58. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Demystifying ML Software stack on AWS
  59. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Up to 20,000 H200/H100 GPUs (P5) or 100,000 Trainium Accelerators (Trn2) Nonblocking petabit-scale network infrastructure Redesigned for 16x larger scale and lower latency with third-gen EFA High-throughput, low-latency storage from Amazon FSx for Lustre T H E L A R G E S T S C A L E M L I N F R A S T R U C T U R E I N T H E C L O U D Second-generation EC2 UltraClusters *Diagram example showing EC2 UltraCluster with Trn2 Up to 100,000 Trainium chips Petabytes per second throughput, billions of IOPS 3,200 Gbps Elastic Fabric Adapter (EFA) Petabit-scale nonblocking network infrastructure Scalable low- latency storage Second-generation EC2 UltraClusters Trn2
  60. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. How it works: Scalable Reliable Datagram (SRD) AWS-designed protocol that uses the many paths within the AWS network simultaneously Designed into the AWS Nitro System Hardware
  61. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Scalable Reliable Datagram Protocol Thanks to Wikipedia and Peter Ashwood-Smith for the snappy animated GIF explaining ECMP Elastic Fabric Adapter OS bypass GPUdirect and RDMA Libfabric core supports wide array of MPIs and NCCL Scalable Reliable Datagram ECMP-enabled packet spraying Cloud-scale congestion control Fast recovery from packet loss or link failure
  62. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. https://aws.amazon.com/jp/blogs/machine-learning/train-and-deploy-ai-models-at-trillion-parameter-scale-with-amazon-sagemaker-hyperpod-support-for-p6e-gb200-ultraservers/
  63. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. GPU1 All Reduce Tensor Parallelism (Attention Block) Mask 𝐼attn [S,E] 𝑂Attn [S,E] Q 𝑾𝑸 [: , ∶ 𝒉/𝟐] K 𝑾𝑲 [: , ∶ 𝒉/𝟐] O 𝑾𝑶 [:h/2,:] V 𝑾𝑽 [: , ∶ 𝒉/𝟐] × Softmax × 𝐼attn [S,E] Q 𝑾𝑸 [: , ∶ 𝒉/𝟐] K 𝑾𝑲 [: , ∶ 𝒉/𝟐] O 𝑾𝑶 [:h/2,:] V 𝑾𝑽 [: , ∶ 𝒉/𝟐] × Softmax × Mask GPU0
  64. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. GPU1 GPU0 All Reduce Tensor Parallelism (MLP Block) SiLU × 𝐼MLP [S,E] Gate 𝑾𝑮 [: , ∶ 𝑭/𝟐] Up 𝑾𝑼 [: , ∶ 𝑭/𝟐] Down 𝑾𝑫 [: 𝑭/𝟐, : ] SiLU × 𝐼MLP [S,E] Gate Up Down 𝑾𝑫 [𝑭/𝟐: , : ] 𝑾𝑮 [: , 𝑭/𝟐: ] 𝑾𝑼 [: , 𝑭/𝟐: ]
  65. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. https://arxiv.org/pdf/1909.08053 Pipeline Parallelism Transformer Layer 0 Transformer Layer 1 Transformer Layer 2 Pipeline Stage 0 (GPU 0) Transformer Layer 3 Transformer Layer 4 Transformer Layer 5 Pipeline Stage 1 (GPU 1) send/recv
  66. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. MoE model MLP Block1 𝑀𝐿𝑃1 (𝐼MLP; 𝑊Gate1, WUp1 , 𝑊Down1) MLP Block2 MLP Block2 Normalization Norm(𝐻𝑙−1 ) Attention Block Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 ) Normalization Norm(𝑂attn) MoE Layer 𝐼attn [S=1,E] 𝑂attn [S=1,E] 𝐼MLP [S=1,E] 𝑂MLP [S=1,E] + + u + 𝐻𝑙−1 [S=1,E] Router 𝐼MLP [S=1,E] https://huggingface.co/blog/moe Normalization Norm(𝐻𝑙−1 ) × GPU0 AlltoAll
  67. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Disaggregated serving https://hao-ai-lab.github.io/blogs/distserve/
  68. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Please complete the session survey in the mobile app © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.