[re:Invent 2025] Distributed inference on AWS: Deep dive into inference optimizations (AIM353)

© 2025, Amazon Web Services, Inc. or its affiliates. All
rights reserved.

rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. AIM353 Aman Shanbhag GenAI Specialist Solutions Architect WWSO AIML Frameworks AWS Keita Watanabe GenAI Specialist Solutions Architect WWSO AIML Frameworks AWS Distributed Inference on AWS: A Deep Dive into Inference Optimizations

rights reserved. Challenges in LLM Inference Is apple a fruit? Yes it is. User ChatBot

rights reserved. Is apple a Fruit? Tokenization Is apple a Fruit? Text generation Yes it is EOS Detokenization Yes it is EOS Is apple a fruit? Yes it is User ChatBot https://www.researchgate.net/publication/382884366_Generative_AI_as_a_Service_in_6G_Edge-Cloud_Generation_Task_Offloading_by_In-context_Learning/figures?lo=1 BOS BOS

rights reserved. Challenges in LLM Inference What exactly is going on? Which hardware to use? How can I streamline? How to scale big model inference? Developer https://www.researchgate.net/publication/382884366_Generative_AI_as_a_Service_in_6G_Edge-Cloud_Generation_Task_Offloading_by_In-context_Learning/figures?lo=1

rights reserved. N V I D I A G P U S , A N D A W S M L A C C E L E R A T O R S Accelerated computing portfolio Trainium accelerator Inferentia accelerator B200, H200, H100, A100, L4, L40S A10G, T4 G5 P4de G6 P4d P5 P5e GPUs P6- B200 G6e Inf2 Trn1 Trn2 AWS ML chips Trn3 P5en P6e- GB200

rights reserved. LLM Inference Optimization Strategies Model Architecture Optimization KV Cache Management Distributed Inference System Optimization Quantization Attention Mechanism Operator Fusion Scheduling & Batching Data Parallel Pipeline Parallel Tensor Parallel Context Parallel Disaggregated Serving Expert Parallel Basics of Text Generation Inference Text Generation Inference Transformers Architecture Prefill vs. Decode and KV Cache Key Metrics Memory/Compute Requirement Roofline Model Agenda

rights reserved. Anatomy of LLM inference is EOS it Is apple a Fruit? Yes LLM Iteration 1 Yes LLM Iteration 2 it LLM Iteration 3 is LLM Iteration 4 https://www.researchgate.net/publication/382884366_Generative_AI_as_a_Service_in_6G_Edge-Cloud_Generation_Task_Offloading_by_In-context_Learning/figures?lo=1 BOS

rights reserved. Anatomy of LLM inference is EOS it Is apple a Fruit? Yes LLM Iteration 1 Yes LLM Iteration 2 it LLM Iteration 3 is LLM Iteration 4 https://www.researchgate.net/publication/382884366_Generative_AI_as_a_Service_in_6G_Edge-Cloud_Generation_Task_Offloading_by_In-context_Learning/figures?lo=1 BOS Time to First Token (TTFT) Time per Output Token (TPOT)

rights reserved. Anatomy of LLM inference is EOS it Is apple a Fruit? Yes LLM Iteration 1 Yes LLM Iteration 2 it LLM Iteration 3 is LLM Iteration 4 https://www.researchgate.net/publication/382884366_Generative_AI_as_a_Service_in_6G_Edge-Cloud_Generation_Task_Offloading_by_In-context_Learning/figures?lo=1 BOS KV Cache 1 KV Cache 2 KV Cache 3

rights reserved. Transformer Architecture Transformer Layer 1 Transformer Layer 𝑙 Transformer Layer 𝐿 Prediction Head … … Normalization Norm(𝐻𝑙−1 ) Hidden States 𝐻𝑙−1 [𝑆, 𝐸] Attention Block Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 ) Normalization Norm(𝑂attn) MLP Block 𝑀𝐿𝑃(𝐼MLP; 𝑊G, WU , 𝑊D) 𝐼attn [S,E] 𝑂MLP [S,E] = 𝐻𝑙 [𝑆, 𝐸] 𝑂attn [S,E] 𝐼MLP [S,E] 𝑂MLP [S,E] + + + symbol dimension S Input Sequence Length E Embedding Dimension Is apple a Fruit? S E Yes BOS

rights reserved. Transformer Layer: 𝑙 Attention Block: Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 ) 𝐼attn [S,E] 𝑂Attn [S,E] Q 𝑾𝑸 K 𝑾𝑲 O 𝑾𝑶 V 𝑾𝑽 × Mask × Normalization Norm(𝐻𝑙−1 ) Attention Block Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 ) Normalization Norm(𝑂attn) MLP Block 𝑀𝐿𝑃(𝐼MLP; 𝑊G, WU , 𝑊D) 𝐼attn [S,E] 𝑂attn [S,E] 𝐼MLP [S,E] 𝑂MLP [S,E] + + u + + 𝐻𝑙−1 [S,E] 𝐻𝑙−1 [S,E] ※ Positional Encoding is ignored Softmax

rights reserved. Transformer Layer: 𝑙 MLP Block: 𝑀𝐿𝑃(𝐼MLP; 𝑊Gate, WUp , 𝑊Down) Normalization Norm(𝐻𝑙−1 ) Attention Block Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 ) Normalization Norm(𝑂attn) MLP Block 𝑀𝐿𝑃(𝐼MLP; 𝑊G, WU , 𝑊D) 𝐼attn [S,E] 𝑂attn [S,E] 𝐼MLP [S,E] 𝑂MLP [S,E] + + u + SiLU Gate 𝑾𝑮 Up 𝑾𝑼 × 𝐼MLP [S,E] Down 𝑾𝑫 𝐻𝑙−1 [S,E] + 𝑂attn [S,E] 𝑂MLP [S,E]

rights reserved. Transformer Architecture Transformer Layer 1 Transformer Layer 𝑙 Transformer Layer 𝐿 Prediction Head … … Normalization Norm(𝐻𝑙−1 ) Hidden States 𝐻𝑙−1 [𝑆 = 1, 𝐸] Attention Block Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 , | 𝐾cache, 𝑉cache) Normalization Norm(𝑂attn) MLP Block 𝑀𝐿𝑃(𝐼MLP; 𝑊G, WU , 𝑊D) 𝐼attn [S=1,E] 𝑂MLP [S=1,E] = 𝐻𝑙 [𝑆 = 1, 𝐸] 𝑂attn [S=1,E] 𝐼MLP [S=1,E] 𝑂MLP [S=1,E] + + + symbol dimension S Input Sequence Length E Embedding Dimension S = 1 E Yes it

rights reserved. Transformer Layer: 𝑙 Attention Block: Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 ) 𝐼attn [S=1,E] 𝑂Attn [S=1,E] Q 𝑾𝑸 K 𝑾𝑲 O 𝑾𝑶 V 𝑾𝑽 × Mask × Normalization Norm(𝐻𝑙−1 ) Attention Block Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 | 𝐾cache, 𝑉cache) Normalization Norm(𝑂attn) MLP Block 𝑀𝐿𝑃(𝐼MLP; 𝑊G, WU , 𝑊D) 𝐼attn [S=1,E] 𝑂attn [S=1,E] 𝐼MLP [S=1,E] 𝑂MLP [S=1,E] + + u + + 𝐻𝑙−1 [S=1,E] 𝐻𝑙−1 [S=1,E] 𝐾cache 𝑉cache Concat Concat Softmax

rights reserved. Transformer Layer: 𝑙 MLP Block: 𝑀𝐿𝑃(𝐼MLP; 𝑊Gate, WUp , 𝑊Down) Normalization Norm(𝐻𝑙−1 ) Attention Block Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 ) Normalization Norm(𝑂attn) MLP Block 𝑀𝐿𝑃(𝐼MLP; 𝑊G, WU , 𝑊D) 𝐼attn [S=1,E] 𝑂attn [S=1,E] 𝐼MLP [S=1,E] 𝑂MLP [S=1,E] + + u + SiLU Gate 𝑾𝑮 Up 𝑾𝑼 × 𝐼MLP [S=1,E] Down 𝑾𝑫 𝐻𝑙−1 [S=1,E] + 𝑂attn [S=1,E] 𝑂MLP [S=1,E]

rights reserved. Q 𝑾𝑸 K 𝑾𝑲 O 𝑾𝑶 V 𝑾𝑽 × Softmax × 𝐾cache 𝑉cache Concat Concat Q 𝑾𝑸 K 𝑾𝑲 O 𝑾𝑶 V 𝑾𝑽 × Mask × Prefill phase Decode phase 𝐼attn [S=1,E] 𝑂Attn [S=1,E] 𝐼attn [S,E] 𝑂Attn [S,E] Softmax Mask

rights reserved. HBM Key-value caching mechanism symbol dimension S Sequence Length E Embedding Dimension https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ Q 𝑆 𝐸 × ×: Matrix Multiply HBM: High Bandwidth Memory K^T 𝐸 𝑆 × V 𝑆 𝐸 Prefill Decode Step 1 K Cache Result 𝑆 𝐸 = V Cache Q 1 𝐸 × K^T 𝐸 𝑆 + 1 × V 𝑆 + 1 𝐸 Result 1 𝐸 = Write Read

rights reserved. Memory Usage in LLM Inference Example Parameters (FP8) 70 GB KV Cache 10 GB Other 7 GB VRAM consumption Llama3 70B Parameters FP32: 4 bytes per parameter BF16: 2 bytes per parameter P8: 1 byte per FP4: 0.5 byte per parameter KV Cache KV Cache Formula Total size of KV cache (bytes) ≈ batch_size × sequence_length × 2 × num_layers × hidden_size × sizeof(precision) Others Activations and other overhead 10~15% of parameter footprint Llama3 70B, batch size 1, sequence length 8K (l=80, h=8192) FP8

rights reserved. NVIDIA GPU instances 24 © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Instance Size ACC NUM ACC ACC Memory Acc. P2P BW EFA P5.48xlarge H100 8 640 GB 900 GB/s 3200 Gbps EFAv2 P5e.48xlarge H200 8 1128 GB 900 GB/s 3200 Gbps EFAv2 P5en.48xlarge H200 8 1128 GB 900 GB/s 3200 Gbps EFAv3 P6-B200.48xlarge B200 8 1440 GB 1.8 TB/s 3200 Gbps EFAv4 P6e-GB200.36xlarge GB200 4 740 GB 1.8 TB/s 3200 Gbps EFAv4 u-p6e-gb200x36 GB200 36 6.7 TB 1.8 TB/s 14400 Gbps EFAv4 u-p6e-gb200x72 GB200 72 13.3 TB 1.8 TB/s 28800 Gbps EFAv4 https://aws.amazon.com/ec2/instance-types/p5/ P6 NVIDIA H100/H200 TENSOR CORE P5 NVIDIA B200 TENSOR CORE

rights reserved. Q 𝑾𝑸 K 𝑾𝑲 O 𝑾𝑶 V 𝑾𝑽 × Softmax × 𝐾cache 𝑉cache Concat Concat Q 𝑾𝑸 K 𝑾𝑲 O 𝑾𝑶 V 𝑾𝑽 × Softmax × Prefill phase Decode phase Mask Mask

rights reserved. Why Memory Matters as Much as Compute

rights reserved. Rooflines for all the accelerators

rights reserved. Multi-head attention – AKA the vanilla attention symbol dimension B Batch size S Sequence length E Embedding dimension H Attention head dimension N Number of query key value heads 𝑾𝑸𝟏 [𝑬, 𝑯] 𝑾𝑸𝟎 [𝑬, 𝑯] 𝑾𝑸𝑵 [𝐄, 𝐇] … 𝐼attn [𝑆, 𝐸] 𝑸𝟎 𝑺, 𝑯 𝑸𝟏 𝑺, 𝑯 𝑸𝑵 𝑺, 𝑯 … 𝑰attn[𝑺, 𝑬] 𝑰attn[𝑺, 𝑬] 𝑰attn[𝑺, 𝑬] × × × = = = 𝑾𝑲𝟏 [𝑬, 𝑯] 𝑾𝑲𝟎 [𝑬, 𝑯] 𝑾𝑲𝑵 [𝐄, 𝐇] … 𝐼attn [𝑆, 𝐸] 𝑲𝟎 𝑺, 𝑯 𝑲𝟏 𝑺, 𝑯 𝑲𝑵 𝑺, 𝑯 … 𝑰attn[𝑺, 𝑬] 𝑰attn[𝑺, 𝑬] 𝑰attn[𝑺, 𝑬] × × × = = = 𝑾𝑽𝟏 [𝑬, 𝑯] 𝑾𝑽𝟎 [𝑬, 𝑯] 𝑾𝑽𝑵 [𝐄, 𝐇] … 𝐼attn [𝑆, 𝐸] 𝑽𝟎 𝑺, 𝑯 𝑽𝟏 𝑺, 𝑯 𝑽𝑵 𝑺, 𝑯 … 𝑰attn[𝑺, 𝑬] 𝑰attn[𝑺, 𝑬] 𝑰attn[𝑺, 𝑬] × × × = = =

rights reserved. Multi-head attention – AKA the vanilla attention symbol dimension B Batch size S Sequence length E Embedding dimension H Attention head dimension N Number of query key value heads 𝑸𝟎 𝑺, 𝑯 𝑸𝟏 𝑺, 𝑯 𝑸𝑵 𝑺, 𝑯 𝑲𝟎 𝑻 𝑯, 𝑺 𝑲𝟏 𝑻 𝑯, 𝑺 𝑲𝑵 𝑻 𝑯, 𝑺 Softmax 𝑽𝟎 𝑺, 𝑯 𝑽𝟏 𝑺, 𝑯 𝑽𝑵 𝑺, 𝑯 … … … × × × × × × 𝑾𝑶𝟏 𝑯, 𝑬 𝑾𝑶𝟐 𝑯, 𝑬 𝑾𝑶𝑵 𝑯, 𝑬 … × × ×

rights reserved. KV Cache bottleneck LLM inference GPU Memory Usage Sequence Length Memory Consumption The KV cache size scales with sequence length, often consuming the majority of GPU memory. Latency Loading massive KV cache from HBM for each generated token slows down decoding Scalability Limit Pose a hard limit on the context lengths and batch sizes that can be feasibly deployed KV Cache KV Cache Formula Total size of KV cache (bytes) ≈ batch_size × sequence_length × 2 × num_layers × hidden_size × sizeof(precision) hidden_size = (num_heads * dim_head)

rights reserved. Multi Query Attention 𝑸𝟎 𝑺, 𝑯 𝑸𝟏 𝑺, 𝑯 𝑸𝑵−𝟏 𝑺, 𝑯 𝑲𝟎 𝑻 𝑯, 𝑺 𝑲𝟏 𝑻 𝑯, 𝑺 𝑲𝑵−𝟏 𝑻 𝑯, 𝑺 Softmax 𝑽𝟎 𝑺, 𝑯 𝑽𝟏 𝑺, 𝑯 𝑽𝑵−𝟏 𝑺, 𝑯 … … … × × × × × × 𝑾𝑶𝟏 𝑯, 𝑬 𝑾𝑶𝟐 𝑯, 𝑬 𝑾𝑶𝑵−𝟏 𝑯, 𝑬 … × × × 𝑸𝑵 𝑺, 𝑯 𝑲𝑵 𝑻 𝑯, 𝑺 𝑽𝑵 𝑺, 𝑯 × × 𝑾𝑶𝑵 𝑯, 𝑬 × symbol dimensio n B Batch size S Sequence Length E Embeddi ng Dimensio n H Attention head N Number of query heads K Number of key/value heads G q heads per kv head (N//K) 𝑸𝟎 𝑺, 𝑯 𝑸𝟏 𝑺, 𝑯 𝑸𝑵−𝟏 𝑺, 𝑯 𝑲𝟎 𝑻 𝑯, 𝑺 Softmax 𝑽𝟎 𝑺, 𝑯 … × × × × × × 𝑾𝑶𝟏 𝑯, 𝑬 𝑾𝑶𝟐 𝑯, 𝑬 𝑾𝑶𝑵−𝟏 𝑯, 𝑬 … × × × 𝑸𝑵 𝑺, 𝑯 × × 𝑾𝑶𝑵 𝑯, 𝑬 × Multi Head Attention (MHA) Multi Query Attention (MQA) G=2 MQA → K=1

rights reserved. Grouped Query Attention 𝑸𝟎 𝑺, 𝑯 𝑸𝟏 𝑺, 𝑯 𝑸𝑵−𝟏 𝑺, 𝑯 𝑲𝟎 𝑻 𝑯, 𝑺 𝑲𝟏 𝑻 𝑯, 𝑺 𝑲𝑵−𝟏 𝑻 𝑯, 𝑺 Softmax 𝑽𝟎 𝑺, 𝑯 𝑽𝟏 𝑺, 𝑯 𝑽𝑵−𝟏 𝑺, 𝑯 … … … × × × × × × 𝑾𝑶𝟏 𝑯, 𝑬 𝑾𝑶𝟐 𝑯, 𝑬 𝑾𝑶𝑵−𝟏 𝑯, 𝑬 … × × × 𝑸𝑵 𝑺, 𝑯 𝑲𝑵 𝑻 𝑯, 𝑺 𝑽𝑵 𝑺, 𝑯 × × 𝑾𝑶𝑵 𝑯, 𝑬 × symbol dimensio n B Batch size S Sequence Length E Embeddi ng Dimensio n H Attention head N Number of query heads K Number of key/value heads G q heads per kv head (N//K) 𝑸𝟎 𝑺, 𝑯 𝑸𝟏 𝑺, 𝑯 𝑸𝑵−𝟏 𝑺, 𝑯 𝑲𝟎 𝑻 𝑯, 𝑺 Softmax 𝑽𝟎 𝑺, 𝑯 … … … × × × × × × 𝑾𝑶𝟏 𝑯, 𝑬 𝑾𝑶𝟐 𝑯, 𝑬 𝑾𝑶𝑵−𝟏 𝑯, 𝑬 … × × × 𝑸𝑵 𝑺, 𝑯 𝑲𝑲 𝑻 𝑯, 𝑺 𝑽𝑲 𝑺, 𝑯 × × 𝑾𝑶𝑵 𝑯, 𝑬 × Multi Head Attention (MHA) Grouped Query Attention (GQA) G=2 MQA → K=1

rights reserved. FLOPS table Modern Hardware corser faster Device FP32 (TFLOPs) TF32 Tensor (TFLOPs) BF16 Tensor (TFLOPs) FP8 Tensor (TFLOPs) NVIDIA H200 67 495 989 1979 NVIDIA H100 67 495 989 1979 NVIDIA B200 75 1125 2250 4500 NVIDIA L4 30.3 60 121 243 NVIDIA L40S 91.6 183 362 733 RTX Pro 4500 Blackwell 55 105 211 422

rights reserved. Weight Activation Quantization FLOPS Arithmetic Intensity (FLOPS/Byte) Attention Block (FP16/FP8) MLP Block (FP16/FP8) Roofline FP8 FP16 ① ② ② MLP Block is compute-bound: high FLOPs/byte already; FP8 lets us use faster low-bit tensor cores, so the compute roof lifts and the MLP moves up to a higher plateau. ① The attention block, especially with KV-cache traffic, is strongly memory-bound. When we quantize weights and activations to FP8, each value is half the size, so we move more FLOPs per byte — our arithmetic intensity increases, and the point moves up along the memory roof. https://arxiv.org/pdf/2310.19102

rights reserved. GPU Architecture Streaming Multiprocessor (SM) L1 Cache Streaming Multiprocessor (SM) L1 Cache Streaming Multiprocessor (SM) L1 Cache Streaming Multiprocessor (SM) L1 Cache L2 Cache High-Bandwidth Memory (HBM) GPU … L1 Cache is a small, very fast on-chip cache that can be programmer controlled. The L2 Cache is a relatively large hardware-controlled cache with faster memory bandwidth. DRAM or HBM stores parameters, activations, optimizer state, etc. The GPU’s compute block that bundles execution units (CUDA/Tensor cores) with warp schedulers, registers, and on- chip memory (L1 Cache). https://jax-ml.github.io/scaling-book/gpus/

rights reserved. GPU Memory Hirarchy (ex, A100) HBM SRAM

rights reserved. GPU Memory Hirarchy (ex, A100) HBM SRAM How can we minimize data movement between SRAM <> HBM?

rights reserved. Streaming Multiprocessor (SM) Standard Attention Flash Attention HBM SRAM Load 𝑄, 𝐾 Write 𝑆 = 𝑄𝐾𝑇 Load 𝑆 Write 𝑃 = Softma𝑥(𝑆) Load 𝑃, 𝑉 Write 𝑂 = 𝑃𝑉 https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention Streaming Multiprocessor (SM) HBM SRAM Load 𝐾𝑗 , 𝑉 𝑗 Load 𝑄𝑖 𝑂𝑖 𝑙𝑖 𝑚𝑖 Write 𝑂𝑖 𝑙𝑖 𝑣𝑖 𝑆𝑖𝑗 = 𝑄𝑖 𝐾𝑗 𝑇 𝑚′ = 𝑟𝑜𝑤𝑚𝑎𝑥 𝑜𝑓 𝑆 𝑃 = exp(𝑠 − 𝑚) 𝑙 = 𝑟𝑜𝑤𝑠𝑢𝑚 𝑜𝑓 𝑃 𝑚 = max 𝑚𝑖𝑗 ′ , 𝑚 Calculate O from l & m

rights reserved. Attention Score Calculation × 𝑸 𝑾𝑸 K 𝑾𝑲 O 𝑾𝑶 V 𝑾𝑽 × Softmax × 𝑴𝒂𝒔𝒌 𝐼attn [S,E] S E E H × 𝐼attn 𝑊𝑄 / 𝑊𝐾 / 𝑊𝑉 𝑄 S H symbol dimension S Sequence length E Embedding dimension H Attention head dimension 𝐾 S H 𝑉 S H

rights reserved. Attention Score Calculation 𝑸 𝑾𝑸 K 𝑾𝑲 O 𝑾𝑶 V 𝑾𝑽 × Softmax × 𝑴𝒂𝒔𝒌 𝐼attn [S,E] symbol dimension S Sequence length E Embedding dimension H Attention head dimension S S ⋅ 𝑄 S H 𝐾𝑇 S H Softmax ×

rights reserved. S H symbol dimension S Sequence length E Embedding dimension H Attention head dimension 𝐾𝑇 1 + 1 = BOS 𝑄 S H 1 + 1 = BOS S H 𝑉𝑇 Key observation attention mask makes K/V pairs “invisible” https://huggingface.co/blog/continuous_batching 1 + 1 = BOS

rights reserved. 𝑄 𝐾𝑇/ 𝑉𝑇 Input Prompts: [“1+1=”, “The best Japanese Food is”, “Shoyu is made of”] Step1 2 1 + 1 = BOS 1 + 1 = BOS Continuous batching

rights reserved. 𝑄 𝐾cache 𝑇 / 𝑉cache 𝑇 Step2 EOS 𝐾𝑇/ 𝑉𝑇 BOS The best Japanese 2 BOS The best Japanese 2 1 + 1 = BOS Input Prompts: [“1+1=”, “The best Japanese Food is”, “Shoyu is made of”] Continuous batching Chunked Prefill!

rights reserved. 𝑄 𝐾cache 𝑇 / 𝑉cache 𝑇 Step3 Sushi 𝐾𝑇/ 𝑉𝑇 is BOS Shoyu is food Input Prompts: [“1+1=”, “The best Japanese Food is”, “Shoyu is made of”] BOS The best Japanese is BOS Shoyu is food Continuous batching

rights reserved. Memory waste in KV cache https://minjiazhang.github.io/courses/sp24-resource/vLLM-pre.pdf KV cache BOS 1 + 1 = 2 EOS resv … resv … BOS The best Japa nese Food is suchi … Max Sequence Length = 2048 In-use slots Request A Request B In use slots Reserved slots Red slots never used Internal Fragmentation Reserved slots Red slots never used Internal Fragmentation External Fragmentation

rights reserved. Paged Attention https://minjiazhang.github.io/courses/sp24-resource/vLLM-pre.pdf Logical Block 0 BOS 1 + 1 Logical Block 1 = Logical Block 2 Logical Block 3 Physical Block 0 BOS 1 + 1 Physical Block 1 Physical Block 2 Physical Block 3 Physical Block 4 = Physical Block 5 Logical KV blocks Physical KV blocks on GPU HBM Physical block number #filled 1 4 5 1 Logical KV blocks

rights reserved. Paged Attention https://minjiazhang.github.io/courses/sp24-resource/vLLM-pre.pdf Logical Block 0 BOS 1 + 1 Logical Block 1 = 2 Logical Block 2 Logical Block 3 Physical Block 0 BOS 1 + 1 Physical Block 1 Physical Block 2 Physical Block 3 Physical Block 4 = 2 resv resv Physical Block 5 Logical KV blocks Physical KV blocks on GPU HBM Physical block number #filled 1 4 5 1 → 2 Logical KV blocks Minimizes internal fragmentation!

rights reserved. Serving Multiple Reqkests https://minjiazhang.github.io/courses/sp24-resource/vLLM-pre.pdf Logical Block 0 BOS 1 + 1 Logical Block 1 = 2 Logical Block 2 Logical Block 3 Physical Block 0 BOS 1 + 1 Physical Block 1 BOS The best Physical Block 2 Physical Block 3 Physical Block 4 = 2 Physical Block 5 Logical KV blocks Physical KV blocks on GPU HBM Logical Block 0 BOS The best Logical Block 1 Logical Block 2 Logical Block 3 Logical KV blocks

rights reserved. Large Memory Needs for Large Models

rights reserved. NVLink Switches NVLink GPU1 GPU2 GPU6 GPU5 GPU4 GPU3 GPU7 GPU8 PCIe Switches PCIe Switches CPU0 CPU1

rights reserved. Demystifying ML Software stack on AWS

rights reserved. Up to 20,000 H200/H100 GPUs (P5) or 100,000 Trainium Accelerators (Trn2) Nonblocking petabit-scale network infrastructure Redesigned for 16x larger scale and lower latency with third-gen EFA High-throughput, low-latency storage from Amazon FSx for Lustre T H E L A R G E S T S C A L E M L I N F R A S T R U C T U R E I N T H E C L O U D Second-generation EC2 UltraClusters *Diagram example showing EC2 UltraCluster with Trn2 Up to 100,000 Trainium chips Petabytes per second throughput, billions of IOPS 3,200 Gbps Elastic Fabric Adapter (EFA) Petabit-scale nonblocking network infrastructure Scalable low- latency storage Second-generation EC2 UltraClusters Trn2

rights reserved. How it works: Scalable Reliable Datagram (SRD) AWS-designed protocol that uses the many paths within the AWS network simultaneously Designed into the AWS Nitro System Hardware

rights reserved. Scalable Reliable Datagram Protocol Thanks to Wikipedia and Peter Ashwood-Smith for the snappy animated GIF explaining ECMP Elastic Fabric Adapter OS bypass GPUdirect and RDMA Libfabric core supports wide array of MPIs and NCCL Scalable Reliable Datagram ECMP-enabled packet spraying Cloud-scale congestion control Fast recovery from packet loss or link failure

rights reserved. https://aws.amazon.com/jp/blogs/machine-learning/train-and-deploy-ai-models-at-trillion-parameter-scale-with-amazon-sagemaker-hyperpod-support-for-p6e-gb200-ultraservers/

rights reserved. GPU1 All Reduce Tensor Parallelism (Attention Block) Mask 𝐼attn [S,E] 𝑂Attn [S,E] Q 𝑾𝑸 [: , ∶ 𝒉/𝟐] K 𝑾𝑲 [: , ∶ 𝒉/𝟐] O 𝑾𝑶 [:h/2,:] V 𝑾𝑽 [: , ∶ 𝒉/𝟐] × Softmax × 𝐼attn [S,E] Q 𝑾𝑸 [: , ∶ 𝒉/𝟐] K 𝑾𝑲 [: , ∶ 𝒉/𝟐] O 𝑾𝑶 [:h/2,:] V 𝑾𝑽 [: , ∶ 𝒉/𝟐] × Softmax × Mask GPU0

rights reserved. GPU1 GPU0 All Reduce Tensor Parallelism (MLP Block) SiLU × 𝐼MLP [S,E] Gate 𝑾𝑮 [: , ∶ 𝑭/𝟐] Up 𝑾𝑼 [: , ∶ 𝑭/𝟐] Down 𝑾𝑫 [: 𝑭/𝟐, : ] SiLU × 𝐼MLP [S,E] Gate Up Down 𝑾𝑫 [𝑭/𝟐: , : ] 𝑾𝑮 [: , 𝑭/𝟐: ] 𝑾𝑼 [: , 𝑭/𝟐: ]

rights reserved. https://arxiv.org/pdf/1909.08053 Pipeline Parallelism Transformer Layer 0 Transformer Layer 1 Transformer Layer 2 Pipeline Stage 0 (GPU 0) Transformer Layer 3 Transformer Layer 4 Transformer Layer 5 Pipeline Stage 1 (GPU 1) send/recv

rights reserved. MoE model MLP Block1 𝑀𝐿𝑃1 (𝐼MLP; 𝑊Gate1, WUp1 , 𝑊Down1) MLP Block2 MLP Block2 Normalization Norm(𝐻𝑙−1 ) Attention Block Attn(𝐼attn; 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 , 𝑊𝑂 ) Normalization Norm(𝑂attn) MoE Layer 𝐼attn [S=1,E] 𝑂attn [S=1,E] 𝐼MLP [S=1,E] 𝑂MLP [S=1,E] + + u + 𝐻𝑙−1 [S=1,E] Router 𝐼MLP [S=1,E] https://huggingface.co/blog/moe Normalization Norm(𝐻𝑙−1 ) × GPU0 AlltoAll

rights reserved. Disaggregated serving https://hao-ai-lab.github.io/blogs/distserve/

rights reserved.

[re:Invent 2025] Distributed inference on AWS: ...

[re:Invent 2025] Distributed inference on AWS: Deep dive into inference optimizations (AIM353)

More Decks by Keita Watanabe

Other Decks in Technology

Featured

Transcript