Deploying PLaMo 2 with vLLM: A Practical Guide / vLLM roundup Community Meetup Tokyo

Deploying PLaMo 2 with vLLM: A Practical Guide Shinichi Hemmi
LLM Inference Optimization Team Preferred Networks vLLM roundup Community Meetup Tokyo (2025-06-16)

2 Presenters Hemmi Shinichi (main speaker) 2018.4 - 2024.3 Bachelor/Master
@UTokyo 2022.11 - Committer @Optuna 2024.4 - Fulltime @Preferred Networks Sixue Wang 2013.9 - 2017.6 Bachelor @ PKU 2017.7 - 2019.12 Engineer @ Mobvoi, inc. 2021.4 - 2023.3 Master @ Titech 2023.4 - Fulltime @ Preferred Netowrks Calvin Metzger 2018.09 - 2024.3 Bachelor/Master @ TU 2024.11 - Fulltime @Preferred Netowrks

3 About Preferred Networks (PFN) Vertical Integration of AI Value
Chain Solutions & products Computing infrastructure AI chips PFN combines advanced software and hardware technologies in a vertically integrated approach, covering the entire AI value chain from chips to solution and products. Generative AI foundation models Solutions and products for industries and consumers MN-Core™ MN-Core™ 2 GPU cluster MN-3 (MN-Core™ cluster) PLaMo™ Prime (large language model) PLaMo™ Lite（small language model for edge devices） Cloud-based computing service powered by MN-Core™ 2 Model for simulating material energy PFP 3rd-generation MN-Core MN-Core™ L1000 for LLM inference

4 Table of Contents Integrating PLaMo 2 with vLLM 1.
Bring your model code 2. Make your code compatible with vLLM 3. (Optional) Implement tensor parallelism and quantization support 4. Implement the weight loading logic 5. Register your model Accelerating Inference with vLLM 1. Torch Compile 2. Chunked Preﬁll

5 Integrating PLaMo 2 with vLLM

6 Integrating PLaMo 2 with vLLM Oﬃcial Docs: Implementing a
Basic Model — vLLM 1. Bring your model code 2. Make your code compatible with vLLM 3. (Optional) Implement tensor parallelism and quantization support 4. Implement the weight loading logic 5. Register your model

7 Integrating PLaMo 2 with vLLM Oﬃcial Docs: Implementing a
Basic Model — vLLM 1. Bring your model code → Already public. See pfnet/plamo-2-1b and pfnet/plamo-2-8b! 2. Make your code compatible with vLLM 3. (Optional) Implement tensor parallelism and quantization support 4. Implement the weight loading logic 5. Register your model

8 Integrating PLaMo 2 with vLLM vLLM Class Hierarchy Architecture
Overview - vLLM Modify PLaMo2ForCausalLM (for GPU) • Pass configs through VllmConfig • Propagate prefix properly • Separate logit computation to compute_logits() • Write sample(), load_model() explicitly

9 Integrating PLaMo 2 with vLLM Modify the handling of:
• Input parameter → Treat as ﬂat tensor without batch dimension • Cache ◦ KV Cache → vLLM’s paged Attention ◦ Mamba Cache (V0 engine) → MambaCacheManager note: Hybrid Memory Allocator #11382

10 Integrating PLaMo 2 with vLLM Implementing a Basic Model
— vLLM 1. Bring your model code 2. Make your code compatible with vLLM 3. (Optional) Implement tensor parallelism and quantization support 4. Implement the weight loading logic 5. Register your model

11 Pipeline Parallel vllm.model_executor.models.util.make_layers creates nn.ModuleList which splits automatically. →
Pass IntermediateTensors among GPUs

12 Tensor Parallel Replace each layer with vLLM native component.
PLaMo2 consists of the following blocks: MLP, Attention: (almost) typical structure → Similar to other model implementations Mamba: Different from Mamba and Mamba2 structure → Carefully combine both implementations

13 Tensor Parallel: MLP PLaMo2’s MLP layers are typical →
Simply replace each component

14 Tensor Parallel: MLP Choose linear layer type based on
how parallel computation actually runs

15 Tensor Parallel: Attention Difference between general Attention block -
Sliding Window Attention - Only supported by xFormers (at time of our implementation) - (Currently supported by most backends) - multi-head RMSNorm - vLLM’s kernel only supports flat RMSNorm

16 Tensor Parallel: Mamba The mamba layer in PLaMo2 differs
from standard Mamba and Mamba2: - Linear projection of the state parameters between the convolution and the scan - No RMSNorm after selective scan

17 Quantization Propagate preﬁx and quant_conﬁg e.g.) MLP block

19 Implement the weight loading logic Reconcile with Transformers implementation
- Weight reordering for TP compatibility - Log processing for mamba layers - Handling of RMSNorm oﬀset - Name replacement ◦ e.g.) ▪ ".A_log" → ".A" ▪ ".XXX_norm_weight"→ ".XXX_norm.weight"

21 Register your model Registering a Model to vLLM —
vLLM - Built-in models ...Directly modify the vLLM repository - Necessary when modifying the CUDA components - Submit a PR - Out-of-tree models ...Plugin custom models into vLLM installed from PyPI - Model deﬁnition ﬁles can be directly migrated to built-in models - Also useful for quick minor adjustments for production use

22 PLaMo 2 is now integrated with vLLM (starting v0.8.5)
🎉 vLLM Now Supports Your Model! $ pip install vllm $ vllm serve pfnet/plamo-2-1b --trust-remote-code --max-model-len 4096 https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation

23 Tips & Pitfalls Debugging Tips - After completing the
model registration process: - Compare layer-by-layer tensor inputs/outputs using torch.save - Perform comparisons using suitable generation benchmarks - pfnet-research/pgfgen-bench is particularly useful for this purpose - Editable install: Takes time (& requires environment variable configuration) - For only modifying the Python components, use USE_VLLM_PREBUILD=1 Implementation Pitfalls - The module names are automatically renamed when loading quantized model → Weight reordering based on original layer names cause output corruption - layers_block_type attribute in the PretrainConfig is assumed for hybrid models → We modify the public model accordingly - The calculation logic for config.head_dim is hardcoded

24 Note: RMSNrom vLLM’s RMSNorm first calculates residual sum.

25 Accelerating Inference with vLLM

26 Accelerating Inference with vLLM 1. 1. Torch Compile 2.
Chunked Prefill

27 torch.compile @support_torch_compile class Plamo2AttentionMixer(nn.Module): ...... class Plamo2DecoderLayer(nn.Module): if self.is_mamba:
self.mixer = Plamo2MambaMixer else: self.mixer = Plamo2AttentionMixer ...... class Plamo2Decoder(torch.nn.Module): self.layers = [Plamo2DecoderLayer for _ in num_layers] ...... Mamba Attention Mamba Attention …… • torch.compile mainly reduces superﬂuous memory operations by fusing torch ops into single Triton kernels. • Decorate a torch.nn.Module using @support_torch_compile. • Use the piecewise compilation feature to only compile the Attention layers. • Compile speciﬁc batch sizes to generate faster kernels. • Save compiled graphs in a persistent volume in order to reduce launch time. compile_sizes=[1, 8, 16, 32, 64, 128, …..] VLLM_CACHE_ROOT

28 torch.compile @support_torch_compile class Plamo2AttentionMixer(nn.Module): qkv_projection q_norm k_norm rotary_embedding attention
output_projection ...... • Use combo_kernels=True to horizontally fuse independent torch ops into single kernels in order to increase occupancy, e.g. QK norm. Attention before Compilation Attention after Compilation Fused qk_norm, RoPE GPU tokens/s improvement A100 5% L40s 5%

29 Chunked Prefill • Combine compute-bound prefill and memory-bound decode
requests in a single batch. • Increases throughput due to better GPU utilization. • But Mamba kernels (chunked selective scan and causal conv1d) performance downgrades with chunked prefill. • Split combined batch into separate prefill and decode batches again in the Mamba layer for better performance. GPU tokens/s improvement A100 3.4% L40s 10.3% H100 5.5%

Making the real world computable

Deploying PLaMo 2 with vLLM: A Practical Guide ...

Deploying PLaMo 2 with vLLM: A Practical Guide / vLLM roundup Community Meetup Tokyo

Preferred Networks PRO

More Decks by Preferred Networks

Featured

Transcript

Deploying PLaMo 2 with vLLM: A Practical Guide Shinichi Hemmi

2 Presenters Hemmi Shinichi (main speaker) 2018.4 - 2024.3 Bachelor/Master

3 About Preferred Networks (PFN) Vertical Integration of AI Value

4 Table of Contents Integrating PLaMo 2 with vLLM 1.

5 Integrating PLaMo 2 with vLLM

6 Integrating PLaMo 2 with vLLM Oﬃcial Docs: Implementing a

7 Integrating PLaMo 2 with vLLM Oﬃcial Docs: Implementing a

8 Integrating PLaMo 2 with vLLM vLLM Class Hierarchy Architecture

9 Integrating PLaMo 2 with vLLM Modify the handling of:

10 Integrating PLaMo 2 with vLLM Implementing a Basic Model

11 Pipeline Parallel vllm.model_executor.models.util.make_layers creates nn.ModuleList which splits automatically. →

12 Tensor Parallel Replace each layer with vLLM native component.

13 Tensor Parallel: MLP PLaMo2’s MLP layers are typical →

14 Tensor Parallel: MLP Choose linear layer type based on

15 Tensor Parallel: Attention Difference between general Attention block -

16 Tensor Parallel: Mamba The mamba layer in PLaMo2 differs

17 Quantization Propagate preﬁx and quant_conﬁg e.g.) MLP block

18 Integrating PLaMo 2 with vLLM Implementing a Basic Model

19 Implement the weight loading logic Reconcile with Transformers implementation

20 Integrating PLaMo 2 with vLLM Implementing a Basic Model

21 Register your model Registering a Model to vLLM —

22 PLaMo 2 is now integrated with vLLM (starting v0.8.5)

23 Tips & Pitfalls Debugging Tips - After completing the

24 Note: RMSNrom vLLM’s RMSNorm first calculates residual sum.

25 Accelerating Inference with vLLM

26 Accelerating Inference with vLLM 1. 1. Torch Compile 2.

27 torch.compile @support_torch_compile class Plamo2AttentionMixer(nn.Module): ...... class Plamo2DecoderLayer(nn.Module): if self.is_mamba:

28 torch.compile @support_torch_compile class Plamo2AttentionMixer(nn.Module): qkv_projection q_norm k_norm rotary_embedding attention

29 Chunked Preﬁll • Combine compute-bound preﬁll and memory-bound decode

Making the real world computable