Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deploying PLaMo 2 with vLLM: A Practical Guide ...

180

Deploying PLaMo 2 with vLLM: A Practical Guide / vLLM roundup Community Meetup Tokyo

Presentation materials for the vLLM Roundup Community Meetup Tokyo. Demonstrates how PLaMo™ 2 has been officially supported by vLLM and showcases various acceleration features.
vLLM roundup Community Meetup Tokyoでの発表資料です。モデルベンダーとしてvLLMで正式にPLaMo 2をサポートし、様々な高速化機能を利用する方法を紹介します。

Avatar for Preferred Networks

Preferred Networks

June 17, 2025
Tweet

More Decks by Preferred Networks

Transcript

  1. Deploying PLaMo 2 with vLLM: A Practical Guide Shinichi Hemmi

    LLM Inference Optimization Team Preferred Networks vLLM roundup Community Meetup Tokyo (2025-06-16)
  2. 2 Presenters Hemmi Shinichi (main speaker) 2018.4 - 2024.3 Bachelor/Master

    @UTokyo 2022.11 - Committer @Optuna 2024.4 - Fulltime @Preferred Networks Sixue Wang 2013.9 - 2017.6 Bachelor @ PKU 2017.7 - 2019.12 Engineer @ Mobvoi, inc. 2021.4 - 2023.3 Master @ Titech 2023.4 - Fulltime @ Preferred Netowrks Calvin Metzger 2018.09 - 2024.3 Bachelor/Master @ TU 2024.11 - Fulltime @Preferred Netowrks
  3. 3 About Preferred Networks (PFN) Vertical Integration of AI Value

    Chain Solutions & products Computing infrastructure AI chips PFN combines advanced software and hardware technologies in a vertically integrated approach, covering the entire AI value chain from chips to solution and products. Generative AI foundation models Solutions and products for industries and consumers MN-Core™ MN-Core™ 2 GPU cluster MN-3 (MN-Core™ cluster) PLaMo™ Prime (large language model) PLaMo™ Lite(small language model for edge devices) Cloud-based computing service powered by MN-Core™ 2 Model for simulating material energy PFP 3rd-generation MN-Core MN-Core™ L1000 for LLM inference
  4. 4 Table of Contents Integrating PLaMo 2 with vLLM 1.

    Bring your model code 2. Make your code compatible with vLLM 3. (Optional) Implement tensor parallelism and quantization support 4. Implement the weight loading logic 5. Register your model Accelerating Inference with vLLM 1. Torch Compile 2. Chunked Prefill
  5. 6 Integrating PLaMo 2 with vLLM Official Docs: Implementing a

    Basic Model — vLLM 1. Bring your model code 2. Make your code compatible with vLLM 3. (Optional) Implement tensor parallelism and quantization support 4. Implement the weight loading logic 5. Register your model
  6. 7 Integrating PLaMo 2 with vLLM Official Docs: Implementing a

    Basic Model — vLLM 1. Bring your model code → Already public. See pfnet/plamo-2-1b and pfnet/plamo-2-8b! 2. Make your code compatible with vLLM 3. (Optional) Implement tensor parallelism and quantization support 4. Implement the weight loading logic 5. Register your model
  7. 8 Integrating PLaMo 2 with vLLM vLLM Class Hierarchy Architecture

    Overview - vLLM Modify PLaMo2ForCausalLM (for GPU) • Pass configs through VllmConfig • Propagate prefix properly • Separate logit computation to compute_logits() • Write sample(), load_model() explicitly
  8. 9 Integrating PLaMo 2 with vLLM Modify the handling of:

    • Input parameter → Treat as flat tensor without batch dimension • Cache ◦ KV Cache → vLLM’s paged Attention ◦ Mamba Cache (V0 engine) → MambaCacheManager note: Hybrid Memory Allocator #11382
  9. 10 Integrating PLaMo 2 with vLLM Implementing a Basic Model

    — vLLM 1. Bring your model code 2. Make your code compatible with vLLM 3. (Optional) Implement tensor parallelism and quantization support 4. Implement the weight loading logic 5. Register your model
  10. 12 Tensor Parallel Replace each layer with vLLM native component.

    PLaMo2 consists of the following blocks: MLP, Attention: (almost) typical structure → Similar to other model implementations Mamba: Different from Mamba and Mamba2 structure → Carefully combine both implementations
  11. 14 Tensor Parallel: MLP Choose linear layer type based on

    how parallel computation actually runs
  12. 15 Tensor Parallel: Attention Difference between general Attention block -

    Sliding Window Attention - Only supported by xFormers (at time of our implementation) - (Currently supported by most backends) - multi-head RMSNorm - vLLM’s kernel only supports flat RMSNorm
  13. 16 Tensor Parallel: Mamba The mamba layer in PLaMo2 differs

    from standard Mamba and Mamba2: - Linear projection of the state parameters between the convolution and the scan - No RMSNorm after selective scan
  14. 18 Integrating PLaMo 2 with vLLM Implementing a Basic Model

    — vLLM 1. Bring your model code 2. Make your code compatible with vLLM 3. (Optional) Implement tensor parallelism and quantization support 4. Implement the weight loading logic 5. Register your model
  15. 19 Implement the weight loading logic Reconcile with Transformers implementation

    - Weight reordering for TP compatibility - Log processing for mamba layers - Handling of RMSNorm offset - Name replacement ◦ e.g.) ▪ ".A_log" → ".A" ▪ ".XXX_norm_weight"→ ".XXX_norm.weight"
  16. 20 Integrating PLaMo 2 with vLLM Implementing a Basic Model

    — vLLM 1. Bring your model code 2. Make your code compatible with vLLM 3. (Optional) Implement tensor parallelism and quantization support 4. Implement the weight loading logic 5. Register your model
  17. 21 Register your model Registering a Model to vLLM —

    vLLM - Built-in models ...Directly modify the vLLM repository - Necessary when modifying the CUDA components - Submit a PR - Out-of-tree models ...Plugin custom models into vLLM installed from PyPI - Model definition files can be directly migrated to built-in models - Also useful for quick minor adjustments for production use
  18. 22 PLaMo 2 is now integrated with vLLM (starting v0.8.5)

    🎉 vLLM Now Supports Your Model! $ pip install vllm $ vllm serve pfnet/plamo-2-1b --trust-remote-code --max-model-len 4096 https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation
  19. 23 Tips & Pitfalls Debugging Tips - After completing the

    model registration process: - Compare layer-by-layer tensor inputs/outputs using torch.save - Perform comparisons using suitable generation benchmarks - pfnet-research/pgfgen-bench is particularly useful for this purpose - Editable install: Takes time (& requires environment variable configuration) - For only modifying the Python components, use USE_VLLM_PREBUILD=1 Implementation Pitfalls - The module names are automatically renamed when loading quantized model → Weight reordering based on original layer names cause output corruption - layers_block_type attribute in the PretrainConfig is assumed for hybrid models → We modify the public model accordingly - The calculation logic for config.head_dim is hardcoded
  20. 27 torch.compile @support_torch_compile class Plamo2AttentionMixer(nn.Module): ...... class Plamo2DecoderLayer(nn.Module): if self.is_mamba:

    self.mixer = Plamo2MambaMixer else: self.mixer = Plamo2AttentionMixer ...... class Plamo2Decoder(torch.nn.Module): self.layers = [Plamo2DecoderLayer for _ in num_layers] ...... Mamba Attention Mamba Attention …… • torch.compile mainly reduces superfluous memory operations by fusing torch ops into single Triton kernels. • Decorate a torch.nn.Module using @support_torch_compile. • Use the piecewise compilation feature to only compile the Attention layers. • Compile specific batch sizes to generate faster kernels. • Save compiled graphs in a persistent volume in order to reduce launch time. compile_sizes=[1, 8, 16, 32, 64, 128, …..] VLLM_CACHE_ROOT
  21. 28 torch.compile @support_torch_compile class Plamo2AttentionMixer(nn.Module): qkv_projection q_norm k_norm rotary_embedding attention

    output_projection ...... • Use combo_kernels=True to horizontally fuse independent torch ops into single kernels in order to increase occupancy, e.g. QK norm. Attention before Compilation Attention after Compilation Fused qk_norm, RoPE GPU tokens/s improvement A100 5% L40s 5%
  22. 29 Chunked Prefill • Combine compute-bound prefill and memory-bound decode

    requests in a single batch. • Increases throughput due to better GPU utilization. • But Mamba kernels (chunked selective scan and causal conv1d) performance downgrades with chunked prefill. • Split combined batch into separate prefill and decode batches again in the Mamba layer for better performance. GPU tokens/s improvement A100 3.4% L40s 10.3% H100 5.5%