松尾研LLM講座2025 応用編Day3「軽量化」講義資料

Slide 1

Slide 1 text

©︎MATSUO-IWASAWA LAB, THE UNIVERSITY OF TOKYO 応用編第3回軽量化大規模言語モデル講座 2025 許諾なく撮影や第三者への開示を禁止します大規模言語モデル講座 2025

Slide 132

Slide 132 text

©︎MATSUO-IWASAWA LAB, THE UNIVERSITY OF TOKYO 132 Reference [1] Nouamane Tazi et al. (2025), “The Ultra-Scale Playbook: Training LLMs on GPU Clusters”, アクセス日:2025/11/5 [2] DeepSeek-AI, “Day 6: One More Thing, DeepSeek-V3/R1 Inference System Overview”, アクセス日:2025/10/30 [3] NVIDIA, “Product Carbon Footprint (PCF) Summary for NVIDIA HGX B200”, アクセス日:2025/10/15 [4] NUS School of Computing “Data Representation and Number Systems” アクセス日:2025/11/19 [5] Maarten Grootendorst “A Visual Guide to Quantization”, アクセス日:2025/10/07 [6] Kazuki Fujii “FP8 trainingを支える技術 1”, アクセス日:2025/10/07 [7] Hugging Face “Quantization concepts”, アクセス日:2025/10/07 [8] Yuma Ichikawa “NLPコロキウム20251022_超効率化への挑戦: LLM 1bit量子化のロードマップ”, アクセス日:2025/10/30 [9] Tim Dettmers et al. (2022), “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale”, arxiv: 2208.07339 [10] Guangxuan Xiao et al. (2022), “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models”, arxiv: 2211.10438 [11] Elias Frantar et al. (2022), “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers”, arxiv: 2210.17323 [12] Elias Frantar et al. (2022), “Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning”, arxiv: 2208.11580 [13] Ji Lin et al. (2023), “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration”, arxiv: 2306.00978 [14] Eduardo Alvarez et al. (2025), “Introducing NVFP4 for Efficient and Accurate Low-Precision Inference”, アクセス日:2025/10/15 [15] Zirui Liu et al. (2024), “KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache”, arxiv: 2402.02750 [16] Coleman Hooper et al. (2024), “KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization”, arxiv: 2401.18079 [17] Amir Gholami et al. (2021), “A Survey of Quantization Methods for Efficient Neural Network Inference”, arxiv: 2103.13630 [18] Zechun Liu et al. (2023), “LLM-QAT: Data-Free Quantization Aware Training for Large Language Models”, arxiv: 2305.17888 [19] Mengzhao Chen et al. (2024), “EfficientQAT: Efficient Quantization-Aware Training for Large Language Models”, arxiv: 2407.11062 [20] Hongyu Wang et al. (2023), “BitNet: Scaling 1-bit Transformers for Large Language Models”, arxiv: 2310.11453

Slide 133

Slide 133 text

©︎MATSUO-IWASAWA LAB, THE UNIVERSITY OF TOKYO 133 Reference [21] Shuming Ma et al. (2024), “The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits”, arxiv: 2402.17764 [22] https://github.com/bitsandbytes-foundation/bitsandbytes, アクセス日:2025/10/15 [23] https://github.com/vllm-project/llm-compressor, アクセス日:2025/10/15 [24] https://github.com/ggml-org/llama.cpp/wiki/Tensor-Encoding-Schemes, アクセス日:2025/10/15 [25] Meta, “Llama 3.2: Revolutionizing edge AI and vision with open, customizable models”, アクセス日:2025/10/18 [26] Shuji Suzuki, “Pruningによる小型LLM PLaMo 2 2Bの開発”, アクセス日:2025/10/18 [27] Song Han, “EfficientML.ai Lecture 03: Pruning and Sparsity”, アクセス日:2025/10/18 [28] Jeff Pool et al, “Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT”, アクセス日:2025/10/18 [29] Elias Frantar et al. (2023), “SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot”, arxiv: 2301.00774 [30] Xin Men et al. (2024), “ShortGPT: Layers in Large Language Models are More Redundant Than You Expect”, arxiv: 2403.03853 [31] Andrey Gromov et al. (2024), “The Unreasonable Ineffectiveness of the Deeper Layers”, arxiv: 2403.17887 [32] Xudong Lu et al. (2024), “Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models”, arxiv: 2402.14800 [33] Sharath Sreenivas et al. (2024), “Compact Language Models via Pruning and Knowledge Distillation”, arxiv: 2407.14679 [34] Sharath Sreenivas et al. (2024), “How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model”, アクセス日:2025/10/18 [35] Jianping Gou et al. (2006), “Knowledge Distillation: A Survey”, arxiv:2006.05525 [36] Kevin Lu et al. (2025), “On-Policy Distillation”, アクセス日:2025/11/5 [37] Rishabh Agarwal et al. (2023), “On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes”, arxiv: 2306.13649 [38] Daya Guo et al. (2025), “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning”, arxiv: 2501.12948 [39] An Yang et al. (2025), “Qwen3 Technical Report”, arxiv: 2505.09388

Slide 134

Slide 134 text

©︎MATSUO-IWASAWA LAB, THE UNIVERSITY OF TOKYO 134 Reference [40] Ashish Vaswani et al. (2017), “Attention Is All You Need”, arxiv: 1706.03762 [41] Noam Shazeer (2019), “Fast Transformer Decoding: One Write-Head is All You Need”, arxiv: 1911.02150 [42] shubham ashok gandhi, “Multi-Head vs Multi-Query vs Grouped Query Attention”, アクセス日:2025/10/21 [43] Joshua Ainslie et al. (2023), “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”, arxiv: 2305.13245 [44] Aixin Liu et al. (2024), “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model”, arxiv: 2405.04434 [45] Iz Beltagy et al. (2020), “Longformer: The Long-Document Transformer”, arxiv: 2004.05150 [46] Sebastian Raschka, “Gemma 3 270M From Scratch”, アクセス日:2025/10/21 [47] DeepSeek-AI (2025), “DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention”,アクセス日:2025/10/21 [48] Aaron Blakeman et al. (2025), “Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models”, arxiv: 2504.03624 [49] Omkaar Kamath, “The Illustrated LFM-2 (by Liquid AI)”, アクセス日:2025/10/21 [50] Qwen Team, “Qwen3-Next: Towards Ultimate Training & Inference Efficiency”, アクセス日:2025/10/21 [51] Weilin Cai et al. (2024), “A Survey on Mixture of Experts in Large Language Models”, arxiv: 2407.06204 [52] Hugging Face, “meta-llama/Llama-4-Scout-17B-16E”, アクセス日:2025/10/23 [53] Aixin Liu et al. (2024), “DeepSeek-V3 Technical Report”, arxiv: 2412.19437 [54] Felix Abecassis et al. (2025), “Pretraining Large Language Models with NVFP4”, arxiv: 2509.25149 [55] Zeyu Han et al. (2024), “Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey”, arxiv: 2403.14608 [56] Edward J. Hu et al. (2021), “LoRA: Low-Rank Adaptation of Large Language Models”, arxiv: 2106.09685 [57] Tim Dettmers et al. (2023), “QLoRA: Efficient Finetuning of Quantized LLMs”, arxiv: 2305.14314 [58] Yixiao Li et al. (2023), “LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models”, arxiv: 2310.08659 [59] Damjan Kalajdzievski (2023), “A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA”, arxiv: 2312.03732

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text