al. 2022. Chain-of-thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the NeurIPS 2022. [3] Xuezhi Wang, Jason Wei, Dale Schuurmans, et al. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of the ICLR 2023. [4] 高野・齋藤・石原『Kaggleではじめる大規模言語モデル入門』講談社( 2026) [5] https://github.com/huggingface/transformers [6] https://github.com/huggingface/trl [7] https://github.com/huggingface/sentence-transformers [8] Edward J. Hu, Yelong Shen, Phillip Wallis, et al. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the ICLR 2022. [9] Tim Dettmers, Mike Lewis, Younes Belkada, et al. 2022. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. In Proceedings of the NeurIPS 2022. [10] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, et al. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. In Proceedings of the NeurIPS 2023. [11] https://github.com/bitsandbytes-foundation/bitsandbytes [12] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, et al. 2023. GPTQ: Accurate Quantization for Generative Pre- trained Transformers. In Proceedings of the ICLR 2023. [13] https://github.com/ModelCloud/GPTQModel [14] https://github.com/vllm-project/llm-compressor [15] Ji Lin, Jiaming Tang, Haotian Tang, et al. 2024. AWQ: Activation-aware Weight Quantization for On-device LLM Compression and Acceleration. GetMobile: Mobile Computing and Communications, 28(4):12–17. 引用
et al. 2024. Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs. In Findings of the EMNLP 2024. [18] https://github.com/intel/auto-round [19] Jianping Gou, Baosheng Yu, Stephen J. Maybank, et al. 2021. Knowledge Distillation: A Survey. International Journal of Computer Vision, 129(6):1789–1819. [20] https://medium.com/my-musings-with-llms/understanding-kv-cache-and-paged-attention-in-llms-a-deep-dive-into-efficient-inference-62fa372432ce [21] Tri Dao, Daniel Y. Fu, Stefano Ermon, et al. 2022. FLASHATTENTION: Fast and Memory-efficient Exact Attention with IO-awareness. In Proceedings of the NeurIPS 2022. [22] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the SOSP 2023. [23] https://github.com/vllm-project/vllm [24] https://www.kaggle.com/competitions/lmsys-chatbot-arena [25] https://huggingface.co/unsloth/gemma-2-9b-it-bnb-4bit [26] https://www.kaggle.com/code/emiz6413/training-gemma-2-9b-4-bit-qlora-fine-tuning [27] https://www.kaggle.com/competitions/map-charting-student-math-misunderstandings/overview [28] https://huggingface.co/Qwen/Qwen3-8B [29] https://www.kaggle.com/code/sinchir0/lb-0-942-infer-fullft-qwen3-8b-by-sfttrainer [30] https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data [31] https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/discussion/472221