and training of neural networks for efficient integer-arithmetic-only inference." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. • Zhou, Aojun, et al. "Incremental network quantization: Towards lossless cnns with low-precision weights." arXiv preprint arXiv:1702.03044 (2017). • Krishnamoorthi, Raghuraman. "Quantizing deep convolutional networks for efficient inference: A whitepaper." arXiv preprint arXiv:1806.08342 (2018). • Nagel, Markus, et al. "A white paper on neural network quantization." arXiv preprint arXiv:2106.08295 (2021). • Lin, Xiaofan, Cong Zhao, and Wei Pan. "Towards accurate binary convolutional neural network." Advances in neural information processing systems 30 (2017). • Gholami, Amir, et al. "A survey of quantization methods for efficient neural network inference." Low-Power Computer Vision. Chapman and Hall/CRC, 2022. 291-326. • Nagel, Markus, et al. "Data-free quantization through weight equalization and bias correction." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. • Nagel, Markus, et al. "Up or down? adaptive rounding for post-training quantization." International Conference on Machine Learning. PMLR, 2020. • Dettmers, Tim, et al. "LLM.int8 () 8-bit matrix multiplication for transformers at scale." Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022. • Wei, Xiuying, et al. "Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling." arXiv preprint arXiv:2304.09145 (2023). 参考文献
Accurate Quantization for Generative Pre-trained Transformers." The Eleventh International Conference on Learning Representations . 2023. • Kim, Jeonghoon, et al. "Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization." arXiv preprint arXiv:2305.14152 (2023). • Schaefer, Clemens JS, et al. "Mixed Precision Post Training Quantization of Neural Networks with Sensitivity Guided Search." arXiv preprint arXiv:2302.01382 (2023). • Pandey, Nilesh Prasad, et al. "A Practical Mixed Precision Algorithm for Post-Training Quantization." arXiv preprint arXiv:2302.05397 (2023). • Wang, Kuan, et al. "Haq: Hardware-aware automated quantization with mixed precision." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. • Koryakovskiy, Ivan, et al. "One-Shot Model for Mixed-Precision Quantization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. • Wu, Bichen, et al. "Mixed precision quantization of convnets via differentiable neural architecture search." arXiv preprint arXiv:1812.00090 (2018). • Qian, Biao, et al. "Adaptive Data-Free Quantization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. • Liu, Zechun, et al. "LLM-QAT: Data-Free Quantization Aware Training for Large Language Models." arXiv preprint arXiv:2305.17888 (2023). • Tang, Chen, et al. "Mixed-precision neural network quantization via learned layer-wise importance." European conference on computer vision. Cham: Springer Nature Switzerland, 2022. 参考文献
Accurate and efficient post-training quantization for large language models." International Conference on Machine Learning. PMLR, 2023. • Dettmers, Tim, et al. "QLoRA: Efficient finetuning of quantized llms." Advances in neural information processing systems 36 (2023): 10088-10115. • Dettmers, Tim. "8-bit approximations for parallelism in deep learning." arXiv preprint arXiv:1511.04561 (2015). • Lin, Ji, et al. "AWQ: Activation-aware weight quantization for on-device llm compression and acceleration." Proceedings of Machine Learning and Systems 6 (2024): 87-100. • Dettmers, Tim, et al. "8-bit Optimizers via Block-wise Quantization." International Conference on Learning Representations. 2022. • Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." ICLR 1.2 (2022): 3. 参考文献