読んだ⽬的 • 「量⼦化」の中で実ライブラリで使えるものを明らかにする • weight, activation, gradient, error • 1bit(XNOR), 2bit, 4bit, 8bit • ハードウェアなど周辺動向(論⽂以外)と突き合わせたい Training and Inference with Integers in Deep Neural Networks https://arxiv.org/abs/1802.04680
論⽂ • Quantization and Training of Neural Networks for Efficient Integer- Arithmetic-Only Inference • https://arxiv.org/abs/1712.05877 • Quantizing deep convolutional networks for efficient inference: A whitepaper • https://arxiv.org/abs/1806.08342 • Post training 4-bit quantization of convolutional networks for rapid- deployment • https://arxiv.org/abs/1810.05723 TensorFlow whitepaper NeurIPS 2019
• Post-training quantization: range決め打ちで決定 • Quantization-aware training: fine tuningによる解析的なrange 決定 Post-training or Quantization-aware training 精度低下の原因に
Quantization-aware training Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference https://arxiv.org/abs/1712.05877 forward: FakeQuant backward: FP32
Quantization-aware training Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference https://arxiv.org/abs/1712.05877 forward backward forward: FakeQuant backward: FP32
Quantization-aware training Linear weight FP32 bias FP32 x FP32 y FP32 Linear weight INT8 bias FP32 x INT8 y INT8 weight quantization activation quantization Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference https://arxiv.org/abs/1712.05877
TensorFlowによる量⼦化 Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference https://arxiv.org/abs/1712.05877 • TFLiteConverterによる変換 • Whitepaperでは3種類(a,b,d) • TensorFlow 2.xドキュメントでは4種類(a,b,c,d) • Approaches 1. Post Training Quantization a. Weight Quantization: Dataset不要 b. Full integer quantization: Datasetで分布調整 c. Float16 quantization: 半精度 2. Quantization-Aware Training d. weight + activation: Datasetによるfine-tuning
TensorFlowによる量⼦化 Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference https://arxiv.org/abs/1712.05877 • TFLiteConverterによる変換 • Whitepaperでは3種類(a,b,d) • TensorFlow 2.xドキュメントでは4種類(a,b,c,d) • Approaches 1. Post Training Quantization a. Weight Quantization: Dataset不要 b. Full integer quantization: Datasetで分布調整 c. Float16 quantization: 半精度 2. Quantization-Aware Training d. weight + activation: Datasetによるfine-tuning
Analytical Clipping for Integer Quantization (ACIQ) + Per-channel bit allocation Post training 4-bit quantization of convolutional networks for rapid-deployment https://arxiv.org/abs/1810.05723
Analytical Clipping for Integer Quantization (ACIQ) + Per-channel bit allocation Post training 4-bit quantization of convolutional networks for rapid-deployment https://arxiv.org/abs/1810.05723 post trainにて解析的にrange決定
Analytical Clipping for Integer Quantization (ACIQ) + Per-channel bit allocation Post training 4-bit quantization of convolutional networks for rapid-deployment https://arxiv.org/abs/1810.05723 ResNet Post Training • weight 4bit • activation 4bit
TVM TensorCore INT4/INT1対応 [CODEGEN] Support cuda tensorcore subbyte int data type in auto tensorcore https://github.com/apache/incubator-tvm/pull/4546
TVM TensorCore INT4/INT1対応 [CODEGEN] Support cuda tensorcore subbyte int data type in auto tensorcore https://github.com/apache/incubator-tvm/pull/4546 • NVIDIA Turing世代 • TensorCore INT4/INT1を試験的にサポート • ただ、cuBLAS(⾏列演算ライブラリ)はINT8まで • CUTLASS(INT/FP混合演算向けライブラリ) 1.1以降で置き換 える必要あり • NVIDIAに先⾏してTVMがGPU INT4サポートしてきた • つまり「理論上INT4いける」に実装が追いついてきた