読んだ⽬的
• 「量⼦化」の中で実ライブラリで使えるものを明らかにする
• weight, activation, gradient, error
• 1bit(XNOR), 2bit, 4bit, 8bit
• ハードウェアなど周辺動向(論⽂以外)と突き合わせたい
Training and Inference with Integers in Deep Neural Networks
https://arxiv.org/abs/1802.04680
Slide 4
Slide 4 text
論⽂
• Quantization and Training of Neural Networks for Efficient Integer-
Arithmetic-Only Inference
• https://arxiv.org/abs/1712.05877
• Quantizing deep convolutional networks for efficient inference:
A whitepaper
• https://arxiv.org/abs/1806.08342
• Post training 4-bit quantization of convolutional networks for rapid-
deployment
• https://arxiv.org/abs/1810.05723
TensorFlow whitepaper
NeurIPS 2019
• Post-training quantization: range決め打ちで決定
• Quantization-aware training: fine tuningによる解析的なrange
決定
Post-training or Quantization-aware training
Slide 11
Slide 11 text
• Post-training quantization: range決め打ちで決定
• Quantization-aware training: fine tuningによる解析的なrange
決定
Post-training or Quantization-aware training
精度低下の原因に
Slide 12
Slide 12 text
Quantization-aware training
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
https://arxiv.org/abs/1712.05877
forward: FakeQuant
backward: FP32
Slide 13
Slide 13 text
Quantization-aware training
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
https://arxiv.org/abs/1712.05877
forward backward
forward: FakeQuant
backward: FP32
Slide 14
Slide 14 text
Quantization-aware training
Linear
weight
FP32
bias
FP32
x
FP32
y
FP32
Linear
weight
INT8
bias
FP32
x
INT8
y
INT8
weight quantization
activation quantization
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
https://arxiv.org/abs/1712.05877
Slide 15
Slide 15 text
Post-training vs Aware training
• 精度(変換前からの劣化の⼩ささ): Post < Aware
• 処理のシンプルさ: Post >>> Aware
• Fine-tuningを⾏うためのデータセットが必要
• Fake Quantレイヤの追加が必要
Slide 16
Slide 16 text
TensorFlowによる量⼦化
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
https://arxiv.org/abs/1712.05877
• TFLiteConverterによる変換
• Whitepaperでは3種類(a,b,d)
• TensorFlow 2.xドキュメントでは4種類(a,b,c,d)
• Approaches
1. Post Training Quantization
a. Weight Quantization: Dataset不要
b. Full integer quantization: Datasetで分布調整
c. Float16 quantization: 半精度
2. Quantization-Aware Training
d. weight + activation: Datasetによるfine-tuning
Slide 17
Slide 17 text
TensorFlowによる量⼦化
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
https://arxiv.org/abs/1712.05877
• TFLiteConverterによる変換
• Whitepaperでは3種類(a,b,d)
• TensorFlow 2.xドキュメントでは4種類(a,b,c,d)
• Approaches
1. Post Training Quantization
a. Weight Quantization: Dataset不要
b. Full integer quantization: Datasetで分布調整
c. Float16 quantization: 半精度
2. Quantization-Aware Training
d. weight + activation: Datasetによるfine-tuning
Slide 18
Slide 18 text
PyTorchによる量⼦化
• PyTorch 1.3からExperimental Support開始
• パフォーマンス(FP32->INT8)
1. 1/4 in the model size
2. 1/4 in memory bandwidth requirements
3. 2-4x faster (Hardware support for INT8 computations)
• Approaches
1. Dynamic Quantization: weight, Dataset不要
2. Post Training Quantization: weight + activation, Datasetで分布調整
3. Quantization-Aware Training: weight + activation, Datasetによるfine-tuning
https://pytorch.org/blog/pytorch-1-dot-3-adds-mobile-privacy-quantization-and-named-tensors/#quantization-
experimental
LSTMがターゲット(論文要確認)
Slide 19
Slide 19 text
INT8よりbit数を短くできるか?
• 論⽂としては存在する
• Post training 4-bit quantization of convolutional networks for rapid-deployment
• https://arxiv.org/abs/1810.05723
• ただし速度が向上するか?は利⽤プロセッサやフレームワーク次第
• 例えばARM NEONのSIMD命令を利⽤してINT8演算を並列化している
• https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/kernels/in
ternal/optimized
Slide 20
Slide 20 text
Analytical Clipping for Integer Quantization (ACIQ)
+ Per-channel bit allocation
Post training 4-bit quantization of convolutional networks for rapid-deployment
https://arxiv.org/abs/1810.05723
Slide 21
Slide 21 text
Analytical Clipping for Integer Quantization (ACIQ)
+ Per-channel bit allocation
Post training 4-bit quantization of convolutional networks for rapid-deployment
https://arxiv.org/abs/1810.05723
post trainにて解析的にrange決定
Slide 22
Slide 22 text
Analytical Clipping for Integer Quantization (ACIQ)
+ Per-channel bit allocation
Post training 4-bit quantization of convolutional networks for rapid-deployment
https://arxiv.org/abs/1810.05723
ResNet Post Training
• weight 4bit
• activation 4bit
Slide 23
Slide 23 text
Quantization for TVM
Quantization for TVM (TVM Conference, Dec 12th 2018)
https://sampl.cs.washington.edu/tvmconf/slides/11-Ziheng-Jiang.pdf
FPGA向け?
Slide 24
Slide 24 text
TVM TensorCore INT4/INT1対応
[CODEGEN] Support cuda tensorcore subbyte int data type in auto tensorcore
https://github.com/apache/incubator-tvm/pull/4546
Slide 25
Slide 25 text
TVM TensorCore INT4/INT1対応
[CODEGEN] Support cuda tensorcore subbyte int data type in auto tensorcore
https://github.com/apache/incubator-tvm/pull/4546
• NVIDIA Turing世代
• TensorCore INT4/INT1を試験的にサポート
• ただ、cuBLAS(⾏列演算ライブラリ)はINT8まで
• CUTLASS(INT/FP混合演算向けライブラリ) 1.1以降で置き換
える必要あり
• NVIDIAに先⾏してTVMがGPU INT4サポートしてきた
• つまり「理論上INT4いける」に実装が追いついてきた