Slide 1

Slide 1 text

モデルアーキテクチャ観点からの ⾼速化2019 内⽥祐介 AIシステム部 株式会社ディー・エヌ・エー 1

Slide 2

Slide 2 text

⾃⼰紹介 • 内⽥祐介(株式会社ディー・エヌ・エー AIシステム部 副部⻑) • 〜2017年︓通信キャリアの研究所で画像認識・検索の研究に従事 • 2016年 ︓社会⼈学⽣として博⼠号を取得(情報理⼯学) • 2017年〜︓DeNA中途⼊社、深層学習を中⼼とした コンピュータビジョン技術の研究開発に従事 2 Twitter: https://twitter.com/yu4u GitHub: https://github.com/yu4u Qiita: https://qiita.com/yu4u SlideShare: https://www.slideshare.net/ren4yu medium: https://medium.com/@yu4u

Slide 3

Slide 3 text

前提 • 主に下記の条件を満たす⼿法を紹介 • 特定のハードウェアに依存せずに実現可能 • 畳み込みニューラルネットワーク (CNN) が対象 • 推論時の⾼速化が対象 3

Slide 4

Slide 4 text

⾼速化︖ • モデルパラメータ数の削減 • FLOPs (MACs) 数の削減 • モデルファイルサイズの削減 • 推論時間の削減 • 訓練時間の削減 微妙に違うので、使うときは何を重視すべきか、 論⽂を読むときは何が改善しているのかを気にする 4

Slide 5

Slide 5 text

FLOPs ≠ 処理速度 • Convの部分がFLOPsで⾒える部分 5 N. Ma, X. Zhang, H. Zheng, and J. Sun, "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design," in Proc. of ECCV, 2018.

Slide 6

Slide 6 text

モデル⾼速化 • 畳み込みの分解 (Factorization) • 枝刈り (Pruning) • アーキテクチャ探索 (Neural Architecture Search; NAS) • 早期終了、動的計算グラフ (Early Termination, Dynamic Computation Graph) • 蒸留 (Distillation) • 量⼦化 (Quantization) 6

Slide 7

Slide 7 text

7 畳み込みの分解 (Factorization)

Slide 8

Slide 8 text

畳み込み層の計算量 • ⼊⼒レイヤサイズ︓H x W x N • 畳み込みカーネル︓K x K x N x M convKxK, M と表記 (e.g. conv 3x3, 64) • 出⼒レイヤサイズ︓H x W x M • 畳み込みの計算量︓H・W・N・K2・M(バイアス項を無視) 8 W H N M K K W H ⼊⼒特徴マップ 畳み込み カーネル N 出⼒特徴マップ * 和 要素積 × M convK×K, M 畳み込み層の計算量は • 画像/特徴マップのサイズ(HW) • ⼊出⼒チャネル数(NM) • カーネルサイズ(K2) に⽐例

Slide 9

Slide 9 text

空間⽅向の分解 • ⼤きな畳み込みカーネルを⼩さな畳み込みカーネルに分解 • 例えば5x5の畳み込みを3x3の畳み込み2つに分解 • これらは同じサイズの受容野を持つが分解すると計算量は25:18 • Inception-v2 [4] では最初の7x7畳み込みを3x3畳み込み3つに分解 • 以降のSENetやShuffleNetV2等の実装でも利⽤されている[18] 9 特徴マップ conv5x5 conv3x3 - conv3x3 [4] C. Szegedy, et al., "Rethinking the Inception Architecture for Computer Vision," in Proc. of CVPR, 2016. [18] T. He, et al., "Bag of Tricks for Image Classification with Convolutional Neural Networks," in Proc. of CVPR, 2019.

Slide 10

Slide 10 text

空間⽅向の分解 • nxnを1xnとnx1に分解することも 10 [4] C. Szegedy, et al., "Rethinking the Inception Architecture for Computer Vision," in Proc. of CVPR, 2016.

Slide 11

Slide 11 text

SqueezeNet • 戦略 • 3x3の代わりに1x1のフィルタを利⽤する • 3x3への⼊⼒となるチャネル数を少なくする(1x1で次元圧縮) 11 conv 1x1, s1x1 conv 1x1, e1x1 conv 3x3, e3x3 concat Fire module 32 128 128 256 256 Squeeze layer Expand layer F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size," in arXiv:1602.07360, 2016.

Slide 12

Slide 12 text

空間⽅向とチャネル⽅向の分解 (separable conv) • 空間⽅向とチャネル⽅向の畳み込みを独⽴に⾏う • Depthwise畳み込み(空間⽅向) • 特徴マップに対しチャネル毎に畳み込み • 計算量︓H・W・N・K2・M (M=N) H・W・K2・N • Pointwise畳み込み(チャネル⽅向) • 1x1の畳み込み • 計算量︓H・W・N・K2・M (K=1) H・W・N・M • Depthwise + pointwise (separable) • 計算量︓H・W・N・(K2 + M) ≒ H・W・N・M (※M >> K2) • H・W・N・K2・M から⼤幅に計算量を削減 12 W H W H N 1 1 M W H W H N K K N W H W H N M K K 通常 depthwise pointwise

Slide 13

Slide 13 text

Xception[6] • Separable convを多⽤したモデル 13 [6] F. Chollet, "Xception: Deep learning with depthwise separable convolutions," in Proc. of CVPR, 2017.

Slide 14

Slide 14 text

MobileNet[7] • depthwise/pointwise convを多⽤ • 改良版のMobileNetV2[13]/V3[20]もある 14 通常の畳み込み MobileNetの1要素 [7] A. Howard, et al., "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications," in arXiv:1704.04861, 2017. [13] M. Sandler, et al., "MobileNetV2: Inverted Residuals and Linear Bottlenecks," in Proc. of CVPR, 2018. [20] A. Howard, et al., "Searching for MobileNetV3," in Proc. of ICCV’19.

Slide 15

Slide 15 text

MobileNetV1 vs. V2 15 depthwise conv conv 1x1 depthwise conv conv 1x1 conv 1x1 spatial channel ボトルネック構造を採⽤ conv1x1の計算量を相対的に削減 MobileNetV1 MobileNetV2

Slide 16

Slide 16 text

MNasNet • 後述のアーキテクチャ探索⼿法 • Mobile inverted bottleneck にSEモジュールを追加 (MBConv) • MBConv3 (k5x5) →ボトルネックでチャネル数を3倍 depthwiseのカーネルが5x5 16 M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, "MnasNet: Platform-Aware Neural Architecture Search for Mobile," in Proc. of CVPR, 2019.

Slide 17

Slide 17 text

MobileNetV3 • MnasNetをベースに最適化 • SEモジュールを⼤きめにする(︖)←どちらも/4では…︖ • (h-)swishの利⽤、実装の最適化 • NetAdaptによるPruning(後述) • Compactation↓ 17 swishが⼊ったMBConvは Kaggleで⼤活躍のEfficientNetでも 基本モジュールとして採⽤

Slide 18

Slide 18 text

EfficientNet • あるネットワークが与えられ、それをベースに より⼤きなネットワークを構成しようとした際の depth, width, resolutionの増加の最適割り当て • EfficientNet-B0 (ほぼMnasNet) で割り当てを求め、 以降は同じように 指数的に増加させる 18 M. Tan and Q. V. Le, "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks," in Proc. of ICML, 2019. 畳み込み層の計算量は • 画像/特徴マップのサイズ(HW) • ⼊出⼒チャネル数(NM) • カーネルサイズ(K2) に⽐例

Slide 19

Slide 19 text

ShuffleNet[8] • MobileNetのボトルネックとなっているconv1x1を group conv1x1 + channel shuffleに置換 • group conv: ⼊⼒の特徴マップをG個にグループ化し 各グループ内で個別に畳み込みを⾏う (計算量 H・W・N・K2・M → H・W・N・K2・M / G) • channel shuffle: チャネルの順序を⼊れ替える reshape + transposeの操作で実現可能 c shuffle depthwise conv gconv 1x1 spatial channel gconv 1x1 [8] X. Zhang, et al., "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices," in arXiv:1707.01083, 2017.

Slide 20

Slide 20 text

ShuffleNet V2 • FLOPsではなく対象プラットフォームでの実速度を⾒るべき • 効率的なネットワーク設計のための4つのガイドラインを提⾔ 1. メモリアクセス最⼩化のためconv1x1は⼊⼒と出⼒を同じにす べし 2. ⾏き過ぎたgroup convはメモリアクセスコストを増加させる 3. モジュールを細分化しすぎると並列度を低下させる 4. 要素毎の演算(ReLUとかaddとか)コストは無視できない • これらの妥当性がtoyネットワークを通して実験的に⽰されている 20 N. Ma, X. Zhang, H. Zheng, and J. Sun, "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design," in Proc. of ECCV, 2018.

Slide 21

Slide 21 text

ShuffleNet V2 • その上で新たなアーキテクチャを提案 21 N. Ma, X. Zhang, H. Zheng, and J. Sun, "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design," in Proc. of ECCV, 2018.

Slide 22

Slide 22 text

ChannelNet[11] • チャネル⽅向に1次元の畳み込みを⾏う 22 [11] H. Gao, Z. Wang, and S. Ji, "ChannelNets: Compact and Efficient Convolutional Neural Networks via Channel-Wise Convolutions", in Proc. of NIPS, 2018.

Slide 23

Slide 23 text

ShiftNet • チャネルをグループ化し、各グループごとに空間的にシフトする shift演算 (0 FLOPs) とconv1x1でモジュールを構成 23 B. Wu, et al., "Shift: A Zero FLOP, Zero Parameter," in arXiv:1711.08141, 2017.

Slide 24

Slide 24 text

OctConv 24

Slide 25

Slide 25 text

他にも… 25 G. Huang, S. Liu, L. Maaten, and K. Weinberger, "CondenseNet: An Efficient DenseNet using Learned Group Convolutions," in Proc. of CVPR, 2018. T. Zhang, G. Qi, B. Xiao, and J. Wang. Interleaved group convolutions for deep neural networks," in Proc. of ICCV, 2017. G. Xie, J. Wang, T. Zhang, J. Lai, R. Hong, and G. Qi, "IGCV2: Interleaved Structured Sparse Convolutional Neural Networks, in Proc. of CVPR, 2018. K. Sun, M. Li, D. Liu, and J. Wang, "IGCV3: Interleaved Low-Rank Group Convolutions for Efficient Deep Neural Networks," in BMVC, 2018. J. Zhang, "Seesaw-Net: Convolution Neural Network With Uneven Group Convolution," in arXiv:1905.03672, 2019.

Slide 26

Slide 26 text

チートシート的なもの 26 https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d

Slide 27

Slide 27 text

27 枝刈り (Pruning)

Slide 28

Slide 28 text

枝刈り (Pruning) • 畳み込み層や全結合層の重みの⼀部を0にすることで パラメータ数・計算量を削減 1. ネットワークを学習 2. 枝刈り(精度低下) 3. ネットワークを再学習(精度をある程度回復) というフローが⼀般的 28

Slide 29

Slide 29 text

Unstructured vs. Structured Pruning • Pruning前の畳み込みフィルタ • Unstructured pruning • Structured pruning(フィルタ(チャネル)pruningが⼀般的) 29 K K … … … M(出⼒チャネル)個 計算量vs.精度のtrade-offは優れているが 専⽤のハードウェアでないと⾼速化できない 単にチャネル数が減少したネットワークに 再構築が可能で⾼速化の恩恵を受けやすい

Slide 30

Slide 30 text

Optimal Brain Damage (OBD) • 損失関数のヘッシアン(対⾓近似)から重みの重要度を算出 • 重要度の低い重みをpruning • LeNetの60%のパラメータを削除しても精度を維持 30

Slide 31

Slide 31 text

Optimal Brain Damage (OBD) • 損失関数のヘッシアン(対⾓近似)から重みの重要度を算出 • 重要度の低い重みをpruning • LeNetの60%のパラメータを削除しても精度を維持 31 Y. LeCun, J. Denker, and S. Solla, "Optimal Brain Damage," in Proc. of NIPS, 1990.

Slide 32

Slide 32 text

Optimal Brain Damage (OBD) • 損失関数のヘッシアン(対⾓近似)から重みの重要度を算出 • 重要度の低い重みをpruning • LeNetの60%のパラメータを削除しても精度を維持 32 Y. LeCun, J. Denker, and S. Solla, "Optimal Brain Damage," in Proc. of NIPS, 1990. Retrainして精度を回復させている

Slide 33

Slide 33 text

Deep Compression[23, 25, 26] • Unstructuredなpruning • L2 正則化を加えて学習し、絶対値が⼩さいweightを0に • 実際に⾼速に動かすには専⽤ハードが必要[26] 33 [23] S. Han, et al., "Learning both Weights and Connections for Efficient Neural Networks," in Proc. of NIPS, 2015. [25] S. Han, et al., "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding," in Proc. of ICLR, 2016. [26] S. Han, et al., "EIE: Efficient Inference Engine on Compressed Deep Neural Network," in Proc. of ISCA, 2016.

Slide 34

Slide 34 text

Pruning Filters for Efficient ConvNets[30] • Structured pruning(チャネルレベルのpruning) • 各レイヤについて、フィルタの重みの絶対値の総和が ⼩さいものからpruning • 各レイヤのpruning率はpruningへのsensitivityから ⼈⼿で調整 • Pruning後にfinetune 34 [30] H. Li, et al., "Pruning Filters for Efficient ConvNets," in Proc. of ICLR, 2017.

Slide 35

Slide 35 text

Network Slimming[33] • Batch normのパラメータγにL1 ロスをかけて学習 • 学習後、γが⼩さいチャネルを削除し、fine-tune 35 チャネル毎に⼊⼒を平均0分散1に正規化、γとβでscale & shi. チャネルi … … Batch normaliza-on [33] Z. Liu, et al., "Learning Efficient Convolutional Networks through Network Slimming," in Proc. of ICCV, 2017.

Slide 36

Slide 36 text

L0 ではなくLasso に緩和して解く Channel Pruning[34] • あるfeature mapのチャネル削除した場合に 次のfeature mapの誤差が最⼩となるようチャネルを選択 • Wも最⼩⼆乗で調整 36 [34] Y. He, et al., "Channel Pruning for Accelerating Very Deep Neural Networks," in Proc. of ICCV, 2017.

Slide 37

Slide 37 text

ThiNet[35] • 前述の⼿法と同じように、次のfeature mapの誤差が 最⼩となるレイヤをgreedy削除 • 削除後に、畳み込みの重みを誤差が最⼩になるように 調整→finetune 37 [35] J. Luo, et al., "ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression," in Proc. of ICCV, 2017.

Slide 38

Slide 38 text

AutoML for Model Compression and Acceleration (AMC)[41] • 強化学習(off-policy actor-critic)により各レイヤ毎の最適な pruning率を学習(実際のpruningは他の⼿法を利⽤) • ⼊⼒は対象レイヤの情報とそれまでのpruning結果、 報酬は –エラー率×log(FLOPs) or log(#Params) 38 [41] Y. He, et al., "AMC - AutoML for Model Compression and Acceleration on Mobile Devices," in Proc. of ECCV, 2018.

Slide 39

Slide 39 text

NetAdapt • ステップ毎に定義されるリソース制約を満たす 最適なlayerをgreedyにpruning • LUTを利⽤してリソースを推定 • ステップ毎に少しだけfinetune • 最終的⽬的のリソースまで 削減できたら⻑めに finetuneして終了 39 T. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam, "NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications," in Proc. of ECCV, 2018.

Slide 40

Slide 40 text

Lottery Ticket Hypothesis (ICLRʼ19 Best Paper)[44] • NNには、「部分ネットワーク構造」と「初期値」の 組み合わせに「当たり」が存在し、それを引き当てると 効率的に学習が可能という仮説 • Unstructuredなpruningでその構造と初期値を⾒つけることができた 40 https://www.slideshare.net/YosukeShinya/the-lottery-ticket-hypothesis-finding-small-trainable-neural-networks [44] Jonathan Frankle, Michael Carbin, "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks," in Proc. of ICLR, 2019.

Slide 41

Slide 41 text

Network Pruning as Architecture Search[45] • Structuredなpruning後のネットワークをscratchから学習させても finetuneと同等かそれより良い結果が得られるという主張 • つまりpruningは、重要な重みを探索しているのではなく 各レイヤにどの程度のチャネル数を割り当てるかという Neural Architecture Search (NAS) をしているとみなせる • Lottery Ticket Hypothesisではunstructuredで、低LRのみ、 実験も⼩規模ネットワークのみ 41 [45] Z. Liu, et al., "Rethinking the Value of Network Pruning," in Proc. of ICLR, 2019.

Slide 42

Slide 42 text

Slimmable Neural Networks* • 1モデルだが複数の計算量(精度)で動かせるモデルを学習 • Incremental trainingだと精度が出ない • 同時学習だとBNの統計量が違うため学習できない → 切替可能なモデルごとにBN層だけを個別に持つ︕ • もっと連続的に変化できるモデル**や、そこからgreedyにpruning する(精度低下が最も⼩さいレイヤを削っていく)拡張***も 42 * J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang, "Slimmable Neural Networks," in Proc. of ICLR, 2019. ** J. Yu and T. Huang, "Universally Slimmable Networks and Improved Training Techniques," in arXiv:1903.05134, 2019. *** J. Yu and T. Huang, "Network Slimming by Slimmable Networks: Towards One-Shot Architecture Search for Channel Numbers," in arXiv:1903.11728, 2019.

Slide 43

Slide 43 text

MetaPruning • Pruning後のネットワークの重みを 出⼒するPruningNetを学習 • Blockへの⼊⼒はNetwork encoding vector 前および対象レイヤのランダムなpruning率 • 全部⼊れたほうが良さそうな気がするが 著者に聞いたところ効果なし • End-to-endで学習できる︕ • 学習が終わると精度vs.速度のトレードオフの 優れたモデルを探索(⼿法は何でも良い)ここではGA 43 Z. Liu, H. Mu, X. Zhang, Z. Guo, X. Yang, T. Cheng, and J. Sun, "MetaPruning: Meta Learning for Automatic Neural Network Channel Pruning," in Proc. of ICCV’19.

Slide 44

Slide 44 text

44 アーキテクチャ探索 (Neural Architecture Search; NAS)

Slide 45

Slide 45 text

アーキテクチャ探索 (NAS) • NNのアーキテクチャを⾃動設計する⼿法 • 探索空間、探索⼿法、精度評価⼿法で⼤まかに分類される • 探索空間 • Global, cell-based • 探索⼿法 • 強化学習、進化的アルゴリズム、gradientベース、random • 精度測定⼿法 • 全学習、部分学習、weight-share、枝刈り探索 45 T. Elsken, J. Metzen, and F. Hutter, "Neural Architecture Search: A Survey," in JMLR, 2019. M. Wistuba, A. Rawat, and T. Pedapati, "A Survey on Neural Architecture Search," in arXiv:1905.01392, 2019. https://github.com/D-X-Y/awesome-NAS

Slide 46

Slide 46 text

NAS with Reinforcement Learning • 探索空間︓global、探索⼿法︓REINFORCE • RNNのcontrollerがネットワーク構造を⽣成 • 畳み込み層のパラメータと、skip connectionの有無を出⼒ • ⽣成されたネットワークを学習し、その精度を報酬にする 46 Zoph and Q. V. Le, "Neural architecture search with reinforcement learning," in Proc. of ICLR, 2017.

Slide 47

Slide 47 text

NAS with Reinforcement Learning • 800GPUs for 28 daysの成果 47 Zoph and Q. V. Le, "Neural architecture search with reinforcement learning," in Proc. of ICLR, 2017.

Slide 48

Slide 48 text

NASNet[52] • 探索空間︓cell、 探索⼿法︓強化学習 (Proximal Policy Optimization) • Globalな設計にドメイン知識を活⽤、 構成するcellのみを⾃動設計 →探索空間を⼤幅に削減 • Normal cell x Nとreduction cellのスタック • Reduction cellは最初にstride付きのOPで 特徴マップをダウンサンプル • Reduction cell以降でチャネルを倍に 48 [52] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, "Learning transferable architectures for scalable image recognition," in Proc. of CVPR, 2018.

Slide 49

Slide 49 text

NASNetのコントローラの動作 1. Hidden state※1 1, 2を選択 2. それらへのOPsを選択※2 3. それらを結合するOP (add or concat) を選択し新たなhidden stateとする ※1 Hidden state: 緑のブロックとhi , hi-I ※2 Hidden stateへのOP候補 49 [52] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, "Learning transferable architectures for scalable image recognition," in Proc. of CVPR, 2018.

Slide 50

Slide 50 text

NASNetのコントローラの動作 1. Hidden state※1 1, 2を選択 2. それらへのOPsを選択※2 3. それらを結合するOP (add or concat) を選択し新たなhidden stateとする ※1 Hidden state: 緑のブロックとhi , hi-I ※2 Hidden stateへのOP候補 50 [52] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, "Learning transferable architectures for scalable image recognition," in Proc. of CVPR, 2018.

Slide 51

Slide 51 text

NASNetのコントローラの動作 1. Hidden state※1 1, 2を選択 2. それらへのOPsを選択※2 3. それらを結合するOP (add or concat) を選択し新たなhidden stateとする ※1 Hidden state: 緑のブロックとhi , hi-I ※2 Hidden stateへのOP候補 51 sep 3x3 avg 3x3 [52] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, "Learning transferable architectures for scalable image recognition," in Proc. of CVPR, 2018.

Slide 52

Slide 52 text

NASNetのコントローラの動作 1. Hidden state※1 1, 2を選択 2. それらへのOPsを選択※2 3. それらを結合するOP (add or concat) を選択し新たなhidden stateとする ※1 Hidden state: 緑のブロックとhi , hi-I ※2 Hidden stateへのOP候補 52 concat sep 3x3 avg 3x3 [52] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, "Learning transferable architectures for scalable image recognition," in Proc. of CVPR, 2018.

Slide 53

Slide 53 text

ENAS[54] • 探索空間︓cell、探索⼿法︓強化学習 (REINFORCE) • Cellの構造を出⼒するRNNコントローラと、 コントローラーが出⼒する全てのネットワークをサブグラフとして 保持できる巨⼤な計算グラフ(ネットワーク)を同時に学習 →⽣成したネットワークの学習が不要に(1GPU for 0.45 days!) • Single shot, weight share • 詳細は神資料*を参照 53 [54] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and Jeff Dean, "Efficient Neural Architecture Search via Parameter Sharing," in Proc. of ICML, 2018. * https://www.slideshare.net/tkatojp/efficient-neural-architecture-search-via-parameters- sharing-icml2018

Slide 54

Slide 54 text

ENASの学習 • コントローラーのパラメータθと 巨⼤なネットワークのパラメータwを交互に学習 • wの学習 • θを固定し、サブグラフをサンプリング • サブグラフをforward-backwardしwを更新 • θの学習 • wを固定し、サブグラフをサンプリング • validationデータで精度を測定し報酬を取得、REINFORCEでθを更新 54

Slide 55

Slide 55 text

DARTS[57] • 探索空間︓cell、探索⼿法︓gradient • グラフの接続やOPの選択をsoftmaxで実現することで、 構造探索もforward-backwardで実現 • ENASと同じくshared param、wと構造を交互に最適化 55 [57] H. Liu, K. Simonyan, and Y. Yang, "DARTS: Differentiable Architecture Search," in Proc. of ICLR, 2019.

Slide 56

Slide 56 text

FBNet[61] • DARTSと同じくgradient-based • 各OPの実デバイス上での処理時間をlookup tableに保持 • 処理時間を考慮したロスをかける • ブロック毎に違う構造 56 [61] B. Wu, et al., "FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search", in Proc. of CVPR, 2019. クロスエントロピー 処理時間

Slide 57

Slide 57 text

Random Search系 • Weight share + random search (ASHA) が良い* • Asynchronous Successive Halving (ASHA)︓複数のモデルを平⾏ に学習を進めながら有望なものだけを残して枝刈り • Optunaで使えるよ︕** • 探索空間を、ランダムなDAG⽣成アルゴリズムが⽣成するグラフ にすると想像以上に良い*** 57 * L. Li and A. Talwalkar, "Random search and reproducibility for neural architecture search," in arXiv:1902.07638, 2019. ** https://www.slideshare.net/shotarosano5/automl-in-neurips-2018 *** S. Xie, A. Kirillov, R. Girshick, and K. He, "Exploring Randomly Wired Neural Networks for Image Recognition," in arXiv:1904.01569, 2019.

Slide 58

Slide 58 text

他にも [58] H. Cai, L. Zhu, and S. Han, "ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware," in Proc. of ICLR, 2019. [59] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, "MnasNet: Platform-Aware Neural Architecture Search for Mobile," in Proc. of CVPR, 2019. [60] X. Dai, et al., "ChamNet: Towards Efficient Network Design through Platform-Aware Model Adaptation," in Proc. of CVPR, 2019. [62] D. Stamoulis, et al., "Single-Path NAS: Device-Aware Efficient ConvNet Design," in Proc. of ICMLW, 2019. 58

Slide 59

Slide 59 text

59 早期終了、動的計算グラフ (Early Termination, Dynamic Computation Graph)

Slide 60

Slide 60 text

早期終了 (Early termination) • ⼊⼒に応じてネットワークの途中で結果を出⼒し、 それ以降の処理を⾏わない(早期終了) • ⼊⼒に応じてネットワークの構造を動的に変える (動的計算グラフ; dynamic computation graph) • 「平均処理時間」を削減する 60

Slide 61

Slide 61 text

BranchyNet[65] • ネットワークの途中に結果の出⼒層を追加 • 学習時にはすべての出⼒層に適当なweightをかけて学習 • そのsoftmaxのエントロピーが閾値以下の場合にExit 61 [65] S. Teerapittayanon, et al., "BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks," in Proc. of ICPR, 2016.

Slide 62

Slide 62 text

Spatially Adaptive Computation Time (SACT)[66] • ACT: 各ResBlockがhalting scoreを出⼒、合計が1を超えると 以降の処理をスキップ(空間領域でも⾏うとSACT) 62 計算量に関する勾配を追加 [66] M. Figurnov, et al., "Spatially Adaptive Computation Time for Residual Networks," in Proc. of CVPR, 2017.

Slide 63

Slide 63 text

Runtime Neural Pruning[68] • 各レイヤ毎に、直前までの特徴マップを⼊⼒とするRNNが 利⽤する畳み込みフィルタ集合を決定 • Keepした畳み込みフィルタ数と元タスクの損失関数(最終層の場 合)を負の報酬としてQ学習でRNNを学習 63 [68] J. Lin, et al., "Runtime Neural Pruning," in Proc. of NIPS, 2017.

Slide 64

Slide 64 text

BlockDrop[73] • Policy networkに画像を⼊⼒、どのBlockをスキップするかを出⼒ • KeepとなったResBlockのみをforward • 認識が失敗した場合は負の報酬を、成功した場合にはスキップ率に 応じた正の報酬を与えることでpolicy networkを学習 64 [73] Z. Wu, et al., "BlockDrop: Dynamic Inference Paths in Residual Networks," in Proc. of CVPR, 2018.

Slide 65

Slide 65 text

65 蒸留 (Distillation)

Slide 66

Slide 66 text

蒸留 (Distillation) • ⼤きなモデルや、複数のネットワークのアンサンブルを 「教師モデル」とし、⼩さな「⽣徒モデル」を学習 • 教師の出⼒や中間特徴を⽣徒が模擬するようなロスをかける 66 1. アンサンブルモデルや ⼤きなモデルを学習 2. 学習済みモデルを利⽤して ⼩さなモデルを学習

Slide 67

Slide 67 text

Distilling the Knowledge in a Neural Network[77] 67 … … 学習画像 学習済みモデル 学習するモデル … 正解ラベル (ハード ターゲッ ト) 通常T = 1のsoftmaxのTを⼤きくした ソフトターゲットを利⽤ … ソフトターゲット ソフト ターゲット ハード ターゲット 正解ラベルと 学習モデル出⼒の 両⽅を利⽤ [77] G. Hinton, et al., "Distilling the Knowledge in a Neural Network," in Proc. of NIPS Workshop, 2014.

Slide 68

Slide 68 text

FitNet[79] • 教師よりもdeepかつthinな⽣徒を学習する • ⽣徒のguided layerが、教師のhit layerの出⼒を 正確に模擬する (regression) ロスを追加 68 [79] A. Romero, et al., "FitNets: Hints for Thin Deep Nets," in Proc. of ICLR, 2015.

Slide 69

Slide 69 text

さいきんの(雑) 69 B. Heo, et al., "A Comprehensive Overhaul of Feature Distillation," in Proc. of ICCV, 2019.

Slide 70

Slide 70 text

70 量⼦化 (Quantization)

Slide 71

Slide 71 text

量⼦化 • ネットワークのパラメータ等を量⼦化することで モデルサイズを削減、学習や推論を⾼速化 • 量⼦化対象 • 重み、アクティベーション(特徴マップ)、勾配、エラー • 量⼦化⼿法 • 線形、log、⾮線形 / スカラ、ベクトル、直積量⼦化 • 量⼦化ビット • 1bit(バイナリ)、3値 (-1, 0, 1)、8bit、16bit、任意bit • 専⽤ハードがないと恩恵を受けられない事が多い • 半精度/混合精度*は汎⽤ハード&フレームワークでもサポート 71 * https://github.com/NVIDIA/apex

Slide 72

Slide 72 text

WAGE[96] • weights (W), activations (A), gradients (G), errors (E) の全てを量⼦化 72 [96] S. Wu, et al., "Training and Inference with Integers in Deep Neural Networks," in Proc. of ICLR, 2018.

Slide 73

Slide 73 text

WAGE[96] • weights (W), activations (A), gradients (G), errors (E) 73 バイナリ [96] S. Wu, et al., "Training and Inference with Integers in Deep Neural Networks," in Proc. of ICLR, 2018.

Slide 74

Slide 74 text

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference[97] • 推論時にuint8の演算がメインとなるように 学習時に量⼦化をシミュレーションしながら学習 • TensorFlow公式に実装が存在* 74 [97] B. Jacob, et al., "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference," in Proc. of CVPR, 2018. * https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/quantize/README.md

Slide 75

Slide 75 text

Post-Training Integer Quantization • Post-trainingも 75 https://medium.com/tensorflow/tensorflow-model-optimization-toolkit-post-training- integer-quantization-b4964a1ea9ba

Slide 76

Slide 76 text

EMS 76 “EMS: End-to-End Model Search for Network Architecture, Pruning and Quantization,” ICLRʼ20 under review.

Slide 77

Slide 77 text

77 まとめ

Slide 78

Slide 78 text

汎⽤的な⾼速化⼿法を紹介 • 畳み込みの分解 (Factorization) • 枝刈り (Pruning) • アーキテクチャ探索 (Neural Architecture Search; NAS) • 早期終了、動的計算グラフ (Early Termination, Dynamic Computation Graph) • 蒸留 (Distillation) • 量⼦化 (Quantization) 78

Slide 79

Slide 79 text

2年前のまとめ 79

Slide 80

Slide 80 text

まとめ • NASが庶⺠の⼿に • Single shot, weight share • FLOPsではなく実速度を最適化(mobile device-aware • 依然としてベースモジュール (cell) は⼈⼿ • むしろ昔はcellのほうが⾃動設計(配線多いのがNG) • あまり探索された感がない(greedyなgrid search感) • モジュール設計・pruning・NASが⼀体化 • 今後 • 単に軽量なバックボーンを利⽤するだけでなく 各タスクに最適化されたアーキテクチャ(既にあるけど) 80

Slide 81

Slide 81 text

81 百選 (ちょっと古い)

Slide 82

Slide 82 text

畳み込みの分解 [1] L. Sifre and S. Mallat, "Rotation, Scaling and Deformation Invariant Scattering for Texture Discrimination," in Proc. of CVPR, 2013. [2] L. Sifre, "Rigid-motion Scattering for Image Classification, in Ph.D. thesis, 2014. [3] M. Lin, Q. Chen, and S. Yan, "Network in Network," in Proc. of ICLR, 2014. [4] C. Szegedy, et al., "Rethinking the Inception Architecture for Computer Vision," in Proc. of CVPR, 2016. [5] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size," in arXiv:1602.07360, 2016. [6] F. Chollet, "Xception: Deep learning with depthwise separable convolutions," in Proc. of CVPR, 2017. [7] A. Howard, et al., "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications," in arXiv:1704.04861, 2017. [8] X. Zhang, et al., "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices," in arXiv:1707.01083, 2017. [9] B. Wu, et al., "Shift: A Zero FLOP, Zero Parameter," in arXiv:1711.08141, 2017. [10] N. Ma, X. Zhang, H. Zheng, and J. Sun, "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design," in Proc. of ECCV, 2018. [11] H. Gao, Z. Wang, and S. Ji, "ChannelNets: Compact and Efficient Convolutional Neural Networks via Channel-Wise Convolutions", in Proc. of NIPS, 2018. [12] G. Huang, S. Liu, L. Maaten, and K. Weinberger, "CondenseNet: An Efficient DenseNet using Learned Group Convolutions," in Proc. of CVPR, 2018. [13] M. Sandler, et al., "MobileNetV2: Inverted Residuals and Linear Bottlenecks," in Proc. of CVPR, 2018. [14] G. Xie, J. Wang, T. Zhang, J. Lai, R. Hong, and G. Qi, "IGCV2: Interleaved Structured Sparse Convolutional Neural Networks, in Proc. of CVPR, 2018. 82

Slide 83

Slide 83 text

畳み込みの分解 [15] T. Zhang, G. Qi, B. Xiao, and J. Wang, "Interleaved group convolutions for deep neural networks," in Proc. of ICCV, 2017. [16] Z. Qin, Z. Zhang, X. Chen, and Y. Peng, "FD-MobileNet: Improved MobileNet with a Fast Downsampling Strategy," in Proc. of ICIP, 2018. [17] K. Sun, M. Li, D. Liu, and J. Wang, "IGCV3: Interleaved Low-Rank Group Convolutions for Efficient Deep Neural Networks," in BMVC, 2018. [18] T. He, et al., "Bag of Tricks for Image Classification with Convolutional Neural Networks," in Proc. of CVPR, 2019. [19] Y. Chen, et al., "Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution," in arXiv:1904.05049, 2019. [20] A. Howard, et al., "Searching for MobileNetV3," in arXiv:1905.02244, 2019. [21] J. Zhang, "Seesaw-Net: Convolution Neural Network With Uneven Group Convolution," in arXiv:1905.03672, 2019. 83

Slide 84

Slide 84 text

枝刈り [22] Y. LeCun, J. Denker, and S. Solla, "Optimal Brain Damage," in Proc. of NIPS, 1990. [23] S. Han, J. Pool, J. Tran, and W. Dally, "Learning both Weights and Connections for Efficient Neural Networks," in Proc. of NIPS, 2015. [24] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, "Learning Structured Sparsity in Deep Neural Networks," in Proc. of NIPS, 2016. [25] S. Han, et al., "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding," in Proc. of ICLR, 2016. [26] S. Han, J. Pool, J. Tran, and W. Dally, "EIE: Efficient Inference Engine on Compressed Deep Neural Network," in Proc. of ISCA, 2016. [27] S. Anwar, K. Hwang, and W. Sung, "Structured Pruning of Deep Convolutional Neural Networks," in JETC, 2017. [28] S. Changpinyo, M. Sandler, and A. Zhmoginov, "The Power of Sparsity in Convolutional Neural Networks," in arXiv:1702.06257, 2017. [29] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini, "Group Sparse Regularization for Deep Neural Networks," in Neurocomputing, 2017. [30] H. Li, et al., "Pruning Filters for Efficient ConvNets," in Proc. of ICLR, 2017. [31] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, "Pruning Convolutional Neural Networks for Resource Efficient Inference," in Proc. of ICLR, 1017. [32] D. Molchanov, A. Ashukha, and D. Vetrov, "Variational Dropout Sparsifies Deep Neural Networks," in Proc. of ICML, 2017. [33] Z. Liu, et al., "Learning Efficient Convolutional Networks through Network Slimming," in Proc. of ICCV, 2017. [34] Y. He, et al., "Channel Pruning for Accelerating Very Deep Neural Networks," in Proc. of ICCV, 2017. [35] J. Luo, et al., "ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression," in Proc. of ICCV, 2017. [36] C. Louizos, K. Ullrich, and M. Welling, "Bayesian Compression for Deep Learning," in Proc. of NIPS, 2017. 84

Slide 85

Slide 85 text

枝刈り [37] Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, Dmitry Vetrov, "Structured Bayesian Pruning via Log-Normal Multiplicative Noise," in Proc. of NIPS, 2017. [38] M. Zhu and S. Gupta, "To prune, or not to prune: exploring the efficacy of pruning for model compression," in Proc. of ICLRW, 2018. [39] T. Yang, Y. Chen, and V. Sze, "Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning," in Proc. of CVPR, 2017. [40] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, "Soft Filter Pruning for Accelerating Deep Convolutional Neural Networks," in Proc. of IJCAI, 2018. [41] Y. He, et al., "AMC - AutoML for Model Compression and Acceleration on Mobile Devices," in Proc. of ECCV, 2018. [42] T. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam, "NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications," in Proc. of ECCV, 2018. [43] J. Luo and J. Wu, "AutoPruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep Model Inference," in arXiv:1805.08941, 2018. [44] J. Frankle and M. Carbin, "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks," in Proc. of ICLR, 2019. [45] Z. Liu, et al., "Rethinking the Value of Network Pruning," in Proc. of ICLR, 2019. [46] J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang, "Slimmable Neural Networks," in Proc. of ICLR, 2019. [47] S. Lin, R. Ji, C. Yan, B. Zhang, L. Cao, Q. Ye, F. Huang, and D. Doermann, "Towards Optimal Structured CNN Pruning via Generative Adversarial Learning," in Proc. of CVPR, 2019. GAN [48] J. Yu and T. Huang, "Universally Slimmable Networks and Improved Training Techniques," in arXiv:1903.05134, 2019. [49] J. Yu and T. Huang, "Network Slimming by Slimmable Networks: Towards One-Shot Architecture Search for Channel Numbers," in arXiv:1903.11728, 2019. [50] Z. Liu, H. Mu, X. Zhang, Z. Guo, X. Yang, T. Cheng, and J. Sun, "MetaPruning: Meta Learning for Automatic Neural Network Channel Pruning," in arXiv:1903.10258, 2019. 85

Slide 86

Slide 86 text

アーキテクチャ探索 [51] B. Zoph and Q. V. Le, "Neural architecture search with reinforcement learning," in Proc. of ICLR, 2017. [52] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, "Learning transferable architectures for scalable image recognition," in Proc. of CVPR, 2018. [53] C. Liu, et al., "Progressive Neural Architecture Search," in Proc. of ECCV, 2018. [54] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and Jeff Dean, "Efficient Neural Architecture Search via Parameter Sharing," in Proc. of ICML, 2018. [55] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu, "Hierarchical Representations for Efficient Architecture Search," in Proc. of ICLR, 2018. [56] E. Real, A. Aggarwal, Y. Huang, Q. V. Le, "Regularized Evolution for Image Classifier Architecture Search," in Proc. of AAAI, 2019. [57] H. Liu, K. Simonyan, and Y. Yang, "DARTS: Differentiable Architecture Search," in Proc. of ICLR, 2019. [58] H. Cai, L. Zhu, and S. Han, "ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware," in Proc. of ICLR, 2019. [59] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, "MnasNet: Platform-Aware Neural Architecture Search for Mobile," in Proc. of CVPR, 2019. [60] X. Dai, et al., "ChamNet: Towards Efficient Network Design through Platform-Aware Model Adaptation," in Proc. of CVPR, 2019. [61] B. Wu, et al., "FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search", in Proc. of CVPR, 2019. [62] D. Stamoulis, et al., "Single-Path NAS: Device-Aware Efficient ConvNet Design," in Proc. of ICMLW, 2019. [63] L. Li and A. Talwalkar, "Random search and reproducibility for neural architecture search," in arXiv:1902.07638, 2019. 86

Slide 87

Slide 87 text

早期終了、動的計算グラフ [64] Y. Guo, A. Yao, and Y. Chen, "Dynamic Network Surgery for Efficient DNNs," in Proc. of NIPS, 2016. [65] S. Teerapittayanon, et al., "BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks," in Proc. of ICPR, 2016. [66] M. Figurnov, et al., "Spatially Adaptive Computation Time for Residual Networks," in Proc. of CVPR, 2017. [67] T. Bolukbasi, J. Wang, O. Dekel, and V. Saligrama, "Adaptive Neural Networks for Efficient Inference," in Proc. of ICML, 2017. [68] J. Lin, et al., "Runtime Neural Pruning," in Proc. of NIPS, 2017. [69] G. Huang, D. Chen, T. Li, F. Wu, L. Maaten, and K. Weinberger, "Multi-Scale Dense Networks for Resource Efficient Image Classification," in Proc. of ICLR, 2018. [70] X. Wang, F. Yu, Z. Dou, T. Darrell, and J. Gonzalez, "SkipNet: Learning Dynamic Routing in Convolutional Networks," in Proc. of ECCV, 2018. [71] A. Veit and S. Belongie, "Convolutional Networks with Adaptive Inference Graphs," in Proc. of ECCV, 2018. [72] L. Liu and J. Deng, "Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-Offs by Selective Execution," in Proc. of AAAI, 2018. [73] Z. Wu, et al., "BlockDrop: Dynamic Inference Paths in Residual Networks," in Proc. of CVPR, 2018. [74] R, Yu, et al., "NISP: Pruning Networks using Neuron Importance Score Propagation," in Proc. of CVPR, 2018. [75] J. Kuen, X. Kong, Z. Lin, G. Wang, J. Yin, S. See, and Y. Tan, "Stochastic Downsampling for Cost-Adjustable Inference and Improved Regularization in Convolutional Networks," in Proc. of CVPR, 2018. [76] X. Gao, Y. Zhao, L. Dudziak, R. Mullins, and C. Xu, "Dynamic Channel Pruning: Feature Boosting and Suppression," in Proc. of ICLR, 2019. 87

Slide 88

Slide 88 text

蒸留 [77] G. Hinton, et al., "Distilling the Knowledge in a Neural Network," in Proc. of NIPS Workshop, 2014. [78] J. Ba and R. Caruana, "Do Deep Nets Really Need to be Deep?," in Proc. of NIPS, 2014. [79] A. Romero, et al., "FitNets: Hints for Thin Deep Nets," in Proc. of ICLR, 2015. [80] T. Chen, I. Goodfellow, and J. Shlens, "Net2Net: Accelerating Learning via Knowledge Transfer," in Proc. of ICLR, 2016. [81] G. Urban, et al., "Do Deep Convolutional Nets Really Need to be Deep and Convolutional?," in Proc. of ICLR, 2017. [82] J. Yim, D. Joo, J. Bae, and J. Kim, "A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning," in Proc. of CVPR, 2017. [83] A. Mishra and D. Marr, "Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy," in Proc. of ICLR, 2018. [84] T. Furlanello, Z. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, "Born Again Neural Networks," in Proc. of ICML, 2018. [85] Y. Zhang, T. Xiang, T. Hospedales, and H. Lu, "Deep Mutual Learning," in Proc. of CVPR, 2018. [86] X. Lan, X. Zhu, and S. Gong, "Knowledge Distillation by On-the-Fly Native Ensemble," in Proc. of NIPS, 2018. [87] W. Park, D. Kim, Y. Lu, and M. Cho, "Relational Knowledge Distillation," in Proc. of CVPR, 2019. 88

Slide 89

Slide 89 text

量⼦化 [88] M. Courbariaux, Y. Bengio, and J. David, "BinaryConnect: Training Deep Neural Networks with binary weights during propagations," in Proc. of NIPS, 2015. [89] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, "Binarized Neural Networks," in Proc. of NIPS, 2016. [90] M. Rastegari, V. OrdonezJoseph, and R. Farhadi, "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks," in Proc. of ECCV, 2016. [91] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, "Quantized Convolutional Neural Networks for Mobile Devices," in Proc. of CVPR, 2016. [92] F. Li, B. Zhang, and B. Liu, "Ternary Weight Networks," in arXiv:1605.04711, 2016. [93] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, "DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients," in arXiv:1606.06160, 2016. [94] C. Zhu, S. Han, H. Mao, and W. Dally, "Trained Ternary Quantization," in Proc. of ICLR, 2017. [95] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, "Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights," in Proc. of ICLR, 2017. [96] S. Wu, G. Li, F. Chen, and L. Shi, "Training and Inference with Integers in Deep Neural Networks," in Proc. of ICLR, 2018. [97] B. Jacob, et al., "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference," in Proc. of CVPR, 2018. [98] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K. Cheng, "Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm," in Proc. of ECCV, 2018. [99] N. Wang, J. Choi, D. Brand, C. Chen, and K. Gopalakrishnan, "Training Deep Neural Networks with 8-bit Floating Point Numbers," in Proc. of NIPS, 2018. [100] G. Yang, et al., "SWALP : Stochastic Weight Averaging in Low-Precision Training," in Proc. of ICML, 2019. 89

Slide 90

Slide 90 text

90