Path-Level Network Transformation for Efficient Architecture Search (ICML2018読み会)

斉藤翔汰（横浜国立大学） 2018年7月28日 | ICML2018読み会 Path-Level Network Transformation for Efficient
Architecture Search H. Cai, J. Yang, W. Zhang, S. Han, Y. Yu | ICML 2018

自己紹介・Research Interest 2 • 名前：斉藤翔汰 • 横浜国立大学大学院環境情報学府情報メディア環境学専攻
白川研究室 M2 • Machine Learning (Deep Learning, Feature Selection) • Evolutionary Computation (Evolution Strategy) • ML × EC = Evolutionary Machine Learning feature 1 feature 2 feature d 0 1 1 0 1 … 0 1 0 1 Expected loss Update the distribution Update the model Input vector Model Bernoulli distribution Sampling the binary vector Concept image of PEFS G(W, ✓) Shota Saito, Shinichi Shirakawa, Youhei Akimoto: “Embedded Feature Selection Using Probabilistic Model-Based Optimization”, Student Workshop in Genetic and Evolutionary Computation Conference 2018 (GECCO 2018) , Kyoto, Japan, 15th-19th July (2018). Probabilistic model-based EC × Feature Selection

文献情報・選んだ理由 3 • H. Cai, J. Yang, W. Zhang, S.
Han, and Y. Yu, “Path-Level Network Transformation for Efficient Architecture Search,” in Proceedings of the 35th International Conference on Machine Learning, 2018, pp. 677–686. • Neural Networkの構造探索（特に強化学習ベース）にまつわる研究であるため • 省コストな構造探索のサーベイも兼ねて • ENAS[Pham et al. 2018] ，DARTS [Liu et al., 2018]など

概要 4 • Net2Net：もとの学習済みNeural Network (NN)の構造を拡大して再学習する手法 • 重み継承（パラメータ共有）付き構造探索 •
NAS：NNの構造を出力するRNNを用意し，強化学習(REINFORCE)で構造学習する手法 • さらにNet2Netにおける構造の拡大方法にも一工夫 Net2Net + Neural Architecture Search 提案手法を構成する技術 [Chen et al., 2016] [Zoph & Le, 2017]

Related Work and Background | 関連研究と背景 5

Related Work and Background 6 1990 ~ 2000 Neuro-Evolution 2010
~ 2016 Bayesian Optimization 2016 ~ 強化学習ベースの手法進化計算で構造や重みを最適化目的関数をガウス過程によって推定し，ある基準に従って探索構造を探索するエージェントを考え，強化学習で方策を更新 NAS以外にQ-Learningを用いた手法 [Baker et al., 2017] が存在現在

• NASの枠組み：Recurrent NN + REINFORCE Related Work and Background 7
Child Network 1 (CNN) … Child Network N (CNN) Meta-Controller (RNN) ①RNNが学習した生成確率P に基づいて構造を複数サンプル Reward 1 (Accuracy) Reward N (Accuracy) … ②NNを学習して評価値を算出 ③評価値をもとに勾配を計算 ④勾配を使ってControllerのパラメータを更新 m： Child Networkの個数，T：ハイパーパラメータの個数 r✓ J ( ✓ ) ⇡ 1 m m X k= 1 T X t= 1 ( R( T ) k b ) r✓ log P ( a( t ) | a( t 1):1 ; ✓ )

層の幅を変更するNet2Net: Net2WiderNet 8 • Net2Netの枠組み：出力を保つ構造変換 z x y a b
z y y’ b x a Original Network Transformed Network (Wider) Net2WiderNet Operator ②重みをコピー ③重みを1/2 ①コピー元のユニットをランダム選択変換後も出力zは変わらない function-preserving transformation ④変換後のネットワークを学習

層の深さを変更するNet2Net: Net2DeeperNet 9 z x y a b z y’
b x’ a Original Network Transformed Network (Deeper) Net2DeeperNet Operator ①コピー元の層をランダム選択変換後も出力zは変わらない ③変換後のネットワークを学習 y x ②新しい重みを単位行列で初期化

Efficient Architecture Search (EAS) 10 • 本論文の著者らがAAAI-18で提案した構造探索手法 • ネットワーク変換：レイヤー挿入，ユニット数の増加 •
この論文ではスキップ結合なしのCNNと DenseNet[Huang et al., 2017]を基にした空間を探索 • どのようなネットワーク変換操作を取るべきかを強化学習で探索ネットワーク変換 (Net2Net)＋強化学習でパラメータ空間を探索 Efficient Architecture Search [Cai et al., AAAI-2018]

Efficient Architecture Search (EAS) 11 • EASの概要： Encoder Network Network
Transformation Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM CONV(32,3,1) POOL(2,2) CONV(64,5,1) POOL(2,2) FC(64) Net2Wider Actor Network ... Layer Embedding Update the network Multiple Actor Networks Network Transformation Layer Embedding Layer Embedding Layer Embedding Layer Embedding Net2Deeper Actor Network Figure 1: Overview of the RL based meta-controller in EAS, which consists of an encoder network for encoding the architecture and multiple separate actor networks for taking network transformation actions. Encoder Network Net2Wider Actor Probability of Widening the Layer Sigmoid Classifier Sigmoid Classifier Sigmoid Classifier Decision of Widening the Layer Bi-LSTM Bi-LSTM Block Index Layer Index Filter Size Stride Parameters of New Layer if Applicable (CNN as example) <START> Initial State Encoder Network Net2Deeper Actor Bi-LSTM ①層の情報を全結合NNで低次元特徴に変換 ②Bi-directional LSTMで構造全体をEncoding ③出力にSigmoidを適用し，ユニットを増やすか決定 ④Decoder Netを用いて，追加する層の構造を決定

Method | 提案手法 12

Path-Level Network Transformation 13 • ある層と等価な機能を持つ枝分かれたしたモジュール（= function-preservingを満たす構造）を考える • 通常の畳み込み層は以下のように表現可能
Path-Level Network Transformation x C ( x ) x Replication C ( x ) x x Concat x Replication C ( x ) x x 0.5 0.5 Add Figure 1. Convolution layer and its equivalent multi-branch motifs. 通常の畳み込み層 Path-Level Network Transformation for Efficient Arc x C ( x ) x Replication C ( x ) x x Concat x Replication C ( x ) x x 0.5 0.5 Add Figure 1. Convolution layer and its equivalent multi-branch motifs. x Identity x x 0.5 Identi Figure 2. Identity Path-Level Network Transformation for Effi x C ( x ) x Replication C ( x ) x x Concat x Replication C ( x ) x x 0.5 0.5 Add Figure 1. Convolution layer and its equivalent multi-branch motifs. x Identity x Figure 2 Path-Level Network Transformation for Efficien x C ( x ) x Replication C ( x ) x x Concat x Replication C ( x ) x x 0.5 0.5 Add Figure 1. Convolution layer and its equivalent multi-branch motifs. x Identity x Figure 2. Ide 加算による等価表現 concatによる等価表現値をコピー 0.5倍 0.5倍

Path-Level Network Transformation 14 • 恒等関数（＝なにも処理しない層）も同様に枝分かれさせた構造で表現することができる ormation for Efﬁcient
Architecture Search x motifs. x Identity x x Replication x x 0.5 0.5 Add Identity Identity x Split Identity Identity x Concat x1 x2 x : [ x1, x2] Figure 2. Identity layer and its equivalent multi-branch motifs. 0.5倍 0.5倍

Path-Level Network Transformation 15 • 畳み込み層・恒等関数に対する等価表現を用いて何度も枝分かれさせていく • 枝の恒等関数を畳み込み層に変更することで Net2DeeperNetを実現
Path-Level Network Transformation for Efﬁcient Architecture Se x C ( x ) C(·) C(·) x Replication x x 0.5 Add C(·) C(·) C(·) C(·) 0.5 x Replication x x Add Split Iden tity Concat C(·) C(·) C(·) C(·) Iden tity Concat x Replication x x Add Split Sep 3x3 Concat C(·) C(·) C(·) C(·) Iden tity Split ) ) ) ) C ( x ) C ( x ) C ( x ) (a) (b) (c) (d) e 3. An illustration of transforming a single layer to a tree-structured motif via path-level transfo DeeperNet operation to replace an identity mapping with a 3 ⇥ 3 depthwise-separable convolut Path-Level Network Transformation for Efﬁcient Architecture Search x C ( x ) C(·) C(·) x Replication x x 0.5 Add C(·) C(·) C(·) C(·) 0.5 x Replication x x Add Split Iden tity Concat C(·) C(·) C(·) C(·) Iden tity Concat x Replication x x Add Split Sep 3x3 Concat C(·) C(·) C(·) C(·) Iden tity Split x Replication Add Split Concat C(·) C(·) Sep 3x3 Iden tity Leaf ) ) ) ) C ( x ) C ( x ) C ( x ) (a) (b) (c) (d) Figure 3. An illustration of transforming a single layer to a tree-structured motif via path-level transformation operations, where we apply Net2DeeperNet operation to replace an identity mapping with a 3 ⇥ 3 depthwise-separable convolution in (c). 3x3 depth-wise sep. convへ変更恒等関数を…

Tree-Structured Architecture Space 16 • 変換によって得られた構造は木で表現される • 入力の特徴マップxに対するノードの出力N(x)を以下で
定義： Path-Level Network Transformation for Efficient Architecture Search x C ( x ) C(·) C(·) x Replication x x 0.5 Add C(·) C(·) C(·) C(·) 0.5 x Replication x x Add Split Iden tity Concat C(·) C(·) C(·) C(·) Iden tity Concat x Replication x x Add Split Sep 3x3 Concat C(·) C(·) C(·) C(·) Iden tity Split x Replication Add Split Concat C(·) C(·) Sep 3x3 Iden tity Leaf ) ) ) ) C ( x ) C ( x ) C ( x ) (a) (b) (c) (d) llustration of transforming a single layer to a tree-structured motif via path-level transformation operations, wher Net operation to replace an identity mapping with a 3 ⇥ 3 depthwise-separable convolution in (c). mply applying the above transformations does non-trivial path topology modifications. How- ombined with Net2Net operations, we are able ally change the path topology, as shown in Fig- example, we can insert different numbers and rs into each branch by applying Net2DeeperNet leaf nodes, and finally aggregated in mirror from nodes to the root node in a bottom-up manner to p final output feature map. Notice that the tree-structured architecture space full architecture space that can be achieved with s section, we describe the tree-structured architecture that can be explored with path-level network transfor- n operations as illustrated in Figure 3. -structured architecture consists of edges and nodes, at each node (except leaf nodes) we have a specific nation of the allocation scheme and the merge scheme, he node is connected to each of its child nodes via ge that is defined as a primitive operation such as lution, pooling, etc. Given the input feature map x , utput of node N (·), with m child nodes denoted as )} and m corresponding edges denoted as { Ei(·)}, is d recursively based on the outputs of its child nodes: zi = allocation ( x, i ) , yi = N c i ( Ei( zi)) , 1  i  m, (1) N ( x ) = merge ( y1, · · · , ym ) , allocation ( x, i ) denotes the allocated feature map To apply architectu the set of primitive the alloc the merge primitive 2017; Liu of layers: • 1 ⇥ • Iden • 3 ⇥ • 5 ⇥ • 7 ⇥ • 3 ⇥ • 3 ⇥ i番目のノードに対するデータの割り当て子ノードEi にデータを渡して出力yi を得る子ノードの出力を全て結合しそのノードの出力とする

Tree-Structured Architecture Space 17 • allocationやmargeで行う操作も選択 • allocation：複製 (replication)，チャンネル分割（split） •
marge：加算（add），結合（concatenation） • 既存の層や恒等関数は以下の層から選択して置換 • 1x1 convolution • Identity • 3x3 or 5x5 or 7x7 depthwise-separable convolution • 3x3 average pooling • 3x3 max pooling パラメータを持たないので， function-preservingが満たせない蒸留によって重みを調整（追加コストはごくわずか）

Architecture Search with Path-Level Operations 18 • 今回，構造は木によって表現されていることから構造を生成するControllerとしてTree-LSTMを採用 •
Tree-LSTMは子ノードを入力として親ノードに状態を受け渡していく (Bottom-up) • さらに親ノード+他の子ノードの隠れ状態を入力として子ノードに状態を受け渡す (Top-down) Path-Level Network Transformation for Efﬁcient Architecture h3, c3 0 0 h1, c1 h, c h0, c0 h2, c2 0 (a) Bottom-up h3, c3 h2, c2 ˜ h0 1 , ˜ c0 1 ˜ h1, ˜ c1 ˜ hP, ˜ cP (b) Top-down Leaf e e1 e2 e3 (a) Figure 5. Illustration of tr edges. (a) The meta-contr leaf child node to have mu and branch number are pre

Architecture Search with Path-Level Operations 19 • Tree-LSTMの出力にSoftmax関数を通すことでノードやオペレーションの種類を選択していく 1.
1つの子ノードしか持っていない親ノードに対しmarge操作を選択 • marge: add, concatenation, none（何もしない） 2. 親ノードごとに層を追加するかを決定 3. Identityの部分を畳み込み層or プーリング層に置換 Path-Level Network Transformation for Efficient Architecture h3, c3 0 0 h1, c1 h, c h0, c0 h2, c2 0 (a) Bottom-up h3, c3 h2, c2 ˜ h0 1 , ˜ c0 1 ˜ h1, ˜ c1 ˜ hP, ˜ cP (b) Top-down Figure 4. Calculation procedure of bottom-up and top-down hidden states. 3.3. Architecture Search with Path-Level Operations Leaf e e1 e2 e3 (a) Figure 5. Illustration of tra edges. (a) The meta-contr leaf child node to have mu and branch number are pred new leaf node to be the chil are connected with an iden replaces an identity mappin from the set of possible pri evel Network Transformation for Efficient Architecture Search ˜ hP, ˜ cP Leaf e e1 e2 e3 e Identity e Iden tity e Path-Level Network Transformation for Efficient Architecture Search h3, c3 0 h, c 0 2 om-up h3, c3 h2, c2 ˜ h0 1 , ˜ c0 1 ˜ h1, ˜ c1 ˜ hP, ˜ cP (b) Top-down Leaf e e1 e2 e3 (a) e Identity e (b) Figure 5. Illustration of transformation decision edges. (a) The meta-controller transforms a nod leaf child node to have multiple child nodes. Bo

Meta-Controller Training Procedure 20 Path-Level Network Transformation for Efﬁcient Architecture
Search D. Meta-Controller Training Procedure Algorithm 1 Path-Level Efﬁcient Architecture Search Input: base network baseNet , training set trainSet , validation set valSet , batch size B , maximum number of networks M 1: trained = 0 // Number of trained networks 2: Pnets = [] // Store results of trained networks 3: randomly initialize the meta-controller C 4: Gc = [] // Store gradients to be applied to C 5: while trained < M do 6: meta-controller C samples a tree-structured cell 7: if cell in Pnets then 8: get the validation accuracy accv of cell from Pnets 9: else 10: model = train(trans( baseNet , cell ), trainSet ) 11: accv = evel(model, valSet ) 12: add ( cell , accv ) to Pnets 13: trained = trained + 1 14: end if 15: compute gradients according to ( cell , accv ) and add to Gc 16: if len ( Gc) == B then 17: update C according to Gc 18: Gc = [] 19: end if 20: end while ベース構造のセルをサンプリングしたセルに置換して学習・評価 Tree-LSTMからセルをサンプリング DenceNetのような繰り返し構造を持つベース構造を与える勾配を推定し Tree-LSTMのパラメータを更新

Experimental Details 24 • CIFAR-10とImageNetで評価 • ImageNetはCIFAR-10で得られたセルを使用 • CIFAR-10はデータ拡張と標準化による前処理を使用 •
ミラーリング，シフト＋チャンネルごとの標準化 LSTMに関する設定ユニット数 100 Optimizer Adam 勾配推定 REINFORCE バッチサイズ 10 ベースライン指数移動平均エントロピー正則化 0.01 構造の評価値 tan(0.5π * 精度) CNNに関する設定エポック数 ※重みは継承される 20 Optimizer SGD + Nestrov momentum バッチサイズ 64 Weight Decay 0.0001

Experimental Details 25 • ベース構造としてはDenseNet-BCを使用 • ブロック数16，ブロック内は3x3 Conv 2層， Growth
rate（3x3 Convの出力チャンネル数）16 • Group Convolutionを使用 • 構造探索中はDenseBlock内の3x3 Convをサンプリングされたセルに置換してモデルを学習 • 構造探索が終わったあとにDenseNet・PyramidNetの畳み込み層を学習済みのセルに置換し学習・評価 • 学習の設定は構造探索中と変更なし • 300 or 600エポック学習後のテストエラーが最終的なスコア

Path-Level Network Transformation for Efﬁcient Architecture Search Table 1. Test
error rate (%) results of our best discovered architectures as well as state-of-the-art human-designed and automatically designed architectures on CIFAR-10. If “Reg” is checked, additional regularization techniques (e.g., Shake-Shake (Gastaldi, 2017), DropPath (Zoph et al., 2017) and Cutout (DeVries & Taylor, 2017)), along with a longer training schedule (600 epochs or 1800 epochs) are utilized when training the networks. Model Reg Params Test error Human designed ResNeXt-29 (16 ⇥ 64d) (Xie et al., 2017) DenseNet-BC ( N = 31 , k = 40) (Huang et al., 2017b) PyramidNet-Bottleneck ( N = 18 , ↵ = 270) (Han et al., 2017) PyramidNet-Bottleneck ( N = 30 , ↵ = 200) (Han et al., 2017) ResNeXt + Shake-Shake (1800 epochs) (Gastaldi, 2017) ResNeXt + Shake-Shake + Cutout (1800 epochs) (DeVries & Taylor, 2017) X X 68.1M 25.6M 27.0M 26.0M 26.2M 26.2M 3.58 3.46 3.48 3.31 2.86 2.56 Auto designed EAS (plain CNN) (Cai et al., 2018) Hierarchical ( c0 = 128) (Liu et al., 2018) Block-QNN-A ( N = 4) (Zhong et al., 2017) NAS v3 (Zoph & Le, 2017) NASNet-A (6, 32) + DropPath (600 epochs) (Zoph et al., 2017) NASNet-A (6, 32) + DropPath + Cutout (600 epochs) (Zoph et al., 2017) NASNet-A (7, 96) + DropPath + Cutout (600 epochs) (Zoph et al., 2017) X X X 23.4M - - 37.4M 3.3M 3.3M 27.6M 4.23 3.63 3.60 3.65 3.41 2.65 2.40 Ours TreeCell-B with DenseNet ( N = 6 , k = 48 , G = 2) TreeCell-A with DenseNet ( N = 6 , k = 48 , G = 2) TreeCell-A with DenseNet ( N = 16 , k = 48 , G = 2) TreeCell-B with PyramidNet ( N = 18 , ↵ = 84 , G = 2) TreeCell-A with PyramidNet ( N = 18 , ↵ = 84 , G = 2) TreeCell-A with PyramidNet ( N = 18 , ↵ = 84 , G = 2) + DropPath (600 epochs) TreeCell-A with PyramidNet ( N = 18 , ↵ = 84 , G = 2) + DropPath + Cutout (600 epochs) TreeCell-A with PyramidNet ( N = 18 , ↵ = 150 , G = 2) + DropPath + Cutout (600 epochs) X X X 3.2M 3.2M 13.1M 5.6M 5.7M 5.7M 5.7M 14.3M 3.71 3.64 3.35 3.40 3.14 2.99 2.49 2.30 Results on CIFAR-10 26

Path-Level Network Transformation for Efﬁcient Architecture Search Table 1. Test
error rate (%) results of our best discovered architectures as well as state-of-the-art human-designed and automatically designed architectures on CIFAR-10. If “Reg” is checked, additional regularization techniques (e.g., Shake-Shake (Gastaldi, 2017), DropPath (Zoph et al., 2017) and Cutout (DeVries & Taylor, 2017)), along with a longer training schedule (600 epochs or 1800 epochs) are utilized when training the networks. Model Reg Params Test error Human designed ResNeXt-29 (16 ⇥ 64d) (Xie et al., 2017) DenseNet-BC ( N = 31 , k = 40) (Huang et al., 2017b) PyramidNet-Bottleneck ( N = 18 , ↵ = 270) (Han et al., 2017) PyramidNet-Bottleneck ( N = 30 , ↵ = 200) (Han et al., 2017) ResNeXt + Shake-Shake (1800 epochs) (Gastaldi, 2017) ResNeXt + Shake-Shake + Cutout (1800 epochs) (DeVries & Taylor, 2017) X X 68.1M 25.6M 27.0M 26.0M 26.2M 26.2M 3.58 3.46 3.48 3.31 2.86 2.56 Auto designed EAS (plain CNN) (Cai et al., 2018) Hierarchical ( c0 = 128) (Liu et al., 2018) Block-QNN-A ( N = 4) (Zhong et al., 2017) NAS v3 (Zoph & Le, 2017) NASNet-A (6, 32) + DropPath (600 epochs) (Zoph et al., 2017) NASNet-A (6, 32) + DropPath + Cutout (600 epochs) (Zoph et al., 2017) NASNet-A (7, 96) + DropPath + Cutout (600 epochs) (Zoph et al., 2017) X X X 23.4M - - 37.4M 3.3M 3.3M 27.6M 4.23 3.63 3.60 3.65 3.41 2.65 2.40 Ours TreeCell-B with DenseNet ( N = 6 , k = 48 , G = 2) TreeCell-A with DenseNet ( N = 6 , k = 48 , G = 2) TreeCell-A with DenseNet ( N = 16 , k = 48 , G = 2) TreeCell-B with PyramidNet ( N = 18 , ↵ = 84 , G = 2) TreeCell-A with PyramidNet ( N = 18 , ↵ = 84 , G = 2) TreeCell-A with PyramidNet ( N = 18 , ↵ = 84 , G = 2) + DropPath (600 epochs) TreeCell-A with PyramidNet ( N = 18 , ↵ = 84 , G = 2) + DropPath + Cutout (600 epochs) TreeCell-A with PyramidNet ( N = 18 , ↵ = 150 , G = 2) + DropPath + Cutout (600 epochs) X X X 3.2M 3.2M 13.1M 5.6M 5.7M 5.7M 5.7M 14.3M 3.71 3.64 3.35 3.40 3.14 2.99 2.49 2.30 Results on CIFAR-10 27 200 GPU-hoursで48,000 GPU-hoursの NAS-Net(2.4%)を超える精度(2.3%)を達成

Path-Level Network Transformation x Replication Add Replication Add Replication Add
Replication Add GroupConv 3x3 GroupConv 3x3 Sep 5x5 Sep 5x5 Split Concat Replication Add Replication Add Replication Add Sep 3x3 Max 3x3 Conv 1x1 Avg 3x3 Sep 7x7 Sep 3x3 Sep 5x5 Avg 3x3 Avg 3x3 Avg 3x3 Sep 3x3 Leaf Results on CIFAR-10 28 • 最終的に得られたセルの構造 Replication Add Replication Add Replication Add Sep 5x5 Sep 5x5 Split Concat Replication Add Replication Add Replication Add Sep 3x3 Max 3x3 Conv 1x1 Avg 3x3 Sep 7x7 Sep 3x3 Sep 5x5 Avg 3x3 Avg 3x3 Avg 3x3 Sep 3x3 Figure 6. Detailed structure of the best discovered cell on CIFAR- 10 (TreeCell-A). “GroupConv” denotes the group convolution; “Conv” denotes the normal convolution; “Sep” denotes the depthwise-separable convolution; “Max” denotes the max pooling; “Avg” denotes the average pooling. 2. For a node that is a leaf node, the meta-controller de- termines whether to expand the node, i.e. insert a new leaf node to be the child node of this node and connect them with identity mapping, which increases the depth of the architecture (Figure 5b). 3. For an identity edge, the meta-controller chooses a new edge (can be identity) from the set of possible primitive operations (Section 3.2) to replace the identity edge (Figure 5c). Also this decision will only be made once for each edge. 4. Experiments and Results Our experimental setting1 resembles Zoph et al. (2017), Zhong et al. (2017) and Liu et al. (2018). Speciﬁcally, we apply the proposed method described above to learn Figure 7. Progress of the architecture search process and compari- son between RL and random search (RS) on CIFAR-10. state size of all LSTM units is 100 and we train it with the ADAM optimizer (Kingma & Ba, 2014) using the REIN- FORCE algorithm (Williams, 1992). To reduce variance, we adopt a baseline function which is an exponential moving average of previous rewards with a decay of 0.95, as done in Cai et al. (2018). We also use an entropy penalty with a weight of 0.01 to ensure exploration. At each step in the architecture search process, the meta- controller samples a tree-structured cell by taking transformation actions starting with a single layer in the base network. For example, when using a DenseNet as the base network, after the transformations, all 3 ⇥ 3 convolution layers in the dense blocks are replaced with the sampled tree-structured cell while all the others remain unchanged. The obtained network, along with weights transferred from the base network, is then trained for 20 epochs on CIFAR-10 with an initial learning rate of 0.035 that is further annealed with a cosine learning rate decay (Loshchilov & Hutter, 2016), a batch size of 64, a weight decay of 0.0001, using the SGD optimizer with a Nesterov momentum of 0.9. The validation accuracy accv of the obtained network is used to compute a reward signal. We follow Cai et al. (2018) and use the transformed value, i.e. tan ( accv ⇥ ⇡/ 2), as

nsformation for Efﬁcient Architecture Search ication Add p Sep 3x3
Leaf Figure 7. Progress of the architecture search process and compari- son between RL and random search (RS) on CIFAR-10. Results on CIFAR-10 29 • 構造が学習されていることを示すためにランダムサーチと比較 ation for Efﬁcient Architecture Search f Figure 7. Progress of the architecture search process and compari-

Results on ImageNet 30 • ImageNetはCIFAR-10で得られたセルを使用 • 乗算・加算の回数が600M回以下のMobile設定で Top-1，Top-5で比較 •
NAS-Net以上のTop-1, Top-5精度を達成 rmation for Efficient Architecture Search more, ame- meter oved com- ined with eves indi- ncor- e the Table 2. Top-1 (%) and Top-5 (%) classification error rate results on ImageNet in the Mobile Setting ( 600M multiply-add operations). “⇥+” denotes the number of multiply-add operations. Model ⇥+ Top-1 Top-5 1.0 MobileNet-224 (Howard et al., 2017) 569M 29.4 10.5 ShuffleNet 2x (Zhang et al., 2017) 524M 29.1 10.2 CondenseNet ( G1 = G3 = 8) (Huang et al., 2017a) 274M 29.0 10.0 CondenseNet ( G1 = G3 = 4) (Huang et al., 2017a) 529M 26.2 8.3 NASNet-A ( N = 4) (Zoph et al., 2017) 564M 26.0 8.4 NASNet-B ( N = 4) (Zoph et al., 2017) 448M 27.2 8.7 NASNet-C ( N = 3) (Zoph et al., 2017) 558M 27.5 9.0 TreeCell-A with CondenseNet ( G1 = 4 , G3 = 8) 588M 25.5 8.0 TreeCell-B with CondenseNet ( G1 = 4 , G3 = 8) 594M 25.4 8.1 Network Transformation for Efficient Architecture Search eters. Furthermore, number of parame- mproved parameter lts to the improved path topology com- ls. When combined .14% test error with ramidNet achieves rs, which also indi- efficiency by incor- midNets. Since the he start point rather he transferability of ectures. Table 2. Top-1 (%) and Top-5 (%) classification error rate results on ImageNet in the Mobile Setting ( 600M multiply-add operations). “⇥+” denotes the number of multiply-add operations. Model ⇥+ Top-1 Top-5 1.0 MobileNet-224 (Howard et al., 2017) 569M 29.4 10.5 ShuffleNet 2x (Zhang et al., 2017) 524M 29.1 10.2 CondenseNet ( G1 = G3 = 8) (Huang et al., 2017a) 274M 29.0 10.0 CondenseNet ( G1 = G3 = 4) (Huang et al., 2017a) 529M 26.2 8.3 NASNet-A ( N = 4) (Zoph et al., 2017) 564M 26.0 8.4 NASNet-B ( N = 4) (Zoph et al., 2017) 448M 27.2 8.7 NASNet-C ( N = 3) (Zoph et al., 2017) 558M 27.5 9.0 TreeCell-A with CondenseNet ( G1 = 4 , G3 = 8) 588M 25.5 8.0 TreeCell-B with CondenseNet ( G1 = 4 , G3 = 8) 594M 25.4 8.1

Conclusion 31 • ある層に対してfunction-preservingを満たす構造の変更方法を提案 • 構造を木構造で表現し，強化学習とTree-LSTMを用いて最適な構造を学習 • 学習したセルをPyramidNetに適用することで，
少ない計算コスト(200 GPU-hours)でNAS-Netを上回る性能を持つ構造を得ることができた • Future Work: モデル圧縮の手法と組み合わせてよりコンパクトなモデルを探索する

感想 32 • DenseNet，PyramidNetのように大域的な構造をベース構造として与える必要がある • DenseNet，PyramidNetから外れた新しい構造を生成できないのでは • 類似した構造の中で良い構造を見つける
「局所的な」構造の最適化という印象 • ENASと比べるとまだ計算コストが高い • 木構造によるエンコーディングスキームを用いれば Genetic Programmingによって構造最適化ができそう • ノードをグリッド上に配置するCartesian GPによる CNNの構造最適化は既に先行事例がある [Suganuma et al., 2017] [Suganuma et al., 2018]

References 33 • [Pham et al., 2018] H. Pham, M.
Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient Neural Architecture Search via Parameter Sharing,” in Proceedings of the 35th International Conference on Machine Learning, pp. 4092-4101, 2018. • [Liu et al., 2018] H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable Architecture Search,” in preprint arXiv:1806.09055v1, 2018. • [Chen et al. 2016] T. Chen, I. Goodfellow, and J. Shlens, “Net2Net: Accelerating Learning via Knowledge Transfer,” in Proceedings of 4th International Conference on Learning Representations (ICLR’16), 2016. • [Zoph & Le, 2017] B. Zoph and Q. V Le, “Neural Architecture Search with Reinforcement Learning,” in Proceedings of 5th International Conference on Learning Representations (ICLR'17), 2017. • [Real et al., 2017] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. Le, and A. Kurakin, “Large-Scale Evolution of Image Classifiers,” in Proceedings of the 34th International Conference on Machine Learning, vol. PMLR 70, pp. 2902–2911, Mar. 2017.

References 34 • [Huang et al., 2017] G. Huang, Z.
Liu, L. v. d. Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2261–2269. • [Baker et al., 2017] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing Neural Network Architectures Using Reinforcement Learning,” in Proceedings of 5th International Conference on Learning Representations (ICLR’17), 2017. • [Cai et al. AAAI-18] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wan, “Efficient Architecture Search by Network Transformation,” in Thirty- Second AAAI Conference on Artificial Intelligence (AAAI-18), 2018. • [Suganuma et al., 2017] M. Suganuma, S. Shirakawa, and T. Nagao, “A genetic programming approach to designing convolutional neural network architectures,” in Proceedings of the Genetic and Evolutionary Computation Conference on - GECCO ’17, 2017, pp. 497–504. • [Suganuma et al., 2018] M. Suganuma, M. Ozay, and T. Okatani, “Exploiting the Potential of Standard Convolutional Autoencoders for Image Restoration by Evolutionary Search,” in Proceedings of the 35th International Conference on Machine Learning, PMLR 80:4778-4787, 2018.

Path-Level Network Transformation for Efficient...

Path-Level Network Transformation for Efficient Architecture Search (ICML2018読み会)

More Decks by S.Shota

Other Decks in Research

Featured

Transcript