## Slide 4

### Slide 4 text

Copyright © GREE, Inc. All Rights Reserved. IkW HjMkS 3584 aVf 395:4 D+D Xbkdf YTWikL 36::4 Xbkdf YTWikL 3::4 eJgjW Xbkdf YTWikL %aVf ZkQ]Whj aVf D+D #C h_UELO Mj\bkR [Ncj &') 1(* ZRkj (* . !/ 0 ^f`j " ^FP. \$ 7<2;2 :@= , \$ KGO -"

## Slide 5

### Slide 5 text

Copyright © GREE, Inc. All Rights Reserved. 機械学習におけるハイパパラメータ モデル自身や学習に関わる手法が持つ，性能に影響を及ぼす調整可能なパラメータ x t ln λ = −18 0 1 −1 0 1 x t ln λ = 0 0 1 −1 0 1 正則化項のはたらき (Bishop, 2006) Adam optimizer (Kingma and Ba 2015)

## Slide 6

### Slide 6 text

Copyright © GREE, Inc. All Rights Reserved. モデルの複雑化に伴いハイパパラメータ数も増加 手作業や簡単な手法では細かい調整が手に負えない状況 7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000 image 3x3 conv, 512 3x3 conv, 64 3x3 conv, 64 pool, /2 3x3 conv, 128 3x3 conv, 128 pool, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 pool, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 pool, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 pool, /2 fc 4096 fc 4096 fc 1000 image output size: 112 output size: 224 output size: 56 output size: 28 output size: 14 output size: 7 output size: 1 VGG-19 34-layer plain 7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000 image 34-layer residual Residual Network (He et al. 2016)

## Slide 8

### Slide 8 text

Copyright © GREE, Inc. All Rights Reserved. ハイパパラメータ最適化問題の定式化 性能指標（損失関数）を最小化するブラックボックス最適化と考えるのが標準的 Minimize f(λ) subject to λ ∈ Λ. 自分たちが観測できるのは，ノイズを伴った目的関数値のみ 目的関数が数式の形で明示的には与えられない fϵ(λ) = f(λ) + ϵ, ϵ iid ∼ N(0, σ2 n )

## Slide 13

### Slide 13 text

Copyright © GREE, Inc. All Rights Reserved. • Strong Anytime Performance • 厳しい制約のもとで，良い性能が得られること • Strong Final Performance • 緩い制約のもとで，非常に良い設定が得られること • Effective Use of Parallel Resources • 効率的に並列化できること • Scalability • 非常に多くのパラメータ数でも問題なく扱うことができること • Robustness & Flexibility • 目的関数値の観測ノイズや非常にセンシティブなパラメータに対して， 頑健かつ柔軟であること ハイパパラメータ最適化手法が満たすべき要件 (Falkner et al. 2018a) 全てを満たすのは難しいため，現実には目的に応じて取捨選択が必要

## Slide 14

### Slide 14 text

Copyright © GREE, Inc. All Rights Reserved. 手法の分類 Dodge et al. (2017) λk {(λi, f(λi))}k−1 i=1 λk {λi}k−1 i=1 • ベイズ最適化など • 目的関数値を活用して効率的に最適化 • 評価回数を少なく抑えられる傾向 • グリッドサーチやランダムサーチなど • 目的関数値に対する依存性がないため，リソースの許す限り並列評価が可能 • CPU時間に対する課金が主流のクラウド計算資源と相性がよい • ウォールクロックタイムを少なく抑えられる傾向

## Slide 20

### Slide 20 text

Copyright © GREE, Inc. All Rights Reserved. 低実効次元性 (Low Effective Dimensionality) モデル性能にとって重要なパラメータは少数であるためグリッドサーチは非効率， またデータセット毎にそれらは異なる (Bergstra et al. 2012) Important parameter Unimportant parameter Important parameter Unimportant parameter f(λ1, λ2) = g(λ1) + h(λ2) ≈ g(λ1)

## Slide 21

### Slide 21 text

Copyright © GREE, Inc. All Rights Reserved. • Hutter et al. (2014) • functional ANOVAによるアプローチで重要なハイパパラメータを特定 • Fawcett and Hoos (2016) • 2つの設定間で最もパフォーマンスに貢献しているパラメータを調べるablation analysis • Biedenkapp et al. (2017) • サロゲートを用いることでablation analysisを高速化 • van Rijn and Hutter (2017a, b) • functional ANOVAを用いて大規模にデータセット間のハイパパラメータ重要性を分析 重要なハイパパラメータの特定 近年の研究動向

## Slide 46

### Slide 46 text

Copyright © GREE, Inc. All Rights Reserved. • 標準的な選択 係数の選択 0 < γs < 1, −1 < δic < 0 < δoc < δr < δe γs = 1 2 , δic = −1 2 , δoc = 1 2 , δr = 1 and δe = 2 γs = 1 − 1 n , δic = − 3 4 + 1 2n , δoc = 3 4 − 1 2n , δr = 1, δe = 1 + 2 n where n ≥ 2 • 適応的な係数 (Gao and Han 2012) Nelder-Mead法 (Nelder and Mead 1965)

## Slide 48

### Slide 48 text

Copyright © GREE, Inc. All Rights Reserved. ベイズ最適化 • ベイズ最適化 • サロゲートをベイズ的に構築するSMBOの総称 • 　　　　　　を考える P(fϵ(λ) | λ) • サロゲートの種類 • ガウス過程 (GP) • 最も標準的，有名な実装はSpearmint (Snoek et al. 2012) • ランダムフォレスト • SMAC (Hutter et al. 2011) • Tree Parzen Estimator (TPE) (Bergstra et al. 2011) • 実装はHyperopt • 　　　　　　　　　　　を考える • DNN (Snoek et al. 2015) P(λ | fϵ(λ)), P(fϵ(λ)) • Sequential Model-based Optimization (SMBO) • 反復的に関数評価とサロゲート（目的関数のモデル）の更新を繰り返す手法の総称 • ベイズ最適化や信頼領域法 (Ghanbari and Scheinberg 2017)

## Slide 49

### Slide 49 text

Copyright © GREE, Inc. All Rights Reserved. • ガウス分布 • スカラ，ベクトル上の分布 • ガウス過程 • 関数上の分布 ベイズ最適化 ガウス過程回帰に基づく方法 −1 −0.5 0 0.5 1 −3 −1.5 0 1.5 3 ガウス過程からのサンプル (Bishop, 2006)

## Slide 52

### Slide 52 text

Copyright © GREE, Inc. All Rights Reserved. • ARD squared exponential kernel • ARD Matérn 5/2 kernel • カーネルのハイパパラメータはデータから動的に決める • 経験ベイズ (Bishop 2006) • Markov Chain Monte Carlo (MCMC) (Snoek et al. 2012) 共分散関数（カーネル）の選択 (Snoek et al. 2012) kse(λ, λ′) = θ0 exp(− 1 2 r2(λ, λ′)), r2(λ, λ′) = D d=1 (λd − λ′ d)2/(θd)2 k52(λ, λ′) = θ0(1 + 5r2(λ, λ′) + 5 3 r2(x, λ′)) exp(− 5r2(λ, λ′)) ベイズ最適化

## Slide 53

### Slide 53 text

Copyright © GREE, Inc. All Rights Reserved. ベイズ最適化 PRML 6章，カーネルのハイパパラメータの影響 (Bishop 2006) (1.00, 4.00, 0.00, 0.00) −1 −0.5 0 0.5 1 −3 −1.5 0 1.5 3 (9.00, 4.00, 0.00, 0.00) −1 −0.5 0 0.5 1 −9 −4.5 0 4.5 9 (1.00, 64.00, 0.00, 0.00) −1 −0.5 0 0.5 1 −3 −1.5 0 1.5 3 (1.00, 0.25, 0.00, 0.00) −1 −0.5 0 0.5 1 −3 −1.5 0 1.5 3 (1.00, 4.00, 10.00, 0.00) −1 −0.5 0 0.5 1 −9 −4.5 0 4.5 9 (1.00, 4.00, 0.00, 5.00) −1 −0.5 0 0.5 1 −4 −2 0 2 4 k(λ, λ′) = θ0 exp − θ1 2 ∥λ − λ′∥2 + θ2 + θ3λ⊤λ′

## Slide 54

### Slide 54 text

Copyright © GREE, Inc. All Rights Reserved. ベイズ最適化 mとkを決めれば，過去の観測から未観測点の関数値を予測できる ガウス分布の性質とSchurの公式から導出される (Rasmussen and Williams 2005; Bishop 2006) データがないとまともに予測できないので，ランダムサーチなどでデータを集めて初期化しておく P(fϵ(λt+1) | λ1, λ2, . . . , λt+1) = N(µt(λt+1), σ2 t (λt+1) + σ2 n ), µt(λt+1) = k⊤[K + σ2 n I]−1[f(λ1) f(λ2) · · · f(λt)]⊤, σ2 t (λt+1) = k(λt+1, λt+1) − k⊤[K + σ2 n I]−1k where k = [k(λt+1, λ1) k(λt+1, λ2) · · · k(λt+1, λt)]⊤, K = ⎡ ⎢ ⎣ k(λ1, λ1) · · · k(λ1, λt) . . . ... . . . k(λt, λ1) · · · k(λt, λt) ⎤ ⎥ ⎦ .

## Slide 56

### Slide 56 text

Copyright © GREE, Inc. All Rights Reserved. ベイズ最適化 次に評価する点の選び方 • 獲得関数と呼ばれる指標を最大化する点を次に評価する点として選ぶ • 獲得関数は探索と知識利用のトレードオフを担う • サロゲートの分散が大きい点を評価（探索） • サロゲートの平均が小さい点を評価（知識利用） aUCB(λ) = −µ(λ) + ξσ(λ) • 例：GP-Upper Confidence Bound (GP-UCB) (Srinivas 2012)  解きたいのは損失最小化問題なので-µ(λ) • Probability of Improvement (PI)， Expected Improvement (EI)， Predictive Entropy Search (PES) など色々あり，探索性能に大きく影響

## Slide 58

### Slide 58 text

Copyright © GREE, Inc. All Rights Reserved. サロゲートの計算量削減 近年の研究動向 [K + σ2 n I]−1 • ガウス過程回帰のボトルネック： • 近似計算 (Quiñonero-Candela et al. 2007; Titsias 2009) • 計算量が相対的に少ないサロゲート • ランダムフォレスト (Hutter et al. 2011) • DNN (Snoek et al. 2015)

## Slide 59

### Slide 59 text

Copyright © GREE, Inc. All Rights Reserved. • Shah and Ghahramani (2015) • Parallel Predictive Entropy Search • Gonzalez et al. (2016) • Local Penalization • Kathuria et al. (2016) • DPP sampling • Kandasamy et al. (2018) • 非同期並列Thompson sampling • この他にも沢山 • Bergstra et al. (2011); Snoek et al. (2012); Contal et al. (2013); Desautels et al. (2014); Daxberger and Low (2017); Wang et al. (2017, 2018a); Rubin (2018) ベイズ最適化の並列化 近年の研究動向

## Slide 61

### Slide 61 text

Copyright © GREE, Inc. All Rights Reserved. その他の手法 適用事例報告がある主なもの • CMA-ES • Watanabe and Le Roux (2014); Loshchilov and Hutter (2016) • Particle Swarm Optimization (PSO) • Meissner et al. (2006); Lin et al. (2009); Lorenzo et al. (2017); Ye (2017) • Genetic Algorithm (GA) • Leung et al. (2003); Young et al. (2015) • Differential Evolution (DE) • Fu et al. (2016a,b) • 強化学習 • Hansen (2016); Bello et al. (2017); Dong et al. (2018) • 勾配法 (※ブラックボックス最適化でない，連続パラメータのみ) • Maclaurin et al. (2015); Luketina et al. (2016); Pedregosa (2016); Franceschi (2017a,b,c, 2018a,b)

## Slide 63

### Slide 63 text

Copyright © GREE, Inc. All Rights Reserved. • Domhan et al. (2015) • 11種類の基底関数の重み付き線形和で学習曲線をモデル化 • ベイジアンネットワークを使用 (Klein et al. 2016) • 過去のデータを活用 (Chandrashekaran and Lane 2017) 早期終了 エポック数に対する学習曲線を予測し，良い性能を達成する見込みのない学習を停止 fcomb = k i=1 wifi(λ | θi) + ϵ, ϵ ∼ N(0, σ2), k i=1 wi = 1, ∀wi, wi ≥ 0

## Slide 64

### Slide 64 text

Copyright © GREE, Inc. All Rights Reserved. • 異なる解像度でハイパパラメータ最適化後，functional ANOVAにより重要なパラメータを分析 • 多くの重要なパラメータとその値は解像度に依らず同じ (e.g. 学習率，バッチサイズ) • 解像度の影響を受けるものは直後にmax-poolingを伴う畳込み層の数など（poolingすると 解像度が減るため）-> 高解像度化した際の適切な初期値は低解像度の場合から推測する • 32×32で750回評価，64×64で500回評価，128×128で250回評価を行いハイパパラメータ最 適化しても精度は落ちず，128×128で1500回評価するよりも早く終わる Increasing Image Sizes (IIS) (Hinz et al. 2018) 低解像度の画像を用いてハイパパラメータを最適化を始め，徐々に解像度を上げていく

## Slide 65

### Slide 65 text

Copyright © GREE, Inc. All Rights Reserved. • Successive Halving (Jamieson and Talwalkar 2015) • 複数のハイパパラメータ設定候補を評価 • 下位候補を棄却，リソースを上位候補に多く割当て直して評価を継続 • 課題 • 候補数をnリソースをBとしたとき，nとB/nの適切なトレードオフは非自明 Hyperband (Li et al. 2016) リソース (e.g. 学習時間，教師データ数) を適応的に割り当てる

## Slide 67

### Slide 67 text

Copyright © GREE, Inc. All Rights Reserved. • 仮説：近いデータセットに対するハイパパラメータ最適化結果は似ている • e.g. 学習データが増えたので，モデルを再学習する場合 • メタ特徴量 • ハンドメイド • シンプルな特徴量（e.g. データ数，次元数，クラス数） • 統計学や情報理論に基づく特徴 （e.g. 分布の歪度） • ランドマーク特徴（決定木などシンプルな機械学習モデルの性能） • 深層学習 (Kim et al. 2017a,b) • 近いデータセットのハイパパラメータ最適化結果で手法を初期化しウォームスタート • PSO (Gomes et al. 2012) • GA (Reif et al. 2012) • ベイズ最適化 (Bardenet et al. 2013; Yogatama and Mann 2014; Feurer et al. 2014,2015,2018; Kim et al. 2017a,b) メタ学習とウォームスタート 近年の研究動向

## Slide 68

### Slide 68 text

Copyright © GREE, Inc. All Rights Reserved. • Sampling (Arnold and Beyer 2006) • 設定をn回評価し，平均値を取る • Threshold Selection Equipped with Re-evaluation  (Markon et al. 2001; Beielstein and Markon 2002; Jin and Branke 2005; Goh and Tan 2007; Gießen and Kötzing 2016) • 目的関数値が最良値をしきい値以上改善した場合にsampling • Value Suppression (Wang et al. 2018b) • best-k設定が一定期間更新されないときにbest-k設定をsamplingし，関数値を修正 ノイズ対策 近年の研究動向

## Slide 70

### Slide 70 text

Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et al. 2017) 以下を5つの手法でハイパパラメータ最適化する Name Description Range x1 Learning rate (= 0.1x1 ) [1, 4] x2 Momentum (= 1 − 0.1x2 ) [0.5, 2] x3 L2 weight decay [0.001, 0.01] x∗ 4 FC1 units [256, 1024] Integer parameters are marked with ∗. データセット：MNIST ネットワーク：LeNet，Batch-Normalized Maxout Network in Network タスク：文字認識（10クラス分類） Name Description Range x1 Learning rate (= 0.1x1 ) [0.5, 2] x2 Momentum (= 1 − 0.1x2 ) [0.5, 2] x3 L2 weight decay [0.001, 0.01] x4 Dropout 1 [0.4, 0.6] x5 Dropout 2 [0.4, 0.6] x6 Conv 1 initialization deviation [0.01, 0.05] x7 Conv 2 initialization deviation [0.01, 0.05] x8 Conv 3 initialization deviation [0.01, 0.05] x9 MMLP 1-1 initialization deviation [0.01, 0.05] x10 MMLP 1-2 initialization deviation [0.01, 0.05] x11 MMLP 2-1 initialization deviation [0.01, 0.05] x12 MMLP 2-2 initialization deviation [0.01, 0.05] x13 MMLP 3-1 initialization deviation [0.01, 0.05] x14 MMLP 3-2 initialization deviation [0.01, 0.05] Batch-Normalized Mahout Network in Network (Chang and Chen 2015) MMLP (Maxout Multi Layer Perceptron) LeNet (LeCun et al. 1998) MNIST (LeCun and Cortes, 2010)

## Slide 72

### Slide 72 text

Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et al. 2017) 文字認識 (LeNet) 結果 Method mean loss min loss Random search 0.005411 (±0.001413) 0.002781 Bayesian optimization 0.004217 (±0.002242) 0.000089 CMA-ES 0.000926 (±0.001420) 0.000047 Coordinate-search method 0.000052 (±0.000094) 0.000002 Nelder-Mead method 0.000029 (±0.000029) 0.000004 Method mean accuracy (%) accuracy with min loss (%) Random search 98.98 (±0.08) 99.06 Bayesian optimization 99.07 (±0.02) 99.25 CMA-ES 99.20 (±0.08) 99.30 Coordinate-search method 99.26 (±0.05) 99.35 Nelder-Mead method 99.24 (±0.04) 99.28

## Slide 73

### Slide 73 text

Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et al. 2017) 文字認識 (Batch-Normalized Mahout Network in Network) 結果 Mean loss of all executions for each method per iteration (Batch-Normalized Maxout Network in Network)

## Slide 74

### Slide 74 text

Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et al. 2017) 文字認識 (Batch-Normalized Mahout Network in Network) 結果 Method mean loss min loss Random search 0.045438 (±0.002142) 0.042694 Bayesian optimization 0.045636 (±0.001197) 0.044447 CMA-ES 0.045248 (±0.002537) 0.042250 Coordinate-search method 0.045131 (±0.001088) 0.043639 Nelder-Mead method 0.044549 (±0.001079) 0.043238 Method mean accuracy (%) accuracy with min loss (%) Random search 99.56 (±0.02) 99.58 Bayesian optimization 99.47 (±0.05) 99.59 CMA-ES 99.49 (±0.14) 99.59 Coordinate-search method 99.48 (±0.04) 99.53 Nelder-Mead method 99.53 (±0.00) 99.54

## Slide 75

### Slide 75 text

Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et al. 2017) データセット：Adience benchmark ネットワーク：Gil and Tal (2015) タスク： (1)性別推定（2クラス分類） (2)年齢層推定（8クラス分類） Name Description Range x1 Learning rate (= 0.1x1 ) [1, 4] x2 Momentum (= 1 − 0.1x2 ) [0.5, 2] x3 L2 weight decay [0.001, 0.01] x4 Dropout 1 [0.4, 0.6] x5 Dropout 2 [0.4, 0.6] x∗ 6 FC 1 units [512, 1024] x∗ 7 FC 2 units [256, 512] x8 Conv 1 initialization deviation [0.01, 0.05] x9 Conv 2 initialization deviation [0.01, 0.05] x10 Conv 3 initialization deviation [0.01, 0.05] x11 FC 1 initialization deviation [0.001, 0.01] x12 FC 2 initialization deviation [0.001, 0.01] x13 FC 3 initialization deviation [0.001, 0.01] x14 Conv 1 bias [0, 1] x15 Conv 2 bias [0, 1] x16 Conv 3 bias [0, 1] x17 FC 1 bias [0, 1] x18 FC 2 bias [0, 1] x∗ 19 Normalization 1 localsize (= 2x19 + 3) [0, 2] x∗ 20 Normalization 2 localsize (= 2x20 + 3) [0, 2] x21 Normalization 1 alpha [0.0001, 0.0002] x22 Normalization 2 alpha [0.0001, 0.0002] x23 Normalization 1 beta [0.5, 0.95] x24 Normalization 2 beta [0.5, 0.95] Integer parameters are marked with ∗. Adience benchmark (Eran et al. 2014)

## Slide 77

### Slide 77 text

Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et al. 2017) 性別推定結果 Method mean loss min loss Random search 0.001732 (±0.000540) 0.000984 Bayesian optimization 0.00183 (±0.000547) 0.001097 CMA-ES 0.001804 (±0.000480) 0.001249 Coordinate-search method 0.002240 (±0.001448) 0.000378 Nelder-Mead method 0.000395 (±0.000129) 0.000245 Method mean accuracy (%) accuracy with min loss (%) Random search 87.93 (±0.24) 88.21 Bayesian optimization 88.07 (±0.27) 87.85 CMA-ES 88.20 (±0.38) 88.55 Coordinate-search method 87.04 (±0.52) 87.72 Nelder-Mead method 88.38 (±0.47) 88.83

## Slide 79

### Slide 79 text

Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et al. 2017) 年齢層推定結果 Method mean loss min loss Random search 0.035694 (±0.006958) 0.026563 Bayesian optimization 0.024792 (±0.003076) 0.020466 CMA-ES 0.031244 (±0.010834) 0.016952 Coordinate-search method 0.032244 (±0.006109) 0.024637 Nelder-Mead method 0.015492 (±0.002276) 0.013556 Method mean accuracy (%) accuracy with min loss (%) Random search 57.18 (±0.96) 57.90 Bayesian optimization 56.28 (±1.68) 57.19 CMA-ES 57.17 (±0.80) 58.19 Coordinate-search method 55.06 (±2.31) 56.98 Nelder-Mead method 56.72 (±0.50) 57.42

## Slide 80

### Slide 80 text

Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et al. 2017) 局所探索法が良い結果を出せた理由はなにか 仮説：目的関数が多くの良質な局所解を持つ？ ->肯定的な結果（NMは異なる局所解に収束も，良い性能） Parallel coordinates plot of the optimized hyperparameters of the gender classification CNN • Olof (2018)による追試 • NMはCNNに対して確かに上手くいく，RNNに対しては微妙 • 平均的にはCNN/RNNいずれもTPEが良かった (ベイズ最適化でもGPの方は全然ダメだった) • 実験を通して最良の結果を見つけたのはCNN/RNNいずれについてもNM • CNNに共通するロス関数の性質がRNNでは成り立たないと指摘 • Snoek et al. (2012)らの実験ではGPを用いたベイズ最適化が，TPEより優れていたと報告

## Slide 81

### Slide 81 text

Copyright © GREE, Inc. All Rights Reserved. 計算実験 様々な課題 • 基本的にどの論文も提案手法が一番という結論を主張する • 提案手法は念入りにチューニングしてあるものと考える • 再現性の問題 • 手法の実装（ソースコード公開），ランダム性及びチューニング • 十分な計算リソースが手元にない • モデルの評価結果を記録した表形式のデータセット (Klein et al. 2018) • 実験設定がまちまち • HPOLib (Eggensperger et al. 2013) • 手法比較の方法 • 基準（e.g. 精度，AUC）と順位付けの手法 (Dewancker et al. 2016) • 検証データへの過学習 • 実用においてはデータセットをtraining / validation / testの3つに分割して おきチューニング後の性能がtestにおいて乖離し過ぎていないか確認

## Slide 83

### Slide 83 text

Copyright © GREE, Inc. All Rights Reserved. 結論 これから熱くなると予想するトピック • 脱グリッドサーチ • ランダムサーチをはじめとする他の手法を使用 • 状況に応じて利点と欠点を考慮 • 自分と近い実験設定の論文を参考 • 研究トピック • 最適化手法 • 関連手法 (e.g. 重要なパラメータの特定，学習曲線予測) • 再現性の担保やベンチマークの整備 • 応用 (AutoML e.g. CASH problem，モデルアーキテクチャ探索)  Combined Algorithm Selection and Hyperparameter Optimization (CASH)

## Slide 85

### Slide 85 text

Copyright © GREE, Inc. All Rights Reserved. Coordinate Search法 Maximal positive basisを活用した探索 (Conn et al., 2009; Audet and Hare, 2017) D ⊕ D⊕ = {±ei : i = 1, 2, . . . , n}

## Slide 91

### Slide 91 text

Copyright © GREE, Inc. All Rights Reserved. Coordinate Search法 λ0 λ1 λ2 λ3 Pk = {λk + δk : d ∈ D⊕ } f(λ) < f(λk) λ ∈ Pk

## Slide 96

### Slide 96 text

Copyright © GREE, Inc. All Rights Reserved. Coordinate Search法 Pros and Cons • 局所解を見つける能力 • 並列化は部分的にのみ可能 • 座標軸に沿い反復的に探索を行うため次元数に対して低スケーラブル • 大域的な探索を行わないため，悪質な局所解に陥るリスク 収束性や失敗する例，改良した手法などはConn et al. (2009); Audet and Hare (2017)

## Slide 99

### Slide 99 text

Copyright © GREE, Inc. All Rights Reserved. Coordinate Search法 探索の戦略 (Audet and Hare 2017) • Opportunistic polling • 良いものが見つかった時点で採用 • 固定された順番 • 完全にランダム • 直前に改善した方向からスタート • Complete polling（スケールしない） • 反復の度に全ての候補を評価して最良の値を選択

## Slide 100

### Slide 100 text

Copyright © GREE, Inc. All Rights Reserved. • Weighted Hamming distance kernel (Hutter et al. 2011) ベイズ最適化 カテゴリ的パラメータを扱うためのカーネル kmixed(λ, λ′) = exp(rcont + rcat), rcont(λ, λ′) = l∈Λcont (−θl(λl − λ′ l)2), rcat(λ, λ′) = l∈Λcat −θl(1 − δ(λl, λ′ l)). where δ is the Kronecker delta function

## Slide 101

### Slide 101 text

Copyright © GREE, Inc. All Rights Reserved. • Conditional kernel (Lévesque et al. 2017) • 条件的パラメータのための別のカーネル (Swersky et al. 2014) ベイズ最適化 条件パラメータを扱うためのカーネル kc(λ, λ′) = k(λ, λ′) if λc = λ′ c ∀c ∈ C 0 otherwise where C is the set of indices of active conditional hyperparameters

## Slide 102

### Slide 102 text

Copyright © GREE, Inc. All Rights Reserved. ベイズ最適化 具体的なガウス過程回帰の計算 µ1(λ2) = k(λ2, λ1)f(λ1) µ2(λ3) = k(λ3, λ1) k(λ3, λ2) 1 k(λ1, λ2) k(λ2, λ1) 1 −1 f(λ1) f(λ2) = 1 1 − k(λ1, λ2)2 k(λ3, λ1) k(λ3, λ2) 1 −k(λ1, λ2) −k(λ2, λ1) 1 f(λ1) f(λ2) = 1 1 − k(λ1, λ2)2 k(λ3, λ1) − k(λ2, λ1)k(λ3, λ2) k(λ3, λ2) − k(λ2, λ1)k(λ3, λ1) f(λ1) f(λ2) = 1 1 − k(λ1, λ2)2 (k(λ3, λ1) − k(λ2, λ1)k(λ3, λ2))f(λ1) + (k(λ3, λ2) − k(λ2, λ1)k(λ3, λ1))f(λ2) λ1 λ2 λ3 k(λ, λ′) = exp −1 2 ∥λ − λ′∥2 k(λ3, λ1) k(λ2, λ1) k(λ3, λ2) f(λ1) f(λ3)

## Slide 103

### Slide 103 text

Copyright © GREE, Inc. All Rights Reserved. • Probability of Improvement (PI) (Kushner 1964) • Expected Improvement (EI) (Mockus et al. 1978) • 改善量を加味，よく使われる • Predictive Entropy Search (PES) (Henrández- Lobato et al. 2014) • 情報量を最大化 ベイズ最適化 獲得関数の補足 aPI = P(f(λ) ≤ f(λ∗) − ξ) = φ f(λ∗) − ξ − µ(λ) σ(λ) λ∗ Φ ξ PIの可視化 (Brochu et al. 2010) ※この図は最大化問題のため左式とは少し異なる

## Slide 104

### Slide 104 text

Copyright © GREE, Inc. All Rights Reserved. ベイズ最適化 獲得関数の最大化手法 • 獲得関数最大化自体が非凸大域的最適化 • 最適化手法 • Brochu (2010) • DIRECT (Jones et al. 1993) • Bergstra (2011) • Estimation of Distribution (EDA) (Larraanaga and Lozano 2011) • Covariance Matrix Adaptation Evolution Strategy (CMA- ES) (Hansen 2006)

## Slide 105

### Slide 105 text

Copyright © GREE, Inc. All Rights Reserved. • 多腕バンディット • 複数の候補から最も良いものを逐次的に探す • スロットマシンの累積報酬最大化問題 • ハイパパラメータ最適化は連続 / 無限腕バンディットや最適腕識別として考えられる • ベイズ最適化は平均ケースを考えている • バンディットは最悪ケースのリグレット最小化を考えるのが一般的 • 関連研究 • Srinivas et al. (2010, 2012); Bull (2011); Kandasamy et al. (2015, 2017)など ベイズ最適化と多腕バンディットの繋がり 近年の研究動向

## Slide 107

### Slide 107 text

Copyright © GREE, Inc. All Rights Reserved. Christopher M. Bishop. Pattern recognition and machine learning. Information science and statistics. Springer, New York, 2006. ISBN 978-0-387-31073-2. Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], December 2014. URL http://arxiv.org/abs/ 1412.6980. arXiv:1412.6980. He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. Frank Hutter, Jörg Lücke, and Lars Schmidt-Thieme. Beyond Manual Tuning of Hyperparameters. KI - Künstliche Intelligenz, 29(4):329–337, November 2015. ISSN 0933-1875, 1610-1987. doi: 10.1007/s13218-015-0381-0. URL http://link.springer.com/10.1007/s13218-015-0381-0. Stefan Falkner, Aaron Klein, and Frank Hutter. Practical hyperparameter optimization for deep learning, 2018a. URL https://openreview.net/forum?id=HJMudFkDf. Jesse Dodge, Kevin Jamieson, and Noah A. Smith. Open Loop Hyperparameter Optimization and Determinantal Point Processes. arXiv:1706.01566 [cs, stat], June 2017. URL http://arxiv.org/abs/1706.01566. arXiv: 1706.01566. Jaak Simm. Survey of hyperparameter optimization in NIPS2014, 2015. URL https://github.com/jaak-s/nips2014-survey. Carl Staelin. Parameter selection for support vector machines. 2002. URL http://www.hpl.hp.com/techreports/2002/HPL-2002-354R1.html. James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13:281–305, February 2012. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=2188385.2188395. Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. An eﬃcient approach for assessing hyperparameter importance. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages I—754–I—762. JMLR.org, 2014. URL http://dl.acm.org/citation.cfm?id=3044805.3044891. 参考文献

## Slide 110

### Slide 110 text

Copyright © GREE, Inc. All Rights Reserved. Tarun Kathuria, Amit Deshpande, and Pushmeet Kohli. Batched Gaussian Process Bandit Optimization via Determinantal Point Processes. arXiv:1611.04088 [cs], November 2016. URL http://arxiv.org/abs/1611.04088. arXiv: 1611.04088. Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeﬀ Schneider, and Barnabas Poczos. Parallelised bayesian optimisation via thompson sampling. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artiﬁcial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 133–142, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018. PMLR. URL http://proceedings.mlr.press/v84/kandasamy18a.html. Emile Contal, David Buﬀoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel gaussian process optimization with upper conﬁdence bound and pure exploration. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases - Volume 8188, ECML PKDD 2013, pages 225–240, New York, NY, USA, 2013. Springer-Verlag New York, Inc. ISBN 978-3-642-40987-5. doi: 10.1007/978-3-642-40988-2_15. URL http://dx.doi.org/10.1007/978-3-642-40988-2_15. Thomas Desautels, Andreas Krause, and Joel W. Burdick. Parallelizing Exploration-Exploitation Tradeoﬀs in Gaussian Process Bandit Optimization. Journal of Machine Learning Research, 15:4053–4103, 2014. URL http://jmlr.org/papers/v15/desautels14a.html. Erik A. Daxberger and Bryan Kian Hsiang Low. Distributed batch Gaussian process optimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 951–960, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/daxberger17a.html. Zi Wang, Chengtao Li, Stefanie Jegelka, and Pushmeet Kohli. Batched high-dimensional Bayesian optimization via structural kernel learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3656– 3664, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/wang17h.html. Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batched large-scalebayesian optimization in high-dimensional spaces. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First nternational Conference on Artiﬁcial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 745–754, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018b. PMLR. URL http://proceedings.mlr.press/v84/wang18c.html. Ran Rubin. New Heuristics for Parallel and Scalable Bayesian Optimization. arXiv:1807.00373 [cs, stat], July 2018. URL http://arxiv.org/abs/1807.00373. arXiv: 1807.00373. Watanabe, Shinji, and Jonathan Le Roux. Black box optimization for automatic speech recognition. 2014. Loshchilov, Ilya, and Frank Hutter. CMA-ES for Hyperparameter Optimization of Deep Neural Networks. 2016. 参考文献

## Slide 111

### Slide 111 text

Copyright © GREE, Inc. All Rights Reserved. Michael Meissner, Michael Schmuker, and Gisbert Schneider. Optimized Particle Swarm Optimization (OPSO) and its application to artiﬁcial neural network training. BMC Bioinformatics, 7(1):125, March 2006. ISSN 1471-2105. doi: 10.1186/1471-2105-7-125. URL https://doi.org/10.1186/1471-2105-7-125. Shih-Wei Lin, Shih-Chieh Chen, Wen-Jie Wu, and Chih-Hsien Chen. Parameter determination and feature selection for back-propagation network by particle swarm optimization. Knowledge and Information Systems, 21(2):249–266, November 2009. ISSN 0219-3116. doi: 10.1007/s10115-009-0242-y. URL https://doi.org/10.1007/s10115-009-0242-y. Pablo Ribalta Lorenzo, Jakub Nalepa, Luciano Sanchez Ramos, and José Ranilla Pastor. Hyper-parameter selection in deep neural networks using parallel particle swarm optimization. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pages 1864–1871. ACM, 2017. Fei Ye. Particle swarm optimization-based automatic parameter selection for deep neural networks and its applications in large-scale and high- dimensional data. PLOS ONE, 12 (12):1–36, 2017. doi: 10.1371/journal.pone.0188746. URL https://doi.org/10.1371/journal.pone.0188746. F. H. F. Leung, H. K. Lam, S. H. Ling, and P. K. S. Tam. Tuning of the structure and parameters of a neural network using an improved genetic algorithm. Neural Networks, IEEE Transactions on, 14(1):79–88, February 2003. doi: 10.1109/tnn.2002.804317. URL http://dx.doi.org/10.1109/tnn.2002.804317. Steven R Young, Derek C Rose, Thomas P Karnowski, Seung-Hwan Lim, and Robert M Patton. Optimizing deep learning hyper-parameters through an evolutionary algorithm. In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments, page 4. ACM, 2015. Wei Fu, Tim Menzies, and Xipeng Shen. Tuning for software analytics: Is it really necessary? Information and Software Technology, 76:135 – 146, 2016a. ISSN 0950-5849. doi: https://doi.org/10.1016/j.infsof.2016.04.017. URL http://www.sciencedirect.com/science/article/pii/S0950584916300738. Wei Fu, Vivek Nair, and Tim Menzies. Why is Diﬀerential Evolution Better than Grid Search for Tuning Defect Predictors? arXiv:1609.02613 [cs, stat], September 2016b. URL http://arxiv.org/abs/1609.02613. arXiv: 1609.02613. Samantha Hansen. Using deep q-learning to control optimization hyperparameters. arXiv preprint arXiv:1602.04062, 2016. Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V Le. Neural optimizer search with reinforcement learning. In International Conference on Machine Learning, pages 459–468, 2017. 参考文献

## Slide 112

### Slide 112 text

Copyright © GREE, Inc. All Rights Reserved. Xingping Dong, Jianbing Shen, Wenguan Wang, Yu Liu, Ling Shao, and Fatih Porikli. Hyperparameter optimization for tracking with continuous deep q-learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 518–527, 2018. Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible learning. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 2113–2122. JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045343. Jelena Luketina, Mathias Berglund, Klaus Greﬀ, and Tapani Raiko. Scalable gradientbased tuning of continuous regularization hyperparameters. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 2952–2960. JMLR.org, 2016. URL http://dl.acm.org/citation.cfm? id=3045390.3045701. Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 737–746. JMLR.org, 2016. URL http://dl.acm.org/citation.cfm?id=3045390.3045469. Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. On hyperparameter optimization in learning systems. In Proceedings of the 5th International Conference on Learning Representations (Workshop Track), 2017a. Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. A Bridge Between Hyperparameter Optimization and Larning-to-learn. arXiv:1712.06283 [cs, stat], December 2017b. URL http://arxiv.org/abs/1712.06283. arXiv: 1712.06283. Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1165–1173, International ConventionCentre, Sydney, Australia, 06–11 Aug 2017c. PMLR. URL http://proceedings.mlr. press/v70/franceschi17a.html. Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1563–1572, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018a. PMLR. URL http://proceedings.mlr.press/v80/franceschi18a.html. Luca Franceschi, Riccardo Grazzi, Massimiliano Pontil, Saverio Salzo, and Paolo Frasconi. Far-ho: A bilevel programming package for hyperparameter optimization and metalearning. CoRR, abs/1806.04941, 2018b. URL http://arxiv.org/abs/1806.04941. Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Proceedings of the 24th International Conference on Artiﬁcial Intelligence, IJCAI’15, pages 3460–3468. AAAI Press, 2015. ISBN 978-1-57735-738-4. URL http://dl.acm.org/ citation.cfm?id=2832581.2832731. 参考文献

## Slide 113

### Slide 113 text

Copyright © GREE, Inc. All Rights Reserved. Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter. Learning curve prediction with bayesian neural networks. 2016. Akshay Chandrashekaran and Ian R. Lane. Speeding up Hyper-parameter Optimization by Extrapolation of Learning Curves Using Previous Builds. In Michelangelo Ceci, Jaakko Hollmén, Ljupčo Todorovski, Celine Vens, and Sašo Džeroski, editors, Machine Learning and Knowledge Discovery in Databases, pages 477–492, Cham, 2017. Springer International Publishing. ISBN 978-3-319-71249-9. Tobias Hinz, Nicolás Navarro-Guerrero, Sven Magg, and Stefan Wermter. Speeding up the hyperparameter optimization of deep convolutional neural networks. International Journal of Computational Intelligence and Applications, page 1850008, 2018. Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. Journal of Machine Learning Research, 18(185):1–52, 2018. URL http://jmlr.org/papers/v18/16-558.html. Hadrien Bertrand, Roberto Ardon, Matthieu Perrot, and Isabelle Bloch. Hyperparameter optimization of deep neural networks : Combining hyperband with bayesian model selection. 2017. Stefan Falkner, Aaron Klein, and Frank Hutter. Bohb: Robust and eﬃcient hyperparameter optimization at scale. In International Conference on Machine Learning, pages 1436–1445, 2018b. Jiazhuo Wang, Jason Xu, and Xuejun Wang. Combination of Hyperband and Bayesian Optimization for Hyperparameter Optimization in Deep Learning. arXiv:1801.01596 [cs], January 2018a. URL http://arxiv.org/abs/1801.01596. arXiv: 1801.01596. Jungtaek Kim, Saehoon Kim, and Seungjin Choi. Learning to Warm-Start Bayesian Hyperparameter Optimization. ArXiv e-prints, October 2017. Jungtaek Kim, Saehoon Kim, and Seungjin Choi. Learning to transfer initializations for bayesian hyperparameter optimization. arXiv preprint arXiv: 1710.06219, 2017. T Gomes, P Miranda, R Prudêncio, C Soares, and A Carvalho. Combining meta-learning and optimization algorithms for parameter selection. In 5 th PLANNING TO LEARN WORKSHOP WS28 AT ECAI 2012, page 6. 2012. 参考文献

## Slide 114

### Slide 114 text

Copyright © GREE, Inc. All Rights Reserved. Matthias Reif, Faisal Shafait, and Andreas Dengel. Meta-learning for evolutionary parameter optimization of classiﬁers. Machine learning, 87(3):357– 380, 2012. Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michele Sebag. Collaborative hyperparameter tuning. In International Conference on Machine Learning, pages 199–207, 2013. Dani Yogatama and Gideon Mann. Eﬃcient transfer learning method for automatic hyperparameter tuning. In Artiﬁcial Intelligence and Statistics, pages 1077–1085, 2014. Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Using meta-learning to initialize bayesian optimization of hyperparameters. In Proceedings of the 2014 International Conference on Meta-learning and Algorithm Selection-Volume 1201, pages 3–10. 2014. Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Initializing bayesian hyperparameter optimization via meta-learning. In AAAI, pages 1128–1135, 2015. Matthias Feurer, Benjamin Letham, and Eytan Bakshy. Scalable meta-learning for bayesian optimization. arXiv preprint arXiv:1802.02219, 2018. Dirk V Arnold and H-G Beyer. A general noise model and its eﬀects on evolution strategy performance. IEEE Transactions on Evolutionary Computation, 10(4):380–391, 2006. Sandor Markon, Dirk V Arnold, Thomas Back, Thomas Beielstein, and H-G Beyer. Thresholding-a selection operator for noisy es. In Evolutionary Computation, 2001. Proceedings of the 2001 Congress on, volume 1, pages 465–472. IEEE, 2001. Thomas Beielstein and Sandor Markon. Threshold selection, hypothesis tests, and doe methods. In Evolutionary Computation, 2002. CEC’02. Proceedings of the 2002 Congress on, volume 1, pages 777–782. IEEE, 2002. Yaochu Jin and Jürgen Branke. Evolutionary optimization in uncertain environments-a survey. IEEE Transactions on evolutionary computation, 9(3): 303–317, 2005. 参考文献

## Slide 115

### Slide 115 text

Copyright © GREE, Inc. All Rights Reserved. Chi Keong Goh and Kay Chen Tan. An investigation on noisy environments in evolutionary multiobjective optimization. IEEE Transactions on Evolutionary Computation, 11(3):354–381, 2007. Christian Gießen and Timo Kötzing. Robustness of populations in stochastic environments. Algorithmica, 75(3):462–489, 2016. Hong Wang, Hong Qian, and Yang Yu. Noisy derivative-free optimization with value suppression. 2018b. Yoshihiko Ozaki, Masaki Yano, and Masaki Onishi. Eﬀective hyperparameter optimization using Nelder-Mead method in deep learning. IPSJ Transactions on Computer Vision and Applications, 9(1), December 2017. ISSN 1882-6695. doi: 10.1186/s41074-017-0030-7. URL https:// ipsjcva.springeropen.com/articles/10.1186/s41074-017-0030-7. LeCun Y, Cortes C MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/. 2010. LeCun Y, Bottou L, Bengio Y, Patrick H Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324, 1998. Chang JR, Chen YS Batch-Normalized Maxout Network in Network. In: Proceedings of the 33rd International Conference on Machine Learning. 2015. https://arxiv.org/abs/1511.02583. Eran E, Roee E, Tal E Age and gender estimation of unﬁltered faces. IEEE Trans Inf Forensic Secur 9(12):2170–2179, 2014. Gil L, Tal H Age and gender classiﬁcation using convolutional neural networks. Computer Vision and Pattern Recognition Workshops (CVPRW). 2015. http://ieeexplore.ieee.org/document/7301352. Skogby Steinholtz Olof. A comparative study of black-box optimization algorithms for tuning of hyper-parameters in deep neural networks, 2018. 参考文献

## Slide 116

### Slide 116 text

Copyright © GREE, Inc. All Rights Reserved. Aaron Klein, Eric Christiansen, Kevin Murphy, and Frank Hutter. Towards reproducible neural architecture and hyperparameter search. 2018. Katharina Eggensperger, Matthias Feurer, Frank Hutter, James Bergstra, Jasper Snoek, Holger Hoos, and Kevin Leyton-Brown. Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In NIPS workshop on Bayesian Optimization in Theory and Practice, volume 10, page 3, 2013. Ian Dewancker, Michael McCourt, Scott Clark, Patrick Hayes, Alexandra Johnson, and George Ke. A strategy for ranking optimization methods using multiple criteria. In Workshop on Automatic Machine Learning, pages 11–20, 2016. Julien-Charles Lévesque, Audrey Durand, Christian Gagné, and Robert Sabourin. Bayesian optimization for conditional hyperparameter spaces. In Proc. of the International Joint Conference on Neural Networks (IJCNN). IEEE, 05 2017. Kevin Swersky, David Duvenaud, Jasper Snoek, Frank Hutter, and Michael A Osborne. Raiders of the lost architecture: Kernels for bayesian optimization in conditional parameter spaces. arXiv preprint arXiv:1409.4011, 2014a. Harold J. Kushner. A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise. Journal of Basic Engineering, 86(1):97+, 1964. ISSN 00219223. doi: 10.1115/1.3653121. URL http://dx.doi.org/10.1115/1.3653121. Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application of bayesian methods for seeking the extremum. Towards Global Optimization, 1978. José Miguel Henrández-Lobato, Matthew W. Hoﬀman, and Zoubin Ghahramani. Predictive entropy search for eﬃcient global optimization of black- box functions. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, pages 918–926, Cambridge, MA, USA, 2014. MIT Press. URL http://dl.acm.org/citation.cfm?id=2968826.2968929. D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157–181, October 1993. ISSN 1573-2878. doi: 10.1007/BF00941892. URL https://doi.org/10.1007/BF00941892. Pedro Larraanaga and Jose A. Lozano. Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation. Kluwer Academic Publishers, Norwell, MA, USA, 2001. ISBN 0792374665. 参考文献