Upgrade to Pro — share decks privately, control downloads, hide ads and more …

機械学習モデルのハイパパラメータ最適化

 機械学習モデルのハイパパラメータ最適化

gree_tech

August 21, 2018
Tweet

More Decks by gree_tech

Other Decks in Technology

Transcript

  1. Copyright © GREE, Inc. All Rights Reserved. • 尾崎 嘉彦

    • グリー株式会社 エンジニア • Webゲーム開発 -> 機械学習 • 産総研 特定集中研究専門員 • ブラックボックス最適化 • 微分フリー最適化 • ハイパパラメータ最適化 発表者の紹介
  2. Copyright © GREE, Inc. All Rights Reserved. IkW HjMkS 3584

      aVf 395:4 D+D Xbkdf YTWikL 36::4 Xbkdf YTWikL 3::4 eJgjW Xbkdf YTWikL  %aVf ZkQ]Whj aVf  D+D  #C  h_UELO Mj\bkR [Ncj &')  1(* ZRkj (* . !/ 0 ^f`j " ^FP.   $ 7<<?2;2 :<AB>@= , $ KGO -"
  3. Copyright © GREE, Inc. All Rights Reserved. モデルの複雑化に伴いハイパパラメータ数も増加 手作業や簡単な手法では細かい調整が手に負えない状況 7x7

    conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000 image 3x3 conv, 512 3x3 conv, 64 3x3 conv, 64 pool, /2 3x3 conv, 128 3x3 conv, 128 pool, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 pool, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 pool, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 pool, /2 fc 4096 fc 4096 fc 1000 image output size: 112 output size: 224 output size: 56 output size: 28 output size: 14 output size: 7 output size: 1 VGG-19 34-layer plain 7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000 image 34-layer residual Residual Network (He et al. 2016)
  4. Copyright © GREE, Inc. All Rights Reserved. ハイパパラメータ最適化の研究の盛り上がり 深層学習等の実用において必要不可欠な道具へ発展 •

    探索空間が広大 • 関数評価コストが高価 • 目的関数がノイジー • 変数のタイプが多様 ベイズ最適化などを中心に研究が発展 (Hutter et al. 2015) ハイパパラメータ調整の自動化は最適化問題としてチャレンジング
  5. Copyright © GREE, Inc. All Rights Reserved. ハイパパラメータ最適化問題の定式化 性能指標(損失関数)を最小化するブラックボックス最適化と考えるのが標準的 Minimize

    f(λ) subject to λ ∈ Λ. 自分たちが観測できるのは,ノイズを伴った目的関数値のみ 目的関数が数式の形で明示的には与えられない fϵ(λ) = f(λ) + ϵ, ϵ iid ∼ N(0, σ2 n )
  6. Copyright © GREE, Inc. All Rights Reserved. ブラックボックス最適化 利点と欠点 •

    目的関数値しか要らない • モデルや損失関数に依存せず極めて汎用的 • 目的関数の素性が不明 • 勾配情報が利用不可 (効率的な最適化手法を考えるのが難しい) • 微分フリー最適化手法が必要 利点 欠点
  7. Copyright © GREE, Inc. All Rights Reserved. ハイパパラメータ最適化問題の定式化 最適化対象として直接k-fold cross

    validation lossなどを考えるのが一般的 fϵ(λ) = 1 k k i=1 L(Aλ, Di train , Di valid )
  8. Copyright © GREE, Inc. All Rights Reserved. • Strong Anytime

    Performance • 厳しい制約のもとで,良い性能が得られること • Strong Final Performance • 緩い制約のもとで,非常に良い設定が得られること • Effective Use of Parallel Resources • 効率的に並列化できること • Scalability • 非常に多くのパラメータ数でも問題なく扱うことができること • Robustness & Flexibility • 目的関数値の観測ノイズや非常にセンシティブなパラメータに対して, 頑健かつ柔軟であること ハイパパラメータ最適化手法が満たすべき要件 (Falkner et al. 2018a) 全てを満たすのは難しいため,現実には目的に応じて取捨選択が必要
  9. Copyright © GREE, Inc. All Rights Reserved. 手法の分類 Dodge et

    al. (2017) λk {(λi, f(λi))}k−1 i=1 λk {λi}k−1 i=1 • ベイズ最適化など • 目的関数値を活用して効率的に最適化 • 評価回数を少なく抑えられる傾向 • グリッドサーチやランダムサーチなど • 目的関数値に対する依存性がないため,リソースの許す限り並列評価が可能 • CPU時間に対する課金が主流のクラウド計算資源と相性がよい • ウォールクロックタイムを少なく抑えられる傾向
  10. Copyright © GREE, Inc. All Rights Reserved. グリッドサーチ 利点と欠点 •

    並列化しやすく,計算リソースに対してスケーラブル • 低実効次元性(後述)に著しく脆弱 • 計算量がパラメータ数の指数オーダーのためノンスケーラブル • 局所・大域的最適解を見つける能力が貧弱
  11. Copyright © GREE, Inc. All Rights Reserved. 実験計画法 (Design of

    Experiments) 最良の点を中心とするより狭い範囲を反復的にサンプリング (Staelin 2002) 黒:2-level DOE 白:3-level DOE 黒:2-level DOEの1反復目 白:左下黒を最良と仮定した2反復目
  12. Copyright © GREE, Inc. All Rights Reserved. ランダムサーチ 利点と欠点 •

    並列化しやすく,計算リソースに対してスケーラブル • パラメータ数に対してスケーラブル • 低実効次元性(後述)に頑健 • 局所・大域的最適解を見つける能力が貧弱 利点 欠点
  13. Copyright © GREE, Inc. All Rights Reserved. 低実効次元性 (Low Effective

    Dimensionality) モデル性能にとって重要なパラメータは少数であるためグリッドサーチは非効率, またデータセット毎にそれらは異なる (Bergstra et al. 2012) Important parameter Unimportant parameter Important parameter Unimportant parameter f(λ1, λ2) = g(λ1) + h(λ2) ≈ g(λ1)
  14. Copyright © GREE, Inc. All Rights Reserved. • Hutter et

    al. (2014) • functional ANOVAによるアプローチで重要なハイパパラメータを特定 • Fawcett and Hoos (2016) • 2つの設定間で最もパフォーマンスに貢献しているパラメータを調べるablation analysis • Biedenkapp et al. (2017) • サロゲートを用いることでablation analysisを高速化 • van Rijn and Hutter (2017a, b) • functional ANOVAを用いて大規模にデータセット間のハイパパラメータ重要性を分析 重要なハイパパラメータの特定 近年の研究動向
  15. Copyright © GREE, Inc. All Rights Reserved. 低食い違い量列 (Low Discrepancy

    Sequence) 一様ランダムの代わりにSobol列やLatin Hypercube Samplingの使用を提案,計算実験の 結果Sobol列が有望 (Bergstra et al. 2012),Dodge et al. 2017はk-DPPの使用を提案 Uniform Sobol LHS
  16. Copyright © GREE, Inc. All Rights Reserved. Nelder-Mead法 (Nelder and

    Mead 1965) 反復的に単体を変形し最適化,Rのoptim関数の標準手法として採用されている 1次元,2次元および3次元単体
  17. Copyright © GREE, Inc. All Rights Reserved. λ⁰ λ2 λ¹

    λic λc λoc λr λe f(λ0) ≤ f(λ1) ≤ f(λ2) Nelder-Mead法 (Nelder and Mead 1965)
  18. Copyright © GREE, Inc. All Rights Reserved. λ⁰ λ2 λ¹

    λic λc λoc λr λe Reflect: λr = λc + δr(λc − λn) where λc = n−1 i=0 λi/n Nelder-Mead法 (Nelder and Mead 1965)
  19. Copyright © GREE, Inc. All Rights Reserved. λ⁰ λ2 λ¹

    λic λc λoc λr λe Expand: λe = λc + δe(λc − λn) Nelder-Mead法 (Nelder and Mead 1965)
  20. Copyright © GREE, Inc. All Rights Reserved. λ⁰ λ2 λ¹

    λic λc λoc λr λe Outside contract: λoc = λc + δoc(λc − λn) Nelder-Mead法 (Nelder and Mead 1965)
  21. Copyright © GREE, Inc. All Rights Reserved. λ⁰ λ2 λ¹

    λic λc λoc λr λe Inside contract: λic = λc + δic(λc − λn) Nelder-Mead法 (Nelder and Mead 1965)
  22. Copyright © GREE, Inc. All Rights Reserved. λ⁰ λ2 λ¹

    λic λ1s λoc λr λe λ2s Shrink: λ0 + γs(λi − λ0) : i = 0, . . . , n} Nelder-Mead法 (Nelder and Mead 1965)
  23. Copyright © GREE, Inc. All Rights Reserved. λ0 λ1 λ2

    f(λ0) ≤ f(λ1) ≤ f(λ2) Nelder-Mead法 (Nelder and Mead 1965)
  24. Copyright © GREE, Inc. All Rights Reserved. λ0 λ1 λr

    λ2 Reflect Nelder-Mead法 (Nelder and Mead 1965)
  25. Copyright © GREE, Inc. All Rights Reserved. λ0 λ1 λr

    λe λ2 f(λr) < f(λ0) Expand Nelder-Mead法 (Nelder and Mead 1965)
  26. Copyright © GREE, Inc. All Rights Reserved. λ0 λ1 λe

    f(λr) f(λe) λ2 Nelder-Mead法 (Nelder and Mead 1965)
  27. Copyright © GREE, Inc. All Rights Reserved. λ1 λ2 λr

    λ0 Nelder-Mead法 (Nelder and Mead 1965)
  28. Copyright © GREE, Inc. All Rights Reserved. λ1 λ2 λr

    λ0 λoc f(λ1) ≤ f(λr) < f(λ2) Outside contract Nelder-Mead法 (Nelder and Mead 1965)
  29. Copyright © GREE, Inc. All Rights Reserved. λ2 λ1 λ0

    f(λoc) ≤ f(λ2) λ2 λoc Nelder-Mead法 (Nelder and Mead 1965)
  30. Copyright © GREE, Inc. All Rights Reserved. λ2 λ1 λr

    λ0 λe Nelder-Mead法 (Nelder and Mead 1965)
  31. Copyright © GREE, Inc. All Rights Reserved. λ0 λ2 λ1

    Nelder-Mead法 (Nelder and Mead 1965)
  32. Copyright © GREE, Inc. All Rights Reserved. λ1 λ0 λ2

    Nelder-Mead法 (Nelder and Mead 1965)
  33. Copyright © GREE, Inc. All Rights Reserved. λ1 λ0 λ2

    λic λr f(λr) ≥ f(λ2) Inside contract Nelder-Mead法 (Nelder and Mead 1965)
  34. Copyright © GREE, Inc. All Rights Reserved. λ2 λ0 λ1

    Reflect Contract λ2 Shrink Nelder-Mead法 (Nelder and Mead 1965)
  35. Copyright © GREE, Inc. All Rights Reserved. λ2 λ0 λ1

    Reflect Contract λ2 Shrink Nelder-Mead法 (Nelder and Mead 1965)
  36. Copyright © GREE, Inc. All Rights Reserved. λ2 λ1 λ0

    Nelder-Mead法 (Nelder and Mead 1965)
  37. Copyright © GREE, Inc. All Rights Reserved. 利点と欠点 収束性や失敗する例,改良した手法などはConn et

    al. (2009); Audet and Hare (2017) 利点 • 局所解を見つける能力に優れる • 部分的な並列化しかできない • 悪質な局所解に陥る可能性がある 欠点 Nelder-Mead法 (Nelder and Mead 1965)
  38. Copyright © GREE, Inc. All Rights Reserved. • 標準的な選択 係数の選択

    0 < γs < 1, −1 < δic < 0 < δoc < δr < δe γs = 1 2 , δic = −1 2 , δoc = 1 2 , δr = 1 and δe = 2 γs = 1 − 1 n , δic = − 3 4 + 1 2n , δoc = 3 4 − 1 2n , δr = 1, δe = 1 + 2 n where n ≥ 2 • 適応的な係数 (Gao and Han 2012) Nelder-Mead法 (Nelder and Mead 1965)
  39. Copyright © GREE, Inc. All Rights Reserved. ベイズ最適化 • ベイズ最適化

    • サロゲートをベイズ的に構築するSMBOの総称 •       を考える P(fϵ(λ) | λ) • サロゲートの種類 • ガウス過程 (GP) • 最も標準的,有名な実装はSpearmint (Snoek et al. 2012) • ランダムフォレスト • SMAC (Hutter et al. 2011) • Tree Parzen Estimator (TPE) (Bergstra et al. 2011) • 実装はHyperopt •            を考える • DNN (Snoek et al. 2015) P(λ | fϵ(λ)), P(fϵ(λ)) • Sequential Model-based Optimization (SMBO) • 反復的に関数評価とサロゲート(目的関数のモデル)の更新を繰り返す手法の総称 • ベイズ最適化や信頼領域法 (Ghanbari and Scheinberg 2017)
  40. Copyright © GREE, Inc. All Rights Reserved. • ガウス分布 •

    スカラ,ベクトル上の分布 • ガウス過程 • 関数上の分布 ベイズ最適化 ガウス過程回帰に基づく方法 −1 −0.5 0 0.5 1 −3 −1.5 0 1.5 3 ガウス過程からのサンプル (Bishop, 2006)
  41. Copyright © GREE, Inc. All Rights Reserved. • 目的関数が平均関数mと共分散関数kにより特徴づけされるGPに従うと仮定 •

    事前平均関数としては      とするのが標準的 ベイズ最適化 ガウス過程回帰に基づく方法 fϵ(λ) ∼ GP(m(λ), k(λ, λ′)) m(λ) = 0
  42. Copyright © GREE, Inc. All Rights Reserved. • カーネルはモデルの形を特徴づける •

    2点間の近さを抽象化したようなもの • 適切なカーネルを選べばカテゴリ的・条件的パラメータも扱える ベイズ最適化 共分散関数(カーネル) Exponentiated Quadratic Matérn 5/2 Kernels / Covariance functions (PyMC3)
  43. Copyright © GREE, Inc. All Rights Reserved. • ARD squared

    exponential kernel • ARD Matérn 5/2 kernel • カーネルのハイパパラメータはデータから動的に決める • 経験ベイズ (Bishop 2006) • Markov Chain Monte Carlo (MCMC) (Snoek et al. 2012) 共分散関数(カーネル)の選択 (Snoek et al. 2012) kse(λ, λ′) = θ0 exp(− 1 2 r2(λ, λ′)), r2(λ, λ′) = D d=1 (λd − λ′ d)2/(θd)2 k52(λ, λ′) = θ0(1 + 5r2(λ, λ′) + 5 3 r2(x, λ′)) exp(− 5r2(λ, λ′)) ベイズ最適化
  44. Copyright © GREE, Inc. All Rights Reserved. ベイズ最適化 PRML 6章,カーネルのハイパパラメータの影響

    (Bishop 2006) (1.00, 4.00, 0.00, 0.00) −1 −0.5 0 0.5 1 −3 −1.5 0 1.5 3 (9.00, 4.00, 0.00, 0.00) −1 −0.5 0 0.5 1 −9 −4.5 0 4.5 9 (1.00, 64.00, 0.00, 0.00) −1 −0.5 0 0.5 1 −3 −1.5 0 1.5 3 (1.00, 0.25, 0.00, 0.00) −1 −0.5 0 0.5 1 −3 −1.5 0 1.5 3 (1.00, 4.00, 10.00, 0.00) −1 −0.5 0 0.5 1 −9 −4.5 0 4.5 9 (1.00, 4.00, 0.00, 5.00) −1 −0.5 0 0.5 1 −4 −2 0 2 4 k(λ, λ′) = θ0 exp − θ1 2 ∥λ − λ′∥2 + θ2 + θ3λ⊤λ′
  45. Copyright © GREE, Inc. All Rights Reserved. ベイズ最適化 mとkを決めれば,過去の観測から未観測点の関数値を予測できる ガウス分布の性質とSchurの公式から導出される

    (Rasmussen and Williams 2005; Bishop 2006) データがないとまともに予測できないので,ランダムサーチなどでデータを集めて初期化しておく P(fϵ(λt+1) | λ1, λ2, . . . , λt+1) = N(µt(λt+1), σ2 t (λt+1) + σ2 n ), µt(λt+1) = k⊤[K + σ2 n I]−1[f(λ1) f(λ2) · · · f(λt)]⊤, σ2 t (λt+1) = k(λt+1, λt+1) − k⊤[K + σ2 n I]−1k where k = [k(λt+1, λ1) k(λt+1, λ2) · · · k(λt+1, λt)]⊤, K = ⎡ ⎢ ⎣ k(λ1, λ1) · · · k(λ1, λt) . . . ... . . . k(λt, λ1) · · · k(λt, λt) ⎤ ⎥ ⎦ .
  46. Copyright © GREE, Inc. All Rights Reserved. ベイズ最適化 次に評価する点の選び方 •

    獲得関数と呼ばれる指標を最大化する点を次に評価する点として選ぶ • 獲得関数は探索と知識利用のトレードオフを担う • サロゲートの分散が大きい点を評価(探索) • サロゲートの平均が小さい点を評価(知識利用) aUCB(λ) = −µ(λ) + ξσ(λ) • 例:GP-Upper Confidence Bound (GP-UCB) (Srinivas 2012)
 解きたいのは損失最小化問題なので-µ(λ) • Probability of Improvement (PI), Expected Improvement (EI), Predictive Entropy Search (PES) など色々あり,探索性能に大きく影響
  47. Copyright © GREE, Inc. All Rights Reserved. ベイズ最適化 利点と欠点 利点

    欠点 • 探索と知識利用のトレードオフを考慮した大域的な探索が可能 • 観測ノイズを考慮した探索が可能 • 共分散関数と獲得関数に対してセンシティブ • 獲得関数の最適化が非凸大域的最適化 • ガウス過程回帰の場合,観測データ数の3乗オーダーの計算量 • 並列化が難しい
  48. Copyright © GREE, Inc. All Rights Reserved. サロゲートの計算量削減 近年の研究動向 [K

    + σ2 n I]−1 • ガウス過程回帰のボトルネック: • 近似計算 (Quiñonero-Candela et al. 2007; Titsias 2009) • 計算量が相対的に少ないサロゲート • ランダムフォレスト (Hutter et al. 2011) • DNN (Snoek et al. 2015)
  49. Copyright © GREE, Inc. All Rights Reserved. • Shah and

    Ghahramani (2015) • Parallel Predictive Entropy Search • Gonzalez et al. (2016) • Local Penalization • Kathuria et al. (2016) • DPP sampling • Kandasamy et al. (2018) • 非同期並列Thompson sampling • この他にも沢山 • Bergstra et al. (2011); Snoek et al. (2012); Contal et al. (2013); Desautels et al. (2014); Daxberger and Low (2017); Wang et al. (2017, 2018a); Rubin (2018) ベイズ最適化の並列化 近年の研究動向
  50. Copyright © GREE, Inc. All Rights Reserved. その他の手法 適用事例報告がある主なもの •

    CMA-ES • Watanabe and Le Roux (2014); Loshchilov and Hutter (2016) • Particle Swarm Optimization (PSO) • Meissner et al. (2006); Lin et al. (2009); Lorenzo et al. (2017); Ye (2017) • Genetic Algorithm (GA) • Leung et al. (2003); Young et al. (2015) • Differential Evolution (DE) • Fu et al. (2016a,b) • 強化学習 • Hansen (2016); Bello et al. (2017); Dong et al. (2018) • 勾配法 (※ブラックボックス最適化でない,連続パラメータのみ) • Maclaurin et al. (2015); Luketina et al. (2016); Pedregosa (2016); Franceschi (2017a,b,c, 2018a,b)
  51. Copyright © GREE, Inc. All Rights Reserved. • Domhan et

    al. (2015) • 11種類の基底関数の重み付き線形和で学習曲線をモデル化 • ベイジアンネットワークを使用 (Klein et al. 2016) • 過去のデータを活用 (Chandrashekaran and Lane 2017) 早期終了 エポック数に対する学習曲線を予測し,良い性能を達成する見込みのない学習を停止 fcomb = k i=1 wifi(λ | θi) + ϵ, ϵ ∼ N(0, σ2), k i=1 wi = 1, ∀wi, wi ≥ 0
  52. Copyright © GREE, Inc. All Rights Reserved. • 異なる解像度でハイパパラメータ最適化後,functional ANOVAにより重要なパラメータを分析

    • 多くの重要なパラメータとその値は解像度に依らず同じ (e.g. 学習率,バッチサイズ) • 解像度の影響を受けるものは直後にmax-poolingを伴う畳込み層の数など(poolingすると 解像度が減るため)-> 高解像度化した際の適切な初期値は低解像度の場合から推測する • 32×32で750回評価,64×64で500回評価,128×128で250回評価を行いハイパパラメータ最 適化しても精度は落ちず,128×128で1500回評価するよりも早く終わる Increasing Image Sizes (IIS) (Hinz et al. 2018) 低解像度の画像を用いてハイパパラメータを最適化を始め,徐々に解像度を上げていく
  53. Copyright © GREE, Inc. All Rights Reserved. • Successive Halving

    (Jamieson and Talwalkar 2015) • 複数のハイパパラメータ設定候補を評価 • 下位候補を棄却,リソースを上位候補に多く割当て直して評価を継続 • 課題 • 候補数をnリソースをBとしたとき,nとB/nの適切なトレードオフは非自明 Hyperband (Li et al. 2016) リソース (e.g. 学習時間,教師データ数) を適応的に割り当てる
  54. Copyright © GREE, Inc. All Rights Reserved. Hyperband (Li et

    al. 2016) 提案手法:グリッドサーチのようにnとB/nのトレードオフを複数試す ランダムサーチやベイズ最適化と組み合わせる (Bertrand et al. 2017; Falkner et al. 2018; Wang et al. 2018)
  55. Copyright © GREE, Inc. All Rights Reserved. • 仮説:近いデータセットに対するハイパパラメータ最適化結果は似ている •

    e.g. 学習データが増えたので,モデルを再学習する場合 • メタ特徴量 • ハンドメイド • シンプルな特徴量(e.g. データ数,次元数,クラス数) • 統計学や情報理論に基づく特徴 (e.g. 分布の歪度) • ランドマーク特徴(決定木などシンプルな機械学習モデルの性能) • 深層学習 (Kim et al. 2017a,b) • 近いデータセットのハイパパラメータ最適化結果で手法を初期化しウォームスタート • PSO (Gomes et al. 2012) • GA (Reif et al. 2012) • ベイズ最適化 (Bardenet et al. 2013; Yogatama and Mann 2014; Feurer et al. 2014,2015,2018; Kim et al. 2017a,b) メタ学習とウォームスタート 近年の研究動向
  56. Copyright © GREE, Inc. All Rights Reserved. • Sampling (Arnold

    and Beyer 2006) • 設定をn回評価し,平均値を取る • Threshold Selection Equipped with Re-evaluation
 (Markon et al. 2001; Beielstein and Markon 2002; Jin and Branke 2005; Goh and Tan 2007; Gießen and Kötzing 2016) • 目的関数値が最良値をしきい値以上改善した場合にsampling • Value Suppression (Wang et al. 2018b) • best-k設定が一定期間更新されないときにbest-k設定をsamplingし,関数値を修正 ノイズ対策 近年の研究動向
  57. Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et

    al. 2017) 以下を5つの手法でハイパパラメータ最適化する Name Description Range x1 Learning rate (= 0.1x1 ) [1, 4] x2 Momentum (= 1 − 0.1x2 ) [0.5, 2] x3 L2 weight decay [0.001, 0.01] x∗ 4 FC1 units [256, 1024] Integer parameters are marked with ∗. データセット:MNIST ネットワーク:LeNet,Batch-Normalized Maxout Network in Network タスク:文字認識(10クラス分類) Name Description Range x1 Learning rate (= 0.1x1 ) [0.5, 2] x2 Momentum (= 1 − 0.1x2 ) [0.5, 2] x3 L2 weight decay [0.001, 0.01] x4 Dropout 1 [0.4, 0.6] x5 Dropout 2 [0.4, 0.6] x6 Conv 1 initialization deviation [0.01, 0.05] x7 Conv 2 initialization deviation [0.01, 0.05] x8 Conv 3 initialization deviation [0.01, 0.05] x9 MMLP 1-1 initialization deviation [0.01, 0.05] x10 MMLP 1-2 initialization deviation [0.01, 0.05] x11 MMLP 2-1 initialization deviation [0.01, 0.05] x12 MMLP 2-2 initialization deviation [0.01, 0.05] x13 MMLP 3-1 initialization deviation [0.01, 0.05] x14 MMLP 3-2 initialization deviation [0.01, 0.05] Batch-Normalized Mahout Network in Network (Chang and Chen 2015) MMLP (Maxout Multi Layer Perceptron) LeNet (LeCun et al. 1998) MNIST (LeCun and Cortes, 2010)
  58. Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et

    al. 2017) 文字認識 (LeNet) 結果 Mean loss of all executions for each method per iteration (LeNet)
  59. Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et

    al. 2017) 文字認識 (LeNet) 結果 Method mean loss min loss Random search 0.005411 (±0.001413) 0.002781 Bayesian optimization 0.004217 (±0.002242) 0.000089 CMA-ES 0.000926 (±0.001420) 0.000047 Coordinate-search method 0.000052 (±0.000094) 0.000002 Nelder-Mead method 0.000029 (±0.000029) 0.000004 Method mean accuracy (%) accuracy with min loss (%) Random search 98.98 (±0.08) 99.06 Bayesian optimization 99.07 (±0.02) 99.25 CMA-ES 99.20 (±0.08) 99.30 Coordinate-search method 99.26 (±0.05) 99.35 Nelder-Mead method 99.24 (±0.04) 99.28
  60. Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et

    al. 2017) 文字認識 (Batch-Normalized Mahout Network in Network) 結果 Mean loss of all executions for each method per iteration (Batch-Normalized Maxout Network in Network)
  61. Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et

    al. 2017) 文字認識 (Batch-Normalized Mahout Network in Network) 結果 Method mean loss min loss Random search 0.045438 (±0.002142) 0.042694 Bayesian optimization 0.045636 (±0.001197) 0.044447 CMA-ES 0.045248 (±0.002537) 0.042250 Coordinate-search method 0.045131 (±0.001088) 0.043639 Nelder-Mead method 0.044549 (±0.001079) 0.043238 Method mean accuracy (%) accuracy with min loss (%) Random search 99.56 (±0.02) 99.58 Bayesian optimization 99.47 (±0.05) 99.59 CMA-ES 99.49 (±0.14) 99.59 Coordinate-search method 99.48 (±0.04) 99.53 Nelder-Mead method 99.53 (±0.00) 99.54
  62. Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et

    al. 2017) データセット:Adience benchmark ネットワーク:Gil and Tal (2015) タスク: (1)性別推定(2クラス分類) (2)年齢層推定(8クラス分類) Name Description Range x1 Learning rate (= 0.1x1 ) [1, 4] x2 Momentum (= 1 − 0.1x2 ) [0.5, 2] x3 L2 weight decay [0.001, 0.01] x4 Dropout 1 [0.4, 0.6] x5 Dropout 2 [0.4, 0.6] x∗ 6 FC 1 units [512, 1024] x∗ 7 FC 2 units [256, 512] x8 Conv 1 initialization deviation [0.01, 0.05] x9 Conv 2 initialization deviation [0.01, 0.05] x10 Conv 3 initialization deviation [0.01, 0.05] x11 FC 1 initialization deviation [0.001, 0.01] x12 FC 2 initialization deviation [0.001, 0.01] x13 FC 3 initialization deviation [0.001, 0.01] x14 Conv 1 bias [0, 1] x15 Conv 2 bias [0, 1] x16 Conv 3 bias [0, 1] x17 FC 1 bias [0, 1] x18 FC 2 bias [0, 1] x∗ 19 Normalization 1 localsize (= 2x19 + 3) [0, 2] x∗ 20 Normalization 2 localsize (= 2x20 + 3) [0, 2] x21 Normalization 1 alpha [0.0001, 0.0002] x22 Normalization 2 alpha [0.0001, 0.0002] x23 Normalization 1 beta [0.5, 0.95] x24 Normalization 2 beta [0.5, 0.95] Integer parameters are marked with ∗. Adience benchmark (Eran et al. 2014)
  63. Copyright © GREE, Inc. All Rights Reserved. 性別推定結果 Mean loss

    of all executions for each method per iteration (gender classification CNN) CNNのハイパパラメータ最適化 (Ozaki et al. 2017)
  64. Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et

    al. 2017) 性別推定結果 Method mean loss min loss Random search 0.001732 (±0.000540) 0.000984 Bayesian optimization 0.00183 (±0.000547) 0.001097 CMA-ES 0.001804 (±0.000480) 0.001249 Coordinate-search method 0.002240 (±0.001448) 0.000378 Nelder-Mead method 0.000395 (±0.000129) 0.000245 Method mean accuracy (%) accuracy with min loss (%) Random search 87.93 (±0.24) 88.21 Bayesian optimization 88.07 (±0.27) 87.85 CMA-ES 88.20 (±0.38) 88.55 Coordinate-search method 87.04 (±0.52) 87.72 Nelder-Mead method 88.38 (±0.47) 88.83
  65. Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et

    al. 2017) 年齢層推定結果 Mean loss of all executions for each method per iteration (age classification CNN)
  66. Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et

    al. 2017) 年齢層推定結果 Method mean loss min loss Random search 0.035694 (±0.006958) 0.026563 Bayesian optimization 0.024792 (±0.003076) 0.020466 CMA-ES 0.031244 (±0.010834) 0.016952 Coordinate-search method 0.032244 (±0.006109) 0.024637 Nelder-Mead method 0.015492 (±0.002276) 0.013556 Method mean accuracy (%) accuracy with min loss (%) Random search 57.18 (±0.96) 57.90 Bayesian optimization 56.28 (±1.68) 57.19 CMA-ES 57.17 (±0.80) 58.19 Coordinate-search method 55.06 (±2.31) 56.98 Nelder-Mead method 56.72 (±0.50) 57.42
  67. Copyright © GREE, Inc. All Rights Reserved. CNNのハイパパラメータ最適化 (Ozaki et

    al. 2017) 局所探索法が良い結果を出せた理由はなにか 仮説:目的関数が多くの良質な局所解を持つ? ->肯定的な結果(NMは異なる局所解に収束も,良い性能) Parallel coordinates plot of the optimized hyperparameters of the gender classification CNN • Olof (2018)による追試 • NMはCNNに対して確かに上手くいく,RNNに対しては微妙 • 平均的にはCNN/RNNいずれもTPEが良かった (ベイズ最適化でもGPの方は全然ダメだった) • 実験を通して最良の結果を見つけたのはCNN/RNNいずれについてもNM • CNNに共通するロス関数の性質がRNNでは成り立たないと指摘 • Snoek et al. (2012)らの実験ではGPを用いたベイズ最適化が,TPEより優れていたと報告
  68. Copyright © GREE, Inc. All Rights Reserved. 計算実験 様々な課題 •

    基本的にどの論文も提案手法が一番という結論を主張する • 提案手法は念入りにチューニングしてあるものと考える • 再現性の問題 • 手法の実装(ソースコード公開),ランダム性及びチューニング • 十分な計算リソースが手元にない • モデルの評価結果を記録した表形式のデータセット (Klein et al. 2018) • 実験設定がまちまち • HPOLib (Eggensperger et al. 2013) • 手法比較の方法 • 基準(e.g. 精度,AUC)と順位付けの手法 (Dewancker et al. 2016) • 検証データへの過学習 • 実用においてはデータセットをtraining / validation / testの3つに分割して おきチューニング後の性能がtestにおいて乖離し過ぎていないか確認
  69. Copyright © GREE, Inc. All Rights Reserved. 結論 これから熱くなると予想するトピック •

    脱グリッドサーチ • ランダムサーチをはじめとする他の手法を使用 • 状況に応じて利点と欠点を考慮 • 自分と近い実験設定の論文を参考 • 研究トピック • 最適化手法 • 関連手法 (e.g. 重要なパラメータの特定,学習曲線予測) • 再現性の担保やベンチマークの整備 • 応用 (AutoML e.g. CASH problem,モデルアーキテクチャ探索)
 Combined Algorithm Selection and Hyperparameter Optimization (CASH)
  70. Copyright © GREE, Inc. All Rights Reserved. Coordinate Search法 Maximal

    positive basisを活用した探索 (Conn et al., 2009; Audet and Hare, 2017) D ⊕ D⊕ = {±ei : i = 1, 2, . . . , n}
  71. Copyright © GREE, Inc. All Rights Reserved. Coordinate Search法 λ0

    ∈ Λ(⊂ Rn) δ0 ∈ R with δ > 0 ϵ ∈ [0, ∞) λ0
  72. Copyright © GREE, Inc. All Rights Reserved. Coordinate Search法 Pk

    = {λk + δkd : d ∈ D⊕ } f(λ) < f(λk) λ ∈ Pk λ0 λ
  73. Copyright © GREE, Inc. All Rights Reserved. Coordinate Search法 λ0

    λ1 Pk = {λk + δkd : d ∈ D⊕ } f(λ) < f(λk) λ ∈ Pk
  74. Copyright © GREE, Inc. All Rights Reserved. Coordinate Search法 λ0

    λ1 λ2 λ3 Pk = {λk + δk : d ∈ D⊕ } f(λ) < f(λk) λ ∈ Pk
  75. Copyright © GREE, Inc. All Rights Reserved. Coordinate Search法 λ0

    λ1 λ2 λ3=λ4=λ5 λk+1 = λk δk+1 = δk/2
  76. Copyright © GREE, Inc. All Rights Reserved. Coordinate Search法 Pros

    and Cons • 局所解を見つける能力 • 並列化は部分的にのみ可能 • 座標軸に沿い反復的に探索を行うため次元数に対して低スケーラブル • 大域的な探索を行わないため,悪質な局所解に陥るリスク 収束性や失敗する例,改良した手法などはConn et al. (2009); Audet and Hare (2017)
  77. Copyright © GREE, Inc. All Rights Reserved. Coordinate Search法 探索空間の正規化

    • ハイパパラメータ間のスケールが違いすぎると探索が非効率化 • 探索空間を予め単位超立方体に正規化して防止 • 実用上は無効値となる場合,適当に大きな損失値を返す
  78. Copyright © GREE, Inc. All Rights Reserved. • 初期点の決め方 •

    悪質な局所解に陥る問題に対して有効な方法 Coordinate Search法 初期化の戦略 • 探索範囲の中心で初期化 • 数回のランダムサーチを行い,最も良かった点で初期化 • 異なる初期点からのマルチスタート
  79. Copyright © GREE, Inc. All Rights Reserved. Coordinate Search法 探索の戦略

    (Audet and Hare 2017) • Opportunistic polling • 良いものが見つかった時点で採用 • 固定された順番 • 完全にランダム • 直前に改善した方向からスタート • Complete polling(スケールしない) • 反復の度に全ての候補を評価して最良の値を選択
  80. Copyright © GREE, Inc. All Rights Reserved. • Weighted Hamming

    distance kernel (Hutter et al. 2011) ベイズ最適化 カテゴリ的パラメータを扱うためのカーネル kmixed(λ, λ′) = exp(rcont + rcat), rcont(λ, λ′) = l∈Λcont (−θl(λl − λ′ l)2), rcat(λ, λ′) = l∈Λcat −θl(1 − δ(λl, λ′ l)). where δ is the Kronecker delta function
  81. Copyright © GREE, Inc. All Rights Reserved. • Conditional kernel

    (Lévesque et al. 2017) • 条件的パラメータのための別のカーネル (Swersky et al. 2014) ベイズ最適化 条件パラメータを扱うためのカーネル kc(λ, λ′) = k(λ, λ′) if λc = λ′ c ∀c ∈ C 0 otherwise where C is the set of indices of active conditional hyperparameters
  82. Copyright © GREE, Inc. All Rights Reserved. ベイズ最適化 具体的なガウス過程回帰の計算 µ1(λ2)

    = k(λ2, λ1)f(λ1) µ2(λ3) = k(λ3, λ1) k(λ3, λ2) 1 k(λ1, λ2) k(λ2, λ1) 1 −1 f(λ1) f(λ2) = 1 1 − k(λ1, λ2)2 k(λ3, λ1) k(λ3, λ2) 1 −k(λ1, λ2) −k(λ2, λ1) 1 f(λ1) f(λ2) = 1 1 − k(λ1, λ2)2 k(λ3, λ1) − k(λ2, λ1)k(λ3, λ2) k(λ3, λ2) − k(λ2, λ1)k(λ3, λ1) f(λ1) f(λ2) = 1 1 − k(λ1, λ2)2 (k(λ3, λ1) − k(λ2, λ1)k(λ3, λ2))f(λ1) + (k(λ3, λ2) − k(λ2, λ1)k(λ3, λ1))f(λ2) λ1 λ2 λ3 k(λ, λ′) = exp −1 2 ∥λ − λ′∥2 k(λ3, λ1) k(λ2, λ1) k(λ3, λ2) f(λ1) f(λ3)
  83. Copyright © GREE, Inc. All Rights Reserved. • Probability of

    Improvement (PI) (Kushner 1964) • Expected Improvement (EI) (Mockus et al. 1978) • 改善量を加味,よく使われる • Predictive Entropy Search (PES) (Henrández- Lobato et al. 2014) • 情報量を最大化 ベイズ最適化 獲得関数の補足 aPI = P(f(λ) ≤ f(λ∗) − ξ) = φ f(λ∗) − ξ − µ(λ) σ(λ) λ∗ Φ ξ PIの可視化 (Brochu et al. 2010) ※この図は最大化問題のため左式とは少し異なる
  84. Copyright © GREE, Inc. All Rights Reserved. ベイズ最適化 獲得関数の最大化手法 •

    獲得関数最大化自体が非凸大域的最適化 • 最適化手法 • Brochu (2010) • DIRECT (Jones et al. 1993) • Bergstra (2011) • Estimation of Distribution (EDA) (Larraanaga and Lozano 2011) • Covariance Matrix Adaptation Evolution Strategy (CMA- ES) (Hansen 2006)
  85. Copyright © GREE, Inc. All Rights Reserved. • 多腕バンディット •

    複数の候補から最も良いものを逐次的に探す • スロットマシンの累積報酬最大化問題 • ハイパパラメータ最適化は連続 / 無限腕バンディットや最適腕識別として考えられる • ベイズ最適化は平均ケースを考えている • バンディットは最悪ケースのリグレット最小化を考えるのが一般的 • 関連研究 • Srinivas et al. (2010, 2012); Bull (2011); Kandasamy et al. (2015, 2017)など ベイズ最適化と多腕バンディットの繋がり 近年の研究動向
  86. Copyright © GREE, Inc. All Rights Reserved. Christopher M. Bishop.

    Pattern recognition and machine learning. Information science and statistics. Springer, New York, 2006. ISBN 978-0-387-31073-2. Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], December 2014. URL http://arxiv.org/abs/ 1412.6980. arXiv:1412.6980. He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. Frank Hutter, Jörg Lücke, and Lars Schmidt-Thieme. Beyond Manual Tuning of Hyperparameters. KI - Künstliche Intelligenz, 29(4):329–337, November 2015. ISSN 0933-1875, 1610-1987. doi: 10.1007/s13218-015-0381-0. URL http://link.springer.com/10.1007/s13218-015-0381-0. Stefan Falkner, Aaron Klein, and Frank Hutter. Practical hyperparameter optimization for deep learning, 2018a. URL https://openreview.net/forum?id=HJMudFkDf. Jesse Dodge, Kevin Jamieson, and Noah A. Smith. Open Loop Hyperparameter Optimization and Determinantal Point Processes. arXiv:1706.01566 [cs, stat], June 2017. URL http://arxiv.org/abs/1706.01566. arXiv: 1706.01566. Jaak Simm. Survey of hyperparameter optimization in NIPS2014, 2015. URL https://github.com/jaak-s/nips2014-survey. Carl Staelin. Parameter selection for support vector machines. 2002. URL http://www.hpl.hp.com/techreports/2002/HPL-2002-354R1.html. James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13:281–305, February 2012. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=2188385.2188395. Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. An efficient approach for assessing hyperparameter importance. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages I—754–I—762. JMLR.org, 2014. URL http://dl.acm.org/citation.cfm?id=3044805.3044891. 参考文献
  87. Copyright © GREE, Inc. All Rights Reserved. Chris Fawcett and

    Holger H. Hoos. Analysing differences between algorithm configurations through ablation. Journal of Heuristics, 22(4):431–458, Aug 2016. ISSN 1572-9397. doi:10.1007/s10732-014-9275-9. URL https://doi.org/10.1007/s10732-014-9275-9. Andre Biedenkapp, Marius Lindauer, Katharina Eggensperger, Frank Hutter, ChrisFawcett, and Holger Hoos. Efficient parameter importance analysis via ablation with surrogates, 2017. URL https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14750. Jan N van Rijn and Frank Hutter. An empirical study of hyperparameter importance across datasets. In AutoML@PKDD/ECML, 2017a. Jan N van Rijn and Frank Hutter. Hyperparameter importance across datasets. arXiv preprint arXiv:1710.04725, 2017b. J. A. Nelder and R. Mead. A Simplex Method for Function Minimization. The Computer Journal, 7(4):308–313, January 1965. ISSN 0010-4620, 1460-2067. doi: 10.1093/comjnl/7.4.308. URL https://academic.oup.com/comjnl/article-lookup/doi/10.1093/comjnl/7.4.308. Andrew R. Conn, Katya Scheinberg, and Luis N. Vicente. Introduction to Derivative-Free Optimization. Society for Industrial and Applied Mathematics, January 2009. ISBN 978-0-89871-668-9 978-0-89871-876-8. doi: 10.1137/1.9780898718768. URL http://epubs.siam.org/doi/book/ 10.1137/1.9780898718768. Charles Audet and Warren Hare. Derivative-Free and Blackbox Optimization. Springer Series in Operations Research and Financial Engineering. Springer International Publishing, Cham, 2017. ISBN 978-3-319-68912-8 978-3-319-68913-5. doi: 10.1007/978-3-319-68913-5. URL http:// link.springer.com/10.1007/978-3-319-68913-5. Fuchang Gao and Lixing Han. Implementing the Nelder-Mead simplex algorithm with adaptive parameters. Computational Optimization and Applications, 51(1):259–277, January 2012. ISSN 0926-6003, 1573-2894. doi: 10.1007/s10589-010-9329-3. URL http://link.springer.com/10.1007/ s10589-010-9329-3. Hiva Ghanbari and Katya Scheinberg. Black-Box Optimization in Machine Learning with Trust Region Based Derivative Free Algorithm. arXiv: 1703.06925 [cs], March 2017. URL http://arxiv.org/abs/1703.06925. arXiv: 1703.06925. Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012. 参考文献
  88. Copyright © GREE, Inc. All Rights Reserved. Frank Hutter, Holger

    H. Hoos, and Kevin Leyton-Brown. Sequential Model-Based Optimization for General Algorithm Configuration. In Carlos A. Coello Coello, editor, Learning and Intelligent Optimization, pages 507–523, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. ISBN 978-3-642-25566-3. James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyperparameter optimization. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pages 2546–2554, USA, 2011. Curran Associates Inc. ISBN 978-1-61839-599-3. URL http://dl.acm.org/citation.cfm?id=2986459.2986743. Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat Prabhat, and Ryan P. Adams. Scalable bayesian optimization using deep neural networks. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 2171– 2180. JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045349. Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005. ISBN 026218253X.32 Eric Brochu, Vlad M. Cora, and Nando de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. arXiv:1012.2599 [cs], December 2010. URL http://arxiv.org/abs/1012.2599. arXiv: 1012.2599. Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58:3250–3265, 2012. J. Quiñonero-Candela, CE. Rasmussen, and CKI. Williams. Approximation Methods for Gaussian Process Regression, pages 203–223. Neural Information Processing. MIT Press, Cambridge, MA, USA, September 2007. Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In David van Dyk and Max Welling, editors, Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages 567–574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR. URL http://proceedings.mlr.press/v5/titsias09a.html. Amar Shah and Zoubin Ghahramani. Parallel predictive entropy search for batch global optimization of expensive objective functions. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 3330–3338, Cambridge, MA, USA, 2015. MIT Press. URL http://dl.acm.org/citation.cfm? id=2969442.2969611. Javier Gonzalez, Zhenwen Dai, Philipp Hennig, and Neil Lawrence. Batch bayesian optimization via local penalization. In Arthur Gretton and Christian C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 648–657, Cadiz, Spain, 09–11 May 2016. PMLR. URL http://proceedings.mlr.press/v51/gonzalez16a.html. 参考文献
  89. Copyright © GREE, Inc. All Rights Reserved. Tarun Kathuria, Amit

    Deshpande, and Pushmeet Kohli. Batched Gaussian Process Bandit Optimization via Determinantal Point Processes. arXiv:1611.04088 [cs], November 2016. URL http://arxiv.org/abs/1611.04088. arXiv: 1611.04088. Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnabas Poczos. Parallelised bayesian optimisation via thompson sampling. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 133–142, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018. PMLR. URL http://proceedings.mlr.press/v84/kandasamy18a.html. Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel gaussian process optimization with upper confidence bound and pure exploration. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases - Volume 8188, ECML PKDD 2013, pages 225–240, New York, NY, USA, 2013. Springer-Verlag New York, Inc. ISBN 978-3-642-40987-5. doi: 10.1007/978-3-642-40988-2_15. URL http://dx.doi.org/10.1007/978-3-642-40988-2_15. Thomas Desautels, Andreas Krause, and Joel W. Burdick. Parallelizing Exploration-Exploitation Tradeoffs in Gaussian Process Bandit Optimization. Journal of Machine Learning Research, 15:4053–4103, 2014. URL http://jmlr.org/papers/v15/desautels14a.html. Erik A. Daxberger and Bryan Kian Hsiang Low. Distributed batch Gaussian process optimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 951–960, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/daxberger17a.html. Zi Wang, Chengtao Li, Stefanie Jegelka, and Pushmeet Kohli. Batched high-dimensional Bayesian optimization via structural kernel learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3656– 3664, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/wang17h.html. Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batched large-scalebayesian optimization in high-dimensional spaces. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First nternational Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 745–754, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018b. PMLR. URL http://proceedings.mlr.press/v84/wang18c.html. Ran Rubin. New Heuristics for Parallel and Scalable Bayesian Optimization. arXiv:1807.00373 [cs, stat], July 2018. URL http://arxiv.org/abs/1807.00373. arXiv: 1807.00373. Watanabe, Shinji, and Jonathan Le Roux. Black box optimization for automatic speech recognition. 2014. Loshchilov, Ilya, and Frank Hutter. CMA-ES for Hyperparameter Optimization of Deep Neural Networks. 2016. 参考文献
  90. Copyright © GREE, Inc. All Rights Reserved. Michael Meissner, Michael

    Schmuker, and Gisbert Schneider. Optimized Particle Swarm Optimization (OPSO) and its application to artificial neural network training. BMC Bioinformatics, 7(1):125, March 2006. ISSN 1471-2105. doi: 10.1186/1471-2105-7-125. URL https://doi.org/10.1186/1471-2105-7-125. Shih-Wei Lin, Shih-Chieh Chen, Wen-Jie Wu, and Chih-Hsien Chen. Parameter determination and feature selection for back-propagation network by particle swarm optimization. Knowledge and Information Systems, 21(2):249–266, November 2009. ISSN 0219-3116. doi: 10.1007/s10115-009-0242-y. URL https://doi.org/10.1007/s10115-009-0242-y. Pablo Ribalta Lorenzo, Jakub Nalepa, Luciano Sanchez Ramos, and José Ranilla Pastor. Hyper-parameter selection in deep neural networks using parallel particle swarm optimization. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pages 1864–1871. ACM, 2017. Fei Ye. Particle swarm optimization-based automatic parameter selection for deep neural networks and its applications in large-scale and high- dimensional data. PLOS ONE, 12 (12):1–36, 2017. doi: 10.1371/journal.pone.0188746. URL https://doi.org/10.1371/journal.pone.0188746. F. H. F. Leung, H. K. Lam, S. H. Ling, and P. K. S. Tam. Tuning of the structure and parameters of a neural network using an improved genetic algorithm. Neural Networks, IEEE Transactions on, 14(1):79–88, February 2003. doi: 10.1109/tnn.2002.804317. URL http://dx.doi.org/10.1109/tnn.2002.804317. Steven R Young, Derek C Rose, Thomas P Karnowski, Seung-Hwan Lim, and Robert M Patton. Optimizing deep learning hyper-parameters through an evolutionary algorithm. In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments, page 4. ACM, 2015. Wei Fu, Tim Menzies, and Xipeng Shen. Tuning for software analytics: Is it really necessary? Information and Software Technology, 76:135 – 146, 2016a. ISSN 0950-5849. doi: https://doi.org/10.1016/j.infsof.2016.04.017. URL http://www.sciencedirect.com/science/article/pii/S0950584916300738. Wei Fu, Vivek Nair, and Tim Menzies. Why is Differential Evolution Better than Grid Search for Tuning Defect Predictors? arXiv:1609.02613 [cs, stat], September 2016b. URL http://arxiv.org/abs/1609.02613. arXiv: 1609.02613. Samantha Hansen. Using deep q-learning to control optimization hyperparameters. arXiv preprint arXiv:1602.04062, 2016. Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V Le. Neural optimizer search with reinforcement learning. In International Conference on Machine Learning, pages 459–468, 2017. 参考文献
  91. Copyright © GREE, Inc. All Rights Reserved. Xingping Dong, Jianbing

    Shen, Wenguan Wang, Yu Liu, Ling Shao, and Fatih Porikli. Hyperparameter optimization for tracking with continuous deep q-learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 518–527, 2018. Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible learning. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 2113–2122. JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045343. Jelena Luketina, Mathias Berglund, Klaus Greff, and Tapani Raiko. Scalable gradientbased tuning of continuous regularization hyperparameters. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 2952–2960. JMLR.org, 2016. URL http://dl.acm.org/citation.cfm? id=3045390.3045701. Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 737–746. JMLR.org, 2016. URL http://dl.acm.org/citation.cfm?id=3045390.3045469. Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. On hyperparameter optimization in learning systems. In Proceedings of the 5th International Conference on Learning Representations (Workshop Track), 2017a. Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. A Bridge Between Hyperparameter Optimization and Larning-to-learn. arXiv:1712.06283 [cs, stat], December 2017b. URL http://arxiv.org/abs/1712.06283. arXiv: 1712.06283. Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1165–1173, International ConventionCentre, Sydney, Australia, 06–11 Aug 2017c. PMLR. URL http://proceedings.mlr. press/v70/franceschi17a.html. Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1563–1572, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018a. PMLR. URL http://proceedings.mlr.press/v80/franceschi18a.html. Luca Franceschi, Riccardo Grazzi, Massimiliano Pontil, Saverio Salzo, and Paolo Frasconi. Far-ho: A bilevel programming package for hyperparameter optimization and metalearning. CoRR, abs/1806.04941, 2018b. URL http://arxiv.org/abs/1806.04941. Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pages 3460–3468. AAAI Press, 2015. ISBN 978-1-57735-738-4. URL http://dl.acm.org/ citation.cfm?id=2832581.2832731. 参考文献
  92. Copyright © GREE, Inc. All Rights Reserved. Aaron Klein, Stefan

    Falkner, Jost Tobias Springenberg, and Frank Hutter. Learning curve prediction with bayesian neural networks. 2016. Akshay Chandrashekaran and Ian R. Lane. Speeding up Hyper-parameter Optimization by Extrapolation of Learning Curves Using Previous Builds. In Michelangelo Ceci, Jaakko Hollmén, Ljupčo Todorovski, Celine Vens, and Sašo Džeroski, editors, Machine Learning and Knowledge Discovery in Databases, pages 477–492, Cham, 2017. Springer International Publishing. ISBN 978-3-319-71249-9. Tobias Hinz, Nicolás Navarro-Guerrero, Sven Magg, and Stefan Wermter. Speeding up the hyperparameter optimization of deep convolutional neural networks. International Journal of Computational Intelligence and Applications, page 1850008, 2018. Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. Journal of Machine Learning Research, 18(185):1–52, 2018. URL http://jmlr.org/papers/v18/16-558.html. Hadrien Bertrand, Roberto Ardon, Matthieu Perrot, and Isabelle Bloch. Hyperparameter optimization of deep neural networks : Combining hyperband with bayesian model selection. 2017. Stefan Falkner, Aaron Klein, and Frank Hutter. Bohb: Robust and efficient hyperparameter optimization at scale. In International Conference on Machine Learning, pages 1436–1445, 2018b. Jiazhuo Wang, Jason Xu, and Xuejun Wang. Combination of Hyperband and Bayesian Optimization for Hyperparameter Optimization in Deep Learning. arXiv:1801.01596 [cs], January 2018a. URL http://arxiv.org/abs/1801.01596. arXiv: 1801.01596. Jungtaek Kim, Saehoon Kim, and Seungjin Choi. Learning to Warm-Start Bayesian Hyperparameter Optimization. ArXiv e-prints, October 2017. Jungtaek Kim, Saehoon Kim, and Seungjin Choi. Learning to transfer initializations for bayesian hyperparameter optimization. arXiv preprint arXiv: 1710.06219, 2017. T Gomes, P Miranda, R Prudêncio, C Soares, and A Carvalho. Combining meta-learning and optimization algorithms for parameter selection. In 5 th PLANNING TO LEARN WORKSHOP WS28 AT ECAI 2012, page 6. 2012. 参考文献
  93. Copyright © GREE, Inc. All Rights Reserved. Matthias Reif, Faisal

    Shafait, and Andreas Dengel. Meta-learning for evolutionary parameter optimization of classifiers. Machine learning, 87(3):357– 380, 2012. Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michele Sebag. Collaborative hyperparameter tuning. In International Conference on Machine Learning, pages 199–207, 2013. Dani Yogatama and Gideon Mann. Efficient transfer learning method for automatic hyperparameter tuning. In Artificial Intelligence and Statistics, pages 1077–1085, 2014. Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Using meta-learning to initialize bayesian optimization of hyperparameters. In Proceedings of the 2014 International Conference on Meta-learning and Algorithm Selection-Volume 1201, pages 3–10. 2014. Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Initializing bayesian hyperparameter optimization via meta-learning. In AAAI, pages 1128–1135, 2015. Matthias Feurer, Benjamin Letham, and Eytan Bakshy. Scalable meta-learning for bayesian optimization. arXiv preprint arXiv:1802.02219, 2018. Dirk V Arnold and H-G Beyer. A general noise model and its effects on evolution strategy performance. IEEE Transactions on Evolutionary Computation, 10(4):380–391, 2006. Sandor Markon, Dirk V Arnold, Thomas Back, Thomas Beielstein, and H-G Beyer. Thresholding-a selection operator for noisy es. In Evolutionary Computation, 2001. Proceedings of the 2001 Congress on, volume 1, pages 465–472. IEEE, 2001. Thomas Beielstein and Sandor Markon. Threshold selection, hypothesis tests, and doe methods. In Evolutionary Computation, 2002. CEC’02. Proceedings of the 2002 Congress on, volume 1, pages 777–782. IEEE, 2002. Yaochu Jin and Jürgen Branke. Evolutionary optimization in uncertain environments-a survey. IEEE Transactions on evolutionary computation, 9(3): 303–317, 2005. 参考文献
  94. Copyright © GREE, Inc. All Rights Reserved. Chi Keong Goh

    and Kay Chen Tan. An investigation on noisy environments in evolutionary multiobjective optimization. IEEE Transactions on Evolutionary Computation, 11(3):354–381, 2007. Christian Gießen and Timo Kötzing. Robustness of populations in stochastic environments. Algorithmica, 75(3):462–489, 2016. Hong Wang, Hong Qian, and Yang Yu. Noisy derivative-free optimization with value suppression. 2018b. Yoshihiko Ozaki, Masaki Yano, and Masaki Onishi. Effective hyperparameter optimization using Nelder-Mead method in deep learning. IPSJ Transactions on Computer Vision and Applications, 9(1), December 2017. ISSN 1882-6695. doi: 10.1186/s41074-017-0030-7. URL https:// ipsjcva.springeropen.com/articles/10.1186/s41074-017-0030-7. LeCun Y, Cortes C MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/. 2010. LeCun Y, Bottou L, Bengio Y, Patrick H Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324, 1998. Chang JR, Chen YS Batch-Normalized Maxout Network in Network. In: Proceedings of the 33rd International Conference on Machine Learning. 2015. https://arxiv.org/abs/1511.02583. Eran E, Roee E, Tal E Age and gender estimation of unfiltered faces. IEEE Trans Inf Forensic Secur 9(12):2170–2179, 2014. Gil L, Tal H Age and gender classification using convolutional neural networks. Computer Vision and Pattern Recognition Workshops (CVPRW). 2015. http://ieeexplore.ieee.org/document/7301352. Skogby Steinholtz Olof. A comparative study of black-box optimization algorithms for tuning of hyper-parameters in deep neural networks, 2018. 参考文献
  95. Copyright © GREE, Inc. All Rights Reserved. Aaron Klein, Eric

    Christiansen, Kevin Murphy, and Frank Hutter. Towards reproducible neural architecture and hyperparameter search. 2018. Katharina Eggensperger, Matthias Feurer, Frank Hutter, James Bergstra, Jasper Snoek, Holger Hoos, and Kevin Leyton-Brown. Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In NIPS workshop on Bayesian Optimization in Theory and Practice, volume 10, page 3, 2013. Ian Dewancker, Michael McCourt, Scott Clark, Patrick Hayes, Alexandra Johnson, and George Ke. A strategy for ranking optimization methods using multiple criteria. In Workshop on Automatic Machine Learning, pages 11–20, 2016. Julien-Charles Lévesque, Audrey Durand, Christian Gagné, and Robert Sabourin. Bayesian optimization for conditional hyperparameter spaces. In Proc. of the International Joint Conference on Neural Networks (IJCNN). IEEE, 05 2017. Kevin Swersky, David Duvenaud, Jasper Snoek, Frank Hutter, and Michael A Osborne. Raiders of the lost architecture: Kernels for bayesian optimization in conditional parameter spaces. arXiv preprint arXiv:1409.4011, 2014a. Harold J. Kushner. A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise. Journal of Basic Engineering, 86(1):97+, 1964. ISSN 00219223. doi: 10.1115/1.3653121. URL http://dx.doi.org/10.1115/1.3653121. Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application of bayesian methods for seeking the extremum. Towards Global Optimization, 1978. José Miguel Henrández-Lobato, Matthew W. Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of black- box functions. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, pages 918–926, Cambridge, MA, USA, 2014. MIT Press. URL http://dl.acm.org/citation.cfm?id=2968826.2968929. D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157–181, October 1993. ISSN 1573-2878. doi: 10.1007/BF00941892. URL https://doi.org/10.1007/BF00941892. Pedro Larraanaga and Jose A. Lozano. Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation. Kluwer Academic Publishers, Norwell, MA, USA, 2001. ISBN 0792374665. 参考文献
  96. Copyright © GREE, Inc. All Rights Reserved. Nikolaus Hansen. The

    CMA Evolution Strategy: A Comparing Review. In Jose A. Lozano, Pedro Larrañaga, Iñaki Inza, and Endika Bengoetxea, editors, Towards a New Evolutionary Computation: Advances in the Estimation of Distribution Algorithms, pages 75–102. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006. ISBN 978-3-540-32494-2. doi: 10.1007/3-540-32494-1_4. URL https://doi.org/10.1007/3-540-32494-1_4. Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 1015– 1022, USA, 2010. Omnipress. ISBN 978-1-60558-907-7. URL http://dl.acm.org/citation.cfm?id=3104322.3104451. Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58:3250–3265, 2012. Adam D. Bull. Convergence rates of efficient global optimization algorithms. J. Mach. Learn. Res., 12:2879–2904, November 2011. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1953048.2078198. Kirthevasan Kandasamy, Jeff Schneider, and Barnabás Póczos. High dimensional bayesian optimisation and bandits via additive models. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 295–304. JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045151. Kirthevasan Kandasamy. Tuning hyper-parameters without grad students: Scaling up bandit optimisation. 2017. 参考文献