Planning to Explore via Self-Supervised World Models

Planning to Explore via Self-Supervised World Models Ramanan Sekar, Oleh
Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, Deepak Pathak, ICML 2020, https://arxiv.org/abs/2005.05960 第69回汎用人工知能輪読会 2020/08/26 太田晋

目次 • 自己紹介 • 強化学習 • 内部報酬 • Plan2Explore –
目的 – 方法 • 世界モデル • ゼロショット/少数ショット適応 – 結果 • まとめ 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 2

自己紹介 • 太田晋 – フリーランスエンジニア – 東京工科大学非常勤講師 • 2018年頃から強化学習に関心
– 全脳アーキテクチャハッカソン2018 に参加 • 眼球運動に関する6個のタスク(右上図) • 5個解けた. 最優秀賞(7チーム中) – The Animal-AI Olympics 2019 に参加 • 動物の認知に関する10個のタスク(右下図) • 全体18位(61チーム中) • カテゴリ7(Internal Models)賞 (繰り上げ) • Sutton輪読会 • 強化学習コロキウム (論文読み会) http://animalaiolympics.com/AAI/ https://wba-initiative.org/3401/ 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 4

強化学習とは • 強化学習 – 環境との直接的な相互作用から学習 – 4要素: 方策・報酬・価値関数・(環境のモデル) – 報酬が後から得られる場合がある
• マルコフ決定過程(MDP)によって定式化 (右上図) – エージェント・環境との相互作用を状態・行動・報酬によって記述 – 遷移後の状態と報酬は現在の状態と行動のみに依存 • 価値関数 – 状態/行動の価値とは、将来得られると予想される報酬の合計(収益) – 価値関数の推定は強化学習課題を解くために厳密に必要というわけではない(例: 進化的手法) • 探索と利用のジレンマ – 探索を重視すると計算時間増大 or 収束しない – 利用を重視すると局所最適解 – εグリーディ法やソフトマックス法等(温度Tで調整)でランダムな探索的行動を取り入れる Sutton et al., "Reinforcement Learning: An Introduction second edition", MIT Press, 2018. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 6

モデルフリー強化学習とモデルベース強化学習 • 目的 – 価値関数/方策の改善 • モデルフリー強化学習 – 直接強化学習 –
シンプル – モデル設計のバイアスを受けない • モデルベース強化学習 – 間接強化学習(モデル経由) – 限られた経験を最大限利用 – 少数の環境との相互作用から、より良い方策を見つけることが出来る Sutton et al., "Reinforcement Learning: An Introduction second edition", MIT Press, 2018. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 7

Atari-57 の SotA • 代表的な手法 – DQN (2015) – Rainbow
(2017) – Ape-X (2018) – R2D2 (2019) – MuZero (2019) – Agent57 (2020) • Atari-57のSotAではないが57種類のゲーム全てで人間以上のスコア https://paperswithcode.com/sota/atari-games-on-atari-57 Rainbow https://www.slideshare.net/SusumuOTA/distributed-rl-ota-20200622 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 8

Montezuma’s Revenge の SotA • 人間の熟練者のスコア – 4,753 • 探索が難しい
– 環境から得られる報酬(外部報酬)が疎 • 内発的動機づけ・内部報酬・好奇心等のアイデアを導入 https://paperswithcode.com/sota/atari-games-on-atari-2600-montezumas-revenge 疑似カウント ICM 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 10 RND ICM + RND

内部報酬の生成手法 • カウントベース • 疑似カウント • フォワードモデルによる予測誤差 • Random Network
Distillation (RND) • アンサンブルモデルによる予測の不一致 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 11

カウントベースの例 UCB1 (Upper Confidence Bound 1) • バンディット問題の解法の1つ • ステップ数t,
行動aの報酬の平均値Qt (a), 行動aの回数Nt (a) • 平均報酬1.55が最適解 • 序盤は探索(第2項)が大 • 次第に探索が減っていく • サンプル数が増えると信頼区間が狭くなる • 状態空間が表形式 • 連続空間の場合は離散化 • 問題点 – 状態空間の次元が大きくなるとメモリが不足 – 84 x 84 x 4フレーム x 色数 x 輝度利用探索 Sutton et al., "Reinforcement Learning: An Introduction second edition", MIT Press, 2018. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 13

疑似カウント(pseudo-count) • 離散化せずに、状態空間(画像空間)上の密度を推定し、確率密度の変化から、疑似訪問回数を算出 • 問題点
– 画像が似ていたら訪れたことになる等 https://www.slideshare.net/ssuser1ad085/rnd-124137638 Bellemare et al., "Unifying Count-Based Exploration and Intrinsic Motivation", NIPS 2016, 2016. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 15

フォワードモデルによる予測誤差 (Intrinsic Curiosity Model, ICM) • フォワードモデル – 次時刻どうなるかを予測するモデル
• インバースモデル – 予測された行動と実際の行動をロスとして特徴ベクトルΦを学習 https://www.slideshare.net/ssuser1ad085/rnd-124137638 Pathak et al., "Curiosity-driven Exploration by Self-supervised Prediction", ICML 2017, 2017. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 17

フォワードモデルによる予測誤差の問題点 Noisy-TV問題 • ランダムに画面が切り替わるテレビが視界に入るとエージェントが立ち止まってしまう Burda et al., "Large-Scale
Study of Curiosity-Driven Learning", ICLR 2019, 2019. https://www.slideshare.net/ssuser1ad085/rnd-124137638 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 18

Random Network Distillation (RND) • モンテズマリベンジで初めて人間のスコア(4753)を超えた • フォワードモデルで
はない→Noisy-TV 問題を避ける Burda et al., "Exploration by Random Network Distillation", ICLR 2019, 2019. https://www.slideshare.net/ssuser1ad085/rnd-124137638 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 20

アンサンブルモデルによる予測の不一致方法 • ネットワーク出力のアンサンブルの分散を内部報酬 Pathak et al., "Self-Supervised Exploration
via Disagreement". ICML 2019, 2019. 右辺がxt+1 に依存していない 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 22

アンサンブルモデルによる予測の不一致結果 • Noisy-TV問題(TV+remote)に引っかからない • 内部報酬のみでオブジェクトに興味を持つ(押す・掴む) – 1.5kサンプル –
オブジェクトに触れても報酬なし Deepak Pathak, “ICML 2019: Self-Supervised Exploration via Disagreement”, YouTube, 2019. https://youtu.be/POlrWt32_ec Pathak et al., "Self-Supervised Exploration via Disagreement". ICML 2019, 2019. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 23

アンサンブルモデルによる予測の不一致 Noisy-TV問題の回避 • 黄: ICM, 緑: 提案手法 • Less Stochastic
– ラベル0の画像からラベル0の別の画像への遷移 • High Stochastic – ラベル1の画像からランダムにラベル2〜9 の画像への遷移 – Noizy-TV問題に相当 Deepak Pathak, “ICML 2020 Oral Talk: Planning to Explore via Self-Supervised World Models”, YouTube, 2020. https://youtu.be/gan79mAVfq8 Pathak et al., "Self-Supervised Exploration via Disagreement". ICML 2019, 2019. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 24 決定論的環境・確率的環境における予測誤差とアンサンブル不一致の比較

Agent57: 内部報酬(ICM+RND)の効果 • Agent57 (2020年2月): モデルフリー強化学習 – R2D2をベースに内部報酬(ICM+RND)とメタコントローラを追加 • 57個全てのゲームで人間以上
• 中央値(Median)でR2D2とほぼ同等 • 5 th Percentileでトップ – 57個のスコアを下から順に並べて、下位 5 %の位置にあるスコア(2.8番目) – MuZero は全く解けないゲームがいくつかあるので値が低くなる – 内部報酬を用いることで解ける問題がある Badia et al., "Agent57: Outperforming the Atari Human Benchmark", arXiv preprint arXiv:2003.13350, 2020. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 25

Plan2Explore 概要 Deepak Pathak, “ICML 2020 Oral Talk: Planning to
Explore via Self-Supervised World Models”, YouTube, 2020. https://youtu.be/gan79mAVfq8 Sekar et al., "Planning to Explore via Self-Supervised World Models", ICML 2020, 2020. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 27

Plan2Explore 概要 Sekar et al., "Planning to Explore via Self-Supervised
World Models", ICML 2020, 2020. Deepak Pathak, “Planning to Explore via Self-Supervised World Models”, YouTube, 2020. https://youtu.be/GftqnPWsCWw 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 28

Plan2Explore 概要 • タスク非依存探索 – 世界モデル • タスク固有の外的報酬なし • 潜在的不一致を内部報酬
• ゼロ/少数ショット適応 – 世界モデルを使い複数の下流タスクに素早く適応 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 29 Sekar et al., "Planning to Explore via Self-Supervised World Models", ICML 2020, 2020.

Plan2Explore 目的 • 強化学習の課題 – タスク固有性 →タスク非依存探索で世界モデルを学習 – サンプル効率 →世界モデルを用いてゼロショット/少数ショットで適応
Sekar et al., "Planning to Explore via Self-Supervised World Models", ICML 2020, 2020. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 31

Plan2Explore 方法 • Dreamer (モデルベース強化学習) によるプランニング • Latent Disagreement
(潜在空間における予測不一致)による内部報酬生成 Sekar et al., "Planning to Explore via Self-Supervised World Models", ICML 2020, 2020. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 33

Plan2Explore 方法アルゴリズム Sekar et al., "Planning to Explore via
Self-Supervised World Models", ICML 2020, 2020. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 34

Plan2Explore 方法内部報酬 • アンサンブル予測器 – 𝐾 = 5, 隠れ層2のMLP
• 内部報酬 Sekar et al., "Planning to Explore via Self-Supervised World Models", ICML 2020, 2020. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 35

Plan2Explore 結果 DM Control Suite ゼロショット適応 • 緑が提案手法 • Dreamer(教師あり,黄)と同程度
• Hopper HopでDreamerを超える Sekar et al., "Planning to Explore via Self-Supervised World Models", ICML 2020, 2020. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 37

Plan2Explore 結果 DM Control Suite ゼロショット適応 Sekar et al., "Planning
to Explore via Self-Supervised World Models", ICML 2020, 2020. Deepak Pathak, “Planning to Explore via Self-Supervised World Models”, YouTube, 2020. https://youtu.be/GftqnPWsCWw 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 38

Plan2Explore 結果 DM Control Suite 少数ショット適応 • 1000エピソード自己教師あり探索 • 150エピソード教師あり適応
• 20エピソード程度でDreamerに追いつく Sekar et al., "Planning to Explore via Self-Supervised World Models", ICML 2020, 2020. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 39

Plan2Explore 結果 DM Control Suite 少数ショット適応 Sekar et al., "Planning
to Explore via Self-Supervised World Models", ICML 2020, 2020. Deepak Pathak, “Planning to Explore via Self-Supervised World Models”, YouTube, 2020. https://youtu.be/GftqnPWsCWw 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 40

Plan2Explore まとめ • 自己教師あり学習 – 世界モデル (タスク非依存探索、潜在的不一致を内部報酬) – 世界モデルを用いてゼロショット/少数ショット適応 •
連続制御タスクにおける実験 – ゼロショット学習手法でstate-of-the-art(SotA) • いくつかのタスクでDreamer(教師あり手法)に匹敵(competitive) – 少数ショット学習手法でも教師あり手法と一致または上回る • 多数の異なるタスクでスケーラブルかつ高いサンプル効率を達成 Sekar et al., "Planning to Explore via Self-Supervised World Models", ICML 2020, 2020. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 42

参考文献 • Sekar et al., "Planning to Explore via Self-Supervised
World Models", ICML 2020, 2020. https://arxiv.org/abs/2005.05960 • Pathak et al., "Self-Supervised Exploration via Disagreement". ICML 2019, 2019. https://arxiv.org/abs/1906.04161 • Burda et al., "Large-Scale Study of Curiosity-Driven Learning", ICLR 2019, 2019. https://openreview.net/forum?id=rJNwDjAqYX • Burda et al., "Exploration by Random Network Distillation", ICLR 2019, 2019. https://openreview.net/forum?id=H1lJJnR5Ym • Pathak et al., "Curiosity-driven Exploration by Self-supervised Prediction", ICML 2017, 2017. https://arxiv.org/abs/1705.05363 • Bellemare et al., "Unifying Count-Based Exploration and Intrinsic Motivation", NIPS 2016, 2016. https://arxiv.org/abs/1606.01868 • Hafner et al., "Dream to Control: Learning Behaviors by Latent Imagination", ICLR 2020, 2020. https://openreview.net/forum?id=S1lOTC4tDS • Badia et al., "Agent57: Outperforming the Atari Human Benchmark", arXiv preprint arXiv:2003.13350, 2020. https://arxiv.org/abs/2003.13350 • Badia et al., "Never Give Up: Learning Directed Exploration Strategies", ICLR2020, 2020. https://openreview.net/forum?id=Sye57xStvB • Finn et al., "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks", ICML 2017. 2017. https://arxiv.org/abs/1703.03400 • Baker et al., "Emergent Tool Use From Multi-Agent Autocurricula", ICLR2020, 2020. https://openreview.net/forum?id=SkxpxJBKwS • Chua, K. et al., “Deep reinforcement learning in a handful of trials using probabilistic dynamics models.”, NIPS 2018, 2018. https://arxiv.org/abs/1805.12114 • Leibo et al., "Autocurricula and the Emergence of Innovation from Social Interaction: A Manifesto for Multi-Agent Intelligence Research", arXiv preprint arXiv:1903.00742, 2019. https://arxiv.org/abs/1903.00742 • Sutton et al., "Reinforcement Learning: An Introduction second edition", MIT Press, 2018. http://incompleteideas.net/book/the-book- 2nd.html • Tassa et al., "dm_control: Software and Tasks for Continuous Control", arXiv preprint arXiv:2006.12983, 2020. https://arxiv.org/abs/2006.12983 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 43

参考: Dreamer 方法 • 画像入力からlong-horizonなタスクを潜在空間における想像(latent imagination)のみによって解くモデルベース強化学習手法 https://www.slideshare.net/DeepLearningJP2016/dldream-to-control-learning-behaviors-by-latent-imagination-230172979 Hafner
et al., "Dream to Control: Learning Behaviors by Latent Imagination", ICLR 2020, 2020. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 44

参考: Dreamer 結果 • サンプル効率がよい • モデルフリー手法(D4PG, 2018)に匹敵 • 20タスクの平均でDreamerは823
– PlaNetは332 – D4PGは1e9ステップで786 https://www.slideshare.net/DeepLearningJP2016/dldream-to-control-learning-behaviors-by-latent-imagination-230172979 Hafner et al., "Dream to Control: Learning Behaviors by Latent Imagination", ICLR 2020, 2020. 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 45

参考: DeepMind Control Suite Yuval Tassa, "DeepMind Control Suite", YouTube,
2018. https://youtu.be/rAai4QzcYbs Tassa et al., "dm_control: Software and Tasks for Continuous Control", arXiv preprint arXiv:2006.12983, 2020. https://arxiv.org/abs/2006.12983 2020/08/26 第69回汎用人工知能輪読会担当：太田晋 46

Planning to Explore via Self-Supervised World M...

Planning to Explore via Self-Supervised World Models

More Decks by S. Ota

Other Decks in Research

Featured

Transcript