[論文サーベイ] Survey on Transformer and Online Reinforcement Learning

Survey on Transformer and Online Reinforcement Learning Transformers are Sample-Efficient
World Models, RVincent Micheli, Eloi Alonso, François Fleuret, (University of Geneva) [ICLR'23] (Cited by:19) Online Decision Transformer, Qinqing Zheng, Amy Zhang, Aditya Grover, (Meta AI Research,University of California, Berkeley,University of California, Los Angeles) [ICML'22] (Cited by:52) Masked World Models for Visual Control, Younggyo Seo,Danijar Hafner,Hao Liu,Fangchen Liu,Stephen James,Kimin Lee,Pieter Abbeel, (KAIST,UC Berkeley,Google Research,University of Toronto) [CoRL'22] (Cited by:19) 1/12 2023/5/24

前提知識 ❏ オンライン強化学習とオフライン強化学習 ❏ オンライン強化学習 ❏ データ収集しながら学習 ❏ 環境とインタラクションする ❏
オフライン強化学習 ❏ 事前に用意されたデータセットから学習 ❏ 環境とインタラクションしない ❏ 例：ゲーム実況動画からゲームタスクをクリアする “An Optimistic Perspective on Offline Reinforcement Learning”のWebページより抜粋 2/12

背景 ❏ 強化学習の課題 ❏ 高性能を発揮するが，学習時に多くのデータが必要 ❏ サンプル効率が悪い 3/12 “AGIRobots”より抜粋 ❏
世界モデル[NIPS’18] ❏ 世界モデル内（想像の中）で方策を学習 ❏ 十分な回数を試行可能なためサンプル効率が良い

TRANSFORMERS ARE SAMPLE-EFFICIENT WORLD MODELS, RVincent Micheli, Eloi Alonso, François
Fleuret, (University of Geneva) [ICLR'23] (Cited by:19) 4/12 ❏ 研究目的 ❏ Transformerを応用して高精度な世界モデルを構築 ❏ 得られた世界モデルから高性能なモデルベース強化学習を実現 ❏ 概要 ❏ Discrete autoencoderとTransformerを組み合わせた世界モデルを提案 ❏ モデルベース強化学習を用いてAtari100kベンチマークで高性能を発揮

Fleuret, (University of Geneva) [ICLR'23] (Cited by:19) 5/12 ❏ 提案する世界モデルの概要 1. エンコーダEが初期フレームx0をトークンz0に変換 2. デコーダDがトークンztを画像xtハットに再構成 3. 方策πが再構成画像xtハットから行動atをサンプリング 4. Transformerが報酬rハット，エピソードの終了dハット，次のトークンzt+1を予測 ① ② ③ ④

Fleuret, (University of Geneva) [ICLR'23] (Cited by:19) 6/12 ❏ 実験結果 ❏ 先読み探索なし ❏ SimPLe,CURL,DrQ,SPR ❏ 先読み探索あり ❏ MuZero,EfficientZero ❏ Frostbiteでは稀な事象があり想像上で十分に経験できない

Fleuret, (University of Geneva) [ICLR'23] (Cited by:19) 7/12 ❏ 世界モデルの性能解析 (Pong,Breakout) ❏ 黄枠：正の報酬を予測するフレーム ❏ 赤枠：エピソード終了を予測するフレーム実環境の情報世界モデルの想像結果

8/12 ❏ 概要 ❏ オフラインで事前学習したDecision Transformerに，オンラインでファインチューニングしたもの Online Decision Transformer,
Qinqing Zheng, Amy Zhang, Aditya Grover, (Meta AI Research,University of California, Berkeley,University of California, Los Angeles) [ICML'22] (Cited by:52) ❏ Decision Transformerより多くのタスクで良い性能

9/12 Masked World Models for Visual Control, Younggyo Seo,Danijar Hafner,Hao
Liu,Fangchen Liu,Stephen James,Kimin Lee,Pieter Abbeel, (KAIST,UC Berkeley,Google Research,University of Toronto) [CoRL'22] (Cited by:19) ❏ 観測をMasked Autoencoderにより埋め込み ❏ ただし， ❏ パッチを画像直接ではなくConvで埋め込んだあとにパッチ化 ❏ 再構成に加えて，報酬も予測 [Masked Autoencoder]

Liu,Fangchen Liu,Stephen James,Kimin Lee,Pieter Abbeel, (KAIST,UC Berkeley,Google Research,University of Toronto) [CoRL'22] (Cited by:19) ❏ 実験結果（Meta-World） ❏ 性能・サンプル効率ともにDreamerV2から改善小さな物体を扱うタスク

Liu,Fangchen Liu,Stephen James,Kimin Lee,Pieter Abbeel, (KAIST,UC Berkeley,Google Research,University of Toronto) [CoRL'22] (Cited by:19) ❏ Ablation Studies ❏ 75%の特徴量マスクと報酬予測で最高性能画像直接ではなく特徴量マスクで性能向上 75%のマスクで最高性能報酬予測で性能向上

まとめ Transformers are Sample-Eﬃcient World Models：Transformerで世界モデル 12/12 Online Decision Transformer：DT+確率的方策＋エントロピー最大化
Masked World Models for Visual Control：画像表現にMAE ❏ 傾向と今後 ❏ 計算負荷の高い困難なタスクを解くようになりそう ❏ 音声などの他のモダリティによる補助的な予測が導入されそう

Fleuret, (University of Geneva) [ICLR'23] (Cited by:19) 13/12 ❏ Discrete autoencoder 1. エンコーダE:CNNにより入力画像xtを出力ytに変換 2. デコーダD:CNNデコーダを用いてトークンを画像xハットに再構成 ① ② ③ ④ ❏ Discrete autoencoderの学習 ❏ 収集したフレームデータを使用 ❏ 損失関数：L2　reconstruction,commitment,perceptualを等しく重みづけ

Fleuret, (University of Geneva) [ICLR'23] (Cited by:19) 14/12 ❏ Transformer:G ❏ Discrete autoencoderで得たトークンを用いて，潜在空間での状態遷移モデルを学習 ① ② ③ ④ ❏ Transformerの学習： ❏ 損失関数としてTransitionとTerminationには交差エントロピー誤差， Rewardには交差エントロピー誤差もしくは平均二乗誤差を使用

[論文サーベイ] Survey on Transformer and Online Reinf...

[論文サーベイ] Survey on Transformer and Online Reinforcement Learning

tt1717

More Decks by tt1717

Other Decks in Research

Featured

Transcript

Survey on Transformer and Online Reinforcement Learning Transformers are Sample-Efficient

前提知識 ❏ オンライン強化学習とオフライン強化学習 ❏ オンライン強化学習 ❏ データ収集しながら学習 ❏ 環境とインタラクションする ❏

背景 ❏ 強化学習の課題 ❏ 高性能を発揮するが，学習時に多くのデータが必要 ❏ サンプル効率が悪い 3/12 “AGIRobots”より抜粋 ❏

TRANSFORMERS ARE SAMPLE-EFFICIENT WORLD MODELS, RVincent Micheli, Eloi Alonso, François

TRANSFORMERS ARE SAMPLE-EFFICIENT WORLD MODELS, RVincent Micheli, Eloi Alonso, François

TRANSFORMERS ARE SAMPLE-EFFICIENT WORLD MODELS, RVincent Micheli, Eloi Alonso, François

TRANSFORMERS ARE SAMPLE-EFFICIENT WORLD MODELS, RVincent Micheli, Eloi Alonso, François

8/12 ❏ 概要 ❏ オフラインで事前学習したDecision Transformerに，オンラインでファインチューニングしたもの Online Decision Transformer,

9/12 Masked World Models for Visual Control, Younggyo Seo,Danijar Hafner,Hao

10/12 Masked World Models for Visual Control, Younggyo Seo,Danijar Hafner,Hao

11/12 Masked World Models for Visual Control, Younggyo Seo,Danijar Hafner,Hao

まとめ Transformers are Sample-Eﬃcient World Models：Transformerで世界モデル 12/12 Online Decision Transformer：DT+確率的方策＋エントロピー最大化

TRANSFORMERS ARE SAMPLE-EFFICIENT WORLD MODELS, RVincent Micheli, Eloi Alonso, François

TRANSFORMERS ARE SAMPLE-EFFICIENT WORLD MODELS, RVincent Micheli, Eloi Alonso, François