OpenTalks.AI

Sample efficient Reinforcement Learning Aleksandr I. Panov and Alexey Skrynnik
Artificial Intelligence Research Institute FRC CSC RAS Moscow Institute of Physics and Technology

Sample efficiency in RL • Deep RL has led to
breakthroughs in many difficult domains: Atari, Go, Dota 2, Starcraft... • But SOTA RL algorithms require an exponentially increasing number of samples • We can't apply them in real-world problems, where environment samples are expensive: robotic manipulation, self-driving... • Main reason: we should not use RL in isolation from the full agent architecture J. E. Laird The Soar Cognitive Architecture MIT Press, 2012. S. Emel’yanov etc. Multilayer cognitive architecture for UAV control Cogn. Syst. Res., 2016.

Promising approaches • More realistic simulators: Gym Robotics, Habitat, NVIDIA
Isaac • Hierarchical Reinforcement Learning: ◦ S. Levine et al. Data-Efficient Hierarchical Reinforcement Learning NIPS 2018 ◦ A. Skrynnik and A. I. Panov Hierarchical Reinforcement Learning with Clustering Abstract Machines RCAI 2019 • Imitation Learning and Learning from Demonstrations: ◦ Y. Gao et al. Reinforcement Learning from Imperfect Demonstrations 2018. ◦ W. H. Guss et al. The MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors NIPS 2019. • Memory-based Reinforcement Learning: ◦ S. Levine et al. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models NIPS 2018 ◦ A. Younes and A. I. Panov Toward Faster Reinforcement Learning for Robotics : Using Gaussian Processes RAAI Summer School 2019 • Lifelong learning and Transfer Learning

Motivation • DQN , A3C and Rainbow DQN have been
applied to ATARI 2600 games and require from 44 to over 200 million frames (200 to over 900 hours) to achieve human-level performance • OpenAI Five utilizes 11,000+ years of Dota 2 gameplay • AlphaZero uses 4.9 million games of self-play in Go • AlphaStar uses 200 years of Starcraft II gameplay MineRL Sample-efficient reinforcement learning in Minecraft

1. Participants train their agents to play Minecraft. During the
round, they submit trained models for evaluation to determine leaderboard ranks. 2. At the end of the round, participants submit source code. The models at the top of the leaderboard are re-trained (from scratch) for four days to compute the final score used for ranking. 3. Top 10 12 move on to Round 2. https://www.aicrowd.com/challenges/neurips-2019-minerl-competition 1. Participants may submit code up to four times. Each submission is trained for four days to compute score. Final ranking is based on best submission for each participant. 2. The top participants will present their work at a workshop at NeurIPS 2019. 3. New texture pack and random rotation actions. Round 2 Competition: Round 1

MineRL Domain • Based on Malmo Project • Environment is
rather slow (~40 steps per second) • 2-3 times slower on headless server • xvfb-run works only with docker • Problems with parallelization

Treechop baselines (ours) reduction of action space and discretization Baselines

Deep Q-learning from Demonstrations (DQfD) margin loss N-Step Return loss
DQN loss L2 regularization loss Hester, et al. "Deep q-learning from demonstrations", 2018.

Deep Q-learning from Demonstrations (DQfD) pre-train train

Option extraction place: crafting table nearby craft: wooden pickaxe craft:
planks craft: crafting table craft: stick Trajectory Segmentation Log Plank Stick Wooden pickaxe Cobblestone Stone pickaxe Iron ore Furnace Iron ingot Iron pickaxe ... demonstrations

MineRLTreechop demonstrations agent replay treechop demo replay log agent sampling
sampling storing data update log agent agent replay obtain diamond storing data training batch weights interaction interaction forgetting ⍴→1 ⍴ 1-⍴ MineRLTreechop MineRLObtainDiamondDense discretization Deep Q-learning from Imperfect Demonstrations

The log agent results from Round 2

ObtainIronPickaxe <item> replay other <items> replay <item> agent sampling sampling
training batch 0.5 0.5 discretization log agent is already trained ... marking up trajectories no interaction with the environment! R → 0 λ2 → 0 (margin) <item> agents training

Round 2 results

The ingredients of sample efficient solution • Proper discretization of
agent’s actions and reduction of action space • Reduction of state space via extracting subtasks • Structured replay buffer for <item> agents • Semantic network of meta-actions • DQfD with forgetting

Thank you for your attention [email protected]

OpenTalks.AI - Александр Панов, Алексей Скрынни...

OpenTalks.AI - Александр Панов, Алексей Скрынник, Эффективное обучение с подкреплением

More Decks by OpenTalks.AI

Other Decks in Science

Featured

Transcript

Sample efficient Reinforcement Learning Aleksandr I. Panov and Alexey Skrynnik

Sample efficiency in RL • Deep RL has led to

Promising approaches • More realistic simulators: Gym Robotics, Habitat, NVIDIA

Motivation • DQN , A3C and Rainbow DQN have been

1. Participants train their agents to play Minecraft. During the

MineRL Domain • Based on Malmo Project • Environment is

Treechop baselines (ours) reduction of action space and discretization Baselines

Deep Q-learning from Demonstrations (DQfD) margin loss N-Step Return loss

Deep Q-learning from Demonstrations (DQfD) pre-train train

Option extraction place: crafting table nearby craft: wooden pickaxe craft:

MineRLTreechop demonstrations agent replay treechop demo replay log agent sampling