OpenTalks.AI - Александр Панов, Алексей Скрынник, Эффективное обучение с подкреплением​

Ad8ae7af280edaecb09bd73a551b5e5f?s=47 OpenTalks.AI
February 21, 2020

OpenTalks.AI - Александр Панов, Алексей Скрынник, Эффективное обучение с подкреплением​

Ad8ae7af280edaecb09bd73a551b5e5f?s=128

OpenTalks.AI

February 21, 2020
Tweet

Transcript

  1. Sample efficient Reinforcement Learning Aleksandr I. Panov and Alexey Skrynnik

    Artificial Intelligence Research Institute FRC CSC RAS Moscow Institute of Physics and Technology
  2. Sample efficiency in RL • Deep RL has led to

    breakthroughs in many difficult domains: Atari, Go, Dota 2, Starcraft... • But SOTA RL algorithms require an exponentially increasing number of samples • We can't apply them in real-world problems, where environment samples are expensive: robotic manipulation, self-driving... • Main reason: we should not use RL in isolation from the full agent architecture J. E. Laird The Soar Cognitive Architecture MIT Press, 2012. S. Emel’yanov etc. Multilayer cognitive architecture for UAV control Cogn. Syst. Res., 2016.
  3. Promising approaches • More realistic simulators: Gym Robotics, Habitat, NVIDIA

    Isaac • Hierarchical Reinforcement Learning: ◦ S. Levine et al. Data-Efficient Hierarchical Reinforcement Learning NIPS 2018 ◦ A. Skrynnik and A. I. Panov Hierarchical Reinforcement Learning with Clustering Abstract Machines RCAI 2019 • Imitation Learning and Learning from Demonstrations: ◦ Y. Gao et al. Reinforcement Learning from Imperfect Demonstrations 2018. ◦ W. H. Guss et al. The MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors NIPS 2019. • Memory-based Reinforcement Learning: ◦ S. Levine et al. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models NIPS 2018 ◦ A. Younes and A. I. Panov Toward Faster Reinforcement Learning for Robotics : Using Gaussian Processes RAAI Summer School 2019 • Lifelong learning and Transfer Learning
  4. Motivation • DQN , A3C and Rainbow DQN have been

    applied to ATARI 2600 games and require from 44 to over 200 million frames (200 to over 900 hours) to achieve human-level performance • OpenAI Five utilizes 11,000+ years of Dota 2 gameplay • AlphaZero uses 4.9 million games of self-play in Go • AlphaStar uses 200 years of Starcraft II gameplay MineRL Sample-efficient reinforcement learning in Minecraft
  5. 1. Participants train their agents to play Minecraft. During the

    round, they submit trained models for evaluation to determine leaderboard ranks. 2. At the end of the round, participants submit source code. The models at the top of the leaderboard are re-trained (from scratch) for four days to compute the final score used for ranking. 3. Top 10 12 move on to Round 2. https://www.aicrowd.com/challenges/neurips-2019-minerl-competition 1. Participants may submit code up to four times. Each submission is trained for four days to compute score. Final ranking is based on best submission for each participant. 2. The top participants will present their work at a workshop at NeurIPS 2019. 3. New texture pack and random rotation actions. Round 2 Competition: Round 1
  6. MineRL Domain • Based on Malmo Project • Environment is

    rather slow (~40 steps per second) • 2-3 times slower on headless server • xvfb-run works only with docker • Problems with parallelization
  7. Treechop baselines (ours) reduction of action space and discretization Baselines

  8. Deep Q-learning from Demonstrations (DQfD) margin loss N-Step Return loss

    DQN loss L2 regularization loss Hester, et al. "Deep q-learning from demonstrations", 2018.
  9. Deep Q-learning from Demonstrations (DQfD) pre-train train

  10. Option extraction place: crafting table nearby craft: wooden pickaxe craft:

    planks craft: crafting table craft: stick Trajectory Segmentation Log Plank Stick Wooden pickaxe Cobblestone Stone pickaxe Iron ore Furnace Iron ingot Iron pickaxe ... demonstrations
  11. MineRLTreechop demonstrations agent replay treechop demo replay log agent sampling

    sampling storing data update log agent agent replay obtain diamond storing data training batch weights interaction interaction forgetting ⍴→1 ⍴ 1-⍴ MineRLTreechop MineRLObtainDiamondDense discretization Deep Q-learning from Imperfect Demonstrations
  12. The log agent results from Round 2

  13. ObtainIronPickaxe <item> replay other <items> replay <item> agent sampling sampling

    training batch 0.5 0.5 discretization log agent is already trained ... marking up trajectories no interaction with the environment! R → 0 λ2 → 0 (margin) <item> agents training
  14. Round 2 results

  15. The ingredients of sample efficient solution • Proper discretization of

    agent’s actions and reduction of action space • Reduction of state space via extracting subtasks • Structured replay buffer for <item> agents • Semantic network of meta-actions • DQfD with forgetting
  16. Thank you for your attention panov.ai@mipt.ru