Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MULTIPOLAR: Multi-Source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics (IJCAI'20)

MULTIPOLAR: Multi-Source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics (IJCAI'20)

Mohammadamin Barekatain, Ryo Yonetani, Masashi Hamaya, "MULTIPOLAR: Multi-Source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics", IJCAI'20

paper: https://www.ijcai.org/Proceedings/2020/430
blog: https://medium.com/sinicx/multipolar-multi-source-policy-aggregation-for-transfer-reinforcement-learning-between-diverse-bc42a152b0f5
code: https://github.com/Mohammadamin-Barekatain/multipolar

OMRON SINIC X

March 30, 2021
Tweet

More Decks by OMRON SINIC X

Other Decks in Research

Transcript

  1. © 2021 OMRON SINIC X Corporation. All Rights Reserved.
    MULTIPOLAR: Multi-source Policy Aggregation
    for Transfer Reinforcement Learning between
    Diverse Environmental Dynamics (IJCAI’20)
    Mohammadamin Barekatain (TUM), Ryo Yonetani, and Masashi Hamaya (OMRON SINIC X)
    International Joint Conference on Artificial Intelligence (IJCAI 2020)
    Jan. 13, 2021

    View full-size slide

  2. MULTIPOLAR:
    Multi-source Policy Aggreagtion for Transfer RL
    between Diverse Environmental Dynamics
    Mohammadamin Barekatain
    Technical University of Munich
    Ryo Yonetani
    OMRON SINIC X
    Masashi Hamaya
    OMRON SINIC X
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

    View full-size slide

  3. Objective: Transfer RL between Heterogeneous Dynamics
    Video of real robot Video of real robot
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.
    *See our presentation on YouTube https://youtu.be/adUnIj83RtU for video materials

    View full-size slide

  4. Problem Setting
    • Multiple sources to single target environmental instances
    • Each instance share the same state/action spaces and reward function but differs in state-transition dynamics
    • Only source policies are shared to the target agent
    • No prior knowledge about source policies and source environmental dynamics are accessible


    Source agents Target agent
    Source agents Target agent
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

    View full-size slide

  5. Problem Setting
    • Multiple sources to single target environmental instances
    • Each instance share the same state/action spaces and reward function but differs in state-transition dynamics
    • Not able to adopt Policy reuse [Fernandez and veloso, AAMAS’06; Zheng et al., NeurIPS’18; etc.] and option frameworks [Sutton
    et al., Art. Intel. ’98; Bacon et al., AAAI‘17; etc.] as they assume policies to have the same transition dynamics.
    • Only source policies are shared to the target agent
    • No prior knowledge about source policies and source environmental dynamics are accessible
    • Not able to adopt existing transfer RL methods or meta-RL methods that require access to source environmental dynamics
    [Lazaric et al., ICML’08; Chen et al., NeurIPS’18; Yu et al., ICLR’19; etc]
    [Fernandez and veloso, AAMAS’06] “Probabilistic Policy Reuse in a Reinforcement Learning Agent”, AAMAS’06
    [Zheng et al., NeurIPS’18] “A Deep Bayesian Policy Reuse Approach against Non- Stationary Agents”, NeurIPS’18
    [Sutton et al., Art. Intel.’98] “Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning”, Artificial Intelligence’98
    [Bacon et al., AAAI’17] “The Option-Critic Architecture”, AAAI’17
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

    View full-size slide

  6. MULTIPOLAR: MULTI-source POLicy AggRegation
    State
    !!
    Auxiliary network for predicting residuals: ""#$
    !!
    ; $"#$
    !aux
    Continuous action space:
    %%"&'(%
    ≡ ' " !!
    ; (, $"''
    , $"#$
    , Σ
    Discrete action space:
    %%"&'(%
    ≡ +,-./01 " !!
    ; (, $"''
    , $"#$
    )!
    )"
    )#

    Source policies
    " = $!
    , … , $"


    (#

    !agg
    Adaptive aggregation of source policies: ""''
    !!
    ; (, $"''
    " !!
    ; (, $"''
    , $"#$
    +
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

    View full-size slide

  7. MULTIPOLAR: Advantages
    • Able to work on black-box source policies
    • No need to know source dynamics: aggregation is done adaptively to maximize the target task performance
    • No need to know original task performances: auxiliary network ensures the policy expressiveness
    • Applicable to both continuous/discrete action spaces
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

    View full-size slide

  8. Environments
    CartPole Acrobot Lunar Lander
    Roboschool Hopper Roboshool Ant Roboschool Inverted Pendulum Swingup
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

    View full-size slide

  9. Source Tasks (RoboSchool Ant)
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

    View full-size slide

  10. Results (RoboSchool Ant)
    MULTIPOLAR
    Learning from scratch
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

    View full-size slide

  11. Results (RoboSchool Hopper)
    MULTIPOLAR
    Learning from scratch
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

    View full-size slide

  12. Quantitative Evaluation
    • Experimental setups
    • 100 instances with unique dynamics for each environment
    • 3 random choices of 4 source instances for each of 100 target instances
    • 3 random seeds
    • Baselines
    • MLP learned from scratch
    • Extension of Residual Policy Learning (RPL) [Johannik et al., ICRA’19]: single source policy + auxiliary network
    • A2T [Rajendran et al., ICLR’17]: 4 source policies + auxiliary network + state-dependent attention
    • Metrics
    • Average episodic rewards over training samples [Henderson et al., AAAI’18]
    • Bootstrap mean + 95% confidence bounds
    900 sessions for
    each of 6 environments!
    [Johannik et al., ICRA’19] “Residual Reinforcement Learning for Robot Control”, ICRA’19
    Rajendran et al., ICLR’17] “Attend, adapt and transfer: Attentive deep architecture for adaptive transfer from multiple sources in the same domain”, ICLR’17
    [Henderson et al., AAAI’18] “Deep Reinforcement Learning That Matters”, AAAI’18
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

    View full-size slide

  13. Quantitative Evaluations: Learning Curves
    MULTIPOLAR (Ours)
    A2T [Rajendran et al., ICLR’17]
    RPL
    MLP
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

    View full-size slide

  14. Learning Curves
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

    View full-size slide

  15. Average Episodic Rewards
    Roboschool Ant
    Roboschool Hopper
    MLP RPL A2T MULTIPOLAR MLP RPL A2T MULTIPOLAR
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

    View full-size slide

  16. Analysis
    Number of source policies (Hopper)
    Ablation study (Hopper)
    K=1 K=4 K=16
    Full Uniform
    aggregation
    weights
    State-
    independent
    residual
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

    View full-size slide

  17. Analysis
    Timesteps
    Weights for high-performing sources
    Weights for low-performing sources
    Aggregation weights (Hopper)
    Source policies Average episodic rewards
    Random 283
    4 high 420
    2 high / 2 low 208
    4 low 92
    Learning from scrtach 92
    MULITPOLAR avoids negative transfer
    by suppressing low-performing source policies
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

    View full-size slide

  18. Summary
    • MULTIPOLAR: multi-source aggregation for transfer RL between
    diverse environmental dymamics
    • Able to work on black-box source policies
    • Able to work on both continous/discrete action spaces
    • Confirmed effective on a variety of Gym environments
    • Code available!
    • https://github.com/Mohammadamin-Barekatain/multipolar
    • Interested in internship at OMRON SINIC X?
    • Contact us at [email protected] with your CV!
    Learning from scratch
    MULTIPOLAR
    (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

    View full-size slide