Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Adaptive Selection of Auxiliary Tasks in UNREAL

Adaptive Selection of Auxiliary Tasks in UNREAL

Presented at the IJCAI2019 scaling-up reinforcement learning (SURL) workshop.
37 pages.

Hidenori Itaya

June 14, 2023
Tweet

More Decks by Hidenori Itaya

Other Decks in Research

Transcript

  1. Adaptive Selection of Auxiliary Tasks in UNREAL Hidenori Itaya, Tsubasa

    Hirakawa Takayoshi Yamashita, Hironobu Fujiyoshi (Chubu University) IJCAI2019 Scaling-Up Reinforcement Learning (SURL) Workshop
  2. Reinforcement Learning (RL) • Problems involving an agent interacting with

    an environment • Application example of RL Environment reward, next state state action Agent [Gu+, ICRA2016] [Mnih+, Nature2015] 2
  3. Asynchronous Advantage Actor-Critic • Asynchronous r Each workers updates parameters

    asynchronously • Advantage r Target error is calculate considering the reward more than 2 steps ahead in each worker • Actor-Critic r Estimate • policy • State-value function Global Network Parameter ! Actor Parameter !! Critic Environment Worker Parameter !′ Parameter !′! Actor Critic Worker Environment !(#|%) '(%) !(#|%) '(%) [Mnih+, 2016] 3
  4. UNREAL [Jaderberg+, ICLR2017] &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45.

    -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+  + - Skewed sampling main task Pixel Control Value Function Replay Reward Prediction &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+  + - • Introducing three auxiliary tasks into the A3C r Pixel Control • Train actions that large changes in pixel values r Value Function Replay • Shuffle past experiences and train state-value functions r Reward Prediction • Predict future rewards 4
  5. • Introducing three auxiliary tasks into the A3C r Pixel

    Control • Train actions that large changes in pixel values r Value Function Replay • Shuffle past experiences and train state-value functions r Reward Prediction • Predict future rewards UNREAL &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+  + - &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+  + - Skewed sampling main task Pixel Control Value Function Replay Reward Prediction 5 [Jaderberg+, ICLR2017]
  6. • Introducing three auxiliary tasks into the A3C r Pixel

    Control • Train actions that large changes in pixel values r Value Function Replay • Shuffle past experiences and train state-value functions r Reward Prediction • Predict future rewards UNREAL &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+  + - &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+  + - Skewed sampling main task Pixel Control Value Function Replay Reward Prediction 6 [Jaderberg+, ICLR2017]
  7. • Introducing three auxiliary tasks into the A3C r Pixel

    Control • Train actions that large changes in pixel values r Value Function Replay • Shuffle past experiences and train state-value functions r Reward Prediction • Predict future rewards UNREAL &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+  + - &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+  + - Skewed sampling main task Pixel Control Value Function Replay Reward Prediction 7 [Jaderberg+, ICLR2017]
  8. Loss function of UNREAL • The sum of main task

    loss and auxiliary tasks loss r :Main task loss r :Pixel Control loss r :Value Function Replay loss r :Reward Prediction loss NREAL = Lmain + c L(c) Q + LVR + LRP (1) = Lmain + CPC c L(c) Q + CVRLVR + CRPLRP (2) LAS = L(πAS) + L(VAS) (3) 29,2018 2 Lmain + c L(c) Q + LVR + LRP (1) CPC c L(c) Q + CVRLVR + CRPLRP (2) = L(πAS) + L(VAS) (3) 2 c L(c) Q + LVR + LRP (1) L(c) Q + CVRLVR + CRPLRP (2) S) + L(VAS) (3) 2 + LVR + LRP (1) + CVRLVR + CRPLRP (2) Main task Auxiliary Tasks Work Document November 29,2018 ࣜ LUNREAL = Lmain + c L(c) Q + LVR + LRP LUNREAL = Lmain + CPC L(c) Q + CVRLVR + CRPLRP 8
  9. Preliminary experiment • Investigate whether each auxiliary task is effective

    or not • Environment:DeepMind Lab • Investigation functions r Pixel Control (PC) r Value Function Replay (VR) r Reward Prediction (RP) r Three auxiliary tasks (UNREAL) nav_maze_static_01 seekavoid_arena_01 lt_horseshoe_color [Beattie+, arXiv2016] 9
  10. nav_maze_static_01 • A First-person viewpoint maze game • Action r

    Look left r Look right r Forward r Backward r Strafe left r Strafe right • Reward r Apple:+1 r Goal:+10 10
  11. seekavoid_arena_01 • Avoid lemons and earn apples game • Action

    r Look left r Look right r Forward r Backward r Strafe left r Strafe right • Reward r Apple:+1 r Lemon:-1 12
  12. 3FTVMU TFFLBWPJE@BSFOB@ • Value Function Replay is effective r Actions

    changing pixel values are not suitable r Seekavoid obtains reward, frequently Variation of pixel Variation high low 13
  13. • First person shooting game • Action r Look left

    r Look right r Forward r Backward r Strafe left r Strafe right r Attack • Reward r Kill the enemy:+1 lt_horseshoe_color 14
  14. Result (lt_horseshoe_color) • All auxiliary tasks are effective r Kill

    the enemy = Actions change pixel values r Reward (kill the enemy) acquired less frequent 15
  15. HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10%> TDPSF TDPSF TDPSF nav_maze_static_01 seekavoid_arena_01 lt_horseshoe_color Summary

    of pre-experiment → Need to select suitable auxiliary tasks for game Optimal auxiliary task Pixel Control UNREAL Value Function Replay UNREAL 16
  16. Purpose of proposed method • Using only suitable auxiliary task

    for environment r Automatically select for suitable auxiliary tasks • Proposed method r Auxiliary Selection • Adaptively selection of optimal auxiliary tasks nav_maze_static_01 Environment Pixel Control Value Function Replay Reward Prediction Pixel Control select Auxiliary Selection º º 17
  17. Auxiliary Selection • A novel task to select the suitable

    auxiliary task for environment r Network build independent network from the main task &OWJSPONFOU 3FQMBZ #VGGFS $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) $POW '$ 7 %F$POW "EW %F$POW ()*+  + - $POW $POW '$ !-. (#) %-. (&|#) main task Pixel Control Value Function Replay Reward Prediction Auxiliary Selection 18
  18. &OWJSPONFOU 3FQMBZ #VGGFS $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#)

    $POW '$ 7 %F$POW "EW %F$POW ()*+  + - $POW $POW '$ !-. (#) %-. (&|#) main task Pixel Control Value Function Replay Reward Prediction Auxiliary Selection Action of Auxiliary Selection • Weight of each auxiliary task • Actions of Auxiliary Selection 8 patterns (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) (CPC, CVR, CRP) = (0, 0, 0)ʙ(1, 1, 1) CPC, CVR, CRP (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) (CPC, CVR, CRP) = (0, 0, 0)ʙ(1, 1, 1) 19 LRP c L(c) Q ×1 ×0 {0} {1} arg max πAS a = {CPC, CVR, CRP} = {0, 0, 0}ʙ{1, 1, 1}
  19. LUNREAL = Lmain + c L(c) Q + LVR +

    LRP LUNREAL = Lmain + CPC c L(c) Q + CVRLVR + CRPLRP LAS = L(πAS) + L(VAS) CPC, CVR, CRP Loss of main and auxiliary tasks &OWJSPONFOU 3FQMBZ #VGGFS $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) $POW '$ 7 %F$POW "EW %F$POW ()*+  + - $POW $POW '$ !-. (#) %-. (&|#) main task Pixel Control Value Function Replay Reward Prediction Auxiliary Selection CPC, CVR, CRP (4) (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) (5) (CPC, CVR, CRP) = (0, 0, 0)ʙ(1, 1, 1) (6) LAS = L(πAS) + L(VAS) (7) CPC, CVR, CRP (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) (CPC, CVR, CRP) = (0, 0, 0)ʙ(1, 1, 1) LAS = L(πAS) + L(VAS) CPC, CVR, CRP (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) (CPC, CVR, CRP) = (0, 0, 0)ʙ(1, 1, 1) LAS = L(πAS) + L(VAS) • Multiply Auxiliary Selection outputs and loss of auxiliary tasks 20 MPRG Work Document November 29,2018 1 ਺ࣜ LUNREAL = Lmain + c L(c) Q + LVR + LRP LUNREAL = Lmain + CPC c L(c) Q + CVRLVR + C LAS = L(πAS) + L(VAS) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} ( c L(c) Q ( LVR ( LRP ( c L(c) Q ( a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5)
  20. LUNREAL = Lmain + c L(c) Q + LVR +

    LRP LUNREAL = Lmain + CPC c L(c) Q + CVRLVR + CRPLRP LAS = L(πAS) + L(VAS) CPC, CVR, CRP Loss of main and auxiliary tasks &OWJSPONFOU 3FQMBZ #VGGFS $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) $POW '$ 7 %F$POW "EW %F$POW ()*+  + - $POW $POW '$ !-. (#) %-. (&|#) main task Pixel Control Value Function Replay Reward Prediction Auxiliary Selection c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) ×1 (6) ×0 (7) {0} (8) {1} (9) c Q LVR (3) LRP (4) c L(c) Q (5) ×1 (6) ×0 (7) {0} (8) {1} (9) c Q LVR LRP c L(c) Q ×1 ×0 {0} {1} • Multiply Auxiliary Selection outputs and loss of auxiliary tasks 21 MPRG Work Document November 29,2018 1 ਺ࣜ LUNREAL = Lmain + c L(c) Q + LVR + LRP LUNREAL = Lmain + CPC c L(c) Q + CVRLVR + C LAS = L(πAS) + L(VAS) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} ( c L(c) Q ( LVR ( LRP ( c L(c) Q ( a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) L(c) Q (2) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) ×1 (6) ×0 (7) ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ ͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) ×1 (6) ×0 (7) 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} c L(c) Q LVR LRP c L(c) Q ×1 ×0
  21. Loss function of Auxiliary Selection Loss of policy Loss of

    state-value function &OWJSPONFOU 3FQMBZ #VGGFS $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) $POW '$ 7 %F$POW "EW %F$POW ()*+  + - $POW $POW '$ !-. (#) %-. (&|#) main task Pixel Control Value Function Replay Reward Prediction Auxiliary Selection UNREAL main PC c Q VR VR RP RP LAS = L(πAS) + L(VAS) CPC, CVR, CRP (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) • Adding losses of policy and state-value function 22
  22. • Environment:DeepMind Lab • Training setting r # of steps

    • 1.0×10! steps (maze and seekavoid) • 1.0×10" steps (horseshoe) r # of workers • 8 • Comparison r Only auxiliary task (PC, VR, RP) r Three auxiliary tasks (UNREAL) r Proposed method (proposed) Experiment settings [Beattie+, arXiv2016] 23
  23. → Proposed method achieve high score as same as UNREAL

    or PC Result (nav_maze_static_01) 25
  24. Pixel Control Value Function Replay Reward Prediction maze 48.3 54.1

    41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 Analysis of the selected auxiliary tasks (nav_maze_static_01) → All auxiliary tasks are equivalently selected ※ 50 episodes average Selection percentage of each AT in one episode [%] 30
  25. Pixel Control Value Function Replay Reward Prediction maze 48.3 54.1

    41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 Analysis of the selected auxiliary tasks (seekavoid_arena_01) → VR is stably selected ※ 50 episodes average Selection percentage of each AT in one episode [%] 31
  26. Pixel Control Value Function Replay Reward Prediction maze 48.3 54.1

    41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 Selection percentage of each AT in one episode [%] Analysis of the selected auxiliary tasks (lt_horseshoe_color) → PC and RP are stably selected ※ 50 episodes average 32
  27. Additional experiment • Investigation another combinations of auxiliary tasks •

    Environment:lt_horseshoe_color (DeepMind Lab) • Comparison:Compare scores in horseshoe r Three auxiliary tasks (UNREAL) r Value Function Replay (VR) r Pixel Control and Reward Prediction (PC+RP) r Only main task (main) 33
  28. Result 36 → VR is lower than only main task

    → PC+RP achieve high score as same as UNREAL
  29. Conclusion • Auxiliary Selection r Achieves the score as same

    as the optimal auxiliary task r Can select appropriate auxiliary tasks for each games • nav_maze_static_01:UNREAL,Pixel Control • seekavoid_arena_01:Value Function Replay • lt_horseshoe_color:Pixel Control + Reward Prediction • Future work r Evaluating the proposed method in various environments with other auxiliary tasks 37