Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Auxiliary selection: optimal selection of auxiliary tasks using deep reinforcement learning

Hidenori Itaya
March 12, 2024
12

Auxiliary selection: optimal selection of auxiliary tasks using deep reinforcement learning

This slide is the slide used for the presentation at IEVC2024.

Hidenori Itaya

March 12, 2024
Tweet

Transcript

  1. Hidenori Itaya, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi Chubu University

    IEVC2024 Session 6A: Computer Vision & 3D Image Processing (2) Auxiliary selection: optimal selection of auxiliary tasks using deep reinforcement learning
  2. 2 Deep Reinforcement Learning (DRL) • Problems involving an agent

    interacting with an environment Agent Environment State Action Reward / Next state [Elia+, 2023] [Mnih+, 2015] [Chen+, 2017] [Levine+, 2016] Application example of RL
  3. • Asynchronous r Each workers updates parameters asynchronously • Advantage

    r Target error is calculate considering the reward more than 2 steps ahead in each worker • Actor-Critic r Separately inference • Actor: Policy ! "|$ • Critic: State-value function % $ 3 Asynchronous Advantage Actor-Critic (A3C) [Mnih+, 2016] Environment Environment Agent Global network Worker ! "|$ % $ Worker Actor Critic ! "|$ % $ Parameter !! " Parameter !′ Actor Critic Parameter &# Parameter &
  4. 4 UNREAL [Jaderberg+, ICLR2017] • Introducing 3 auxiliary tasks into

    the A3C r Pixel Control: Train actions that large changes in pixel values r Value function Replay: Shuffle past experiences and train state-value functions r Reward Prediction: Predict future rewards  — &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+  + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. ""#$ ! " : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling
  5. 5 UNREAL [Jaderberg+, ICLR2017] • Introducing 3 auxiliary tasks into

    the A3C r Pixel Control: Train actions that large changes in pixel values r Value function Replay: Shuffle past experiences and train state-value functions r Reward Prediction: Predict future rewards  — &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+  + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. ""#$ ! " : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling
  6. 6 UNREAL [Jaderberg+, ICLR2017] • Introducing 3 auxiliary tasks into

    the A3C r Pixel Control: Train actions that large changes in pixel values r Value function Replay: Shuffle past experiences and train state-value functions r Reward Prediction: Predict future rewards  — &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+  + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. ""#$ ! " : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling
  7. 7 UNREAL [Jaderberg+, ICLR2017] • Introducing 3 auxiliary tasks into

    the A3C r Pixel Control: Train actions that large changes in pixel values r Value function Replay: Shuffle past experiences and train state-value functions r Reward Prediction: Predict future rewards  — &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+  + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. ""#$ ! " : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling
  8. 8 Loss function of UNREAL • The sum of main

    task loss and auxiliary tasks loss r !!"#$: Main task loss r ! % (') : Pixel Control loss r !)*: Value Function Replay loss r !*+: Reward Prediction loss Main task Auxiliary tasks !"#$%&' = !()*+ + $ , ! - (,) + !0$ + !$1
  9. 9 Preliminary experiment • Investigate whether each auxiliary task is

    effective or not • Environment: DeepMind Lab [Beattie+, arXiv2016] • Investigation auxiliary tasks. r Pixel Control (PC) r Value Function Replay (VR) r Reward Prediction (RP) r 3 auxiliary tasks (UNREAL) nav_maze_static_01 (maze) seekavoid_arena_01 (seekavoid) lt_horseshoe_color (horseshoe)
  10. 10 nav_maze_static_01 (maze) • A First-person viewpoint maze game •

    Action r Look left r Look right r Forward r Backward r Strafe left r Strafe right • Reward r Apple: +1 r Goal: +10
  11. 11 Pre-experiment result – maze Pixel Control is effective →

    Action changing pixel values promote movement
  12. 12 seekavoid_arena_01 (seekavoid) • Avoid lemons and earn apples game

    • Action r Look left r Look right r Forward r Backward r Strafe left r Strafe right • Reward r Apple: +1 r Lemon: -1
  13. 13 Pre-experiment result – seekavoid Variation of pixel Variation high

    low Value Function Replay is effective PC: Action changing pixel values are not suitable RP: Agent obtains reward in seekavoid, frequently
  14. 14 lt_horseshoe_color (horseshoe) • First person shooting game • Action

    r Look left r Look right r Forward r Backward r Strafe left r Strafe right r Attack • Reward r Kill the enemy: +1
  15. 15 Pre-experiment result – horseshoe All auxiliary tasks are effective

    Kill the enemy = actions change pixel values Reward (kill the enemy) acquired less frequent
  16. 16 Summary of pre-experiment → Need to select suitable auxiliary

    tasks for game HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10%> TDPSF TDPSF TDPSF nav_maze_static_01 seekavoid_arena_01 lt_horseshoe_color Optimal auxiliary task Pixel Control UNREAL Value Function Replay UNREAL HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10%> TDPSF TDPSF TDPSF HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10%> TDPSF TDPSF TDPSF
  17. 17 Purpose of our research • Using only suitable auxiliary

    task for environment r Automatically select for suitable auxiliary tasks • Proposed method r Adaptive selection of optimal auxiliary tasks by using DRL • Construct a DRL agent that adaptively selects the optimal auxiliary task Environment Pixel Control Value Function Replay Reward Prediction Pixel Control select DRL agent º º
  18. 18 Auxiliary selection • DRL agent to select the suitable

    auxiliary task for environment r Network build an independent network from the main task  — &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+  + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. # ! $ " (!) ""#$ !*+ !)* ! " : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling Auxiliary selection Conv. Conv. FC !%& "%& = &'( , &)* , &*' !!"#$%& = !%'( + $)( % * ! + (*) + $.# !.# + $#) !#)
  19. 19 Actions of Auxiliary selection • Weight of each auxiliary

    task • Actions of Auxiliary selection 2+, , 2)* , 2*+ !14 , !0$ , !$1 = 0,1 , 0,1 , 0,1 argmax 5() + = !14 , !0$ , !$1 = 0,0,0 ~ 1,1,1 !! Buffer Auxiliary selection Conv. Conv. FC !%& * ++ = &'( , &)* , &*' "%& (++ )
  20. 20 Loss of main and auxiliary tasks • Multiply Auxiliary

    selection outputs and loss of auxiliary tasks -"#$%&' = -()*+ + !14 / , - - (,) + !0$ -0$ + !$1 -$1  — &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+  + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. / ' 0 ( (') ,$%& !#) !.# 1 2 : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling Auxiliary selection Conv. Conv. FC 1+, 3 4- 2+, 0,1 0,1 0,1
  21. 21 Loss of main and auxiliary tasks • Multiply Auxiliary

    selection outputs and loss of auxiliary tasks -"#$%&' = -()*+ + !14 / , - - (,) + !0$ -0$ + !$1 -$1  — &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+  + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. / ' 0 ( (') ,$%& !#) !.# 1 2 : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling Auxiliary selection Conv. Conv. FC 1+, 3 4- 2+, 0 1 1 5./ , 501 , 51. = 0,1,1 ×1 ×1 ×0
  22. 22 Loss function of Auxiliary selection • Adding losses of

    policy and state-value function -&6 = - 0&6 + - 1&6  — &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+  + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. / ' 0 ( (') ,$%& !#) !.# 1 2 : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling Auxiliary selection Conv. Conv. FC 1+, 3 4- 2+,
  23. 23 Experiment settings • Environment:DeepMind Lab [Beattie+, arXiv2016] r maze,

    seekavoid, horseshoe • Training setting r # of steps • 1.0×10: steps (maze and seekavoid) • 1.0×10; steps (horseshoe) r # of workers • 8 • Comparison r Only auxiliary task (PC, VR, RP) r 3 auxiliary tasks (UNREAL) r UNREAL + Auxiliary selection (proposed) nav_maze_static_01 (maze) seekavoid_arena_01 (seekavoid) lt_horseshoe_color (horseshoe)
  24. 30 Analysis of the selected auxiliary tasks (maze) Pixel Control

    Value Function Replay Reward Prediction maze 48.3 54.1 41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 → All auxiliary tasks are equivalently selected ※ 50 episodes average Selection percentage of each auxiliary task in one episode [%]
  25. 31 Analysis of the selected auxiliary tasks (seekavoid) Pixel Control

    Value Function Replay Reward Prediction maze 48.3 54.1 41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 → VR is stably selected ※ 50 episodes average Selection percentage of each auxiliary task in one episode [%]
  26. 32 Analysis of the selected auxiliary tasks (horseshoe) Pixel Control

    Value Function Replay Reward Prediction maze 48.3 54.1 41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 → PC and RP are stably selected ※ 50 episodes average Selection percentage of each auxiliary task in one episode [%]
  27. 33 Additional experiment • Investigation another combinations of auxiliary tasks

    • Environment: lt_horseshoe_color (DeepMind Lab) • Comparison: Compare scores in horseshoe r Three auxiliary tasks (UNREAL) r Value Function Replay (VR) r Pixel Control and Reward Prediction (PC+RP) r Without auxiliary tasks, only main task, (w/o aux) lt_horseshoe_color (horseshoe)
  28. 36 Additional experiment result → VR is lower than only

    main task → PC+RP achieve high score as same as UNREAL
  29. 37 Conclusion • Introduction of auxiliary tasks: expected to improve

    the main task accuracy r Unsuitable auxiliary tasks lead to reduced accuracy → Suitable auxiliary tasks need to be introduced to improve the accuracy of the main task • Auxiliary selection: adaptive selection of optimal auxiliary tasks by using DRL r Achieves the score as same as the optimal auxiliary task r Can select appropriate auxiliary tasks for each games • nav_maze_static_01: UNREAL,Pixel Control • seekavoid_arena_01: Value Function Replay • lt_horseshoe_color: Pixel Control + Reward Prediction • Future works r Evaluating the proposed method in various environments with other auxiliary tasks