Auxiliary selection: optimal selection of auxiliary tasks using deep reinforcement learning

Hidenori Itaya, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi Chubu University
IEVC2024 Session 6A: Computer Vision & 3D Image Processing (2) Auxiliary selection: optimal selection of auxiliary tasks using deep reinforcement learning

2 Deep Reinforcement Learning (DRL) • Problems involving an agent
interacting with an environment Agent Environment State Action Reward / Next state [Elia+, 2023] [Mnih+, 2015] [Chen+, 2017] [Levine+, 2016] Application example of RL

• Asynchronous r Each workers updates parameters asynchronously • Advantage
r Target error is calculate considering the reward more than 2 steps ahead in each worker • Actor-Critic r Separately inference • Actor: Policy ! "|$ • Critic: State-value function % $ 3 Asynchronous Advantage Actor-Critic (A3C) [Mnih+, 2016] Environment Environment Agent Global network Worker ! "|$ % $ Worker Actor Critic ! "|$ % $ Parameter !! " Parameter !′ Actor Critic Parameter &# Parameter &

4 UNREAL [Jaderberg+, ICLR2017] • Introducing 3 auxiliary tasks into
the A3C r Pixel Control: Train actions that large changes in pixel values r Value function Replay: Shuffle past experiences and train state-value functions r Reward Prediction: Predict future rewards &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. ""#$ ! " : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling

8 Loss function of UNREAL • The sum of main
task loss and auxiliary tasks loss r !!"#$: Main task loss r ! % (') : Pixel Control loss r !)*: Value Function Replay loss r !*+: Reward Prediction loss Main task Auxiliary tasks !"#$%&' = !()*+ + $ , ! - (,) + !0$ + !$1

9 Preliminary experiment • Investigate whether each auxiliary task is
effective or not • Environment: DeepMind Lab [Beattie+, arXiv2016] • Investigation auxiliary tasks. r Pixel Control (PC) r Value Function Replay (VR) r Reward Prediction (RP) r 3 auxiliary tasks (UNREAL) nav_maze_static_01 (maze) seekavoid_arena_01 (seekavoid) lt_horseshoe_color (horseshoe)

10 nav_maze_static_01 (maze) • A First-person viewpoint maze game •
Action r Look left r Look right r Forward r Backward r Strafe left r Strafe right • Reward r Apple: +1 r Goal: +10

11 Pre-experiment result – maze Pixel Control is effective →
Action changing pixel values promote movement

12 seekavoid_arena_01 (seekavoid) • Avoid lemons and earn apples game
• Action r Look left r Look right r Forward r Backward r Strafe left r Strafe right • Reward r Apple: +1 r Lemon: -1

13 Pre-experiment result – seekavoid Variation of pixel Variation high
low Value Function Replay is effective PC: Action changing pixel values are not suitable RP: Agent obtains reward in seekavoid, frequently

14 lt_horseshoe_color (horseshoe) • First person shooting game • Action
r Look left r Look right r Forward r Backward r Strafe left r Strafe right r Attack • Reward r Kill the enemy: +1

15 Pre-experiment result – horseshoe All auxiliary tasks are effective
Kill the enemy = actions change pixel values Reward (kill the enemy) acquired less frequent

16 Summary of pre-experiment → Need to select suitable auxiliary
tasks for game HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10%> TDPSF TDPSF TDPSF nav_maze_static_01 seekavoid_arena_01 lt_horseshoe_color Optimal auxiliary task Pixel Control UNREAL Value Function Replay UNREAL HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10%> TDPSF TDPSF TDPSF HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10%> TDPSF TDPSF TDPSF

17 Purpose of our research • Using only suitable auxiliary
task for environment r Automatically select for suitable auxiliary tasks • Proposed method r Adaptive selection of optimal auxiliary tasks by using DRL • Construct a DRL agent that adaptively selects the optimal auxiliary task Environment Pixel Control Value Function Replay Reward Prediction Pixel Control select DRL agent º º

18 Auxiliary selection • DRL agent to select the suitable
auxiliary task for environment r Network build an independent network from the main task &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. # ! $ " (!) ""#$ !*+ !)* ! " : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling Auxiliary selection Conv. Conv. FC !%& "%& = &'( , &)* , &*' !!"#$%& = !%'( + $)( % * ! + (*) + $.# !.# + $#) !#)

19 Actions of Auxiliary selection • Weight of each auxiliary
task • Actions of Auxiliary selection 2+, , 2)* , 2*+ !14 , !0$ , !$1 = 0,1 , 0,1 , 0,1 argmax 5() + = !14 , !0$ , !$1 = 0,0,0 ~ 1,1,1 !! Buffer Auxiliary selection Conv. Conv. FC !%& * ++ = &'( , &)* , &*' "%& (++ )

20 Loss of main and auxiliary tasks • Multiply Auxiliary
selection outputs and loss of auxiliary tasks -"#$%&' = -()*+ + !14 / , - - (,) + !0$ -0$ + !$1 -$1 &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. / ' 0 ( (') ,$%& !#) !.# 1 2 : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling Auxiliary selection Conv. Conv. FC 1+, 3 4- 2+, 0,1 0,1 0,1

21 Loss of main and auxiliary tasks • Multiply Auxiliary
selection outputs and loss of auxiliary tasks -"#$%&' = -()*+ + !14 / , - - (,) + !0$ -0$ + !$1 -$1 &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. / ' 0 ( (') ,$%& !#) !.# 1 2 : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling Auxiliary selection Conv. Conv. FC 1+, 3 4- 2+, 0 1 1 5./ , 501 , 51. = 0,1,1 ×1 ×1 ×0

22 Loss function of Auxiliary selection • Adding losses of
policy and state-value function -&6 = - 0&6 + - 1&6 &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. / ' 0 ( (') ,$%& !#) !.# 1 2 : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling Auxiliary selection Conv. Conv. FC 1+, 3 4- 2+,

23 Experiment settings • Environment：DeepMind Lab [Beattie+, arXiv2016] r maze,
seekavoid, horseshoe • Training setting r # of steps • 1.0×10: steps (maze and seekavoid) • 1.0×10; steps (horseshoe) r # of workers • 8 • Comparison r Only auxiliary task (PC, VR, RP) r 3 auxiliary tasks (UNREAL) r UNREAL + Auxiliary selection (proposed) nav_maze_static_01 (maze) seekavoid_arena_01 (seekavoid) lt_horseshoe_color (horseshoe)

24 Result – maze

25 Result – maze → Proposed method achieve high score
as same as UNREAL or PC

26 Result – seekavoid

27 Result – seekavoid → Proposed method achieve high score
as same as VR

28 Result – horseshoe

29 Result – horseshoe → Proposed method achieve high score
as same as UNREAL

30 Analysis of the selected auxiliary tasks (maze) Pixel Control
Value Function Replay Reward Prediction maze 48.3 54.1 41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 → All auxiliary tasks are equivalently selected ※ 50 episodes average Selection percentage of each auxiliary task in one episode [%]

31 Analysis of the selected auxiliary tasks (seekavoid) Pixel Control
Value Function Replay Reward Prediction maze 48.3 54.1 41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 → VR is stably selected ※ 50 episodes average Selection percentage of each auxiliary task in one episode [%]

32 Analysis of the selected auxiliary tasks (horseshoe) Pixel Control
Value Function Replay Reward Prediction maze 48.3 54.1 41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 → PC and RP are stably selected ※ 50 episodes average Selection percentage of each auxiliary task in one episode [%]

33 Additional experiment • Investigation another combinations of auxiliary tasks
• Environment: lt_horseshoe_color (DeepMind Lab) • Comparison: Compare scores in horseshoe r Three auxiliary tasks (UNREAL) r Value Function Replay (VR) r Pixel Control and Reward Prediction (PC+RP) r Without auxiliary tasks, only main task, (w/o aux) lt_horseshoe_color (horseshoe)

34 Additional experiment result w/o aux

35 Additional experiment result → VR is lower than only
main task

36 Additional experiment result → VR is lower than only
main task → PC+RP achieve high score as same as UNREAL

37 Conclusion • Introduction of auxiliary tasks: expected to improve
the main task accuracy r Unsuitable auxiliary tasks lead to reduced accuracy → Suitable auxiliary tasks need to be introduced to improve the accuracy of the main task • Auxiliary selection: adaptive selection of optimal auxiliary tasks by using DRL r Achieves the score as same as the optimal auxiliary task r Can select appropriate auxiliary tasks for each games • nav_maze_static_01: UNREAL，Pixel Control • seekavoid_arena_01: Value Function Replay • lt_horseshoe_color: Pixel Control + Reward Prediction • Future works r Evaluating the proposed method in various environments with other auxiliary tasks

Auxiliary selection: optimal selection of auxil...

Auxiliary selection: optimal selection of auxiliary tasks using deep reinforcement learning

More Decks by Hidenori Itaya

Featured

Transcript