Auxiliary selection: optimal selection of auxiliary tasks using deep reinforcement learning

Slide 1

Slide 1 text

Hidenori Itaya, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi Chubu University IEVC2024 Session 6A: Computer Vision & 3D Image Processing (2) Auxiliary selection: optimal selection of auxiliary tasks using deep reinforcement learning

Slide 2

Slide 2 text

2 Deep Reinforcement Learning (DRL) • Problems involving an agent interacting with an environment Agent Environment State Action Reward / Next state [Elia+, 2023] [Mnih+, 2015] [Chen+, 2017] [Levine+, 2016] Application example of RL

Slide 3

Slide 3 text

• Asynchronous r Each workers updates parameters asynchronously • Advantage r Target error is calculate considering the reward more than 2 steps ahead in each worker • Actor-Critic r Separately inference • Actor: Policy ! "|$ • Critic: State-value function % $ 3 Asynchronous Advantage Actor-Critic (A3C) [Mnih+, 2016] Environment Environment Agent Global network Worker ! "|$ % $ Worker Actor Critic ! "|$ % $ Parameter !! " Parameter !′ Actor Critic Parameter &# Parameter &

Slide 4

Slide 4 text

4 UNREAL [Jaderberg+, ICLR2017] • Introducing 3 auxiliary tasks into the A3C r Pixel Control: Train actions that large changes in pixel values r Value function Replay: Shuffle past experiences and train state-value functions r Reward Prediction: Predict future rewards &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. ""#$ ! " : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling

Slide 5

Slide 5 text

5 UNREAL [Jaderberg+, ICLR2017] • Introducing 3 auxiliary tasks into the A3C r Pixel Control: Train actions that large changes in pixel values r Value function Replay: Shuffle past experiences and train state-value functions r Reward Prediction: Predict future rewards &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. ""#$ ! " : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling

Slide 6

Slide 6 text

6 UNREAL [Jaderberg+, ICLR2017] • Introducing 3 auxiliary tasks into the A3C r Pixel Control: Train actions that large changes in pixel values r Value function Replay: Shuffle past experiences and train state-value functions r Reward Prediction: Predict future rewards &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. ""#$ ! " : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling

Slide 7

Slide 7 text

7 UNREAL [Jaderberg+, ICLR2017] • Introducing 3 auxiliary tasks into the A3C r Pixel Control: Train actions that large changes in pixel values r Value function Replay: Shuffle past experiences and train state-value functions r Reward Prediction: Predict future rewards &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. ""#$ ! " : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling

Slide 8

Slide 8 text

8 Loss function of UNREAL • The sum of main task loss and auxiliary tasks loss r !!"#$: Main task loss r ! % (') : Pixel Control loss r !)*: Value Function Replay loss r !*+: Reward Prediction loss Main task Auxiliary tasks !"#$%&' = !()*+ + $ , ! - (,) + !0$ + !$1

Slide 9

Slide 9 text

9 Preliminary experiment • Investigate whether each auxiliary task is effective or not • Environment: DeepMind Lab [Beattie+, arXiv2016] • Investigation auxiliary tasks. r Pixel Control (PC) r Value Function Replay (VR) r Reward Prediction (RP) r 3 auxiliary tasks (UNREAL) nav_maze_static_01 (maze) seekavoid_arena_01 (seekavoid) lt_horseshoe_color (horseshoe)

Slide 10

Slide 10 text

10 nav_maze_static_01 (maze) • A First-person viewpoint maze game • Action r Look left r Look right r Forward r Backward r Strafe left r Strafe right • Reward r Apple: +1 r Goal: +10

Slide 11

Slide 11 text

11 Pre-experiment result – maze Pixel Control is effective → Action changing pixel values promote movement

Slide 12

Slide 12 text

12 seekavoid_arena_01 (seekavoid) • Avoid lemons and earn apples game • Action r Look left r Look right r Forward r Backward r Strafe left r Strafe right • Reward r Apple: +1 r Lemon: -1

Slide 13

Slide 13 text

13 Pre-experiment result – seekavoid Variation of pixel Variation high low Value Function Replay is effective PC: Action changing pixel values are not suitable RP: Agent obtains reward in seekavoid, frequently

Slide 14

Slide 14 text

14 lt_horseshoe_color (horseshoe) • First person shooting game • Action r Look left r Look right r Forward r Backward r Strafe left r Strafe right r Attack • Reward r Kill the enemy: +1

Slide 15

Slide 15 text

15 Pre-experiment result – horseshoe All auxiliary tasks are effective Kill the enemy = actions change pixel values Reward (kill the enemy) acquired less frequent

Slide 16

Slide 16 text

16 Summary of pre-experiment → Need to select suitable auxiliary tasks for game HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10%> TDPSF TDPSF TDPSF nav_maze_static_01 seekavoid_arena_01 lt_horseshoe_color Optimal auxiliary task Pixel Control UNREAL Value Function Replay UNREAL HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10%> TDPSF TDPSF TDPSF HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10%> TDPSF TDPSF TDPSF

Slide 17

Slide 17 text

17 Purpose of our research • Using only suitable auxiliary task for environment r Automatically select for suitable auxiliary tasks • Proposed method r Adaptive selection of optimal auxiliary tasks by using DRL • Construct a DRL agent that adaptively selects the optimal auxiliary task Environment Pixel Control Value Function Replay Reward Prediction Pixel Control select DRL agent º º

Slide 18

Slide 18 text

18 Auxiliary selection • DRL agent to select the suitable auxiliary task for environment r Network build an independent network from the main task &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. # ! $ " (!) ""#$ !*+ !)* ! " : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling Auxiliary selection Conv. Conv. FC !%& "%& = &'( , &)* , &*' !!"#$%& = !%'( + $)( % * ! + (*) + $.# !.# + $#) !#)

Slide 19

Slide 19 text

19 Actions of Auxiliary selection • Weight of each auxiliary task • Actions of Auxiliary selection 2+, , 2)* , 2*+ !14 , !0$ , !$1 = 0,1 , 0,1 , 0,1 argmax 5() + = !14 , !0$ , !$1 = 0,0,0 ~ 1,1,1 !! Buffer Auxiliary selection Conv. Conv. FC !%& * ++ = &'( , &)* , &*' "%& (++ )

Slide 20

Slide 20 text

20 Loss of main and auxiliary tasks • Multiply Auxiliary selection outputs and loss of auxiliary tasks -"#$%&' = -()*+ + !14 / , - - (,) + !0$ -0$ + !$1 -$1 &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. / ' 0 ( (') ,$%& !#) !.# 1 2 : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling Auxiliary selection Conv. Conv. FC 1+, 3 4- 2+, 0,1 0,1 0,1

Slide 21

Slide 21 text

21 Loss of main and auxiliary tasks • Multiply Auxiliary selection outputs and loss of auxiliary tasks -"#$%&' = -()*+ + !14 / , - - (,) + !0$ -0$ + !$1 -$1 &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. / ' 0 ( (') ,$%& !#) !.# 1 2 : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling Auxiliary selection Conv. Conv. FC 1+, 3 4- 2+, 0 1 1 5./ , 501 , 51. = 0,1,1 ×1 ×1 ×0

Slide 22

Slide 22 text

22 Loss function of Auxiliary selection • Adding losses of policy and state-value function -&6 = - 0&6 + - 1&6 &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. / ' 0 ( (') ,$%& !#) !.# 1 2 : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling Auxiliary selection Conv. Conv. FC 1+, 3 4- 2+,

Slide 23

Slide 23 text

23 Experiment settings • Environment：DeepMind Lab [Beattie+, arXiv2016] r maze, seekavoid, horseshoe • Training setting r # of steps • 1.0×10: steps (maze and seekavoid) • 1.0×10; steps (horseshoe) r # of workers • 8 • Comparison r Only auxiliary task (PC, VR, RP) r 3 auxiliary tasks (UNREAL) r UNREAL + Auxiliary selection (proposed) nav_maze_static_01 (maze) seekavoid_arena_01 (seekavoid) lt_horseshoe_color (horseshoe)

Slide 24

Slide 24 text

24 Result – maze

Slide 25

Slide 25 text

25 Result – maze → Proposed method achieve high score as same as UNREAL or PC

Slide 26

Slide 26 text

26 Result – seekavoid

Slide 27

Slide 27 text

27 Result – seekavoid → Proposed method achieve high score as same as VR

Slide 28

Slide 28 text

28 Result – horseshoe

Slide 29

Slide 29 text

29 Result – horseshoe → Proposed method achieve high score as same as UNREAL

Slide 30

Slide 30 text

30 Analysis of the selected auxiliary tasks (maze) Pixel Control Value Function Replay Reward Prediction maze 48.3 54.1 41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 → All auxiliary tasks are equivalently selected ※ 50 episodes average Selection percentage of each auxiliary task in one episode [%]

Slide 31

Slide 31 text

31 Analysis of the selected auxiliary tasks (seekavoid) Pixel Control Value Function Replay Reward Prediction maze 48.3 54.1 41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 → VR is stably selected ※ 50 episodes average Selection percentage of each auxiliary task in one episode [%]

Slide 32

Slide 32 text

32 Analysis of the selected auxiliary tasks (horseshoe) Pixel Control Value Function Replay Reward Prediction maze 48.3 54.1 41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 → PC and RP are stably selected ※ 50 episodes average Selection percentage of each auxiliary task in one episode [%]

Slide 33

Slide 33 text

33 Additional experiment • Investigation another combinations of auxiliary tasks • Environment: lt_horseshoe_color (DeepMind Lab) • Comparison: Compare scores in horseshoe r Three auxiliary tasks (UNREAL) r Value Function Replay (VR) r Pixel Control and Reward Prediction (PC+RP) r Without auxiliary tasks, only main task, (w/o aux) lt_horseshoe_color (horseshoe)

Slide 34

Slide 34 text

34 Additional experiment result w/o aux

Slide 35

Slide 35 text

35 Additional experiment result → VR is lower than only main task

Slide 36

Slide 36 text

36 Additional experiment result → VR is lower than only main task → PC+RP achieve high score as same as UNREAL

Slide 37

Slide 37 text

37 Conclusion • Introduction of auxiliary tasks: expected to improve the main task accuracy r Unsuitable auxiliary tasks lead to reduced accuracy → Suitable auxiliary tasks need to be introduced to improve the accuracy of the main task • Auxiliary selection: adaptive selection of optimal auxiliary tasks by using DRL r Achieves the score as same as the optimal auxiliary task r Can select appropriate auxiliary tasks for each games • nav_maze_static_01: UNREAL，Pixel Control • seekavoid_arena_01: Value Function Replay • lt_horseshoe_color: Pixel Control + Reward Prediction • Future works r Evaluating the proposed method in various environments with other auxiliary tasks