Adaptive Selection of Auxiliary Tasks in UNREAL

Slide 1

Slide 1 text

Adaptive Selection of Auxiliary Tasks in UNREAL Hidenori Itaya, Tsubasa Hirakawa Takayoshi Yamashita, Hironobu Fujiyoshi （Chubu University） IJCAI2019 Scaling-Up Reinforcement Learning (SURL) Workshop

Slide 2

Slide 2 text

Reinforcement Learning (RL) • Problems involving an agent interacting with an environment • Application example of RL Environment reward, next state state action Agent [Gu+, ICRA2016] [Mnih+, Nature2015] 2

Slide 3

Slide 3 text

Asynchronous Advantage Actor-Critic • Asynchronous r Each workers updates parameters asynchronously • Advantage r Target error is calculate considering the reward more than 2 steps ahead in each worker • Actor-Critic r Estimate • policy • State-value function Global Network Parameter ! Actor Parameter !! Critic Environment Worker Parameter !′ Parameter !′! Actor Critic Worker Environment !(#|%) '(%) !(#|%) '(%) [Mnih+, 2016] 3

Slide 4

Slide 4 text

UNREAL [Jaderberg+, ICLR2017] &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Skewed sampling main task Pixel Control Value Function Replay Reward Prediction &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - • Introducing three auxiliary tasks into the A3C r Pixel Control • Train actions that large changes in pixel values r Value Function Replay • Shuffle past experiences and train state-value functions r Reward Prediction • Predict future rewards 4

Slide 5

Slide 5 text

• Introducing three auxiliary tasks into the A3C r Pixel Control • Train actions that large changes in pixel values r Value Function Replay • Shuffle past experiences and train state-value functions r Reward Prediction • Predict future rewards UNREAL &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Skewed sampling main task Pixel Control Value Function Replay Reward Prediction 5 [Jaderberg+, ICLR2017]

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Loss function of UNREAL • The sum of main task loss and auxiliary tasks loss r ：Main task loss r ：Pixel Control loss r ：Value Function Replay loss r ：Reward Prediction loss NREAL = Lmain + c L(c) Q + LVR + LRP (1) = Lmain + CPC c L(c) Q + CVRLVR + CRPLRP (2) LAS = L(πAS) + L(VAS) (3) 29,2018 2 Lmain + c L(c) Q + LVR + LRP (1) CPC c L(c) Q + CVRLVR + CRPLRP (2) = L(πAS) + L(VAS) (3) 2 c L(c) Q + LVR + LRP (1) L(c) Q + CVRLVR + CRPLRP (2) S) + L(VAS) (3) 2 + LVR + LRP (1) + CVRLVR + CRPLRP (2) Main task Auxiliary Tasks Work Document November 29,2018 ࣜ LUNREAL = Lmain + c L(c) Q + LVR + LRP LUNREAL = Lmain + CPC L(c) Q + CVRLVR + CRPLRP 8

Slide 9

Slide 9 text

Preliminary experiment • Investigate whether each auxiliary task is effective or not • Environment：DeepMind Lab • Investigation functions r Pixel Control (PC) r Value Function Replay (VR) r Reward Prediction (RP) r Three auxiliary tasks (UNREAL) nav_maze_static_01 seekavoid_arena_01 lt_horseshoe_color [Beattie+, arXiv2016] 9

Slide 10

Slide 10 text

nav_maze_static_01 • A First-person viewpoint maze game • Action r Look left r Look right r Forward r Backward r Strafe left r Strafe right • Reward r Apple：+1 r Goal：+10 10

Slide 11

Slide 11 text

Result (nav_maze_static_01) • Pixel Control is effective r Action changing pixel values promote movement 11

Slide 12

Slide 12 text

seekavoid_arena_01 • Avoid lemons and earn apples game • Action r Look left r Look right r Forward r Backward r Strafe left r Strafe right • Reward r Apple：+1 r Lemon：-1 12

Slide 13

Slide 13 text

3FTVMU TFFLBWPJE@BSFOB@ • Value Function Replay is effective r Actions changing pixel values are not suitable r Seekavoid obtains reward, frequently Variation of pixel Variation high low 13

Slide 14

Slide 14 text

• First person shooting game • Action r Look left r Look right r Forward r Backward r Strafe left r Strafe right r Attack • Reward r Kill the enemy：+1 lt_horseshoe_color 14

Slide 15

Slide 15 text

Result (lt_horseshoe_color) • All auxiliary tasks are effective r Kill the enemy = Actions change pixel values r Reward (kill the enemy) acquired less frequent 15

Slide 16

Slide 16 text

HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10%> TDPSF TDPSF TDPSF nav_maze_static_01 seekavoid_arena_01 lt_horseshoe_color Summary of pre-experiment → Need to select suitable auxiliary tasks for game Optimal auxiliary task Pixel Control UNREAL Value Function Replay UNREAL 16

Slide 17

Slide 17 text

Purpose of proposed method • Using only suitable auxiliary task for environment r Automatically select for suitable auxiliary tasks • Proposed method r Auxiliary Selection • Adaptively selection of optimal auxiliary tasks nav_maze_static_01 Environment Pixel Control Value Function Replay Reward Prediction Pixel Control select Auxiliary Selection º º 17

Slide 18

Slide 18 text

Auxiliary Selection • A novel task to select the suitable auxiliary task for environment r Network build independent network from the main task &OWJSPONFOU 3FQMBZ #VGGFS $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) $POW '$ 7 %F$POW "EW %F$POW ()*+ + - $POW $POW '$ !-. (#) %-. (&|#) main task Pixel Control Value Function Replay Reward Prediction Auxiliary Selection 18

Slide 19

Slide 19 text

&OWJSPONFOU 3FQMBZ #VGGFS $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) $POW '$ 7 %F$POW "EW %F$POW ()*+ + - $POW $POW '$ !-. (#) %-. (&|#) main task Pixel Control Value Function Replay Reward Prediction Auxiliary Selection Action of Auxiliary Selection • Weight of each auxiliary task • Actions of Auxiliary Selection 8 patterns (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) (CPC, CVR, CRP) = (0, 0, 0)ʙ(1, 1, 1) CPC, CVR, CRP (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) (CPC, CVR, CRP) = (0, 0, 0)ʙ(1, 1, 1) 19 LRP c L(c) Q ×1 ×0 {0} {1} arg max πAS a = {CPC, CVR, CRP} = {0, 0, 0}ʙ{1, 1, 1}

Slide 20

Slide 20 text

LUNREAL = Lmain + c L(c) Q + LVR + LRP LUNREAL = Lmain + CPC c L(c) Q + CVRLVR + CRPLRP LAS = L(πAS) + L(VAS) CPC, CVR, CRP Loss of main and auxiliary tasks &OWJSPONFOU 3FQMBZ #VGGFS $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) $POW '$ 7 %F$POW "EW %F$POW ()*+ + - $POW $POW '$ !-. (#) %-. (&|#) main task Pixel Control Value Function Replay Reward Prediction Auxiliary Selection CPC, CVR, CRP (4) (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) (5) (CPC, CVR, CRP) = (0, 0, 0)ʙ(1, 1, 1) (6) LAS = L(πAS) + L(VAS) (7) CPC, CVR, CRP (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) (CPC, CVR, CRP) = (0, 0, 0)ʙ(1, 1, 1) LAS = L(πAS) + L(VAS) CPC, CVR, CRP (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) (CPC, CVR, CRP) = (0, 0, 0)ʙ(1, 1, 1) LAS = L(πAS) + L(VAS) • Multiply Auxiliary Selection outputs and loss of auxiliary tasks 20 MPRG Work Document November 29,2018 1 ਺ࣜ LUNREAL = Lmain + c L(c) Q + LVR + LRP LUNREAL = Lmain + CPC c L(c) Q + CVRLVR + C LAS = L(πAS) + L(VAS) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} ( c L(c) Q ( LVR ( LRP ( c L(c) Q ( a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5)

Slide 21

Slide 21 text

LUNREAL = Lmain + c L(c) Q + LVR + LRP LUNREAL = Lmain + CPC c L(c) Q + CVRLVR + CRPLRP LAS = L(πAS) + L(VAS) CPC, CVR, CRP Loss of main and auxiliary tasks &OWJSPONFOU 3FQMBZ #VGGFS $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) $POW '$ 7 %F$POW "EW %F$POW ()*+ + - $POW $POW '$ !-. (#) %-. (&|#) main task Pixel Control Value Function Replay Reward Prediction Auxiliary Selection c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) ×1 (6) ×0 (7) {0} (8) {1} (9) c Q LVR (3) LRP (4) c L(c) Q (5) ×1 (6) ×0 (7) {0} (8) {1} (9) c Q LVR LRP c L(c) Q ×1 ×0 {0} {1} • Multiply Auxiliary Selection outputs and loss of auxiliary tasks 21 MPRG Work Document November 29,2018 1 ਺ࣜ LUNREAL = Lmain + c L(c) Q + LVR + LRP LUNREAL = Lmain + CPC c L(c) Q + CVRLVR + C LAS = L(πAS) + L(VAS) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} ( c L(c) Q ( LVR ( LRP ( c L(c) Q ( a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) L(c) Q (2) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) ×1 (6) ×0 (7) ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ ͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) ×1 (6) ×0 (7) 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} c L(c) Q LVR LRP c L(c) Q ×1 ×0

Slide 22

Slide 22 text

Loss function of Auxiliary Selection Loss of policy Loss of state-value function &OWJSPONFOU 3FQMBZ #VGGFS $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) $POW '$ 7 %F$POW "EW %F$POW ()*+ + - $POW $POW '$ !-. (#) %-. (&|#) main task Pixel Control Value Function Replay Reward Prediction Auxiliary Selection UNREAL main PC c Q VR VR RP RP LAS = L(πAS) + L(VAS) CPC, CVR, CRP (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) • Adding losses of policy and state-value function 22

Slide 23

Slide 23 text

• Environment：DeepMind Lab • Training setting r # of steps • 1.0×10! steps (maze and seekavoid) • 1.0×10" steps (horseshoe) r # of workers • 8 • Comparison r Only auxiliary task (PC, VR, RP) r Three auxiliary tasks (UNREAL) r Proposed method (proposed) Experiment settings [Beattie+, arXiv2016] 23

Slide 24

Slide 24 text

Result (nav_maze_static_01) 24

Slide 25

Slide 25 text

→ Proposed method achieve high score as same as UNREAL or PC Result (nav_maze_static_01) 25

Slide 26

Slide 26 text

Result (seekavoid_arena_01) 26

Slide 27

Slide 27 text

Result (seekavoid_arena_01) → Proposed method achieve high score as same as VR 27

Slide 28

Slide 28 text

Result (lt_horseshoe_color) 28

Slide 29

Slide 29 text

Result (lt_horseshoe_color) → Proposed method achieve high score as same as UNREAL 29

Slide 30

Slide 30 text

Pixel Control Value Function Replay Reward Prediction maze 48.3 54.1 41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 Analysis of the selected auxiliary tasks (nav_maze_static_01) → All auxiliary tasks are equivalently selected ※ 50 episodes average Selection percentage of each AT in one episode [%] 30

Slide 31

Slide 31 text

Pixel Control Value Function Replay Reward Prediction maze 48.3 54.1 41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 Analysis of the selected auxiliary tasks (seekavoid_arena_01) → VR is stably selected ※ 50 episodes average Selection percentage of each AT in one episode [%] 31

Slide 32

Slide 32 text

Pixel Control Value Function Replay Reward Prediction maze 48.3 54.1 41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 Selection percentage of each AT in one episode [%] Analysis of the selected auxiliary tasks (lt_horseshoe_color) → PC and RP are stably selected ※ 50 episodes average 32

Slide 33

Slide 33 text

Additional experiment • Investigation another combinations of auxiliary tasks • Environment：lt_horseshoe_color (DeepMind Lab) • Comparison：Compare scores in horseshoe r Three auxiliary tasks (UNREAL) r Value Function Replay (VR) r Pixel Control and Reward Prediction (PC+RP) r Only main task (main) 33

Slide 34

Slide 34 text

Result 34

Slide 35

Slide 35 text

Result 35 → VR is lower than only main task

Slide 36

Slide 36 text

Result 36 → VR is lower than only main task → PC+RP achieve high score as same as UNREAL

Slide 37

Slide 37 text

Conclusion • Auxiliary Selection r Achieves the score as same as the optimal auxiliary task r Can select appropriate auxiliary tasks for each games • nav_maze_static_01：UNREAL，Pixel Control • seekavoid_arena_01：Value Function Replay • lt_horseshoe_color：Pixel Control + Reward Prediction • Future work r Evaluating the proposed method in various environments with other auxiliary tasks 37