Adaptive Selection of Auxiliary Tasks in UNREAL

Adaptive Selection of Auxiliary Tasks in UNREAL Hidenori Itaya, Tsubasa
Hirakawa Takayoshi Yamashita, Hironobu Fujiyoshi （Chubu University） IJCAI2019 Scaling-Up Reinforcement Learning (SURL) Workshop

Reinforcement Learning (RL) • Problems involving an agent interacting with
an environment • Application example of RL Environment reward, next state state action Agent [Gu+, ICRA2016] [Mnih+, Nature2015] 2

Asynchronous Advantage Actor-Critic • Asynchronous r Each workers updates parameters
asynchronously • Advantage r Target error is calculate considering the reward more than 2 steps ahead in each worker • Actor-Critic r Estimate • policy • State-value function Global Network Parameter ! Actor Parameter !! Critic Environment Worker Parameter !′ Parameter !′! Actor Critic Worker Environment !(#|%) '(%) !(#|%) '(%) [Mnih+, 2016] 3

UNREAL [Jaderberg+, ICLR2017] &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45.
-BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Skewed sampling main task Pixel Control Value Function Replay Reward Prediction &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - • Introducing three auxiliary tasks into the A3C r Pixel Control • Train actions that large changes in pixel values r Value Function Replay • Shuffle past experiences and train state-value functions r Reward Prediction • Predict future rewards 4

• Introducing three auxiliary tasks into the A3C r Pixel
Control • Train actions that large changes in pixel values r Value Function Replay • Shuffle past experiences and train state-value functions r Reward Prediction • Predict future rewards UNREAL &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Skewed sampling main task Pixel Control Value Function Replay Reward Prediction 5 [Jaderberg+, ICLR2017]

Loss function of UNREAL • The sum of main task
loss and auxiliary tasks loss r ：Main task loss r ：Pixel Control loss r ：Value Function Replay loss r ：Reward Prediction loss NREAL = Lmain + c L(c) Q + LVR + LRP (1) = Lmain + CPC c L(c) Q + CVRLVR + CRPLRP (2) LAS = L(πAS) + L(VAS) (3) 29,2018 2 Lmain + c L(c) Q + LVR + LRP (1) CPC c L(c) Q + CVRLVR + CRPLRP (2) = L(πAS) + L(VAS) (3) 2 c L(c) Q + LVR + LRP (1) L(c) Q + CVRLVR + CRPLRP (2) S) + L(VAS) (3) 2 + LVR + LRP (1) + CVRLVR + CRPLRP (2) Main task Auxiliary Tasks Work Document November 29,2018 ࣜ LUNREAL = Lmain + c L(c) Q + LVR + LRP LUNREAL = Lmain + CPC L(c) Q + CVRLVR + CRPLRP 8

Preliminary experiment • Investigate whether each auxiliary task is effective
or not • Environment：DeepMind Lab • Investigation functions r Pixel Control (PC) r Value Function Replay (VR) r Reward Prediction (RP) r Three auxiliary tasks (UNREAL) nav_maze_static_01 seekavoid_arena_01 lt_horseshoe_color [Beattie+, arXiv2016] 9

nav_maze_static_01 • A First-person viewpoint maze game • Action r
Look left r Look right r Forward r Backward r Strafe left r Strafe right • Reward r Apple：+1 r Goal：+10 10

Result (nav_maze_static_01) • Pixel Control is effective r Action changing
pixel values promote movement 11

seekavoid_arena_01 • Avoid lemons and earn apples game • Action
r Look left r Look right r Forward r Backward r Strafe left r Strafe right • Reward r Apple：+1 r Lemon：-1 12

3FTVMU TFFLBWPJE@BSFOB@ • Value Function Replay is effective r Actions
changing pixel values are not suitable r Seekavoid obtains reward, frequently Variation of pixel Variation high low 13

• First person shooting game • Action r Look left
r Look right r Forward r Backward r Strafe left r Strafe right r Attack • Reward r Kill the enemy：+1 lt_horseshoe_color 14

Result (lt_horseshoe_color) • All auxiliary tasks are effective r Kill
the enemy = Actions change pixel values r Reward (kill the enemy) acquired less frequent 15

HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10$> HMPCBMTUFQ<×10%> TDPSF TDPSF TDPSF nav_maze_static_01 seekavoid_arena_01 lt_horseshoe_color Summary
of pre-experiment → Need to select suitable auxiliary tasks for game Optimal auxiliary task Pixel Control UNREAL Value Function Replay UNREAL 16

Purpose of proposed method • Using only suitable auxiliary task
for environment r Automatically select for suitable auxiliary tasks • Proposed method r Auxiliary Selection • Adaptively selection of optimal auxiliary tasks nav_maze_static_01 Environment Pixel Control Value Function Replay Reward Prediction Pixel Control select Auxiliary Selection º º 17

Auxiliary Selection • A novel task to select the suitable
auxiliary task for environment r Network build independent network from the main task &OWJSPONFOU 3FQMBZ #VGGFS $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) $POW '$ 7 %F$POW "EW %F$POW ()*+ + - $POW $POW '$ !-. (#) %-. (&|#) main task Pixel Control Value Function Replay Reward Prediction Auxiliary Selection 18

&OWJSPONFOU 3FQMBZ #VGGFS $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#)
$POW '$ 7 %F$POW "EW %F$POW ()*+ + - $POW $POW '$ !-. (#) %-. (&|#) main task Pixel Control Value Function Replay Reward Prediction Auxiliary Selection Action of Auxiliary Selection • Weight of each auxiliary task • Actions of Auxiliary Selection 8 patterns (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) (CPC, CVR, CRP) = (0, 0, 0)ʙ(1, 1, 1) CPC, CVR, CRP (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) (CPC, CVR, CRP) = (0, 0, 0)ʙ(1, 1, 1) 19 LRP c L(c) Q ×1 ×0 {0} {1} arg max πAS a = {CPC, CVR, CRP} = {0, 0, 0}ʙ{1, 1, 1}

LUNREAL = Lmain + c L(c) Q + LVR +
LRP LUNREAL = Lmain + CPC c L(c) Q + CVRLVR + CRPLRP LAS = L(πAS) + L(VAS) CPC, CVR, CRP Loss of main and auxiliary tasks &OWJSPONFOU 3FQMBZ #VGGFS $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) $POW '$ 7 %F$POW "EW %F$POW ()*+ + - $POW $POW '$ !-. (#) %-. (&|#) main task Pixel Control Value Function Replay Reward Prediction Auxiliary Selection CPC, CVR, CRP (4) (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) (5) (CPC, CVR, CRP) = (0, 0, 0)ʙ(1, 1, 1) (6) LAS = L(πAS) + L(VAS) (7) CPC, CVR, CRP (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) (CPC, CVR, CRP) = (0, 0, 0)ʙ(1, 1, 1) LAS = L(πAS) + L(VAS) CPC, CVR, CRP (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) (CPC, CVR, CRP) = (0, 0, 0)ʙ(1, 1, 1) LAS = L(πAS) + L(VAS) • Multiply Auxiliary Selection outputs and loss of auxiliary tasks 20 MPRG Work Document November 29,2018 1 ਺ࣜ LUNREAL = Lmain + c L(c) Q + LVR + LRP LUNREAL = Lmain + CPC c L(c) Q + CVRLVR + C LAS = L(πAS) + L(VAS) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} ( c L(c) Q ( LVR ( LRP ( c L(c) Q ( a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5)

LUNREAL = Lmain + c L(c) Q + LVR +
LRP LUNREAL = Lmain + CPC c L(c) Q + CVRLVR + CRPLRP LAS = L(πAS) + L(VAS) CPC, CVR, CRP Loss of main and auxiliary tasks &OWJSPONFOU 3FQMBZ #VGGFS $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) $POW '$ 7 %F$POW "EW %F$POW ()*+ + - $POW $POW '$ !-. (#) %-. (&|#) main task Pixel Control Value Function Replay Reward Prediction Auxiliary Selection c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) ×1 (6) ×0 (7) {0} (8) {1} (9) c Q LVR (3) LRP (4) c L(c) Q (5) ×1 (6) ×0 (7) {0} (8) {1} (9) c Q LVR LRP c L(c) Q ×1 ×0 {0} {1} • Multiply Auxiliary Selection outputs and loss of auxiliary tasks 21 MPRG Work Document November 29,2018 1 ਺ࣜ LUNREAL = Lmain + c L(c) Q + LVR + LRP LUNREAL = Lmain + CPC c L(c) Q + CVRLVR + C LAS = L(πAS) + L(VAS) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} ( c L(c) Q ( LVR ( LRP ( c L(c) Q ( a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) L(c) Q (2) a ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) ×1 (6) ×0 (7) ൘୩ӳయ 2018 ೥ 3 ݄ 3 ೔ ͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} (1) c L(c) Q (2) LVR (3) LRP (4) c L(c) Q (5) ×1 (6) ×0 (7) 1 ͸͡Ίʹ {CPC, CVR, CRP} = {0, 1, 1} c L(c) Q LVR LRP c L(c) Q ×1 ×0

Loss function of Auxiliary Selection Loss of policy Loss of
state-value function &OWJSPONFOU 3FQMBZ #VGGFS $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) $POW '$ 7 %F$POW "EW %F$POW ()*+ + - $POW $POW '$ !-. (#) %-. (&|#) main task Pixel Control Value Function Replay Reward Prediction Auxiliary Selection UNREAL main PC c Q VR VR RP RP LAS = L(πAS) + L(VAS) CPC, CVR, CRP (CPC, CVR, CRP) = ({0, 1}, {0, 1}, {0, 1}) • Adding losses of policy and state-value function 22

• Environment：DeepMind Lab • Training setting r # of steps
• 1.0×10! steps (maze and seekavoid) • 1.0×10" steps (horseshoe) r # of workers • 8 • Comparison r Only auxiliary task (PC, VR, RP) r Three auxiliary tasks (UNREAL) r Proposed method (proposed) Experiment settings [Beattie+, arXiv2016] 23

Result (nav_maze_static_01) 24

→ Proposed method achieve high score as same as UNREAL
or PC Result (nav_maze_static_01) 25

Result (seekavoid_arena_01) 26

Result (seekavoid_arena_01) → Proposed method achieve high score as same
as VR 27

Result (lt_horseshoe_color) 28

Result (lt_horseshoe_color) → Proposed method achieve high score as same
as UNREAL 29

Pixel Control Value Function Replay Reward Prediction maze 48.3 54.1
41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 Analysis of the selected auxiliary tasks (nav_maze_static_01) → All auxiliary tasks are equivalently selected ※ 50 episodes average Selection percentage of each AT in one episode [%] 30

41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 Analysis of the selected auxiliary tasks (seekavoid_arena_01) → VR is stably selected ※ 50 episodes average Selection percentage of each AT in one episode [%] 31

41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 Selection percentage of each AT in one episode [%] Analysis of the selected auxiliary tasks (lt_horseshoe_color) → PC and RP are stably selected ※ 50 episodes average 32

Additional experiment • Investigation another combinations of auxiliary tasks •
Environment：lt_horseshoe_color (DeepMind Lab) • Comparison：Compare scores in horseshoe r Three auxiliary tasks (UNREAL) r Value Function Replay (VR) r Pixel Control and Reward Prediction (PC+RP) r Only main task (main) 33

Result 34

Result 35 → VR is lower than only main task

Result 36 → VR is lower than only main task
→ PC+RP achieve high score as same as UNREAL

Conclusion • Auxiliary Selection r Achieves the score as same
as the optimal auxiliary task r Can select appropriate auxiliary tasks for each games • nav_maze_static_01：UNREAL，Pixel Control • seekavoid_arena_01：Value Function Replay • lt_horseshoe_color：Pixel Control + Reward Prediction • Future work r Evaluating the proposed method in various environments with other auxiliary tasks 37

Adaptive Selection of Auxiliary Tasks in UNREAL

Adaptive Selection of Auxiliary Tasks in UNREAL

More Decks by Hidenori Itaya

Other Decks in Research

Featured

Transcript