interacting with an environment Agent Environment State Action Reward / Next state [Elia+, 2023] [Mnih+, 2015] [Chen+, 2017] [Levine+, 2016] Application example of RL
the A3C r Pixel Control: Train actions that large changes in pixel values r Value function Replay: Shuffle past experiences and train state-value functions r Reward Prediction: Predict future rewards &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. ""#$ ! " : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling
the A3C r Pixel Control: Train actions that large changes in pixel values r Value function Replay: Shuffle past experiences and train state-value functions r Reward Prediction: Predict future rewards &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. ""#$ ! " : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling
the A3C r Pixel Control: Train actions that large changes in pixel values r Value function Replay: Shuffle past experiences and train state-value functions r Reward Prediction: Predict future rewards &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. ""#$ ! " : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling
the A3C r Pixel Control: Train actions that large changes in pixel values r Value function Replay: Shuffle past experiences and train state-value functions r Reward Prediction: Predict future rewards &OWJSPONFOU 3FQMBZ #VGGFS $POW $POW '$ -45. -BTUSFXBSE -BTUBDUJPO !(#) %(&|#) '$ 7 %F$POW "EW %F$POW ()*+ + - Environment !! Buffer Conv. Conv. FC LSTM Last reward Last action FC V Deconv. Adv Deconv. ""#$ ! " : main task (A3C) : Pixel control : Value function replay : Reward prediction Skewed sampling
task loss and auxiliary tasks loss r !!"#$: Main task loss r ! % (') : Pixel Control loss r !)*: Value Function Replay loss r !*+: Reward Prediction loss Main task Auxiliary tasks !"#$%&' = !()*+ + $ , ! - (,) + !0$ + !$1
effective or not • Environment: DeepMind Lab [Beattie+, arXiv2016] • Investigation auxiliary tasks. r Pixel Control (PC) r Value Function Replay (VR) r Reward Prediction (RP) r 3 auxiliary tasks (UNREAL) nav_maze_static_01 (maze) seekavoid_arena_01 (seekavoid) lt_horseshoe_color (horseshoe)
task for environment r Automatically select for suitable auxiliary tasks • Proposed method r Adaptive selection of optimal auxiliary tasks by using DRL • Construct a DRL agent that adaptively selects the optimal auxiliary task Environment Pixel Control Value Function Replay Reward Prediction Pixel Control select DRL agent º º
Value Function Replay Reward Prediction maze 48.3 54.1 41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 → All auxiliary tasks are equivalently selected ※ 50 episodes average Selection percentage of each auxiliary task in one episode [%]
Value Function Replay Reward Prediction maze 48.3 54.1 41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 → VR is stably selected ※ 50 episodes average Selection percentage of each auxiliary task in one episode [%]
Value Function Replay Reward Prediction maze 48.3 54.1 41.0 seekavoid 0.1 100.0 0.0 horseshoe 94.9 0.1 99.9 → PC and RP are stably selected ※ 50 episodes average Selection percentage of each auxiliary task in one episode [%]
• Environment: lt_horseshoe_color (DeepMind Lab) • Comparison: Compare scores in horseshoe r Three auxiliary tasks (UNREAL) r Value Function Replay (VR) r Pixel Control and Reward Prediction (PC+RP) r Without auxiliary tasks, only main task, (w/o aux) lt_horseshoe_color (horseshoe)
the main task accuracy r Unsuitable auxiliary tasks lead to reduced accuracy → Suitable auxiliary tasks need to be introduced to improve the accuracy of the main task • Auxiliary selection: adaptive selection of optimal auxiliary tasks by using DRL r Achieves the score as same as the optimal auxiliary task r Can select appropriate auxiliary tasks for each games • nav_maze_static_01: UNREAL,Pixel Control • seekavoid_arena_01: Value Function Replay • lt_horseshoe_color: Pixel Control + Reward Prediction • Future works r Evaluating the proposed method in various environments with other auxiliary tasks