Slide 1

Slide 1 text

ߴڮݚ.PEFM#BTFE3- ษڧձ ୈճ ੴݪ༔ 5XJUUFSIUUQTUXJUUFSDPNYFPOJDT (JU)VCIUUQTHJUIVCDPNZVJTIJIBSB 2JJUBIUUQTRJJUBDPNZVJTIJIBSB

Slide 2

Slide 2 text

ษڧձͷ໨త ʮ࠷ద੍ޚʯºʮڧԽֶशʯͱ͍͏෼໺ͷ࠷ઌ୺ͷݚڀΛཧղ w .PEFMCBTFE3-ͷେ࿮ͷཧղ w .PEFMCBTFE3-ͷ୅දख๏ͷཧղ w (14 1*-$0 J-23 *0$ ʜ w ͦͷଞͷख๏ͷཧղ w 5310 ೋճ໨ʹͯ͜͠ΕΛ΍Γ·͢

Slide 3

Slide 3 text

ࠓ೔ͷϝχϡʔ Model Based Reinforcement Learning for Atari Łukasz Kaiser * 1 Mohammad Babaeizadeh * 2 3 Piotr Miło´ s * 4 5 Bła˙ zej Osi´ nski * 4 5 3 Roy H Campbell 2 Konrad Czechowski 4 Dumitru Erhan 1 Chelsea Finn 1 Piotr Kozakowski 4 Sergey Levine 1 Ryan Sepassi 1 George Tucker 1 Henryk Michalewski 4 5 Abstract Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observa- tions. However, this typically requires very large amounts of interaction – substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore how video prediction models can similarly enable agents to solve Atari games with orders of magni- tude fewer interactions than model-free methods. We describe Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models and present a compar- ison of several model architectures, including a novel architecture that yields the best results in our setting. Our experiments evaluate SimPLe on a range of Atari games and achieve competitive results with only 100K interactions between the agent and the environment (400K frames), which corresponds to about two hours of real-time play. 1. Introduction Human players can learn to play Atari games in min- utes (Tsividis et al., 2017). However, our best model-free reinforcement learning algorithms require tens or hundreds of millions of time steps – the equivalent of several weeks of training in real time. How is it that humans can learn these games so much faster? Perhaps part of the puzzle is that processes that are represented in the game: we know that planes can fly, balls can roll, and bullets can destroy aliens. We can therefore predict the outcomes of our actions. In this paper, we explore how learned video models can enable learning in the Atari Learning Environment (ALE) bench- mark (Bellemare et al., 2015; Machado et al., 2017) with a budget restricted to 100K time steps – roughly to two hours of a play time. Although prior works have proposed training predictive models for next-frame, future-frame, as well as combined future-frame and reward predictions in Atari games (Oh et al., 2015; Chiappa et al., 2017; Leibfried et al., 2016), no prior work has successfully demonstrated model-based control via such predictive models that achieve results that are competitive with model-free RL. Indeed, in a recent sur- vey by Machado et al. this was formulated as the following challenge: “So far, there has been no clear demonstration of successful planning with a learned model in the ALE” (Section 7.2 in Machado et al. (2017)). Using models of environments, or informally giving the agent ability to predict its future, has a fundamental appeal for reinforcement learning. The spectrum of possible appli- cations is vast, including learning policies from the model (Watter et al., 2015; Finn et al., 2016; Finn & Levine, 2016; Ebert et al., 2017; Hafner et al., 2018; Piergiovanni et al., 2018; Rybkin et al., 2018; Sutton & Barto, 2017, Chapter 8), capturing important details of the scene (Ha & Schmidhuber, 2018), encouraging exploration (Oh et al., 2015), creating intrinsic motivation (Schmidhuber, 2010) or counterfactual reasoning (Buesing et al., 2018). One of the exciting bene- fits of model-based learning is the promise to substantially improve sample efficiency of deep reinforcement learning (see Chapter 8 in (Sutton & Barto, 2017)). arXiv:1903.00374v1 [cs.LG] 1 Mar 2019 Trust Region Policy Optimization John Schulman JOSCHU@EECS.BERKELEY.EDU Sergey Levine SLEVINE@EECS.BERKELEY.EDU Philipp Moritz PCMORITZ@EECS.BERKELEY.EDU Michael Jordan JORDAN@CS.BERKELEY.EDU Pieter Abbeel PABBEEL@CS.BERKELEY.EDU University of California, Berkeley, Department of Electrical Engineering and Computer Sciences Abstract We describe an iterative procedure for optimizing policies, with guaranteed monotonic improve- ment. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effec- tive for optimizing large nonlinear policies such as neural networks. Our experiments demon- strate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. De- spite its approximations that deviate from the theory, TRPO tends to give monotonic improve- ment, with little tuning of hyperparameters. Tetris is a classic benchmark problem for approximate dy- namic programming (ADP) methods, stochastic optimiza- tion methods are difficult to beat on this task (Gabillon et al., 2013). For continuous control problems, methods like CMA have been successful at learning control poli- cies for challenging tasks like locomotion when provided with hand-engineered policy classes with low-dimensional parameterizations (Wampler & Popovi´ c, 2009). The in- ability of ADP and gradient-based methods to consistently beat gradient-free random search is unsatisfying, since gradient-based optimization algorithms enjoy much better sample complexity guarantees than gradient-free methods (Nemirovski, 2005). Continuous gradient-based optimiza- tion has been very successful at learning function approxi- mators for supervised learning tasks with huge numbers of parameters, and extending their success to reinforcement learning would allow for efficient training of complex and powerful policies. 477v5 [cs.LG] 20 Apr 2017 Proximal Policy Optimization Algorithms John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov OpenAI {joschu, filip, prafulla, alec, oleg}@openai.com Abstract We propose a new family of policy gradient methods for reinforcement learning, which al- ternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent. Whereas standard policy gra- Ϟσϧ ࣮࣭ΤϛϡϨʔλ Λ࢖ͬͯɺ ΤʔδΣϯτΛֶशͤ͞Δ࿦จ
 ํࡦͷֶशʹ͸110Λ࢖͏ ࠓ೔ͷϝΠϯςʔϚͷϋζͩͬͨɾɾ 110ͷߟ͑ํͷϕʔεʹͳ͍ͬͯΔ࿦จ
 ཧղ͠ͳ͍ͱ110ͷྑ͕͞Θ͔Βͳ͍
 ࠓ೔ͷཪςʔϚ 5310ͷར఺Λ׆͔͠ͳ͕Βɺ ओʹ࣮૷໘Ͱͷվળ͕ͳ͞Εͨख๏
 ࠷ۙ5310ʹมΘͬͯΑ͘࢖ΘΕ͍ͯΔ

Slide 4

Slide 4 text

Model Based Reinforcement Learning for Atari Łukasz Kaiser * 1 Mohammad Babaeizadeh * 2 3 Piotr Miło´ s * 4 5 Bła˙ zej Osi´ nski * 4 5 3 Roy H Campbell 2 Konrad Czechowski 4 Dumitru Erhan 1 Chelsea Finn 1 Piotr Kozakowski 4 Sergey Levine 1 Ryan Sepassi 1 George Tucker 1 Henryk Michalewski 4 5 Abstract Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observa- tions. However, this typically requires very large amounts of interaction – substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore processes that are represented in the game: we know that planes can fly, balls can roll, and bullets can destroy aliens. We can therefore predict the outcomes of our actions. In this paper, we explore how learned video models can enable learning in the Atari Learning Environment (ALE) bench- mark (Bellemare et al., 2015; Machado et al., 2017) with a budget restricted to 100K time steps – roughly to two hours of a play time. Although prior works have proposed training predictive models for next-frame, future-frame, as well as combined future-frame and reward predictions in Atari games (Oh G] 1 Mar 2019 ·ͣ͸ɾɾɾ

Slide 5

Slide 5 text

.PEFMGSFFWT.PEFMCBTFE3- .PEFMGSFF3-ͷಛ௃ w ౰વͳ͕Βঢ়ଶભҠʹؔ͢Δ஌ࣝෆཁ w ྫ͑͹ঢ়ଶભҠ֬཰ɺ෺ମͷӡಈͷϞσϧ͕ෆཁ w ؀ڥͷ૬ޓ࡞༻Λ௨ͯ͡ɺঢ়ଶߦಈใुͷܥྻНΛಘΔ͜ͱͰ ํࡦΛֶश͢Δ w ྫ͑͹2MFBSOJOH .PEFMCBTFE3-ͷಛ௃ w ঢ়ଶભҠʹؔ͢Δ஌ࣝΛར༻͢Δ w ղੳతʹํࡦͷޯ഑Λܭࢉ͢Δ 1*-$0 w ະདྷͷঢ়ଶʹԠͨ͡ߦಈͷϓϥϯχϯάΛ͢Δ "MQIB(P w ࣮ࡍͷ؀ڥͰߦಈ͠ͳͯ͘΋ϞσϧΛ࢖ֶͬͯश ຊ࿦จ %ZOBతͳײ͡ʜͨͿΜ

Slide 6

Slide 6 text

ຊ࿦จ͕ఏى͢Δ՝୊ .PEFMGSFF3-͸͏·͘ಈ͕͘ɺ
 ؀ڥͱͷJOUFSBDUJPO͕ͨ͘͞Μඞཁ ਓؒ͸΋ͬͱগͳ͍ࢼߦճ਺ͰํࡦΛ֫ಘͰ͖Δ ਓؒ͸࣮ࡍʹήʔϜ͕ͲͷΑ͏ʹਐߦ͢Δ͔ɺ ը໘͕Ͳ͏มԽ͢Δ͔༧૝Ͱ͖Δ͔ΒͰ͸ͳ͍͔ʁ ը૾༧ଌϞσϧΛ࢖ֶͬͯश͢Ε͹΋ͬͱૣ͘ ࣮ࡍͷ؀ڥͰࢼߦ͠ͳͯ͘΋ํࡦ͕֫ಘͰ͖ΔͷͰ͸ͳ͍͔ʁ

Slide 7

Slide 7 text

4JNVMBUFE1PMJDZ-FBSOJOH 4JN1-FఏҊख๏ ͷ֓ཁ Model-Based Reinforcement Learning for Atari Observations Policy World Model World Model World Model Training Self-Supervised* RL Agent Training In World Model Policy Observations Agent Evaluation In Real World Interaction Agent Training In World Model Agent Evaluation in Real World World Model Training Policy Observations World Model Figure 1: Main loop of SimPLe. 1) the agent starts interacting with the real environment following the latest policy (initialized to random). 2) the collected observations will be used to train (update) the current world model. 3) the agent updates the policy by acting inside the world model. The new policy will be evaluated to measure the performance of the agent as well as collecting more data (back to 1). Note that world model training is self-supervised for the observed states and supervised for the reward. prediction techniques and can train a policy to play the 2018). Although the sample complexity of these methods <>,BJ[FSFUBM.PEFM#BTFE3FJOGPSDFNFOU-FBSOJOHGPS"UBSJ 8PSME.PEFM γϛϡϨʔλ Λ࢖ͬͯ
 ֶशͨ͠ํࡦΛ࣮؀ڥͰධՁ Ξοϓσʔτͨ͠8PSME.PEFM γϛϡϨʔλ 
 ͚ͩΛ࢖ͬͯํࡦΛֶश 110Λ࢖͏ ධՁ࣌ʹ֫ಘͨ͠
 ࣮؀ڥͷσʔλ͔ΒϞσϧΛߋ৽ Ϟσϧߋ৽ʹ࢖͏࣮؀ڥͷσʔλྔʻ࣮؀ڥͰ110౳Ͱֶश͢Δσʔλྔ ͔ͩΒ͍͍ΑͶʂͬͯ͜ͱΒ͍͠ŋŋŋ

Slide 8

Slide 8 text

4JN1-Fͷٖࣅίʔυ Model-Based Reinforcement Learning for Atari umber of Atari games. y aim to model or pre- but relatively modest (2018) present a way with a recurrent neural uccessfully evaluated 2D racing game. The hm 1, but only one it- vironments are simple om exploration. nforcement learning applications such as Though most of such several recent works rld (Finn et al., 2016; al., 2017; Ebert et al., n et al., 2018; Rybkin mulated (Watter et al., ol. Our video models Algorithm 1: Pseudocode for SimPLe Initialize policy ⇡ Initialize model parameters env0 Initialize empty set D while not done do . collect observations from real env. while not enough observations do a ⇡(s) (s0, r) env(a) D D [ (s, a, r, s0) s s0 end while . update model using collected data. ✓ TRAIN_SUPERVISED(env0, D) . update policy using world model. ⇡ TRAIN_RL(⇡, ✓) end while "UBSJͳͷͰϞσϧ͸ ࣍ը໘ͱS T B T` Λ༧ଌ͢ΔΑ͏ʹֶश <>,BJ[FSFUBM.PEFM#BTFE3FJOGPSDFNFOU-FBSOJOHGPS"UBSJ ͜͜͸110

Slide 9

Slide 9 text

Ϟσϧͷֶश Model-Based Reinforcement Learning for Atari @training @inference 8x8 7x5x128 Attention 4x4 4x4 4x4 4x4 4x4 4x4 Pixels Embedding 105x80x64 105x80x12 4x4 4x4 4x4 4x4 4x4 skip connections Input Action 4x4 Per Pixel Logits 4 Input Frames Predicted Predicted Reward softmax 53x40x128 27x20x256 14x10x256 7x5x256 4x3x256 4x3x256 7x5x256 14x10x256 27x20x256 53x40x128 105x80x64 105x80x256 105x80x3 multiplication dense conv deconv Legend: Frame Next Frame 2x2x256 Pixels Embedding 8x8 27x20x128 Attention Discretization discrete latent Bit Predictor recurrent attention Figure 2: Architecture of the proposed stochastic model with discrete latent. The input to the model is four stacked frames (as well as the action selected by the agent) while the output is the next predicted frame and expected reward. Input pixels and action are embedded using fully connected layers, and there is per-pixel softmax (256 colors) in the output. This model has two main components. First, the bottom part of the network which consists of a skip-connected convolutional encoder and decoder. To condition the output on the actions of the agent, the output of each layer in the decoder is multiplied with the (learned) embedded action. Second part of the model <>,BJ[FSFUBM.PEFM#BTFE3FJOGPSDFNFOU-FBSOJOHGPS"UBSJ ࠷ॳ͸࣮؀ڥͷσʔλΛೖྗʹ༧ଌը૾Λֶश
 ͦͷޙɺঃʑʹ༧ଌը૾΋ೖྗʹࠞͥͯ༧ଌ͕υϦϑτΛ͍ͯ͘͠ͷΛ๷͙ "UBSJͷ؀ڥ͸ը໘͕ͪΒ͍ͭͨΓɺӅΕͯݟ͍͑ͯͳ͍෦෼͕͋ͬͨΓ͢ΔͷͰɺ ͦ͏͍ͬͨϥϯμϜੑ΋ֶश͢ΔɻͦͷࡍϥϯμϜͳύϥϝʔλΛ཭ࢄԽͯ͠ग़ྗ #JU1SFEJDUPS͸ϥϯμϜͳύϥϝʔλΛ
 ໛฿͢ΔΑ͏ʹֶश

Slide 10

Slide 10 text

࣮ݧ݁Ռ w 4JN1-F͸࣮؀ڥͱ,ճJOUFSBDUJPO w 110ͷֶश͸Ϟσϧʹରͯ͠.JOUFSBDUJPO Model-Based Reinforcement Learning for Atari Figure 3: Comparison with Rainbow. Each bar illustrates the number of interactions with environment required by Rainbow to achieve the same score as our method (SimPLe). The red line indicates the 100K interactions threshold which is used by the our method. 7.2. Ablations Figure 4: Comparison with PPO. Each bar illustrates the number of interactions with environment required by PPO to achieve the same score as our method (SimPLe). The red line indicates the 100K interactions threshold which is used by the our method. <>,BJ[FSFUBM.PEFM#BTFE3FJOGPSDFNFOU-FBSOJOHGPS"UBSJ ͳΜ͔ൺֱͱͯ͠ ҙຯ͋ΔΜ͔ͩͳ͍Μ͔ͩʜ 4JN1-Fͱಉ͡ੑೳʹͳΔͷʹ
 ඞཁͳ ຊ෺ͷ"UBSJ؀ڥΛ෇͖߹͏৔߹ͷ ैདྷख๏ͷJUFSBUJPO਺ ,͸༏ʹ௒͑ΔJUFSBUJPO͕ඞཁ

Slide 11

Slide 11 text

Trust Region Policy Optimization John Schulman JOSCHU@EECS.BERKELEY.EDU Sergey Levine SLEVINE@EECS.BERKELEY.EDU Philipp Moritz PCMORITZ@EECS.BERKELEY.EDU Michael Jordan JORDAN@CS.BERKELEY.EDU Pieter Abbeel PABBEEL@CS.BERKELEY.EDU University of California, Berkeley, Department of Electrical Engineering and Computer Sciences Abstract We describe an iterative procedure for optimizing policies, with guaranteed monotonic improve- ment. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effec- tive for optimizing large nonlinear policies such as neural networks. Our experiments demon- strate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari Tetris is a classic benchmark problem for approximate dy- namic programming (ADP) methods, stochastic optimiza- tion methods are difficult to beat on this task (Gabillon et al., 2013). For continuous control problems, methods like CMA have been successful at learning control poli- cies for challenging tasks like locomotion when provided with hand-engineered policy classes with low-dimensional parameterizations (Wampler & Popovi´ c, 2009). The in- ability of ADP and gradient-based methods to consistently beat gradient-free random search is unsatisfying, since gradient-based optimization algorithms enjoy much better sample complexity guarantees than gradient-free methods (Nemirovski, 2005). Continuous gradient-based optimiza- tion has been very successful at learning function approxi- [cs.LG] 20 Apr 2017 4JN1-FΑΓ΋110͕ؾʹͳΔͱࢥ͏ͷͰ͔͜͜Β5310

Slide 12

Slide 12 text

ํࡦޯ഑๏ͷ෮श ڧԽֶशͷ໨త ํࡦޯ഑๏ͷͶΒ͍ arg max ⇡ E[Gt ] AAADVHicjVLLThRBFD0zDfIQYcCNiS4mTDAuzKSakGBcEQ2RJQ8HSKYnneq2GAr6le6aAewMC5f+AAtWmphAXOEvuPEHIOETiEtM3LDgdk2r0Ylidbq67ulz7j11c53Ik4li7LxQNPr6bw0MDg3fHrkzOlYan1hNwlbsipobemG87vBEeDIQNSWVJ9ajWHDf8cSas/08+7/WFnEiw+Cl2otEw+fNQG5IlyuC7NIDi8dNn+/aqRXJjuVztek46Xyn/sJWDbtUYVWmV7n3YOaHCvK1GI4XSrDwCiFctOBDIICisweOhJ46TEwjIuwxITPYQgMp7RwxYVIzBToYpiwt4gvickK3aW9SVM/RgOIse6L1LtXz6I1JWcYUO2XH7JJ9YR/ZBbv6a65U58hc7dHX6WpFZI+9vbfy/UaVT1+FzV+qf3pW2MAT7VWS90gj2S3crr79+uBy5enyVPqQvWdfyf87ds4+0w2C9jf3w5JYPtTZA9Ls6Nv6un5AnU4Jz7rX1Mgu5cuQH+5CqpXFMSHlnLf/k2lR1yRFkrhJ3vWbamT+/69Gl9lTg2bK/HOCeg+r01WTVc2lmcrcs3y6BnEfk3hEEzSLOSxgETXy+QZHOMGn4lnxyjCM/i61WMg1d/HbMkavAVNPwB8= AAADVHicjVLLThRBFD0zDfIQYcCNiS4mTDAuzKSakGBcEQ2RJQ8HSKYnneq2GAr6le6aAewMC5f+AAtWmphAXOEvuPEHIOETiEtM3LDgdk2r0Ylidbq67ulz7j11c53Ik4li7LxQNPr6bw0MDg3fHrkzOlYan1hNwlbsipobemG87vBEeDIQNSWVJ9ajWHDf8cSas/08+7/WFnEiw+Cl2otEw+fNQG5IlyuC7NIDi8dNn+/aqRXJjuVztek46Xyn/sJWDbtUYVWmV7n3YOaHCvK1GI4XSrDwCiFctOBDIICisweOhJ46TEwjIuwxITPYQgMp7RwxYVIzBToYpiwt4gvickK3aW9SVM/RgOIse6L1LtXz6I1JWcYUO2XH7JJ9YR/ZBbv6a65U58hc7dHX6WpFZI+9vbfy/UaVT1+FzV+qf3pW2MAT7VWS90gj2S3crr79+uBy5enyVPqQvWdfyf87ds4+0w2C9jf3w5JYPtTZA9Ls6Nv6un5AnU4Jz7rX1Mgu5cuQH+5CqpXFMSHlnLf/k2lR1yRFkrhJ3vWbamT+/69Gl9lTg2bK/HOCeg+r01WTVc2lmcrcs3y6BnEfk3hEEzSLOSxgETXy+QZHOMGn4lnxyjCM/i61WMg1d/HbMkavAVNPwB8= AAADVHicjVLLThRBFD0zDfIQYcCNiS4mTDAuzKSakGBcEQ2RJQ8HSKYnneq2GAr6le6aAewMC5f+AAtWmphAXOEvuPEHIOETiEtM3LDgdk2r0Ylidbq67ulz7j11c53Ik4li7LxQNPr6bw0MDg3fHrkzOlYan1hNwlbsipobemG87vBEeDIQNSWVJ9ajWHDf8cSas/08+7/WFnEiw+Cl2otEw+fNQG5IlyuC7NIDi8dNn+/aqRXJjuVztek46Xyn/sJWDbtUYVWmV7n3YOaHCvK1GI4XSrDwCiFctOBDIICisweOhJ46TEwjIuwxITPYQgMp7RwxYVIzBToYpiwt4gvickK3aW9SVM/RgOIse6L1LtXz6I1JWcYUO2XH7JJ9YR/ZBbv6a65U58hc7dHX6WpFZI+9vbfy/UaVT1+FzV+qf3pW2MAT7VWS90gj2S3crr79+uBy5enyVPqQvWdfyf87ds4+0w2C9jf3w5JYPtTZA9Ls6Nv6un5AnU4Jz7rX1Mgu5cuQH+5CqpXFMSHlnLf/k2lR1yRFkrhJ3vWbamT+/69Gl9lTg2bK/HOCeg+r01WTVc2lmcrcs3y6BnEfk3hEEzSLOSxgETXy+QZHOMGn4lnxyjCM/i61WMg1d/HbMkavAVNPwB8= AAADVHicjVLLThRBFD0zDfIQYcCNiS4mTDAuzKSakGBcEQ2RJQ8HSKYnneq2GAr6le6aAewMC5f+AAtWmphAXOEvuPEHIOETiEtM3LDgdk2r0Ylidbq67ulz7j11c53Ik4li7LxQNPr6bw0MDg3fHrkzOlYan1hNwlbsipobemG87vBEeDIQNSWVJ9ajWHDf8cSas/08+7/WFnEiw+Cl2otEw+fNQG5IlyuC7NIDi8dNn+/aqRXJjuVztek46Xyn/sJWDbtUYVWmV7n3YOaHCvK1GI4XSrDwCiFctOBDIICisweOhJ46TEwjIuwxITPYQgMp7RwxYVIzBToYpiwt4gvickK3aW9SVM/RgOIse6L1LtXz6I1JWcYUO2XH7JJ9YR/ZBbv6a65U58hc7dHX6WpFZI+9vbfy/UaVT1+FzV+qf3pW2MAT7VWS90gj2S3crr79+uBy5enyVPqQvWdfyf87ds4+0w2C9jf3w5JYPtTZA9Ls6Nv6un5AnU4Jz7rX1Mgu5cuQH+5CqpXFMSHlnLf/k2lR1yRFkrhJ3vWbamT+/69Gl9lTg2bK/HOCeg+r01WTVc2lmcrcs3y6BnEfk3hEEzSLOSxgETXy+QZHOMGn4lnxyjCM/i61WMg1d/HbMkavAVNPwB8= Gt = rt + rt+1 + 2rt+2 + · · · AAAD0HicjVJNT9RAGH6X+oH4waIXEy+NG4wJZjNdTSBGCcEY9QaLCyQsbqZlWCe0naadXYFmMV75Ax48SeLB+DO8cNXEAz/BeMTEiwefTqtEN7LMpDPv+/R93q953ciXiWbsoDRknTp95uzwuZHzFy5eGi2PXV5MVCf2RMNTvoqXXZ4IX4aioaX2xXIUCx64vlhyNx5k/5e6Ik6kCp/qrUisBrwdynXpcQ2oVb73qJXqnn3fjs09YTfbPAi4USecI+BZWuvlYM2A3prSSatcYVVmlt0vOIVQoWLNqbHSNDVpjRR51KGABIWkIfvEKcFeIYdqFAG7BVTh7zpsYmPn4VyllJ7QQ7M1cA68RyPw2QFbgMlht4GzDW2lQEPoWazE8DO/Pr4YTJvG2Rf2nh2yffaBfWU//+srNT6yHLdwuzlXRK3R3asLPwayAtyanh+xjmG42MExVaU4s8oj+JPANwd0QKOHU6ZyiU5EBsn7abLpbr8+XLhbH09vsD32Dd14yw7YR/Qj7H733s2L+hvjPQTnheldYKoJ8YopcI5c2gbJ8siQ37UqxMr0GIhd2O38sWziDSQ0CdukqGBQDGle+yQxcsu+GJhX59/p7BcWa1XndpXN36nMzBaTO0zX6DrdxHRO0gw9pjlqIM892qdP9NmqW5vWS+tVbjpUKjhX6K9l7f4CM3rcaA== ظ଴ऩӹΛ࠷େԽ͢ΔํࡦΛ֫ಘ͢Δ ⇡✓ AAADpHicjVLNSltBFD7x+m81sW4KbkKD0oWESRSULopYChVcqDEqGBvmXsc4eP+4d5KqF/sAvkAXulFQEB/DTV+gCx+huFTopot+d3Kt2FDTGe7cc7453/mbY/q2DBVjN6kOo7Oru6e3r3/gxeBQOjP8cjX06oElypZne8G6yUNhS1eUlVS2WPcDwR3TFmvm7vv4fq0hglB67ora98Wmw2uu3JYWV4CqmUwWq+LLalRRO0Lxw2omx/JMr2yrUEiEHCVr0RtOvaMKbZFHFtXJIUEuKcg2cQqxN6hARfKBTQD1cLsNm0DbWTg3KaJ5+qC3As6BH1I/fNbBFmBy2O3irEHbSFAXehwr1PzYr40vADNLY+w7u2R37Bu7Yj/Yr3/6irSPOMd9/M0mV/jV9NGr0s+2LAd/RTuPrGcYJrbzTFURzrhyH/4k8L02HVDo4YyuXKITvkaa/dTZNA6+3pXeLo9F4+yM3aIbp+yGXaMfbuPeOl8Sy8fauwvOZ907R1fj4hUj4By51DQS5xEjD7V6iBXrAZBsYvflj2UFbyChSdiGSQXtYkj92v8To2nZEgPzWvh7OluF1WK+MJkvLk3lZueSye2lUXpNbzCd0zRLH2mRysizQSd0ThfGuLFglIxy07QjlXBG6MkyPv0GUu/L8Q== http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-5.pdf r✓ E ⌧⇠⇡✓ [G0 (⌧)] = E ⌧⇠⇡✓ [r✓ log ⇡✓ (⌧)r(⌧)] = E ⌧⇠⇡✓ [ t=T X t=0 (r✓ log ⇡✓ (at |st )) t=T X t=0 (r(st , at ))] ⇡ 1 N [ t=T X t=0 (r✓ log ⇡✓ (ai,t |si,t )) t=T X t=0 (r(si,t , ai,t ))] AAAFTHicnVJNaxNBGH6zprXGjzZ6EQRZDC0JhDCJgiJUilrUg6VfaQvZNMyuk2TofrE7ia3r+gP8Ax48KXgQf4A/wIsnbx76E8RjCx5U8J3ZtdWGNtVZdj6e93nerxnTt3koCNnOaCeyI6Mnx07lTp85e258In9+JfR6gcXqlmd7wZpJQ2Zzl9UFFzZb8wNGHdNmq+bGHWlf7bMg5J67LLZ81nRox+VtblGBUCufea/jMFxq2rQVGaLLBI11w6Gia5rRbCwx2tONkDu64fM9StyI7rUiEheluRQ39Zz0MzV9LOmBcIbtdf4kJD6DZGnqhvFPvsOe04rENInXcV5GZ0OCoUnET0M5l0oD6qCoLGWdJoT9dAzq+4G3qRvtgFpRNY7m/i84Lyfh5XpYAtKmUkhIMonWRIFUiBr64KaabgqQjnkvn7kFBjwCDyzogQMMXBC4t4FCiF8DqlADH7Eyoh5a28gJFM/CuQkRPIBZ9QnEKeIx5NBnD9UMlRR5Gzh38NRIURfPMlao9NKvjX+ASh0myWfyluyQj+Qd+UJ+HOorUj5kjlu4momW+a3x5xeXvg1VObgK6O6rjlCY+DlHVBXhLCv30R9HfHNIBwT28IaqnGMnfIUk/VTZ9J+82Fm6uTgZTZHX5Ct24xXZJh+wH25/13qzwBZfKu8uah6r3jmqGhdvMUKcYi4dhcg8JPK7Vg9jyXOAiJ7ynu0xDbwDjieO3DCtYFgMrm77ODES5kAMfK/Vg69zcLNSq1SvVmoL1wozt9OXOwaX4AoU8XVehxm4D/NQB0u7rN3VHmpz2U/Z3ez37M+EqmVSzQX4a4yM/gKRCGnK ํࡦͷύϥϝʔλΛظ଴ऩӹ͕࠷େԽ͢Δํ޲ʹௐ੔͢Δ ྫ3&*/'03$&ΞϧΰϦζϜ ύϥϝʔλԽ͞Εͨํࡦ

Slide 13

Slide 13 text

7BOJMMBQPMJDZHSBEJFOUͷ໰୊ https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/ &VDMJEڑ཭Ͱݟͯಉ͡ൣғ͔ΒબͿ r✓ E ⌧⇠⇡✓ [G0 (⌧)] ||r✓ E ⌧⇠⇡✓ [G0 (⌧)]|| = lim ✏!0 arg max d s.t. ||d||✏ E ⌧⇠⇡✓+d [G0 (⌧)] AAAEtHicpVJNaxNBGH63rlrjRxu9CB5cDJWKEiZRVASlKEW99cO0hU5YZjeTZOh+uTtJU2fXowf/gAdPCh5Er/oHvPgHPPQniMcKXjz47mRb0dBGcIadnXnmed6veZ3IE4kkZNuYOGQePnJ08ljp+ImTp6amy6dXkrAXu7zhhl4Yrzks4Z4IeEMK6fG1KObMdzy+6mzcy+9X+zxORBg8klsRb/qsE4i2cJlEyC4b5y0ctB0zV9GAOR6zFZVdLllmUZ/JruOo+SzHWM+iifAtGok9Srau7tuKZLP59aWsmZVyaypN/99Umg6N3baoJ3yk8SgRXhhYNBadrmRxHG5aJKMs7vhsYKsWtajkA6mSqqxm1ErTFobh8ce7wkyNDeJyazQje7pCqkQPa3RTKzYVKMZCWDbuAIUWhOBCD3zgEIDEvQcMEpzrUIM6RIhdQTTE2zZyYs1zcW2Cgocwr6dEnCGeQQlt9lDNUcmQt4FrB0/rBRrgOfeVaH1u18MvRqUFM+QLeUt2yGfyjnwlP/e1pbSNPMYt/DtDLY/sqednl3+MVfn4l9D9rTpA4eD0D8hK4ZpnHqE9gfhgTAUk1vCmzlxgJSKNDOupo+k/ebGzfGtpRl0kr8k3rMYrsk0+YT2C/nf3zSJfeqmtB6jZ1LXzdTYBvqJCnGEsHY3kceTIbq4h+srPMSJWwXu6x6T4BgJPArlJkcE4H0K/9r/4GDJHfGC/1v7uztHNSr1au1qtL16rzN0tOncSzsEFmMXuvAFz8AAWoAGu8cx4b3wwPprXTWq6Jh9SJ4xCcwb+GGbwC99YM6k= ͜ͷͭͷਖ਼ن෼෍͸શવҧ͏ܗ͕ͩɺύϥϝʔλͷڑ཭͸ಉ͡ ෼෍ͷҧ͍ΛཅʹߟྀͰ͖͍ͯͳ͍

Slide 14

Slide 14 text

5310Ͱ໨ࢦ͢ܗ arg max d s.t. KL(⇡✓||⇡✓+d)✏ E ⌧⇠⇡✓+d [G0 (⌧)] AAAEFXicjVJNaxNBGH6360eN1qYKKnhZDJUUJUyioHiQohQVPfTDtIVMWGY302Tofrk7ianTFbz6B3rwpOBB/A2eFPHmyUN/gYjHCl48+O5k1dZg4ww7O+8zz/N+zTiRJxJJyLYxZh44eOjw+JHC0WMTxyeLUyeWk7Abu7zuhl4Yrzos4Z4IeF0K6fHVKObMdzy+4qzfzM5XejxORBjclxsRb/qsHYg14TKJkF18YuGgLG77rG+rFrWo5H2pkoqspNS6e69MI2ErKjtcsnRzc5d1oZXOUI8/oDxKhBcGqaI+kx3HUXNpxmFdiybCt/ZK0oa6ZSuSljPCTNpMC3axRCpED2t4U803JcjHfDhlXAcKLQjBhS74wCEAiXsPGCQ4G1CFGkSIXUQ0xNM15MSa5+LaBAV3YE5PiThDPIUC+uyimqOSIW8d1zZajRwN0M5iJVqf+fXwi1FpwTT5RF6RHfKBvCZfyI9/+lLaR5bjBv6dgZZH9uTTM0vfR6p8/Evo/FHto3Bw+vtUpXDNKo/Qn0C8P6IDEnt4VVcusBORRgb91Nn0Hm3tLF1bnFbnyQvyFbvxnGyTt9iPoPfNfbnAF59p7wFqHure+bqaAG9RIc4wl7ZGsjwy5FetIcbK7BgRK+c9/s2keAcCLYHcJK9gVAyhb/t/YgyYQzHwvVb/fp3Dm+VapXqpUlu4XJq9kb/ccTgL56CMr/MKzMJtmIc65vnZmDBOGafNLfON+c58P6COGbnmJOwZ5sefp2X4lw== ݩͷํࡦͷ෼෍͔Β͋·Γҳ୤͠ͳ͍ൣғͰύϥϝʔλΛมߋ͢Δ ,-͸,VMMCBDL-FJCMFSEJWFSHFODFͷ͜ͱ

Slide 15

Slide 15 text

5310ͷಋग़ ⌘(⇡) = E s0,a0...  1 X t=0 tr(st ) Q⇡ (st , at ) = E st+1,at+1...  1 X l=0 lr(st+l ) V⇡ (st ) = E at,st+1,at+1...  1 X l=0 lr(st+l ) A⇡ (st , at ) = Q⇡ (st , at ) V⇡ (st ) at ⇠ ⇡(at |st ), st+1 ⇠ P(st+1 |st , at ) AAAFeXictVJPTxNBFH+tVBH/QPVi4mWVQIpgMy0mGhMMakg0XgrYQtKBzewyLRP2n7tTFMb1A/gFPHjSxIPxY3jxC3jg7sV4xMSDHnw7u1ZtC3hxNjt/fu/33u+9N2MFjogkIXu5/LGhwvETwydHTp0+c3Z0rHiuEfmd0OZ123f8cNViEXeEx+tSSIevBiFnruXwFWvrbmJf2eZhJHzvodwJ+JrL2p5oCZtJhMxi7pOBY9KgXLISDcSUMWdQl8lNy1ILsakiU5F4xmDJYtANX0YxtUS73aRRxzWVnCPxuqLCa8mdmLaZ67J1JeOwhH4yntLUNYPSkVRl0VSoEadWHRVJ/YpyupJZcTNA1Rmk6mSq006/buNP3V5BlibzP3RvH1Dv4D5c7c2zG0cTDBoJ10B7SR+fpqRu4qm5VsqOqXkmDW2OjZMy0cPo31SyzThko+YXc7eAwgb4YEMHXODggcS9Awwi/JpQgSoEiM0g6qO1hZxQ82yc10DBfVjQn0ScIR7DCMbsoDdHT4a8LZzbeGpmqIfnRCvS/klcB/8QPQ2YIB/JW7JPPpB35DP5cWAspWMkOe7gaqW+PDBHn19Y/nakl4urhM3fXod4WPi5h1SlcE4qDzCeQPzJER2Q2MMbunKBnQg0kvZTZ7O9+2J/+ebShJokr8kX7MYrskfeYz+87a/2m0W+9FJH99Dnse6dq6vx8BYV4gxzaWskySNBftXqo1ZyDhExMt6zLpPiHQg8CeRGWQVHaQh92/+ikTL7NPC9VnpfZ/+mUS1XZsvVxWvj83eylzsMF+EylPB1Xod5uAc1qIOdf5B/lN/Nq6HvhUuFUuFKSs3nMp/z8NcozP4E2h9x2w== ظ଴ऩӹ ঢ়ଶߦಈՁ஋ؔ਺ ঢ়ଶՁ஋ؔ਺ Ξυόϯςʔδؔ਺

Slide 16

Slide 16 text

5310ͷಋग़ ⌘(e ⇡) = ⌘(⇡) + E s0,a0,⇠e ⇡  1 X t=0 tA⇡ (st , at ) AAAEMHicjVJLbxMxEJ7N8ijh0RQuSFxWREWJiCInIIGQigqoEtz6IG2lOF15N05qdV/adVKCtfwA/gAHuIDEAXHjL3DhDyDRE+IIHIvEhQOzzgKiEQ1jre359vtmPGM7kScSScieUTCPHD12fOZE8eSp02dmS3Nn15NwELu85YZeGG86LOGeCHhLCunxzSjmzHc8vuHs3Mn+bwx5nIgwuC9HEe/4rB+InnCZRMguPbPQKJesQndFl0vhdbmikUir1kKOR6JqXbaoz+S246il1FaJrUhas9h4oYnwrQPqtKjjOqLfb9Nk4NtKLpB0S1ER9OQopX3m+2xLyfSWrfkVDCnTGsvmqpZ1LErtUpnUiTZrctPIN2XIbTmcM24ChS6E4MIAfOAQgMS9BwwSHG1oQBMixGqIhvi3h5xY81ycO6DgHizpIRFniKdQxJgDVHNUMuTt4NxHr52jAfpZrkTrs7gefjEqLZgn78krsk/ekdfkM/nxz1hKx8jOOMLVGWt5ZM8+Pr/2farKx1XC9h/VIQoHh39IVQrnrPII4wnEH0zpgMQeXteVC+xEpJFxP/Vphg+f7K/dWJ1Xl8gL8hW78ZzskbfYj2D4zX25wlef6ugBanZ173xdTYC3qBBneJa+RrJzZMivWkPMlfkxIlbOe/SbSfEOBHoCuUlewbQcQt/2/+QYMydy4HttHHydk5v1Zr1xpd5cuVpevJ2/3Bm4ABehgq/zGizCXViGFrhGwagYDaNpvjE/mB/NT2Nqwcg15+AvM7/8BPOFADQ= ظ଴ऩӹͱํࡦʹ͍ͭͯҎԼͷؔ܎͕͋Δ͜ͱ͕஌ΒΕ͍ͯΔ ݩͷํࡦͷऩӹʹ৽͍͠ํࡦͷׂ߹Ͱ
 ݩͷํࡦͷׂҾ͞ΕͨΞυόϯςʔδΛ଍͢ͱ ৽͍͠ํࡦͷऩӹʹͳΔ ৽͍͠ํࡦ ݩͷํࡦ ݩͷํࡦͷΞυόϯςʔδ ৽͍͠ํࡦͷׂ߹Ͱظ଴஋ΛͱΔ

Slide 17

Slide 17 text

5310ͷಋग़ ⌘(e ⇡) = ⌘(⇡) + E s0,a0,⇠e ⇡  1 X t=0 tA⇡ (st , at ) = ⌘(⇡) + 1 X t=0 X s P(st = s|e ⇡) X a e ⇡(s, a) tA⇡ (s, a) = ⌘(⇡) + X s 1 X t=0 P(st = s|e ⇡) X a e ⇡(s, a) tA⇡ (s, a) = ⌘(⇡) + X s 1 X t=0 tP(st = s|e ⇡) X a e ⇡(s, a)A⇡ (s, a) = ⌘(⇡) + X s ⇢e ⇡ (s) X a e ⇡(s, a)A⇡ (s, a) AAAGH3iczVJLaxRBEK7suhrXRxI9KHhpXRJ2cQm9UVCESFQCesvDPCCdDD2TzqbJvJjuTYzj+APyBzx4iuBB/Bke9CRePOQniCeJ4MWDNT2jwd24a/BiD9OPr+ur+qq67NCVSlO611coHisdP9F/snzq9JmzA4ND5+ZV0IocMecEbhAt2lwJV/piTkvtisUwEtyzXbFgb9xL7xc2RaRk4D/U26FY9njTl2vS4Roha6iww4TmVbYlV4WW7qqIWSiTGhkZJ9lFKGvkKmEe1+u2HU8mVqysmCZ1wrOFKemRNnpSJjiYLZvNJaZanhXrcZqsxEz6a3o7YU3ueXwl1skdy9hX0aVO6jyda4a2TBgzTtp1HOLNQCqZyryQcaKetKeT6TGGPGm7rKo6rx0iCdGuIlRyiJj/QsSBn3+ScxQR0XqA1m1dUFVHC2INVugoNYN0bhr5pgL5mAqG+m4Dg1UIwIEWeCDAB417Fzgo/JagAWMQIlZHNMDbNbSJjJ2D8zLE8AAmzacR54gnUEafLWQLZHK028C5iaelHPXxnMZShp/6dfGPkElgmH6kr+g+fUdf00/0+x99xcZHqnEbVzvjitAa2Lk4+60ny8NVw/oBqwvDxs/rklWMc5p5iP4k4o96VEBjDW+azCVWIjRIVk+jZvPxs/3ZWzPD8Qh9QT9jNXbpHn2D9fA3vzovp8XMc+PdR86WqZ1nsvHxFWPEOWppGiTVkSI/cw0wVnqOECG53dNflgzfQOJJoq3KM+gVQ5rX/psYmWVHDOzXRnt3dm7mx0Yb10bHpq9XJu7mndsPl+AKVLE7b8AE3IcpmAOn8KV4oUiKl0u7pbel96UPmWmhL+ech99Gae8HvdWvGg== ⇢⇡ (s) = P(s0 = s) + P(s1 = s) + 2P(s2 = s) · · · AAAD6nicjVJNa9RAGH638aPWj271IngJLpUVZZnEQkVQiiLobdu620JTQ5JOt0OTTEhmV2tYf4BHLyI9KXoQ/4I3L/4BD716k3qr4MWDT2ajYhe7nSGT933e93m/ZvwkFJlibKcyZhw5euz4+ImJk6dOn5msTp1tZ7KbBrwVyFCmy76X8VDEvKWECvlyknIv8kO+5G/eKexLPZ5mQsYP1FbCVyOvE4t1EXgKkFttm1hOuiHd3ElEv55dNm+azXrm5qwPCeoV0+l4UeQNUGsf+jC3+9pglwYnWJMqc6s11mB6mcOCVQo1KldTTlVukUNrJCmgLkXEKSYFOSSPMuwVssimBNhVoBLWdfik2i/AuUo53ae7eivgHvA+TSBmF2wOpge/TZwdaCslGkMvcmWaX8QN8aVgmjTNPrN3bI99Yu/ZV/bzv7FyHaOocQt/f8DliTv57Pzij5GsCH9FG39ZBzB87OiArnKcRecJ4gngj0dMQGGG13XnApNINDKYp66m9+TF3uKNhen8EnvNdjGNV2yHfcQ84t734O08X9jW0WNwHunZRbqbGLeYA/dQS0cjRR0F8rtXiVyFngIxS7+nfzwd3IGAJuCblR2MyiH0bR8mx8BzKAfeq7X/dQ4LbbthXWvY8zO1udvlyx2nC3SR6nidszRH96hJLdT5gb7QLn0zQuO58dLYHriOVUrOOfpnGW9+AcZX45o= ׂҾ๚໰ස౓ TͱUΛೖΕସ͑ U͸BͷTVNʹ
 ؔ܎ͳ͍ͷͰ֎ʹग़ͨ͠

Slide 18

Slide 18 text

5310ͷಋग़ ⌘(e ⇡) = ⌘(⇡) + X s ⇢e ⇡ (s) X a e ⇡(s, a)A⇡ (s, a) AAAEC3icjVJNaxNBGH6360eNrU31InhZDC0JljCJgiIoVRH01g/TFrplmd1M06H7xe4msS7rDxA8e/Ck4EE8ehNvIvgHPPTgDxCPFbx48NnJajHBxhl29p1nnuf9mrFDV8YJY/vahH7s+ImTk6dKp6emz8yUZ8+uxUE3ckTLCdwg2rB5LFzpi1YiE1dshJHgnu2KdXv3Tn6+3hNRLAP/QbIXii2Pd3y5LR2eALLKfQPDFAmvmn3ZFol02yI1Q5nVjPkbxUEoa8Ylw4y7npXGmRntBFY6xM6qca2kXCkWz4bOq/ECr92yDm2rXGF1poYxajQKo0LFWApmtZtkUpsCcqhLHgnyKYHtEqcYc5Ma1KQQ2ALQAKfb4ESK52DdopTu0101E+AceEYl+OxCLaDk4O1i7WC3WaA+9nmsWOlzvy6+CEqD5thn9podsE/sDfvKfv7TV6p85Dnu4W8PtCK0Zp6cX/0xVuXhn9DOoeoIhY3pHVFVijWvPIQ/CfzhmA4k6OE1VblEJ0KFDPqpsuk9enawen1lLp1nL9k3dOMF22cf0A+/9915tSxWnivvPjR91TtPVePjFlPgHLl0FJLnkSO/aw0QK99HQIyC9/gP08QdSOwkuHFRwbgYUt32/8QYMEdi4L02hl/nqLHWrDcu15vLVyqLt4uXO0kX6CJV8Tqv0iLdoyVqIc8vmq5NadP6U/2t/k5/P6BOaIXmHP019I+/ACJH8XY= ৽͍͠ํࡦͷظ଴ऩӹ ݩͷํࡦͷظ଴ऩӹ ظ଴ऩӹͷࠩ෼ ͕͜͜ΑΓେ͖͚Ε͹վળ͍ͯ͠Δ e ⇡(s) = arg max a A⇡ (s, a) AAADxnicjVJNa9RAGH63UVvrR7d6EbwEl0oLZZldC4rQUhWh3vq1baEpYZKdrkOTSUhmd1vDimf/gIInBQ/if/Diwf4BD/0J4rEFLx58MhsVXew6Qybv+8z7vF/zenEgU83YUWnEOnP23OjY+fELFy9dnihPXtlIo3bii4YfBVGy5fFUBFKJhpY6EFtxInjoBWLT23uQ3292RJLKSK3rg1jshLyl5K70uQbklus2ltOVTaFl0BSZE8vedDpjz9sOT1oh33cz3svuucXFLJ/pueUKqzKz7EGhVggVKtZyNFlaIIeaFJFPbQpJkCINOSBOKfY21ahOMbBZoBFud2GTGDsf5w5l9Igemq2Bc+A9GofPNtgCTA67PZwtaNsFqqDnsVLDz/0G+BIwbZpin9k7dswO2Xv2hX3/p6/M+MhzPMDf63NF7E48v7b2bSgrxF/T49+sUxgednhKVRnOvPIY/iTw/SEd0OjhHVO5RCdig/T7abLpPHlxvHZ3dSq7yd6wr+jGa3bEPqIfqnPiv10Rq6+MdwVO1/QuNNUovGIGnCOXlkHyPHLkZ60RYuV6AsQu7J7+snTwBhKahG1aVDAshjSv/T8x+pYDMTCvtb+nc1DYqFdrt6r1lbnK4v1icsfoOt2gaUznbVqkJVqmBvJ8SR/oEx1aS5ay2la3bzpSKjhX6Y9lPfsB0HDZdg== ͷΑ͏ʹํࡦΛߋ৽͢Ε͹ඞͣվળ͢Δ ํࡦ൓෮ ͨͩ͜ͷ··ͩͱύϥϝʔλΛߋ৽͢Δ࿩ʹͳΒͳ͍

Slide 19

Slide 19 text

5310ͷಋग़ ⌘(e ⇡) = ⌘(⇡) + X s ⇢e ⇡ (s) X a e ⇡(s, a)A⇡ (s, a) AAAEC3icjVJNaxNBGH6360eNrU31InhZDC0JljCJgiIoVRH01g/TFrplmd1M06H7xe4msS7rDxA8e/Ck4EE8ehNvIvgHPPTgDxCPFbx48NnJajHBxhl29p1nnuf9mrFDV8YJY/vahH7s+ImTk6dKp6emz8yUZ8+uxUE3ckTLCdwg2rB5LFzpi1YiE1dshJHgnu2KdXv3Tn6+3hNRLAP/QbIXii2Pd3y5LR2eALLKfQPDFAmvmn3ZFol02yI1Q5nVjPkbxUEoa8Ylw4y7npXGmRntBFY6xM6qca2kXCkWz4bOq/ECr92yDm2rXGF1poYxajQKo0LFWApmtZtkUpsCcqhLHgnyKYHtEqcYc5Ma1KQQ2ALQAKfb4ESK52DdopTu0101E+AceEYl+OxCLaDk4O1i7WC3WaA+9nmsWOlzvy6+CEqD5thn9podsE/sDfvKfv7TV6p85Dnu4W8PtCK0Zp6cX/0xVuXhn9DOoeoIhY3pHVFVijWvPIQ/CfzhmA4k6OE1VblEJ0KFDPqpsuk9enawen1lLp1nL9k3dOMF22cf0A+/9915tSxWnivvPjR91TtPVePjFlPgHLl0FJLnkSO/aw0QK99HQIyC9/gP08QdSOwkuHFRwbgYUt32/8QYMEdi4L02hl/nqLHWrDcu15vLVyqLt4uXO0kX6CJV8Tqv0iLdoyVqIc8vmq5NadP6U/2t/k5/P6BOaIXmHP019I+/ACJH8XY= L⇡ (e ⇡) = ⌘(⇡) + X s ⇢⇡ (s) X a e ⇡(s, a)A⇡ (s, a) AAAEAnicjVJNaxNBGH7TrVrjR1O9CF4WQ0uCJUyi0CIoVREUPPTDtIVuWXY303TofrEzSa1LPHnyD3jwpOBBPYlXb178AXroTxCPFQTx4LOT1WJDG2fY2fd95n3er3nd2BdSMbZbGDFGjx0/MXayeOr0mbPjpYlzyzLqJB5vepEfJauuI7kvQt5UQvl8NU64E7g+X3G3bmf3K12eSBGFD9ROzNcDpx2KDeE5CpBdCk2s+3ZqxaJXsbZFiyvht7jWq+bUddPiyqlAq5qXTUt2AjuVPSvZjHKKrBYzD/0bp3fAQ0VOO9Wb9r5sl8qsxvQyB4V6LpQpX/PRROEGWdSiiDzqUECcQlKQfXJIYq9RnRoUA5sGGuF2AzaJtvNwrlNK9+iO3gq4A7xHRfjsgM3BdGC3hbMNbS1HQ+hZLKn5mV8fXwKmSZPsC3vN9tgn9pZ9Zb8O9ZVqH1mOO/i7fS6P7fGnF5Z+DGUF+Cva3GcdwXCxgyOqSnFmlcfwJ4A/HNIBhR7O6soFOhFrpN9PnU330bO9pWuLk+kUe8m+oRsv2C77iH6E3e/eqwW++Fx7D8HZ1r0LdDUhXjEF7iCXtkayPDLkT60RYmV6AsTM7R7/tbTwBgKagK3MKxgWQ+jX/p8YfcuBGJjX+sHpHBSWG7X6lVpj4Wp57lY+uWN0kS5RBdM5Q3N0l+apiTw/08+CURg1nhhvjHfG+77pSCHnnKd/lvHhN1CT7k4= গ͚ͩؔ͠਺Λ͍ͬͨ͡΋ͷΛఆٛ Ұ࣍ͷඍ෼߲·Ͱ͸ಉ͡΋ͷ ύϥϝʔλͷมԽ͕খ͍ۙ͞๣Ͱ͸ಉ͡ͱΈͳͤΔ L⇡✓0 (⇡✓0 ) = ⌘(⇡✓0 ) r✓ L⇡✓0 (⇡✓ )|✓=✓0 = r✓ ⌘(⇡✓ )|✓=✓0 AAAEZnicjVLLahRBFL2djBpHTSaKKLhpMiREkKFmFAxCJCiCgos8nCSQjk11W5kp0i+6a0ZjO36AP+DClYIL8TPcuBZc5A8MLiO4ceHpmh4fM0kmVXT1vafOubfurXIiTyaKsV1jZLRw4uSpsdPFM2fPjU+UJs+vJmErdkXdDb0wXnd4IjwZiLqSyhPrUSy473hizdm+m+2vtUWcyDB4pHYisenzRiC3pMsVILv0xcR4aKdWJLGoplDcTlmn05ntR66aM/OmBe+AHcsqZnGsgDse7+11hoWF8EXPnv+HhEQHhevPfZjcLpVZhelhDhrV3ChTPhbDSeM2WfSEQnKpRT4JCkjB9ohTgrlBVapRBOwa0BC7W+DEmudi3aSUHtA9PRVwDrxDRcRsQS2g5OBtY23A28jRAH6WK9H6LK6HL4bSpGn2lX1g++wz+8j22K9DY6U6RnbGHfydrlZE9sSryys/h6p8/BU1/6qOUDiY/hFVpVizyiPEk8CfDemAQg/ndOUSnYg00u2nPk37+ev9lVvL0+kMe8e+oxtv2S77hH4E7R/u+yWx/EZHD6B5qnvn62oC3GIKnOMsDY1k58iQXq0hcmV+DMTMeS//MC3cgYQnwU3yCoblkPq2j5OjyxzIgfda7X+dg8ZqrVK9Xqkt3Sgv3Mlf7hhdoSmaxeu8SQt0nxapTq4xZzw2GkZz9FthvHCxcKlLHTFyzQX6bxTM3xqKEvI= L⇡✓0 (⇡✓0+✏ ) L⇡✓0 (⇡✓0 ) > 0 AAAD9nicjVJNa9RAGH638aPWarf1UvASXFoq2mV2FSoeSlEEBQ/9cNtCt4QkTrdDk8mQzK5tQ/wB/gEPgqBQRLz4H7z0D3joTxDBS4V68OCT2ajYxW5nyGTmmed5v+b1VCASzdhBacA6c/bc+cELQxeHL10eKY+OLSdRO/Z5w4+CKF713IQHQvKGFjrgqyrmbugFfMXbup/fr3R4nIhIPtE7iq+HbkuKDeG7GpBT9myMx07aVAKL3uTadVKWZdnUMcS+YTe5SkQQyey6PX0qDYizNnPKFVZlZti9m1qxqVAx5qPR0iw16SlF5FObQuIkSWMfkEsJ5hrVqE4K2E2gEW43wIkNz8e6Tik9ogdmauAu8IyGYLMNNYfSBW8LawuntQKVOOe+EqPP7Qb4YihtmmCf2Xt2yPbZB/aF/fyvrdTYyGPcwd/rarlyRl6MLx31VYX4a9r8qzpB4WGGJ2SVYs0zV7AngG/3qYBGDe+YzAUqoQzSraeJprP78nDp7uJEOsnesq+oxht2wD6hHrLz3d9b4IuvjHUJzTNTu9BkI/GKKXAXsbQMkseRI79zjeArP8dA7IL3/A+ziTcQOAlwkyKDfj6Eee3T+Ogye3ygX2vHu7N3s1yv1m5V6wu3K3P3is4dpKt0jabQnTM0Rw9pnhqIc5++0RH9sLat19ae9a5LHSgVmiv0z7A+/gJ4AOz/ ۙ๣Ͱ੒ཱ͢ΔͳΒํࡦ͕վળ͍ͯ͠Δ͸ͣ Ͳͷ͘Β͍ۙ๣ͳΒ0,

Slide 20

Slide 20 text

5310ಋग़ 5PUBMWBSJBUJPOEJWFSHFODFͷಋೖ DTV (p||q) = 1 2 X i |pi qi | AAADxnicjVLLahRBFL2T9hFjNBPdBNw0DgkRdKjuBBRBCT4g7vKaSSATmuq2ZizSr/RjYtLT4tofUHCl4EL8BzcuzA+4yCeIywhuXHi6plV0MGMVffvWqXvuq64dujJOGDusjGgnTp46PXpm7Oz4ufMT1ckLzThII0c0nMANog2bx8KVvmgkMnHFRhgJ7tmuWLe37xb3610RxTLw15K9UGx5vOPLtnR4Asiqmrqu37OytWY+G/Z6O1f0W3qrHXEnM/LMzFtx6lmZzHthIa/tKN2q1lidqaUPKkap1KhcS8Fk5Ta16CEF5FBKHgnyKYHuEqcYe5MMMikEdhVogNs2bCJl50BuUUYP6L7aCXAOPKcx+EzBFmBy2G1DdnDaLFEf5yJWrPiFXxdfBKZO0+wTe8uO2AF7xz6z7//0lSkfRY57+Nt9rgitiWdTq9+Gsjz8E3r0m3UMw8b2jqkqgywqD+FPAn88pAMJenhDVS7RiVAh/X6qbLr7z49Wb65MZzPsNfuCbrxih+wD+uF3vzpvlsXKS+XdB2dX9c5T1fh4xQw4Ry4dhRR5FMjPWgPEKs4REL20e/LLsoU3kDhJ2MZlBcNiSPXa/xOjbzkQA/Nq/D2dg0rTrBtzdXN5vrZwp5zcUbpEl2kW03mdFmiRlqiBPF/Qe/pIB9qi5muptts3HamUnIv0x9Ke/gCUBtpV ↵ = Dmax TV (⇡old , ⇡new ) = max s DTV (⇡old (·|s)||⇡new (·|s)) AAAD+3icjVLLbtNAFL2peZTyaAobJDYWUVEiVdEkIIGQQBUPCXZ9Ja1UF2vsTNNR/ZLtpC2O+QB+gAUrQCx4bPkCNmxYsugnIJZBYlMkjicmFUQ0zMjjO2fOua8ZK3BkFDO2X5jQjh0/cXLy1NTpM2fPTRdnzjcjvxPaomH7jh+uWTwSjvREI5axI9aCUHDXcsSqtX03O1/tijCSvrcS7wViw+VtT25Km8eAzGJb13WDO8EW12/p98xkpZk+SgyX76ZlI5Bm4jutdE5ZnthJKyBlh2YSpQPyIats2C0/7kWVXm/IH2IVs1hiVaaGPmrUcqNE+VjwZwq3yaAW+WRTh1wS5FEM2yFOEeY61ahOAbA5oD5ON8EJFc/GukEJPaT7asbAOfCUpuCzA7WAkoO3jbWN3XqOethnsSKlz/w6+EIodZplX9gb1mef2Dv2lR3801eifGQ57uFvDbQiMKefXlz+MVbl4h/T1qHqCIWF6R5RVYI1qzyAPwl8d0wHYvTwhqpcohOBQgb9VNl0Hz/rL99cmk2usJfsG7rxgu2zj+iH1/1uv14US8+Vdw+aHdU7V1Xj4RYT4By5tBWS5ZEhv2v1ESvbh0D0nPdkyDRwBxI7CW6UVzAuhlS3/T8xBsyRGHivtb9f56jRrFdrV6v1xWul+Tv5y52kS3SZynid12meHtACNZDnZ+rTAf3UUu2V9lZ7P6BOFHLNBfpjaB9+ASky8Bc= ͱͨ͠ͱ͖ɺԼه͕੒Γཱͭ ⌘(⇡new ) L⇡old (⇡new ) 4✏ (1 )2 ↵2 AAAD+XicjVLLahRBFL2T9hHHRya6Edw0DpEJmKFmDERcSFAEBRd5OEkgHYfqTk2nSL/S3TMxFu0H+AMuXEUQfCz9BDeCaxf5BHElEQR14ema9jmYsYquvvfUPfdV1448maSM7ZVGjEOHjxwdPVY+fuLkqbHK+OmlJOzGjmg5oRfGKzZPhCcD0Upl6omVKBbctz2xbG9ez++XeyJOZBjcSXciseZzN5Ad6fAUULsiTCxLpLxmRbKtArGdTZqWK7bM222lodBbz7Lfb6dMqxNzR01bIkqkFwaWy32fZ6rWmOqLk3dVM8ss7kUbPBfL7UqV1Zle5qDQKIQqFWsuHC9dJYvWKSSHuuSToIBSyB5xSrBXqUFNioBdBBritgObWNs5ONdI0S26oXcKnAPPqAyfXbAFmBx2mzhdaKsFGkDPYyWan/v18MVgmjTB3rHnbJ+9YS/Ze/btn76U9pHnuIO/3eeKqD328Ozi56EsH/+UNn6xDmDY2P4BVSmceeUR/Eng94Z0IEUPL+vKJToRaaTfT51N7/6j/cUrCxPqAnvCPqAbu2yPvUY/gt4n5+m8WHisvQfgbOve+bqaAK+ogHPk4mokzyNHftQaIlaux0DMwu7BT0sLbyChSdgmRQXDYkj92v8To285EAPz2vh7OgeFpWa9canenJ+uzl4rJneUztF5qmE6Z2iWbtIctZDnW/pIX+iroYxd45nxom86Uio4Z+iPZbz6Dr387jg= ✏ = max s,a |A⇡old (s, a)| AAADwnicjVLLbtNAFL2peZTyaAobJDYWUVGRqmgSKhUhtSovCXZ9kLZSXVljdxqG2h7jcUKLEz6ANRILViCxQPwB27LgB1j0ExDLIrHposcTA4KIpjPy+N4z99zXXC8OpE4Z2ysNWSdOnjo9fGbk7LnzF0bLYxeXtWolvmj4KlDJqse1CGQkGqlMA7EaJ4KHXiBWvK27+f1KWyRaquhRuhOL9ZA3I7kpfZ4CcstVG8sRsZaBiuwZ2wn5tpvpSd7t3HYzJ5ZupoKNbncC0PWO7ZYrrMrMsvuFWiFUqFjzaqw0Sw5tkCKfWhSSoIhSyAFx0thrVKM6xcAmgSrcbsImMXY+znXK6CHdNzsFzoF3aQQ+W2ALMDnstnA2oa0VaAQ9j6UNP/cb4EvAtGmcfWUf2D77wj6yb+zgv74y4yPPcQd/r8cVsTv68vLSz4GsEP+UHv9hHcHwsMMjqspw5pXH8CeBbw/oQIoe3jSVS3QiNkivnyab9vPX+0u3Fseza+wd+45uvGV7bBf9iNo//PcLYvGN8R6B88z0LjTVRHjFDDhHLk2D5HnkyK9aFWLlegLELuxe/LZ08AYSmoStLioYFEOa1z5OjJ5lXwzMa+3f6ewXluvV2o1qfWGqMnenmNxhukJXaQLTOU1z9IDmqYE8X9En2qXP1j3rifXU0j3ToVLBuUR/LatzCN5g17c= DTV (p||q)2  DKL (p||q) AAADu3icjVJNTxNBGH7LoiKKFLiYcNnYoJiYZlpNNCQYopho9MBXCwmFZncd6sjsB7vTKi71B5B49uBJEw/Go1c9efEPeOAnGI6QePHAM9NVo43Umezs+z7zPu/XvG4kRaIY28v1Wf0nTp4aOD145uzQueH8yGg1CZuxxyteKMN4xXUSLkXAK0ooyVeimDu+K/myu3lb3y+3eJyIMFhS2xFf851GIDaE5yhA9fwlG2u2ni5V25PRzs7W5fW03LZrkm9p9P6DDLXr+QIrMrPsbqGUCQXK1lw4krtJNXpIIXnUJJ84BaQgS3IowV6lEpUpAnYFaIjbDdjExs7DuUYp3aM7ZivgDvA2DcJnE2wOpgO7TZwNaKsZGkDXsRLD134lvhhMmybYV/aOHbAv7D37xn7801dqfOgct/F3O1we1Yd3zy9+78ny8Vf06DfrGIaL7R9TVYpTVx7BnwD+tEcHFHp4w1Qu0InIIJ1+mmxaz14eLE4tTKQX2Ru2j268ZnvsM/oRtA69t/N84ZXxHoDzxPTON9UEeMUUuINcGgbReWjkZ60hYmk9BmJnds9/WdbwBgKagG2SVdArhjCv/T8xOpZdMTCvpb+ns1uoloulq8Xy/LXCzK1scgdonC7QJKbzOs3QXZqjCvJ8QR/oI32ypi3PemzJjmlfLuOM0R/Lah4B6yvUKg== ӈͷؔ܎ੑΛར༻ͯ͠ ⌘(⇡new ) L⇡old (⇡new ) CDmax TV (⇡old , ⇡new )2 L⇡old (⇡new ) CDmax KL (⇡old , ⇡new ) AAAEPHiclVJNT9RAGH5L/cBVYdGLiZfGDQQTJLMricZEQkQTjRz42oWEQtOWYZ3QL9vZBWzqD/APePCkiYnG/+DFi3/AA3cvhnjC6MWDT2erqKuszqTTmWee53nnfWecyBOJZGxX69OPHD12vP9E6eSp0wOD5aEzjSRsxS6vu6EXxsuOnXBPBLwuhfT4chRz23c8vuRsTuf7S20eJyIMFuVOxFd9uxmIDeHaEpBVfmGgmVzao2YkrDTgW9lFY8Rs8vvGjJUqLPTWs+zn7UvG9M211PTt7cxKFxvFXk4bO2CtpbXMMM1S7v8/fndn/uxnlStsnKlmdE+qxaRCRZsNh7RJMmmdQnKpRT5xCkhi7pFNCfoKValGEbAxoCF2N8CJFc/FuEop3aFbqkvgNvCMSvBsQc2htMHbxNjEaqVAA6zzWInS574evhhKg4bZO/aS7bO37BX7wL7+1StVHvkZd/B3OloeWYOPzi186any8Zd070B1iMJB9w/JKsWYZx7BTwDf7lEBiRpeVZkLVCJSSKee6jTtB4/3F67ND6cj7BnbQzWesl32BvUI2p/c53N8/olyD6DZUrXzVTYBbjEFbuMsTYXk58iR77mGiJWvYyBGwXv4g2niDgRWAtykyKBXDKFu+19idJhdMfBeq7+/zu5JozZevTxem5uoTN0oXm4/nacLNIrXeYWm6DbNUp1cbUCb0K5rk/pr/b2+p3/sUPu0QnOWfmn6529dQQSo C = 4✏ (1 )2 AAADwnicjVJNa9RAGH63UVvrR7d6EbwEl0oFXSZrQSlUilXQWz/cttDUZRJnt2OTTExmV2tcf4BnwYMnBQ/iP/BaD/4BD/0J4rGCFw8+mURFF7vOkMn7PvM+79e8XhzIVDO2VxmxDh0+Mjp2dPzY8RMnJ6qTp1ZT1U180fRVoJJ1j6cikJFoaqkDsR4ngodeINa87YX8fq0nklSq6I7eicVmyDuRbEufa0Ctat3GWrDnbLedcD+bcUWcykBFboeHIe9n086lQrxwN2v0+61qjdWZWfag4JRCjcq1qCYr18ile6TIpy6FJCgiDTkgTin2BjnUoBjYRaAKt23YJMbOx7lJGd2mm2Zr4Bx4n8bhswu2AJPDbhtnB9pGiUbQ81ip4ed+A3wJmDZNsU/sLdtnH9k79pl9/6evzPjIc9zB3yu4Im5NPDuz8m0oK8Rf09Zv1gEMDzs8oKoMZ155DH8S+KMhHdDo4VVTuUQnYoMU/TTZ9B6/2F+ZXZ7KzrPX7Au68YrtsV30I+p99d8sieWXxnsEzkPTu9BUE+EVM+AcuXQMkueRIz9rVYiV6wkQu7R7+svSxRtIaBK2aVnBsBjSvPb/xCgsB2JgXp2/p3NQWG3Uncv1xtJMbf56ObljdJbO0TSm8wrN0y1apCbyfE7vaZc+WDes+9YDKy1MRyol5zT9sawnPwBeHddG

Slide 21

Slide 21 text

5310ಋग़ ⌘(⇡i+1 ) Mi (⇡i+1 ), ⌘(⇡i ) = Mi (⇡i ) ⌘(⇡i+1 ) ⌘(⇡i ) Mi (⇡i+1 ) Mi (⇡i ) AAAEK3icjVJNaxNBGH6360eNH031InhZDJWKaZiNgiIoRRH0IPTDtIVuCbvrNA7dL3cn0brEH+Af6MGTggcRf4UX7+Kh4tWDeKzgRcFnJqs2Tds4w87O+8zzvF8zXhKITDK2aYyYBw4eOjx6pHT02PETY+XxkwtZ3E593vDjIE6XPDfjgYh4QwoZ8KUk5W7oBXzRW7upzhc7PM1EHN2T6wlfCd1WJFaF70pAzfKGheFw6U46iWjm4oLdPW85Lf7Quguruw2tbqeBdK2PoVROaTdnUzt0uzoHq99bs1xhNaaHNbixi02FijETjxvXyaH7FJNPbQqJU0QS+4BcyjCXyaY6JcCqQGOcroKTap6PdYVyukO39JTAXeBdKsFnG2oOpQveGtYWrOUCjWCrWJnWK78BvhRKiybYR/aabbH37A37yn7u6SvXPlSO6/h7PS1PmmPPTs//GKoK8Zf04J9qH4WHGe5TVY5VVZ7AnwD+eEgHJHp4RVcu0IlEI71+6mw6Tza25q/OTeTn2Ev2Dd14wTbZO/Qj6nz3X83yuefaewTNI927UFcT4RZz4C5yaWlE5aGQP7XGiKXsFIhV8J7+ZTq4AwFLgJsVFQyLIfRt/0+MHnMgBt6rvfN1Dm4W6jX7Yq0+e6kyfaN4uaN0hs7SJF7nZZqm2zRDDeT5y6gYVWPKfGt+MD+Zn3vUEaPQnKK+YX75DXMX+h8= Mi (⇡) = L⇡i (⇡) CDmax KL (⇡i , ⇡) AAAD1XicjVJLaxRBEK7NqHn4yEYvgpfBJRIhLr0bQRGUaBQUI+ThZgPZOPSMnbXJvJjp3SQO60k8ePHowZOCiPgzvHjxqJKfEHKM4MWDX/eOii5mrWZ6qr6ur17dbuzLVDG2XRiwDhw8NDg0PHL4yNFjo8Wx40tp1Eo8UfMiP0qWXZ4KX4aipqTyxXKcCB64vqi76zP6vN4WSSqj8K7aisVqwJuhXJMeV4Cc4lUbcsfJZGeiEcuz9mV71smgaSSHztkz153s9mznXtYI+KZB9fGkbY6dYomVmRG7V6nkSolymYvGCleoQfcpIo9aFJCgkBR0nzilWCtUoSrFwCaBRjhdg09i/Dzsq5TRLbphlgLOgXdoBDFbYAswOfzWsTdhreRoCFvnSg1fx/XxJWDaNM4+sbdsj31g79gO+/7PWJmJoWvcwt/tckXsjD49ufitLyvAX9GD36x9GC5WsE9XGXbdeYx4EvhmnwkozPCi6VxiErFBuvM01bQfPt9bvLQwnp1hr9gupvGSbbP3mEfY/uq9nhcLL0z0EJwNM7vAdBPiFjPgHLU0DaLr0MjPXiPk0nYCxM79Hv3ybOAOJCwJ3zTvoF8OaW77f3J0PXty4L1W/n6dvcpStVyZKlfnz5emr+Uvd4hO0WmawOu8QNN0k+aohjrf0Ef6TF+sutWxHltPuq4DhZxzgv4Q69kPAGfdzA== L⇡i (⇡i ) = ⌘(⇡i ) AAADuHicjVLLThRBFL1DqyCoDLAhcdNhxGBCJjWjiYYEQjQkkrjg4QAJQ5rqphgq9CvdNcOjM34AP+CClSYsCBu3uHXDD7jgE4xLTNy48HRNC8GJjFXp6ntP3XNfde3QlbFi7DzXZdy6fae7525v3737D/rzA4NLcVCPHFFxAjeIVmweC1f6oqKkcsVKGAnu2a5YtrdfpffLDRHFMvDfqr1QrHm85stN6XAFyMo/emMl1VBaiWw2xzLhiTlpVoXil7qVL7Ai08tsF0qZUKBszQUDuSmq0gYF5FCdPBLkk4LsEqcYe5VKVKYQ2DjQALebsIm0nYNzjRKapRm9FXAOvEm98FkHW4DJYbeNswZtNUN96GmsWPNTvy6+CEyTRtlXdswu2Bk7Yd/Yr3/6SrSPNMc9/O0WV4RW/8Hw4s+OLA9/RVtXrBsYNrZ3Q1UJzrTyEP4k8N0OHVDo4QtduUQnQo20+qmzaey/v1icWBhNHrOP7Du68YGdsy/oh9/44RzNi4VD7d0HZ0f3ztPV+HjFBDhHLjWNpHmkyJ9aA8RK9QiImdm9u7Ss4g0kNAnbOKugUwypX/t/YrQs22JgXkt/T2e7sFQulp4Wy/PPCtMvs8ntoYc0QmOYzuc0Ta9pjirI84A+0Sl9NiaMdaNmyJZpVy7jDNG1ZUS/Ab/s1HI= ⇡i+1 = arg max ⇡  L⇡i (⇡) CDmax KL (⇡i , ⇡) AAAD6nicjVJNa9RAGH638aPWj271IngJLpWKdZlsBUVQilVQ7KEf7rawWcMkTtOh+SLJrq0h/gCPXkR6UvQg/gVvXvwDHnr1JvVWwYsHn8xGRRe7zpDJO888z/s1Y0eeTFLGdioj2oGDhw6PHhk7euz4ifHqxMlWEnZjRzSd0AvjVZsnwpOBaKYy9cRqFAvu255YsTfmivOVnogTGQb30q1IdHzuBnJNOjwFZFVbZiStTF4wcv2abvLY9fmmlQHMddOWrtvW59UWnDyfgnFev6jP3bSyu/P5/cwEW6HF8bReHCtVx6rWWJ2poQ8aRmnUqBwL4UTlOpn0gEJyqEs+CQoohe0RpwSzTQY1KAI2DTTE6Ro4seI5WDuU0R26pWYKnAPPaQw+u1ALKDl4G1hd7NolGmBfxEqUvvDr4Yuh1GmSfWRv2B77wN6yz+z7P31lykeR4xb+dl8rImv8yenlb0NVPv4prf9W7aOwMf19qsqwFpVH8CeBbw7pQIoeXlGVS3QiUki/nyqb3qNne8tXlyazc+wl20U3XrAd9h79CHpfndeLYmlbeQ+geah656tqAtxiBpwjF1chRR4F8rPWELGKfQxEL3mPfzFN3IHEToKblBUMiyHVbf9PjD5zIAbeq/H36xw0Wo26MVNvLF6qzd4oX+4onaGzNIXXeZlm6TYtUBN5vqNPtEtfNE97qj3XtvvUkUqpOUV/DO3VD7QZ6AY= ͱͯ͠બͿͱ ͳͷͰվળ͢Δ͜ͱ͕Θ͔Δ Ξυόϯςʔδͷਪఆ͕ਖ਼֬Ͱ
 NBY,-EJWFSHFODF͕ܭࢉͰ͖Δͱ͖

Slide 22

Slide 22 text

5310ಋग़ ͜͜·Ͱͷٞ࿦͔Β ⇡i+1 = arg max ⇡  L⇡i (⇡) CDmax KL (⇡i , ⇡) AAAD6nicjVJNa9RAGH638aPWj271IngJLpWKdZlsBUVQilVQ7KEf7rawWcMkTtOh+SLJrq0h/gCPXkR6UvQg/gVvXvwDHnr1JvVWwYsHn8xGRRe7zpDJO888z/s1Y0eeTFLGdioj2oGDhw6PHhk7euz4ifHqxMlWEnZjRzSd0AvjVZsnwpOBaKYy9cRqFAvu255YsTfmivOVnogTGQb30q1IdHzuBnJNOjwFZFVbZiStTF4wcv2abvLY9fmmlQHMddOWrtvW59UWnDyfgnFev6jP3bSyu/P5/cwEW6HF8bReHCtVx6rWWJ2poQ8aRmnUqBwL4UTlOpn0gEJyqEs+CQoohe0RpwSzTQY1KAI2DTTE6Ro4seI5WDuU0R26pWYKnAPPaQw+u1ALKDl4G1hd7NolGmBfxEqUvvDr4Yuh1GmSfWRv2B77wN6yz+z7P31lykeR4xb+dl8rImv8yenlb0NVPv4prf9W7aOwMf19qsqwFpVH8CeBbw7pQIoeXlGVS3QiUki/nyqb3qNne8tXlyazc+wl20U3XrAd9h79CHpfndeLYmlbeQ+geah656tqAtxiBpwjF1chRR4F8rPWELGKfQxEL3mPfzFN3IHEToKblBUMiyHVbf9PjD5zIAbeq/H36xw0Wo26MVNvLF6qzd4oX+4onaGzNIXXeZlm6TYtUBN5vqNPtEtfNE97qj3XtvvUkUqpOUV/DO3VD7QZ6AY= ͱͯ͠બΜͩํࡦ͸վળ͢ΔͷͰɺύϥϝʔλͷ؍఺ͰݟΔͱ ͱͳΔΑ͏ͳɺВΛٻΊΕ͹Α͍ C = 4✏ (1 )2 AAADwnicjVJNa9RAGH63UVvrR7d6EbwEl0oFXSZrQSlUilXQWz/cttDUZRJnt2OTTExmV2tcf4BnwYMnBQ/iP/BaD/4BD/0J4rGCFw8+mURFF7vOkMn7PvM+79e8XhzIVDO2VxmxDh0+Mjp2dPzY8RMnJ6qTp1ZT1U180fRVoJJ1j6cikJFoaqkDsR4ngodeINa87YX8fq0nklSq6I7eicVmyDuRbEufa0Ctat3GWrDnbLedcD+bcUWcykBFboeHIe9n086lQrxwN2v0+61qjdWZWfag4JRCjcq1qCYr18ile6TIpy6FJCgiDTkgTin2BjnUoBjYRaAKt23YJMbOx7lJGd2mm2Zr4Bx4n8bhswu2AJPDbhtnB9pGiUbQ81ip4ed+A3wJmDZNsU/sLdtnH9k79pl9/6evzPjIc9zB3yu4Im5NPDuz8m0oK8Rf09Zv1gEMDzs8oKoMZ155DH8S+KMhHdDo4VVTuUQnYoMU/TTZ9B6/2F+ZXZ7KzrPX7Au68YrtsV30I+p99d8sieWXxnsEzkPTu9BUE+EVM+AcuXQMkueRIz9rVYiV6wkQu7R7+svSxRtIaBK2aVnBsBjSvPb/xCgsB2JgXp2/p3NQWG3Uncv1xtJMbf56ObljdJbO0TSm8wrN0y1apCbyfE7vaZc+WDes+9YDKy1MRyol5zT9sawnPwBeHddG ͸࣮૷্͸େ͖͘ͳΓ͗͢ύϥϝʔλͷߋ৽෯͕খ͘͞ͳΔͷͰ USVTUSFHJPODPOTUSBJOU maximize ✓ L✓old (✓) s.t. ¯ D⇢✓old KL (✓old , ✓)  AAAE7XiclVNNa9RAGH5To9b1o1tFEAQNLpUKJcyuguKp+AGKPfTDbQtNXSbZ6e7QJBOT2XW7Id71Lh4UQUFF/Ble/AMe+hPEg4cKXjz4ZpKtdBd32xkymfeZ53m/hrEDl0eSkG1t7JB++MjR8WOF4ydOnpooTp5ejkQrdFjVEa4IV20aMZf7rCq5dNlqEDLq2S5bsTdvp+crbRZGXPgP5VbA1j3a8PkGd6hEqFZ8ZuCwPNrhHu+yWmzJJpM0MeZ621os3HqSTGfWFcOyjILSSNaRcWRKM7EMy6ZhfCepxQ/mkkexFTZFn7ynz8wZY9ebyx4bVp25eFQsEZOoYQxuyvmmBPmYF5PaB7CgDgIcaIEHDHyQuHeBQoRzDcpQgQCxGUQFnm4gJ1Q8B9d1iOE+3FVTIk4RT6CAPluoZqikyNvEtYHWWo76aKexIqVP/br4hag0YIp8I5/IDvlKPpPv5M9/fcXKR5rjFv7tTMuC2sTzc0u/R6o8/Eto/lMNUdg4vSFVxbimlQfojyPeGdEBiT28oSrn2IlAIVk/VTbt7sudpZuLU/Fl8o78wG68JdvkC/bDb/9y3i+wxVfKu4+aJ6p3nqrGx1uMEaeYS0MhaR4p0qtVYKzUDhExct7TXaaFd8DR4siN8gpGxeDqtvcTI2MeJEaWV8buImdYnEHuwSKl2e03Uj93Txx8f+X+1za4Wa6Y5atmZeFaafZW/hLH4Txcgml8bddhFu7BPFQx05/aWe2CdlEX+gv9tf4mo45pueYM7Bn6x78raiVF maximize ✓  L✓old (✓) CDmax KL (✓old , ✓) AAAEx3iclVNNaxNBGH5TV1vrR1O9CF4WQ6VCDZMoKJ6KVbDYQz9MW0jisrudpkP3i91JTLNE8Oof8NBTBQXxZ3jRH+ChP0E8eKigggefnd0obTBpZ9jZeZ95nvdrGCtwRCQZ28+NnNJOnxkdOzt+7vyFixP5yUurkd8MbV6xfccP1y0z4o7weEUK6fD1IOSmazl8zdqeS87XWjyMhO89kTsBr7tmwxObwjYlICNv6xg112wLV3S4EdfkFpdmV69ZotGo6gs9xIh9Z6PbnU6tG/pNfe6BET9e6D6NE3XvIKXN6BlNeakb+QIrMjX0/k0p2xQoG4v+ZO4t1WiDfLKpSS5x8khi75BJEWaVSlSmANgMUB+nm+CEimdjrVNM8/RQTQncBN6lcfhsQs2hNMHbxtqAVc1QD3YSK1L6xK+DL4RSpyn2mb1jB+wje8++sN//9RUrH0mOO/hbqZYHxsTLKys/hqpc/CVt/VMNUFiY7oCqYqxJ5QH8CeDtIR2Q6OFdVblAJwKFpP1U2bQ6rw5W7i1PxdfZa/YV3dhj++wD+uG1vttvlvjyrvLuQfNM9c5V1Xi4xRi4iVwaCknySJBerT5iJXYIRM94z/8ya7gDAUuAG2UVDIsh1G0fJ0bKPEmMNK+U3QFnUJx+7skiJdkdN9JR7qE4eH+lo6+tf7NaLpZuFctLtwuz97OXOEZX6RpN47XdoVl6RItUQaaf6Bv9pF/avOZrLa2dUkdymeYyHRraiz8Lsxiv

Slide 23

Slide 23 text

CSFBL

Slide 24

Slide 24 text

5310࣮૷ Dmax KL (⇡old , ⇡new ) = max s DKL (⇡old (·|s)||⇡new (·|s)) AAAD9HicjVLbahNBGP7T9VDroaneCN4shkoCJUxiQRGU4gEUvejBtIWmrrubaTp0T+xO0tbN+gC+gBd6o9ILEXwJQXwBL/oIonf1cOOF386uKRpsnGFn//nm+/7TjBU4IpKM7RZGtEOHjxwdPTZ2/MTJU+PFidOLkd8Jbd6wfccPly0z4o7weEMK6fDlIOSmazl8ydq4kZ4vdXkYCd+7L7cDvuqabU+sCduUgIziQx3jphHfvZc8iJuuuZWUm4EwYt9pJVO6Mj2+mVT0q3p6asRRkrH3aeWm3fJlL6r0en1+H6sYxRKrMjX0QaOWGyXKx6w/UbhGTWqRTzZ1yCVOHknYDpkUYa5QjeoUAJsC6uN0DZxQ8WysqxTTHbqlpgRuAk9oDD47UHMoTfA2sLaxW8lRD/s0VqT0qV8HXwilTpPsI3vN9tgH9oZ9Yj//6StWPtIct/G3Mi0PjPEnZxd+DFW5+Eta31cdoLAw3QOqirGmlQfwJ4BvDemARA8vq8oFOhEoJOunyqb76OnewpX5yfgCe8k+oxsv2C57h3543a/2zhyff6a8e9Bsqt65qhoPtxgDN5FLWyFpHinyu1YfsdJ9CETPeY/7zCbuQGAnwI3yCobFEOq2/ydGxhyIgfda+/t1DhqL9WrtYrU+N12auZ6/3FE6R+epjNd5iWboNs1SA3m+py/0jb5rXe259krbyagjhVxzhv4Y2ttfQYrtDA== ࣮ࡍʹ࣮૷͢ΔࡍɺԼه͸ࢉग़͕ࠔ೉ͳͷͰ 5310Ͱ͸୅ΘΓʹ࣍Λར༻͢Δ ¯ D⇢✓old KL (⇡old , ⇡new ) = E s⇠⇢✓old [DKL (⇡old (·|s)||⇡new (·|s))] AAAELnicjVLNaxNBFH/b9aPGj6Z6EbwshpYESpxEQRGUoi0oeuiHaQvZuMxupsnQ/WJ3klon6x/gPyDoScGDePF/8OI/oNCDeFU8VvDiwbeza4pGG2fY2Te/eb/fm/fm2aHLY0HIrjahHzp85OjkscLxEydPTRWnT6/FQS9yWMMJ3CDasGnMXO6zhuDCZRthxKhnu2zd3rqZnq/3WRTzwL8ndkLW8mjH55vcoQIhq/jUwGHaNJILiSXv3E3uSzPqBpY0RZcJasnAbSdJUjZDntlzhjJ9tp1UjELKvmaYHhVd25aLKBEbZsy9v2k0F1SAfamy6bQDMYgrg8FQc4hVWlaxRKpEDWPUqOVGCfKxFExr18GENgTgQA88YOCDQNsFCjHOJtSgDiFic4gGeLqJPpHyc3BtgYTbsKimQJwinkABNXvIZsik6LeFawd3zRz1cZ/GihU/1XXxi5BpwAx5T16RPfKOvCZfyI9/akmlkd5xB/92xmWhNfX47Or3sSwP/wK6+6wDGDZO74CsJK5p5iHqccQfjKmAwBpeUZlzrESokKye6jb9h0/2Vq+uzMhZ8oJ8xWo8J7vkLdbD739zXi6zlWdK3UfOtqqdp7Lx8RUl4hTv0lFIeo8U+ZVrgLHSfYSIkfs9Gnqa+AYcdxx94zyDcTG4eu3/iZF5jsTAfq392Z2jxlq9WrtYrS9fKs3fyDt3Es7BeShjd16GebgFS9AARwNtVrugEf2N/kH/qH/KXCe0nHMGfhv6559bZQG+ ͳͷͰ࣮ࡍʹ͸5310Ͱ͸࣍ͷ࠷దԽ໰୊Λղ͘ maximize ✓ L✓old (✓) s.t. ¯ D⇢✓old KL (✓old , ✓)  AAAE7XiclVNNa9RAGH5To9b1o1tFEAQNLpUKJcyuguKp+AGKPfTDbQtNXSbZ6e7QJBOT2XW7Id71Lh4UQUFF/Ble/AMe+hPEg4cKXjz4ZpKtdBd32xkymfeZ53m/hrEDl0eSkG1t7JB++MjR8WOF4ydOnpooTp5ejkQrdFjVEa4IV20aMZf7rCq5dNlqEDLq2S5bsTdvp+crbRZGXPgP5VbA1j3a8PkGd6hEqFZ8ZuCwPNrhHu+yWmzJJpM0MeZ621os3HqSTGfWFcOyjILSSNaRcWRKM7EMy6ZhfCepxQ/mkkexFTZFn7ynz8wZY9ebyx4bVp25eFQsEZOoYQxuyvmmBPmYF5PaB7CgDgIcaIEHDHyQuHeBQoRzDcpQgQCxGUQFnm4gJ1Q8B9d1iOE+3FVTIk4RT6CAPluoZqikyNvEtYHWWo76aKexIqVP/br4hag0YIp8I5/IDvlKPpPv5M9/fcXKR5rjFv7tTMuC2sTzc0u/R6o8/Eto/lMNUdg4vSFVxbimlQfojyPeGdEBiT28oSrn2IlAIVk/VTbt7sudpZuLU/Fl8o78wG68JdvkC/bDb/9y3i+wxVfKu4+aJ6p3nqrGx1uMEaeYS0MhaR4p0qtVYKzUDhExct7TXaaFd8DR4siN8gpGxeDqtvcTI2MeJEaWV8buImdYnEHuwSKl2e03Uj93Txx8f+X+1za4Wa6Y5atmZeFaafZW/hLH4Txcgml8bddhFu7BPFQx05/aWe2CdlEX+gv9tf4mo45pueYM7Bn6x78raiVF

Slide 25

Slide 25 text

5310࣮૷ αϯϓϦϯάͯ͠࠷େԽΛ͍ͨ͠ͷͰ ظ଴஋ܗࣜʹมߋ͢Δ maximize ✓ L✓old (✓) s.t. ¯ D⇢✓old KL (✓old , ✓)  AAAE7XiclVNNa9RAGH5To9b1o1tFEAQNLpUKJcyuguKp+AGKPfTDbQtNXSbZ6e7QJBOT2XW7Id71Lh4UQUFF/Ble/AMe+hPEg4cKXjz4ZpKtdBd32xkymfeZ53m/hrEDl0eSkG1t7JB++MjR8WOF4ydOnpooTp5ejkQrdFjVEa4IV20aMZf7rCq5dNlqEDLq2S5bsTdvp+crbRZGXPgP5VbA1j3a8PkGd6hEqFZ8ZuCwPNrhHu+yWmzJJpM0MeZ621os3HqSTGfWFcOyjILSSNaRcWRKM7EMy6ZhfCepxQ/mkkexFTZFn7ynz8wZY9ebyx4bVp25eFQsEZOoYQxuyvmmBPmYF5PaB7CgDgIcaIEHDHyQuHeBQoRzDcpQgQCxGUQFnm4gJ1Q8B9d1iOE+3FVTIk4RT6CAPluoZqikyNvEtYHWWo76aKexIqVP/br4hag0YIp8I5/IDvlKPpPv5M9/fcXKR5rjFv7tTMuC2sTzc0u/R6o8/Eto/lMNUdg4vSFVxbimlQfojyPeGdEBiT28oSrn2IlAIVk/VTbt7sudpZuLU/Fl8o78wG68JdvkC/bDb/9y3i+wxVfKu4+aJ6p3nqrGx1uMEaeYS0MhaR4p0qtVYKzUDhExct7TXaaFd8DR4siN8gpGxeDqtvcTI2MeJEaWV8buImdYnEHuwSKl2e03Uj93Txx8f+X+1za4Wa6Y5atmZeFaafZW/hLH4Txcgml8bddhFu7BPFQx05/aWe2CdlEX+gv9tf4mo45pueYM7Bn6x78raiVF ύϥϝʔλΛಈ͔ͯ͠΋มΘΒͳ͍ͷͰແࢹ maximize ✓ L✓old (✓) = maximize ✓ ⌘(⇡✓old ) + X s ⇢⇡✓old (s) X a ⇡✓ (s, a)A⇡✓old (s, a) = maximize ✓ X s ⇢⇡✓old (s) X a ⇡✓ (s, a)A⇡✓old (s, a) = maximize ✓ E s⇠⇢⇡✓old  X a ⇡✓ (s, a)A⇡✓old (s, a) = maximize ✓ E s⇠⇢⇡✓old  X a ⇡✓old (s, a) ⇡✓ (s, a) ⇡✓old (s, a) A⇡✓old (s, a) = maximize ✓ E s⇠⇢⇡✓old ,a⇠⇡✓old  ⇡✓ (s, a) ⇡✓old (s, a) A⇡✓old (s, a) = maximize ✓ E s⇠⇢⇡✓old ,a⇠⇡✓old  ⇡✓ (s, a) ⇡✓old (s, a) Q⇡✓old (s, a) AAAIfniczVPNaxNBFH+pUdf40UYvipfF0pBgrJMoWAShKgUFD/0wbSEbwux2kgzdL3Y3sc2y4tl/wIMnBQviQf8HL/4DHvoniMcKXjz4dnajbbf5qGB1hszOvPf7vfd7bzKqrXPXI2Q7NXYsffzESelU5vSZs+fGJ7Lnl12r7Wisolm65ayq1GU6N1nF457OVm2HUUPV2Yq6fj/0r3SY43LLfOxt2qxm0KbJG1yjHprq2fQlGUdOMegGN3iX1X3FazGPBvKj3rbuW/paEOSjU0FWlIzg3JEPYCm45hWb7yMXZMG5Kitu26j7bqA4LQsxCWCQd2NshKTBLgw6i7Rwtw+vOFTbv8xtUK+lqv5cgAoUlxt9RQSRApU3m9U/0SGYtSNS0/tviMQNh2p+QmpwQNjIcWRFFKlwJx27i/uL6v8/8QsjiK9PTJJpIoac3JTizSTEY97KprZAgTWwQIM2GMDABA/3OlBwcVahBGWw0VZEq4XeBmIcgdNwrYEPD2FOTA/tFO0BZDBmG9kMmRRx67g28VSNrSaew1yu4Idxdfw5yJRhinwmb8kO+UTekS/kR99YvogRatzErxpxmV0ff35x6ftQloFfD1q/WQMYKk5jQFU+rmHlNsbjaN8Y0gEPezgjKufYCVtYon4KNZ3ui52l24tTfo68Jl+xG6/INvmI/TA737Q3C2zxpYhuIueJ6J0hqjHxFn20U9TSFJZQR2jp1WphrvDsoEWOcU9/IRW8A44njlg3rmBYDi5ue5QcEfIwOSJdEbqLmEF5ktjDZQrVjZppP3ZPHnx/pf2vLblZLk+XbkyXF25Ozt6LX6IEl+EK5PG13YJZeADzUAEt/Sy9lX6f/iCBlJOuSdcj6Fgq5lyAPUOa+Qm423kJ

Slide 26

Slide 26 text

5310࣮૷ ͜Εͷ࠷దͳύϥϝʔλΛٻΊ͍ͨ L⇡✓old = E s⇠⇢⇡✓old ,a⇠⇡✓old  ⇡✓ (s, a) ⇡✓old (s, a) Q⇡✓old (s, a) ¯ D ⇢⇡✓old KL = E s⇠⇢⇡✓old [DKL (⇡✓old (·|s)||⇡✓ (·|s))] AAAFv3iclVPNahRBEK6Jjsb1J4leBC+DS5YsLKE3CoogRE1AMYdsYn5wZzP0zHZmm8wfM53VZDI+gC/gwZOCgvgU4sUX8JAnEPGiRPDiwZqeibrZzW7SzfR0f/VVfVXVtBk4PBKE7CpDJ06qp04PnymcPXf+wsjo2MXlyN8MLbZk+Y4frpo0Yg732JLgwmGrQcioazpsxdy4l9pX2iyMuO89ElsBa7jU9vg6t6hAyBhTHms4SnNGrAccF9Fighqx7zSTJNFua7pLRcs049nEiCM94q4etvye5AqV5m5DIVXQTW7bdX09pNb/zslEVKHlpEe8zFDrKSVtMmJD03UZv6SbNIxnMMuHc8lafGiWXTVpfYtK6jMy4kSPBHWr6YudqLyz01HQPlxuGKNFMknk0Lo31XxThHzM+2PKW9ChCT5YsAkuMPBA4N4BChHOOlRhCgLEKoj6aF1HTih5Fq4NiOEBzMopEKeIJ1DAmJvozdCTIm8DVxtP9Rz18JxqRdI/jevgF6KnBuPkM3lH9sgn8p58Jb8PjRXLGGmOW/g3M18WGCPPLy/+Gujl4l9A659XHw8Tp9unqhjXtPIA43HEnw7ogMAe3pSVc+xEIJGsnzKb9vaLvcVbC+Nxibwm37Abr8gu+Yj98No/rTc1tvBSRvfQ54nsnSur8fAWY8Qp5mJLJM0jRfZr9VErPYeIaDnv2V+mjnfA8cSRG+UVDNLg8raPopExj6OR5ZWxt5HTT6ebezylNLujKh3kdujg+6sefG3dm+Wpyeq1yana9eL03fwlDsMVuAoT+NpuwDTch3lYAkv5oHxRvis/1DuqrXpqkFGHlNznEnQMdesPZ3d5Nw== ͱ͓͍ͯͦΕͧΕɺҰ࣍ɺೋ࣍ۙࣅΛ͢Δ maximize ✓ E s⇠⇢⇡✓old ,a⇠⇡✓old  ⇡✓ (s, a) ⇡✓old (s, a) Q⇡✓old (s, a) s.t. E s⇠⇢⇡✓old [DKL (⇡✓old (·|s)||⇡✓ (·|s))]  AAAFp3iclVNNaxNBGH63Gq3xo61eBC+LoaWBECapoHgq1YKiYD/sB2RCmN1MkqH75e60ptmsP8A/4MGTgkLxZ3jx5kkwP0E8VvDiwXdnN2qamrQz7OzM8z7P+zWM4VkikIT0tIkzZzPnzk9eyF68dPnK1PTM1c3A3fVNvmG6lutvGyzglnD4hhTS4tuez5ltWHzL2LkX27f2uB8I13kq9z1etVnTEQ1hMolQbUZb0nHMUZu1hS06vBZS2eKSRTpCsmUY4XJUCwMaCJv6LRfNnuhzaqFr1aMoKjBlHjZkY9/UEM1mhTZ8Zv4rjuaDAstHx/hLDKvHhlI25bGqU5pNcpe8LcOgKIsRHcxaH5l2VLlfCx89juaPSYGadVd2g3y3O5ByH85jdIs/02mdWyiazpEiUUMf3pTSTQ7SseLOaO+BQh1cMGEXbODggMS9BQwCnBUoQRk8xAqIumhtIMdXPBPXKoTwEJbVlIgzxCPIos9dVHNUMuTt4NrEUyVFHTzHsQKlj/1a+Pmo1GGWfCEH5JB8Ih/IN/Lrv75C5SPOcR//RqLlXm3q5fX1n2NVNv4ltP6qRigMnPaIqkJc48o99CcQb4/pgMQe3lGVC+yEp5Cknyqbvc6rw/W7a7PhHHlLvmM33pAe+Yj9cPZ+mO9W+dpr5d1BzXPVO1tV4+AthogzzKWpkDiPGOnX6mKs+Owjoqe8F3+YFO9A4EkgN0grGBdDqNs+SYyEeZoYSV4Ju4OcUXGGuaeLFGd30khHuQNx8P2Vjr624c1muVhaKJZXb+UWl9KXOAk34CbM42u7DYvwAFZgA0ztQPusfdV6mXzmSWYzs51QJ7RUcw0GRob9BoKIbjo= L⇡✓old ⇡ r✓L⇡✓old |✓=✓old (✓ ✓old ) + L⇡✓old (✓old ) ¯ D ⇢⇡✓old KL ⇡ 1 2 (✓ ✓old )T H(✓ ✓old ) where Hi,j = @ @i @ @j ¯ D ⇢⇡✓old KL (⇡✓old , ⇡✓ )|✓=✓old AAAGMniclVPLThRBFL0zOor4AHRj4qYigQwRSc1IojEhIT4SjCx4Q6ShU90UTEFPd6e7eJblB/gDLlxpoolx6x+4cePSBe6MK6M7TNy48HZ1gzMMw6MqXV117jn3VSkn9EQsKd3O5U+dLpw523Ku9fyFi5fa2jsuT8XBauTySTfwgmjGYTH3hM8npZAenwkjzqqOx6edlfuJfXqNR7EI/Am5GfK5KlvyxaJwmUTI7sh9Jji6h21lhQIXWeGS2SrwFrTWxGJhGAUbxPKZ47Fdsz6Y/XQXGKjDi+mJ3CS1cA+5QQ52U6yjWVarSdByWKQeaFs9HtbzyooqwYHimpQXI+aqklblZhnMqwk91Cy7vbiSb0i1XuER1xYZspXoXdZkIHNvhSySgnl6b0eErjeRGtuyPm4ZxUawtwbSPU2abbd30j5qBmnclLJNJ2RjJOjIvQULFiAAF1ahChx8kLj3gEGMcxZKUIYQsV5EA7QuIicyPBfXOVDwCB6aKRFniGtoRZ+rqOaoZMhbwXUJT7MZ6uM5iRUbfeLXwy9CJYEu+oW+ozv0E31Pf9C/TX0p4yPJcRP/Tqrlod32/Or4nyNVVfxLqPxXHaJwcFYPqUrhmlQeoj+B+MYRHZDYwzumcoGdCA2S9tNks7b1Ymf87liX6qav6U/sxiu6TT9iP/y13+6bUT720nj3UbNuelc11fh4iwpxhrksGSTJI0F2aw0wVnKOECEZ79ke08I7EHgSyI2zCo6KIcxtHydGyjxJjDSvlL2FnMPiNHJPFinJ7riR9nPr4uD7K+1/bY2bqXJf6VZfebS/c/Be9hJb4BpchyK+ttswCEMwApPg5vvzT/JufqHwofC18K3wPaXmc5nmCtSNwq9/WfGlaw== ͱۙࣅͰ͖Δ maximize ✓ r✓L⇡✓old (✓)|✓=✓old (✓ ✓old ) + L⇡✓old (✓old ) s.t. 1 2 (✓ ✓old )T H(✓ ✓old )  AAAFXHiclVPNaxNBFH9bE62ptamCCF4WQ0uLdZlEQRGEoggVPPQrbaHbhtnNJB26X+5OYpp1PXkS9OrBk4KC+Gd48R/w0It38VhBDx58O7uppiFJO8POvvd7v/c5jOFZPBCE7CsjpzLZ02dGz+bGzo2fn8hPXlgL3IZvsrLpWq6/YdCAWdxhZcGFxTY8n1HbsNi6sXs/tq83mR9w11kVex7bsmnd4TVuUoFQJf9bxTWt27TFbd5mlVAXO0zQSHeoYdFD9RFKHu+oldC1qlEUzSTq7NOO4W6XPTWr19X/4Vn1mjowXErT9VxSm2AtEQaa0CJd1Ws+NcNiFJb6Rd8OV6OFfpl1iz1W9SqzEMoXiEbkUnuFYioUIF2L7qTyAXSoggsmNMAGBg4IlC2gEODehCKUwENsDlEXrTXk+JJn4rkFITyEB3ILxCniEeQwZgO9GXpS5O3iWUdtM0Ud1ONcgfSP41r4+eipwhT5Sj6SA/KFfCLfyZ++sUIZI65xD/9G4su8ysSLyyu/hnrZ+Bew889rgIeB2x7QVYhn3LmH8TjirSETEDjD27JzjpPwJJLMU1bTbL8+WLmzPBVOk3fkB07jLdknn3EeTvOn+X6JLb+R0R30eSJnZ8tuHLzFEHGKtdQlEtcRI51eXcwV6z4iasp7dsjU8Q44ahy5QdrBsBxc3vZxciTMk+RI6krYbeQMytPLPVmmuLrjZjrK7cqD76949LX1CmslrXhDKy3dLMzfS1/iKFyBqzCDr+0WzMMCLEIZTGVbea68VF5lvmUz2bHseEIdUVKfi9C1spf+AqVoTnk= Λ୅ΘΓʹղ͍ͯղީิΛݟ͚ͭΔ ৄ͘͠͸5310ͷ"QQFOEJY$Λࢀর

Slide 27

Slide 27 text

5310࣮૷ maximize ✓ r✓L⇡✓old (✓)|✓=✓old (✓ ✓old ) + L⇡✓old (✓old ) s.t. 1 2 (✓ ✓old )T H(✓ ✓old )  AAAFXHiclVPNaxNBFH9bE62ptamCCF4WQ0uLdZlEQRGEoggVPPQrbaHbhtnNJB26X+5OYpp1PXkS9OrBk4KC+Gd48R/w0It38VhBDx58O7uppiFJO8POvvd7v/c5jOFZPBCE7CsjpzLZ02dGz+bGzo2fn8hPXlgL3IZvsrLpWq6/YdCAWdxhZcGFxTY8n1HbsNi6sXs/tq83mR9w11kVex7bsmnd4TVuUoFQJf9bxTWt27TFbd5mlVAXO0zQSHeoYdFD9RFKHu+oldC1qlEUzSTq7NOO4W6XPTWr19X/4Vn1mjowXErT9VxSm2AtEQaa0CJd1Ws+NcNiFJb6Rd8OV6OFfpl1iz1W9SqzEMoXiEbkUnuFYioUIF2L7qTyAXSoggsmNMAGBg4IlC2gEODehCKUwENsDlEXrTXk+JJn4rkFITyEB3ILxCniEeQwZgO9GXpS5O3iWUdtM0Ud1ONcgfSP41r4+eipwhT5Sj6SA/KFfCLfyZ++sUIZI65xD/9G4su8ysSLyyu/hnrZ+Bew889rgIeB2x7QVYhn3LmH8TjirSETEDjD27JzjpPwJJLMU1bTbL8+WLmzPBVOk3fkB07jLdknn3EeTvOn+X6JLb+R0R30eSJnZ8tuHLzFEHGKtdQlEtcRI51eXcwV6z4iasp7dsjU8Q44ahy5QdrBsBxc3vZxciTMk+RI6krYbeQMytPLPVmmuLrjZjrK7cqD76949LX1CmslrXhDKy3dLMzfS1/iKFyBqzCDr+0WzMMCLEIZTGVbea68VF5lvmUz2bHseEIdUVKfi9C1spf+AqVoTnk= maximize ✓ g T (✓ ✓old ) where g = r✓L⇡✓old (✓)|✓=✓old s.t. 1 2 (✓ ✓old )T H(✓ ✓old )  AAAFTXiclVPNaxNBFH9bU1vjR1O9CF4WQ0sLGmZTQREKRREqCPYrbaFbw+xmkgzdL3cnaZp1691/wIMnBQXxz/DizZOH/gnFYxUvRXw7u6mkIUk7w86+95vf7703bxjDs3ggCDlQRi5kRi+OjV/KXr5y9dpEbvL6euA2fJOVTNdy/U2DBsziDisJLiy26fmM2obFNoydx/H+RpP5AXedNbHnsW2b1hxe5SYVCJVzhyqOad2mLW7zNiuHuqgzQaPai3Atmkkc9a6aGOXQtSrRrKqjz1oi3K0zn0W6WpvXHWpY9ET9DC2Pd9xEFnXCzb7qbMx37et6NilGxg4KooCh9apPzVCLwmK/auJCF/tWarGXql5hFkK5PCkQOdReQ0uNPKRjyZ1UPoEOFXDBhAbYwMABgbYFFAKcW6BBETzE7iDq4m4VOb7kmbhuQwhP4YmcAnGKeARZjNlANUMlRd4OrjX0tlLUQT/OFUh9HNfCz0elClPkB/lMjsg38oUckuO+sUIZI65xD/9GomVeeeLNzdU/Q1U2/gXU/6sGKAyc9oBThbjGJ/cwHke8NaQDAnv4QJ6cYyc8iST9lNU022+PVh+uTIXT5AP5id14Tw7IV+yH0/xlflxmK+9kdAc1u7J3tjyNg7cYIk6xlppE4jpipHNWF3PFvo+ImvL2T5g63gFHjyM3SE8wLAeXt32WHAnzPDmSuhJ2GzmD8vRyz5cpru6smU5zu/Lg+9NOv7ZeY71Y0OYKxeV7+YVH6Usch1twG2bwtd2HBViEJSiBqTxXGsq+8jrzPfM7c5z5m1BHlFRzA7rG6Ng/z7JK5g== ॻ͖௚͢ ͜ͷ࠷దԽ໰୊͸ϥάϥϯδϡͷະఆ৐਺๏Λ࢖ͬͯ maximize ✓ g T (✓ ✓old ) +  (✓ ✓old )T H(✓ ✓old ) 2 AAAE6niclVPNaxNBFH9bV631o2m9CPWwGFoqtmESBcVTUYR661faQjaG2c10O3T2g91JbLPEowePXkR6UlQQ/wwv/gMe+ieI3ip48eCb2VVpQ5J2hp1985vfe7/33jBOJHgiCTkwRs6YZ8+dH70wdvHS5SvjhYnJ9SRsxS6ruqEI402HJkzwgFUll4JtRjGjviPYhrPzUJ1vtFmc8DBYk3sRq/vUC/gWd6lEqFF4buGYsX26y33eYY3UlttM0q73JF3rzmYba97KjEYaimb3pnXLsgVKNKntcM+r9aGpCIv9QsxbFbvJhMxC1BuFIikRPaxeo5wbRcjHUjhhfAAbmhCCCy3wgUEAEm0BFBKcNShDBSLE5hAN8XQLObHmubjWIYXH8EhPiThFvAtjGLOF3gw9KfJ2cPVwV8vRAPdKK9H+Kq7AL0ZPC6bJV/KRHJIv5BP5Rn73jZXqGCrHPfw7mS+LGuMvrq3+Gurl41/C9n+vAR4OTn9AVSmuqvII43HEd4d0QGIP7+nKOXYi0kjWT51Nu/PqcPX+ynQ6Q96S79iNN+SAfMZ+BO2f7vtltrKvowfo81T3ztfVBHiLKeIUc/E0ovJQyN9aQ9RS+xgRK+c9+8e08Q447jhyk7yCYRpc3/ZJNDLmaTSyvDJ2BzmDdHq5p1NS2Z1U6Tj3iA6+v/Lx19ZrrFdK5dulyvKd4sKD/CWOwhTcgFl8bXdhARZhCaqY6Q9j0pgyrpvCfGm+Nvcz6oiR+1yFI8N89wcAfCJv ͱॻ͖௚͢͜ͱ͕Ͱ͖ɺ࠷େԽ͢Δύϥϝʔλ ͜ͷ࣌఺Ͱ͸܎਺͕͍͔ͭ͘ෆ໌͕ͩɺ )? Hͷํ޲ʹมԽͤ͞Ε͹࠷దͱ͍͏͜ͱ͕Θ͔Δ ✓ ⇤ = 1 2 H 1 g + ✓old AAAEnXiclVNNaxNBGH5TV631o6leBA8uhoqghkkU/AChKIUKIv0wTaHbhtnNJB26X+xOou0Sf4B/wIMHUVAQf4YIntqTh/4E8VjBiwefnV2VNpi0M+zsvM88z/s1jB26MlaM7RRGjhhHjx0fPTF28tTpM+PFibOLcdCJHFFzAjeIlmweC1f6oqakcsVSGAnu2a6o2+sP0vN6V0SxDPwnaiMUKx5v+7IlHa4ANYp3LLUmFF9NLB6rnnnPtFoRd5LrlV5StVz4afLezGpqt82rZkZuJIHb7DWKJVZmepj9m0q+KVE+ZoOJwnuyqEkBOdQhjwT5pLB3iVOMuUwVqlII7BrQAKctcCLNc7CuUEIPaVpPBZwD79EYfHagFlBy8NaxtmEt56gPO40Va33q18UXQWnSJPvKPrBd9oV9ZN/Yr//6SrSPNMcN/O1MK8LG+IvzCz+Hqjz8Fa39Uw1Q2JjegKoSrGnlIfxJ4M+GdEChh7d15RKdCDWS9VNn0918ubtwd34yuczesu/oxhu2wz6hH373h/NuTsy/0t59aJ7q3nm6Gh+3mADnyKWtkTSPFPlTa4BYqR0BMXPe879MC3cgYUlw47yCYTGkvu2DxMiYh4mR5ZWxN8EZFKefe7hIaXYHjbSfuycO3l9l/2vr3yxWy5Ub5erczdLU/fwljtIFukRX8Npu0RTN0CzVkOlr+kxbtG1cNKaNR8bjjDpSyDXnaM8w6r8BmvYHtQ== ͱ͓͍ͯ 1 2 (✓ ⇤ ✓old )T H(✓ ⇤ ✓old ) = 1 2 ( s)T H( s) = = r 2 sT Hs ✓ ⇤ = s = r 2 sT Hs H 1 g AAAFZniclVNNaxNBGH63Nlqjto0iCl4WQ0sFG2ajoAiBogjx1q+0hW4TZjeTZOl+uTOJtsP6Azx4reBJQUH8GV78Ax76DxSPFbx48N3ZrbWNTdoZdnbeZ57n/RrGCl2HC0J2tZEzo7mz58bO5y9cvDQ+MVm4vMKDbmSzmh24QbRmUc5cx2c14QiXrYURo57lslVr81FyvtpjEXcCf1lshWzDo23faTk2FQg1Cpqm45g2WxG1pRHLcjxjig4TtC5NykWsz+qp3ZCB24xv1eVyXB1M0Su62WSuoLpp5vu9W8jU+b6jzPqPJjmpmPxpJGSqL6eMWHKl5XF8QP43n0rmNPE5UF6ty1kjbjcmi6RE1ND7N0a2KUI25oOC9gFMaEIANnTBAwY+CNy7QIHjXAcDyhAidhvRAE9byIkUz8Z1AyQ8gcdqCsQp4jHk0WcX1QyVFHmbuLbRWs9QH+0kFlf6xK+LX4RKHabIV/KR7JEv5BP5Tn4f60sqH0mOW/i3Ui0LGxMvry/9Gqry8C+gc6AaoLBwegOqkrgmlYfoz0H8+ZAOCOzhfVW5g50IFZL2U2XT297ZW3qwOCWnyTvyA7vxluySz9gPv/fTfr/AFt8o7z5qnqneeaoaH29RIk4xl7ZCkjwSZL/WAGMldoSInvFe/GWaeAcOWg5yeVbBsBiOuu2TxEiZp4mR5pWyt5EzKE4/93SRkuxOGuko91AcfH/G0dfWv1kpl4w7pfLC3eLcw+wljsENuAkz+NruwRxUYR5qYGsd7ZW2o70e/ZYbz13NXUupI1qmuQKHRk7/A5HWS3M= H 1 g = s, 1 2 = AAAEjHiclVNNaxNBGH5TV61R21QvgpdgqPRQw2wstCiBoij11g/TFro1zG4n6dD9YncSbZf4Azx49eBJQUG8etWLF/+Ah/4E8VjBiwefnV2VNpi0M+zsvM88z/s1jB26MlaM7RdGThmnz5wdPVc8f+Hi2Hhp4tJqHHQiRzScwA2idZvHwpW+aCipXLEeRoJ7tivW7J276flaV0SxDPyHajcUmx5v+7IlHa4ANUtTZYyFR8kNs9eux9NWK+JOaiQ1y4WXLd6rW7ZQvNgsVViV6VHu35j5pkL5WAwmCm/Joi0KyKEOeSTIJ4W9S5xizA0yqUYhsGmgAU5b4ESa52DdpIQe0D09FXAOvEdF+OxALaDk4O1gbcPayFEfdhor1vrUr4svgrJMk+wre8cO2Bf2nn1jv/7rK9E+0hx38bczrQib48+urPwcqvLwV7T9TzVAYWN6A6pKsKaVh/AngT8Z0gGFHs7pyiU6EWok66fOprv34mDl1vJkcp29Zt/RjVdsn31GP/zuD+fNklh+qb370DzWvfN0NT5uMQHOkUtbI2keKfKn1gCxUjsCUs55T/8yLdyBhCXBjfMKhsWQ+raPEyNjniRGllfG3gNnUJx+7skipdkdN9JR7qE4eH/m0dfWv1mtVc2b1drSTGX+Tv4SR+kqXaMpvLZZmqcFWqQGMn1OH+gjfTLGjBnjtlHPqCOFXHOZDg3j/m+aVf87 ৄ͘͠͸5310ͷ"QQFOEJY$Λࢀর ղީิ

Slide 28

Slide 28 text

5310࣮૷ ✓ ⇤ = r 2 sT Hs H 1 g AAAEnniclVNNaxNBGH5Tt1prtaleBC+LocWDhkkqKAWhKEo8iP1KG+g2YXY7SZful7uTaLusP8A/4MGLCgrizxDEm1489CeIxwpePPjs7Kq0waSdYWdnnnme92t4zcCxI8nYXmHkhDZ68tTY6fEzE2fPTRanzq9Gfje0RN3yHT9smDwSju2JurSlIxpBKLhrOmLN3L6T3q/1RBjZvrcidwKx4fKOZ7dti0tAreKcjjFjyC0heTM2eCQT/ZZuRI9CGRvtkFtx1dgUjuRJHDXjlaQWJUmtGV+rJJ1WscTKTA29f1PJNyXKx4I/VXhLBm2STxZ1ySVBHknsHeIUYa5ThaoUALsK1MdtG5xQ8SysGxTTfbqrpgTOgSc0DptdqAWUHLxtrB2c1nPUwzn1FSl9atfBF0Kp0zT7yt6xffaJvWff2K//2oqVjTTGHfzNTCuC1uSzi8s/h6pc/CVt/VMNUJiY7oCsYqxp5gHs2cCfDKmARA1vqsxtVCJQSFZPFU1v9/n+8tzSdDzDXrPvqMYrtsc+oB5e74f1ZlEsvVDWPWgeq9q5KhsPrxgD54ilo5A0jhT5k6sPX+k5BKLnvKd/mQbewMbJBjfKMxjmw1avfRQfGfM4PrK4MvYuOIP89HOP5ymN7qieDnMP+EH/VQ53W/9mtVquzJari9dL87fzThyjS3SZrqDbbtA81WiB6oj0JX2kz/RF07V72gPtYUYdKeSaC3RgaI3fhjMIMw== ͸ݩͷ໰୊ͷۙࣅղͳͷͰɺ͔͜͜Β L⇡✓old = E s⇠⇢⇡✓old ,a⇠⇡✓old  ⇡✓ (s, a) ⇡✓old (s, a) Q⇡✓old (s, a) ¯ D ⇢⇡✓old KL = E s⇠⇢⇡✓old [DKL (⇡✓old (·|s)||⇡✓ (·|s))] AAAFv3iclVPNahRBEK6Jjsb1J4leBC+DS5YsLKE3CoogRE1AMYdsYn5wZzP0zHZmm8wfM53VZDI+gC/gwZOCgvgU4sUX8JAnEPGiRPDiwZqeibrZzW7SzfR0f/VVfVXVtBk4PBKE7CpDJ06qp04PnymcPXf+wsjo2MXlyN8MLbZk+Y4frpo0Yg732JLgwmGrQcioazpsxdy4l9pX2iyMuO89ElsBa7jU9vg6t6hAyBhTHms4SnNGrAccF9Fighqx7zSTJNFua7pLRcs049nEiCM94q4etvye5AqV5m5DIVXQTW7bdX09pNb/zslEVKHlpEe8zFDrKSVtMmJD03UZv6SbNIxnMMuHc8lafGiWXTVpfYtK6jMy4kSPBHWr6YudqLyz01HQPlxuGKNFMknk0Lo31XxThHzM+2PKW9ChCT5YsAkuMPBA4N4BChHOOlRhCgLEKoj6aF1HTih5Fq4NiOEBzMopEKeIJ1DAmJvozdCTIm8DVxtP9Rz18JxqRdI/jevgF6KnBuPkM3lH9sgn8p58Jb8PjRXLGGmOW/g3M18WGCPPLy/+Gujl4l9A659XHw8Tp9unqhjXtPIA43HEnw7ogMAe3pSVc+xEIJGsnzKb9vaLvcVbC+Nxibwm37Abr8gu+Yj98No/rTc1tvBSRvfQ54nsnSur8fAWY8Qp5mJLJM0jRfZr9VErPYeIaDnv2V+mjnfA8cSRG+UVDNLg8raPopExj6OR5ZWxt5HTT6ebezylNLujKh3kdujg+6sefG3dm+Wpyeq1yana9eL03fwlDsMVuAoT+NpuwDTch3lYAkv5oHxRvis/1DuqrXpqkFGHlNznEnQMdesPZ3d5Nw== Λ࣮ࡍͷαϯϓϧ͔Βܭࢉͯ͠ վળ͞ΕΔ͜ͱΛ֬ೝ͢Δ σϧλΛܾΊΔ վળ͍ͯ͠ͳ͔ͬͨΒ܎਺Λখͯ͘͞͠ܭࢉ͠ͳ͓͢ վળ͍ͯͨ͠Βͦͷ܎਺ͰύϥϝʔλΛߋ৽

Slide 29

Slide 29 text

5310ΛݟͯΈͯ େมͩΘ͜Εɾɾɾ ΋ͬͱ؆୯ͳͷແ͍ͷʁ

Slide 30

Slide 30 text

Proximal Policy Optimization Algorithms John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov OpenAI {joschu, filip, prafulla, alec, oleg}@openai.com Abstract We propose a new family of policy gradient methods for reinforcement learning, which al- ternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent. Whereas standard policy gra- dient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimiza- tion (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, includ- ing simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time. [cs.LG] 28 Aug 2017 ͦ͜Ͱʂ

Slide 31

Slide 31 text

110ಋग़ maximize ✓ E s⇠⇢⇡✓old ,a⇠⇡✓old  ⇡✓ (s, a) ⇡✓old (s, a) Q⇡✓old (s, a) s.t. E s⇠⇢⇡✓old [DKL (⇡✓old (·|s)||⇡✓ (·|s))]  AAAFp3iclVNNaxNBGH63Gq3xo61eBC+LoaWBECapoHgq1YKiYD/sB2RCmN1MkqH75e60ptmsP8A/4MGTgkLxZ3jx5kkwP0E8VvDiwXdnN2qamrQz7OzM8z7P+zWM4VkikIT0tIkzZzPnzk9eyF68dPnK1PTM1c3A3fVNvmG6lutvGyzglnD4hhTS4tuez5ltWHzL2LkX27f2uB8I13kq9z1etVnTEQ1hMolQbUZb0nHMUZu1hS06vBZS2eKSRTpCsmUY4XJUCwMaCJv6LRfNnuhzaqFr1aMoKjBlHjZkY9/UEM1mhTZ8Zv4rjuaDAstHx/hLDKvHhlI25bGqU5pNcpe8LcOgKIsRHcxaH5l2VLlfCx89juaPSYGadVd2g3y3O5ByH85jdIs/02mdWyiazpEiUUMf3pTSTQ7SseLOaO+BQh1cMGEXbODggMS9BQwCnBUoQRk8xAqIumhtIMdXPBPXKoTwEJbVlIgzxCPIos9dVHNUMuTt4NrEUyVFHTzHsQKlj/1a+Pmo1GGWfCEH5JB8Ih/IN/Lrv75C5SPOcR//RqLlXm3q5fX1n2NVNv4ltP6qRigMnPaIqkJc48o99CcQb4/pgMQe3lGVC+yEp5Cknyqbvc6rw/W7a7PhHHlLvmM33pAe+Yj9cPZ+mO9W+dpr5d1BzXPVO1tV4+AthogzzKWpkDiPGOnX6mKs+Owjoqe8F3+YFO9A4EkgN0grGBdDqNs+SYyEeZoYSV4Ju4OcUXGGuaeLFGd30khHuQNx8P2Vjr624c1muVhaKJZXb+UWl9KXOAk34CbM42u7DYvwAFZgA0ztQPusfdV6mXzmSWYzs51QJ7RUcw0GRob9BoKIbjo= 5310Ͱ͸ ͱ͍͏࠷దԽ໰୊Λղ͘୅ΘΓʹ࣍ͷ৚݅෇͖࠷దԽ໰୊Λղ͍ͨ maximize ✓ E s⇠⇢⇡✓old ,a⇠⇡✓old  ⇡✓ (s, a) ⇡✓old (s, a) Q⇡✓old (s, a) DKL (⇡✓old (·|s)||⇡✓ (·|s)) AAAFY3iclVNPaxNBFH9bG62p2rR6EERYDC0txDCJguKpqAVFD/1j2kI3LLObSTJ0/7E7iW026wfQq+LBk4KC+DG8+AU89AN4EI8VvAj6dnajtolJO8POzvze773fe28Yw7N4IAjZU8ZOjGdOnpo4nZ08c/bcVG56Zj1wW77JKqZruf6mQQNmcYdVBBcW2/R8Rm3DYhvG9p3YvtFmfsBd55HY9VjVpg2H17lJBUJ67peKY06z6Q63eYfpoSaaTNBIRUg0DSNcivQw0AJua37TRbPHexw9dK1aFEUFKs39hmwcWzN4o7Gl1X1q/usczQcFuhANiJcYVgZKSZsMexUDo0G9q4cPHkbzA8JoZs0V3WCh2z0g24MXZGJVPZcnRSKH2r8ppZs8pGPZnVbegQY1cMGEFtjAwAGBewsoBDi3oARl8BArIOqitY4cX/JMXKsQwn1YklMgThGPIIsxW+jN0JMibxvXBp62UtTBc6wVSP84roWfj54qzJLP5D3ZJ5/IB/KV/PxvrFDGiHPcxb+R+DJPn3p6ce3HSC8b/wKaf72GeBg47SFVhbjGlXsYjyO+M6IDAnt4U1bOsROeRJJ+ymzanZf7a7dWZ8M58oZ8w268JnvkI/bDaX83366w1VcyuoM+j2XvbFmNg7cYIk4xl4ZE4jxipFeri1rx2UdETXlP/jA1vAOOJ47cIK1glAaXt30UjYR5HI0kr4TdQc4wnX7u8ZTi7I6qdJh7QAffX+nwa+vfrJeLpWvF8sr1/OLt9CVOwCW4AvP42m7AItyDZaiAqTDlmfJceTH+JTOZmclcSKhjSupzHg6MzOXfo6JS/Q== ൚༻తʹར༻Ͱ͖Δ஋Λݟ͚ͭΔͷ͕೉͍͠
 ͱ͍͏͜ͱʹͳ͍ͬͯͨ ࣮ࡍ͸΋্͠ख͍܎਺ɺ΋͘͠͸ͦΕʹ૬౰͢Δ΋ͷ͕͋ΔͳΒɺ ݩͷϖφϧςΟ෇͖࠷దԽ໰୊Λղ͖͍ͨ ͳͥͳΒɺ୯७ʹޯ഑๏ͰվળͰ͖Δ͔Β ͜Ε͕110Ͱ໨ࢦ͢໨ඪ

Slide 32

Slide 32 text

110ಋग़ ⇡✓ (s, a) ⇡✓old (s, a) = rt (✓) (஫: rt (✓old ) = 1) AAAFQniclVPbahNBGP63rlrroaneCN4EY6UBCZMoKAWheAC968G0hbaE3ck0Hbsnd6cxbdAH8AW88EpBQXwMvfAFvOgbeLiSCN70wm9m19I0JGl32Nl//vm+//DNjht5MlGM7VojJ+yTp06Pnhk7e+78hfHcxMXFJNyKuajy0AvjZddJhCcDUVVSeWI5ioXju55Ycjfv6/2lpogTGQZP1HYk1nynEch1yR0FV5gLKZ+NVVqnmBzi1IYdkaSasRRtkMDs0AuaooRuwCrC7ofS65A8qgPTzcjTXbyxQSizd5BXNDU8oy3YdWMr7LQwt2nK+07T8F3ryz6YtYg8ZWCRs5YrsBIzT77XKGdGgbJnNpywPiBuHbE4KvERPUB8jsgOOkloBZEr6FuhKw5UANUEqtI4jnkNdTymh2Yoo2eAisYQcwtsAaZWeBNzA6uVzBtgrXMlhs9NJyHYWrNJ9o19ZB32lX1iP9he31htE0PXuI2vm3JFVBt/dXnh71CWj69Wc581gOFi+AO6amPWnUeIJ+FvDVFAQcM7pnMJJSLjSfU01TR3XncWpucn29fZO/YLarxlu+wz9Aiaf/j7OTH/xkQPwHlutPNNNwFOUf+ZDmppGE8r+2v/9xoil17H8OQz3Mt95CrOQGIlgU2yDoblkOa0j5IjRR4nR1pXit4BZlCeXuzxMunqjprpMLYrD+5f+fBt6zUWK6XyzVJl7lZh5l52E0fpCl3FHS/TbZqhRzRLVeLWA+uplVjK/mL/tH/bnRQ6YmWcS9T12Hv/AJOlHZw= ͱͯ͠ maximize ✓ E s⇠⇢⇡✓old ,a⇠⇡✓old  ⇡✓ (s, a) ⇡✓old (s, a) Q⇡✓old (s, a) = maximize ✓ E s⇠⇢⇡✓old ,a⇠⇡✓old  rt (✓)Q⇡✓old (s, a) AAAFtnictVNNaxNBGH63ulrjRxu9CF4WQ0sKIUyi0FIQilLQW9OatpCNy+x2shm6X+xOYptl/QH+AQ+eFBTEP+Ddi3/AQy9eVTxW8OLBd2e3apuatAdn2NmZ532e92sYM3B4JAjZUybOnFXPnZ+8ULh46fKVqeni1fXI74UWa1q+44ebJo2Ywz3WFFw4bDMIGXVNh22Y2/dS+0afhRH3vYdiN2Btl9oe73CLCoSMotLQcMzqLt3hLh8wI9ZFlwmaaAiJrmnGy4kRR3rEXT3s+mgO+AHHiH1nK0mSCpXmYUMh9a2b3LZbeiek1t/ipBxV6FxyjL/M0Dg2lLRJj21N16X/O/8/+dCIRVLOzHNjEzOmS6RK5NCGN7V8U4J8rPhF5TXosAU+WNADFxh4IHDvAIUIZwtqUIcAsQqiPlo7yAklz8K1DTE8gGU5BeIU8QQK6LOHaoZKirxtXG08tXLUw3MaK5L61K+DX4hKDWbIR/KG7JMP5C35Sn7+01csfaQ57uLfzLQsMKaeXl/7MVbl4l9A949qhMLE6Y6oKsY1rTxAfxzxnTEdENjDBVk5x04EEsn6KbPpD57try2uzsSz5CX5ht14QfbIe+yH1/9uvWqw1efSu4eax7J3rqzGw1uMEaeYiy2RNI8UOajVx1jpOUREy3lPfjN1vAOOJ47cKK9gXAwub/skMTLmaWJkeWXsAXJGxRnmni5Smt1JIx3lHoqD76929LUNb9br1dqtar1xu7R0N3+Jk3ADbkIZX9s8LMF9WIEmWMo75ZPyWfmiLqiPVKbaGXVCyTXX4NBQg1+/83So Λղ͍ͯΈ͍͕ͨɺ͜Ε͸5310ͷٞ࿦͔Β
 εςοϓ͕େ͖͘ͳΓ͗ͯ͢͠·͏ͷͰࣦഊ͢Δ ࣦഊ͢Δͷ͸ݩͷํࡦͱมߋޙͷํࡦ͕େ͖͘มԽ͢Δͱ͖ ͔ͩΒ5310Ͱ͸,-EJWFSHFODFͰ੍ݶΛ͔͚͍ͯͨ

Slide 33

Slide 33 text

110ಋग़ ⇡✓ (s, a) ⇡✓old (s, a) = rt (✓) (஫: rt (✓old ) = 1) AAAFQniclVPbahNBGP63rlrroaneCN4EY6UBCZMoKAWheAC968G0hbaE3ck0Hbsnd6cxbdAH8AW88EpBQXwMvfAFvOgbeLiSCN70wm9m19I0JGl32Nl//vm+//DNjht5MlGM7VojJ+yTp06Pnhk7e+78hfHcxMXFJNyKuajy0AvjZddJhCcDUVVSeWI5ioXju55Ycjfv6/2lpogTGQZP1HYk1nynEch1yR0FV5gLKZ+NVVqnmBzi1IYdkaSasRRtkMDs0AuaooRuwCrC7ofS65A8qgPTzcjTXbyxQSizd5BXNDU8oy3YdWMr7LQwt2nK+07T8F3ryz6YtYg8ZWCRs5YrsBIzT77XKGdGgbJnNpywPiBuHbE4KvERPUB8jsgOOkloBZEr6FuhKw5UANUEqtI4jnkNdTymh2Yoo2eAisYQcwtsAaZWeBNzA6uVzBtgrXMlhs9NJyHYWrNJ9o19ZB32lX1iP9he31htE0PXuI2vm3JFVBt/dXnh71CWj69Wc581gOFi+AO6amPWnUeIJ+FvDVFAQcM7pnMJJSLjSfU01TR3XncWpucn29fZO/YLarxlu+wz9Aiaf/j7OTH/xkQPwHlutPNNNwFOUf+ZDmppGE8r+2v/9xoil17H8OQz3Mt95CrOQGIlgU2yDoblkOa0j5IjRR4nR1pXit4BZlCeXuzxMunqjprpMLYrD+5f+fBt6zUWK6XyzVJl7lZh5l52E0fpCl3FHS/TbZqhRzRLVeLWA+uplVjK/mL/tH/bnRQ6YmWcS9T12Hv/AJOlHZw= ΛݟΔͱɺݩͷํࡦ͔Βҳ୤͢Δͱ͜ͷׂ߹͸͔Βԕ͔͟Δ ͦ͜Ͱ110Ͱ͸ԼهͰΫϦοϓͨ͠஋ʹରͯ͠࠷େԽ͢Δ maximize ✓ E s⇠⇢⇡✓old ,a⇠⇡✓old  min(rt (✓)Q⇡✓old (s, a), clip(rt (✓), 1 ✏, 1 + ✏)Q⇡✓old (s, a)) AAAFRHiclVNNaxNBGH63rlrjR1O9CF6CoZJgGiZRUDwVpaK3pjVtIRuW2e00GTr7we4ktlnWH+Af8OBJQUH8GYL4Bzz05lX0FsWLiO/ObpQ2mrQz7Ow7zzzP+zWM5QseSkL2tZkT+slTp2fP5M6eO39hLj9/cT30eoHNmrYnvGDToiET3GVNyaVgm37AqGMJtmHt3EvON/osCLnnPpJ7Pms7tOPybW5TiZCZ/1TAYTh0lzt8wMzIkF0maZxAsmtZ0XJsRqERcscIuh4e+3zEMSNPbMVxXKHqePwgp1xbvNNpGQ53S4EZybiUcsqNf/oqhRVartiC+wfZlUJt0WB+yIXnon19ZE9yU1ah22a+SKpEjcK4UcuMImRjxZvXXoMBW+CBDT1wgIELEm0BFEKcLahBHXzEKoh6eLqNnEDxbFzbEMFDWFZTIk4RjyGHPnuoZqikyNvBtYO7Voa6uE9ihUqf+BX4BagswAL5SN6QIflA3pLP5Od/fUXKR5LjHv6tVMt8c+7p5bUfU1UO/iV0/6omKCyczoSqIlyTyn30xxHfndIBiT28rSrn2AlfIWk/VTb9wbPh2p3VhegaeUm+YDdekH3yDvvh9r/Zrxps9bny7qLmseqdo6px8RYjxCnm0lFIkkeCjGr1MFayDxApZLwnf5gG3gHHHUdumFUwLQZXt32UGCnzODHSvFL2ADmT4oxzjxcpye6okQ5zD8TB91c7/NrGjfV6tXajWm/cLC7dzV7iLFyBq1DC13YLluABrEATbO2+JrSe1tff61/1of49pc5omeYSHBj6r9+EjEma Let rt( ) denote the probability ratio rt( ) = ⇡✓(at | st) ⇡✓old (at | st) , so r( old ) = 1. TRPO maximizes a “surrogate” objective LCPI( ) = ˆ Et  ✓(at | st) ✓old (at | st) ˆ At = ˆ Et h rt( ) ˆ At i . (6) The superscript CPI refers to conservative policy iteration [KL02], where this objective was pro- posed. Without a constraint, maximization of LCPI would lead to an excessively large policy update; hence, we now consider how to modify the objective, to penalize changes to the policy that move rt( ) away from 1. The main objective we propose is the following: LCLIP ( ) = ˆ Et h min(rt( ) ˆ At, clip(rt( ), 1 , 1 + ) ˆ At) i (7) where epsilon is a hyperparameter, say, = 0.2. The motivation for this objective is as follows. The first term inside the min is LCPI. The second term, clip(rt( ), 1 , 1+ ) ˆ At, modifies the surrogate objective by clipping the probability ratio, which removes the incentive for moving rt outside of the interval [1 , 1 + ]. Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse. Note that LCLIP ( ) = LCPI( ) to first order around old (i.e., where r = 1), however, they become di erent as moves away from old . Figure 1 plots a single term (i.e., a single t) in LCLIP ; note that the probability ratio r is clipped at 1 or 1 + depending on whether the advantage is positive or negative. r LCLIP 0 1 1 + ϵ A > 0 r LCLIP 0 1 1 − ϵ A < 0 Figure 1: Plots showing one term (i.e., a single timestep) of the surrogate function LCLIP as a function of the probability ratio r, for positive advantages (left) and negative advantages (right). The red circle on each "ʹͳͬͯ·͕͢ɺ 2ͱಡΈସ͍͑ͯͩ͘͞

Slide 34

Slide 34 text

࠷ޙʹ.PEFMCBTFESFJOGPSDFNFOU MFBSOJOHGPS"UBSJͷ݁Ռ https://sites.google.com/view/modelbasedrlatari/home

Slide 35

Slide 35 text

ࢀߟจݙ -VLBT[,BJTFS FUBM.PEFM#BTFE3FJOGPSDFNFOU-FBSOJOHGPS"UBSJ
 IUUQTBSYJWPSHBCT 
 IUUQTTJUFTHPPHMFDPNWJFXNPEFMCBTFESMBUBSJIPNF 4DIVMNBO FUBM5SVTU3FHJPO1PMJDZ0QUJNJ[BUJPO
 IUUQTBSYJWPSHBCT 4DIVMNBO FUBM1SPYJNBM1PMJDZ0QUJNJ[BUJPO"MHPSJUINT
 IUUQTBSYJWPSHBCT $4%FFQ3FJOGPSDFNFOU-FBSOJOH
 IUUQSBJMFFDTCFSLFMFZFEVEFFQSMDPVSTF /BUVSBM(SBEJFOU%FTDFOU
 IUUQTXJTFPEEHJUIVCJPUFDICMPHOBUVSBMHSBEJFOU 3-5SVTU3FHJPO1PMJDZ0QUJNJ[BUJPO&YQMBJOFE
 IUUQTNFEJVNDPN!KPOBUIBO@IVJSMUSVTUSFHJPOQPMJDZPQUJNJ[BUJPOUSQPFYQMBJOFE BFFFFFFF 3-5SVTU3FHJPO1PMJDZ0QUJNJ[BUJPO&YQMBJOFE1BSU
 IUUQTNFEJVNDPN!KPOBUIBO@IVJSMUSVTUSFHJPOQPMJDZPQUJNJ[BUJPOUSQPQBSU GFCFB