Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Model Based Reinforcement Learning for Atari

Model Based Reinforcement Learning for Atari

高橋研Model Based RL勉強会の第2回のスライドです。
Model Based Reinforcement Learning for Atari
Trust Region Policy Optimization
Proximal Policy Optimization Algorithms
の3つの論文について説明しました。

Yu Ishihara

May 06, 2019
Tweet

More Decks by Yu Ishihara

Other Decks in Research

Transcript

  1. ࠓ೔ͷϝχϡʔ   Model Based Reinforcement Learning for Atari Łukasz

    Kaiser * 1 Mohammad Babaeizadeh * 2 3 Piotr Miło´ s * 4 5 Bła˙ zej Osi´ nski * 4 5 3 Roy H Campbell 2 Konrad Czechowski 4 Dumitru Erhan 1 Chelsea Finn 1 Piotr Kozakowski 4 Sergey Levine 1 Ryan Sepassi 1 George Tucker 1 Henryk Michalewski 4 5 Abstract Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observa- tions. However, this typically requires very large amounts of interaction – substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore how video prediction models can similarly enable agents to solve Atari games with orders of magni- tude fewer interactions than model-free methods. We describe Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models and present a compar- ison of several model architectures, including a novel architecture that yields the best results in our setting. Our experiments evaluate SimPLe on a range of Atari games and achieve competitive results with only 100K interactions between the agent and the environment (400K frames), which corresponds to about two hours of real-time play. 1. Introduction Human players can learn to play Atari games in min- utes (Tsividis et al., 2017). However, our best model-free reinforcement learning algorithms require tens or hundreds of millions of time steps – the equivalent of several weeks of training in real time. How is it that humans can learn these games so much faster? Perhaps part of the puzzle is that processes that are represented in the game: we know that planes can fly, balls can roll, and bullets can destroy aliens. We can therefore predict the outcomes of our actions. In this paper, we explore how learned video models can enable learning in the Atari Learning Environment (ALE) bench- mark (Bellemare et al., 2015; Machado et al., 2017) with a budget restricted to 100K time steps – roughly to two hours of a play time. Although prior works have proposed training predictive models for next-frame, future-frame, as well as combined future-frame and reward predictions in Atari games (Oh et al., 2015; Chiappa et al., 2017; Leibfried et al., 2016), no prior work has successfully demonstrated model-based control via such predictive models that achieve results that are competitive with model-free RL. Indeed, in a recent sur- vey by Machado et al. this was formulated as the following challenge: “So far, there has been no clear demonstration of successful planning with a learned model in the ALE” (Section 7.2 in Machado et al. (2017)). Using models of environments, or informally giving the agent ability to predict its future, has a fundamental appeal for reinforcement learning. The spectrum of possible appli- cations is vast, including learning policies from the model (Watter et al., 2015; Finn et al., 2016; Finn & Levine, 2016; Ebert et al., 2017; Hafner et al., 2018; Piergiovanni et al., 2018; Rybkin et al., 2018; Sutton & Barto, 2017, Chapter 8), capturing important details of the scene (Ha & Schmidhuber, 2018), encouraging exploration (Oh et al., 2015), creating intrinsic motivation (Schmidhuber, 2010) or counterfactual reasoning (Buesing et al., 2018). One of the exciting bene- fits of model-based learning is the promise to substantially improve sample efficiency of deep reinforcement learning (see Chapter 8 in (Sutton & Barto, 2017)). arXiv:1903.00374v1 [cs.LG] 1 Mar 2019 Trust Region Policy Optimization John Schulman [email protected] Sergey Levine [email protected] Philipp Moritz [email protected] Michael Jordan [email protected] Pieter Abbeel [email protected] University of California, Berkeley, Department of Electrical Engineering and Computer Sciences Abstract We describe an iterative procedure for optimizing policies, with guaranteed monotonic improve- ment. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effec- tive for optimizing large nonlinear policies such as neural networks. Our experiments demon- strate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. De- spite its approximations that deviate from the theory, TRPO tends to give monotonic improve- ment, with little tuning of hyperparameters. Tetris is a classic benchmark problem for approximate dy- namic programming (ADP) methods, stochastic optimiza- tion methods are difficult to beat on this task (Gabillon et al., 2013). For continuous control problems, methods like CMA have been successful at learning control poli- cies for challenging tasks like locomotion when provided with hand-engineered policy classes with low-dimensional parameterizations (Wampler & Popovi´ c, 2009). The in- ability of ADP and gradient-based methods to consistently beat gradient-free random search is unsatisfying, since gradient-based optimization algorithms enjoy much better sample complexity guarantees than gradient-free methods (Nemirovski, 2005). Continuous gradient-based optimiza- tion has been very successful at learning function approxi- mators for supervised learning tasks with huge numbers of parameters, and extending their success to reinforcement learning would allow for efficient training of complex and powerful policies. 477v5 [cs.LG] 20 Apr 2017 Proximal Policy Optimization Algorithms John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov OpenAI {joschu, filip, prafulla, alec, oleg}@openai.com Abstract We propose a new family of policy gradient methods for reinforcement learning, which al- ternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent. Whereas standard policy gra- Ϟσϧ ࣮࣭ΤϛϡϨʔλ Λ࢖ͬͯɺ ΤʔδΣϯτΛֶशͤ͞Δ࿦จ
 ํࡦͷֶशʹ͸110Λ࢖͏ ࠓ೔ͷϝΠϯςʔϚͷϋζͩͬͨɾɾ 110ͷߟ͑ํͷϕʔεʹͳ͍ͬͯΔ࿦จ
 ཧղ͠ͳ͍ͱ110ͷྑ͕͞Θ͔Βͳ͍
 ࠓ೔ͷཪςʔϚ 5310ͷར఺Λ׆͔͠ͳ͕Βɺ ओʹ࣮૷໘Ͱͷվળ͕ͳ͞Εͨख๏
 ࠷ۙ5310ʹมΘͬͯΑ͘࢖ΘΕ͍ͯΔ
  2.   Model Based Reinforcement Learning for Atari Łukasz Kaiser

    * 1 Mohammad Babaeizadeh * 2 3 Piotr Miło´ s * 4 5 Bła˙ zej Osi´ nski * 4 5 3 Roy H Campbell 2 Konrad Czechowski 4 Dumitru Erhan 1 Chelsea Finn 1 Piotr Kozakowski 4 Sergey Levine 1 Ryan Sepassi 1 George Tucker 1 Henryk Michalewski 4 5 Abstract Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observa- tions. However, this typically requires very large amounts of interaction – substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore processes that are represented in the game: we know that planes can fly, balls can roll, and bullets can destroy aliens. We can therefore predict the outcomes of our actions. In this paper, we explore how learned video models can enable learning in the Atari Learning Environment (ALE) bench- mark (Bellemare et al., 2015; Machado et al., 2017) with a budget restricted to 100K time steps – roughly to two hours of a play time. Although prior works have proposed training predictive models for next-frame, future-frame, as well as combined future-frame and reward predictions in Atari games (Oh G] 1 Mar 2019 ·ͣ͸ɾɾɾ
  3. .PEFMGSFFWT.PEFMCBTFE3-   .PEFMGSFF3-ͷಛ௃ w ౰વͳ͕Βঢ়ଶભҠʹؔ͢Δ஌ࣝෆཁ w ྫ͑͹ঢ়ଶભҠ֬཰ɺ෺ମͷӡಈͷϞσϧ͕ෆཁ w ؀ڥͷ૬ޓ࡞༻Λ௨ͯ͡ɺঢ়ଶߦಈใुͷܥྻНΛಘΔ͜ͱͰ

    ํࡦΛֶश͢Δ w ྫ͑͹2MFBSOJOH .PEFMCBTFE3-ͷಛ௃ w ঢ়ଶભҠʹؔ͢Δ஌ࣝΛར༻͢Δ w ղੳతʹํࡦͷޯ഑Λܭࢉ͢Δ 1*-$0  w ະདྷͷঢ়ଶʹԠͨ͡ߦಈͷϓϥϯχϯάΛ͢Δ "MQIB(P  w ࣮ࡍͷ؀ڥͰߦಈ͠ͳͯ͘΋ϞσϧΛ࢖ֶͬͯश ຊ࿦จ %ZOBతͳײ͡ʜͨͿΜ
  4. 4JNVMBUFE1PMJDZ-FBSOJOH 4JN1-FఏҊख๏ ͷ֓ཁ   Model-Based Reinforcement Learning for Atari

    Observations Policy World Model World Model World Model Training Self-Supervised* RL Agent Training In World Model Policy Observations Agent Evaluation In Real World Interaction Agent Training In World Model Agent Evaluation in Real World World Model Training Policy Observations World Model Figure 1: Main loop of SimPLe. 1) the agent starts interacting with the real environment following the latest policy (initialized to random). 2) the collected observations will be used to train (update) the current world model. 3) the agent updates the policy by acting inside the world model. The new policy will be evaluated to measure the performance of the agent as well as collecting more data (back to 1). Note that world model training is self-supervised for the observed states and supervised for the reward. prediction techniques and can train a policy to play the 2018). Although the sample complexity of these methods <>,BJ[FSFUBM.PEFM#BTFE3FJOGPSDFNFOU-FBSOJOHGPS"UBSJ 8PSME.PEFM γϛϡϨʔλ Λ࢖ͬͯ
 ֶशͨ͠ํࡦΛ࣮؀ڥͰධՁ Ξοϓσʔτͨ͠8PSME.PEFM γϛϡϨʔλ 
 ͚ͩΛ࢖ͬͯํࡦΛֶश 110Λ࢖͏ ධՁ࣌ʹ֫ಘͨ͠
 ࣮؀ڥͷσʔλ͔ΒϞσϧΛߋ৽ Ϟσϧߋ৽ʹ࢖͏࣮؀ڥͷσʔλྔʻ࣮؀ڥͰ110౳Ͱֶश͢Δσʔλྔ ͔ͩΒ͍͍ΑͶʂͬͯ͜ͱΒ͍͠ŋŋŋ
  5. 4JN1-Fͷٖࣅίʔυ   Model-Based Reinforcement Learning for Atari umber of

    Atari games. y aim to model or pre- but relatively modest (2018) present a way with a recurrent neural uccessfully evaluated 2D racing game. The hm 1, but only one it- vironments are simple om exploration. nforcement learning applications such as Though most of such several recent works rld (Finn et al., 2016; al., 2017; Ebert et al., n et al., 2018; Rybkin mulated (Watter et al., ol. Our video models Algorithm 1: Pseudocode for SimPLe Initialize policy ⇡ Initialize model parameters env0 Initialize empty set D while not done do . collect observations from real env. while not enough observations do a ⇡(s) (s0, r) env(a) D D [ (s, a, r, s0) s s0 end while . update model using collected data. ✓ TRAIN_SUPERVISED(env0, D) . update policy using world model. ⇡ TRAIN_RL(⇡, ✓) end while "UBSJͳͷͰϞσϧ͸ ࣍ը໘ͱS T B T` Λ༧ଌ͢ΔΑ͏ʹֶश <>,BJ[FSFUBM.PEFM#BTFE3FJOGPSDFNFOU-FBSOJOHGPS"UBSJ ͜͜͸110
  6. Ϟσϧͷֶश   Model-Based Reinforcement Learning for Atari @training @inference

    8x8 7x5x128 Attention 4x4 4x4 4x4 4x4 4x4 4x4 Pixels Embedding 105x80x64 105x80x12 4x4 4x4 4x4 4x4 4x4 skip connections Input Action 4x4 Per Pixel Logits 4 Input Frames Predicted Predicted Reward softmax 53x40x128 27x20x256 14x10x256 7x5x256 4x3x256 4x3x256 7x5x256 14x10x256 27x20x256 53x40x128 105x80x64 105x80x256 105x80x3 multiplication dense conv deconv Legend: Frame Next Frame 2x2x256 Pixels Embedding 8x8 27x20x128 Attention Discretization discrete latent Bit Predictor recurrent attention Figure 2: Architecture of the proposed stochastic model with discrete latent. The input to the model is four stacked frames (as well as the action selected by the agent) while the output is the next predicted frame and expected reward. Input pixels and action are embedded using fully connected layers, and there is per-pixel softmax (256 colors) in the output. This model has two main components. First, the bottom part of the network which consists of a skip-connected convolutional encoder and decoder. To condition the output on the actions of the agent, the output of each layer in the decoder is multiplied with the (learned) embedded action. Second part of the model <>,BJ[FSFUBM.PEFM#BTFE3FJOGPSDFNFOU-FBSOJOHGPS"UBSJ ࠷ॳ͸࣮؀ڥͷσʔλΛೖྗʹ༧ଌը૾Λֶश
 ͦͷޙɺঃʑʹ༧ଌը૾΋ೖྗʹࠞͥͯ༧ଌ͕υϦϑτΛ͍ͯ͘͠ͷΛ๷͙ "UBSJͷ؀ڥ͸ը໘͕ͪΒ͍ͭͨΓɺӅΕͯݟ͍͑ͯͳ͍෦෼͕͋ͬͨΓ͢ΔͷͰɺ ͦ͏͍ͬͨϥϯμϜੑ΋ֶश͢ΔɻͦͷࡍϥϯμϜͳύϥϝʔλΛ཭ࢄԽͯ͠ग़ྗ #JU1SFEJDUPS͸ϥϯμϜͳύϥϝʔλΛ
 ໛฿͢ΔΑ͏ʹֶश
  7. ࣮ݧ݁Ռ   w 4JN1-F͸࣮؀ڥͱ,ճJOUFSBDUJPO w 110ͷֶश͸Ϟσϧʹରͯ͠.JOUFSBDUJPO Model-Based Reinforcement Learning

    for Atari Figure 3: Comparison with Rainbow. Each bar illustrates the number of interactions with environment required by Rainbow to achieve the same score as our method (SimPLe). The red line indicates the 100K interactions threshold which is used by the our method. 7.2. Ablations Figure 4: Comparison with PPO. Each bar illustrates the number of interactions with environment required by PPO to achieve the same score as our method (SimPLe). The red line indicates the 100K interactions threshold which is used by the our method. <>,BJ[FSFUBM.PEFM#BTFE3FJOGPSDFNFOU-FBSOJOHGPS"UBSJ ͳΜ͔ൺֱͱͯ͠ ҙຯ͋ΔΜ͔ͩͳ͍Μ͔ͩʜ 4JN1-Fͱಉ͡ੑೳʹͳΔͷʹ
 ඞཁͳ ຊ෺ͷ"UBSJ؀ڥΛ෇͖߹͏৔߹ͷ ैདྷख๏ͷJUFSBUJPO਺ ,͸༏ʹ௒͑ΔJUFSBUJPO͕ඞཁ
  8.   Trust Region Policy Optimization John Schulman [email protected] Sergey

    Levine [email protected] Philipp Moritz [email protected] Michael Jordan [email protected] Pieter Abbeel [email protected] University of California, Berkeley, Department of Electrical Engineering and Computer Sciences Abstract We describe an iterative procedure for optimizing policies, with guaranteed monotonic improve- ment. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effec- tive for optimizing large nonlinear policies such as neural networks. Our experiments demon- strate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari Tetris is a classic benchmark problem for approximate dy- namic programming (ADP) methods, stochastic optimiza- tion methods are difficult to beat on this task (Gabillon et al., 2013). For continuous control problems, methods like CMA have been successful at learning control poli- cies for challenging tasks like locomotion when provided with hand-engineered policy classes with low-dimensional parameterizations (Wampler & Popovi´ c, 2009). The in- ability of ADP and gradient-based methods to consistently beat gradient-free random search is unsatisfying, since gradient-based optimization algorithms enjoy much better sample complexity guarantees than gradient-free methods (Nemirovski, 2005). Continuous gradient-based optimiza- tion has been very successful at learning function approxi- [cs.LG] 20 Apr 2017 4JN1-FΑΓ΋110͕ؾʹͳΔͱࢥ͏ͷͰ͔͜͜Β5310
  9. ํࡦޯ഑๏ͷ෮श   ڧԽֶशͷ໨త ํࡦޯ഑๏ͷͶΒ͍ arg max ⇡ E[Gt ]

    <latexit sha1_base64="nZhgJ+cM0lssECJQEUv4Blrhohw=">AAADVHicjVLLThRBFD0zDfIQYcCNiS4mTDAuzKSakGBcEQ2RJQ8HSKYnneq2GAr6le6aAewMC5f+AAtWmphAXOEvuPEHIOETiEtM3LDgdk2r0Ylidbq67ulz7j11c53Ik4li7LxQNPr6bw0MDg3fHrkzOlYan1hNwlbsipobemG87vBEeDIQNSWVJ9ajWHDf8cSas/08+7/WFnEiw+Cl2otEw+fNQG5IlyuC7NIDi8dNn+/aqRXJjuVztek46Xyn/sJWDbtUYVWmV7n3YOaHCvK1GI4XSrDwCiFctOBDIICisweOhJ46TEwjIuwxITPYQgMp7RwxYVIzBToYpiwt4gvickK3aW9SVM/RgOIse6L1LtXz6I1JWcYUO2XH7JJ9YR/ZBbv6a65U58hc7dHX6WpFZI+9vbfy/UaVT1+FzV+qf3pW2MAT7VWS90gj2S3crr79+uBy5enyVPqQvWdfyf87ds4+0w2C9jf3w5JYPtTZA9Ls6Nv6un5AnU4Jz7rX1Mgu5cuQH+5CqpXFMSHlnLf/k2lR1yRFkrhJ3vWbamT+/69Gl9lTg2bK/HOCeg+r01WTVc2lmcrcs3y6BnEfk3hEEzSLOSxgETXy+QZHOMGn4lnxyjCM/i61WMg1d/HbMkavAVNPwB8=</latexit> <latexit sha1_base64="nZhgJ+cM0lssECJQEUv4Blrhohw=">AAADVHicjVLLThRBFD0zDfIQYcCNiS4mTDAuzKSakGBcEQ2RJQ8HSKYnneq2GAr6le6aAewMC5f+AAtWmphAXOEvuPEHIOETiEtM3LDgdk2r0Ylidbq67ulz7j11c53Ik4li7LxQNPr6bw0MDg3fHrkzOlYan1hNwlbsipobemG87vBEeDIQNSWVJ9ajWHDf8cSas/08+7/WFnEiw+Cl2otEw+fNQG5IlyuC7NIDi8dNn+/aqRXJjuVztek46Xyn/sJWDbtUYVWmV7n3YOaHCvK1GI4XSrDwCiFctOBDIICisweOhJ46TEwjIuwxITPYQgMp7RwxYVIzBToYpiwt4gvickK3aW9SVM/RgOIse6L1LtXz6I1JWcYUO2XH7JJ9YR/ZBbv6a65U58hc7dHX6WpFZI+9vbfy/UaVT1+FzV+qf3pW2MAT7VWS90gj2S3crr79+uBy5enyVPqQvWdfyf87ds4+0w2C9jf3w5JYPtTZA9Ls6Nv6un5AnU4Jz7rX1Mgu5cuQH+5CqpXFMSHlnLf/k2lR1yRFkrhJ3vWbamT+/69Gl9lTg2bK/HOCeg+r01WTVc2lmcrcs3y6BnEfk3hEEzSLOSxgETXy+QZHOMGn4lnxyjCM/i61WMg1d/HbMkavAVNPwB8=</latexit> <latexit sha1_base64="nZhgJ+cM0lssECJQEUv4Blrhohw=">AAADVHicjVLLThRBFD0zDfIQYcCNiS4mTDAuzKSakGBcEQ2RJQ8HSKYnneq2GAr6le6aAewMC5f+AAtWmphAXOEvuPEHIOETiEtM3LDgdk2r0Ylidbq67ulz7j11c53Ik4li7LxQNPr6bw0MDg3fHrkzOlYan1hNwlbsipobemG87vBEeDIQNSWVJ9ajWHDf8cSas/08+7/WFnEiw+Cl2otEw+fNQG5IlyuC7NIDi8dNn+/aqRXJjuVztek46Xyn/sJWDbtUYVWmV7n3YOaHCvK1GI4XSrDwCiFctOBDIICisweOhJ46TEwjIuwxITPYQgMp7RwxYVIzBToYpiwt4gvickK3aW9SVM/RgOIse6L1LtXz6I1JWcYUO2XH7JJ9YR/ZBbv6a65U58hc7dHX6WpFZI+9vbfy/UaVT1+FzV+qf3pW2MAT7VWS90gj2S3crr79+uBy5enyVPqQvWdfyf87ds4+0w2C9jf3w5JYPtTZA9Ls6Nv6un5AnU4Jz7rX1Mgu5cuQH+5CqpXFMSHlnLf/k2lR1yRFkrhJ3vWbamT+/69Gl9lTg2bK/HOCeg+r01WTVc2lmcrcs3y6BnEfk3hEEzSLOSxgETXy+QZHOMGn4lnxyjCM/i61WMg1d/HbMkavAVNPwB8=</latexit> <latexit sha1_base64="nZhgJ+cM0lssECJQEUv4Blrhohw=">AAADVHicjVLLThRBFD0zDfIQYcCNiS4mTDAuzKSakGBcEQ2RJQ8HSKYnneq2GAr6le6aAewMC5f+AAtWmphAXOEvuPEHIOETiEtM3LDgdk2r0Ylidbq67ulz7j11c53Ik4li7LxQNPr6bw0MDg3fHrkzOlYan1hNwlbsipobemG87vBEeDIQNSWVJ9ajWHDf8cSas/08+7/WFnEiw+Cl2otEw+fNQG5IlyuC7NIDi8dNn+/aqRXJjuVztek46Xyn/sJWDbtUYVWmV7n3YOaHCvK1GI4XSrDwCiFctOBDIICisweOhJ46TEwjIuwxITPYQgMp7RwxYVIzBToYpiwt4gvickK3aW9SVM/RgOIse6L1LtXz6I1JWcYUO2XH7JJ9YR/ZBbv6a65U58hc7dHX6WpFZI+9vbfy/UaVT1+FzV+qf3pW2MAT7VWS90gj2S3crr79+uBy5enyVPqQvWdfyf87ds4+0w2C9jf3w5JYPtTZA9Ls6Nv6un5AnU4Jz7rX1Mgu5cuQH+5CqpXFMSHlnLf/k2lR1yRFkrhJ3vWbamT+/69Gl9lTg2bK/HOCeg+r01WTVc2lmcrcs3y6BnEfk3hEEzSLOSxgETXy+QZHOMGn4lnxyjCM/i61WMg1d/HbMkavAVNPwB8=</latexit> Gt = rt + rt+1 + 2rt+2 + · · · <latexit sha1_base64="+XAAYRDKdYkEJoIFuWXZP7SOCpI=">AAAD0HicjVJNT9RAGH6X+oH4waIXEy+NG4wJZjNdTSBGCcEY9QaLCyQsbqZlWCe0naadXYFmMV75Ax48SeLB+DO8cNXEAz/BeMTEiwefTqtEN7LMpDPv+/R93q953ciXiWbsoDRknTp95uzwuZHzFy5eGi2PXV5MVCf2RMNTvoqXXZ4IX4aioaX2xXIUCx64vlhyNx5k/5e6Ik6kCp/qrUisBrwdynXpcQ2oVb73qJXqnn3fjs09YTfbPAi4USecI+BZWuvlYM2A3prSSatcYVVmlt0vOIVQoWLNqbHSNDVpjRR51KGABIWkIfvEKcFeIYdqFAG7BVTh7zpsYmPn4VyllJ7QQ7M1cA68RyPw2QFbgMlht4GzDW2lQEPoWazE8DO/Pr4YTJvG2Rf2nh2yffaBfWU//+srNT6yHLdwuzlXRK3R3asLPwayAtyanh+xjmG42MExVaU4s8oj+JPANwd0QKOHU6ZyiU5EBsn7abLpbr8+XLhbH09vsD32Dd14yw7YR/Qj7H733s2L+hvjPQTnheldYKoJ8YopcI5c2gbJ8siQ37UqxMr0GIhd2O38sWziDSQ0CdukqGBQDGle+yQxcsu+GJhX59/p7BcWa1XndpXN36nMzBaTO0zX6DrdxHRO0gw9pjlqIM892qdP9NmqW5vWS+tVbjpUKjhX6K9l7f4CM3rcaA==</latexit> ظ଴ऩӹΛ࠷େԽ͢ΔํࡦΛ֫ಘ͢Δ ⇡✓ <latexit sha1_base64="BxcofLONK5IHE26UU7WEyXiijcw=">AAADpHicjVLNSltBFD7x+m81sW4KbkKD0oWESRSULopYChVcqDEqGBvmXsc4eP+4d5KqF/sAvkAXulFQEB/DTV+gCx+huFTopot+d3Kt2FDTGe7cc7453/mbY/q2DBVjN6kOo7Oru6e3r3/gxeBQOjP8cjX06oElypZne8G6yUNhS1eUlVS2WPcDwR3TFmvm7vv4fq0hglB67ora98Wmw2uu3JYWV4CqmUwWq+LLalRRO0Lxw2omx/JMr2yrUEiEHCVr0RtOvaMKbZFHFtXJIUEuKcg2cQqxN6hARfKBTQD1cLsNm0DbWTg3KaJ5+qC3As6BH1I/fNbBFmBy2O3irEHbSFAXehwr1PzYr40vADNLY+w7u2R37Bu7Yj/Yr3/6irSPOMd9/M0mV/jV9NGr0s+2LAd/RTuPrGcYJrbzTFURzrhyH/4k8L02HVDo4YyuXKITvkaa/dTZNA6+3pXeLo9F4+yM3aIbp+yGXaMfbuPeOl8Sy8fauwvOZ907R1fj4hUj4By51DQS5xEjD7V6iBXrAZBsYvflj2UFbyChSdiGSQXtYkj92v8To2nZEgPzWvh7OluF1WK+MJkvLk3lZueSye2lUXpNbzCd0zRLH2mRysizQSd0ThfGuLFglIxy07QjlXBG6MkyPv0GUu/L8Q==</latexit> http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-5.pdf r✓ E ⌧⇠⇡✓ [G0 (⌧)] = E ⌧⇠⇡✓ [r✓ log ⇡✓ (⌧)r(⌧)] = E ⌧⇠⇡✓ [ t=T X t=0 (r✓ log ⇡✓ (at |st )) t=T X t=0 (r(st , at ))] ⇡ 1 N [ t=T X t=0 (r✓ log ⇡✓ (ai,t |si,t )) t=T X t=0 (r(si,t , ai,t ))] <latexit sha1_base64="5PXBF7uIUmsf+igyuuXL8rqPOpU=">AAAFTHicnVJNaxNBGH6zprXGjzZ6EQRZDC0JhDCJgiJUilrUg6VfaQvZNMyuk2TofrE7ia3r+gP8Ax48KXgQf4A/wIsnbx76E8RjCx5U8J3ZtdWGNtVZdj6e93nerxnTt3koCNnOaCeyI6Mnx07lTp85e258In9+JfR6gcXqlmd7wZpJQ2Zzl9UFFzZb8wNGHdNmq+bGHWlf7bMg5J67LLZ81nRox+VtblGBUCufea/jMFxq2rQVGaLLBI11w6Gia5rRbCwx2tONkDu64fM9StyI7rUiEheluRQ39Zz0MzV9LOmBcIbtdf4kJD6DZGnqhvFPvsOe04rENInXcV5GZ0OCoUnET0M5l0oD6qCoLGWdJoT9dAzq+4G3qRvtgFpRNY7m/i84Lyfh5XpYAtKmUkhIMonWRIFUiBr64KaabgqQjnkvn7kFBjwCDyzogQMMXBC4t4FCiF8DqlADH7Eyoh5a28gJFM/CuQkRPIBZ9QnEKeIx5NBnD9UMlRR5Gzh38NRIURfPMlao9NKvjX+ASh0myWfyluyQj+Qd+UJ+HOorUj5kjlu4momW+a3x5xeXvg1VObgK6O6rjlCY+DlHVBXhLCv30R9HfHNIBwT28IaqnGMnfIUk/VTZ9J+82Fm6uTgZTZHX5Ct24xXZJh+wH25/13qzwBZfKu8uah6r3jmqGhdvMUKcYi4dhcg8JPK7Vg9jyXOAiJ7ynu0xDbwDjieO3DCtYFgMrm77ODES5kAMfK/Vg69zcLNSq1SvVmoL1wozt9OXOwaX4AoU8XVehxm4D/NQB0u7rN3VHmpz2U/Z3ez37M+EqmVSzQX4a4yM/gKRCGnK</latexit> ํࡦͷύϥϝʔλΛظ଴ऩӹ͕࠷େԽ͢Δํ޲ʹௐ੔͢Δ ྫ3&*/'03$&ΞϧΰϦζϜ ύϥϝʔλԽ͞Εͨํࡦ
  10. 7BOJMMBQPMJDZHSBEJFOUͷ໰୊   https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/ &VDMJEڑ཭Ͱݟͯಉ͡ൣғ͔ΒબͿ r✓ E ⌧⇠⇡✓ [G0 (⌧)]

    ||r✓ E ⌧⇠⇡✓ [G0 (⌧)]|| = lim ✏!0 arg max d s.t. ||d||✏ E ⌧⇠⇡✓+d [G0 (⌧)] <latexit sha1_base64="AefjRPt5JoFZ160YrwE38iKOqbE=">AAAEtHicpVJNaxNBGH63rlrjRxu9CB5cDJWKEiZRVASlKEW99cO0hU5YZjeTZOh+uTtJU2fXowf/gAdPCh5Er/oHvPgHPPQniMcKXjz47mRb0dBGcIadnXnmed6veZ3IE4kkZNuYOGQePnJ08ljp+ImTp6amy6dXkrAXu7zhhl4Yrzks4Z4IeEMK6fG1KObMdzy+6mzcy+9X+zxORBg8klsRb/qsE4i2cJlEyC4b5y0ctB0zV9GAOR6zFZVdLllmUZ/JruOo+SzHWM+iifAtGok9Srau7tuKZLP59aWsmZVyaypN/99Umg6N3baoJ3yk8SgRXhhYNBadrmRxHG5aJKMs7vhsYKsWtajkA6mSqqxm1ErTFobh8ce7wkyNDeJyazQje7pCqkQPa3RTKzYVKMZCWDbuAIUWhOBCD3zgEIDEvQcMEpzrUIM6RIhdQTTE2zZyYs1zcW2Cgocwr6dEnCGeQQlt9lDNUcmQt4FrB0/rBRrgOfeVaH1u18MvRqUFM+QLeUt2yGfyjnwlP/e1pbSNPMYt/DtDLY/sqednl3+MVfn4l9D9rTpA4eD0D8hK4ZpnHqE9gfhgTAUk1vCmzlxgJSKNDOupo+k/ebGzfGtpRl0kr8k3rMYrsk0+YT2C/nf3zSJfeqmtB6jZ1LXzdTYBvqJCnGEsHY3kceTIbq4h+srPMSJWwXu6x6T4BgJPArlJkcE4H0K/9r/4GDJHfGC/1v7uztHNSr1au1qtL16rzN0tOncSzsEFmMXuvAFz8AAWoAGu8cx4b3wwPprXTWq6Jh9SJ4xCcwb+GGbwC99YM6k=</latexit> ͜ͷͭͷਖ਼ن෼෍͸શવҧ͏ܗ͕ͩɺύϥϝʔλͷڑ཭͸ಉ͡ ෼෍ͷҧ͍ΛཅʹߟྀͰ͖͍ͯͳ͍
  11. 5310Ͱ໨ࢦ͢ܗ   arg max d s.t. KL(⇡✓||⇡✓+d)✏ E ⌧⇠⇡✓+d

    [G0 (⌧)] <latexit sha1_base64="i69wjpKnYFMoZg1wFjrwFWYug5g=">AAAEFXicjVJNaxNBGH6360eN1qYKKnhZDJUUJUyioHiQohQVPfTDtIVMWGY302Tofrk7ianTFbz6B3rwpOBB/A2eFPHmyUN/gYjHCl48+O5k1dZg4ww7O+8zz/N+zTiRJxJJyLYxZh44eOjw+JHC0WMTxyeLUyeWk7Abu7zuhl4Yrzos4Z4IeF0K6fHVKObMdzy+4qzfzM5XejxORBjclxsRb/qsHYg14TKJkF18YuGgLG77rG+rFrWo5H2pkoqspNS6e69MI2ErKjtcsnRzc5d1oZXOUI8/oDxKhBcGqaI+kx3HUXNpxmFdiybCt/ZK0oa6ZSuSljPCTNpMC3axRCpED2t4U803JcjHfDhlXAcKLQjBhS74wCEAiXsPGCQ4G1CFGkSIXUQ0xNM15MSa5+LaBAV3YE5PiThDPIUC+uyimqOSIW8d1zZajRwN0M5iJVqf+fXwi1FpwTT5RF6RHfKBvCZfyI9/+lLaR5bjBv6dgZZH9uTTM0vfR6p8/Evo/FHto3Bw+vtUpXDNKo/Qn0C8P6IDEnt4VVcusBORRgb91Nn0Hm3tLF1bnFbnyQvyFbvxnGyTt9iPoPfNfbnAF59p7wFqHure+bqaAG9RIc4wl7ZGsjwy5FetIcbK7BgRK+c9/s2keAcCLYHcJK9gVAyhb/t/YgyYQzHwvVb/fp3Dm+VapXqpUlu4XJq9kb/ccTgL56CMr/MKzMJtmIc65vnZmDBOGafNLfON+c58P6COGbnmJOwZ5sefp2X4lw==</latexit> ݩͷํࡦͷ෼෍͔Β͋·Γҳ୤͠ͳ͍ൣғͰύϥϝʔλΛมߋ͢Δ ,-͸,VMMCBDL-FJCMFSEJWFSHFODFͷ͜ͱ
  12. 5310ͷಋग़   ⌘(⇡) = E s0,a0...  1 X

    t=0 tr(st ) Q⇡ (st , at ) = E st+1,at+1...  1 X l=0 lr(st+l ) V⇡ (st ) = E at,st+1,at+1...  1 X l=0 lr(st+l ) A⇡ (st , at ) = Q⇡ (st , at ) V⇡ (st ) at ⇠ ⇡(at |st ), st+1 ⇠ P(st+1 |st , at ) <latexit sha1_base64="bI3QM6zfSMQGUBsOdU67J9zSSZQ=">AAAFeXictVJPTxNBFH+tVBH/QPVi4mWVQIpgMy0mGhMMakg0XgrYQtKBzewyLRP2n7tTFMb1A/gFPHjSxIPxY3jxC3jg7sV4xMSDHnw7u1ZtC3hxNjt/fu/33u+9N2MFjogkIXu5/LGhwvETwydHTp0+c3Z0rHiuEfmd0OZ123f8cNViEXeEx+tSSIevBiFnruXwFWvrbmJf2eZhJHzvodwJ+JrL2p5oCZtJhMxi7pOBY9KgXLISDcSUMWdQl8lNy1ILsakiU5F4xmDJYtANX0YxtUS73aRRxzWVnCPxuqLCa8mdmLaZ67J1JeOwhH4yntLUNYPSkVRl0VSoEadWHRVJ/YpyupJZcTNA1Rmk6mSq006/buNP3V5BlibzP3RvH1Dv4D5c7c2zG0cTDBoJ10B7SR+fpqRu4qm5VsqOqXkmDW2OjZMy0cPo31SyzThko+YXc7eAwgb4YEMHXODggcS9Awwi/JpQgSoEiM0g6qO1hZxQ82yc10DBfVjQn0ScIR7DCMbsoDdHT4a8LZzbeGpmqIfnRCvS/klcB/8QPQ2YIB/JW7JPPpB35DP5cWAspWMkOe7gaqW+PDBHn19Y/nakl4urhM3fXod4WPi5h1SlcE4qDzCeQPzJER2Q2MMbunKBnQg0kvZTZ7O9+2J/+ebShJokr8kX7MYrskfeYz+87a/2m0W+9FJH99Dnse6dq6vx8BYV4gxzaWskySNBftXqo1ZyDhExMt6zLpPiHQg8CeRGWQVHaQh92/+ikTL7NPC9VnpfZ/+mUS1XZsvVxWvj83eylzsMF+EylPB1Xod5uAc1qIOdf5B/lN/Nq6HvhUuFUuFKSs3nMp/z8NcozP4E2h9x2w==</latexit> ظ଴ऩӹ ঢ়ଶߦಈՁ஋ؔ਺ ঢ়ଶՁ஋ؔ਺ Ξυόϯςʔδؔ਺
  13. 5310ͷಋग़   ⌘(e ⇡) = ⌘(⇡) + E s0,a0,⇠e

    ⇡  1 X t=0 tA⇡ (st , at ) <latexit sha1_base64="Bp40bafPNlt8xs0611o9z6RS4N4=">AAAEMHicjVJLbxMxEJ7N8ijh0RQuSFxWREWJiCInIIGQigqoEtz6IG2lOF15N05qdV/adVKCtfwA/gAHuIDEAXHjL3DhDyDRE+IIHIvEhQOzzgKiEQ1jre359vtmPGM7kScSScieUTCPHD12fOZE8eSp02dmS3Nn15NwELu85YZeGG86LOGeCHhLCunxzSjmzHc8vuHs3Mn+bwx5nIgwuC9HEe/4rB+InnCZRMguPbPQKJesQndFl0vhdbmikUir1kKOR6JqXbaoz+S246il1FaJrUhas9h4oYnwrQPqtKjjOqLfb9Nk4NtKLpB0S1ER9OQopX3m+2xLyfSWrfkVDCnTGsvmqpZ1LErtUpnUiTZrctPIN2XIbTmcM24ChS6E4MIAfOAQgMS9BwwSHG1oQBMixGqIhvi3h5xY81ycO6DgHizpIRFniKdQxJgDVHNUMuTt4NxHr52jAfpZrkTrs7gefjEqLZgn78krsk/ekdfkM/nxz1hKx8jOOMLVGWt5ZM8+Pr/2farKx1XC9h/VIQoHh39IVQrnrPII4wnEH0zpgMQeXteVC+xEpJFxP/Vphg+f7K/dWJ1Xl8gL8hW78ZzskbfYj2D4zX25wlef6ugBanZ173xdTYC3qBBneJa+RrJzZMivWkPMlfkxIlbOe/SbSfEOBHoCuUlewbQcQt/2/+QYMydy4HttHHydk5v1Zr1xpd5cuVpevJ2/3Bm4ABehgq/zGizCXViGFrhGwagYDaNpvjE/mB/NT2Nqwcg15+AvM7/8BPOFADQ=</latexit> ظ଴ऩӹͱํࡦʹ͍ͭͯҎԼͷؔ܎͕͋Δ͜ͱ͕஌ΒΕ͍ͯΔ ݩͷํࡦͷऩӹʹ৽͍͠ํࡦͷׂ߹Ͱ
 ݩͷํࡦͷׂҾ͞ΕͨΞυόϯςʔδΛ଍͢ͱ ৽͍͠ํࡦͷऩӹʹͳΔ ৽͍͠ํࡦ ݩͷํࡦ ݩͷํࡦͷΞυόϯςʔδ ৽͍͠ํࡦͷׂ߹Ͱظ଴஋ΛͱΔ
  14. 5310ͷಋग़   ⌘(e ⇡) = ⌘(⇡) + E s0,a0,⇠e

    ⇡  1 X t=0 tA⇡ (st , at ) = ⌘(⇡) + 1 X t=0 X s P(st = s|e ⇡) X a e ⇡(s, a) tA⇡ (s, a) = ⌘(⇡) + X s 1 X t=0 P(st = s|e ⇡) X a e ⇡(s, a) tA⇡ (s, a) = ⌘(⇡) + X s 1 X t=0 tP(st = s|e ⇡) X a e ⇡(s, a)A⇡ (s, a) = ⌘(⇡) + X s ⇢e ⇡ (s) X a e ⇡(s, a)A⇡ (s, a) <latexit sha1_base64="payhGTgFMQ4TUrKEZn9sxX28BXc=">AAAGH3iczVJLaxRBEK7suhrXRxI9KHhpXRJ2cQm9UVCESFQCesvDPCCdDD2TzqbJvJjuTYzj+APyBzx4iuBB/Bke9CRePOQniCeJ4MWDNT2jwd24a/BiD9OPr+ur+qq67NCVSlO611coHisdP9F/snzq9JmzA4ND5+ZV0IocMecEbhAt2lwJV/piTkvtisUwEtyzXbFgb9xL7xc2RaRk4D/U26FY9njTl2vS4Roha6iww4TmVbYlV4WW7qqIWSiTGhkZJ9lFKGvkKmEe1+u2HU8mVqysmCZ1wrOFKemRNnpSJjiYLZvNJaZanhXrcZqsxEz6a3o7YU3ueXwl1skdy9hX0aVO6jyda4a2TBgzTtp1HOLNQCqZyryQcaKetKeT6TGGPGm7rKo6rx0iCdGuIlRyiJj/QsSBn3+ScxQR0XqA1m1dUFVHC2INVugoNYN0bhr5pgL5mAqG+m4Dg1UIwIEWeCDAB417Fzgo/JagAWMQIlZHNMDbNbSJjJ2D8zLE8AAmzacR54gnUEafLWQLZHK028C5iaelHPXxnMZShp/6dfGPkElgmH6kr+g+fUdf00/0+x99xcZHqnEbVzvjitAa2Lk4+60ny8NVw/oBqwvDxs/rklWMc5p5iP4k4o96VEBjDW+azCVWIjRIVk+jZvPxs/3ZWzPD8Qh9QT9jNXbpHn2D9fA3vzovp8XMc+PdR86WqZ1nsvHxFWPEOWppGiTVkSI/cw0wVnqOECG53dNflgzfQOJJoq3KM+gVQ5rX/psYmWVHDOzXRnt3dm7mx0Yb10bHpq9XJu7mndsPl+AKVLE7b8AE3IcpmAOn8KV4oUiKl0u7pbel96UPmWmhL+ech99Gae8HvdWvGg==</latexit> ⇢⇡ (s) = P(s0 = s) + P(s1 = s) + 2P(s2 = s) · · · <latexit sha1_base64="18hZH8DY3CcfT8nStYLuQAfbUa0=">AAAD6nicjVJNa9RAGH638aPWj271IngJLpUVZZnEQkVQiiLobdu620JTQ5JOt0OTTEhmV2tYf4BHLyI9KXoQ/4I3L/4BD716k3qr4MWDT2ajYhe7nSGT933e93m/ZvwkFJlibKcyZhw5euz4+ImJk6dOn5msTp1tZ7KbBrwVyFCmy76X8VDEvKWECvlyknIv8kO+5G/eKexLPZ5mQsYP1FbCVyOvE4t1EXgKkFttm1hOuiHd3ElEv55dNm+azXrm5qwPCeoV0+l4UeQNUGsf+jC3+9pglwYnWJMqc6s11mB6mcOCVQo1KldTTlVukUNrJCmgLkXEKSYFOSSPMuwVssimBNhVoBLWdfik2i/AuUo53ae7eivgHvA+TSBmF2wOpge/TZwdaCslGkMvcmWaX8QN8aVgmjTNPrN3bI99Yu/ZV/bzv7FyHaOocQt/f8DliTv57Pzij5GsCH9FG39ZBzB87OiArnKcRecJ4gngj0dMQGGG13XnApNINDKYp66m9+TF3uKNhen8EnvNdjGNV2yHfcQ84t734O08X9jW0WNwHunZRbqbGLeYA/dQS0cjRR0F8rtXiVyFngIxS7+nfzwd3IGAJuCblR2MyiH0bR8mx8BzKAfeq7X/dQ4LbbthXWvY8zO1udvlyx2nC3SR6nidszRH96hJLdT5gb7QLn0zQuO58dLYHriOVUrOOfpnGW9+AcZX45o=</latexit> ׂҾ๚໰ස౓ TͱUΛೖΕସ͑ U͸BͷTVNʹ
 ؔ܎ͳ͍ͷͰ֎ʹग़ͨ͠
  15. 5310ͷಋग़   ⌘(e ⇡) = ⌘(⇡) + X s

    ⇢e ⇡ (s) X a e ⇡(s, a)A⇡ (s, a) <latexit sha1_base64="FohaYUaZPEGbuktxtO0UDhvKQww=">AAAEC3icjVJNaxNBGH6360eNrU31InhZDC0JljCJgiIoVRH01g/TFrplmd1M06H7xe4msS7rDxA8e/Ck4EE8ehNvIvgHPPTgDxCPFbx48NnJajHBxhl29p1nnuf9mrFDV8YJY/vahH7s+ImTk6dKp6emz8yUZ8+uxUE3ckTLCdwg2rB5LFzpi1YiE1dshJHgnu2KdXv3Tn6+3hNRLAP/QbIXii2Pd3y5LR2eALLKfQPDFAmvmn3ZFol02yI1Q5nVjPkbxUEoa8Ylw4y7npXGmRntBFY6xM6qca2kXCkWz4bOq/ECr92yDm2rXGF1poYxajQKo0LFWApmtZtkUpsCcqhLHgnyKYHtEqcYc5Ma1KQQ2ALQAKfb4ESK52DdopTu0101E+AceEYl+OxCLaDk4O1i7WC3WaA+9nmsWOlzvy6+CEqD5thn9podsE/sDfvKfv7TV6p85Dnu4W8PtCK0Zp6cX/0xVuXhn9DOoeoIhY3pHVFVijWvPIQ/CfzhmA4k6OE1VblEJ0KFDPqpsuk9enawen1lLp1nL9k3dOMF22cf0A+/9915tSxWnivvPjR91TtPVePjFlPgHLl0FJLnkSO/aw0QK99HQIyC9/gP08QdSOwkuHFRwbgYUt32/8QYMEdi4L02hl/nqLHWrDcu15vLVyqLt4uXO0kX6CJV8Tqv0iLdoyVqIc8vmq5NadP6U/2t/k5/P6BOaIXmHP019I+/ACJH8XY=</latexit> ৽͍͠ํࡦͷظ଴ऩӹ ݩͷํࡦͷظ଴ऩӹ ظ଴ऩӹͷࠩ෼ ͕͜͜ΑΓେ͖͚Ε͹վળ͍ͯ͠Δ e ⇡(s) = arg max a A⇡ (s, a) <latexit sha1_base64="gmZJCXEL5fSlGR7BoOBQ6485kfk=">AAADxnicjVJNa9RAGH63UVvrR7d6EbwEl0oLZZldC4rQUhWh3vq1baEpYZKdrkOTSUhmd1vDimf/gIInBQ/if/Diwf4BD/0J4rEFLx58MhsVXew6Qybv+8z7vF/zenEgU83YUWnEOnP23OjY+fELFy9dnihPXtlIo3bii4YfBVGy5fFUBFKJhpY6EFtxInjoBWLT23uQ3292RJLKSK3rg1jshLyl5K70uQbklus2ltOVTaFl0BSZE8vedDpjz9sOT1oh33cz3svuucXFLJ/pueUKqzKz7EGhVggVKtZyNFlaIIeaFJFPbQpJkCINOSBOKfY21ahOMbBZoBFud2GTGDsf5w5l9Igemq2Bc+A9GofPNtgCTA67PZwtaNsFqqDnsVLDz/0G+BIwbZpin9k7dswO2Xv2hX3/p6/M+MhzPMDf63NF7E48v7b2bSgrxF/T49+sUxgednhKVRnOvPIY/iTw/SEd0OjhHVO5RCdig/T7abLpPHlxvHZ3dSq7yd6wr+jGa3bEPqIfqnPiv10Rq6+MdwVO1/QuNNUovGIGnCOXlkHyPHLkZ60RYuV6AsQu7J7+snTwBhKahG1aVDAshjSv/T8x+pYDMTCvtb+nc1DYqFdrt6r1lbnK4v1icsfoOt2gaUznbVqkJVqmBvJ8SR/oEx1aS5ay2la3bzpSKjhX6Y9lPfsB0HDZdg==</latexit> ͷΑ͏ʹํࡦΛߋ৽͢Ε͹ඞͣվળ͢Δ ํࡦ൓෮ ͨͩ͜ͷ··ͩͱύϥϝʔλΛߋ৽͢Δ࿩ʹͳΒͳ͍
  16. 5310ͷಋग़   ⌘(e ⇡) = ⌘(⇡) + X s

    ⇢e ⇡ (s) X a e ⇡(s, a)A⇡ (s, a) <latexit sha1_base64="FohaYUaZPEGbuktxtO0UDhvKQww=">AAAEC3icjVJNaxNBGH6360eNrU31InhZDC0JljCJgiIoVRH01g/TFrplmd1M06H7xe4msS7rDxA8e/Ck4EE8ehNvIvgHPPTgDxCPFbx48NnJajHBxhl29p1nnuf9mrFDV8YJY/vahH7s+ImTk6dKp6emz8yUZ8+uxUE3ckTLCdwg2rB5LFzpi1YiE1dshJHgnu2KdXv3Tn6+3hNRLAP/QbIXii2Pd3y5LR2eALLKfQPDFAmvmn3ZFol02yI1Q5nVjPkbxUEoa8Ylw4y7npXGmRntBFY6xM6qca2kXCkWz4bOq/ECr92yDm2rXGF1poYxajQKo0LFWApmtZtkUpsCcqhLHgnyKYHtEqcYc5Ma1KQQ2ALQAKfb4ESK52DdopTu0101E+AceEYl+OxCLaDk4O1i7WC3WaA+9nmsWOlzvy6+CEqD5thn9podsE/sDfvKfv7TV6p85Dnu4W8PtCK0Zp6cX/0xVuXhn9DOoeoIhY3pHVFVijWvPIQ/CfzhmA4k6OE1VblEJ0KFDPqpsuk9enawen1lLp1nL9k3dOMF22cf0A+/9915tSxWnivvPjR91TtPVePjFlPgHLl0FJLnkSO/aw0QK99HQIyC9/gP08QdSOwkuHFRwbgYUt32/8QYMEdi4L02hl/nqLHWrDcu15vLVyqLt4uXO0kX6CJV8Tqv0iLdoyVqIc8vmq5NadP6U/2t/k5/P6BOaIXmHP019I+/ACJH8XY=</latexit> L⇡ (e ⇡) = ⌘(⇡) + X s ⇢⇡ (s) X a e ⇡(s, a)A⇡ (s, a) <latexit sha1_base64="362pbDxZA16OXmATQaH4hlU7YQY=">AAAEAnicjVJNaxNBGH7TrVrjR1O9CF4WQ0uCJUyi0CIoVREUPPTDtIVuWXY303TofrEzSa1LPHnyD3jwpOBBPYlXb178AXroTxCPFQTx4LOT1WJDG2fY2fd95n3er3nd2BdSMbZbGDFGjx0/MXayeOr0mbPjpYlzyzLqJB5vepEfJauuI7kvQt5UQvl8NU64E7g+X3G3bmf3K12eSBGFD9ROzNcDpx2KDeE5CpBdCk2s+3ZqxaJXsbZFiyvht7jWq+bUddPiyqlAq5qXTUt2AjuVPSvZjHKKrBYzD/0bp3fAQ0VOO9Wb9r5sl8qsxvQyB4V6LpQpX/PRROEGWdSiiDzqUECcQlKQfXJIYq9RnRoUA5sGGuF2AzaJtvNwrlNK9+iO3gq4A7xHRfjsgM3BdGC3hbMNbS1HQ+hZLKn5mV8fXwKmSZPsC3vN9tgn9pZ9Zb8O9ZVqH1mOO/i7fS6P7fGnF5Z+DGUF+Cva3GcdwXCxgyOqSnFmlcfwJ4A/HNIBhR7O6soFOhFrpN9PnU330bO9pWuLk+kUe8m+oRsv2C77iH6E3e/eqwW++Fx7D8HZ1r0LdDUhXjEF7iCXtkayPDLkT60RYmV6AsTM7R7/tbTwBgKagK3MKxgWQ+jX/p8YfcuBGJjX+sHpHBSWG7X6lVpj4Wp57lY+uWN0kS5RBdM5Q3N0l+apiTw/08+CURg1nhhvjHfG+77pSCHnnKd/lvHhN1CT7k4=</latexit> গ͚ͩؔ͠਺Λ͍ͬͨ͡΋ͷΛఆٛ Ұ࣍ͷඍ෼߲·Ͱ͸ಉ͡΋ͷ ύϥϝʔλͷมԽ͕খ͍ۙ͞๣Ͱ͸ಉ͡ͱΈͳͤΔ L⇡✓0 (⇡✓0 ) = ⌘(⇡✓0 ) r✓ L⇡✓0 (⇡✓ )|✓=✓0 = r✓ ⌘(⇡✓ )|✓=✓0 <latexit sha1_base64="N974w/yG/f16fzXDRO54R/Qoflc=">AAAEZnicjVLLahRBFL2djBpHTSaKKLhpMiREkKFmFAxCJCiCgos8nCSQjk11W5kp0i+6a0ZjO36AP+DClYIL8TPcuBZc5A8MLiO4ceHpmh4fM0kmVXT1vafOubfurXIiTyaKsV1jZLRw4uSpsdPFM2fPjU+UJs+vJmErdkXdDb0wXnd4IjwZiLqSyhPrUSy473hizdm+m+2vtUWcyDB4pHYisenzRiC3pMsVILv0xcR4aKdWJLGoplDcTlmn05ntR66aM/OmBe+AHcsqZnGsgDse7+11hoWF8EXPnv+HhEQHhevPfZjcLpVZhelhDhrV3ChTPhbDSeM2WfSEQnKpRT4JCkjB9ohTgrlBVapRBOwa0BC7W+DEmudi3aSUHtA9PRVwDrxDRcRsQS2g5OBtY23A28jRAH6WK9H6LK6HL4bSpGn2lX1g++wz+8j22K9DY6U6RnbGHfydrlZE9sSryys/h6p8/BU1/6qOUDiY/hFVpVizyiPEk8CfDemAQg/ndOUSnYg00u2nPk37+ev9lVvL0+kMe8e+oxtv2S77hH4E7R/u+yWx/EZHD6B5qnvn62oC3GIKnOMsDY1k58iQXq0hcmV+DMTMeS//MC3cgYQnwU3yCoblkPq2j5OjyxzIgfda7X+dg8ZqrVK9Xqkt3Sgv3Mlf7hhdoSmaxeu8SQt0nxapTq4xZzw2GkZz9FthvHCxcKlLHTFyzQX6bxTM3xqKEvI=</latexit> L⇡✓0 (⇡✓0+✏ ) L⇡✓0 (⇡✓0 ) > 0 <latexit sha1_base64="7ifQPFvATHThrStA5RrU9I2RGzE=">AAAD9nicjVJNa9RAGH638aPWarf1UvASXFoq2mV2FSoeSlEEBQ/9cNtCt4QkTrdDk8mQzK5tQ/wB/gEPgqBQRLz4H7z0D3joTxDBS4V68OCT2ajYxW5nyGTmmed5v+b1VCASzdhBacA6c/bc+cELQxeHL10eKY+OLSdRO/Z5w4+CKF713IQHQvKGFjrgqyrmbugFfMXbup/fr3R4nIhIPtE7iq+HbkuKDeG7GpBT9myMx07aVAKL3uTadVKWZdnUMcS+YTe5SkQQyey6PX0qDYizNnPKFVZlZti9m1qxqVAx5qPR0iw16SlF5FObQuIkSWMfkEsJ5hrVqE4K2E2gEW43wIkNz8e6Tik9ogdmauAu8IyGYLMNNYfSBW8LawuntQKVOOe+EqPP7Qb4YihtmmCf2Xt2yPbZB/aF/fyvrdTYyGPcwd/rarlyRl6MLx31VYX4a9r8qzpB4WGGJ2SVYs0zV7AngG/3qYBGDe+YzAUqoQzSraeJprP78nDp7uJEOsnesq+oxht2wD6hHrLz3d9b4IuvjHUJzTNTu9BkI/GKKXAXsbQMkseRI79zjeArP8dA7IL3/A+ziTcQOAlwkyKDfj6Eee3T+Ogye3ygX2vHu7N3s1yv1m5V6wu3K3P3is4dpKt0jabQnTM0Rw9pnhqIc5++0RH9sLat19ae9a5LHSgVmiv0z7A+/gJ4AOz/</latexit> ۙ๣Ͱ੒ཱ͢ΔͳΒํࡦ͕վળ͍ͯ͠Δ͸ͣ Ͳͷ͘Β͍ۙ๣ͳΒ0,
  17. 5310ಋग़   5PUBMWBSJBUJPOEJWFSHFODFͷಋೖ DTV (p||q) = 1 2 X

    i |pi qi | <latexit sha1_base64="F3xvmYgzsqgMld4wvW/SUHBYk+0=">AAADxnicjVLLahRBFL2T9hFjNBPdBNw0DgkRdKjuBBRBCT4g7vKaSSATmuq2ZizSr/RjYtLT4tofUHCl4EL8BzcuzA+4yCeIywhuXHi6plV0MGMVffvWqXvuq64dujJOGDusjGgnTp46PXpm7Oz4ufMT1ckLzThII0c0nMANog2bx8KVvmgkMnHFRhgJ7tmuWLe37xb3610RxTLw15K9UGx5vOPLtnR4Asiqmrqu37OytWY+G/Z6O1f0W3qrHXEnM/LMzFtx6lmZzHthIa/tKN2q1lidqaUPKkap1KhcS8Fk5Ta16CEF5FBKHgnyKYHuEqcYe5MMMikEdhVogNs2bCJl50BuUUYP6L7aCXAOPKcx+EzBFmBy2G1DdnDaLFEf5yJWrPiFXxdfBKZO0+wTe8uO2AF7xz6z7//0lSkfRY57+Nt9rgitiWdTq9+Gsjz8E3r0m3UMw8b2jqkqgywqD+FPAn88pAMJenhDVS7RiVAh/X6qbLr7z49Wb65MZzPsNfuCbrxih+wD+uF3vzpvlsXKS+XdB2dX9c5T1fh4xQw4Ry4dhRR5FMjPWgPEKs4REL20e/LLsoU3kDhJ2MZlBcNiSPXa/xOjbzkQA/Nq/D2dg0rTrBtzdXN5vrZwp5zcUbpEl2kW03mdFmiRlqiBPF/Qe/pIB9qi5muptts3HamUnIv0x9Ke/gCUBtpV</latexit> ↵ = Dmax TV (⇡old , ⇡new ) = max s DTV (⇡old (·|s)||⇡new (·|s)) <latexit sha1_base64="6IQ/UoWb/UV4xmArlpFFJIaRLVc=">AAAD+3icjVLLbtNAFL2peZTyaAobJDYWUVEiVdEkIIGQQBUPCXZ9Ja1UF2vsTNNR/ZLtpC2O+QB+gAUrQCx4bPkCNmxYsugnIJZBYlMkjicmFUQ0zMjjO2fOua8ZK3BkFDO2X5jQjh0/cXLy1NTpM2fPTRdnzjcjvxPaomH7jh+uWTwSjvREI5axI9aCUHDXcsSqtX03O1/tijCSvrcS7wViw+VtT25Km8eAzGJb13WDO8EW12/p98xkpZk+SgyX76ZlI5Bm4jutdE5ZnthJKyBlh2YSpQPyIats2C0/7kWVXm/IH2IVs1hiVaaGPmrUcqNE+VjwZwq3yaAW+WRTh1wS5FEM2yFOEeY61ahOAbA5oD5ON8EJFc/GukEJPaT7asbAOfCUpuCzA7WAkoO3jbWN3XqOethnsSKlz/w6+EIodZplX9gb1mef2Dv2lR3801eifGQ57uFvDbQiMKefXlz+MVbl4h/T1qHqCIWF6R5RVYI1qzyAPwl8d0wHYvTwhqpcohOBQgb9VNl0Hz/rL99cmk2usJfsG7rxgu2zj+iH1/1uv14US8+Vdw+aHdU7V1Xj4RYT4By5tBWS5ZEhv2v1ESvbh0D0nPdkyDRwBxI7CW6UVzAuhlS3/T8xBsyRGHivtb9f56jRrFdrV6v1xWul+Tv5y52kS3SZynid12meHtACNZDnZ+rTAf3UUu2V9lZ7P6BOFHLNBfpjaB9+ASky8Bc=</latexit> ͱͨ͠ͱ͖ɺԼه͕੒Γཱͭ ⌘(⇡new ) L⇡old (⇡new ) 4✏ (1 )2 ↵2 <latexit sha1_base64="Q8Ln9fm8ia8NSI+Dg4WGpppYoLQ=">AAAD+XicjVLLahRBFL2T9hHHRya6Edw0DpEJmKFmDERcSFAEBRd5OEkgHYfqTk2nSL/S3TMxFu0H+AMuXEUQfCz9BDeCaxf5BHElEQR14ema9jmYsYquvvfUPfdV1448maSM7ZVGjEOHjxwdPVY+fuLkqbHK+OmlJOzGjmg5oRfGKzZPhCcD0Upl6omVKBbctz2xbG9ez++XeyJOZBjcSXciseZzN5Ad6fAUULsiTCxLpLxmRbKtArGdTZqWK7bM222lodBbz7Lfb6dMqxNzR01bIkqkFwaWy32fZ6rWmOqLk3dVM8ss7kUbPBfL7UqV1Zle5qDQKIQqFWsuHC9dJYvWKSSHuuSToIBSyB5xSrBXqUFNioBdBBritgObWNs5ONdI0S26oXcKnAPPqAyfXbAFmBx2mzhdaKsFGkDPYyWan/v18MVgmjTB3rHnbJ+9YS/Ze/btn76U9pHnuIO/3eeKqD328Ozi56EsH/+UNn6xDmDY2P4BVSmceeUR/Eng94Z0IEUPL+vKJToRaaTfT51N7/6j/cUrCxPqAnvCPqAbu2yPvUY/gt4n5+m8WHisvQfgbOve+bqaAK+ogHPk4mokzyNHftQaIlaux0DMwu7BT0sLbyChSdgmRQXDYkj92v8To285EAPz2vh7OgeFpWa9canenJ+uzl4rJneUztF5qmE6Z2iWbtIctZDnW/pIX+iroYxd45nxom86Uio4Z+iPZbz6Dr387jg=</latexit> ✏ = max s,a |A⇡old (s, a)| <latexit sha1_base64="UD1T/bwTjJiatYeX75dnkvgKIk0=">AAADwnicjVLLbtNAFL2peZTyaAobJDYWUVGRqmgSKhUhtSovCXZ9kLZSXVljdxqG2h7jcUKLEz6ANRILViCxQPwB27LgB1j0ExDLIrHposcTA4KIpjPy+N4z99zXXC8OpE4Z2ysNWSdOnjo9fGbk7LnzF0bLYxeXtWolvmj4KlDJqse1CGQkGqlMA7EaJ4KHXiBWvK27+f1KWyRaquhRuhOL9ZA3I7kpfZ4CcstVG8sRsZaBiuwZ2wn5tpvpSd7t3HYzJ5ZupoKNbncC0PWO7ZYrrMrMsvuFWiFUqFjzaqw0Sw5tkCKfWhSSoIhSyAFx0thrVKM6xcAmgSrcbsImMXY+znXK6CHdNzsFzoF3aQQ+W2ALMDnstnA2oa0VaAQ9j6UNP/cb4EvAtGmcfWUf2D77wj6yb+zgv74y4yPPcQd/r8cVsTv68vLSz4GsEP+UHv9hHcHwsMMjqspw5pXH8CeBbw/oQIoe3jSVS3QiNkivnyab9vPX+0u3Fseza+wd+45uvGV7bBf9iNo//PcLYvGN8R6B88z0LjTVRHjFDDhHLk2D5HnkyK9aFWLlegLELuxe/LZ08AYSmoStLioYFEOa1z5OjJ5lXwzMa+3f6ewXluvV2o1qfWGqMnenmNxhukJXaQLTOU1z9IDmqYE8X9En2qXP1j3rifXU0j3ToVLBuUR/LatzCN5g17c=</latexit> DTV (p||q)2  DKL (p||q) <latexit sha1_base64="LtMy/kMKRuTtqeclStKkj2DF3Kk=">AAADu3icjVJNTxNBGH7LoiKKFLiYcNnYoJiYZlpNNCQYopho9MBXCwmFZncd6sjsB7vTKi71B5B49uBJEw/Go1c9efEPeOAnGI6QePHAM9NVo43Umezs+z7zPu/XvG4kRaIY28v1Wf0nTp4aOD145uzQueH8yGg1CZuxxyteKMN4xXUSLkXAK0ooyVeimDu+K/myu3lb3y+3eJyIMFhS2xFf851GIDaE5yhA9fwlG2u2ni5V25PRzs7W5fW03LZrkm9p9P6DDLXr+QIrMrPsbqGUCQXK1lw4krtJNXpIIXnUJJ84BaQgS3IowV6lEpUpAnYFaIjbDdjExs7DuUYp3aM7ZivgDvA2DcJnE2wOpgO7TZwNaKsZGkDXsRLD134lvhhMmybYV/aOHbAv7D37xn7801dqfOgct/F3O1we1Yd3zy9+78ny8Vf06DfrGIaL7R9TVYpTVx7BnwD+tEcHFHp4w1Qu0InIIJ1+mmxaz14eLE4tTKQX2Ru2j268ZnvsM/oRtA69t/N84ZXxHoDzxPTON9UEeMUUuINcGgbReWjkZ60hYmk9BmJnds9/WdbwBgKagG2SVdArhjCv/T8xOpZdMTCvpb+ns1uoloulq8Xy/LXCzK1scgdonC7QJKbzOs3QXZqjCvJ8QR/oI32ypi3PemzJjmlfLuOM0R/Lah4B6yvUKg==</latexit> ӈͷؔ܎ੑΛར༻ͯ͠ ⌘(⇡new ) L⇡old (⇡new ) CDmax TV (⇡old , ⇡new )2 L⇡old (⇡new ) CDmax KL (⇡old , ⇡new ) <latexit sha1_base64="k4Z+lyMtgTtDO58S01q4Bh6+izE=">AAAEPHiclVJNT9RAGH5L/cBVYdGLiZfGDQQTJLMricZEQkQTjRz42oWEQtOWYZ3QL9vZBWzqD/APePCkiYnG/+DFi3/AA3cvhnjC6MWDT2erqKuszqTTmWee53nnfWecyBOJZGxX69OPHD12vP9E6eSp0wOD5aEzjSRsxS6vu6EXxsuOnXBPBLwuhfT4chRz23c8vuRsTuf7S20eJyIMFuVOxFd9uxmIDeHaEpBVfmGgmVzao2YkrDTgW9lFY8Rs8vvGjJUqLPTWs+zn7UvG9M211PTt7cxKFxvFXk4bO2CtpbXMMM1S7v8/fndn/uxnlStsnKlmdE+qxaRCRZsNh7RJMmmdQnKpRT5xCkhi7pFNCfoKValGEbAxoCF2N8CJFc/FuEop3aFbqkvgNvCMSvBsQc2htMHbxNjEaqVAA6zzWInS574evhhKg4bZO/aS7bO37BX7wL7+1StVHvkZd/B3OloeWYOPzi186any8Zd070B1iMJB9w/JKsWYZx7BTwDf7lEBiRpeVZkLVCJSSKee6jTtB4/3F67ND6cj7BnbQzWesl32BvUI2p/c53N8/olyD6DZUrXzVTYBbjEFbuMsTYXk58iR77mGiJWvYyBGwXv4g2niDgRWAtykyKBXDKFu+19idJhdMfBeq7+/zu5JozZevTxem5uoTN0oXm4/nacLNIrXeYWm6DbNUp1cbUCb0K5rk/pr/b2+p3/sUPu0QnOWfmn6529dQQSo</latexit> C = 4✏ (1 )2 <latexit sha1_base64="wzKDc/qagGiZCqzg/+fgl9elqZU=">AAADwnicjVJNa9RAGH63UVvrR7d6EbwEl0oFXSZrQSlUilXQWz/cttDUZRJnt2OTTExmV2tcf4BnwYMnBQ/iP/BaD/4BD/0J4rGCFw8+mURFF7vOkMn7PvM+79e8XhzIVDO2VxmxDh0+Mjp2dPzY8RMnJ6qTp1ZT1U180fRVoJJ1j6cikJFoaqkDsR4ngodeINa87YX8fq0nklSq6I7eicVmyDuRbEufa0Ctat3GWrDnbLedcD+bcUWcykBFboeHIe9n086lQrxwN2v0+61qjdWZWfag4JRCjcq1qCYr18ile6TIpy6FJCgiDTkgTin2BjnUoBjYRaAKt23YJMbOx7lJGd2mm2Zr4Bx4n8bhswu2AJPDbhtnB9pGiUbQ81ip4ed+A3wJmDZNsU/sLdtnH9k79pl9/6evzPjIc9zB3yu4Im5NPDuz8m0oK8Rf09Zv1gEMDzs8oKoMZ155DH8S+KMhHdDo4VVTuUQnYoMU/TTZ9B6/2F+ZXZ7KzrPX7Au68YrtsV30I+p99d8sieWXxnsEzkPTu9BUE+EVM+AcuXQMkueRIz9rVYiV6wkQu7R7+svSxRtIaBK2aVnBsBjSvPb/xCgsB2JgXp2/p3NQWG3Uncv1xtJMbf56ObljdJbO0TSm8wrN0y1apCbyfE7vaZc+WDes+9YDKy1MRyol5zT9sawnPwBeHddG</latexit>
  18. 5310ಋग़   ⌘(⇡i+1 ) Mi (⇡i+1 ), ⌘(⇡i )

    = Mi (⇡i ) ⌘(⇡i+1 ) ⌘(⇡i ) Mi (⇡i+1 ) Mi (⇡i ) <latexit sha1_base64="gCyGl8v/BMlgkJOFrA1Be6bykZE=">AAAEK3icjVJNaxNBGH6360eNH031InhZDJWKaZiNgiIoRRH0IPTDtIVuCbvrNA7dL3cn0brEH+Af6MGTggcRf4UX7+Kh4tWDeKzgRcFnJqs2Tds4w87O+8zzvF8zXhKITDK2aYyYBw4eOjx6pHT02PETY+XxkwtZ3E593vDjIE6XPDfjgYh4QwoZ8KUk5W7oBXzRW7upzhc7PM1EHN2T6wlfCd1WJFaF70pAzfKGheFw6U46iWjm4oLdPW85Lf7Quguruw2tbqeBdK2PoVROaTdnUzt0uzoHq99bs1xhNaaHNbixi02FijETjxvXyaH7FJNPbQqJU0QS+4BcyjCXyaY6JcCqQGOcroKTap6PdYVyukO39JTAXeBdKsFnG2oOpQveGtYWrOUCjWCrWJnWK78BvhRKiybYR/aabbH37A37yn7u6SvXPlSO6/h7PS1PmmPPTs//GKoK8Zf04J9qH4WHGe5TVY5VVZ7AnwD+eEgHJHp4RVcu0IlEI71+6mw6Tza25q/OTeTn2Ev2Dd14wTbZO/Qj6nz3X83yuefaewTNI927UFcT4RZz4C5yaWlE5aGQP7XGiKXsFIhV8J7+ZTq4AwFLgJsVFQyLIfRt/0+MHnMgBt6rvfN1Dm4W6jX7Yq0+e6kyfaN4uaN0hs7SJF7nZZqm2zRDDeT5y6gYVWPKfGt+MD+Zn3vUEaPQnKK+YX75DXMX+h8=</latexit> Mi (⇡) = L⇡i (⇡) CDmax KL (⇡i , ⇡) <latexit sha1_base64="HZEH2jOAmfGfjRuQPzG5Zi+LS2U=">AAAD1XicjVJLaxRBEK7NqHn4yEYvgpfBJRIhLr0bQRGUaBQUI+ThZgPZOPSMnbXJvJjp3SQO60k8ePHowZOCiPgzvHjxqJKfEHKM4MWDX/eOii5mrWZ6qr6ur17dbuzLVDG2XRiwDhw8NDg0PHL4yNFjo8Wx40tp1Eo8UfMiP0qWXZ4KX4aipqTyxXKcCB64vqi76zP6vN4WSSqj8K7aisVqwJuhXJMeV4Cc4lUbcsfJZGeiEcuz9mV71smgaSSHztkz153s9mznXtYI+KZB9fGkbY6dYomVmRG7V6nkSolymYvGCleoQfcpIo9aFJCgkBR0nzilWCtUoSrFwCaBRjhdg09i/Dzsq5TRLbphlgLOgXdoBDFbYAswOfzWsTdhreRoCFvnSg1fx/XxJWDaNM4+sbdsj31g79gO+/7PWJmJoWvcwt/tckXsjD49ufitLyvAX9GD36x9GC5WsE9XGXbdeYx4EvhmnwkozPCi6VxiErFBuvM01bQfPt9bvLQwnp1hr9gupvGSbbP3mEfY/uq9nhcLL0z0EJwNM7vAdBPiFjPgHLU0DaLr0MjPXiPk0nYCxM79Hv3ybOAOJCwJ3zTvoF8OaW77f3J0PXty4L1W/n6dvcpStVyZKlfnz5emr+Uvd4hO0WmawOu8QNN0k+aohjrf0Ef6TF+sutWxHltPuq4DhZxzgv4Q69kPAGfdzA==</latexit> L⇡i (⇡i ) = ⌘(⇡i ) <latexit sha1_base64="0H6UZJfyt6sfCP8xuYvMDYgRd9E=">AAADuHicjVLLThRBFL1DqyCoDLAhcdNhxGBCJjWjiYYEQjQkkrjg4QAJQ5rqphgq9CvdNcOjM34AP+CClSYsCBu3uHXDD7jgE4xLTNy48HRNC8GJjFXp6ntP3XNfde3QlbFi7DzXZdy6fae7525v3737D/rzA4NLcVCPHFFxAjeIVmweC1f6oqKkcsVKGAnu2a5YtrdfpffLDRHFMvDfqr1QrHm85stN6XAFyMo/emMl1VBaiWw2xzLhiTlpVoXil7qVL7Ai08tsF0qZUKBszQUDuSmq0gYF5FCdPBLkk4LsEqcYe5VKVKYQ2DjQALebsIm0nYNzjRKapRm9FXAOvEm98FkHW4DJYbeNswZtNUN96GmsWPNTvy6+CEyTRtlXdswu2Bk7Yd/Yr3/6SrSPNMc9/O0WV4RW/8Hw4s+OLA9/RVtXrBsYNrZ3Q1UJzrTyEP4k8N0OHVDo4QtduUQnQo20+qmzaey/v1icWBhNHrOP7Du68YGdsy/oh9/44RzNi4VD7d0HZ0f3ztPV+HjFBDhHLjWNpHmkyJ9aA8RK9QiImdm9u7Ss4g0kNAnbOKugUwypX/t/YrQs22JgXkt/T2e7sFQulp4Wy/PPCtMvs8ntoYc0QmOYzuc0Ta9pjirI84A+0Sl9NiaMdaNmyJZpVy7jDNG1ZUS/Ab/s1HI=</latexit> ⇡i+1 = arg max ⇡  L⇡i (⇡) CDmax KL (⇡i , ⇡) <latexit sha1_base64="caZ+bbLGtVS4Ya7egWE2vLCYFVU=">AAAD6nicjVJNa9RAGH638aPWj271IngJLpWKdZlsBUVQilVQ7KEf7rawWcMkTtOh+SLJrq0h/gCPXkR6UvQg/gVvXvwDHnr1JvVWwYsHn8xGRRe7zpDJO888z/s1Y0eeTFLGdioj2oGDhw6PHhk7euz4ifHqxMlWEnZjRzSd0AvjVZsnwpOBaKYy9cRqFAvu255YsTfmivOVnogTGQb30q1IdHzuBnJNOjwFZFVbZiStTF4wcv2abvLY9fmmlQHMddOWrtvW59UWnDyfgnFev6jP3bSyu/P5/cwEW6HF8bReHCtVx6rWWJ2poQ8aRmnUqBwL4UTlOpn0gEJyqEs+CQoohe0RpwSzTQY1KAI2DTTE6Ro4seI5WDuU0R26pWYKnAPPaQw+u1ALKDl4G1hd7NolGmBfxEqUvvDr4Yuh1GmSfWRv2B77wN6yz+z7P31lykeR4xb+dl8rImv8yenlb0NVPv4prf9W7aOwMf19qsqwFpVH8CeBbw7pQIoeXlGVS3QiUki/nyqb3qNne8tXlyazc+wl20U3XrAd9h79CHpfndeLYmlbeQ+geah656tqAtxiBpwjF1chRR4F8rPWELGKfQxEL3mPfzFN3IHEToKblBUMiyHVbf9PjD5zIAbeq/H36xw0Wo26MVNvLF6qzd4oX+4onaGzNIXXeZlm6TYtUBN5vqNPtEtfNE97qj3XtvvUkUqpOUV/DO3VD7QZ6AY=</latexit> ͱͯ͠બͿͱ ͳͷͰվળ͢Δ͜ͱ͕Θ͔Δ Ξυόϯςʔδͷਪఆ͕ਖ਼֬Ͱ
 NBY,-EJWFSHFODF͕ܭࢉͰ͖Δͱ͖
  19. 5310ಋग़   ͜͜·Ͱͷٞ࿦͔Β ⇡i+1 = arg max ⇡ 

    L⇡i (⇡) CDmax KL (⇡i , ⇡) <latexit sha1_base64="caZ+bbLGtVS4Ya7egWE2vLCYFVU=">AAAD6nicjVJNa9RAGH638aPWj271IngJLpWKdZlsBUVQilVQ7KEf7rawWcMkTtOh+SLJrq0h/gCPXkR6UvQg/gVvXvwDHnr1JvVWwYsHn8xGRRe7zpDJO888z/s1Y0eeTFLGdioj2oGDhw6PHhk7euz4ifHqxMlWEnZjRzSd0AvjVZsnwpOBaKYy9cRqFAvu255YsTfmivOVnogTGQb30q1IdHzuBnJNOjwFZFVbZiStTF4wcv2abvLY9fmmlQHMddOWrtvW59UWnDyfgnFev6jP3bSyu/P5/cwEW6HF8bReHCtVx6rWWJ2poQ8aRmnUqBwL4UTlOpn0gEJyqEs+CQoohe0RpwSzTQY1KAI2DTTE6Ro4seI5WDuU0R26pWYKnAPPaQw+u1ALKDl4G1hd7NolGmBfxEqUvvDr4Yuh1GmSfWRv2B77wN6yz+z7P31lykeR4xb+dl8rImv8yenlb0NVPv4prf9W7aOwMf19qsqwFpVH8CeBbw7pQIoeXlGVS3QiUki/nyqb3qNne8tXlyazc+wl20U3XrAd9h79CHpfndeLYmlbeQ+geah656tqAtxiBpwjF1chRR4F8rPWELGKfQxEL3mPfzFN3IHEToKblBUMiyHVbf9PjD5zIAbeq/H36xw0Wo26MVNvLF6qzd4oX+4onaGzNIXXeZlm6TYtUBN5vqNPtEtfNE97qj3XtvvUkUqpOUV/DO3VD7QZ6AY=</latexit> ͱͯ͠બΜͩํࡦ͸վળ͢ΔͷͰɺύϥϝʔλͷ؍఺ͰݟΔͱ ͱͳΔΑ͏ͳɺВΛٻΊΕ͹Α͍ C = 4✏ (1 )2 <latexit sha1_base64="wzKDc/qagGiZCqzg/+fgl9elqZU=">AAADwnicjVJNa9RAGH63UVvrR7d6EbwEl0oFXSZrQSlUilXQWz/cttDUZRJnt2OTTExmV2tcf4BnwYMnBQ/iP/BaD/4BD/0J4rGCFw8+mURFF7vOkMn7PvM+79e8XhzIVDO2VxmxDh0+Mjp2dPzY8RMnJ6qTp1ZT1U180fRVoJJ1j6cikJFoaqkDsR4ngodeINa87YX8fq0nklSq6I7eicVmyDuRbEufa0Ctat3GWrDnbLedcD+bcUWcykBFboeHIe9n086lQrxwN2v0+61qjdWZWfag4JRCjcq1qCYr18ile6TIpy6FJCgiDTkgTin2BjnUoBjYRaAKt23YJMbOx7lJGd2mm2Zr4Bx4n8bhswu2AJPDbhtnB9pGiUbQ81ip4ed+A3wJmDZNsU/sLdtnH9k79pl9/6evzPjIc9zB3yu4Im5NPDuz8m0oK8Rf09Zv1gEMDzs8oKoMZ155DH8S+KMhHdDo4VVTuUQnYoMU/TTZ9B6/2F+ZXZ7KzrPX7Au68YrtsV30I+p99d8sieWXxnsEzkPTu9BUE+EVM+AcuXQMkueRIz9rVYiV6wkQu7R7+svSxRtIaBK2aVnBsBjSvPb/xCgsB2JgXp2/p3NQWG3Uncv1xtJMbf56ObljdJbO0TSm8wrN0y1apCbyfE7vaZc+WDes+9YDKy1MRyol5zT9sawnPwBeHddG</latexit> ͸࣮૷্͸େ͖͘ͳΓ͗͢ύϥϝʔλͷߋ৽෯͕খ͘͞ͳΔͷͰ USVTUSFHJPODPOTUSBJOU maximize ✓ L✓old (✓) s.t. ¯ D⇢✓old KL (✓old , ✓)  <latexit sha1_base64="o/qSw4lDiGete7IhDMKnzk/HbqY=">AAAE7XiclVNNa9RAGH5To9b1o1tFEAQNLpUKJcyuguKp+AGKPfTDbQtNXSbZ6e7QJBOT2XW7Id71Lh4UQUFF/Ble/AMe+hPEg4cKXjz4ZpKtdBd32xkymfeZ53m/hrEDl0eSkG1t7JB++MjR8WOF4ydOnpooTp5ejkQrdFjVEa4IV20aMZf7rCq5dNlqEDLq2S5bsTdvp+crbRZGXPgP5VbA1j3a8PkGd6hEqFZ8ZuCwPNrhHu+yWmzJJpM0MeZ621os3HqSTGfWFcOyjILSSNaRcWRKM7EMy6ZhfCepxQ/mkkexFTZFn7ynz8wZY9ebyx4bVp25eFQsEZOoYQxuyvmmBPmYF5PaB7CgDgIcaIEHDHyQuHeBQoRzDcpQgQCxGUQFnm4gJ1Q8B9d1iOE+3FVTIk4RT6CAPluoZqikyNvEtYHWWo76aKexIqVP/br4hag0YIp8I5/IDvlKPpPv5M9/fcXKR5rjFv7tTMuC2sTzc0u/R6o8/Eto/lMNUdg4vSFVxbimlQfojyPeGdEBiT28oSrn2IlAIVk/VTbt7sudpZuLU/Fl8o78wG68JdvkC/bDb/9y3i+wxVfKu4+aJ6p3nqrGx1uMEaeYS0MhaR4p0qtVYKzUDhExct7TXaaFd8DR4siN8gpGxeDqtvcTI2MeJEaWV8buImdYnEHuwSKl2e03Uj93Txx8f+X+1za4Wa6Y5atmZeFaafZW/hLH4Txcgml8bddhFu7BPFQx05/aWe2CdlEX+gv9tf4mo45pueYM7Bn6x78raiVF</latexit> maximize ✓  L✓old (✓) CDmax KL (✓old , ✓) <latexit sha1_base64="uPrTRXA2kygwRkbD3uDRYb7L6us=">AAAEx3iclVNNaxNBGH5TV1vrR1O9CF4WQ6VCDZMoKJ6KVbDYQz9MW0jisrudpkP3i91JTLNE8Oof8NBTBQXxZ3jRH+ChP0E8eKigggefnd0obTBpZ9jZeZ95nvdrGCtwRCQZ28+NnNJOnxkdOzt+7vyFixP5yUurkd8MbV6xfccP1y0z4o7weEUK6fD1IOSmazl8zdqeS87XWjyMhO89kTsBr7tmwxObwjYlICNv6xg112wLV3S4EdfkFpdmV69ZotGo6gs9xIh9Z6PbnU6tG/pNfe6BET9e6D6NE3XvIKXN6BlNeakb+QIrMjX0/k0p2xQoG4v+ZO4t1WiDfLKpSS5x8khi75BJEWaVSlSmANgMUB+nm+CEimdjrVNM8/RQTQncBN6lcfhsQs2hNMHbxtqAVc1QD3YSK1L6xK+DL4RSpyn2mb1jB+wje8++sN//9RUrH0mOO/hbqZYHxsTLKys/hqpc/CVt/VMNUFiY7oCqYqxJ5QH8CeDtIR2Q6OFdVblAJwKFpP1U2bQ6rw5W7i1PxdfZa/YV3dhj++wD+uG1vttvlvjyrvLuQfNM9c5V1Xi4xRi4iVwaCknySJBerT5iJXYIRM94z/8ya7gDAUuAG2UVDIsh1G0fJ0bKPEmMNK+U3QFnUJx+7skiJdkdN9JR7qE4eH+lo6+tf7NaLpZuFctLtwuz97OXOEZX6RpN47XdoVl6RItUQaaf6Bv9pF/avOZrLa2dUkdymeYyHRraiz8Lsxiv</latexit>
  20. 5310࣮૷   Dmax KL (⇡old , ⇡new ) =

    max s DKL (⇡old (·|s)||⇡new (·|s)) <latexit sha1_base64="0p+KvXm0pX0h40+wQZslJ+/fUiA=">AAAD9HicjVLbahNBGP7T9VDroaneCN4shkoCJUxiQRGU4gEUvejBtIWmrrubaTp0T+xO0tbN+gC+gBd6o9ILEXwJQXwBL/oIonf1cOOF386uKRpsnGFn//nm+/7TjBU4IpKM7RZGtEOHjxwdPTZ2/MTJU+PFidOLkd8Jbd6wfccPly0z4o7weEMK6fDlIOSmazl8ydq4kZ4vdXkYCd+7L7cDvuqabU+sCduUgIziQx3jphHfvZc8iJuuuZWUm4EwYt9pJVO6Mj2+mVT0q3p6asRRkrH3aeWm3fJlL6r0en1+H6sYxRKrMjX0QaOWGyXKx6w/UbhGTWqRTzZ1yCVOHknYDpkUYa5QjeoUAJsC6uN0DZxQ8WysqxTTHbqlpgRuAk9oDD47UHMoTfA2sLaxW8lRD/s0VqT0qV8HXwilTpPsI3vN9tgH9oZ9Yj//6StWPtIct/G3Mi0PjPEnZxd+DFW5+Eta31cdoLAw3QOqirGmlQfwJ4BvDemARA8vq8oFOhEoJOunyqb76OnewpX5yfgCe8k+oxsv2C57h3543a/2zhyff6a8e9Bsqt65qhoPtxgDN5FLWyFpHinyu1YfsdJ9CETPeY/7zCbuQGAnwI3yCobFEOq2/ydGxhyIgfda+/t1DhqL9WrtYrU+N12auZ6/3FE6R+epjNd5iWboNs1SA3m+py/0jb5rXe259krbyagjhVxzhv4Y2ttfQYrtDA==</latexit> ࣮ࡍʹ࣮૷͢ΔࡍɺԼه͸ࢉग़͕ࠔ೉ͳͷͰ 5310Ͱ͸୅ΘΓʹ࣍Λར༻͢Δ ¯ D⇢✓old KL (⇡old , ⇡new ) = E s⇠⇢✓old [DKL (⇡old (·|s)||⇡new (·|s))] <latexit sha1_base64="4i2FYakJ/eMnZzoK+WTV6IYYl6o=">AAAELnicjVLNaxNBFH/b9aPGj6Z6EbwshpYESpxEQRGUoi0oeuiHaQvZuMxupsnQ/WJ3klon6x/gPyDoScGDePF/8OI/oNCDeFU8VvDiwbeza4pGG2fY2Te/eb/fm/fm2aHLY0HIrjahHzp85OjkscLxEydPTRWnT6/FQS9yWMMJ3CDasGnMXO6zhuDCZRthxKhnu2zd3rqZnq/3WRTzwL8ndkLW8mjH55vcoQIhq/jUwGHaNJILiSXv3E3uSzPqBpY0RZcJasnAbSdJUjZDntlzhjJ9tp1UjELKvmaYHhVd25aLKBEbZsy9v2k0F1SAfamy6bQDMYgrg8FQc4hVWlaxRKpEDWPUqOVGCfKxFExr18GENgTgQA88YOCDQNsFCjHOJtSgDiFic4gGeLqJPpHyc3BtgYTbsKimQJwinkABNXvIZsik6LeFawd3zRz1cZ/GihU/1XXxi5BpwAx5T16RPfKOvCZfyI9/akmlkd5xB/92xmWhNfX47Or3sSwP/wK6+6wDGDZO74CsJK5p5iHqccQfjKmAwBpeUZlzrESokKye6jb9h0/2Vq+uzMhZ8oJ8xWo8J7vkLdbD739zXi6zlWdK3UfOtqqdp7Lx8RUl4hTv0lFIeo8U+ZVrgLHSfYSIkfs9Gnqa+AYcdxx94zyDcTG4eu3/iZF5jsTAfq392Z2jxlq9WrtYrS9fKs3fyDt3Es7BeShjd16GebgFS9AARwNtVrugEf2N/kH/qH/KXCe0nHMGfhv6559bZQG+</latexit> ͳͷͰ࣮ࡍʹ͸5310Ͱ͸࣍ͷ࠷దԽ໰୊Λղ͘ maximize ✓ L✓old (✓) s.t. ¯ D⇢✓old KL (✓old , ✓)  <latexit sha1_base64="o/qSw4lDiGete7IhDMKnzk/HbqY=">AAAE7XiclVNNa9RAGH5To9b1o1tFEAQNLpUKJcyuguKp+AGKPfTDbQtNXSbZ6e7QJBOT2XW7Id71Lh4UQUFF/Ble/AMe+hPEg4cKXjz4ZpKtdBd32xkymfeZ53m/hrEDl0eSkG1t7JB++MjR8WOF4ydOnpooTp5ejkQrdFjVEa4IV20aMZf7rCq5dNlqEDLq2S5bsTdvp+crbRZGXPgP5VbA1j3a8PkGd6hEqFZ8ZuCwPNrhHu+yWmzJJpM0MeZ621os3HqSTGfWFcOyjILSSNaRcWRKM7EMy6ZhfCepxQ/mkkexFTZFn7ynz8wZY9ebyx4bVp25eFQsEZOoYQxuyvmmBPmYF5PaB7CgDgIcaIEHDHyQuHeBQoRzDcpQgQCxGUQFnm4gJ1Q8B9d1iOE+3FVTIk4RT6CAPluoZqikyNvEtYHWWo76aKexIqVP/br4hag0YIp8I5/IDvlKPpPv5M9/fcXKR5rjFv7tTMuC2sTzc0u/R6o8/Eto/lMNUdg4vSFVxbimlQfojyPeGdEBiT28oSrn2IlAIVk/VTbt7sudpZuLU/Fl8o78wG68JdvkC/bDb/9y3i+wxVfKu4+aJ6p3nqrGx1uMEaeYS0MhaR4p0qtVYKzUDhExct7TXaaFd8DR4siN8gpGxeDqtvcTI2MeJEaWV8buImdYnEHuwSKl2e03Uj93Txx8f+X+1za4Wa6Y5atmZeFaafZW/hLH4Txcgml8bddhFu7BPFQx05/aWe2CdlEX+gv9tf4mo45pueYM7Bn6x78raiVF</latexit>
  21. 5310࣮૷   αϯϓϦϯάͯ͠࠷େԽΛ͍ͨ͠ͷͰ ظ଴஋ܗࣜʹมߋ͢Δ maximize ✓ L✓old (✓) s.t.

    ¯ D⇢✓old KL (✓old , ✓)  <latexit sha1_base64="o/qSw4lDiGete7IhDMKnzk/HbqY=">AAAE7XiclVNNa9RAGH5To9b1o1tFEAQNLpUKJcyuguKp+AGKPfTDbQtNXSbZ6e7QJBOT2XW7Id71Lh4UQUFF/Ble/AMe+hPEg4cKXjz4ZpKtdBd32xkymfeZ53m/hrEDl0eSkG1t7JB++MjR8WOF4ydOnpooTp5ejkQrdFjVEa4IV20aMZf7rCq5dNlqEDLq2S5bsTdvp+crbRZGXPgP5VbA1j3a8PkGd6hEqFZ8ZuCwPNrhHu+yWmzJJpM0MeZ621os3HqSTGfWFcOyjILSSNaRcWRKM7EMy6ZhfCepxQ/mkkexFTZFn7ynz8wZY9ebyx4bVp25eFQsEZOoYQxuyvmmBPmYF5PaB7CgDgIcaIEHDHyQuHeBQoRzDcpQgQCxGUQFnm4gJ1Q8B9d1iOE+3FVTIk4RT6CAPluoZqikyNvEtYHWWo76aKexIqVP/br4hag0YIp8I5/IDvlKPpPv5M9/fcXKR5rjFv7tTMuC2sTzc0u/R6o8/Eto/lMNUdg4vSFVxbimlQfojyPeGdEBiT28oSrn2IlAIVk/VTbt7sudpZuLU/Fl8o78wG68JdvkC/bDb/9y3i+wxVfKu4+aJ6p3nqrGx1uMEaeYS0MhaR4p0qtVYKzUDhExct7TXaaFd8DR4siN8gpGxeDqtvcTI2MeJEaWV8buImdYnEHuwSKl2e03Uj93Txx8f+X+1za4Wa6Y5atmZeFaafZW/hLH4Txcgml8bddhFu7BPFQx05/aWe2CdlEX+gv9tf4mo45pueYM7Bn6x78raiVF</latexit> ύϥϝʔλΛಈ͔ͯ͠΋มΘΒͳ͍ͷͰແࢹ maximize ✓ L✓old (✓) = maximize ✓ ⌘(⇡✓old ) + X s ⇢⇡✓old (s) X a ⇡✓ (s, a)A⇡✓old (s, a) = maximize ✓ X s ⇢⇡✓old (s) X a ⇡✓ (s, a)A⇡✓old (s, a) = maximize ✓ E s⇠⇢⇡✓old  X a ⇡✓ (s, a)A⇡✓old (s, a) = maximize ✓ E s⇠⇢⇡✓old  X a ⇡✓old (s, a) ⇡✓ (s, a) ⇡✓old (s, a) A⇡✓old (s, a) = maximize ✓ E s⇠⇢⇡✓old ,a⇠⇡✓old  ⇡✓ (s, a) ⇡✓old (s, a) A⇡✓old (s, a) = maximize ✓ E s⇠⇢⇡✓old ,a⇠⇡✓old  ⇡✓ (s, a) ⇡✓old (s, a) Q⇡✓old (s, a) <latexit sha1_base64="068E6J/SiFHHRQqL06+ME81EF+c=">AAAIfniczVPNaxNBFH+pUdf40UYvipfF0pBgrJMoWAShKgUFD/0wbSEbwux2kgzdL3Y3sc2y4tl/wIMnBQviQf8HL/4DHvoniMcKXjz4dnajbbf5qGB1hszOvPf7vfd7bzKqrXPXI2Q7NXYsffzESelU5vSZs+fGJ7Lnl12r7Wisolm65ayq1GU6N1nF457OVm2HUUPV2Yq6fj/0r3SY43LLfOxt2qxm0KbJG1yjHprq2fQlGUdOMegGN3iX1X3FazGPBvKj3rbuW/paEOSjU0FWlIzg3JEPYCm45hWb7yMXZMG5Kitu26j7bqA4LQsxCWCQd2NshKTBLgw6i7Rwtw+vOFTbv8xtUK+lqv5cgAoUlxt9RQSRApU3m9U/0SGYtSNS0/tviMQNh2p+QmpwQNjIcWRFFKlwJx27i/uL6v8/8QsjiK9PTJJpIoac3JTizSTEY97KprZAgTWwQIM2GMDABA/3OlBwcVahBGWw0VZEq4XeBmIcgdNwrYEPD2FOTA/tFO0BZDBmG9kMmRRx67g28VSNrSaew1yu4Idxdfw5yJRhinwmb8kO+UTekS/kR99YvogRatzErxpxmV0ff35x6ftQloFfD1q/WQMYKk5jQFU+rmHlNsbjaN8Y0gEPezgjKufYCVtYon4KNZ3ui52l24tTfo68Jl+xG6/INvmI/TA737Q3C2zxpYhuIueJ6J0hqjHxFn20U9TSFJZQR2jp1WphrvDsoEWOcU9/IRW8A44njlg3rmBYDi5ue5QcEfIwOSJdEbqLmEF5ktjDZQrVjZppP3ZPHnx/pf2vLblZLk+XbkyXF25Ozt6LX6IEl+EK5PG13YJZeADzUAEt/Sy9lX6f/iCBlJOuSdcj6Fgq5lyAPUOa+Qm423kJ</latexit>
  22. 5310࣮૷   ͜Εͷ࠷దͳύϥϝʔλΛٻΊ͍ͨ L⇡✓old = E s⇠⇢⇡✓old ,a⇠⇡✓old 

    ⇡✓ (s, a) ⇡✓old (s, a) Q⇡✓old (s, a) ¯ D ⇢⇡✓old KL = E s⇠⇢⇡✓old [DKL (⇡✓old (·|s)||⇡✓ (·|s))] <latexit sha1_base64="df90o9upSWWtH59jhHzFnPzKiMs=">AAAFv3iclVPNahRBEK6Jjsb1J4leBC+DS5YsLKE3CoogRE1AMYdsYn5wZzP0zHZmm8wfM53VZDI+gC/gwZOCgvgU4sUX8JAnEPGiRPDiwZqeibrZzW7SzfR0f/VVfVXVtBk4PBKE7CpDJ06qp04PnymcPXf+wsjo2MXlyN8MLbZk+Y4frpo0Yg732JLgwmGrQcioazpsxdy4l9pX2iyMuO89ElsBa7jU9vg6t6hAyBhTHms4SnNGrAccF9Fighqx7zSTJNFua7pLRcs049nEiCM94q4etvye5AqV5m5DIVXQTW7bdX09pNb/zslEVKHlpEe8zFDrKSVtMmJD03UZv6SbNIxnMMuHc8lafGiWXTVpfYtK6jMy4kSPBHWr6YudqLyz01HQPlxuGKNFMknk0Lo31XxThHzM+2PKW9ChCT5YsAkuMPBA4N4BChHOOlRhCgLEKoj6aF1HTih5Fq4NiOEBzMopEKeIJ1DAmJvozdCTIm8DVxtP9Rz18JxqRdI/jevgF6KnBuPkM3lH9sgn8p58Jb8PjRXLGGmOW/g3M18WGCPPLy/+Gujl4l9A659XHw8Tp9unqhjXtPIA43HEnw7ogMAe3pSVc+xEIJGsnzKb9vaLvcVbC+Nxibwm37Abr8gu+Yj98No/rTc1tvBSRvfQ54nsnSur8fAWY8Qp5mJLJM0jRfZr9VErPYeIaDnv2V+mjnfA8cSRG+UVDNLg8raPopExj6OR5ZWxt5HTT6ebezylNLujKh3kdujg+6sefG3dm+Wpyeq1yana9eL03fwlDsMVuAoT+NpuwDTch3lYAkv5oHxRvis/1DuqrXpqkFGHlNznEnQMdesPZ3d5Nw==</latexit> ͱ͓͍ͯͦΕͧΕɺҰ࣍ɺೋ࣍ۙࣅΛ͢Δ maximize ✓ E s⇠⇢⇡✓old ,a⇠⇡✓old  ⇡✓ (s, a) ⇡✓old (s, a) Q⇡✓old (s, a) s.t. E s⇠⇢⇡✓old [DKL (⇡✓old (·|s)||⇡✓ (·|s))]  <latexit sha1_base64="BRslR0qQ63LB7/xKMkMvy8Rg9O0=">AAAFp3iclVNNaxNBGH63Gq3xo61eBC+LoaWBECapoHgq1YKiYD/sB2RCmN1MkqH75e60ptmsP8A/4MGTgkLxZ3jx5kkwP0E8VvDiwXdnN2qamrQz7OzM8z7P+zWM4VkikIT0tIkzZzPnzk9eyF68dPnK1PTM1c3A3fVNvmG6lutvGyzglnD4hhTS4tuez5ltWHzL2LkX27f2uB8I13kq9z1etVnTEQ1hMolQbUZb0nHMUZu1hS06vBZS2eKSRTpCsmUY4XJUCwMaCJv6LRfNnuhzaqFr1aMoKjBlHjZkY9/UEM1mhTZ8Zv4rjuaDAstHx/hLDKvHhlI25bGqU5pNcpe8LcOgKIsRHcxaH5l2VLlfCx89juaPSYGadVd2g3y3O5ByH85jdIs/02mdWyiazpEiUUMf3pTSTQ7SseLOaO+BQh1cMGEXbODggMS9BQwCnBUoQRk8xAqIumhtIMdXPBPXKoTwEJbVlIgzxCPIos9dVHNUMuTt4NrEUyVFHTzHsQKlj/1a+Pmo1GGWfCEH5JB8Ih/IN/Lrv75C5SPOcR//RqLlXm3q5fX1n2NVNv4ltP6qRigMnPaIqkJc48o99CcQb4/pgMQe3lGVC+yEp5Cknyqbvc6rw/W7a7PhHHlLvmM33pAe+Yj9cPZ+mO9W+dpr5d1BzXPVO1tV4+AthogzzKWpkDiPGOnX6mKs+Owjoqe8F3+YFO9A4EkgN0grGBdDqNs+SYyEeZoYSV4Ju4OcUXGGuaeLFGd30khHuQNx8P2Vjr624c1muVhaKJZXb+UWl9KXOAk34CbM42u7DYvwAFZgA0ztQPusfdV6mXzmSWYzs51QJ7RUcw0GRob9BoKIbjo=</latexit> L⇡✓old ⇡ r✓L⇡✓old |✓=✓old (✓ ✓old ) + L⇡✓old (✓old ) ¯ D ⇢⇡✓old KL ⇡ 1 2 (✓ ✓old )T H(✓ ✓old ) where Hi,j = @ @i @ @j ¯ D ⇢⇡✓old KL (⇡✓old , ⇡✓ )|✓=✓old <latexit sha1_base64="xbAEDLUd33DFM1tPVk8FFfWNiuI=">AAAGMniclVPLThRBFL0zOor4AHRj4qYigQwRSc1IojEhIT4SjCx4Q6ShU90UTEFPd6e7eJblB/gDLlxpoolx6x+4cePSBe6MK6M7TNy48HZ1gzMMw6MqXV117jn3VSkn9EQsKd3O5U+dLpw523Ku9fyFi5fa2jsuT8XBauTySTfwgmjGYTH3hM8npZAenwkjzqqOx6edlfuJfXqNR7EI/Am5GfK5KlvyxaJwmUTI7sh9Jji6h21lhQIXWeGS2SrwFrTWxGJhGAUbxPKZ47Fdsz6Y/XQXGKjDi+mJ3CS1cA+5QQ52U6yjWVarSdByWKQeaFs9HtbzyooqwYHimpQXI+aqklblZhnMqwk91Cy7vbiSb0i1XuER1xYZspXoXdZkIHNvhSySgnl6b0eErjeRGtuyPm4ZxUawtwbSPU2abbd30j5qBmnclLJNJ2RjJOjIvQULFiAAF1ahChx8kLj3gEGMcxZKUIYQsV5EA7QuIicyPBfXOVDwCB6aKRFniGtoRZ+rqOaoZMhbwXUJT7MZ6uM5iRUbfeLXwy9CJYEu+oW+ozv0E31Pf9C/TX0p4yPJcRP/Tqrlod32/Or4nyNVVfxLqPxXHaJwcFYPqUrhmlQeoj+B+MYRHZDYwzumcoGdCA2S9tNks7b1Ymf87liX6qav6U/sxiu6TT9iP/y13+6bUT720nj3UbNuelc11fh4iwpxhrksGSTJI0F2aw0wVnKOECEZ79ke08I7EHgSyI2zCo6KIcxtHydGyjxJjDSvlL2FnMPiNHJPFinJ7riR9nPr4uD7K+1/bY2bqXJf6VZfebS/c/Be9hJb4BpchyK+ttswCEMwApPg5vvzT/JufqHwofC18K3wPaXmc5nmCtSNwq9/WfGlaw==</latexit> ͱۙࣅͰ͖Δ maximize ✓ r✓L⇡✓old (✓)|✓=✓old (✓ ✓old ) + L⇡✓old (✓old ) s.t. 1 2 (✓ ✓old )T H(✓ ✓old )  <latexit sha1_base64="ruw5cZOvoM0NoyQ9uj2BaprtI9I=">AAAFXHiclVPNaxNBFH9bE62ptamCCF4WQ0uLdZlEQRGEoggVPPQrbaHbhtnNJB26X+5OYpp1PXkS9OrBk4KC+Gd48R/w0It38VhBDx58O7uppiFJO8POvvd7v/c5jOFZPBCE7CsjpzLZ02dGz+bGzo2fn8hPXlgL3IZvsrLpWq6/YdCAWdxhZcGFxTY8n1HbsNi6sXs/tq83mR9w11kVex7bsmnd4TVuUoFQJf9bxTWt27TFbd5mlVAXO0zQSHeoYdFD9RFKHu+oldC1qlEUzSTq7NOO4W6XPTWr19X/4Vn1mjowXErT9VxSm2AtEQaa0CJd1Ws+NcNiFJb6Rd8OV6OFfpl1iz1W9SqzEMoXiEbkUnuFYioUIF2L7qTyAXSoggsmNMAGBg4IlC2gEODehCKUwENsDlEXrTXk+JJn4rkFITyEB3ILxCniEeQwZgO9GXpS5O3iWUdtM0Ud1ONcgfSP41r4+eipwhT5Sj6SA/KFfCLfyZ++sUIZI65xD/9G4su8ysSLyyu/hnrZ+Bew889rgIeB2x7QVYhn3LmH8TjirSETEDjD27JzjpPwJJLMU1bTbL8+WLmzPBVOk3fkB07jLdknn3EeTvOn+X6JLb+R0R30eSJnZ8tuHLzFEHGKtdQlEtcRI51eXcwV6z4iasp7dsjU8Q44ahy5QdrBsBxc3vZxciTMk+RI6krYbeQMytPLPVmmuLrjZjrK7cqD76949LX1CmslrXhDKy3dLMzfS1/iKFyBqzCDr+0WzMMCLEIZTGVbea68VF5lvmUz2bHseEIdUVKfi9C1spf+AqVoTnk=</latexit> Λ୅ΘΓʹղ͍ͯղީิΛݟ͚ͭΔ ৄ͘͠͸5310ͷ"QQFOEJY$Λࢀর
  23. 5310࣮૷   maximize ✓ r✓L⇡✓old (✓)|✓=✓old (✓ ✓old )

    + L⇡✓old (✓old ) s.t. 1 2 (✓ ✓old )T H(✓ ✓old )  <latexit sha1_base64="ruw5cZOvoM0NoyQ9uj2BaprtI9I=">AAAFXHiclVPNaxNBFH9bE62ptamCCF4WQ0uLdZlEQRGEoggVPPQrbaHbhtnNJB26X+5OYpp1PXkS9OrBk4KC+Gd48R/w0It38VhBDx58O7uppiFJO8POvvd7v/c5jOFZPBCE7CsjpzLZ02dGz+bGzo2fn8hPXlgL3IZvsrLpWq6/YdCAWdxhZcGFxTY8n1HbsNi6sXs/tq83mR9w11kVex7bsmnd4TVuUoFQJf9bxTWt27TFbd5mlVAXO0zQSHeoYdFD9RFKHu+oldC1qlEUzSTq7NOO4W6XPTWr19X/4Vn1mjowXErT9VxSm2AtEQaa0CJd1Ws+NcNiFJb6Rd8OV6OFfpl1iz1W9SqzEMoXiEbkUnuFYioUIF2L7qTyAXSoggsmNMAGBg4IlC2gEODehCKUwENsDlEXrTXk+JJn4rkFITyEB3ILxCniEeQwZgO9GXpS5O3iWUdtM0Ud1ONcgfSP41r4+eipwhT5Sj6SA/KFfCLfyZ++sUIZI65xD/9G4su8ysSLyyu/hnrZ+Bew889rgIeB2x7QVYhn3LmH8TjirSETEDjD27JzjpPwJJLMU1bTbL8+WLmzPBVOk3fkB07jLdknn3EeTvOn+X6JLb+R0R30eSJnZ8tuHLzFEHGKtdQlEtcRI51eXcwV6z4iasp7dsjU8Q44ahy5QdrBsBxc3vZxciTMk+RI6krYbeQMytPLPVmmuLrjZjrK7cqD76949LX1CmslrXhDKy3dLMzfS1/iKFyBqzCDr+0WzMMCLEIZTGVbea68VF5lvmUz2bHseEIdUVKfi9C1spf+AqVoTnk=</latexit> maximize ✓ g T (✓ ✓old ) where g = r✓L⇡✓old (✓)|✓=✓old s.t. 1 2 (✓ ✓old )T H(✓ ✓old )  <latexit sha1_base64="4JGRhB9SySmhNlGP0gMQtjdOZ6c=">AAAFTXiclVPNaxNBFH9bU1vjR1O9CF4WQ0sLGmZTQREKRREqCPYrbaFbw+xmkgzdL3cnaZp1691/wIMnBQXxz/DizZOH/gnFYxUvRXw7u6mkIUk7w86+95vf7703bxjDs3ggCDlQRi5kRi+OjV/KXr5y9dpEbvL6euA2fJOVTNdy/U2DBsziDisJLiy26fmM2obFNoydx/H+RpP5AXedNbHnsW2b1hxe5SYVCJVzhyqOad2mLW7zNiuHuqgzQaPai3Atmkkc9a6aGOXQtSrRrKqjz1oi3K0zn0W6WpvXHWpY9ET9DC2Pd9xEFnXCzb7qbMx37et6NilGxg4KooCh9apPzVCLwmK/auJCF/tWarGXql5hFkK5PCkQOdReQ0uNPKRjyZ1UPoEOFXDBhAbYwMABgbYFFAKcW6BBETzE7iDq4m4VOb7kmbhuQwhP4YmcAnGKeARZjNlANUMlRd4OrjX0tlLUQT/OFUh9HNfCz0elClPkB/lMjsg38oUckuO+sUIZI65xD/9GomVeeeLNzdU/Q1U2/gXU/6sGKAyc9oBThbjGJ/cwHke8NaQDAnv4QJ6cYyc8iST9lNU022+PVh+uTIXT5AP5id14Tw7IV+yH0/xlflxmK+9kdAc1u7J3tjyNg7cYIk6xlppE4jpipHNWF3PFvo+ImvL2T5g63gFHjyM3SE8wLAeXt32WHAnzPDmSuhJ2GzmD8vRyz5cpru6smU5zu/Lg+9NOv7ZeY71Y0OYKxeV7+YVH6Usch1twG2bwtd2HBViEJSiBqTxXGsq+8jrzPfM7c5z5m1BHlFRzA7rG6Ng/z7JK5g==</latexit> ॻ͖௚͢ ͜ͷ࠷దԽ໰୊͸ϥάϥϯδϡͷະఆ৐਺๏Λ࢖ͬͯ maximize ✓ g T (✓ ✓old ) +  (✓ ✓old )T H(✓ ✓old ) 2 <latexit sha1_base64="742iUX2wLpDYmdylKRIPV/kZuhc=">AAAE6niclVPNaxNBFH9bV631o2m9CPWwGFoqtmESBcVTUYR661faQjaG2c10O3T2g91JbLPEowePXkR6UlQQ/wwv/gMe+ieI3ip48eCb2VVpQ5J2hp1985vfe7/33jBOJHgiCTkwRs6YZ8+dH70wdvHS5SvjhYnJ9SRsxS6ruqEI402HJkzwgFUll4JtRjGjviPYhrPzUJ1vtFmc8DBYk3sRq/vUC/gWd6lEqFF4buGYsX26y33eYY3UlttM0q73JF3rzmYba97KjEYaimb3pnXLsgVKNKntcM+r9aGpCIv9QsxbFbvJhMxC1BuFIikRPaxeo5wbRcjHUjhhfAAbmhCCCy3wgUEAEm0BFBKcNShDBSLE5hAN8XQLObHmubjWIYXH8EhPiThFvAtjGLOF3gw9KfJ2cPVwV8vRAPdKK9H+Kq7AL0ZPC6bJV/KRHJIv5BP5Rn73jZXqGCrHPfw7mS+LGuMvrq3+Gurl41/C9n+vAR4OTn9AVSmuqvII43HEd4d0QGIP7+nKOXYi0kjWT51Nu/PqcPX+ynQ6Q96S79iNN+SAfMZ+BO2f7vtltrKvowfo81T3ztfVBHiLKeIUc/E0ovJQyN9aQ9RS+xgRK+c9+8e08Q447jhyk7yCYRpc3/ZJNDLmaTSyvDJ2BzmDdHq5p1NS2Z1U6Tj3iA6+v/Lx19ZrrFdK5dulyvKd4sKD/CWOwhTcgFl8bXdhARZhCaqY6Q9j0pgyrpvCfGm+Nvcz6oiR+1yFI8N89wcAfCJv</latexit> ͱॻ͖௚͢͜ͱ͕Ͱ͖ɺ࠷େԽ͢Δύϥϝʔλ ͜ͷ࣌఺Ͱ͸܎਺͕͍͔ͭ͘ෆ໌͕ͩɺ )?  Hͷํ޲ʹมԽͤ͞Ε͹࠷దͱ͍͏͜ͱ͕Θ͔Δ ✓ ⇤ = 1 2 H 1 g + ✓old <latexit sha1_base64="AsjaHSHDJPLzAOHmTb3b31t4huo=">AAAEnXiclVNNaxNBGH5TV631o6leBA8uhoqghkkU/AChKIUKIv0wTaHbhtnNJB26X+xOou0Sf4B/wIMHUVAQf4YIntqTh/4E8VjBiwefnV2VNpi0M+zsvM88z/s1jB26MlaM7RRGjhhHjx0fPTF28tTpM+PFibOLcdCJHFFzAjeIlmweC1f6oqakcsVSGAnu2a6o2+sP0vN6V0SxDPwnaiMUKx5v+7IlHa4ANYp3LLUmFF9NLB6rnnnPtFoRd5LrlV5StVz4afLezGpqt82rZkZuJIHb7DWKJVZmepj9m0q+KVE+ZoOJwnuyqEkBOdQhjwT5pLB3iVOMuUwVqlII7BrQAKctcCLNc7CuUEIPaVpPBZwD79EYfHagFlBy8NaxtmEt56gPO40Va33q18UXQWnSJPvKPrBd9oV9ZN/Yr//6SrSPNMcN/O1MK8LG+IvzCz+Hqjz8Fa39Uw1Q2JjegKoSrGnlIfxJ4M+GdEChh7d15RKdCDWS9VNn0918ubtwd34yuczesu/oxhu2wz6hH373h/NuTsy/0t59aJ7q3nm6Gh+3mADnyKWtkTSPFPlTa4BYqR0BMXPe879MC3cgYUlw47yCYTGkvu2DxMiYh4mR5ZWxN8EZFKefe7hIaXYHjbSfuycO3l9l/2vr3yxWy5Ub5erczdLU/fwljtIFukRX8Npu0RTN0CzVkOlr+kxbtG1cNKaNR8bjjDpSyDXnaM8w6r8BmvYHtQ==</latexit> ͱ͓͍ͯ 1 2 (✓ ⇤ ✓old )T H(✓ ⇤ ✓old ) = 1 2 ( s)T H( s) = = r 2 sT Hs ✓ ⇤ = s = r 2 sT Hs H 1 g <latexit sha1_base64="aQopCOZhA9kIFz48lvgZH0yIHX8=">AAAFZniclVNNaxNBGH63Nlqjto0iCl4WQ0sFG2ajoAiBogjx1q+0hW4TZjeTZOl+uTOJtsP6Azx4reBJQUH8GV78Ax76DxSPFbx48N3ZrbWNTdoZdnbeZ57n/RrGCl2HC0J2tZEzo7mz58bO5y9cvDQ+MVm4vMKDbmSzmh24QbRmUc5cx2c14QiXrYURo57lslVr81FyvtpjEXcCf1lshWzDo23faTk2FQg1Cpqm45g2WxG1pRHLcjxjig4TtC5NykWsz+qp3ZCB24xv1eVyXB1M0Su62WSuoLpp5vu9W8jU+b6jzPqPJjmpmPxpJGSqL6eMWHKl5XF8QP43n0rmNPE5UF6ty1kjbjcmi6RE1ND7N0a2KUI25oOC9gFMaEIANnTBAwY+CNy7QIHjXAcDyhAidhvRAE9byIkUz8Z1AyQ8gcdqCsQp4jHk0WcX1QyVFHmbuLbRWs9QH+0kFlf6xK+LX4RKHabIV/KR7JEv5BP5Tn4f60sqH0mOW/i3Ui0LGxMvry/9Gqry8C+gc6AaoLBwegOqkrgmlYfoz0H8+ZAOCOzhfVW5g50IFZL2U2XT297ZW3qwOCWnyTvyA7vxluySz9gPv/fTfr/AFt8o7z5qnqneeaoaH29RIk4xl7ZCkjwSZL/WAGMldoSInvFe/GWaeAcOWg5yeVbBsBiOuu2TxEiZp4mR5pWyt5EzKE4/93SRkuxOGuko91AcfH/G0dfWv1kpl4w7pfLC3eLcw+wljsENuAkz+NruwRxUYR5qYGsd7ZW2o70e/ZYbz13NXUupI1qmuQKHRk7/A5HWS3M=</latexit> H 1 g = s, 1 2 = <latexit sha1_base64="/G1qoTmQvPUy2et9s1+zPkVMb2I=">AAAEjHiclVNNaxNBGH5TV61R21QvgpdgqPRQw2wstCiBoij11g/TFro1zG4n6dD9YncSbZf4Azx49eBJQUG8etWLF/+Ah/4E8VjBiwefnV2VNpi0M+zsvM88z/s1jB26MlaM7RdGThmnz5wdPVc8f+Hi2Hhp4tJqHHQiRzScwA2idZvHwpW+aCipXLEeRoJ7tivW7J276flaV0SxDPyHajcUmx5v+7IlHa4ANUtTZYyFR8kNs9eux9NWK+JOaiQ1y4WXLd6rW7ZQvNgsVViV6VHu35j5pkL5WAwmCm/Joi0KyKEOeSTIJ4W9S5xizA0yqUYhsGmgAU5b4ESa52DdpIQe0D09FXAOvEdF+OxALaDk4O1gbcPayFEfdhor1vrUr4svgrJMk+wre8cO2Bf2nn1jv/7rK9E+0hx38bczrQib48+urPwcqvLwV7T9TzVAYWN6A6pKsKaVh/AngT8Z0gGFHs7pyiU6EWok66fOprv34mDl1vJkcp29Zt/RjVdsn31GP/zuD+fNklh+qb370DzWvfN0NT5uMQHOkUtbI2keKfKn1gCxUjsCUs55T/8yLdyBhCXBjfMKhsWQ+raPEyNjniRGllfG3gNnUJx+7skipdkdN9JR7qE4eH/m0dfWv1mtVc2b1drSTGX+Tv4SR+kqXaMpvLZZmqcFWqQGMn1OH+gjfTLGjBnjtlHPqCOFXHOZDg3j/m+aVf87</latexit> ৄ͘͠͸5310ͷ"QQFOEJY$Λࢀর ղީิ
  24. 5310࣮૷   ✓ ⇤ = r 2 sT Hs

    H 1 g <latexit sha1_base64="6rigv2WIFr9CmxbpUuMwyNMubE0=">AAAEnniclVNNaxNBGH5Tt1prtaleBC+LocWDhkkqKAWhKEo8iP1KG+g2YXY7SZful7uTaLusP8A/4MGLCgrizxDEm1489CeIxwpePPjs7Kq0waSdYWdnnnme92t4zcCxI8nYXmHkhDZ68tTY6fEzE2fPTRanzq9Gfje0RN3yHT9smDwSju2JurSlIxpBKLhrOmLN3L6T3q/1RBjZvrcidwKx4fKOZ7dti0tAreKcjjFjyC0heTM2eCQT/ZZuRI9CGRvtkFtx1dgUjuRJHDXjlaQWJUmtGV+rJJ1WscTKTA29f1PJNyXKx4I/VXhLBm2STxZ1ySVBHknsHeIUYa5ThaoUALsK1MdtG5xQ8SysGxTTfbqrpgTOgSc0DptdqAWUHLxtrB2c1nPUwzn1FSl9atfBF0Kp0zT7yt6xffaJvWff2K//2oqVjTTGHfzNTCuC1uSzi8s/h6pc/CVt/VMNUJiY7oCsYqxp5gHs2cCfDKmARA1vqsxtVCJQSFZPFU1v9/n+8tzSdDzDXrPvqMYrtsc+oB5e74f1ZlEsvVDWPWgeq9q5KhsPrxgD54ilo5A0jhT5k6sPX+k5BKLnvKd/mQbewMbJBjfKMxjmw1avfRQfGfM4PrK4MvYuOIP89HOP5ymN7qieDnMP+EH/VQ53W/9mtVquzJari9dL87fzThyjS3SZrqDbbtA81WiB6oj0JX2kz/RF07V72gPtYUYdKeSaC3RgaI3fhjMIMw==</latexit> ͸ݩͷ໰୊ͷۙࣅղͳͷͰɺ͔͜͜Β L⇡✓old = E s⇠⇢⇡✓old ,a⇠⇡✓old  ⇡✓ (s, a) ⇡✓old (s, a) Q⇡✓old (s, a) ¯ D ⇢⇡✓old KL = E s⇠⇢⇡✓old [DKL (⇡✓old (·|s)||⇡✓ (·|s))] <latexit sha1_base64="df90o9upSWWtH59jhHzFnPzKiMs=">AAAFv3iclVPNahRBEK6Jjsb1J4leBC+DS5YsLKE3CoogRE1AMYdsYn5wZzP0zHZmm8wfM53VZDI+gC/gwZOCgvgU4sUX8JAnEPGiRPDiwZqeibrZzW7SzfR0f/VVfVXVtBk4PBKE7CpDJ06qp04PnymcPXf+wsjo2MXlyN8MLbZk+Y4frpo0Yg732JLgwmGrQcioazpsxdy4l9pX2iyMuO89ElsBa7jU9vg6t6hAyBhTHms4SnNGrAccF9Fighqx7zSTJNFua7pLRcs049nEiCM94q4etvye5AqV5m5DIVXQTW7bdX09pNb/zslEVKHlpEe8zFDrKSVtMmJD03UZv6SbNIxnMMuHc8lafGiWXTVpfYtK6jMy4kSPBHWr6YudqLyz01HQPlxuGKNFMknk0Lo31XxThHzM+2PKW9ChCT5YsAkuMPBA4N4BChHOOlRhCgLEKoj6aF1HTih5Fq4NiOEBzMopEKeIJ1DAmJvozdCTIm8DVxtP9Rz18JxqRdI/jevgF6KnBuPkM3lH9sgn8p58Jb8PjRXLGGmOW/g3M18WGCPPLy/+Gujl4l9A659XHw8Tp9unqhjXtPIA43HEnw7ogMAe3pSVc+xEIJGsnzKb9vaLvcVbC+Nxibwm37Abr8gu+Yj98No/rTc1tvBSRvfQ54nsnSur8fAWY8Qp5mJLJM0jRfZr9VErPYeIaDnv2V+mjnfA8cSRG+UVDNLg8raPopExj6OR5ZWxt5HTT6ebezylNLujKh3kdujg+6sefG3dm+Wpyeq1yana9eL03fwlDsMVuAoT+NpuwDTch3lYAkv5oHxRvis/1DuqrXpqkFGHlNznEnQMdesPZ3d5Nw==</latexit> Λ࣮ࡍͷαϯϓϧ͔Βܭࢉͯ͠ վળ͞ΕΔ͜ͱΛ֬ೝ͢Δ σϧλΛܾΊΔ վળ͍ͯ͠ͳ͔ͬͨΒ܎਺Λখͯ͘͞͠ܭࢉ͠ͳ͓͢ վળ͍ͯͨ͠Βͦͷ܎਺ͰύϥϝʔλΛߋ৽
  25.   Proximal Policy Optimization Algorithms John Schulman, Filip Wolski,

    Prafulla Dhariwal, Alec Radford, Oleg Klimov OpenAI {joschu, filip, prafulla, alec, oleg}@openai.com Abstract We propose a new family of policy gradient methods for reinforcement learning, which al- ternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent. Whereas standard policy gra- dient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimiza- tion (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, includ- ing simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time. [cs.LG] 28 Aug 2017 ͦ͜Ͱʂ
  26. 110ಋग़   maximize ✓ E s⇠⇢⇡✓old ,a⇠⇡✓old  ⇡✓

    (s, a) ⇡✓old (s, a) Q⇡✓old (s, a) s.t. E s⇠⇢⇡✓old [DKL (⇡✓old (·|s)||⇡✓ (·|s))]  <latexit sha1_base64="BRslR0qQ63LB7/xKMkMvy8Rg9O0=">AAAFp3iclVNNaxNBGH63Gq3xo61eBC+LoaWBECapoHgq1YKiYD/sB2RCmN1MkqH75e60ptmsP8A/4MGTgkLxZ3jx5kkwP0E8VvDiwXdnN2qamrQz7OzM8z7P+zWM4VkikIT0tIkzZzPnzk9eyF68dPnK1PTM1c3A3fVNvmG6lutvGyzglnD4hhTS4tuez5ltWHzL2LkX27f2uB8I13kq9z1etVnTEQ1hMolQbUZb0nHMUZu1hS06vBZS2eKSRTpCsmUY4XJUCwMaCJv6LRfNnuhzaqFr1aMoKjBlHjZkY9/UEM1mhTZ8Zv4rjuaDAstHx/hLDKvHhlI25bGqU5pNcpe8LcOgKIsRHcxaH5l2VLlfCx89juaPSYGadVd2g3y3O5ByH85jdIs/02mdWyiazpEiUUMf3pTSTQ7SseLOaO+BQh1cMGEXbODggMS9BQwCnBUoQRk8xAqIumhtIMdXPBPXKoTwEJbVlIgzxCPIos9dVHNUMuTt4NrEUyVFHTzHsQKlj/1a+Pmo1GGWfCEH5JB8Ih/IN/Lrv75C5SPOcR//RqLlXm3q5fX1n2NVNv4ltP6qRigMnPaIqkJc48o99CcQb4/pgMQe3lGVC+yEp5Cknyqbvc6rw/W7a7PhHHlLvmM33pAe+Yj9cPZ+mO9W+dpr5d1BzXPVO1tV4+AthogzzKWpkDiPGOnX6mKs+Owjoqe8F3+YFO9A4EkgN0grGBdDqNs+SYyEeZoYSV4Ju4OcUXGGuaeLFGd30khHuQNx8P2Vjr624c1muVhaKJZXb+UWl9KXOAk34CbM42u7DYvwAFZgA0ztQPusfdV6mXzmSWYzs51QJ7RUcw0GRob9BoKIbjo=</latexit> 5310Ͱ͸ ͱ͍͏࠷దԽ໰୊Λղ͘୅ΘΓʹ࣍ͷ৚݅෇͖࠷దԽ໰୊Λղ͍ͨ maximize ✓ E s⇠⇢⇡✓old ,a⇠⇡✓old  ⇡✓ (s, a) ⇡✓old (s, a) Q⇡✓old (s, a) DKL (⇡✓old (·|s)||⇡✓ (·|s)) <latexit sha1_base64="Kk5uMW/rwHCtqYVU5CmNTUp4RxM=">AAAFY3iclVNPaxNBFH9bG62p2rR6EERYDC0txDCJguKpqAVFD/1j2kI3LLObSTJ0/7E7iW026wfQq+LBk4KC+DG8+AU89AN4EI8VvAj6dnajtolJO8POzvze773fe28Yw7N4IAjZU8ZOjGdOnpo4nZ08c/bcVG56Zj1wW77JKqZruf6mQQNmcYdVBBcW2/R8Rm3DYhvG9p3YvtFmfsBd55HY9VjVpg2H17lJBUJ67peKY06z6Q63eYfpoSaaTNBIRUg0DSNcivQw0AJua37TRbPHexw9dK1aFEUFKs39hmwcWzN4o7Gl1X1q/usczQcFuhANiJcYVgZKSZsMexUDo0G9q4cPHkbzA8JoZs0V3WCh2z0g24MXZGJVPZcnRSKH2r8ppZs8pGPZnVbegQY1cMGEFtjAwAGBewsoBDi3oARl8BArIOqitY4cX/JMXKsQwn1YklMgThGPIIsxW+jN0JMibxvXBp62UtTBc6wVSP84roWfj54qzJLP5D3ZJ5/IB/KV/PxvrFDGiHPcxb+R+DJPn3p6ce3HSC8b/wKaf72GeBg47SFVhbjGlXsYjyO+M6IDAnt4U1bOsROeRJJ+ymzanZf7a7dWZ8M58oZ8w268JnvkI/bDaX83366w1VcyuoM+j2XvbFmNg7cYIk4xl4ZE4jxipFeri1rx2UdETXlP/jA1vAOOJ47cIK1glAaXt30UjYR5HI0kr4TdQc4wnX7u8ZTi7I6qdJh7QAffX+nwa+vfrJeLpWvF8sr1/OLt9CVOwCW4AvP42m7AItyDZaiAqTDlmfJceTH+JTOZmclcSKhjSupzHg6MzOXfo6JS/Q==</latexit> ൚༻తʹར༻Ͱ͖Δ஋Λݟ͚ͭΔͷ͕೉͍͠
 ͱ͍͏͜ͱʹͳ͍ͬͯͨ ࣮ࡍ͸΋্͠ख͍܎਺ɺ΋͘͠͸ͦΕʹ૬౰͢Δ΋ͷ͕͋ΔͳΒɺ ݩͷϖφϧςΟ෇͖࠷దԽ໰୊Λղ͖͍ͨ ͳͥͳΒɺ୯७ʹޯ഑๏ͰվળͰ͖Δ͔Β ͜Ε͕110Ͱ໨ࢦ͢໨ඪ
  27. 110ಋग़   ⇡✓ (s, a) ⇡✓old (s, a) =

    rt (✓) (஫: rt (✓old ) = 1) <latexit sha1_base64="R5YHGoJbDg+8lqge5N3/dIlXagM=">AAAFQniclVPbahNBGP63rlrroaneCN4EY6UBCZMoKAWheAC968G0hbaE3ck0Hbsnd6cxbdAH8AW88EpBQXwMvfAFvOgbeLiSCN70wm9m19I0JGl32Nl//vm+//DNjht5MlGM7VojJ+yTp06Pnhk7e+78hfHcxMXFJNyKuajy0AvjZddJhCcDUVVSeWI5ioXju55Ycjfv6/2lpogTGQZP1HYk1nynEch1yR0FV5gLKZ+NVVqnmBzi1IYdkaSasRRtkMDs0AuaooRuwCrC7ofS65A8qgPTzcjTXbyxQSizd5BXNDU8oy3YdWMr7LQwt2nK+07T8F3ryz6YtYg8ZWCRs5YrsBIzT77XKGdGgbJnNpywPiBuHbE4KvERPUB8jsgOOkloBZEr6FuhKw5UANUEqtI4jnkNdTymh2Yoo2eAisYQcwtsAaZWeBNzA6uVzBtgrXMlhs9NJyHYWrNJ9o19ZB32lX1iP9he31htE0PXuI2vm3JFVBt/dXnh71CWj69Wc581gOFi+AO6amPWnUeIJ+FvDVFAQcM7pnMJJSLjSfU01TR3XncWpucn29fZO/YLarxlu+wz9Aiaf/j7OTH/xkQPwHlutPNNNwFOUf+ZDmppGE8r+2v/9xoil17H8OQz3Mt95CrOQGIlgU2yDoblkOa0j5IjRR4nR1pXit4BZlCeXuzxMunqjprpMLYrD+5f+fBt6zUWK6XyzVJl7lZh5l52E0fpCl3FHS/TbZqhRzRLVeLWA+uplVjK/mL/tH/bnRQ6YmWcS9T12Hv/AJOlHZw=</latexit> ͱͯ͠ maximize ✓ E s⇠⇢⇡✓old ,a⇠⇡✓old  ⇡✓ (s, a) ⇡✓old (s, a) Q⇡✓old (s, a) = maximize ✓ E s⇠⇢⇡✓old ,a⇠⇡✓old  rt (✓)Q⇡✓old (s, a) <latexit sha1_base64="ex2bvhEZqDPq9YlPt+0J7vxHsuw=">AAAFtnictVNNaxNBGH63ulrjRxu9CF4WQ0sKIUyi0FIQilLQW9OatpCNy+x2shm6X+xOYptl/QH+AQ+eFBTEP+Ddi3/AQy9eVTxW8OLBd2e3apuatAdn2NmZ532e92sYM3B4JAjZUybOnFXPnZ+8ULh46fKVqeni1fXI74UWa1q+44ebJo2Ywz3WFFw4bDMIGXVNh22Y2/dS+0afhRH3vYdiN2Btl9oe73CLCoSMotLQcMzqLt3hLh8wI9ZFlwmaaAiJrmnGy4kRR3rEXT3s+mgO+AHHiH1nK0mSCpXmYUMh9a2b3LZbeiek1t/ipBxV6FxyjL/M0Dg2lLRJj21N16X/O/8/+dCIRVLOzHNjEzOmS6RK5NCGN7V8U4J8rPhF5TXosAU+WNADFxh4IHDvAIUIZwtqUIcAsQqiPlo7yAklz8K1DTE8gGU5BeIU8QQK6LOHaoZKirxtXG08tXLUw3MaK5L61K+DX4hKDWbIR/KG7JMP5C35Sn7+01csfaQ57uLfzLQsMKaeXl/7MVbl4l9A949qhMLE6Y6oKsY1rTxAfxzxnTEdENjDBVk5x04EEsn6KbPpD57try2uzsSz5CX5ht14QfbIe+yH1/9uvWqw1efSu4eax7J3rqzGw1uMEaeYiy2RNI8UOajVx1jpOUREy3lPfjN1vAOOJ47cKK9gXAwub/skMTLmaWJkeWXsAXJGxRnmni5Smt1JIx3lHoqD76929LUNb9br1dqtar1xu7R0N3+Jk3ADbkIZX9s8LMF9WIEmWMo75ZPyWfmiLqiPVKbaGXVCyTXX4NBQg1+/83So</latexit> Λղ͍ͯΈ͍͕ͨɺ͜Ε͸5310ͷٞ࿦͔Β
 εςοϓ͕େ͖͘ͳΓ͗ͯ͢͠·͏ͷͰࣦഊ͢Δ ࣦഊ͢Δͷ͸ݩͷํࡦͱมߋޙͷํࡦ͕େ͖͘มԽ͢Δͱ͖ ͔ͩΒ5310Ͱ͸,-EJWFSHFODFͰ੍ݶΛ͔͚͍ͯͨ
  28. 110ಋग़   ⇡✓ (s, a) ⇡✓old (s, a) =

    rt (✓) (஫: rt (✓old ) = 1) <latexit sha1_base64="R5YHGoJbDg+8lqge5N3/dIlXagM=">AAAFQniclVPbahNBGP63rlrroaneCN4EY6UBCZMoKAWheAC968G0hbaE3ck0Hbsnd6cxbdAH8AW88EpBQXwMvfAFvOgbeLiSCN70wm9m19I0JGl32Nl//vm+//DNjht5MlGM7VojJ+yTp06Pnhk7e+78hfHcxMXFJNyKuajy0AvjZddJhCcDUVVSeWI5ioXju55Ycjfv6/2lpogTGQZP1HYk1nynEch1yR0FV5gLKZ+NVVqnmBzi1IYdkaSasRRtkMDs0AuaooRuwCrC7ofS65A8qgPTzcjTXbyxQSizd5BXNDU8oy3YdWMr7LQwt2nK+07T8F3ryz6YtYg8ZWCRs5YrsBIzT77XKGdGgbJnNpywPiBuHbE4KvERPUB8jsgOOkloBZEr6FuhKw5UANUEqtI4jnkNdTymh2Yoo2eAisYQcwtsAaZWeBNzA6uVzBtgrXMlhs9NJyHYWrNJ9o19ZB32lX1iP9he31htE0PXuI2vm3JFVBt/dXnh71CWj69Wc581gOFi+AO6amPWnUeIJ+FvDVFAQcM7pnMJJSLjSfU01TR3XncWpucn29fZO/YLarxlu+wz9Aiaf/j7OTH/xkQPwHlutPNNNwFOUf+ZDmppGE8r+2v/9xoil17H8OQz3Mt95CrOQGIlgU2yDoblkOa0j5IjRR4nR1pXit4BZlCeXuzxMunqjprpMLYrD+5f+fBt6zUWK6XyzVJl7lZh5l52E0fpCl3FHS/TbZqhRzRLVeLWA+uplVjK/mL/tH/bnRQ6YmWcS9T12Hv/AJOlHZw=</latexit> ΛݟΔͱɺݩͷํࡦ͔Βҳ୤͢Δͱ͜ͷׂ߹͸͔Βԕ͔͟Δ ͦ͜Ͱ110Ͱ͸ԼهͰΫϦοϓͨ͠஋ʹରͯ͠࠷େԽ͢Δ maximize ✓ E s⇠⇢⇡✓old ,a⇠⇡✓old  min(rt (✓)Q⇡✓old (s, a), clip(rt (✓), 1 ✏, 1 + ✏)Q⇡✓old (s, a)) <latexit sha1_base64="VoKSPVjTfDS4gGwktZbOI4uO7hY=">AAAFRHiclVNNaxNBGH63rlrjR1O9CF6CoZJgGiZRUDwVpaK3pjVtIRuW2e00GTr7we4ktlnWH+Af8OBJQUH8GYL4Bzz05lX0FsWLiO/ObpQ2mrQz7Ow7zzzP+zWM5QseSkL2tZkT+slTp2fP5M6eO39hLj9/cT30eoHNmrYnvGDToiET3GVNyaVgm37AqGMJtmHt3EvON/osCLnnPpJ7Pms7tOPybW5TiZCZ/1TAYTh0lzt8wMzIkF0maZxAsmtZ0XJsRqERcscIuh4e+3zEMSNPbMVxXKHqePwgp1xbvNNpGQ53S4EZybiUcsqNf/oqhRVartiC+wfZlUJt0WB+yIXnon19ZE9yU1ah22a+SKpEjcK4UcuMImRjxZvXXoMBW+CBDT1wgIELEm0BFEKcLahBHXzEKoh6eLqNnEDxbFzbEMFDWFZTIk4RjyGHPnuoZqikyNvBtYO7Voa6uE9ihUqf+BX4BagswAL5SN6QIflA3pLP5Od/fUXKR5LjHv6tVMt8c+7p5bUfU1UO/iV0/6omKCyczoSqIlyTyn30xxHfndIBiT28rSrn2AlfIWk/VTb9wbPh2p3VhegaeUm+YDdekH3yDvvh9r/Zrxps9bny7qLmseqdo6px8RYjxCnm0lFIkkeCjGr1MFayDxApZLwnf5gG3gHHHUdumFUwLQZXt32UGCnzODHSvFL2ADmT4oxzjxcpye6okQ5zD8TB91c7/NrGjfV6tXajWm/cLC7dzV7iLFyBq1DC13YLluABrEATbO2+JrSe1tff61/1of49pc5omeYSHBj6r9+EjEma</latexit> Let rt( ) denote the probability ratio rt( ) = ⇡✓(at | st) ⇡✓old (at | st) , so r( old ) = 1. TRPO maximizes a “surrogate” objective LCPI( ) = ˆ Et  ✓(at | st) ✓old (at | st) ˆ At = ˆ Et h rt( ) ˆ At i . (6) The superscript CPI refers to conservative policy iteration [KL02], where this objective was pro- posed. Without a constraint, maximization of LCPI would lead to an excessively large policy update; hence, we now consider how to modify the objective, to penalize changes to the policy that move rt( ) away from 1. The main objective we propose is the following: LCLIP ( ) = ˆ Et h min(rt( ) ˆ At, clip(rt( ), 1 , 1 + ) ˆ At) i (7) where epsilon is a hyperparameter, say, = 0.2. The motivation for this objective is as follows. The first term inside the min is LCPI. The second term, clip(rt( ), 1 , 1+ ) ˆ At, modifies the surrogate objective by clipping the probability ratio, which removes the incentive for moving rt outside of the interval [1 , 1 + ]. Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse. Note that LCLIP ( ) = LCPI( ) to first order around old (i.e., where r = 1), however, they become di erent as moves away from old . Figure 1 plots a single term (i.e., a single t) in LCLIP ; note that the probability ratio r is clipped at 1 or 1 + depending on whether the advantage is positive or negative. r LCLIP 0 1 1 + ϵ A > 0 r LCLIP 0 1 1 − ϵ A < 0 Figure 1: Plots showing one term (i.e., a single timestep) of the surrogate function LCLIP as a function of the probability ratio r, for positive advantages (left) and negative advantages (right). The red circle on each "ʹͳͬͯ·͕͢ɺ 2ͱಡΈସ͍͑ͯͩ͘͞
  29. ࢀߟจݙ    -VLBT[,BJTFS FUBM.PEFM#BTFE3FJOGPSDFNFOU-FBSOJOHGPS"UBSJ
 IUUQTBSYJWPSHBCT 
 IUUQTTJUFTHPPHMFDPNWJFXNPEFMCBTFESMBUBSJIPNF 

     4DIVMNBO FUBM5SVTU3FHJPO1PMJDZ0QUJNJ[BUJPO
 IUUQTBSYJWPSHBCT   4DIVMNBO FUBM1SPYJNBM1PMJDZ0QUJNJ[BUJPO"MHPSJUINT
 IUUQTBSYJWPSHBCT   $4%FFQ3FJOGPSDFNFOU-FBSOJOH
 IUUQSBJMFFDTCFSLFMFZFEVEFFQSMDPVSTF   /BUVSBM(SBEJFOU%FTDFOU
 IUUQTXJTFPEEHJUIVCJPUFDICMPHOBUVSBMHSBEJFOU   3-5SVTU3FHJPO1PMJDZ0QUJNJ[BUJPO&YQMBJOFE
 IUUQTNFEJVNDPN!KPOBUIBO@IVJSMUSVTUSFHJPOQPMJDZPQUJNJ[BUJPOUSQPFYQMBJOFE BFFFFFFF   3-5SVTU3FHJPO1PMJDZ0QUJNJ[BUJPO&YQMBJOFE1BSU
 IUUQTNFEJVNDPN!KPOBUIBO@IVJSMUSVTUSFHJPOQPMJDZPQUJNJ[BUJPOUSQPQBSU GFCFB