Slide 1

Slide 1 text

Deep Reinforcement Learning for NLP Tomohide Shibata Kyoto University 18/09/27

Slide 2

Slide 2 text

Deep Reinforcement Learning 2 RL + DL DL + RL + RL https://www.itmedia.co.jp/news/articles/1603/09/news142.html https://www.youtube.com/watch?v=V1eYniJ0Rnk https://github.com/tensorflow/nmt

Slide 3

Slide 3 text

Deep Reinforcement Learning for Games (AlphaGo) https://www.reddit.com/r/baduk/comments/6ttyyz/better_graph_of_ go_ai_strength_over_time/ AlphaGoZero Policy network Move probabilities Position p Value Evaluation Position Win Euro champion (pro) (2015.10) Win top player (2016.3) Win world number 1 (2017.5) Learn from scratch (2017.10) https://syncedreview.com/2017/02/24/david-silver- google-deepmind-deep-reinforcement-learning/

Slide 4

Slide 4 text

Recent NLP Papers • Seq2seq (encoder-decoder): – NMT [Ranzato+ 15, Johnson+ 16, He+ 16, Bahdanau+ 17, Wu+ 17, Wu+ 18, …] – Summarization [Paulus+ 18, Celikyilmaz+ 18, Li+ 18, Chen+ 18, …] – Dialogue [Li+ 16, Li+ 17, …] • Parsing [Lê+ 17], Coreference Resolution [Clark+ 16], Chinese zero anaphora resolution [Yin+ 18], … • Knowledge Base Inference [Xiong+ 18, …] • Machine Comprehension [He+ 17, Xiong+ 17] • … 4 Delayed feedback, directly optimize end metric, …

Slide 5

Slide 5 text

Three Kinds of Machine Learning • 教師あり学習 (Supervised learning) – Teacher provides desired output for a given input • 強化学習 (Reinforcement Learning, RL) – Delayed feedback from the environment in form of reward when taking action a at state s • 教師なし学習 (Unsupervised learning) – Clustering, … 5

Slide 6

Slide 6 text

Supervised vs RL 6 ・・・ Black win Feedback ・・・ https://www.youtube.com/watch?v=V1eYniJ0Rnk https://www.hokko ku.co.jp/articles/- /1436695

Slide 7

Slide 7 text

Table of Contents 1. An overview of RL 2. Policy gradient 3. RL in NLP tasks 4. Implementation 7

Slide 8

Slide 8 text

Reinforcement Learning • A general purpose framework for decision making • An agent with the capacity to act • Each action influences the agent’s future state • Success is measured by a scalar reward signal • Goal: select actions to maximize future reward 8 https://deepanshut041.github.io/Reinforcement-Learning/notes/00_Introduction_to_rl/

Slide 9

Slide 9 text

Value-based, Policy-based • Value function: predict of value for each state or state/action (discount factors are ignored) • Policy: maps each state to action – Deterministic policy: – Stochastic policy: 9 (s) = a AAACbHichVHLSgMxFD0d3/VVFUEQYbBY6qakIvgAoejGpa2OilpkZkw1OC9mpoVa/AHXggtRUBARP8ONP+DCTxAXLhTcuPDOdEC0qDckOTm55+Yk0RxDeD5jjzGpqbmlta29I97Z1d3Tm+jrX/XssqtzRbcN213XVI8bwuKKL3yDrzsuV03N4Gva/kKwv1bhridsa8WvOrxoqruWKAld9Yna2HJE2huX52R1O5FkGRaG3AiyEUgiiiU7cY0t7MCGjjJMcFjwCRtQ4VHbRBYMDnFF1IhzCYlwn+MQcdKWKYtThkrsPo27tNqMWIvWQU0vVOt0ikHdJaWMMfbAbtgru2e37Il9/FqrFtYIvFRp1upa7mz3Hg0tv/+rMmn2sfel+tOzjxKmQ6+CvDshE9xCr+srByevy7OFsVqKXbJn8n/BHtkd3cCqvOlXeV44RZw+IPvzuRuBMpGZyWTzk8ncfPQT7RjGKNL03FPIYRFLUOhYC8c4w3nsRRqUhqWReqoUizQD+BZS6hOBKIyF AAACbHichVHLSgMxFD0d3/VVFUEQYbBY6qakIvgAoejGpa2OilpkZkw1OC9mpoVa/AHXggtRUBARP8ONP+DCTxAXLhTcuPDOdEC0qDckOTm55+Yk0RxDeD5jjzGpqbmlta29I97Z1d3Tm+jrX/XssqtzRbcN213XVI8bwuKKL3yDrzsuV03N4Gva/kKwv1bhridsa8WvOrxoqruWKAld9Yna2HJE2huX52R1O5FkGRaG3AiyEUgiiiU7cY0t7MCGjjJMcFjwCRtQ4VHbRBYMDnFF1IhzCYlwn+MQcdKWKYtThkrsPo27tNqMWIvWQU0vVOt0ikHdJaWMMfbAbtgru2e37Il9/FqrFtYIvFRp1upa7mz3Hg0tv/+rMmn2sfel+tOzjxKmQ6+CvDshE9xCr+srByevy7OFsVqKXbJn8n/BHtkd3cCqvOlXeV44RZw+IPvzuRuBMpGZyWTzk8ncfPQT7RjGKNL03FPIYRFLUOhYC8c4w3nsRRqUhqWReqoUizQD+BZS6hOBKIyF AAACbHichVHLSgMxFD0d3/VVFUEQYbBY6qakIvgAoejGpa2OilpkZkw1OC9mpoVa/AHXggtRUBARP8ONP+DCTxAXLhTcuPDOdEC0qDckOTm55+Yk0RxDeD5jjzGpqbmlta29I97Z1d3Tm+jrX/XssqtzRbcN213XVI8bwuKKL3yDrzsuV03N4Gva/kKwv1bhridsa8WvOrxoqruWKAld9Yna2HJE2huX52R1O5FkGRaG3AiyEUgiiiU7cY0t7MCGjjJMcFjwCRtQ4VHbRBYMDnFF1IhzCYlwn+MQcdKWKYtThkrsPo27tNqMWIvWQU0vVOt0ikHdJaWMMfbAbtgru2e37Il9/FqrFtYIvFRp1upa7mz3Hg0tv/+rMmn2sfel+tOzjxKmQ6+CvDshE9xCr+srByevy7OFsVqKXbJn8n/BHtkd3cCqvOlXeV44RZw+IPvzuRuBMpGZyWTzk8ncfPQT7RjGKNL03FPIYRFLUOhYC8c4w3nsRRqUhqWReqoUizQD+BZS6hOBKIyF AAACbHichVHLSgMxFD0d3/VVFUEQYbBY6qakIvgAoejGpa2OilpkZkw1OC9mpoVa/AHXggtRUBARP8ONP+DCTxAXLhTcuPDOdEC0qDckOTm55+Yk0RxDeD5jjzGpqbmlta29I97Z1d3Tm+jrX/XssqtzRbcN213XVI8bwuKKL3yDrzsuV03N4Gva/kKwv1bhridsa8WvOrxoqruWKAld9Yna2HJE2huX52R1O5FkGRaG3AiyEUgiiiU7cY0t7MCGjjJMcFjwCRtQ4VHbRBYMDnFF1IhzCYlwn+MQcdKWKYtThkrsPo27tNqMWIvWQU0vVOt0ikHdJaWMMfbAbtgru2e37Il9/FqrFtYIvFRp1upa7mz3Hg0tv/+rMmn2sfel+tOzjxKmQ6+CvDshE9xCr+srByevy7OFsVqKXbJn8n/BHtkd3cCqvOlXeV44RZw+IPvzuRuBMpGZyWTzk8ncfPQT7RjGKNL03FPIYRFLUOhYC8c4w3nsRRqUhqWReqoUizQD+BZS6hOBKIyF Q(s, a) = E[Rt+1 + Rt+2 + ...|St = s, At = a] AAACynichVFNa9RQFD2NH/2ydqqbgpuHQ6XSEl6KYBUKU0VwI3RmHFuYDuEl86YNzRd5bwamcXZd+QdcuFKwUPwHbt24cFuhP6G4rODGhTeZQKml7Q3JO++8e+47N9eJfU9pzo9GjGvXb9wcHRufmLw1dXu6NHPnjYq6iSsbbuRHyYYjlPS9UDa0p325ESdSBI4v152d59n5ek8myovC17ofy1YgtkKv47lCE2WXKtV5tcjEQ7bCNgOht9sqfTFo1uxUL1gDtsBytJQh0zTZW1a3NaWSZDUHomWXytzkebDzwCpAGUWsRaWv2EQbEVx0EUAihCbsQ0DR04QFjpi4FlLiEkJefi4xwARpu5QlKUMQu0PfLdo1CzakfVZT5WqXbvHpTUjJMMcP+QE/4d/5F37M/15YK81rZF76tDpDrYzt6Xez9T9XqgJaNbZPVZd61uhgOffqkfc4Z7Iu3KG+t/v+pP60Npc+4J/4L/L/kR/xb9RB2Pvtfq7K2odL/LSJ7VD32f8d0Jis/4dyHjSWzCemVX1Urjwr5jWGe7iPeRrKY1TwEmto0CX7+IFD/DReGcroG+kw1RgpNHdxJoy9f/ISpjw= AAACynichVFNa9RQFD2NH/2ydqqbgpuHQ6XSEl6KYBUKU0VwI3RmHFuYDuEl86YNzRd5bwamcXZd+QdcuFKwUPwHbt24cFuhP6G4rODGhTeZQKml7Q3JO++8e+47N9eJfU9pzo9GjGvXb9wcHRufmLw1dXu6NHPnjYq6iSsbbuRHyYYjlPS9UDa0p325ESdSBI4v152d59n5ek8myovC17ofy1YgtkKv47lCE2WXKtV5tcjEQ7bCNgOht9sqfTFo1uxUL1gDtsBytJQh0zTZW1a3NaWSZDUHomWXytzkebDzwCpAGUWsRaWv2EQbEVx0EUAihCbsQ0DR04QFjpi4FlLiEkJefi4xwARpu5QlKUMQu0PfLdo1CzakfVZT5WqXbvHpTUjJMMcP+QE/4d/5F37M/15YK81rZF76tDpDrYzt6Xez9T9XqgJaNbZPVZd61uhgOffqkfc4Z7Iu3KG+t/v+pP60Npc+4J/4L/L/kR/xb9RB2Pvtfq7K2odL/LSJ7VD32f8d0Jis/4dyHjSWzCemVX1Urjwr5jWGe7iPeRrKY1TwEmto0CX7+IFD/DReGcroG+kw1RgpNHdxJoy9f/ISpjw= AAACynichVFNa9RQFD2NH/2ydqqbgpuHQ6XSEl6KYBUKU0VwI3RmHFuYDuEl86YNzRd5bwamcXZd+QdcuFKwUPwHbt24cFuhP6G4rODGhTeZQKml7Q3JO++8e+47N9eJfU9pzo9GjGvXb9wcHRufmLw1dXu6NHPnjYq6iSsbbuRHyYYjlPS9UDa0p325ESdSBI4v152d59n5ek8myovC17ofy1YgtkKv47lCE2WXKtV5tcjEQ7bCNgOht9sqfTFo1uxUL1gDtsBytJQh0zTZW1a3NaWSZDUHomWXytzkebDzwCpAGUWsRaWv2EQbEVx0EUAihCbsQ0DR04QFjpi4FlLiEkJefi4xwARpu5QlKUMQu0PfLdo1CzakfVZT5WqXbvHpTUjJMMcP+QE/4d/5F37M/15YK81rZF76tDpDrYzt6Xez9T9XqgJaNbZPVZd61uhgOffqkfc4Z7Iu3KG+t/v+pP60Npc+4J/4L/L/kR/xb9RB2Pvtfq7K2odL/LSJ7VD32f8d0Jis/4dyHjSWzCemVX1Urjwr5jWGe7iPeRrKY1TwEmto0CX7+IFD/DReGcroG+kw1RgpNHdxJoy9f/ISpjw= AAACynichVFNa9RQFD2NH/2ydqqbgpuHQ6XSEl6KYBUKU0VwI3RmHFuYDuEl86YNzRd5bwamcXZd+QdcuFKwUPwHbt24cFuhP6G4rODGhTeZQKml7Q3JO++8e+47N9eJfU9pzo9GjGvXb9wcHRufmLw1dXu6NHPnjYq6iSsbbuRHyYYjlPS9UDa0p325ESdSBI4v152d59n5ek8myovC17ofy1YgtkKv47lCE2WXKtV5tcjEQ7bCNgOht9sqfTFo1uxUL1gDtsBytJQh0zTZW1a3NaWSZDUHomWXytzkebDzwCpAGUWsRaWv2EQbEVx0EUAihCbsQ0DR04QFjpi4FlLiEkJefi4xwARpu5QlKUMQu0PfLdo1CzakfVZT5WqXbvHpTUjJMMcP+QE/4d/5F37M/15YK81rZF76tDpDrYzt6Xez9T9XqgJaNbZPVZd61uhgOffqkfc4Z7Iu3KG+t/v+pP60Npc+4J/4L/L/kR/xb9RB2Pvtfq7K2odL/LSJ7VD32f8d0Jis/4dyHjSWzCemVX1Urjwr5jWGe7iPeRrKY1TwEmto0CX7+IFD/DReGcroG+kw1RgpNHdxJoy9f/ISpjw= (a|s) = P[a|s] AAACpHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxRRkaiTWFGsq2CrE5CaWZKQUVwfURgNFYuMFlA30DMBAAZNhCGUoM0BBQL7ALoYYhhSGfIZkhlKGXIZUhjyGEiA7hyGRoRgIoxkMGQwYCoBisQzVQLEiICsTLJ/KUMvABdRbClSVClSRCBTNBpLpQF40VDQPyAeZWQzWnQy0JQeIi4A6FRhUDa4arDT4bHDCYLXBS4M/OM2qBpsBckslkE6C6E0tiOfvkgj+TlBXLpAuYchA6MLr5hKGNAYLsFszgW4vAIuAfJEM0V9WNf1zsFWQarWawSKD10D3LzS4aXAY6IO8si/JSwNTg2bjcU8KUDQN6HtQ+NYCo8kQPVIwGaFGepZ6hoEmyg5O0PjiYJBmUGLQAEaKOYMDgwdDAEMo0JIGhmUM6xk2MKkz+TIFM4VClDIxQvUIM6AApjgAMBmbzA== AAACpHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxRRkaiTWFGsq2CrE5CaWZKQUVwfURgNFYuMFlA30DMBAAZNhCGUoM0BBQL7ALoYYhhSGfIZkhlKGXIZUhjyGEiA7hyGRoRgIoxkMGQwYCoBisQzVQLEiICsTLJ/KUMvABdRbClSVClSRCBTNBpLpQF40VDQPyAeZWQzWnQy0JQeIi4A6FRhUDa4arDT4bHDCYLXBS4M/OM2qBpsBckslkE6C6E0tiOfvkgj+TlBXLpAuYchA6MLr5hKGNAYLsFszgW4vAIuAfJEM0V9WNf1zsFWQarWawSKD10D3LzS4aXAY6IO8si/JSwNTg2bjcU8KUDQN6HtQ+NYCo8kQPVIwGaFGepZ6hoEmyg5O0PjiYJBmUGLQAEaKOYMDgwdDAEMo0JIGhmUM6xk2MKkz+TIFM4VClDIxQvUIM6AApjgAMBmbzA== AAACpHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxRRkaiTWFGsq2CrE5CaWZKQUVwfURgNFYuMFlA30DMBAAZNhCGUoM0BBQL7ALoYYhhSGfIZkhlKGXIZUhjyGEiA7hyGRoRgIoxkMGQwYCoBisQzVQLEiICsTLJ/KUMvABdRbClSVClSRCBTNBpLpQF40VDQPyAeZWQzWnQy0JQeIi4A6FRhUDa4arDT4bHDCYLXBS4M/OM2qBpsBckslkE6C6E0tiOfvkgj+TlBXLpAuYchA6MLr5hKGNAYLsFszgW4vAIuAfJEM0V9WNf1zsFWQarWawSKD10D3LzS4aXAY6IO8si/JSwNTg2bjcU8KUDQN6HtQ+NYCo8kQPVIwGaFGepZ6hoEmyg5O0PjiYJBmUGLQAEaKOYMDgwdDAEMo0JIGhmUM6xk2MKkz+TIFM4VClDIxQvUIM6AApjgAMBmbzA== AAACpHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxRRkaiTWFGsq2CrE5CaWZKQUVwfURgNFYuMFlA30DMBAAZNhCGUoM0BBQL7ALoYYhhSGfIZkhlKGXIZUhjyGEiA7hyGRoRgIoxkMGQwYCoBisQzVQLEiICsTLJ/KUMvABdRbClSVClSRCBTNBpLpQF40VDQPyAeZWQzWnQy0JQeIi4A6FRhUDa4arDT4bHDCYLXBS4M/OM2qBpsBckslkE6C6E0tiOfvkgj+TlBXLpAuYchA6MLr5hKGNAYLsFszgW4vAIuAfJEM0V9WNf1zsFWQarWawSKD10D3LzS4aXAY6IO8si/JSwNTg2bjcU8KUDQN6HtQ+NYCo8kQPVIwGaFGepZ6hoEmyg5O0PjiYJBmUGLQAEaKOYMDgwdDAEMo0JIGhmUM6xk2MKkz+TIFM4VClDIxQvUIM6AApjgAMBmbzA== V (s) = E[Rt+1 + Rt+2 + ...|St = s] AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe AAACenichVE9T8JQFD3UL0QU3ExciATjRC4Oik4mLo58iJAgIW15YGNpm7aQIOEPsDo4OGniYPwHri7+AQd+gnHExMXB20JilIi3eX3nnffOfee+q1i65rhEg4A0Mzs3vxBcDC2FQ8srkWj4xDFbtioKqqmbdkmRHaFrhii4mquLkmULuanooqicH3r7xbawHc00jt2OJSpNuWFodU2VXaYy1WickuRHbBKkxiCOcZjRR5yiBhMqWmhCwIDLWIcMh78yUiBYzFXQZc5mpPn7Aj2EWNviU4JPyMye87/Bq/KYNXjt5XR8tcq36DxsVsaQoBe6pyE90wO90uefubp+Ds9Lh2dlpBVWNdJfy3/8q2ry7OLsWzXVs4s60r5Xjb1bPuNVoY707YurYX4/l+hu0i29sf8bGtATV2C039W7rMhdT/FTY7bO1Xvv2+MupX73ZBIUtpN7yVSWEMQ6NrDFvdjFAY6QQYFz19DHpRSSSNoZNVMKjLu6ih8hpb8AreqPMw== AAACs3ichVFNS9xQFD3Gj6q1OroqdBMUiyKEGxGqQqEihS51pqPCOIQk82YM5ou8NwM2nT/gHyjYVYUuSv+BK8GNuu/CnyAuLbhx0ZvMgKhob0jeeefdc3Puu07se1IRXfRovX39Ay8Gh4ZfjrwaHSuMj2zIqJm4ouxGfpRsObYUvheKsvKUL7biRNiB44tNZ3c1O99siUR6UfhZ7cWiGtiN0Kt7rq2YsgoLGzNyVn+vbwe22qnJ9GO7UrRSNWe29Tk9R/MZMgxD/6qXLMWpsmoVpsigPPTHwOyCKXRjLSocYRs1RHDRRACBEIqxDxuSnwpMEGLmqkiZSxh5+blAG8OsbXKW4Ayb2V3+NnhX6bIh77OaMle7/Bef34SVOqbpD/2iazql33RJt0/WSvMamZc9Xp2OVsTW2P7r0s1/VQGvCjt3qmc9K9SxmHv12HucM1kXbkff+vLturRcnE7f0iFdsf8fdEEn3EHY+uv+XBfF78/4qTFb5+6z+23zmMyHQ3kMyvPGkmGuEwbxBpOY4Vm8wwd8whrKXPsAxzjDubaiNbSgM0+tpzvYCdwLLfoHL0KhtA== AAACs3ichVFNS9xQFD3Gj6q1OroqdBMUiyKEGxGqQqEihS51pqPCOIQk82YM5ou8NwM2nT/gHyjYVYUuSv+BK8GNuu/CnyAuLbhx0ZvMgKhob0jeeefdc3Puu07se1IRXfRovX39Ay8Gh4ZfjrwaHSuMj2zIqJm4ouxGfpRsObYUvheKsvKUL7biRNiB44tNZ3c1O99siUR6UfhZ7cWiGtiN0Kt7rq2YsgoLGzNyVn+vbwe22qnJ9GO7UrRSNWe29Tk9R/MZMgxD/6qXLMWpsmoVpsigPPTHwOyCKXRjLSocYRs1RHDRRACBEIqxDxuSnwpMEGLmqkiZSxh5+blAG8OsbXKW4Ayb2V3+NnhX6bIh77OaMle7/Bef34SVOqbpD/2iazql33RJt0/WSvMamZc9Xp2OVsTW2P7r0s1/VQGvCjt3qmc9K9SxmHv12HucM1kXbkff+vLturRcnE7f0iFdsf8fdEEn3EHY+uv+XBfF78/4qTFb5+6z+23zmMyHQ3kMyvPGkmGuEwbxBpOY4Vm8wwd8whrKXPsAxzjDubaiNbSgM0+tpzvYCdwLLfoHL0KhtA== AAACvnichVFNS9xQFD2mtn61dVo3Qjehg8UihBsRtIWCVYQudaajwjiEJPNmDOaLvDdDbZw/0D8g6Eqhi+I/cCV003bfhT+huFRw00VvMoHSSu0NyTvvvHvuOzfXiX1PKqLzAe3O4N17Q8Mjo2P3HzwcLz16vC6jTuKKmhv5UbLp2FL4XihqylO+2IwTYQeOLzacneXsfKMrEulF4Vu1G4tGYLdDr+W5tmLKKs2tT8vn+it9K7DVdlOmK716xUrVjNnTZ/QczWbIMAx9T69ailNlwyqVyaA89JvALEAZRaxGpVNsoYkILjoIIBBCMfZhQ/JThwlCzFwDKXMJIy8/F+hhlLUdzhKcYTO7w9827+oFG/I+qylztcu3+PwmrNQxRd/pE13SFzqhH/Tzn7XSvEbmZZdXp68VsTX+YbJ6/V9VwKvC9m/VrZ4VWljIvXrsPc6ZrAu3r+++37+svqxMpc/omC7Y/xGd02fuIOxeuR/XROXwFj9NZlvcffZ/ezwm8++h3AS1WeOFYa5ReXGpmNcwnuAppnko81jEG6yixpcc4Axf8U17rbW1QIv6qdpAoZnAH6G9+wUzfKLa AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe In NLP tasks, policy-based methods are usually used. • Effective in high-dimensional action space. • We usually have a policy network in conventional maximum likelihood method.

Slide 10

Slide 10 text

Conventional Seq2seq Model 10 私 は 学生 です EOS I am a student I am a student EOS Training cross entropy loss gold

Slide 11

Slide 11 text

Conventional Seq2seq Model 11 私 は 学生 です EOS I am a student I am a student EOS 私 は 学生 です EOS I am a student EOS log p(I| ࢲ, ͸, ..., EOS) AAAC2HichVI9SxxRFD1O/P5cYyPYLC4aA7LcEcGPShIErfxYV4VVZGb27To4OzPMvF1cNxapREsbC6sELCRtKruQxj9goa1pgqVCmhS58xwQleh7MHPeeffce+57z/QdO5REl3Xam/qGxqbmlta29o7OrkT32+XQKweWyFqe4wWrphEKx3ZFVtrSEat+IIyS6YgVc+tjtL9SEUFoe+6SrPpivWQUXbtgW4Zkyku8wxoceCgiCR9DmMUnRmsowWR2G7XqD+xi+DFHXsyl1YzQNOaQwfuNRIrSpEbyOdBjkEI85r3EGSfOc1oLZS4g4EIydmAg5JmDDmJjEuuoMRcwstW+YAutrC1zlOAIg9kt/hZ5lYtZl9dRzlCpLa4StRqwMokBuqBTuqVz+ka/6e9/c9VUjshLVR2A0gp/o+ugN/PnVVWJ/xKbD6oXPUsUMK682uzdV0zUhXWvr+wc3WYmFwdqg/SVbtj/F7qkn9yBW7mzThbE4vELfvLMFrj76Hx3+Zr0p5fyHGRH0hNpfWE0NfUhvq9m9KGfX4mOMUxhBvPIcpHvuMI1fmk57bO2p+3fh2p1saYHj4Z2+A9W2aHj AAAC2HichVI9SxxRFD1O/P5cYyPYLC4aA7LcEcGPShIErfxYV4VVZGb27To4OzPMvF1cNxapREsbC6sELCRtKruQxj9goa1pgqVCmhS58xwQleh7MHPeeffce+57z/QdO5REl3Xam/qGxqbmlta29o7OrkT32+XQKweWyFqe4wWrphEKx3ZFVtrSEat+IIyS6YgVc+tjtL9SEUFoe+6SrPpivWQUXbtgW4Zkyku8wxoceCgiCR9DmMUnRmsowWR2G7XqD+xi+DFHXsyl1YzQNOaQwfuNRIrSpEbyOdBjkEI85r3EGSfOc1oLZS4g4EIydmAg5JmDDmJjEuuoMRcwstW+YAutrC1zlOAIg9kt/hZ5lYtZl9dRzlCpLa4StRqwMokBuqBTuqVz+ka/6e9/c9VUjshLVR2A0gp/o+ugN/PnVVWJ/xKbD6oXPUsUMK682uzdV0zUhXWvr+wc3WYmFwdqg/SVbtj/F7qkn9yBW7mzThbE4vELfvLMFrj76Hx3+Zr0p5fyHGRH0hNpfWE0NfUhvq9m9KGfX4mOMUxhBvPIcpHvuMI1fmk57bO2p+3fh2p1saYHj4Z2+A9W2aHj AAAC2HichVI9SxxRFD1O/P5cYyPYLC4aA7LcEcGPShIErfxYV4VVZGb27To4OzPMvF1cNxapREsbC6sELCRtKruQxj9goa1pgqVCmhS58xwQleh7MHPeeffce+57z/QdO5REl3Xam/qGxqbmlta29o7OrkT32+XQKweWyFqe4wWrphEKx3ZFVtrSEat+IIyS6YgVc+tjtL9SEUFoe+6SrPpivWQUXbtgW4Zkyku8wxoceCgiCR9DmMUnRmsowWR2G7XqD+xi+DFHXsyl1YzQNOaQwfuNRIrSpEbyOdBjkEI85r3EGSfOc1oLZS4g4EIydmAg5JmDDmJjEuuoMRcwstW+YAutrC1zlOAIg9kt/hZ5lYtZl9dRzlCpLa4StRqwMokBuqBTuqVz+ka/6e9/c9VUjshLVR2A0gp/o+ugN/PnVVWJ/xKbD6oXPUsUMK682uzdV0zUhXWvr+wc3WYmFwdqg/SVbtj/F7qkn9yBW7mzThbE4vELfvLMFrj76Hx3+Zr0p5fyHGRH0hNpfWE0NfUhvq9m9KGfX4mOMUxhBvPIcpHvuMI1fmk57bO2p+3fh2p1saYHj4Z2+A9W2aHj AAAC2HichVI9SxxRFD1O/P5cYyPYLC4aA7LcEcGPShIErfxYV4VVZGb27To4OzPMvF1cNxapREsbC6sELCRtKruQxj9goa1pgqVCmhS58xwQleh7MHPeeffce+57z/QdO5REl3Xam/qGxqbmlta29o7OrkT32+XQKweWyFqe4wWrphEKx3ZFVtrSEat+IIyS6YgVc+tjtL9SEUFoe+6SrPpivWQUXbtgW4Zkyku8wxoceCgiCR9DmMUnRmsowWR2G7XqD+xi+DFHXsyl1YzQNOaQwfuNRIrSpEbyOdBjkEI85r3EGSfOc1oLZS4g4EIydmAg5JmDDmJjEuuoMRcwstW+YAutrC1zlOAIg9kt/hZ5lYtZl9dRzlCpLa4StRqwMokBuqBTuqVz+ka/6e9/c9VUjshLVR2A0gp/o+ugN/PnVVWJ/xKbD6oXPUsUMK682uzdV0zUhXWvr+wc3WYmFwdqg/SVbtj/F7qkn9yBW7mzThbE4vELfvLMFrj76Hx3+Zr0p5fyHGRH0hNpfWE0NfUhvq9m9KGfX4mOMUxhBvPIcpHvuMI1fmk57bO2p+3fh2p1saYHj4Z2+A9W2aHj log p(am|I, ࢲ, ͸, ..., EOS) AAAC4HichVLPT9RAGH1UEQSURS8mXjZuMBrN5qsh4ceJSEz0JLCukAAhbXd2bei2TTu7YV35A/BiPJnoCRIPxv/AeCFc+Ac8cPVm8IYJFw6+Dk2MEHEmmb5537zve99M3TjwUy2y32NduNh7qa//8sDg0JWrw4WRa8/TqJV4qupFQZQsuk6qAj9UVe3rQC3GiXKabqAW3LWZLL7QVknqR+Ez3YnVStNphH7d9xxNKircwzICRGigiBh34KCJV8RPcJ/rMncuo+vodnawcZqTKOfKZmboEZ6igrurhZKUxYziWWDnoIR8zEaFL0xcY1oPLRZQCKGJAxpKOZdgQ2hQYwVdcgmRb+KKFgaobfGU4gmH7BrXBndLORuathRxpvZYJWs5obKIUfkmn+RQ9uSz/JDjf+bqmhyZl465AKNV8erw6xuVo/+qmvxqvPijOtezRh0TxqtP77Fhsi68E3375bvDytT8aPe2bMsB/W/Jvuyyg7D9y/s4p+Y/nOOnRrbO7rP73eAz2acf5SyoPihPlu25sdL0w/y9+nETt/i32BjHNB5jFlUW+YrvOMBPy7M2rTfW25OjVk+uuY6/hvX+Nzq9o0k= AAAC4HichVLPT9RAGH1UEQSURS8mXjZuMBrN5qsh4ceJSEz0JLCukAAhbXd2bei2TTu7YV35A/BiPJnoCRIPxv/AeCFc+Ac8cPVm8IYJFw6+Dk2MEHEmmb5537zve99M3TjwUy2y32NduNh7qa//8sDg0JWrw4WRa8/TqJV4qupFQZQsuk6qAj9UVe3rQC3GiXKabqAW3LWZLL7QVknqR+Ez3YnVStNphH7d9xxNKircwzICRGigiBh34KCJV8RPcJ/rMncuo+vodnawcZqTKOfKZmboEZ6igrurhZKUxYziWWDnoIR8zEaFL0xcY1oPLRZQCKGJAxpKOZdgQ2hQYwVdcgmRb+KKFgaobfGU4gmH7BrXBndLORuathRxpvZYJWs5obKIUfkmn+RQ9uSz/JDjf+bqmhyZl465AKNV8erw6xuVo/+qmvxqvPijOtezRh0TxqtP77Fhsi68E3375bvDytT8aPe2bMsB/W/Jvuyyg7D9y/s4p+Y/nOOnRrbO7rP73eAz2acf5SyoPihPlu25sdL0w/y9+nETt/i32BjHNB5jFlUW+YrvOMBPy7M2rTfW25OjVk+uuY6/hvX+Nzq9o0k= AAAC4HichVLPT9RAGH1UEQSURS8mXjZuMBrN5qsh4ceJSEz0JLCukAAhbXd2bei2TTu7YV35A/BiPJnoCRIPxv/AeCFc+Ac8cPVm8IYJFw6+Dk2MEHEmmb5537zve99M3TjwUy2y32NduNh7qa//8sDg0JWrw4WRa8/TqJV4qupFQZQsuk6qAj9UVe3rQC3GiXKabqAW3LWZLL7QVknqR+Ez3YnVStNphH7d9xxNKircwzICRGigiBh34KCJV8RPcJ/rMncuo+vodnawcZqTKOfKZmboEZ6igrurhZKUxYziWWDnoIR8zEaFL0xcY1oPLRZQCKGJAxpKOZdgQ2hQYwVdcgmRb+KKFgaobfGU4gmH7BrXBndLORuathRxpvZYJWs5obKIUfkmn+RQ9uSz/JDjf+bqmhyZl465AKNV8erw6xuVo/+qmvxqvPijOtezRh0TxqtP77Fhsi68E3375bvDytT8aPe2bMsB/W/Jvuyyg7D9y/s4p+Y/nOOnRrbO7rP73eAz2acf5SyoPihPlu25sdL0w/y9+nETt/i32BjHNB5jFlUW+YrvOMBPy7M2rTfW25OjVk+uuY6/hvX+Nzq9o0k= AAAC4HichVLPT9RAGH1UEQSURS8mXjZuMBrN5qsh4ceJSEz0JLCukAAhbXd2bei2TTu7YV35A/BiPJnoCRIPxv/AeCFc+Ac8cPVm8IYJFw6+Dk2MEHEmmb5537zve99M3TjwUy2y32NduNh7qa//8sDg0JWrw4WRa8/TqJV4qupFQZQsuk6qAj9UVe3rQC3GiXKabqAW3LWZLL7QVknqR+Ez3YnVStNphH7d9xxNKircwzICRGigiBh34KCJV8RPcJ/rMncuo+vodnawcZqTKOfKZmboEZ6igrurhZKUxYziWWDnoIR8zEaFL0xcY1oPLRZQCKGJAxpKOZdgQ2hQYwVdcgmRb+KKFgaobfGU4gmH7BrXBndLORuathRxpvZYJWs5obKIUfkmn+RQ9uSz/JDjf+bqmhyZl465AKNV8erw6xuVo/+qmvxqvPijOtezRh0TxqtP77Fhsi68E3375bvDytT8aPe2bMsB/W/Jvuyyg7D9y/s4p+Y/nOOnRrbO7rP73eAz2acf5SyoPihPlu25sdL0w/y9+nETt/i32BjHNB5jFlUW+YrvOMBPy7M2rTfW25OjVk+uuY6/hvX+Nzq9o0k= Training cross entropy loss I am a student gold

Slide 12

Slide 12 text

Conventional Seq2seq Model 12 私 は 学生 です EOS I am an actor I am an actor EOS 私 は 学生 です EOS I am EOS Testing an actor argmaxy p(y| ࢲ, ͸, ..., EOS) AAAC4HichVK7SiRBFD227+eMmixsMjgoyspwWwR3N5JdFjbzMTsqqAzdbc3Y2NPddNcMO876AZrIRoJGCgbiH4iJmPgDG5iaiWYKJgbeLhtEZbWK7jp1qs6951aV6Tt2KInO67T6hsam5pbWtvaOzq5EsrtnJvTKgSVylud4wZxphMKxXZGTtnTEnB8Io2Q6YtZc+R6tz1ZEENqe+0tWfbFYMoquXbAtQzLlJT/BQIAiSjz+Rh5VpOBjkMc/jBaYN+HxSq16gjUMP+fIi7mM6hH6gQlkMZRPpilDqqVeAz0GacRt0kseceAlDmuhzAkEXEjGDtsKuc9DB7ExiUXUlGUJW60LttDG2jLvErzDYHaF/0Wezcesy/MoZqjUFmdx+AtYmUI//aMDuqEzOqRLuv9vrJqKEXmpqgNQWuHnExsfsnfvqqLjlVh+Ur3pWaKAz8qrzd59xURVWI/6yurWTfbrdH9tgPboiv3v0jmdcgVu5dbanxLTO2/4WWK2wNVH57vG16S/vJTXIDeS+ZLRp0bT49/i+2rBR/TxK9ExhnH8xCRynOQYF7jCtWZp69qm9vdxq1YXa3rxrGnbDzBHpBk= AAAC4HichVK7SiRBFD227+eMmixsMjgoyspwWwR3N5JdFjbzMTsqqAzdbc3Y2NPddNcMO876AZrIRoJGCgbiH4iJmPgDG5iaiWYKJgbeLhtEZbWK7jp1qs6951aV6Tt2KInO67T6hsam5pbWtvaOzq5EsrtnJvTKgSVylud4wZxphMKxXZGTtnTEnB8Io2Q6YtZc+R6tz1ZEENqe+0tWfbFYMoquXbAtQzLlJT/BQIAiSjz+Rh5VpOBjkMc/jBaYN+HxSq16gjUMP+fIi7mM6hH6gQlkMZRPpilDqqVeAz0GacRt0kseceAlDmuhzAkEXEjGDtsKuc9DB7ExiUXUlGUJW60LttDG2jLvErzDYHaF/0Wezcesy/MoZqjUFmdx+AtYmUI//aMDuqEzOqRLuv9vrJqKEXmpqgNQWuHnExsfsnfvqqLjlVh+Ur3pWaKAz8qrzd59xURVWI/6yurWTfbrdH9tgPboiv3v0jmdcgVu5dbanxLTO2/4WWK2wNVH57vG16S/vJTXIDeS+ZLRp0bT49/i+2rBR/TxK9ExhnH8xCRynOQYF7jCtWZp69qm9vdxq1YXa3rxrGnbDzBHpBk= AAAC4HichVK7SiRBFD227+eMmixsMjgoyspwWwR3N5JdFjbzMTsqqAzdbc3Y2NPddNcMO876AZrIRoJGCgbiH4iJmPgDG5iaiWYKJgbeLhtEZbWK7jp1qs6951aV6Tt2KInO67T6hsam5pbWtvaOzq5EsrtnJvTKgSVylud4wZxphMKxXZGTtnTEnB8Io2Q6YtZc+R6tz1ZEENqe+0tWfbFYMoquXbAtQzLlJT/BQIAiSjz+Rh5VpOBjkMc/jBaYN+HxSq16gjUMP+fIi7mM6hH6gQlkMZRPpilDqqVeAz0GacRt0kseceAlDmuhzAkEXEjGDtsKuc9DB7ExiUXUlGUJW60LttDG2jLvErzDYHaF/0Wezcesy/MoZqjUFmdx+AtYmUI//aMDuqEzOqRLuv9vrJqKEXmpqgNQWuHnExsfsnfvqqLjlVh+Ur3pWaKAz8qrzd59xURVWI/6yurWTfbrdH9tgPboiv3v0jmdcgVu5dbanxLTO2/4WWK2wNVH57vG16S/vJTXIDeS+ZLRp0bT49/i+2rBR/TxK9ExhnH8xCRynOQYF7jCtWZp69qm9vdxq1YXa3rxrGnbDzBHpBk= AAAC4HichVK7SiRBFD227+eMmixsMjgoyspwWwR3N5JdFjbzMTsqqAzdbc3Y2NPddNcMO876AZrIRoJGCgbiH4iJmPgDG5iaiWYKJgbeLhtEZbWK7jp1qs6951aV6Tt2KInO67T6hsam5pbWtvaOzq5EsrtnJvTKgSVylud4wZxphMKxXZGTtnTEnB8Io2Q6YtZc+R6tz1ZEENqe+0tWfbFYMoquXbAtQzLlJT/BQIAiSjz+Rh5VpOBjkMc/jBaYN+HxSq16gjUMP+fIi7mM6hH6gQlkMZRPpilDqqVeAz0GacRt0kseceAlDmuhzAkEXEjGDtsKuc9DB7ExiUXUlGUJW60LttDG2jLvErzDYHaF/0Wezcesy/MoZqjUFmdx+AtYmUI//aMDuqEzOqRLuv9vrJqKEXmpqgNQWuHnExsfsnfvqqLjlVh+Ur3pWaKAz8qrzd59xURVWI/6yurWTfbrdH9tgPboiv3v0jmdcgVu5dbanxLTO2/4WWK2wNVH57vG16S/vJTXIDeS+ZLRp0bT49/i+2rBR/TxK9ExhnH8xCRynOQYF7jCtWZp69qm9vdxq1YXa3rxrGnbDzBHpBk= argmaxy p(y|I, ࢲ, ͸, ..., EOS) AAAC5nichVJNaxRBEH0ZPxLjRzZ6EbwsLhEFGWokkMRTiAh6Msm6JhDDMjPpXZvMzgwzvYubNX/AiyfJwYsKHsR/4E295A/kkH+geNIIXjz4pjMgJhi7me7qV/2qXlVPkEY6NyK7Q86x4ydODo+cGj195uy5scr4+Qd50s1C1QiTKMmWAz9XkY5Vw2gTqeU0U34niNRSsH6r8C/1VJbrJL5v+qla7fjtWLd06BtCScWFjwxtdLg/RhN9VJHiKvcntO7iOteH9AZI6B/0P2DzICZJibl2FtZt3EMd15qVmrhiR/Ww4ZVGDeWYTyrvGXiNYUN0mUAhhqEdUVzOuQIPQnkGqxhY4Qba+hUljJLb5S3FGz7Rda5tnlZKNOa5iJlbdsgsEb+MzComZEfeyp5syzv5Ir/+GWtgYxRa+rYBlqvS5tjTi/Wf/2UVTTZ49Id1pGaDFqatVk3tqUWKKsJ9fm9ja69+c3FicEVey1fqfyW78okVxL0f4ZsFtfjiCD1rRFusvujvJp/JO/goh43GDXfG9RYma7Nz5XuN4BIu81/xMIVZ3ME8GkzyEZ/xDd8d7Txznjtb+1edoZJzAX8N5+Vvj9ek5w== AAAC5nichVJNaxRBEH0ZPxLjRzZ6EbwsLhEFGWokkMRTiAh6Msm6JhDDMjPpXZvMzgwzvYubNX/AiyfJwYsKHsR/4E295A/kkH+geNIIXjz4pjMgJhi7me7qV/2qXlVPkEY6NyK7Q86x4ydODo+cGj195uy5scr4+Qd50s1C1QiTKMmWAz9XkY5Vw2gTqeU0U34niNRSsH6r8C/1VJbrJL5v+qla7fjtWLd06BtCScWFjwxtdLg/RhN9VJHiKvcntO7iOteH9AZI6B/0P2DzICZJibl2FtZt3EMd15qVmrhiR/Ww4ZVGDeWYTyrvGXiNYUN0mUAhhqEdUVzOuQIPQnkGqxhY4Qba+hUljJLb5S3FGz7Rda5tnlZKNOa5iJlbdsgsEb+MzComZEfeyp5syzv5Ir/+GWtgYxRa+rYBlqvS5tjTi/Wf/2UVTTZ49Id1pGaDFqatVk3tqUWKKsJ9fm9ja69+c3FicEVey1fqfyW78okVxL0f4ZsFtfjiCD1rRFusvujvJp/JO/goh43GDXfG9RYma7Nz5XuN4BIu81/xMIVZ3ME8GkzyEZ/xDd8d7Txznjtb+1edoZJzAX8N5+Vvj9ek5w== AAAC5nichVJNaxRBEH0ZPxLjRzZ6EbwsLhEFGWokkMRTiAh6Msm6JhDDMjPpXZvMzgwzvYubNX/AiyfJwYsKHsR/4E295A/kkH+geNIIXjz4pjMgJhi7me7qV/2qXlVPkEY6NyK7Q86x4ydODo+cGj195uy5scr4+Qd50s1C1QiTKMmWAz9XkY5Vw2gTqeU0U34niNRSsH6r8C/1VJbrJL5v+qla7fjtWLd06BtCScWFjwxtdLg/RhN9VJHiKvcntO7iOteH9AZI6B/0P2DzICZJibl2FtZt3EMd15qVmrhiR/Ww4ZVGDeWYTyrvGXiNYUN0mUAhhqEdUVzOuQIPQnkGqxhY4Qba+hUljJLb5S3FGz7Rda5tnlZKNOa5iJlbdsgsEb+MzComZEfeyp5syzv5Ir/+GWtgYxRa+rYBlqvS5tjTi/Wf/2UVTTZ49Id1pGaDFqatVk3tqUWKKsJ9fm9ja69+c3FicEVey1fqfyW78okVxL0f4ZsFtfjiCD1rRFusvujvJp/JO/goh43GDXfG9RYma7Nz5XuN4BIu81/xMIVZ3ME8GkzyEZ/xDd8d7Txznjtb+1edoZJzAX8N5+Vvj9ek5w== AAAC5nichVJNaxRBEH0ZPxLjRzZ6EbwsLhEFGWokkMRTiAh6Msm6JhDDMjPpXZvMzgwzvYubNX/AiyfJwYsKHsR/4E295A/kkH+geNIIXjz4pjMgJhi7me7qV/2qXlVPkEY6NyK7Q86x4ydODo+cGj195uy5scr4+Qd50s1C1QiTKMmWAz9XkY5Vw2gTqeU0U34niNRSsH6r8C/1VJbrJL5v+qla7fjtWLd06BtCScWFjwxtdLg/RhN9VJHiKvcntO7iOteH9AZI6B/0P2DzICZJibl2FtZt3EMd15qVmrhiR/Ww4ZVGDeWYTyrvGXiNYUN0mUAhhqEdUVzOuQIPQnkGqxhY4Qba+hUljJLb5S3FGz7Rda5tnlZKNOa5iJlbdsgsEb+MzComZEfeyp5syzv5Ir/+GWtgYxRa+rYBlqvS5tjTi/Wf/2UVTTZ49Id1pGaDFqatVk3tqUWKKsJ9fm9ja69+c3FicEVey1fqfyW78okVxL0f4ZsFtfjiCD1rRFusvujvJp/JO/goh43GDXfG9RYma7Nz5XuN4BIu81/xMIVZ3ME8GkzyEZ/xDd8d7Txznjtb+1edoZJzAX8N5+Vvj9ek5w== system output

Slide 13

Slide 13 text

Problems • Training and testing mismatch – Sequence generation: • Training: generate a next ground truth word given the previous ground truth words (teacher forcing) • Testing: generate an entire sequence – Objective: • Training: word-level loss (e.g., cross entropy loss) • Testing: sentence-level evaluation (e.g., BLEU) ⇒ Reinforcement Learning 13 Data Distribution Mismatch! 4 CHAPTER 1. INTROD Expert trajectory Learned Policy No data on how to recover Figure 1.1: Mismatch between the distribution of training and test inputs in scenario. many state-of-the-art software system that we use everyday. Systems based vised learning already translate our documents, recommend what we should and Guestrin, 2011), watch (Toscher et al., 2009) or buy, read our handwritin III et al., 2009) and filter spam from our emails (Weinberger et al., 2009), just few. Many subfields of artificial intelligence, such as natural language processin derstanding of natural language by computers) and computer vision (the und of visual input by computers), now deeply integrate machine learning. Despite this widespread proliferation and success of machine learning in va and applications, machine learning has had a much more limited success wh in control applications, e.g. learning to drive from demonstrations by hum One of the main reason behind this limited success is that control proble fundamentally di↵erent issues that are not typically addressed by standard learning techniques. In particular, much of the theory and algorithms for supervised learning ar the fundamental assumption that inputs/observations perceived by the predict its predictions are independent and always coming from the same underlying d during both training and testing (Hastie et al., 2001). This ensures that a enough training examples, we will be able to predict well on new example in expectation). However, this assumption is clearly violated in control task are inherently dynamic and sequential : one must perform a sequence of ac time that have consequences on future inputs or observations of the system, t p⇡⇤ (ot) 6= p⇡✓ (ot) exposure bias https://medium.com/@hridaym.211me129/understan ding-a-few-state-only-imitation-papers-cce885794b43

Slide 14

Slide 14 text

Solution: RL • Action: generate a word • Reward: sentence-level evaluation (e.g., BLEU) – In NLP, a reward is usually calculated using a gold • Only relies on its output to generate the results → avoid exposure bias • Directly optimize the model using the evaluation metric → avoid mismatch between training and testing measures 14

Slide 15

Slide 15 text

RL-based Seq2seq Model 15 私 は 学生 です EOS am a boy EOS Training He 12.0 私 は 学生 です EOS am a boy EOS I 38.0 Reward (BLEU calculated using a gold reference) ・・・ Sampling with stochastic policy ・・・

Slide 16

Slide 16 text

Table of Contents 1. An overview of RL 2. Policy gradient 3. RL in NLP tasks 4. Implementation 16

Slide 17

Slide 17 text

Policy Gradient • Goal: maximize 17 J( ) = Ey1,··· ,yT [R(y1:T )] AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y= AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y= AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y= AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=

Slide 18

Slide 18 text

Policy Gradient • Goal: maximize 18 私 は 学生 です EOS am a boy EOS I 38.0 10.0 8.0 ・・・ ・・・ ・・・ x 0.0002 x 0.0008 x 0.0001 + + ・・・ ・・・ = 28.0 J( ) = Ey1,··· ,yT [R(y1:T )] AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y= AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y= AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y= AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=

Slide 19

Slide 19 text

Policy Gradient • Goal: maximize • Parameter update with gradient ascent • How to compute this gradient when rewards are not differentiable 19 + J( ) AAACw3ichVFNSyNBEH3Oum7WXTW7XgQvgyGiCKEjwq6eREHEkx8bEzASesaOGezMDD2dSAz5A/4BDx5khT0s/gOvInjSkwd/wrLHCF48WDMZEBW1h5l+/ape9aspy5dOoBm76TI+dH/s+ZT43Pvla1//QPLb9/XAqylb5GxPeqpg8UBIxxU57WgpCr4SvGpJkbd25sN4vi5U4HjuL93wxWaVb7tO2bG5JqqUnC7qitDcLEpR1lwpb9eMmQmzyKVfoZDLLclLMb001gHjpWSKZVi0zJcgG4MU4rXsJU9RxBY82KihCgEXmrAER0DPBrJg8InbRJM4RciJ4gIt9JK2RlmCMjixO/TdptNGzLp0DmsGkdqmWyS9ipQm0uya/WVtdsFO2D92/2qtZlQj9NKg3epohV8a2B9au3tXVaVdo/KoetOzRhk/I68OefcjJuzC7ujrewfttZnVdHOUHbP/5P83u2Fn1IFbv7X/rIjVwzf8bBFbpu7D/9uiMWWfD+UlyE1mpjPZlanU7Fw8rwSGMYIxGsoPzGIRy8jRJUc4xyWujAVDGsrQnVSjK9YM4skyWg94Cqgd AAACw3ichVFNSyNBEH3Oum7WXTW7XgQvgyGiCKEjwq6eREHEkx8bEzASesaOGezMDD2dSAz5A/4BDx5khT0s/gOvInjSkwd/wrLHCF48WDMZEBW1h5l+/ape9aspy5dOoBm76TI+dH/s+ZT43Pvla1//QPLb9/XAqylb5GxPeqpg8UBIxxU57WgpCr4SvGpJkbd25sN4vi5U4HjuL93wxWaVb7tO2bG5JqqUnC7qitDcLEpR1lwpb9eMmQmzyKVfoZDLLclLMb001gHjpWSKZVi0zJcgG4MU4rXsJU9RxBY82KihCgEXmrAER0DPBrJg8InbRJM4RciJ4gIt9JK2RlmCMjixO/TdptNGzLp0DmsGkdqmWyS9ipQm0uya/WVtdsFO2D92/2qtZlQj9NKg3epohV8a2B9au3tXVaVdo/KoetOzRhk/I68OefcjJuzC7ujrewfttZnVdHOUHbP/5P83u2Fn1IFbv7X/rIjVwzf8bBFbpu7D/9uiMWWfD+UlyE1mpjPZlanU7Fw8rwSGMYIxGsoPzGIRy8jRJUc4xyWujAVDGsrQnVSjK9YM4skyWg94Cqgd AAACw3ichVFNSyNBEH3Oum7WXTW7XgQvgyGiCKEjwq6eREHEkx8bEzASesaOGezMDD2dSAz5A/4BDx5khT0s/gOvInjSkwd/wrLHCF48WDMZEBW1h5l+/ape9aspy5dOoBm76TI+dH/s+ZT43Pvla1//QPLb9/XAqylb5GxPeqpg8UBIxxU57WgpCr4SvGpJkbd25sN4vi5U4HjuL93wxWaVb7tO2bG5JqqUnC7qitDcLEpR1lwpb9eMmQmzyKVfoZDLLclLMb001gHjpWSKZVi0zJcgG4MU4rXsJU9RxBY82KihCgEXmrAER0DPBrJg8InbRJM4RciJ4gIt9JK2RlmCMjixO/TdptNGzLp0DmsGkdqmWyS9ipQm0uya/WVtdsFO2D92/2qtZlQj9NKg3epohV8a2B9au3tXVaVdo/KoetOzRhk/I68OefcjJuzC7ujrewfttZnVdHOUHbP/5P83u2Fn1IFbv7X/rIjVwzf8bBFbpu7D/9uiMWWfD+UlyE1mpjPZlanU7Fw8rwSGMYIxGsoPzGIRy8jRJUc4xyWujAVDGsrQnVSjK9YM4skyWg94Cqgd AAACw3ichVFNSyNBEH3Oum7WXTW7XgQvgyGiCKEjwq6eREHEkx8bEzASesaOGezMDD2dSAz5A/4BDx5khT0s/gOvInjSkwd/wrLHCF48WDMZEBW1h5l+/ape9aspy5dOoBm76TI+dH/s+ZT43Pvla1//QPLb9/XAqylb5GxPeqpg8UBIxxU57WgpCr4SvGpJkbd25sN4vi5U4HjuL93wxWaVb7tO2bG5JqqUnC7qitDcLEpR1lwpb9eMmQmzyKVfoZDLLclLMb001gHjpWSKZVi0zJcgG4MU4rXsJU9RxBY82KihCgEXmrAER0DPBrJg8InbRJM4RciJ4gIt9JK2RlmCMjixO/TdptNGzLp0DmsGkdqmWyS9ipQm0uya/WVtdsFO2D92/2qtZlQj9NKg3epohV8a2B9au3tXVaVdo/KoetOzRhk/I68OefcjJuzC7ujrewfttZnVdHOUHbP/5P83u2Fn1IFbv7X/rIjVwzf8bBFbpu7D/9uiMWWfD+UlyE1mpjPZlanU7Fw8rwSGMYIxGsoPzGIRy8jRJUc4xyWujAVDGsrQnVSjK9YM4skyWg94Cqgd J( ) = Ey1,··· ,yT [R(y1:T )] AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y= AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y= AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y= AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=

Slide 20

Slide 20 text

= y1:T (y1:T )R(y1:T ) AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB AAACenichVE9T8JQFD3UL0QU3ExciATjRC4Oik4mLo58iJAgIW15YGNpm7aQIOEPsDo4OGniYPwHri7+AQd+gnHExMXB20JilIi3eX3nnffOfee+q1i65rhEg4A0Mzs3vxBcDC2FQ8srkWj4xDFbtioKqqmbdkmRHaFrhii4mquLkmULuanooqicH3r7xbawHc00jt2OJSpNuWFodU2VXaYy1WickuRHbBKkxiCOcZjRR5yiBhMqWmhCwIDLWIcMh78yUiBYzFXQZc5mpPn7Aj2EWNviU4JPyMye87/Bq/KYNXjt5XR8tcq36DxsVsaQoBe6pyE90wO90uefubp+Ds9Lh2dlpBVWNdJfy3/8q2ry7OLsWzXVs4s60r5Xjb1bPuNVoY707YurYX4/l+hu0i29sf8bGtATV2C039W7rMhdT/FTY7bO1Xvv2+MupX73ZBIUtpN7yVSWEMQ6NrDFvdjFAY6QQYFz19DHpRSSSNoZNVMKjLu6ih8hpb8AreqPMw== AAACvHichVG7TtxAFD0YEpZHwkIVicYCEZFmdU3DQ1oJiQalgoUNSBhZYzMLI/ySPbtisfYH+AEKKhKliPIHadMkbaIUfEKUkkg0FFx7rUQJCrmWPWfOnXN97lw39lWqia4GjMGhR4+HKyOjY+NPnk5UJ8dfpVE78WTTi/wo2XVFKn0VyqZW2pe7cSJF4Ppyxz1ey/M7HZmkKgq3dTeW+4E4DFVLeUIz5VTrddMOhesLx9ZHUgvTTtuBk3WdzFrZ7vVMO1ZlZr7kXpiNX9CpzlKNijDvA6sEsyhjI6p+gI0DRPDQRgCJEJqxD4GUnz1YIMTM7SNjLmGkirxED6OsbfMpyScEs8f8PeTdXsmGvM9rpoXa47/4/CasNDFH3+gdXdMnek/f6faftbKiRu6ly6vb18rYmTh7tnXzX1XAq8bRb9WDnjVaWCq8KvYeF0zehdfXd07Pr7dWGnPZc3pNP9j/JV3RR+4g7Pz03m7KxsUDfg6YbXH3+f32eEzW30O5D5oLteWatUmoYBozmOdZLGIV69hAk2u/wWd8wVfjpREbnf48jYFysFP4I4yTOzFkqA4= AAACvHichVG7TtxAFD0YEpZHwkIVicYCEZFmdU3DQ1oJiQalgoUNSBhZYzMLI/ySPbtisfYH+AEKKhKliPIHadMkbaIUfEKUkkg0FFx7rUQJCrmWPWfOnXN97lw39lWqia4GjMGhR4+HKyOjY+NPnk5UJ8dfpVE78WTTi/wo2XVFKn0VyqZW2pe7cSJF4Ppyxz1ey/M7HZmkKgq3dTeW+4E4DFVLeUIz5VTrddMOhesLx9ZHUgvTTtuBk3WdzFrZ7vVMO1ZlZr7kXpiNX9CpzlKNijDvA6sEsyhjI6p+gI0DRPDQRgCJEJqxD4GUnz1YIMTM7SNjLmGkirxED6OsbfMpyScEs8f8PeTdXsmGvM9rpoXa47/4/CasNDFH3+gdXdMnek/f6faftbKiRu6ly6vb18rYmTh7tnXzX1XAq8bRb9WDnjVaWCq8KvYeF0zehdfXd07Pr7dWGnPZc3pNP9j/JV3RR+4g7Pz03m7KxsUDfg6YbXH3+f32eEzW30O5D5oLteWatUmoYBozmOdZLGIV69hAk2u/wWd8wVfjpREbnf48jYFysFP4I4yTOzFkqA4= AAACx3ichVE9TxRRFD0MIB+CrNCY2EzcQKDZ3LERSUgINsQKFhZIGDJ5M7yFF+YrM283rpMtaP0DFFRKLIz/wNZGW4wFP8FQYmJjwZ3ZiUYJ8CYz77xz77nv3Llu7KtUE533Gf0Dg/eGhkdG74+NP5ioPJzcTKNW4smGF/lRsu2KVPoqlA2ttC+340SKwPXllnv4Io9vtWWSqijc0J1Y7gZiP1RN5QnNlFNZXDTtULi+cGx9ILUw7bQVOFnHyayFjW7XtGNVRmZLbs6s/4FOpUo1KpZ5HVglqKJcq1HlE2zsIYKHFgJIhNCMfQik/OzAAiFmbhcZcwkjVcQluhhlbYuzJGcIZg/5u8+nnZIN+ZzXTAu1x7f4/CasNDFN3+kDXdIX+kg/6PeNtbKiRu6lw7vb08rYmXjzaP3XnaqAd42Dv6pbPWs0MV94Vew9Lpi8C6+nb78+vlxfqE9nM/SOLtj/Wzqnz9xB2P7pvV+T9ZNb/Owx2+Tu8//b5TFZ/w/lOmg8rT2vWWtUXVou5zWMx3iCWR7KMyxhBato8CWn+IozfDNeGrHRNl71Uo2+UjOFf5ZxdAV6Qak9 AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB = y1:T (y1:T ) (y1:T ) (y1:T ) R(y1:T ) AAAC93ichVG7SsRAFL2b9bG+V20Em+CiaLNMRPABgmhjqaurgpEwibPuYF4kswtryA/4AxY2KiiKla2tjb1Y+AliqSCChTfZoPiekMyZc++5c26u7prcF4TcpaR0Q2NTc6alta29o7Mr292z4jsVz2BFwzEdb02nPjO5zYqCC5OtuR6jlm6yVX17LoqvVpnnc8deFjWXbVh0y+YlblCBlJa1pmXVr1haUNMCZWo5DGXV5ZoqykzQ4YQbkdWSR41Atalu0iT4U14Y/ETKhXesZXMkT+IlfwdKAnKQrAUnewkqbIIDBlTAAgY2CMQmUPDxWQcFCLjIbUCAnIeIx3EGIbSitoJZDDMostv43cLTesLaeI5q+rHawFtMfD1UyjBIbskZeSTX5Jzck9dfawVxjchLDXe9rmWu1rXbt/T8r8rCXUD5Q/WnZwElmIi9cvTuxkzUhVHXV3f2HpemCoPBEDkiD+j/kNyRK+zArj4Zx4ussP+Hn01kS9h99H9DHJPydSjfQXE0P5lXFsdyM7PJvDLQDwMwjEMZhxmYhwUo4iU38JKSUmlpRzqQTqTTeqqUSjS98GlJF29nqrr3 AAAC93ichVG7SsRAFL2b9bG+V20Em+CiaLNMRPABgmhjqaurgpEwibPuYF4kswtryA/4AxY2KiiKla2tjb1Y+AliqSCChTfZoPiekMyZc++5c26u7prcF4TcpaR0Q2NTc6alta29o7Mr292z4jsVz2BFwzEdb02nPjO5zYqCC5OtuR6jlm6yVX17LoqvVpnnc8deFjWXbVh0y+YlblCBlJa1pmXVr1haUNMCZWo5DGXV5ZoqykzQ4YQbkdWSR41Atalu0iT4U14Y/ETKhXesZXMkT+IlfwdKAnKQrAUnewkqbIIDBlTAAgY2CMQmUPDxWQcFCLjIbUCAnIeIx3EGIbSitoJZDDMostv43cLTesLaeI5q+rHawFtMfD1UyjBIbskZeSTX5Jzck9dfawVxjchLDXe9rmWu1rXbt/T8r8rCXUD5Q/WnZwElmIi9cvTuxkzUhVHXV3f2HpemCoPBEDkiD+j/kNyRK+zArj4Zx4ussP+Hn01kS9h99H9DHJPydSjfQXE0P5lXFsdyM7PJvDLQDwMwjEMZhxmYhwUo4iU38JKSUmlpRzqQTqTTeqqUSjS98GlJF29nqrr3 AAAC93ichVG7SsRAFL2b9bG+V20Em+CiaLNMRPABgmhjqaurgpEwibPuYF4kswtryA/4AxY2KiiKla2tjb1Y+AliqSCChTfZoPiekMyZc++5c26u7prcF4TcpaR0Q2NTc6alta29o7Mr292z4jsVz2BFwzEdb02nPjO5zYqCC5OtuR6jlm6yVX17LoqvVpnnc8deFjWXbVh0y+YlblCBlJa1pmXVr1haUNMCZWo5DGXV5ZoqykzQ4YQbkdWSR41Atalu0iT4U14Y/ETKhXesZXMkT+IlfwdKAnKQrAUnewkqbIIDBlTAAgY2CMQmUPDxWQcFCLjIbUCAnIeIx3EGIbSitoJZDDMostv43cLTesLaeI5q+rHawFtMfD1UyjBIbskZeSTX5Jzck9dfawVxjchLDXe9rmWu1rXbt/T8r8rCXUD5Q/WnZwElmIi9cvTuxkzUhVHXV3f2HpemCoPBEDkiD+j/kNyRK+zArj4Zx4ussP+Hn01kS9h99H9DHJPydSjfQXE0P5lXFsdyM7PJvDLQDwMwjEMZhxmYhwUo4iU38JKSUmlpRzqQTqTTeqqUSjS98GlJF29nqrr3 AAAC93ichVG7SsRAFL2b9bG+V20Em+CiaLNMRPABgmhjqaurgpEwibPuYF4kswtryA/4AxY2KiiKla2tjb1Y+AliqSCChTfZoPiekMyZc++5c26u7prcF4TcpaR0Q2NTc6alta29o7Mr292z4jsVz2BFwzEdb02nPjO5zYqCC5OtuR6jlm6yVX17LoqvVpnnc8deFjWXbVh0y+YlblCBlJa1pmXVr1haUNMCZWo5DGXV5ZoqykzQ4YQbkdWSR41Atalu0iT4U14Y/ETKhXesZXMkT+IlfwdKAnKQrAUnewkqbIIDBlTAAgY2CMQmUPDxWQcFCLjIbUCAnIeIx3EGIbSitoJZDDMostv43cLTesLaeI5q+rHawFtMfD1UyjBIbskZeSTX5Jzck9dfawVxjchLDXe9rmWu1rXbt/T8r8rCXUD5Q/WnZwElmIi9cvTuxkzUhVHXV3f2HpemCoPBEDkiD+j/kNyRK+zArj4Zx4ussP+Hn01kS9h99H9DHJPydSjfQXE0P5lXFsdyM7PJvDLQDwMwjEMZhxmYhwUo4iU38JKSUmlpRzqQTqTTeqqUSjS98GlJF29nqrr3 Policy Gradient 20 log-derivative trick Ey1,··· ,yT [R(y1:T )] AAAC0nichVFNa9RQFD2NX2396Gg3BTehQ6VCGV5EULsqFaHLdjqxhckQXjJvOo/mi7w3AzFkIe7EvQtXLbgQwR/g1o0/wC660L24rODGhTeZgGix3pC88867575zc70kkEozdjxlnDt/4eKl6ZnZy1euXptrXL/xWMWj1Be2HwdxuutxJQIZCVtLHYjdJBU89AKx4+0/LM93xiJVMo46OktEL+R7kRxIn2ui3MaGE3Ev4K6jh0Jz0wm5HvZV/qhw88y1VkzH78darZiZ2zEdJUPTSWSdXJjd9nLm5tZqp7jdcxtN1mJVmKeBVYMm6tiMG+/hoI8YPkYIIRBBEw7AoejpwgJDQlwPOXEpIVmdCxSYJe2IsgRlcGL36btHu27NRrQva6pK7dMtAb0pKU0ssSP2hp2wj+wt+8p+/rNWXtUovWS0ehOtSNy55wvbP/6rCmnVGP5WnelZY4D7lVdJ3pOKKbvwJ/rxk5cn26vtpfwWO2TfyP8BO2YfqINo/N1/vSXar87w0yd2QN2X/7egMVl/D+U0sO+0HrSsrbvNtfV6XtO4iUUs01DuYQ0b2IRNl7zDJ3zGF8M2cuOp8WySakzVmnn8EcaLX5qjrg8= AAAC0nichVFNa9RQFD2NX2396Gg3BTehQ6VCGV5EULsqFaHLdjqxhckQXjJvOo/mi7w3AzFkIe7EvQtXLbgQwR/g1o0/wC660L24rODGhTeZgGix3pC88867575zc70kkEozdjxlnDt/4eKl6ZnZy1euXptrXL/xWMWj1Be2HwdxuutxJQIZCVtLHYjdJBU89AKx4+0/LM93xiJVMo46OktEL+R7kRxIn2ui3MaGE3Ev4K6jh0Jz0wm5HvZV/qhw88y1VkzH78darZiZ2zEdJUPTSWSdXJjd9nLm5tZqp7jdcxtN1mJVmKeBVYMm6tiMG+/hoI8YPkYIIRBBEw7AoejpwgJDQlwPOXEpIVmdCxSYJe2IsgRlcGL36btHu27NRrQva6pK7dMtAb0pKU0ssSP2hp2wj+wt+8p+/rNWXtUovWS0ehOtSNy55wvbP/6rCmnVGP5WnelZY4D7lVdJ3pOKKbvwJ/rxk5cn26vtpfwWO2TfyP8BO2YfqINo/N1/vSXar87w0yd2QN2X/7egMVl/D+U0sO+0HrSsrbvNtfV6XtO4iUUs01DuYQ0b2IRNl7zDJ3zGF8M2cuOp8WySakzVmnn8EcaLX5qjrg8= AAAC0nichVFNa9RQFD2NX2396Gg3BTehQ6VCGV5EULsqFaHLdjqxhckQXjJvOo/mi7w3AzFkIe7EvQtXLbgQwR/g1o0/wC660L24rODGhTeZgGix3pC88867575zc70kkEozdjxlnDt/4eKl6ZnZy1euXptrXL/xWMWj1Be2HwdxuutxJQIZCVtLHYjdJBU89AKx4+0/LM93xiJVMo46OktEL+R7kRxIn2ui3MaGE3Ev4K6jh0Jz0wm5HvZV/qhw88y1VkzH78darZiZ2zEdJUPTSWSdXJjd9nLm5tZqp7jdcxtN1mJVmKeBVYMm6tiMG+/hoI8YPkYIIRBBEw7AoejpwgJDQlwPOXEpIVmdCxSYJe2IsgRlcGL36btHu27NRrQva6pK7dMtAb0pKU0ssSP2hp2wj+wt+8p+/rNWXtUovWS0ehOtSNy55wvbP/6rCmnVGP5WnelZY4D7lVdJ3pOKKbvwJ/rxk5cn26vtpfwWO2TfyP8BO2YfqINo/N1/vSXar87w0yd2QN2X/7egMVl/D+U0sO+0HrSsrbvNtfV6XtO4iUUs01DuYQ0b2IRNl7zDJ3zGF8M2cuOp8WySakzVmnn8EcaLX5qjrg8= AAAC0nichVFNa9RQFD2NX2396Gg3BTehQ6VCGV5EULsqFaHLdjqxhckQXjJvOo/mi7w3AzFkIe7EvQtXLbgQwR/g1o0/wC660L24rODGhTeZgGix3pC88867575zc70kkEozdjxlnDt/4eKl6ZnZy1euXptrXL/xWMWj1Be2HwdxuutxJQIZCVtLHYjdJBU89AKx4+0/LM93xiJVMo46OktEL+R7kRxIn2ui3MaGE3Ev4K6jh0Jz0wm5HvZV/qhw88y1VkzH78darZiZ2zEdJUPTSWSdXJjd9nLm5tZqp7jdcxtN1mJVmKeBVYMm6tiMG+/hoI8YPkYIIRBBEw7AoejpwgJDQlwPOXEpIVmdCxSYJe2IsgRlcGL36btHu27NRrQva6pK7dMtAb0pKU0ssSP2hp2wj+wt+8p+/rNWXtUovWS0ehOtSNy55wvbP/6rCmnVGP5WnelZY4D7lVdJ3pOKKbvwJ/rxk5cn26vtpfwWO2TfyP8BO2YfqINo/N1/vSXar87w0yd2QN2X/7egMVl/D+U0sO+0HrSsrbvNtfV6XtO4iUUs01DuYQ0b2IRNl7zDJ3zGF8M2cuOp8WySakzVmnn8EcaLX5qjrg8= = y1:T (y1:T ) log (y1:T )R(y1:T ) AAAC4HichVFNS9xQFD2mWj/64bRuCm5CB4vdDC8iaIWC6KZLHR0VjISX+GZ8+PJB8mZgGuYH6EZcCbqy0IX0HxR3bvwDXbjtrtidhW5ceJMJltbW3pC88867575zc91IyUQzdtFjPOjte9g/MDj06PGTp8OlZ89XkrAZe6LmhSqM11yeCCUDUdNSK7EWxYL7rhKr7vZ8dr7aEnEiw2BZtyOx4fNGIOvS45oop7Ty1rSTpu+kbSe1ZpY7HdOOpGPrLaH5eMG9Nu2Au4oXtGmrsPHXtOotdEplVmF5mHeBVYAyilgIS59hYxMhPDThQyCAJqzAkdCzDgsMEXEbSImLCcn8XKCDIdI2KUtQBid2m74N2q0XbED7rGaSqz26RdEbk9LEGPvCTtgVO2ef2Dd2/c9aaV4j89Km1e1qReQM775Y+vlflU+rxtYv1b2eNeqYzr1K8h7lTNaF19W33h9cLc1Ux9JX7AO7JP/H7IKdUQdB64f3cVFUj+7xs0lsnbrP/m+HxmT9OZS7oDZReVOxFifLs3PFvAYwipcYp6FMYRbvsIAaXXKKr7jEd8Mzdow9Y7+bavQUmhH8FsbhDZCVsv8= AAAC4HichVFNS9xQFD2mWj/64bRuCm5CB4vdDC8iaIWC6KZLHR0VjISX+GZ8+PJB8mZgGuYH6EZcCbqy0IX0HxR3bvwDXbjtrtidhW5ceJMJltbW3pC88867575zc91IyUQzdtFjPOjte9g/MDj06PGTp8OlZ89XkrAZe6LmhSqM11yeCCUDUdNSK7EWxYL7rhKr7vZ8dr7aEnEiw2BZtyOx4fNGIOvS45oop7Ty1rSTpu+kbSe1ZpY7HdOOpGPrLaH5eMG9Nu2Au4oXtGmrsPHXtOotdEplVmF5mHeBVYAyilgIS59hYxMhPDThQyCAJqzAkdCzDgsMEXEbSImLCcn8XKCDIdI2KUtQBid2m74N2q0XbED7rGaSqz26RdEbk9LEGPvCTtgVO2ef2Dd2/c9aaV4j89Km1e1qReQM775Y+vlflU+rxtYv1b2eNeqYzr1K8h7lTNaF19W33h9cLc1Ux9JX7AO7JP/H7IKdUQdB64f3cVFUj+7xs0lsnbrP/m+HxmT9OZS7oDZReVOxFifLs3PFvAYwipcYp6FMYRbvsIAaXXKKr7jEd8Mzdow9Y7+bavQUmhH8FsbhDZCVsv8= AAAC4HichVFNS9xQFD2mWj/64bRuCm5CB4vdDC8iaIWC6KZLHR0VjISX+GZ8+PJB8mZgGuYH6EZcCbqy0IX0HxR3bvwDXbjtrtidhW5ceJMJltbW3pC88867575zc91IyUQzdtFjPOjte9g/MDj06PGTp8OlZ89XkrAZe6LmhSqM11yeCCUDUdNSK7EWxYL7rhKr7vZ8dr7aEnEiw2BZtyOx4fNGIOvS45oop7Ty1rSTpu+kbSe1ZpY7HdOOpGPrLaH5eMG9Nu2Au4oXtGmrsPHXtOotdEplVmF5mHeBVYAyilgIS59hYxMhPDThQyCAJqzAkdCzDgsMEXEbSImLCcn8XKCDIdI2KUtQBid2m74N2q0XbED7rGaSqz26RdEbk9LEGPvCTtgVO2ef2Dd2/c9aaV4j89Km1e1qReQM775Y+vlflU+rxtYv1b2eNeqYzr1K8h7lTNaF19W33h9cLc1Ux9JX7AO7JP/H7IKdUQdB64f3cVFUj+7xs0lsnbrP/m+HxmT9OZS7oDZReVOxFifLs3PFvAYwipcYp6FMYRbvsIAaXXKKr7jEd8Mzdow9Y7+bavQUmhH8FsbhDZCVsv8= AAAC4HichVFNS9xQFD2mWj/64bRuCm5CB4vdDC8iaIWC6KZLHR0VjISX+GZ8+PJB8mZgGuYH6EZcCbqy0IX0HxR3bvwDXbjtrtidhW5ceJMJltbW3pC88867575zc91IyUQzdtFjPOjte9g/MDj06PGTp8OlZ89XkrAZe6LmhSqM11yeCCUDUdNSK7EWxYL7rhKr7vZ8dr7aEnEiw2BZtyOx4fNGIOvS45oop7Ty1rSTpu+kbSe1ZpY7HdOOpGPrLaH5eMG9Nu2Au4oXtGmrsPHXtOotdEplVmF5mHeBVYAyilgIS59hYxMhPDThQyCAJqzAkdCzDgsMEXEbSImLCcn8XKCDIdI2KUtQBid2m74N2q0XbED7rGaSqz26RdEbk9LEGPvCTtgVO2ef2Dd2/c9aaV4j89Km1e1qReQM775Y+vlflU+rxtYv1b2eNeqYzr1K8h7lTNaF19W33h9cLc1Ux9JX7AO7JP/H7IKdUQdB64f3cVFUj+7xs0lsnbrP/m+HxmT9OZS7oDZReVOxFifLs3PFvAYwipcYp6FMYRbvsIAaXXKKr7jEd8Mzdow9Y7+bavQUmhH8FsbhDZCVsv8= = Ey1:T [ log (y1:T )R(y1:T )] AAAC1HichVE9axRRFD0ZPxLjR1ZtBJvBJSE2yxsR8gFCUAKSKtlk3cDOMryZfbv7yJsPZt4urMNUkiaFbQorhRSSJrWtjX9AIZW2YhnBxiJ3ZgdFg/EOM++88+6579y5bqRkohk7njAuXLx0eXLqyvTVa9dvzFRu3nqWhIPYEw0vVGG87fJEKBmIhpZaie0oFtx3lWi6O0/y8+ZQxIkMgy09ikTb571AdqXHNVFOZe2Raftc9ztJupo56chJreWtLDNbdsBdxR1b94Xmpq3CnmlHstzPl3n3zfov2HYqVVZjRZhngVWCKspYDyvvYKODEB4G8CEQQBNW4EjoacECQ0RcGylxMSFZnAtkmCbtgLIEZXBid+jbo12rZAPa5zWTQu3RLYremJQmZtlH9padsA/skH1lP/9ZKy1q5F5GtLpjrYicmb07mz/+q/Jp1ej/Vp3rWaOLxcKrJO9RweRdeGP98Pn+yeZyfTadY2/YN/L/mh2z99RBMPzuHWyI+qtz/HSI7VL3+f/NaEzW30M5CxoPaks1a+NhdeVxOa8p3MU9zNNQFrCCp1hHgy45wid8xhejaWTGC2N3nGpMlJrb+COMl6cm1q6i AAAC1HichVE9axRRFD0ZPxLjR1ZtBJvBJSE2yxsR8gFCUAKSKtlk3cDOMryZfbv7yJsPZt4urMNUkiaFbQorhRSSJrWtjX9AIZW2YhnBxiJ3ZgdFg/EOM++88+6579y5bqRkohk7njAuXLx0eXLqyvTVa9dvzFRu3nqWhIPYEw0vVGG87fJEKBmIhpZaie0oFtx3lWi6O0/y8+ZQxIkMgy09ikTb571AdqXHNVFOZe2Raftc9ztJupo56chJreWtLDNbdsBdxR1b94Xmpq3CnmlHstzPl3n3zfov2HYqVVZjRZhngVWCKspYDyvvYKODEB4G8CEQQBNW4EjoacECQ0RcGylxMSFZnAtkmCbtgLIEZXBid+jbo12rZAPa5zWTQu3RLYremJQmZtlH9padsA/skH1lP/9ZKy1q5F5GtLpjrYicmb07mz/+q/Jp1ej/Vp3rWaOLxcKrJO9RweRdeGP98Pn+yeZyfTadY2/YN/L/mh2z99RBMPzuHWyI+qtz/HSI7VL3+f/NaEzW30M5CxoPaks1a+NhdeVxOa8p3MU9zNNQFrCCp1hHgy45wid8xhejaWTGC2N3nGpMlJrb+COMl6cm1q6i AAAC1HichVE9axRRFD0ZPxLjR1ZtBJvBJSE2yxsR8gFCUAKSKtlk3cDOMryZfbv7yJsPZt4urMNUkiaFbQorhRSSJrWtjX9AIZW2YhnBxiJ3ZgdFg/EOM++88+6579y5bqRkohk7njAuXLx0eXLqyvTVa9dvzFRu3nqWhIPYEw0vVGG87fJEKBmIhpZaie0oFtx3lWi6O0/y8+ZQxIkMgy09ikTb571AdqXHNVFOZe2Raftc9ztJupo56chJreWtLDNbdsBdxR1b94Xmpq3CnmlHstzPl3n3zfov2HYqVVZjRZhngVWCKspYDyvvYKODEB4G8CEQQBNW4EjoacECQ0RcGylxMSFZnAtkmCbtgLIEZXBid+jbo12rZAPa5zWTQu3RLYremJQmZtlH9padsA/skH1lP/9ZKy1q5F5GtLpjrYicmb07mz/+q/Jp1ej/Vp3rWaOLxcKrJO9RweRdeGP98Pn+yeZyfTadY2/YN/L/mh2z99RBMPzuHWyI+qtz/HSI7VL3+f/NaEzW30M5CxoPaks1a+NhdeVxOa8p3MU9zNNQFrCCp1hHgy45wid8xhejaWTGC2N3nGpMlJrb+COMl6cm1q6i AAAC1HichVE9axRRFD0ZPxLjR1ZtBJvBJSE2yxsR8gFCUAKSKtlk3cDOMryZfbv7yJsPZt4urMNUkiaFbQorhRSSJrWtjX9AIZW2YhnBxiJ3ZgdFg/EOM++88+6579y5bqRkohk7njAuXLx0eXLqyvTVa9dvzFRu3nqWhIPYEw0vVGG87fJEKBmIhpZaie0oFtx3lWi6O0/y8+ZQxIkMgy09ikTb571AdqXHNVFOZe2Raftc9ztJupo56chJreWtLDNbdsBdxR1b94Xmpq3CnmlHstzPl3n3zfov2HYqVVZjRZhngVWCKspYDyvvYKODEB4G8CEQQBNW4EjoacECQ0RcGylxMSFZnAtkmCbtgLIEZXBid+jbo12rZAPa5zWTQu3RLYremJQmZtlH9padsA/skH1lP/9ZKy1q5F5GtLpjrYicmb07mz/+q/Jp1ej/Vp3rWaOLxcKrJO9RweRdeGP98Pn+yeZyfTadY2/YN/L/mh2z99RBMPzuHWyI+qtz/HSI7VL3+f/NaEzW30M5CxoPaks1a+NhdeVxOa8p3MU9zNNQFrCCp1hHgy45wid8xhejaWTGC2N3nGpMlJrb+COMl6cm1q6i

Slide 21

Slide 21 text

REINFORCE [Williams 92] • Sample-based (Monte-Carlo) method • Algorithm – Generate a trajectory according to a current policy – Update parameters 21 move most in the directions that favor actions yielding the highest return sampled Ey1:T [ log (y1:T )R(y1:T )] AAAC0nichVHPa9RAFP4ardZa7Vovgpfg0lIvy6QIak+lIvTYbhtb2Cxhkp3dHTr5QTK7sA05SG/i3YMnBQ8i+Af02ot/gB56sPfisYIXD33JBkWL9YVkvvnmfW++l+fFSqaasaMJ49LlyStXp65NX5+5cXO2dmvuWRoNEl/YfqSiZMfjqVAyFLaWWomdOBE88JTY9nafFOfbQ5GkMgq39CgW7YD3QtmVPtdEubU1J+C630mzp7mbjdzMWt7Kc7PlhNxT3HV0X2huOirqmU4sq/1ilXffbP6CbbdWZw1WhnkeWBWoo4r1qHYABx1E8DFAAIEQmrACR0pPCxYYYuLayIhLCMnyXCDHNGkHlCUogxO7S98e7VoVG9K+qJmWap9uUfQmpDQxz76w9+yUfWIf2An7+c9aWVmj8DKi1RtrRezOvriz+eO/qoBWjf5v1YWeNbp4VHqV5D0umaILf6wf7r063VxuzmcL7C37Rv7fsCN2SB2Ew+/+uw3RfH2Bnw6xXeq++L85jcn6eyjngb3UeNywNh7UV1areU3hLu5hkYbyECtYwzpsuuQjPuMrjg3byIznxv441ZioNLfxRxgvzwD4fK4x AAAC0nichVHPa9RAFP4ardZa7Vovgpfg0lIvy6QIak+lIvTYbhtb2Cxhkp3dHTr5QTK7sA05SG/i3YMnBQ8i+Af02ot/gB56sPfisYIXD33JBkWL9YVkvvnmfW++l+fFSqaasaMJ49LlyStXp65NX5+5cXO2dmvuWRoNEl/YfqSiZMfjqVAyFLaWWomdOBE88JTY9nafFOfbQ5GkMgq39CgW7YD3QtmVPtdEubU1J+C630mzp7mbjdzMWt7Kc7PlhNxT3HV0X2huOirqmU4sq/1ilXffbP6CbbdWZw1WhnkeWBWoo4r1qHYABx1E8DFAAIEQmrACR0pPCxYYYuLayIhLCMnyXCDHNGkHlCUogxO7S98e7VoVG9K+qJmWap9uUfQmpDQxz76w9+yUfWIf2An7+c9aWVmj8DKi1RtrRezOvriz+eO/qoBWjf5v1YWeNbp4VHqV5D0umaILf6wf7r063VxuzmcL7C37Rv7fsCN2SB2Ew+/+uw3RfH2Bnw6xXeq++L85jcn6eyjngb3UeNywNh7UV1areU3hLu5hkYbyECtYwzpsuuQjPuMrjg3byIznxv441ZioNLfxRxgvzwD4fK4x AAAC0nichVHPa9RAFP4ardZa7Vovgpfg0lIvy6QIak+lIvTYbhtb2Cxhkp3dHTr5QTK7sA05SG/i3YMnBQ8i+Af02ot/gB56sPfisYIXD33JBkWL9YVkvvnmfW++l+fFSqaasaMJ49LlyStXp65NX5+5cXO2dmvuWRoNEl/YfqSiZMfjqVAyFLaWWomdOBE88JTY9nafFOfbQ5GkMgq39CgW7YD3QtmVPtdEubU1J+C630mzp7mbjdzMWt7Kc7PlhNxT3HV0X2huOirqmU4sq/1ilXffbP6CbbdWZw1WhnkeWBWoo4r1qHYABx1E8DFAAIEQmrACR0pPCxYYYuLayIhLCMnyXCDHNGkHlCUogxO7S98e7VoVG9K+qJmWap9uUfQmpDQxz76w9+yUfWIf2An7+c9aWVmj8DKi1RtrRezOvriz+eO/qoBWjf5v1YWeNbp4VHqV5D0umaILf6wf7r063VxuzmcL7C37Rv7fsCN2SB2Ew+/+uw3RfH2Bnw6xXeq++L85jcn6eyjngb3UeNywNh7UV1areU3hLu5hkYbyECtYwzpsuuQjPuMrjg3byIznxv441ZioNLfxRxgvzwD4fK4x AAAC0nichVHPa9RAFP4ardZa7Vovgpfg0lIvy6QIak+lIvTYbhtb2Cxhkp3dHTr5QTK7sA05SG/i3YMnBQ8i+Af02ot/gB56sPfisYIXD33JBkWL9YVkvvnmfW++l+fFSqaasaMJ49LlyStXp65NX5+5cXO2dmvuWRoNEl/YfqSiZMfjqVAyFLaWWomdOBE88JTY9nafFOfbQ5GkMgq39CgW7YD3QtmVPtdEubU1J+C630mzp7mbjdzMWt7Kc7PlhNxT3HV0X2huOirqmU4sq/1ilXffbP6CbbdWZw1WhnkeWBWoo4r1qHYABx1E8DFAAIEQmrACR0pPCxYYYuLayIhLCMnyXCDHNGkHlCUogxO7S98e7VoVG9K+qJmWap9uUfQmpDQxz76w9+yUfWIf2An7+c9aWVmj8DKi1RtrRezOvriz+eO/qoBWjf5v1YWeNbp4VHqV5D0umaILf6wf7r063VxuzmcL7C37Rv7fsCN2SB2Ew+/+uw3RfH2Bnw6xXeq++L85jcn6eyjngb3UeNywNh7UV1areU3hLu5hkYbyECtYwzpsuuQjPuMrjg3byIznxv441ZioNLfxRxgvzwD4fK4x log (ys 1:T )R(ys 1:T ) AAACx3ichVE9T9xAEH04hO+PS9JEorFyAkFzWkeRSJAiodBEqeDgAAkTa232jhV79sreO3E5XZGWP0BBRaIUKP8gLQ20iVLwExAlkdKkyNhnKQEEzGq9M2/nzb7x+FrJxDB21mM96H3Y1z8wODQ8Mjo2Xnj0eDWJGnEgKkGkonjd54lQMhQVI40S6zoWvO4rsebvLKT3a00RJzIKV0xLi806r4WyKgNuCPIKr12udRzt2m7IfcU912wLw21XRTXb1TKPp1vvE6/tzK10Zuzyf4FXKLISy8y+6Ti5U0Rui1HhG1xsIUKABuoQCGHIV+BIaG3AAYMmbBNtwmLyZHYv0MEQcRuUJSiDE7pD3xpFGzkaUpzWTDJ2QK8o2jExbUyyn+yIXbIT9pWdsz+31mpnNVItLTr9Lldob3zv6fLve1l1Og22/7Hu1GxQxctMqyTtOkPSLoIuv/lh/3J5rjzZnmKf2AXpP2Rn7Jg6CJu/gi9Lonxwh54tQqvUffp/OzQm5/pQbjqV56VXJWfpRXH+TT6vAUzgGaZpKLOYx1ssokKPfMYpvuOH9c7SVtPa7aZaPTnnCa6Y9fEvYGSpow== AAACx3ichVE9T9xAEH04hO+PS9JEorFyAkFzWkeRSJAiodBEqeDgAAkTa232jhV79sreO3E5XZGWP0BBRaIUKP8gLQ20iVLwExAlkdKkyNhnKQEEzGq9M2/nzb7x+FrJxDB21mM96H3Y1z8wODQ8Mjo2Xnj0eDWJGnEgKkGkonjd54lQMhQVI40S6zoWvO4rsebvLKT3a00RJzIKV0xLi806r4WyKgNuCPIKr12udRzt2m7IfcU912wLw21XRTXb1TKPp1vvE6/tzK10Zuzyf4FXKLISy8y+6Ti5U0Rui1HhG1xsIUKABuoQCGHIV+BIaG3AAYMmbBNtwmLyZHYv0MEQcRuUJSiDE7pD3xpFGzkaUpzWTDJ2QK8o2jExbUyyn+yIXbIT9pWdsz+31mpnNVItLTr9Lldob3zv6fLve1l1Og22/7Hu1GxQxctMqyTtOkPSLoIuv/lh/3J5rjzZnmKf2AXpP2Rn7Jg6CJu/gi9Lonxwh54tQqvUffp/OzQm5/pQbjqV56VXJWfpRXH+TT6vAUzgGaZpKLOYx1ssokKPfMYpvuOH9c7SVtPa7aZaPTnnCa6Y9fEvYGSpow== AAACx3ichVE9T9xAEH04hO+PS9JEorFyAkFzWkeRSJAiodBEqeDgAAkTa232jhV79sreO3E5XZGWP0BBRaIUKP8gLQ20iVLwExAlkdKkyNhnKQEEzGq9M2/nzb7x+FrJxDB21mM96H3Y1z8wODQ8Mjo2Xnj0eDWJGnEgKkGkonjd54lQMhQVI40S6zoWvO4rsebvLKT3a00RJzIKV0xLi806r4WyKgNuCPIKr12udRzt2m7IfcU912wLw21XRTXb1TKPp1vvE6/tzK10Zuzyf4FXKLISy8y+6Ti5U0Rui1HhG1xsIUKABuoQCGHIV+BIaG3AAYMmbBNtwmLyZHYv0MEQcRuUJSiDE7pD3xpFGzkaUpzWTDJ2QK8o2jExbUyyn+yIXbIT9pWdsz+31mpnNVItLTr9Lldob3zv6fLve1l1Og22/7Hu1GxQxctMqyTtOkPSLoIuv/lh/3J5rjzZnmKf2AXpP2Rn7Jg6CJu/gi9Lonxwh54tQqvUffp/OzQm5/pQbjqV56VXJWfpRXH+TT6vAUzgGaZpKLOYx1ssokKPfMYpvuOH9c7SVtPa7aZaPTnnCa6Y9fEvYGSpow== AAACx3ichVE9T9xAEH04hO+PS9JEorFyAkFzWkeRSJAiodBEqeDgAAkTa232jhV79sreO3E5XZGWP0BBRaIUKP8gLQ20iVLwExAlkdKkyNhnKQEEzGq9M2/nzb7x+FrJxDB21mM96H3Y1z8wODQ8Mjo2Xnj0eDWJGnEgKkGkonjd54lQMhQVI40S6zoWvO4rsebvLKT3a00RJzIKV0xLi806r4WyKgNuCPIKr12udRzt2m7IfcU912wLw21XRTXb1TKPp1vvE6/tzK10Zuzyf4FXKLISy8y+6Ti5U0Rui1HhG1xsIUKABuoQCGHIV+BIaG3AAYMmbBNtwmLyZHYv0MEQcRuUJSiDE7pD3xpFGzkaUpzWTDJ2QK8o2jExbUyyn+yIXbIT9pWdsz+31mpnNVItLTr9Lldob3zv6fLve1l1Og22/7Hu1GxQxctMqyTtOkPSLoIuv/lh/3J5rjzZnmKf2AXpP2Rn7Jg6CJu/gi9Lonxwh54tQqvUffp/OzQm5/pQbjqV56VXJWfpRXH+TT6vAUzgGaZpKLOYx1ssokKPfMYpvuOH9c7SVtPa7aZaPTnnCa6Y9fEvYGSpow== + log (ys 1:T )R(ys 1:T ) AAAC4XichVG/T9tQEP4wbfnZktIFqYtFBKKqFD1XlShMCBZGCIQgYWo9Oy/JEy+2Zb8EhSj/AFPFwsDUSgyo/wESU5f+Ax2YmarCRKUuHXp2LGiLSs+y39139937zueGSsaasfM+o//Bw0cDg0PDI6OPn4zlno5vxEEz8kTJC1QQbbo8Fkr6oqSlVmIzjARvuEqU3Z2lJF9uiSiWgb+u26HYbvCaL6vS45ogJ1e2dV1obtpKVDWPomDXzJCXps1VWKeUz13FnZvCoGbaoczimfbb2OlY8+vdF2bxt8DJ5VmBpWbedazMySOzlSB3ChsVBPDQRAMCPjT5ChwxPVuwwBASto0OYRF5Ms0LdDFM3CZVCarghO7Qt0bRVob6FCc945Tt0S2K3oiYJqbYF3bCrtln9pF9ZT//2auT9ki0tOl0e1wROmP7E2s//stq0KlRv2Xdq1mjijepVknawxRJpvB6/Nbe4fXafHGqM80+sG+k/z07Z59oAr/13TteFcWje/RUCK3S9Mn/7dKarL+XctcpvSrMFazV1/mFxWxfg3iOSczQUmaxgGWsoESXnOECl7gyKsa+8c446JUafRnnGf4w4+gXn0Sz1A== AAAC4XichVG/T9tQEP4wbfnZktIFqYtFBKKqFD1XlShMCBZGCIQgYWo9Oy/JEy+2Zb8EhSj/AFPFwsDUSgyo/wESU5f+Ax2YmarCRKUuHXp2LGiLSs+y39139937zueGSsaasfM+o//Bw0cDg0PDI6OPn4zlno5vxEEz8kTJC1QQbbo8Fkr6oqSlVmIzjARvuEqU3Z2lJF9uiSiWgb+u26HYbvCaL6vS45ogJ1e2dV1obtpKVDWPomDXzJCXps1VWKeUz13FnZvCoGbaoczimfbb2OlY8+vdF2bxt8DJ5VmBpWbedazMySOzlSB3ChsVBPDQRAMCPjT5ChwxPVuwwBASto0OYRF5Ms0LdDFM3CZVCarghO7Qt0bRVob6FCc945Tt0S2K3oiYJqbYF3bCrtln9pF9ZT//2auT9ki0tOl0e1wROmP7E2s//stq0KlRv2Xdq1mjijepVknawxRJpvB6/Nbe4fXafHGqM80+sG+k/z07Z59oAr/13TteFcWje/RUCK3S9Mn/7dKarL+XctcpvSrMFazV1/mFxWxfg3iOSczQUmaxgGWsoESXnOECl7gyKsa+8c446JUafRnnGf4w4+gXn0Sz1A== AAAC4XichVG/T9tQEP4wbfnZktIFqYtFBKKqFD1XlShMCBZGCIQgYWo9Oy/JEy+2Zb8EhSj/AFPFwsDUSgyo/wESU5f+Ax2YmarCRKUuHXp2LGiLSs+y39139937zueGSsaasfM+o//Bw0cDg0PDI6OPn4zlno5vxEEz8kTJC1QQbbo8Fkr6oqSlVmIzjARvuEqU3Z2lJF9uiSiWgb+u26HYbvCaL6vS45ogJ1e2dV1obtpKVDWPomDXzJCXps1VWKeUz13FnZvCoGbaoczimfbb2OlY8+vdF2bxt8DJ5VmBpWbedazMySOzlSB3ChsVBPDQRAMCPjT5ChwxPVuwwBASto0OYRF5Ms0LdDFM3CZVCarghO7Qt0bRVob6FCc945Tt0S2K3oiYJqbYF3bCrtln9pF9ZT//2auT9ki0tOl0e1wROmP7E2s//stq0KlRv2Xdq1mjijepVknawxRJpvB6/Nbe4fXafHGqM80+sG+k/z07Z59oAr/13TteFcWje/RUCK3S9Mn/7dKarL+XctcpvSrMFazV1/mFxWxfg3iOSczQUmaxgGWsoESXnOECl7gyKsa+8c446JUafRnnGf4w4+gXn0Sz1A== AAAC4XichVG/T9tQEP4wbfnZktIFqYtFBKKqFD1XlShMCBZGCIQgYWo9Oy/JEy+2Zb8EhSj/AFPFwsDUSgyo/wESU5f+Ax2YmarCRKUuHXp2LGiLSs+y39139937zueGSsaasfM+o//Bw0cDg0PDI6OPn4zlno5vxEEz8kTJC1QQbbo8Fkr6oqSlVmIzjARvuEqU3Z2lJF9uiSiWgb+u26HYbvCaL6vS45ogJ1e2dV1obtpKVDWPomDXzJCXps1VWKeUz13FnZvCoGbaoczimfbb2OlY8+vdF2bxt8DJ5VmBpWbedazMySOzlSB3ChsVBPDQRAMCPjT5ChwxPVuwwBASto0OYRF5Ms0LdDFM3CZVCarghO7Qt0bRVob6FCc945Tt0S2K3oiYJqbYF3bCrtln9pF9ZT//2auT9ki0tOl0e1wROmP7E2s//stq0KlRv2Xdq1mjijepVknawxRJpvB6/Nbe4fXafHGqM80+sG+k/z07Z59oAr/13TteFcWje/RUCK3S9Mn/7dKarL+XctcpvSrMFazV1/mFxWxfg3iOSczQUmaxgGWsoESXnOECl7gyKsa+8c446JUafRnnGf4w4+gXn0Sz1A==

Slide 22

Slide 22 text

Variance Reduction • REINFORCE suffers from high variance – Calculate the gradient with one (or few) trajectory • Introduce a baseline • Why we can subtract a baseline? [practice] • Increase/decrease a probability when a sampled action is better/worse than expected 22 Ey1:T [ log (y1:T )(R(y1:T ) Rb )] AAAC23ichVG/axRBFP6y/ooxmktsBJvFI+FSeMyKYJIqRATL3CVnArfnMrs3dzdk9ge7cwfnsl2qYCsWVgoW4h8QSGvjP2CRwlYRywg2Fr7dWwgajG/ZnW++ed+b7+1zIyUTzdjxlHHh4qXLV6avzlybvX5jrjK/8CQJh7EnWl6ownjX5YlQMhAtLbUSu1EsuO8qsePuPczPd0YiTmQYbOtxJDo+7weyJz2uiXIqDdvnetBN0keZk46d1FrbzjKzbQfcVdyx9UBobtoq7Jt2JMt9rcxbNmvNU3zXbDrucsd0KlVWZ0WYZ4FVgirK2AwrR7DRRQgPQ/gQCKAJK3Ak9LRhgSEiroOUuJiQLM4FMsyQdkhZgjI4sXv07dOuXbIB7fOaSaH26BZFb0xKE4vsE3vHTthH9p59Y7/+WSstauRexrS6E62InLmDW1s//6vyadUYnKrO9azRw0rhVZL3qGDyLryJfvTs5cnWWnMxXWJv2Hfy/5odsw/UQTD64b1tiOarc/x0ie1R9/n/zWhM1t9DOQta9+qrdatxv7q+Uc5rGrdxBzUaygOs4zE20aJLDvEZX/DVeGrsGwfG80mqMVVqbuKPMF78BmoMsHw= AAAC23ichVG/axRBFP6y/ooxmktsBJvFI+FSeMyKYJIqRATL3CVnArfnMrs3dzdk9ge7cwfnsl2qYCsWVgoW4h8QSGvjP2CRwlYRywg2Fr7dWwgajG/ZnW++ed+b7+1zIyUTzdjxlHHh4qXLV6avzlybvX5jrjK/8CQJh7EnWl6ownjX5YlQMhAtLbUSu1EsuO8qsePuPczPd0YiTmQYbOtxJDo+7weyJz2uiXIqDdvnetBN0keZk46d1FrbzjKzbQfcVdyx9UBobtoq7Jt2JMt9rcxbNmvNU3zXbDrucsd0KlVWZ0WYZ4FVgirK2AwrR7DRRQgPQ/gQCKAJK3Ak9LRhgSEiroOUuJiQLM4FMsyQdkhZgjI4sXv07dOuXbIB7fOaSaH26BZFb0xKE4vsE3vHTthH9p59Y7/+WSstauRexrS6E62InLmDW1s//6vyadUYnKrO9azRw0rhVZL3qGDyLryJfvTs5cnWWnMxXWJv2Hfy/5odsw/UQTD64b1tiOarc/x0ie1R9/n/zWhM1t9DOQta9+qrdatxv7q+Uc5rGrdxBzUaygOs4zE20aJLDvEZX/DVeGrsGwfG80mqMVVqbuKPMF78BmoMsHw= AAAC23ichVG/axRBFP6y/ooxmktsBJvFI+FSeMyKYJIqRATL3CVnArfnMrs3dzdk9ge7cwfnsl2qYCsWVgoW4h8QSGvjP2CRwlYRywg2Fr7dWwgajG/ZnW++ed+b7+1zIyUTzdjxlHHh4qXLV6avzlybvX5jrjK/8CQJh7EnWl6ownjX5YlQMhAtLbUSu1EsuO8qsePuPczPd0YiTmQYbOtxJDo+7weyJz2uiXIqDdvnetBN0keZk46d1FrbzjKzbQfcVdyx9UBobtoq7Jt2JMt9rcxbNmvNU3zXbDrucsd0KlVWZ0WYZ4FVgirK2AwrR7DRRQgPQ/gQCKAJK3Ak9LRhgSEiroOUuJiQLM4FMsyQdkhZgjI4sXv07dOuXbIB7fOaSaH26BZFb0xKE4vsE3vHTthH9p59Y7/+WSstauRexrS6E62InLmDW1s//6vyadUYnKrO9azRw0rhVZL3qGDyLryJfvTs5cnWWnMxXWJv2Hfy/5odsw/UQTD64b1tiOarc/x0ie1R9/n/zWhM1t9DOQta9+qrdatxv7q+Uc5rGrdxBzUaygOs4zE20aJLDvEZX/DVeGrsGwfG80mqMVVqbuKPMF78BmoMsHw= AAAC23ichVG/axRBFP6y/ooxmktsBJvFI+FSeMyKYJIqRATL3CVnArfnMrs3dzdk9ge7cwfnsl2qYCsWVgoW4h8QSGvjP2CRwlYRywg2Fr7dWwgajG/ZnW++ed+b7+1zIyUTzdjxlHHh4qXLV6avzlybvX5jrjK/8CQJh7EnWl6ownjX5YlQMhAtLbUSu1EsuO8qsePuPczPd0YiTmQYbOtxJDo+7weyJz2uiXIqDdvnetBN0keZk46d1FrbzjKzbQfcVdyx9UBobtoq7Jt2JMt9rcxbNmvNU3zXbDrucsd0KlVWZ0WYZ4FVgirK2AwrR7DRRQgPQ/gQCKAJK3Ak9LRhgSEiroOUuJiQLM4FMsyQdkhZgjI4sXv07dOuXbIB7fOaSaH26BZFb0xKE4vsE3vHTthH9p59Y7/+WSstauRexrS6E62InLmDW1s//6vyadUYnKrO9azRw0rhVZL3qGDyLryJfvTs5cnWWnMxXWJv2Hfy/5odsw/UQTD64b1tiOarc/x0ie1R9/n/zWhM1t9DOQta9+qrdatxv7q+Uc5rGrdxBzUaygOs4zE20aJLDvEZX/DVeGrsGwfG80mqMVVqbuKPMF78BmoMsHw=

Slide 23

Slide 23 text

How to Reduce Variance • Option 1: Mean of sampled several rewards or rewards in a minibatch • Option 2: Actor-Critic (several variants exist) – Actor: policy network – Critic: estimator to calculate the baseline reward • Option 3: Self-Critic [Rennie+ 16] – Reward obtained by a greedy-search 23

Slide 24

Slide 24 text

Supervised vs RL 24 Supervised Reinforcement Learning correct sampled L( ) = log (y1:T ) AAACsXicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuo+2jElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjtOKrDa1CajXjBZQN9AzAQAGTYQhlKDNAQUC+wC6GGIYUhnyGZIZShlyGVIY8hhIgO4chkaEYCKMZDBkMGAqAYrEM1UCxIiArEyyfylDLwAXUWwpUlQpUkQgUzQaS6UBeNFQ0D8gHmVkM1p0MtCUHiIuAOhUYVA2uGqw0+GxwwmC1wUuDPzjNqgabAXJLJZBOguhNLYjn75II/k5QVy6QLmHIQOjC6+YShjQGC7BbM4FuLwCLgHyRDNFfVjX9c7BVkGq1msEig9dA9y80uGlwGOiDvLIvyUsDU4Nm43FPClA0Deh7UPjWAqPJED1SMBmhRnqWeoaBJsoOTtD44mCQZlBi0ABGijmDA4MHQwBDKNCSXobNDHsY9jKZMEUxJTAlQZQyMUL1CDOgAKZsAOZen7M= AAACsXicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuo+2jElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjtOKrDa1CajXjBZQN9AzAQAGTYQhlKDNAQUC+wC6GGIYUhnyGZIZShlyGVIY8hhIgO4chkaEYCKMZDBkMGAqAYrEM1UCxIiArEyyfylDLwAXUWwpUlQpUkQgUzQaS6UBeNFQ0D8gHmVkM1p0MtCUHiIuAOhUYVA2uGqw0+GxwwmC1wUuDPzjNqgabAXJLJZBOguhNLYjn75II/k5QVy6QLmHIQOjC6+YShjQGC7BbM4FuLwCLgHyRDNFfVjX9c7BVkGq1msEig9dA9y80uGlwGOiDvLIvyUsDU4Nm43FPClA0Deh7UPjWAqPJED1SMBmhRnqWeoaBJsoOTtD44mCQZlBi0ABGijmDA4MHQwBDKNCSXobNDHsY9jKZMEUxJTAlQZQyMUL1CDOgAKZsAOZen7M= AAACsXicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuo+2jElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjtOKrDa1CajXjBZQN9AzAQAGTYQhlKDNAQUC+wC6GGIYUhnyGZIZShlyGVIY8hhIgO4chkaEYCKMZDBkMGAqAYrEM1UCxIiArEyyfylDLwAXUWwpUlQpUkQgUzQaS6UBeNFQ0D8gHmVkM1p0MtCUHiIuAOhUYVA2uGqw0+GxwwmC1wUuDPzjNqgabAXJLJZBOguhNLYjn75II/k5QVy6QLmHIQOjC6+YShjQGC7BbM4FuLwCLgHyRDNFfVjX9c7BVkGq1msEig9dA9y80uGlwGOiDvLIvyUsDU4Nm43FPClA0Deh7UPjWAqPJED1SMBmhRnqWeoaBJsoOTtD44mCQZlBi0ABGijmDA4MHQwBDKNCSXobNDHsY9jKZMEUxJTAlQZQyMUL1CDOgAKZsAOZen7M= AAACsXicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuo+2jElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjtOKrDa1CajXjBZQN9AzAQAGTYQhlKDNAQUC+wC6GGIYUhnyGZIZShlyGVIY8hhIgO4chkaEYCKMZDBkMGAqAYrEM1UCxIiArEyyfylDLwAXUWwpUlQpUkQgUzQaS6UBeNFQ0D8gHmVkM1p0MtCUHiIuAOhUYVA2uGqw0+GxwwmC1wUuDPzjNqgabAXJLJZBOguhNLYjn75II/k5QVy6QLmHIQOjC6+YShjQGC7BbM4FuLwCLgHyRDNFfVjX9c7BVkGq1msEig9dA9y80uGlwGOiDvLIvyUsDU4Nm43FPClA0Deh7UPjWAqPJED1SMBmhRnqWeoaBJsoOTtD44mCQZlBi0ABGijmDA4MHQwBDKNCSXobNDHsY9jKZMEUxJTAlQZQyMUL1CDOgAKZsAOZen7M= L( ) = log (ys 1:T )R(ys 1:T ) AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh

Slide 25

Slide 25 text

Table of Contents 1. An overview of RL 2. Policy gradient 3. RL in NLP tasks 4. Implementation 25

Slide 26

Slide 26 text

NLP Tasks using RL • Seq2seq – Summarization – MT – Dialogue • Machine comprehension 26

Slide 27

Slide 27 text

Summarization • Action: select a next token • Reward: ROUGE 27 https://einstein.ai/research/your-tldr-by-an-ai-a-deep-reinforced-model-for-abstractive-summarization

Slide 28

Slide 28 text

Model 28 https://slcladal.github.io/reinfnlp.html

Slide 29

Slide 29 text

Pointer-Generator Network [See+ 17] 29 Policy Network https://arxiv.org/abs/1704.04368

Slide 30

Slide 30 text

Hybrid Learning Objective • Supervised learning with teacher forcing • Policy learning 30 used method to train a decoder RNN for sequence generation, called the gorithm (Williams & Zipser, 1989), minimizes a maximum-likelihood lo e define y⇤ = {y⇤ 1 , y⇤ 2 , . . . , y⇤ n0 } as the ground-truth output sequence fo The maximum-likelihood training objective is the minimization of the Lml = n0 X t=1 log p(y⇤ t |y⇤ 1 , . . . , y⇤ t 1 , x) zing Lml does not always produce the best results on discrete evaluatio Lin, 2004). This phenomenon has been observed with similar sequence g aptioning with CIDEr (Rennie et al., 2016) and machine translation w Norouzi et al., 2016). There are two main reasons for this discrepancy. re bias (Ranzato et al., 2015), comes from the fact that the network has k h sequence up to the next token during training but does not have such su ce accumulating errors as it predicts the sequence. The second reason of potentially valid summaries, since there are more ways to arrange ses or different sentence orders. The ROUGE metrics take some of this he maximum-likelihood objective does not. CY LEARNING o remedy this is to learn a policy that maximizes a specific discrete metric the maximum-likelihood loss, which is made possible with reinforcement le we use the self-critical policy gradient training algorithm (Rennie et al., 2016 ning algorithm, we produce two separate output sequences at each training ite tained by sampling from the p(ys t |ys 1 , . . . , ys t 1 , x) probability distribution at e p, and ˆ y, the baseline output, obtained by maximizing the output probability d e step, essentially performing a greedy search. We define r(y) as the reward f equence y, comparing it with the ground truth sequence y⇤ with the evaluatio Lrl = (r(ˆ y) r(ys)) n0 X t=1 log p(ys t |ys 1 , . . . , ys t 1 , x) that minimizing Lrl is equivalent to maximizing the conditional likelihood o nce ys if it obtains a higher reward than the baseline ˆ y, thus increasing the rew r model. ED TRAINING OBJECTIVE FUNCTION G OBJECTIVE FUNCTION is reinforcement training objective is that optimizing for a spe es not guarantee an increase in quality and readability of th h discrete metrics and increase their score without an actua Liu et al., 2016). While ROUGE measures the n-gram overlap a reference sequence, human-readability is better captured by measured by perplexity. elihood training objective (Equation 14) is essentially a con g the probability of a token yt based on the previously predict nput sequence x, we hypothesize that it can assist our policy le natural summaries. This motivates us to define a mixed learni quations 14 and 15: Lmixed = Lrl + (1 )Lml, tor accounting for the difference in magnitude between Lrl 0.9984 greedy sampling Reward (ROUGE)

Slide 31

Slide 31 text

Experimental Results 31 Model ROUGE-1 ROUGE-2 ROUGE-L Lead-3 (Nallapati et al., 2017) 39.2 15.7 35.5 SummaRuNNer (Nallapati et al., 2017) 39.6 16.2 35.3 words-lvt2k-temp-att (Nallapati et al., 2016) 35.46 13.30 32.65 ML, no intra-attention, no trigram avoidance 35.15 13.28 32.13 ML, no intra-attention 37.86 14.69 34.99 ML, with intra-attention 38.30 14.81 35.49 RL, with intra-attention 41.16 15.75 39.08 ML+RL, with intra-attention 39.87 15.82 36.90 Table 1: Quantitative results for various models on the CNN/Daily Mail test dataset Model ROUGE-1 ROUGE-2 ROUGE-L ML, no intra-attention, no trigram avoidance 42.85 26.22 39.09 ML, no intra-attention 44.26 27.43 40.41 ML, with intra-attention 43.86 27.10 40.11 RL, no intra-attention 47.22 30.51 43.27 ML+RL, no intra-attention 47.03 30.72 43.10 Table 2: Quantitative results for various models on the New York Times test dataset Model ROUGE-1 ROUGE-2 ROUGE-L Lead-3 (Nallapati et al., 2017) 39.2 15.7 35.5 SummaRuNNer (Nallapati et al., 2017) 39.6 16.2 35.3 words-lvt2k-temp-att (Nallapati et al., 2016) 35.46 13.30 32.65 ML, no intra-attention, no trigram avoidance 35.15 13.28 32.13 ML, no intra-attention 37.86 14.69 34.99 ML, with intra-attention 38.30 14.81 35.49 RL, with intra-attention 41.16 15.75 39.08 ML+RL, with intra-attention 39.87 15.82 36.90 Table 1: Quantitative results for various models on the CNN/Daily Mail test dataset Model ROUGE-1 ROUGE-2 ROUGE-L ML, no intra-attention, no trigram avoidance 42.85 26.22 39.09 ML, no intra-attention 44.26 27.43 40.41 ML, with intra-attention 43.86 27.10 40.11 RL, no intra-attention 47.22 30.51 43.27 ML+RL, no intra-attention 47.03 30.72 43.10 Table 2: Quantitative results for various models on the New York Times test dataset CNN/Daily Mail NYT Model R-1 R-2 First sentences 28.6 17.3 First k words 35.7 21.6 Full (Durrett et al., 2016) 42.2 24.9 ML+RL, with intra-attn 42.94 26.02 Comparison of ROUGE recall scores for lead baselines, the extractive mod 16) and our model on their NYT dataset splits. Model Readability Relevance Perplexity ML 6.76 7.14 84.46 RL 4.18 6.32 16417.68 ML+RL 7.04 7.45 121.07 Comparison of human readability scores on a random subset of the CNN/Da All models are with intra-decoder attention.

Slide 32

Slide 32 text

Machine Translation [Johnson+ 16] 32 el text containing N input-output sequence pairs, denoted D © ) (X(i) , Y ú(i) ) * hood training aims at maximizing the sum of log probabilities of the ground-t onding inputs, OML(◊) = N ÿ i=1 log P◊ (Y ú(i) | X(i) ) . his objective is that it does not reflect the task reward function as measured by n. Further, this objective does not explicitly encourage a ranking among incor outputs with higher BLEU scores should still obtain higher probabilities under tputs are never observed during training. In other words, using maximum-likelih will not learn to be robust to errors made during decoding since they are n a mismatch between the training and testing procedure. [33, 38, 31] have considered di erent ways of incorporating the task reward quence-to-sequence models. In this work, we also attempt to refine a model ikelihood objective to directly optimize for the task reward. We show that, eve t of state-of-the-art maximum-likelihood models using task reward improves efinement using the expected reward objective (also used in [33]), which can N ers [33, 38, 31] have considered di erent ways of incorporating the task l sequence-to-sequence models. In this work, we also attempt to refine a um likelihood objective to directly optimize for the task reward. We show th ment of state-of-the-art maximum-likelihood models using task reward im el refinement using the expected reward objective (also used in [33]), wh ORL(◊) = N ÿ i=1 ÿ Y œY P◊ (Y | X(i) ) r(Y, Y ú(i) ). tes the per-sentence score, and we are computing an expectation over all of certain length. has some undesirable properties when used for single sentences, as it was We therefore use a slightly di erent score for our RL experiments which e GLEU score, we record all sub-sequences of 1, 2, 3 or 4 tokens in output We then compute a recall, which is the ratio of the number of matching -grams in the target (ground truth) sequence, and a precision, which is ng n-grams to the number of total n-grams in the generated output sequ the minimum of recall and precision. This GLEU score’s range is alway ll match) and it is symmetrical when switching output and target. Accor ation 7) and RL (equation 8) objectives as follows: OMixed(◊) = – ú OML(◊) + ORL(◊) is typically set to be 0.25. rain a model using the maximum likelihood objective (equation 7) until converg el using a mixed maximum likelihood and expected reward objective (equatio evelopment set is no longer improving. The second step is optional. Model and Quantized Inference es in deploying our Neural Machine Translation model to our interactive produ 0.25 ed in the previous section are optimized for log-likelihood of the next ste e well with translation quality, as discussed in section 5. We use RL tr cores after normal maximum-likelihood training. RL fine-tuning on the best EnæFr and EnæDe models are presented ning the models with RL can improve BLEU scores. On WMT EnæFr score by close to 1 point. On EnæDe, RL-refinement slightly hurts th bserve about 0.4 BLEU points improvement on the development set. Th he average of 8 independent models. We also note that there is an ov refinement and the decoder fine-tuning (i.e., the introduction of length ). On a less fine-tuned decoder (e.g., if the decoder does beam search m RL would have been bigger (as is evident from comparing results in T ingle model test BLEU scores, averaged over 8 runs, on WMT EnæFr Dataset Trained with log-likelihood Refined with RL EnæFr 38.95 39.92 EnæDe 24.67 24.60

Slide 33

Slide 33 text

Deep Reinforcement Learning for Dialogue Generation [Li+ 16] 33 Baseline mutual information model (Li et al. 2015) Proposed reinforcement learning model A: Where are you going? (1) A: Where are you going? (1) B: I’m going to the restroom. (2) B: I’m going to the police station. (2) A: See you later. (3) A: I’ll come with you. (3) B: See you later. (4) B: No, no, no, no, you’re not going anywhere. (4) A: See you later. (5) A: Why? (5) B: See you later. (6) B: I need you to stay here. (6) ... A: I don’t know what you are talking about. (7) ... ... A: how old are you? (1) A: How old are you? (1) B: I’m 16. (2) B: I’m 16. Why are you asking? (2) A: 16? (3) A I thought you were 12. (3) B: I don’t know what you are talking about. (4) B: What made you think so? (4) A: You don’t know what you are saying. (5) A: I don’t know what you are talking about. (5) B: I don’t know what you are talking about . (6) B: You don’t know what you are saying. (6) A: You don’t know what you are saying. (7) ... ... ... Table 1: Left Column: Dialogue simulation between two agents using a 4-layer LSTM encoder-decoder trained on the OpenSubtitles dataset. The first turn (index 1) is input by the authors. Then the two agents take turns conversing, taking as input the other agent’s prior generated turn. The output is generated using https://arxiv.org/abs/1606.01541

Slide 34

Slide 34 text

Reward 1. Ease of answering 2. Information Flow 3. Semantic Coherence 34 structed a list of dull responses S consisting 8 turns such as “I don’t know what you are talking about”, “I have no idea”, etc., that we and others have found occur very frequently in SEQ2SEQ models of con- versations. The reward function is given as follows: r1 = 1 NS X s2S 1 Ns log pseq2seq(s|a) (1) where NS denotes the cardinality of NS and Ns de- notes the number of tokens in the dull response s. Although of course there are more ways to generate dull responses than the list can cover, many of these responses are likely to fall into similar regions in the vector space computed by the model. A system less likely to generate utterances in the list is thus also less likely to generate other dull responses. pseq2seq(a| ing respon [pi, qi]. pb s bility of g qi based o lar way as and target ence of ta log pbackwa seq2seq The fina the reward r( where 1 0.25 and coder aram- ation ctions ult in imize n this ribute pprox- ed in chine tribute new information at each turn to keep the di- alogue moving and avoid repetitive sequences. We therefore propose penalizing semantic similarity be- tween consecutive turns from the same agent. Let hpi and hpi+1 denote representations obtained from the encoder for two consecutive turns pi and pi+1. The reward is given by the negative log of the cosine similarity between them: r2 = log cos(hpi , hpi+1 ) = log hpi · hpi+1 khpi kkhpi+1 k (2) Semantic Coherence We also need to measure the adequacy of responses to avoid situations in which the generated replies are highly rewarded but are un- In this tribute pprox- zed in achine a turn he con- off and pose to urn by ding to ly con- similarity between them: r2 = log cos(hpi , hpi+1 ) = log hpi · hpi+1 khpi kkhpi+1 k (2) Semantic Coherence We also need to measure the adequacy of responses to avoid situations in which the generated replies are highly rewarded but are un- grammatical or not coherent. We therefore consider the mutual information between the action a and pre- vious turns in the history to ensure the generated responses are coherent and appropriate: r3 = 1 Na log pseq2seq(a|qi, pi)+ 1 Nqi log pbackward seq2seq (qi|a) over, many of these imilar regions in the odel. A system less the list is thus also responses. elihood output by noting that pseq2seq ic policy function mer is learned based Q2SEQ model while for long-term future r(a, [pi, qi]) = 1r1 + 2r2 + 3r3 where 1 + 2 + 3 = 1. We set 1 = 0.25 0.25 and 3 = 0.5. A reward is observed af agent reaches the end of each sentence. 4 Simulation The central idea behind our approach is to si the process of two virtual agents taking turns with each other, through which we can explo 1195 pi , qi pi+1 AAACpnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQvIFMRn6igUxmcqxBRlpmeUJBYV5ZcrFMRXZ2ob1sYLKBvoGYCBAibDEMpQZoCCgHyBXQwxDCkM+QzJDKUMuQypDHkMJUB2DkMiQzEQRjMYMhgwFADFYhmqgWJFQFYmWD6VoZaBC6i3FKgqFagiESiaDSTTgbxoqGgekA8ysxisOxloSw4QFwF1KjCoGlw1WGnw2eCEwWqDlwZ/cJpVDTYD5JZKIJ0E0ZtaEM/fJRH8naCuXCBdwpCB0IXXzSUMaQwWYLdmAt1eABYB+SIZor+savrnYKsg1Wo1g0UGr4HuX2hw0+Aw0Ad5ZV+SlwamBs3G454UoGga0Peg8AVFkyF6pGAyQo30LPUMA02UHZyg8cXBIM2gxKABjBRzBgcGD4YAhlCgJU0MKxg2Mmxi0mTyZwplCocoZWKE6hFmQAFMCQAIdJyO AAACpnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQvIFMRn6igUxmcqxBRlpmeUJBYV5ZcrFMRXZ2ob1sYLKBvoGYCBAibDEMpQZoCCgHyBXQwxDCkM+QzJDKUMuQypDHkMJUB2DkMiQzEQRjMYMhgwFADFYhmqgWJFQFYmWD6VoZaBC6i3FKgqFagiESiaDSTTgbxoqGgekA8ysxisOxloSw4QFwF1KjCoGlw1WGnw2eCEwWqDlwZ/cJpVDTYD5JZKIJ0E0ZtaEM/fJRH8naCuXCBdwpCB0IXXzSUMaQwWYLdmAt1eABYB+SIZor+savrnYKsg1Wo1g0UGr4HuX2hw0+Aw0Ad5ZV+SlwamBs3G454UoGga0Peg8AVFkyF6pGAyQo30LPUMA02UHZyg8cXBIM2gxKABjBRzBgcGD4YAhlCgJU0MKxg2Mmxi0mTyZwplCocoZWKE6hFmQAFMCQAIdJyO AAACpnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQvIFMRn6igUxmcqxBRlpmeUJBYV5ZcrFMRXZ2ob1sYLKBvoGYCBAibDEMpQZoCCgHyBXQwxDCkM+QzJDKUMuQypDHkMJUB2DkMiQzEQRjMYMhgwFADFYhmqgWJFQFYmWD6VoZaBC6i3FKgqFagiESiaDSTTgbxoqGgekA8ysxisOxloSw4QFwF1KjCoGlw1WGnw2eCEwWqDlwZ/cJpVDTYD5JZKIJ0E0ZtaEM/fJRH8naCuXCBdwpCB0IXXzSUMaQwWYLdmAt1eABYB+SIZor+savrnYKsg1Wo1g0UGr4HuX2hw0+Aw0Ad5ZV+SlwamBs3G454UoGga0Peg8AVFkyF6pGAyQo30LPUMA02UHZyg8cXBIM2gxKABjBRzBgcGD4YAhlCgJU0MKxg2Mmxi0mTyZwplCocoZWKE6hFmQAFMCQAIdJyO AAACpnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQvIFMRn6igUxmcqxBRlpmeUJBYV5ZcrFMRXZ2ob1sYLKBvoGYCBAibDEMpQZoCCgHyBXQwxDCkM+QzJDKUMuQypDHkMJUB2DkMiQzEQRjMYMhgwFADFYhmqgWJFQFYmWD6VoZaBC6i3FKgqFagiESiaDSTTgbxoqGgekA8ysxisOxloSw4QFwF1KjCoGlw1WGnw2eCEwWqDlwZ/cJpVDTYD5JZKIJ0E0ZtaEM/fJRH8naCuXCBdwpCB0IXXzSUMaQwWYLdmAt1eABYB+SIZor+savrnYKsg1Wo1g0UGr4HuX2hw0+Aw0Ad5ZV+SlwamBs3G454UoGga0Peg8AVFkyF6pGAyQo30LPUMA02UHZyg8cXBIM2gxKABjBRzBgcGD4YAhlCgJU0MKxg2Mmxi0mTyZwplCocoZWKE6hFmQAFMCQAIdJyO dull utterance “I have no idea”, …

Slide 35

Slide 35 text

Adversarial Learning for Neural Dialogue Generation [Li+ 17] • [Li+ 16] manually define three dialogue properties • Adversarial training: a model should generate utterances indistinguishable from human dialogues 35 https://aclanthology.org/D17-1230/

Slide 36

Slide 36 text

Model • Generative model: – Generate a response y given dialogue history x • Discriminative model: – Input: a sequence of dialogue utterances {x, y} – Output: a label indicating whether the input is generated by humans or machines • Policy Gradient Training 36 is used as a reward for the generator, which is trained to maximize the expected reward of gen- erated utterance(s) using the REINFORCE algo- rithm (Williams, 1992): J(θ) = Ey∼p(y|x) (Q+({x, y})|θ) (1) Given the input dialogue history x, the bot gener- ates a dialogue utterance y by sampling from the human-generated dialogue probability by a discriminator

Slide 37

Slide 37 text

Machine Comprehension 37 Reinforced Mnemonic Reader for Machine Reading Comprehension [He+ 2017] • Standard maximum likelihood method is used for predicting exactly-matched score • In addition, we want to optimize F1 measure → Reinforcement Learning

Slide 38

Slide 38 text

Answer Pointer 38 where ✓ represents all trainable parameters. The standard maximum-likelihood (ML) training method is to maximize the log probabilities of the ground truth an- swer positions [Wang and Jiang, 2017] LML(✓) = X k log p1(y 1 k ) + log p2(y 2 k |y 1 k ) (7) ward measured as word overlap between predicted answer nd groung truth, is introduced to MRC [Xiong et al., 2017a]. A baseline b, which is obtained by running greedy inference with the current model, is used to normalize the reward and educe variances. Such approach is known as the self-critical equence training (SCST) [Rennie et al., 2016], which is first sed in image caption. More specifically, let R(As , A ⇤) de- ote the F1 score between a sampled answer As and the round truth A ⇤. The training objective is to minimize the egative expected reward by LSCST (✓) = E As⇠p✓(A) [R(A s) R( ˆ A)] (8) where we abbreviate the model distribution p(A|C, Q; ✓) as ✓(A), and the reward function R(As , A ⇤) as R(As). ˆ A is btained by greedily maximizing the model distribution: ˆ A = arg max p(A|C, Q; ✓) Standard Maximum Likelihood Reinforcement Learning (Self-Critical Sequence Training) greedy sampling sequence training (SCST) [Rennie et al., 2016], which is first used in image caption. More specifically, let R(As , A ⇤) de- note the F1 score between a sampled answer As and the ground truth A ⇤. The training objective is to minimize the negative expected reward by LSCST (✓) = E As⇠p✓(A) [R(A s) R( ˆ A)] (8) where we abbreviate the model distribution p(A|C, Q; ✓) as p✓(A), and the reward function R(As , A ⇤) as R(As). ˆ A is obtained by greedily maximizing the model distribution: ˆ A = arg max A p(A|C, Q; ✓) The expected gradient r✓ LSCST (✓) can be computed ac- cording to the REINFORCE algorithm [Sutton and Barto, 1998] as r✓ LSCST (✓) = E As⇠p✓(A) [(R(A s) b) r✓ log p✓(A s)] ⇡ ⇣ R(A s) R( ˆ A) ⌘ r✓ log p✓(A s) (9)

Slide 39

Slide 39 text

Table of Contents 1. An overview of RL 2. Policy gradient 3. RL in NLP tasks 4. Implementation 39

Slide 40

Slide 40 text

OpenAI: Gym 40 Game, Classic control, Robotics, … https://github.com/openai/gym

Slide 41

Slide 41 text

chainerrl 41 https://github.com/chainer/chainerrl

Slide 42

Slide 42 text

Policy Gradient • In NLP, difficult to use RL library/environment • RL loss can be added to standard ML setting 42 https://github.com/pytorch/examples/blob/master/reinforcement_learning/reinforce.py L( ) = log (ys 1:T )R(ys 1:T ) AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh sampling