Deep Reinforcement
Learning for NLP
Tomohide Shibata
Kyoto University
18/09/27
Deep Reinforcement Learning
2
RL + DL
DL + RL
+ RL
Deep Reinforcement Learning
for Games (AlphaGo)
https://www.reddit.com/r/baduk/comments/6ttyyz/better_graph_of_go_ai_strength_over_time/
AlphaGoZero
Policy network
Move probabilities
Position
p
Value
Evaluation
Position
Win Euro champion (pro)
(2015.10)
Win top player
(2016.3)
Win world number 1
(2017.5)
Learn from scratch
(2017.10)
Recent NLP Papers
• Seq2seq (encoder-decoder):
– NMT [Ranzato+ 15, Johnson+ 16, He+ 16, Bahdanau+ 17, Wu+ 17,
Wu+ 18, …]
– Summarization [Paulus+ 18, Celikyilmaz+ 18, Li+ 18, Chen+ 18, …]
– Dialogue [Li+ 16, Li+ 17, …]
• Parsing [Lê+ 17], Coreference Resolution [Clark+ 16],
Chinese zero anaphora resolution [Yin+ 18], …
• Knowledge Base Inference [Xiong+ 18, …]
• Machine Comprehension [He+ 17, Xiong+ 17]
• …
4
Delayed feedback, directly optimize end metric, …
Three Kinds of Machine Learning
• 教師あり学習 (Supervised learning)
– Teacher provides desired output for a given input
• 強化学習 (Reinforcement Learning, RL)
– Delayed feedback from the environment in form
of reward when taking action a at state s
• 教師なし学習 (Unsupervised learning)
– Clustering, …
5
Supervised vs RL
6
・・・
Black win
Feedback
・・・
Table of Contents
1. An overview of RL
2. Policy gradient
3. RL in NLP tasks
4. Implementation
7
Reinforcement Learning
• A general purpose framework for decision
making
• An agent with the capacity to act
• Each action influences the agent’s future state
• Success is measured by a scalar reward signal
• Goal: select actions to maximize future reward
8
Value-based, Policy-based
• Value function: predict of value for each state
or state/action (discount factors are ignored)
• Policy: maps each state to action
– Deterministic policy:
– Stochastic policy:
9
(s) = a
AAACbHichVHLSgMxFD0d3/VVFUEQYbBY6qakIvgAoejGpa2OilpkZkw1OC9mpoVa/AHXggtRUBARP8ONP+DCTxAXLhTcuPDOdEC0qDckOTm55+Yk0RxDeD5jjzGpqbmlta29I97Z1d3Tm+jrX/XssqtzRbcN213XVI8bwuKKL3yDrzsuV03N4Gva/kKwv1bhridsa8WvOrxoqruWKAld9Yna2HJE2huX52R1O5FkGRaG3AiyEUgiiiU7cY0t7MCGjjJMcFjwCRtQ4VHbRBYMDnFF1IhzCYlwn+MQcdKWKYtThkrsPo27tNqMWIvWQU0vVOt0ikHdJaWMMfbAbtgru2e37Il9/FqrFtYIvFRp1upa7mz3Hg0tv/+rMmn2sfel+tOzjxKmQ6+CvDshE9xCr+srByevy7OFsVqKXbJn8n/BHtkd3cCqvOlXeV44RZw+IPvzuRuBMpGZyWTzk8ncfPQT7RjGKNL03FPIYRFLUOhYC8c4w3nsRRqUhqWReqoUizQD+BZS6hOBKIyF
AAACbHichVHLSgMxFD0d3/VVFUEQYbBY6qakIvgAoejGpa2OilpkZkw1OC9mpoVa/AHXggtRUBARP8ONP+DCTxAXLhTcuPDOdEC0qDckOTm55+Yk0RxDeD5jjzGpqbmlta29I97Z1d3Tm+jrX/XssqtzRbcN213XVI8bwuKKL3yDrzsuV03N4Gva/kKwv1bhridsa8WvOrxoqruWKAld9Yna2HJE2huX52R1O5FkGRaG3AiyEUgiiiU7cY0t7MCGjjJMcFjwCRtQ4VHbRBYMDnFF1IhzCYlwn+MQcdKWKYtThkrsPo27tNqMWIvWQU0vVOt0ikHdJaWMMfbAbtgru2e37Il9/FqrFtYIvFRp1upa7mz3Hg0tv/+rMmn2sfel+tOzjxKmQ6+CvDshE9xCr+srByevy7OFsVqKXbJn8n/BHtkd3cCqvOlXeV44RZw+IPvzuRuBMpGZyWTzk8ncfPQT7RjGKNL03FPIYRFLUOhYC8c4w3nsRRqUhqWReqoUizQD+BZS6hOBKIyF
AAACbHichVHLSgMxFD0d3/VVFUEQYbBY6qakIvgAoejGpa2OilpkZkw1OC9mpoVa/AHXggtRUBARP8ONP+DCTxAXLhTcuPDOdEC0qDckOTm55+Yk0RxDeD5jjzGpqbmlta29I97Z1d3Tm+jrX/XssqtzRbcN213XVI8bwuKKL3yDrzsuV03N4Gva/kKwv1bhridsa8WvOrxoqruWKAld9Yna2HJE2huX52R1O5FkGRaG3AiyEUgiiiU7cY0t7MCGjjJMcFjwCRtQ4VHbRBYMDnFF1IhzCYlwn+MQcdKWKYtThkrsPo27tNqMWIvWQU0vVOt0ikHdJaWMMfbAbtgru2e37Il9/FqrFtYIvFRp1upa7mz3Hg0tv/+rMmn2sfel+tOzjxKmQ6+CvDshE9xCr+srByevy7OFsVqKXbJn8n/BHtkd3cCqvOlXeV44RZw+IPvzuRuBMpGZyWTzk8ncfPQT7RjGKNL03FPIYRFLUOhYC8c4w3nsRRqUhqWReqoUizQD+BZS6hOBKIyF
AAACbHichVHLSgMxFD0d3/VVFUEQYbBY6qakIvgAoejGpa2OilpkZkw1OC9mpoVa/AHXggtRUBARP8ONP+DCTxAXLhTcuPDOdEC0qDckOTm55+Yk0RxDeD5jjzGpqbmlta29I97Z1d3Tm+jrX/XssqtzRbcN213XVI8bwuKKL3yDrzsuV03N4Gva/kKwv1bhridsa8WvOrxoqruWKAld9Yna2HJE2huX52R1O5FkGRaG3AiyEUgiiiU7cY0t7MCGjjJMcFjwCRtQ4VHbRBYMDnFF1IhzCYlwn+MQcdKWKYtThkrsPo27tNqMWIvWQU0vVOt0ikHdJaWMMfbAbtgru2e37Il9/FqrFtYIvFRp1upa7mz3Hg0tv/+rMmn2sfel+tOzjxKmQ6+CvDshE9xCr+srByevy7OFsVqKXbJn8n/BHtkd3cCqvOlXeV44RZw+IPvzuRuBMpGZyWTzk8ncfPQT7RjGKNL03FPIYRFLUOhYC8c4w3nsRRqUhqWReqoUizQD+BZS6hOBKIyF
Q(s, a) = E[Rt+1
+ Rt+2
+ ...|St
= s, At
= a]
AAACynichVFNa9RQFD2NH/2ydqqbgpuHQ6XSEl6KYBUKU0VwI3RmHFuYDuEl86YNzRd5bwamcXZd+QdcuFKwUPwHbt24cFuhP6G4rODGhTeZQKml7Q3JO++8e+47N9eJfU9pzo9GjGvXb9wcHRufmLw1dXu6NHPnjYq6iSsbbuRHyYYjlPS9UDa0p325ESdSBI4v152d59n5ek8myovC17ofy1YgtkKv47lCE2WXKtV5tcjEQ7bCNgOht9sqfTFo1uxUL1gDtsBytJQh0zTZW1a3NaWSZDUHomWXytzkebDzwCpAGUWsRaWv2EQbEVx0EUAihCbsQ0DR04QFjpi4FlLiEkJefi4xwARpu5QlKUMQu0PfLdo1CzakfVZT5WqXbvHpTUjJMMcP+QE/4d/5F37M/15YK81rZF76tDpDrYzt6Xez9T9XqgJaNbZPVZd61uhgOffqkfc4Z7Iu3KG+t/v+pP60Npc+4J/4L/L/kR/xb9RB2Pvtfq7K2odL/LSJ7VD32f8d0Jis/4dyHjSWzCemVX1Urjwr5jWGe7iPeRrKY1TwEmto0CX7+IFD/DReGcroG+kw1RgpNHdxJoy9f/ISpjw=
AAACynichVFNa9RQFD2NH/2ydqqbgpuHQ6XSEl6KYBUKU0VwI3RmHFuYDuEl86YNzRd5bwamcXZd+QdcuFKwUPwHbt24cFuhP6G4rODGhTeZQKml7Q3JO++8e+47N9eJfU9pzo9GjGvXb9wcHRufmLw1dXu6NHPnjYq6iSsbbuRHyYYjlPS9UDa0p325ESdSBI4v152d59n5ek8myovC17ofy1YgtkKv47lCE2WXKtV5tcjEQ7bCNgOht9sqfTFo1uxUL1gDtsBytJQh0zTZW1a3NaWSZDUHomWXytzkebDzwCpAGUWsRaWv2EQbEVx0EUAihCbsQ0DR04QFjpi4FlLiEkJefi4xwARpu5QlKUMQu0PfLdo1CzakfVZT5WqXbvHpTUjJMMcP+QE/4d/5F37M/15YK81rZF76tDpDrYzt6Xez9T9XqgJaNbZPVZd61uhgOffqkfc4Z7Iu3KG+t/v+pP60Npc+4J/4L/L/kR/xb9RB2Pvtfq7K2odL/LSJ7VD32f8d0Jis/4dyHjSWzCemVX1Urjwr5jWGe7iPeRrKY1TwEmto0CX7+IFD/DReGcroG+kw1RgpNHdxJoy9f/ISpjw=
AAACynichVFNa9RQFD2NH/2ydqqbgpuHQ6XSEl6KYBUKU0VwI3RmHFuYDuEl86YNzRd5bwamcXZd+QdcuFKwUPwHbt24cFuhP6G4rODGhTeZQKml7Q3JO++8e+47N9eJfU9pzo9GjGvXb9wcHRufmLw1dXu6NHPnjYq6iSsbbuRHyYYjlPS9UDa0p325ESdSBI4v152d59n5ek8myovC17ofy1YgtkKv47lCE2WXKtV5tcjEQ7bCNgOht9sqfTFo1uxUL1gDtsBytJQh0zTZW1a3NaWSZDUHomWXytzkebDzwCpAGUWsRaWv2EQbEVx0EUAihCbsQ0DR04QFjpi4FlLiEkJefi4xwARpu5QlKUMQu0PfLdo1CzakfVZT5WqXbvHpTUjJMMcP+QE/4d/5F37M/15YK81rZF76tDpDrYzt6Xez9T9XqgJaNbZPVZd61uhgOffqkfc4Z7Iu3KG+t/v+pP60Npc+4J/4L/L/kR/xb9RB2Pvtfq7K2odL/LSJ7VD32f8d0Jis/4dyHjSWzCemVX1Urjwr5jWGe7iPeRrKY1TwEmto0CX7+IFD/DReGcroG+kw1RgpNHdxJoy9f/ISpjw=
AAACynichVFNa9RQFD2NH/2ydqqbgpuHQ6XSEl6KYBUKU0VwI3RmHFuYDuEl86YNzRd5bwamcXZd+QdcuFKwUPwHbt24cFuhP6G4rODGhTeZQKml7Q3JO++8e+47N9eJfU9pzo9GjGvXb9wcHRufmLw1dXu6NHPnjYq6iSsbbuRHyYYjlPS9UDa0p325ESdSBI4v152d59n5ek8myovC17ofy1YgtkKv47lCE2WXKtV5tcjEQ7bCNgOht9sqfTFo1uxUL1gDtsBytJQh0zTZW1a3NaWSZDUHomWXytzkebDzwCpAGUWsRaWv2EQbEVx0EUAihCbsQ0DR04QFjpi4FlLiEkJefi4xwARpu5QlKUMQu0PfLdo1CzakfVZT5WqXbvHpTUjJMMcP+QE/4d/5F37M/15YK81rZF76tDpDrYzt6Xez9T9XqgJaNbZPVZd61uhgOffqkfc4Z7Iu3KG+t/v+pP60Npc+4J/4L/L/kR/xb9RB2Pvtfq7K2odL/LSJ7VD32f8d0Jis/4dyHjSWzCemVX1Urjwr5jWGe7iPeRrKY1TwEmto0CX7+IFD/DReGcroG+kw1RgpNHdxJoy9f/ISpjw=
(a|s) = P[a|s]
AAACpHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxRRkaiTWFGsq2CrE5CaWZKQUVwfURgNFYuMFlA30DMBAAZNhCGUoM0BBQL7ALoYYhhSGfIZkhlKGXIZUhjyGEiA7hyGRoRgIoxkMGQwYCoBisQzVQLEiICsTLJ/KUMvABdRbClSVClSRCBTNBpLpQF40VDQPyAeZWQzWnQy0JQeIi4A6FRhUDa4arDT4bHDCYLXBS4M/OM2qBpsBckslkE6C6E0tiOfvkgj+TlBXLpAuYchA6MLr5hKGNAYLsFszgW4vAIuAfJEM0V9WNf1zsFWQarWawSKD10D3LzS4aXAY6IO8si/JSwNTg2bjcU8KUDQN6HtQ+NYCo8kQPVIwGaFGepZ6hoEmyg5O0PjiYJBmUGLQAEaKOYMDgwdDAEMo0JIGhmUM6xk2MKkz+TIFM4VClDIxQvUIM6AApjgAMBmbzA==
AAACpHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxRRkaiTWFGsq2CrE5CaWZKQUVwfURgNFYuMFlA30DMBAAZNhCGUoM0BBQL7ALoYYhhSGfIZkhlKGXIZUhjyGEiA7hyGRoRgIoxkMGQwYCoBisQzVQLEiICsTLJ/KUMvABdRbClSVClSRCBTNBpLpQF40VDQPyAeZWQzWnQy0JQeIi4A6FRhUDa4arDT4bHDCYLXBS4M/OM2qBpsBckslkE6C6E0tiOfvkgj+TlBXLpAuYchA6MLr5hKGNAYLsFszgW4vAIuAfJEM0V9WNf1zsFWQarWawSKD10D3LzS4aXAY6IO8si/JSwNTg2bjcU8KUDQN6HtQ+NYCo8kQPVIwGaFGepZ6hoEmyg5O0PjiYJBmUGLQAEaKOYMDgwdDAEMo0JIGhmUM6xk2MKkz+TIFM4VClDIxQvUIM6AApjgAMBmbzA==
AAACpHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxRRkaiTWFGsq2CrE5CaWZKQUVwfURgNFYuMFlA30DMBAAZNhCGUoM0BBQL7ALoYYhhSGfIZkhlKGXIZUhjyGEiA7hyGRoRgIoxkMGQwYCoBisQzVQLEiICsTLJ/KUMvABdRbClSVClSRCBTNBpLpQF40VDQPyAeZWQzWnQy0JQeIi4A6FRhUDa4arDT4bHDCYLXBS4M/OM2qBpsBckslkE6C6E0tiOfvkgj+TlBXLpAuYchA6MLr5hKGNAYLsFszgW4vAIuAfJEM0V9WNf1zsFWQarWawSKD10D3LzS4aXAY6IO8si/JSwNTg2bjcU8KUDQN6HtQ+NYCo8kQPVIwGaFGepZ6hoEmyg5O0PjiYJBmUGLQAEaKOYMDgwdDAEMo0JIGhmUM6xk2MKkz+TIFM4VClDIxQvUIM6AApjgAMBmbzA==
AAACpHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxRRkaiTWFGsq2CrE5CaWZKQUVwfURgNFYuMFlA30DMBAAZNhCGUoM0BBQL7ALoYYhhSGfIZkhlKGXIZUhjyGEiA7hyGRoRgIoxkMGQwYCoBisQzVQLEiICsTLJ/KUMvABdRbClSVClSRCBTNBpLpQF40VDQPyAeZWQzWnQy0JQeIi4A6FRhUDa4arDT4bHDCYLXBS4M/OM2qBpsBckslkE6C6E0tiOfvkgj+TlBXLpAuYchA6MLr5hKGNAYLsFszgW4vAIuAfJEM0V9WNf1zsFWQarWawSKD10D3LzS4aXAY6IO8si/JSwNTg2bjcU8KUDQN6HtQ+NYCo8kQPVIwGaFGepZ6hoEmyg5O0PjiYJBmUGLQAEaKOYMDgwdDAEMo0JIGhmUM6xk2MKkz+TIFM4VClDIxQvUIM6AApjgAMBmbzA==
V (s) = E[Rt+1
+ Rt+2
+ ...|St
= s]
AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
AAACenichVE9T8JQFD3UL0QU3ExciATjRC4Oik4mLo58iJAgIW15YGNpm7aQIOEPsDo4OGniYPwHri7+AQd+gnHExMXB20JilIi3eX3nnffOfee+q1i65rhEg4A0Mzs3vxBcDC2FQ8srkWj4xDFbtioKqqmbdkmRHaFrhii4mquLkmULuanooqicH3r7xbawHc00jt2OJSpNuWFodU2VXaYy1WickuRHbBKkxiCOcZjRR5yiBhMqWmhCwIDLWIcMh78yUiBYzFXQZc5mpPn7Aj2EWNviU4JPyMye87/Bq/KYNXjt5XR8tcq36DxsVsaQoBe6pyE90wO90uefubp+Ds9Lh2dlpBVWNdJfy3/8q2ry7OLsWzXVs4s60r5Xjb1bPuNVoY707YurYX4/l+hu0i29sf8bGtATV2C039W7rMhdT/FTY7bO1Xvv2+MupX73ZBIUtpN7yVSWEMQ6NrDFvdjFAY6QQYFz19DHpRSSSNoZNVMKjLu6ih8hpb8AreqPMw==
AAACs3ichVFNS9xQFD3Gj6q1OroqdBMUiyKEGxGqQqEihS51pqPCOIQk82YM5ou8NwM2nT/gHyjYVYUuSv+BK8GNuu/CnyAuLbhx0ZvMgKhob0jeeefdc3Puu07se1IRXfRovX39Ay8Gh4ZfjrwaHSuMj2zIqJm4ouxGfpRsObYUvheKsvKUL7biRNiB44tNZ3c1O99siUR6UfhZ7cWiGtiN0Kt7rq2YsgoLGzNyVn+vbwe22qnJ9GO7UrRSNWe29Tk9R/MZMgxD/6qXLMWpsmoVpsigPPTHwOyCKXRjLSocYRs1RHDRRACBEIqxDxuSnwpMEGLmqkiZSxh5+blAG8OsbXKW4Ayb2V3+NnhX6bIh77OaMle7/Bef34SVOqbpD/2iazql33RJt0/WSvMamZc9Xp2OVsTW2P7r0s1/VQGvCjt3qmc9K9SxmHv12HucM1kXbkff+vLturRcnE7f0iFdsf8fdEEn3EHY+uv+XBfF78/4qTFb5+6z+23zmMyHQ3kMyvPGkmGuEwbxBpOY4Vm8wwd8whrKXPsAxzjDubaiNbSgM0+tpzvYCdwLLfoHL0KhtA==
AAACs3ichVFNS9xQFD3Gj6q1OroqdBMUiyKEGxGqQqEihS51pqPCOIQk82YM5ou8NwM2nT/gHyjYVYUuSv+BK8GNuu/CnyAuLbhx0ZvMgKhob0jeeefdc3Puu07se1IRXfRovX39Ay8Gh4ZfjrwaHSuMj2zIqJm4ouxGfpRsObYUvheKsvKUL7biRNiB44tNZ3c1O99siUR6UfhZ7cWiGtiN0Kt7rq2YsgoLGzNyVn+vbwe22qnJ9GO7UrRSNWe29Tk9R/MZMgxD/6qXLMWpsmoVpsigPPTHwOyCKXRjLSocYRs1RHDRRACBEIqxDxuSnwpMEGLmqkiZSxh5+blAG8OsbXKW4Ayb2V3+NnhX6bIh77OaMle7/Bef34SVOqbpD/2iazql33RJt0/WSvMamZc9Xp2OVsTW2P7r0s1/VQGvCjt3qmc9K9SxmHv12HucM1kXbkff+vLturRcnE7f0iFdsf8fdEEn3EHY+uv+XBfF78/4qTFb5+6z+23zmMyHQ3kMyvPGkmGuEwbxBpOY4Vm8wwd8whrKXPsAxzjDubaiNbSgM0+tpzvYCdwLLfoHL0KhtA==
AAACvnichVFNS9xQFD2mtn61dVo3Qjehg8UihBsRtIWCVYQudaajwjiEJPNmDOaLvDdDbZw/0D8g6Eqhi+I/cCV003bfhT+huFRw00VvMoHSSu0NyTvvvHvuOzfXiX1PKqLzAe3O4N17Q8Mjo2P3HzwcLz16vC6jTuKKmhv5UbLp2FL4XihqylO+2IwTYQeOLzacneXsfKMrEulF4Vu1G4tGYLdDr+W5tmLKKs2tT8vn+it9K7DVdlOmK716xUrVjNnTZ/QczWbIMAx9T69ailNlwyqVyaA89JvALEAZRaxGpVNsoYkILjoIIBBCMfZhQ/JThwlCzFwDKXMJIy8/F+hhlLUdzhKcYTO7w9827+oFG/I+qylztcu3+PwmrNQxRd/pE13SFzqhH/Tzn7XSvEbmZZdXp68VsTX+YbJ6/V9VwKvC9m/VrZ4VWljIvXrsPc6ZrAu3r+++37+svqxMpc/omC7Y/xGd02fuIOxeuR/XROXwFj9NZlvcffZ/ezwm8++h3AS1WeOFYa5ReXGpmNcwnuAppnko81jEG6yixpcc4Axf8U17rbW1QIv6qdpAoZnAH6G9+wUzfKLa
AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
In NLP tasks, policy-based methods are usually used.
• Effective in high-dimensional action space.
• We usually have a policy network in conventional maximum likelihood
method.
Conventional Seq2seq Model
10
私 は 学生 です EOS
I am a student
I am a student
EOS
Training cross entropy loss
gold
Cross Entropy Loss
11
= negative log likelihood
Conventional Seq2seq Model
12
私 は 学生 です EOS
I am a student
I am a student
EOS
私 は 学生 です EOS
I
am
a
student
EOS
log p(I| ࢲ, , ..., EOS)
AAAC2HichVI9SxxRFD1O/P5cYyPYLC4aA7LcEcGPShIErfxYV4VVZGb27To4OzPMvF1cNxapREsbC6sELCRtKruQxj9goa1pgqVCmhS58xwQleh7MHPeeffce+57z/QdO5REl3Xam/qGxqbmlta29o7OrkT32+XQKweWyFqe4wWrphEKx3ZFVtrSEat+IIyS6YgVc+tjtL9SEUFoe+6SrPpivWQUXbtgW4Zkyku8wxoceCgiCR9DmMUnRmsowWR2G7XqD+xi+DFHXsyl1YzQNOaQwfuNRIrSpEbyOdBjkEI85r3EGSfOc1oLZS4g4EIydmAg5JmDDmJjEuuoMRcwstW+YAutrC1zlOAIg9kt/hZ5lYtZl9dRzlCpLa4StRqwMokBuqBTuqVz+ka/6e9/c9VUjshLVR2A0gp/o+ugN/PnVVWJ/xKbD6oXPUsUMK682uzdV0zUhXWvr+wc3WYmFwdqg/SVbtj/F7qkn9yBW7mzThbE4vELfvLMFrj76Hx3+Zr0p5fyHGRH0hNpfWE0NfUhvq9m9KGfX4mOMUxhBvPIcpHvuMI1fmk57bO2p+3fh2p1saYHj4Z2+A9W2aHj
AAAC2HichVI9SxxRFD1O/P5cYyPYLC4aA7LcEcGPShIErfxYV4VVZGb27To4OzPMvF1cNxapREsbC6sELCRtKruQxj9goa1pgqVCmhS58xwQleh7MHPeeffce+57z/QdO5REl3Xam/qGxqbmlta29o7OrkT32+XQKweWyFqe4wWrphEKx3ZFVtrSEat+IIyS6YgVc+tjtL9SEUFoe+6SrPpivWQUXbtgW4Zkyku8wxoceCgiCR9DmMUnRmsowWR2G7XqD+xi+DFHXsyl1YzQNOaQwfuNRIrSpEbyOdBjkEI85r3EGSfOc1oLZS4g4EIydmAg5JmDDmJjEuuoMRcwstW+YAutrC1zlOAIg9kt/hZ5lYtZl9dRzlCpLa4StRqwMokBuqBTuqVz+ka/6e9/c9VUjshLVR2A0gp/o+ugN/PnVVWJ/xKbD6oXPUsUMK682uzdV0zUhXWvr+wc3WYmFwdqg/SVbtj/F7qkn9yBW7mzThbE4vELfvLMFrj76Hx3+Zr0p5fyHGRH0hNpfWE0NfUhvq9m9KGfX4mOMUxhBvPIcpHvuMI1fmk57bO2p+3fh2p1saYHj4Z2+A9W2aHj
AAAC2HichVI9SxxRFD1O/P5cYyPYLC4aA7LcEcGPShIErfxYV4VVZGb27To4OzPMvF1cNxapREsbC6sELCRtKruQxj9goa1pgqVCmhS58xwQleh7MHPeeffce+57z/QdO5REl3Xam/qGxqbmlta29o7OrkT32+XQKweWyFqe4wWrphEKx3ZFVtrSEat+IIyS6YgVc+tjtL9SEUFoe+6SrPpivWQUXbtgW4Zkyku8wxoceCgiCR9DmMUnRmsowWR2G7XqD+xi+DFHXsyl1YzQNOaQwfuNRIrSpEbyOdBjkEI85r3EGSfOc1oLZS4g4EIydmAg5JmDDmJjEuuoMRcwstW+YAutrC1zlOAIg9kt/hZ5lYtZl9dRzlCpLa4StRqwMokBuqBTuqVz+ka/6e9/c9VUjshLVR2A0gp/o+ugN/PnVVWJ/xKbD6oXPUsUMK682uzdV0zUhXWvr+wc3WYmFwdqg/SVbtj/F7qkn9yBW7mzThbE4vELfvLMFrj76Hx3+Zr0p5fyHGRH0hNpfWE0NfUhvq9m9KGfX4mOMUxhBvPIcpHvuMI1fmk57bO2p+3fh2p1saYHj4Z2+A9W2aHj
AAAC2HichVI9SxxRFD1O/P5cYyPYLC4aA7LcEcGPShIErfxYV4VVZGb27To4OzPMvF1cNxapREsbC6sELCRtKruQxj9goa1pgqVCmhS58xwQleh7MHPeeffce+57z/QdO5REl3Xam/qGxqbmlta29o7OrkT32+XQKweWyFqe4wWrphEKx3ZFVtrSEat+IIyS6YgVc+tjtL9SEUFoe+6SrPpivWQUXbtgW4Zkyku8wxoceCgiCR9DmMUnRmsowWR2G7XqD+xi+DFHXsyl1YzQNOaQwfuNRIrSpEbyOdBjkEI85r3EGSfOc1oLZS4g4EIydmAg5JmDDmJjEuuoMRcwstW+YAutrC1zlOAIg9kt/hZ5lYtZl9dRzlCpLa4StRqwMokBuqBTuqVz+ka/6e9/c9VUjshLVR2A0gp/o+ugN/PnVVWJ/xKbD6oXPUsUMK682uzdV0zUhXWvr+wc3WYmFwdqg/SVbtj/F7qkn9yBW7mzThbE4vELfvLMFrj76Hx3+Zr0p5fyHGRH0hNpfWE0NfUhvq9m9KGfX4mOMUxhBvPIcpHvuMI1fmk57bO2p+3fh2p1saYHj4Z2+A9W2aHj
log p(am|I, ࢲ, , ..., EOS)
AAAC4HichVLPT9RAGH1UEQSURS8mXjZuMBrN5qsh4ceJSEz0JLCukAAhbXd2bei2TTu7YV35A/BiPJnoCRIPxv/AeCFc+Ac8cPVm8IYJFw6+Dk2MEHEmmb5537zve99M3TjwUy2y32NduNh7qa//8sDg0JWrw4WRa8/TqJV4qupFQZQsuk6qAj9UVe3rQC3GiXKabqAW3LWZLL7QVknqR+Ez3YnVStNphH7d9xxNKircwzICRGigiBh34KCJV8RPcJ/rMncuo+vodnawcZqTKOfKZmboEZ6igrurhZKUxYziWWDnoIR8zEaFL0xcY1oPLRZQCKGJAxpKOZdgQ2hQYwVdcgmRb+KKFgaobfGU4gmH7BrXBndLORuathRxpvZYJWs5obKIUfkmn+RQ9uSz/JDjf+bqmhyZl465AKNV8erw6xuVo/+qmvxqvPijOtezRh0TxqtP77Fhsi68E3375bvDytT8aPe2bMsB/W/Jvuyyg7D9y/s4p+Y/nOOnRrbO7rP73eAz2acf5SyoPihPlu25sdL0w/y9+nETt/i32BjHNB5jFlUW+YrvOMBPy7M2rTfW25OjVk+uuY6/hvX+Nzq9o0k=
AAAC4HichVLPT9RAGH1UEQSURS8mXjZuMBrN5qsh4ceJSEz0JLCukAAhbXd2bei2TTu7YV35A/BiPJnoCRIPxv/AeCFc+Ac8cPVm8IYJFw6+Dk2MEHEmmb5537zve99M3TjwUy2y32NduNh7qa//8sDg0JWrw4WRa8/TqJV4qupFQZQsuk6qAj9UVe3rQC3GiXKabqAW3LWZLL7QVknqR+Ez3YnVStNphH7d9xxNKircwzICRGigiBh34KCJV8RPcJ/rMncuo+vodnawcZqTKOfKZmboEZ6igrurhZKUxYziWWDnoIR8zEaFL0xcY1oPLRZQCKGJAxpKOZdgQ2hQYwVdcgmRb+KKFgaobfGU4gmH7BrXBndLORuathRxpvZYJWs5obKIUfkmn+RQ9uSz/JDjf+bqmhyZl465AKNV8erw6xuVo/+qmvxqvPijOtezRh0TxqtP77Fhsi68E3375bvDytT8aPe2bMsB/W/Jvuyyg7D9y/s4p+Y/nOOnRrbO7rP73eAz2acf5SyoPihPlu25sdL0w/y9+nETt/i32BjHNB5jFlUW+YrvOMBPy7M2rTfW25OjVk+uuY6/hvX+Nzq9o0k=
AAAC4HichVLPT9RAGH1UEQSURS8mXjZuMBrN5qsh4ceJSEz0JLCukAAhbXd2bei2TTu7YV35A/BiPJnoCRIPxv/AeCFc+Ac8cPVm8IYJFw6+Dk2MEHEmmb5537zve99M3TjwUy2y32NduNh7qa//8sDg0JWrw4WRa8/TqJV4qupFQZQsuk6qAj9UVe3rQC3GiXKabqAW3LWZLL7QVknqR+Ez3YnVStNphH7d9xxNKircwzICRGigiBh34KCJV8RPcJ/rMncuo+vodnawcZqTKOfKZmboEZ6igrurhZKUxYziWWDnoIR8zEaFL0xcY1oPLRZQCKGJAxpKOZdgQ2hQYwVdcgmRb+KKFgaobfGU4gmH7BrXBndLORuathRxpvZYJWs5obKIUfkmn+RQ9uSz/JDjf+bqmhyZl465AKNV8erw6xuVo/+qmvxqvPijOtezRh0TxqtP77Fhsi68E3375bvDytT8aPe2bMsB/W/Jvuyyg7D9y/s4p+Y/nOOnRrbO7rP73eAz2acf5SyoPihPlu25sdL0w/y9+nETt/i32BjHNB5jFlUW+YrvOMBPy7M2rTfW25OjVk+uuY6/hvX+Nzq9o0k=
AAAC4HichVLPT9RAGH1UEQSURS8mXjZuMBrN5qsh4ceJSEz0JLCukAAhbXd2bei2TTu7YV35A/BiPJnoCRIPxv/AeCFc+Ac8cPVm8IYJFw6+Dk2MEHEmmb5537zve99M3TjwUy2y32NduNh7qa//8sDg0JWrw4WRa8/TqJV4qupFQZQsuk6qAj9UVe3rQC3GiXKabqAW3LWZLL7QVknqR+Ez3YnVStNphH7d9xxNKircwzICRGigiBh34KCJV8RPcJ/rMncuo+vodnawcZqTKOfKZmboEZ6igrurhZKUxYziWWDnoIR8zEaFL0xcY1oPLRZQCKGJAxpKOZdgQ2hQYwVdcgmRb+KKFgaobfGU4gmH7BrXBndLORuathRxpvZYJWs5obKIUfkmn+RQ9uSz/JDjf+bqmhyZl465AKNV8erw6xuVo/+qmvxqvPijOtezRh0TxqtP77Fhsi68E3375bvDytT8aPe2bMsB/W/Jvuyyg7D9y/s4p+Y/nOOnRrbO7rP73eAz2acf5SyoPihPlu25sdL0w/y9+nETt/i32BjHNB5jFlUW+YrvOMBPy7M2rTfW25OjVk+uuY6/hvX+Nzq9o0k=
Training cross entropy loss
I am a student
gold
Conventional Seq2seq Model
13
私 は 学生 です EOS
I am an actor
I am an actor
EOS
私 は 学生 です EOS
I
am EOS
Testing
an
actor
argmaxy
p(y| ࢲ, , ..., EOS)
AAAC4HichVK7SiRBFD227+eMmixsMjgoyspwWwR3N5JdFjbzMTsqqAzdbc3Y2NPddNcMO876AZrIRoJGCgbiH4iJmPgDG5iaiWYKJgbeLhtEZbWK7jp1qs6951aV6Tt2KInO67T6hsam5pbWtvaOzq5EsrtnJvTKgSVylud4wZxphMKxXZGTtnTEnB8Io2Q6YtZc+R6tz1ZEENqe+0tWfbFYMoquXbAtQzLlJT/BQIAiSjz+Rh5VpOBjkMc/jBaYN+HxSq16gjUMP+fIi7mM6hH6gQlkMZRPpilDqqVeAz0GacRt0kseceAlDmuhzAkEXEjGDtsKuc9DB7ExiUXUlGUJW60LttDG2jLvErzDYHaF/0Wezcesy/MoZqjUFmdx+AtYmUI//aMDuqEzOqRLuv9vrJqKEXmpqgNQWuHnExsfsnfvqqLjlVh+Ur3pWaKAz8qrzd59xURVWI/6yurWTfbrdH9tgPboiv3v0jmdcgVu5dbanxLTO2/4WWK2wNVH57vG16S/vJTXIDeS+ZLRp0bT49/i+2rBR/TxK9ExhnH8xCRynOQYF7jCtWZp69qm9vdxq1YXa3rxrGnbDzBHpBk=
AAAC4HichVK7SiRBFD227+eMmixsMjgoyspwWwR3N5JdFjbzMTsqqAzdbc3Y2NPddNcMO876AZrIRoJGCgbiH4iJmPgDG5iaiWYKJgbeLhtEZbWK7jp1qs6951aV6Tt2KInO67T6hsam5pbWtvaOzq5EsrtnJvTKgSVylud4wZxphMKxXZGTtnTEnB8Io2Q6YtZc+R6tz1ZEENqe+0tWfbFYMoquXbAtQzLlJT/BQIAiSjz+Rh5VpOBjkMc/jBaYN+HxSq16gjUMP+fIi7mM6hH6gQlkMZRPpilDqqVeAz0GacRt0kseceAlDmuhzAkEXEjGDtsKuc9DB7ExiUXUlGUJW60LttDG2jLvErzDYHaF/0Wezcesy/MoZqjUFmdx+AtYmUI//aMDuqEzOqRLuv9vrJqKEXmpqgNQWuHnExsfsnfvqqLjlVh+Ur3pWaKAz8qrzd59xURVWI/6yurWTfbrdH9tgPboiv3v0jmdcgVu5dbanxLTO2/4WWK2wNVH57vG16S/vJTXIDeS+ZLRp0bT49/i+2rBR/TxK9ExhnH8xCRynOQYF7jCtWZp69qm9vdxq1YXa3rxrGnbDzBHpBk=
AAAC4HichVK7SiRBFD227+eMmixsMjgoyspwWwR3N5JdFjbzMTsqqAzdbc3Y2NPddNcMO876AZrIRoJGCgbiH4iJmPgDG5iaiWYKJgbeLhtEZbWK7jp1qs6951aV6Tt2KInO67T6hsam5pbWtvaOzq5EsrtnJvTKgSVylud4wZxphMKxXZGTtnTEnB8Io2Q6YtZc+R6tz1ZEENqe+0tWfbFYMoquXbAtQzLlJT/BQIAiSjz+Rh5VpOBjkMc/jBaYN+HxSq16gjUMP+fIi7mM6hH6gQlkMZRPpilDqqVeAz0GacRt0kseceAlDmuhzAkEXEjGDtsKuc9DB7ExiUXUlGUJW60LttDG2jLvErzDYHaF/0Wezcesy/MoZqjUFmdx+AtYmUI//aMDuqEzOqRLuv9vrJqKEXmpqgNQWuHnExsfsnfvqqLjlVh+Ur3pWaKAz8qrzd59xURVWI/6yurWTfbrdH9tgPboiv3v0jmdcgVu5dbanxLTO2/4WWK2wNVH57vG16S/vJTXIDeS+ZLRp0bT49/i+2rBR/TxK9ExhnH8xCRynOQYF7jCtWZp69qm9vdxq1YXa3rxrGnbDzBHpBk=
AAAC4HichVK7SiRBFD227+eMmixsMjgoyspwWwR3N5JdFjbzMTsqqAzdbc3Y2NPddNcMO876AZrIRoJGCgbiH4iJmPgDG5iaiWYKJgbeLhtEZbWK7jp1qs6951aV6Tt2KInO67T6hsam5pbWtvaOzq5EsrtnJvTKgSVylud4wZxphMKxXZGTtnTEnB8Io2Q6YtZc+R6tz1ZEENqe+0tWfbFYMoquXbAtQzLlJT/BQIAiSjz+Rh5VpOBjkMc/jBaYN+HxSq16gjUMP+fIi7mM6hH6gQlkMZRPpilDqqVeAz0GacRt0kseceAlDmuhzAkEXEjGDtsKuc9DB7ExiUXUlGUJW60LttDG2jLvErzDYHaF/0Wezcesy/MoZqjUFmdx+AtYmUI//aMDuqEzOqRLuv9vrJqKEXmpqgNQWuHnExsfsnfvqqLjlVh+Ur3pWaKAz8qrzd59xURVWI/6yurWTfbrdH9tgPboiv3v0jmdcgVu5dbanxLTO2/4WWK2wNVH57vG16S/vJTXIDeS+ZLRp0bT49/i+2rBR/TxK9ExhnH8xCRynOQYF7jCtWZp69qm9vdxq1YXa3rxrGnbDzBHpBk=
argmaxy
p(y|I, ࢲ, , ..., EOS)
AAAC5nichVJNaxRBEH0ZPxLjRzZ6EbwsLhEFGWokkMRTiAh6Msm6JhDDMjPpXZvMzgwzvYubNX/AiyfJwYsKHsR/4E295A/kkH+geNIIXjz4pjMgJhi7me7qV/2qXlVPkEY6NyK7Q86x4ydODo+cGj195uy5scr4+Qd50s1C1QiTKMmWAz9XkY5Vw2gTqeU0U34niNRSsH6r8C/1VJbrJL5v+qla7fjtWLd06BtCScWFjwxtdLg/RhN9VJHiKvcntO7iOteH9AZI6B/0P2DzICZJibl2FtZt3EMd15qVmrhiR/Ww4ZVGDeWYTyrvGXiNYUN0mUAhhqEdUVzOuQIPQnkGqxhY4Qba+hUljJLb5S3FGz7Rda5tnlZKNOa5iJlbdsgsEb+MzComZEfeyp5syzv5Ir/+GWtgYxRa+rYBlqvS5tjTi/Wf/2UVTTZ49Id1pGaDFqatVk3tqUWKKsJ9fm9ja69+c3FicEVey1fqfyW78okVxL0f4ZsFtfjiCD1rRFusvujvJp/JO/goh43GDXfG9RYma7Nz5XuN4BIu81/xMIVZ3ME8GkzyEZ/xDd8d7Txznjtb+1edoZJzAX8N5+Vvj9ek5w==
AAAC5nichVJNaxRBEH0ZPxLjRzZ6EbwsLhEFGWokkMRTiAh6Msm6JhDDMjPpXZvMzgwzvYubNX/AiyfJwYsKHsR/4E295A/kkH+geNIIXjz4pjMgJhi7me7qV/2qXlVPkEY6NyK7Q86x4ydODo+cGj195uy5scr4+Qd50s1C1QiTKMmWAz9XkY5Vw2gTqeU0U34niNRSsH6r8C/1VJbrJL5v+qla7fjtWLd06BtCScWFjwxtdLg/RhN9VJHiKvcntO7iOteH9AZI6B/0P2DzICZJibl2FtZt3EMd15qVmrhiR/Ww4ZVGDeWYTyrvGXiNYUN0mUAhhqEdUVzOuQIPQnkGqxhY4Qba+hUljJLb5S3FGz7Rda5tnlZKNOa5iJlbdsgsEb+MzComZEfeyp5syzv5Ir/+GWtgYxRa+rYBlqvS5tjTi/Wf/2UVTTZ49Id1pGaDFqatVk3tqUWKKsJ9fm9ja69+c3FicEVey1fqfyW78okVxL0f4ZsFtfjiCD1rRFusvujvJp/JO/goh43GDXfG9RYma7Nz5XuN4BIu81/xMIVZ3ME8GkzyEZ/xDd8d7Txznjtb+1edoZJzAX8N5+Vvj9ek5w==
AAAC5nichVJNaxRBEH0ZPxLjRzZ6EbwsLhEFGWokkMRTiAh6Msm6JhDDMjPpXZvMzgwzvYubNX/AiyfJwYsKHsR/4E295A/kkH+geNIIXjz4pjMgJhi7me7qV/2qXlVPkEY6NyK7Q86x4ydODo+cGj195uy5scr4+Qd50s1C1QiTKMmWAz9XkY5Vw2gTqeU0U34niNRSsH6r8C/1VJbrJL5v+qla7fjtWLd06BtCScWFjwxtdLg/RhN9VJHiKvcntO7iOteH9AZI6B/0P2DzICZJibl2FtZt3EMd15qVmrhiR/Ww4ZVGDeWYTyrvGXiNYUN0mUAhhqEdUVzOuQIPQnkGqxhY4Qba+hUljJLb5S3FGz7Rda5tnlZKNOa5iJlbdsgsEb+MzComZEfeyp5syzv5Ir/+GWtgYxRa+rYBlqvS5tjTi/Wf/2UVTTZ49Id1pGaDFqatVk3tqUWKKsJ9fm9ja69+c3FicEVey1fqfyW78okVxL0f4ZsFtfjiCD1rRFusvujvJp/JO/goh43GDXfG9RYma7Nz5XuN4BIu81/xMIVZ3ME8GkzyEZ/xDd8d7Txznjtb+1edoZJzAX8N5+Vvj9ek5w==
AAAC5nichVJNaxRBEH0ZPxLjRzZ6EbwsLhEFGWokkMRTiAh6Msm6JhDDMjPpXZvMzgwzvYubNX/AiyfJwYsKHsR/4E295A/kkH+geNIIXjz4pjMgJhi7me7qV/2qXlVPkEY6NyK7Q86x4ydODo+cGj195uy5scr4+Qd50s1C1QiTKMmWAz9XkY5Vw2gTqeU0U34niNRSsH6r8C/1VJbrJL5v+qla7fjtWLd06BtCScWFjwxtdLg/RhN9VJHiKvcntO7iOteH9AZI6B/0P2DzICZJibl2FtZt3EMd15qVmrhiR/Ww4ZVGDeWYTyrvGXiNYUN0mUAhhqEdUVzOuQIPQnkGqxhY4Qba+hUljJLb5S3FGz7Rda5tnlZKNOa5iJlbdsgsEb+MzComZEfeyp5syzv5Ir/+GWtgYxRa+rYBlqvS5tjTi/Wf/2UVTTZ49Id1pGaDFqatVk3tqUWKKsJ9fm9ja69+c3FicEVey1fqfyW78okVxL0f4ZsFtfjiCD1rRFusvujvJp/JO/goh43GDXfG9RYma7Nz5XuN4BIu81/xMIVZ3ME8GkzyEZ/xDd8d7Txznjtb+1edoZJzAX8N5+Vvj9ek5w==
system output
Problems
• Training and testing mismatch
– Sequence generation:
• Training: generate a next ground truth word given the
previous ground truth words (teacher forcing)
• Testing: generate an entire sequence
– Objective:
• Training: word-level loss (e.g., cross entropy loss)
• Testing: sentence-level evaluation (e.g., BLEU)
⇒ Reinforcement Learning
14
Data Distribution Mismatch!
4 CHAPTER 1. INTROD
Expert trajectory
Learned Policy
No data on
how to recover
Figure 1.1: Mismatch between the distribution of training and test inputs in
scenario.
many state-of-the-art software system that we use everyday. Systems based
vised learning already translate our documents, recommend what we should
and Guestrin, 2011), watch (Toscher et al., 2009) or buy, read our handwritin
III et al., 2009) and filter spam from our emails (Weinberger et al., 2009), just
few. Many subfields of artificial intelligence, such as natural language processin
derstanding of natural language by computers) and computer vision (the und
of visual input by computers), now deeply integrate machine learning.
Despite this widespread proliferation and success of machine learning in va
and applications, machine learning has had a much more limited success wh
in control applications, e.g. learning to drive from demonstrations by hum
One of the main reason behind this limited success is that control proble
fundamentally di↵erent issues that are not typically addressed by standard
learning techniques.
In particular, much of the theory and algorithms for supervised learning ar
the fundamental assumption that inputs/observations perceived by the predict
its predictions are independent and always coming from the same underlying d
during both training and testing (Hastie et al., 2001). This ensures that a
enough training examples, we will be able to predict well on new example
in expectation). However, this assumption is clearly violated in control task
are inherently dynamic and sequential : one must perform a sequence of ac
time that have consequences on future inputs or observations of the system, t
p⇡⇤
(ot) 6= p⇡✓
(ot)
exposure bias
Solution: RL
• Action: generate a word
• Reward: sentence-level evaluation (e.g., BLEU)
– In NLP, a reward is usually calculated using a gold
• Only relies on its output to generate the results
→ avoid exposure bias
• Directly optimize the model using the evaluation
metric → avoid mismatch between training and
testing measures
15
RL-based Seq2seq Model
16
私 は 学生 です EOS am
a
boy
EOS
Training
He
12.0
私 は 学生 です EOS am
a
boy
EOS
I
38.0
Reward
(BLEU calculated
using a gold reference)
・・・
Sampling
with stochastic policy
・・・
Table of Contents
1. An overview of RL
2. Policy gradient
3. RL in NLP tasks
4. Implementation
17
Policy Gradient
• Goal: maximize
18
J( ) = Ey1,··· ,yT
[R(y1:T
)]
AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
Policy Gradient
• Goal: maximize
19
私 は 学生 です EOS am
a
boy
EOS
I
38.0
10.0
8.0
・・・
・・・
・・・
x 0.0002
x 0.0008
x 0.0001
+
+
・・・
・・・
= 28.0
J( ) = Ey1,··· ,yT
[R(y1:T
)]
AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
Policy Gradient
• Goal: maximize
• Parameter update with gradient ascent
• How to compute this gradient when rewards
are not differentiable
20
+ J( )
AAACw3ichVFNSyNBEH3Oum7WXTW7XgQvgyGiCKEjwq6eREHEkx8bEzASesaOGezMDD2dSAz5A/4BDx5khT0s/gOvInjSkwd/wrLHCF48WDMZEBW1h5l+/ape9aspy5dOoBm76TI+dH/s+ZT43Pvla1//QPLb9/XAqylb5GxPeqpg8UBIxxU57WgpCr4SvGpJkbd25sN4vi5U4HjuL93wxWaVb7tO2bG5JqqUnC7qitDcLEpR1lwpb9eMmQmzyKVfoZDLLclLMb001gHjpWSKZVi0zJcgG4MU4rXsJU9RxBY82KihCgEXmrAER0DPBrJg8InbRJM4RciJ4gIt9JK2RlmCMjixO/TdptNGzLp0DmsGkdqmWyS9ipQm0uya/WVtdsFO2D92/2qtZlQj9NKg3epohV8a2B9au3tXVaVdo/KoetOzRhk/I68OefcjJuzC7ujrewfttZnVdHOUHbP/5P83u2Fn1IFbv7X/rIjVwzf8bBFbpu7D/9uiMWWfD+UlyE1mpjPZlanU7Fw8rwSGMYIxGsoPzGIRy8jRJUc4xyWujAVDGsrQnVSjK9YM4skyWg94Cqgd
AAACw3ichVFNSyNBEH3Oum7WXTW7XgQvgyGiCKEjwq6eREHEkx8bEzASesaOGezMDD2dSAz5A/4BDx5khT0s/gOvInjSkwd/wrLHCF48WDMZEBW1h5l+/ape9aspy5dOoBm76TI+dH/s+ZT43Pvla1//QPLb9/XAqylb5GxPeqpg8UBIxxU57WgpCr4SvGpJkbd25sN4vi5U4HjuL93wxWaVb7tO2bG5JqqUnC7qitDcLEpR1lwpb9eMmQmzyKVfoZDLLclLMb001gHjpWSKZVi0zJcgG4MU4rXsJU9RxBY82KihCgEXmrAER0DPBrJg8InbRJM4RciJ4gIt9JK2RlmCMjixO/TdptNGzLp0DmsGkdqmWyS9ipQm0uya/WVtdsFO2D92/2qtZlQj9NKg3epohV8a2B9au3tXVaVdo/KoetOzRhk/I68OefcjJuzC7ujrewfttZnVdHOUHbP/5P83u2Fn1IFbv7X/rIjVwzf8bBFbpu7D/9uiMWWfD+UlyE1mpjPZlanU7Fw8rwSGMYIxGsoPzGIRy8jRJUc4xyWujAVDGsrQnVSjK9YM4skyWg94Cqgd
AAACw3ichVFNSyNBEH3Oum7WXTW7XgQvgyGiCKEjwq6eREHEkx8bEzASesaOGezMDD2dSAz5A/4BDx5khT0s/gOvInjSkwd/wrLHCF48WDMZEBW1h5l+/ape9aspy5dOoBm76TI+dH/s+ZT43Pvla1//QPLb9/XAqylb5GxPeqpg8UBIxxU57WgpCr4SvGpJkbd25sN4vi5U4HjuL93wxWaVb7tO2bG5JqqUnC7qitDcLEpR1lwpb9eMmQmzyKVfoZDLLclLMb001gHjpWSKZVi0zJcgG4MU4rXsJU9RxBY82KihCgEXmrAER0DPBrJg8InbRJM4RciJ4gIt9JK2RlmCMjixO/TdptNGzLp0DmsGkdqmWyS9ipQm0uya/WVtdsFO2D92/2qtZlQj9NKg3epohV8a2B9au3tXVaVdo/KoetOzRhk/I68OefcjJuzC7ujrewfttZnVdHOUHbP/5P83u2Fn1IFbv7X/rIjVwzf8bBFbpu7D/9uiMWWfD+UlyE1mpjPZlanU7Fw8rwSGMYIxGsoPzGIRy8jRJUc4xyWujAVDGsrQnVSjK9YM4skyWg94Cqgd
AAACw3ichVFNSyNBEH3Oum7WXTW7XgQvgyGiCKEjwq6eREHEkx8bEzASesaOGezMDD2dSAz5A/4BDx5khT0s/gOvInjSkwd/wrLHCF48WDMZEBW1h5l+/ape9aspy5dOoBm76TI+dH/s+ZT43Pvla1//QPLb9/XAqylb5GxPeqpg8UBIxxU57WgpCr4SvGpJkbd25sN4vi5U4HjuL93wxWaVb7tO2bG5JqqUnC7qitDcLEpR1lwpb9eMmQmzyKVfoZDLLclLMb001gHjpWSKZVi0zJcgG4MU4rXsJU9RxBY82KihCgEXmrAER0DPBrJg8InbRJM4RciJ4gIt9JK2RlmCMjixO/TdptNGzLp0DmsGkdqmWyS9ipQm0uya/WVtdsFO2D92/2qtZlQj9NKg3epohV8a2B9au3tXVaVdo/KoetOzRhk/I68OefcjJuzC7ujrewfttZnVdHOUHbP/5P83u2Fn1IFbv7X/rIjVwzf8bBFbpu7D/9uiMWWfD+UlyE1mpjPZlanU7Fw8rwSGMYIxGsoPzGIRy8jRJUc4xyWujAVDGsrQnVSjK9YM4skyWg94Cqgd
J( ) = Ey1,··· ,yT
[R(y1:T
)]
AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
=
y1:T
(y1:T
)R(y1:T
)
AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
AAACenichVE9T8JQFD3UL0QU3ExciATjRC4Oik4mLo58iJAgIW15YGNpm7aQIOEPsDo4OGniYPwHri7+AQd+gnHExMXB20JilIi3eX3nnffOfee+q1i65rhEg4A0Mzs3vxBcDC2FQ8srkWj4xDFbtioKqqmbdkmRHaFrhii4mquLkmULuanooqicH3r7xbawHc00jt2OJSpNuWFodU2VXaYy1WickuRHbBKkxiCOcZjRR5yiBhMqWmhCwIDLWIcMh78yUiBYzFXQZc5mpPn7Aj2EWNviU4JPyMye87/Bq/KYNXjt5XR8tcq36DxsVsaQoBe6pyE90wO90uefubp+Ds9Lh2dlpBVWNdJfy3/8q2ry7OLsWzXVs4s60r5Xjb1bPuNVoY707YurYX4/l+hu0i29sf8bGtATV2C039W7rMhdT/FTY7bO1Xvv2+MupX73ZBIUtpN7yVSWEMQ6NrDFvdjFAY6QQYFz19DHpRSSSNoZNVMKjLu6ih8hpb8AreqPMw==
AAACvHichVG7TtxAFD0YEpZHwkIVicYCEZFmdU3DQ1oJiQalgoUNSBhZYzMLI/ySPbtisfYH+AEKKhKliPIHadMkbaIUfEKUkkg0FFx7rUQJCrmWPWfOnXN97lw39lWqia4GjMGhR4+HKyOjY+NPnk5UJ8dfpVE78WTTi/wo2XVFKn0VyqZW2pe7cSJF4Ppyxz1ey/M7HZmkKgq3dTeW+4E4DFVLeUIz5VTrddMOhesLx9ZHUgvTTtuBk3WdzFrZ7vVMO1ZlZr7kXpiNX9CpzlKNijDvA6sEsyhjI6p+gI0DRPDQRgCJEJqxD4GUnz1YIMTM7SNjLmGkirxED6OsbfMpyScEs8f8PeTdXsmGvM9rpoXa47/4/CasNDFH3+gdXdMnek/f6faftbKiRu6ly6vb18rYmTh7tnXzX1XAq8bRb9WDnjVaWCq8KvYeF0zehdfXd07Pr7dWGnPZc3pNP9j/JV3RR+4g7Pz03m7KxsUDfg6YbXH3+f32eEzW30O5D5oLteWatUmoYBozmOdZLGIV69hAk2u/wWd8wVfjpREbnf48jYFysFP4I4yTOzFkqA4=
AAACvHichVG7TtxAFD0YEpZHwkIVicYCEZFmdU3DQ1oJiQalgoUNSBhZYzMLI/ySPbtisfYH+AEKKhKliPIHadMkbaIUfEKUkkg0FFx7rUQJCrmWPWfOnXN97lw39lWqia4GjMGhR4+HKyOjY+NPnk5UJ8dfpVE78WTTi/wo2XVFKn0VyqZW2pe7cSJF4Ppyxz1ey/M7HZmkKgq3dTeW+4E4DFVLeUIz5VTrddMOhesLx9ZHUgvTTtuBk3WdzFrZ7vVMO1ZlZr7kXpiNX9CpzlKNijDvA6sEsyhjI6p+gI0DRPDQRgCJEJqxD4GUnz1YIMTM7SNjLmGkirxED6OsbfMpyScEs8f8PeTdXsmGvM9rpoXa47/4/CasNDFH3+gdXdMnek/f6faftbKiRu6ly6vb18rYmTh7tnXzX1XAq8bRb9WDnjVaWCq8KvYeF0zehdfXd07Pr7dWGnPZc3pNP9j/JV3RR+4g7Pz03m7KxsUDfg6YbXH3+f32eEzW30O5D5oLteWatUmoYBozmOdZLGIV69hAk2u/wWd8wVfjpREbnf48jYFysFP4I4yTOzFkqA4=
AAACx3ichVE9TxRRFD0MIB+CrNCY2EzcQKDZ3LERSUgINsQKFhZIGDJ5M7yFF+YrM283rpMtaP0DFFRKLIz/wNZGW4wFP8FQYmJjwZ3ZiUYJ8CYz77xz77nv3Llu7KtUE533Gf0Dg/eGhkdG74+NP5ioPJzcTKNW4smGF/lRsu2KVPoqlA2ttC+340SKwPXllnv4Io9vtWWSqijc0J1Y7gZiP1RN5QnNlFNZXDTtULi+cGx9ILUw7bQVOFnHyayFjW7XtGNVRmZLbs6s/4FOpUo1KpZ5HVglqKJcq1HlE2zsIYKHFgJIhNCMfQik/OzAAiFmbhcZcwkjVcQluhhlbYuzJGcIZg/5u8+nnZIN+ZzXTAu1x7f4/CasNDFN3+kDXdIX+kg/6PeNtbKiRu6lw7vb08rYmXjzaP3XnaqAd42Dv6pbPWs0MV94Vew9Lpi8C6+nb78+vlxfqE9nM/SOLtj/Wzqnz9xB2P7pvV+T9ZNb/Owx2+Tu8//b5TFZ/w/lOmg8rT2vWWtUXVou5zWMx3iCWR7KMyxhBato8CWn+IozfDNeGrHRNl71Uo2+UjOFf5ZxdAV6Qak9
AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
=
y1:T
(y1:T )
(y1:T )
(y1:T )
R(y1:T )
AAAC93ichVG7SsRAFL2b9bG+V20Em+CiaLNMRPABgmhjqaurgpEwibPuYF4kswtryA/4AxY2KiiKla2tjb1Y+AliqSCChTfZoPiekMyZc++5c26u7prcF4TcpaR0Q2NTc6alta29o7Mr292z4jsVz2BFwzEdb02nPjO5zYqCC5OtuR6jlm6yVX17LoqvVpnnc8deFjWXbVh0y+YlblCBlJa1pmXVr1haUNMCZWo5DGXV5ZoqykzQ4YQbkdWSR41Atalu0iT4U14Y/ETKhXesZXMkT+IlfwdKAnKQrAUnewkqbIIDBlTAAgY2CMQmUPDxWQcFCLjIbUCAnIeIx3EGIbSitoJZDDMostv43cLTesLaeI5q+rHawFtMfD1UyjBIbskZeSTX5Jzck9dfawVxjchLDXe9rmWu1rXbt/T8r8rCXUD5Q/WnZwElmIi9cvTuxkzUhVHXV3f2HpemCoPBEDkiD+j/kNyRK+zArj4Zx4ussP+Hn01kS9h99H9DHJPydSjfQXE0P5lXFsdyM7PJvDLQDwMwjEMZhxmYhwUo4iU38JKSUmlpRzqQTqTTeqqUSjS98GlJF29nqrr3
AAAC93ichVG7SsRAFL2b9bG+V20Em+CiaLNMRPABgmhjqaurgpEwibPuYF4kswtryA/4AxY2KiiKla2tjb1Y+AliqSCChTfZoPiekMyZc++5c26u7prcF4TcpaR0Q2NTc6alta29o7Mr292z4jsVz2BFwzEdb02nPjO5zYqCC5OtuR6jlm6yVX17LoqvVpnnc8deFjWXbVh0y+YlblCBlJa1pmXVr1haUNMCZWo5DGXV5ZoqykzQ4YQbkdWSR41Atalu0iT4U14Y/ETKhXesZXMkT+IlfwdKAnKQrAUnewkqbIIDBlTAAgY2CMQmUPDxWQcFCLjIbUCAnIeIx3EGIbSitoJZDDMostv43cLTesLaeI5q+rHawFtMfD1UyjBIbskZeSTX5Jzck9dfawVxjchLDXe9rmWu1rXbt/T8r8rCXUD5Q/WnZwElmIi9cvTuxkzUhVHXV3f2HpemCoPBEDkiD+j/kNyRK+zArj4Zx4ussP+Hn01kS9h99H9DHJPydSjfQXE0P5lXFsdyM7PJvDLQDwMwjEMZhxmYhwUo4iU38JKSUmlpRzqQTqTTeqqUSjS98GlJF29nqrr3
AAAC93ichVG7SsRAFL2b9bG+V20Em+CiaLNMRPABgmhjqaurgpEwibPuYF4kswtryA/4AxY2KiiKla2tjb1Y+AliqSCChTfZoPiekMyZc++5c26u7prcF4TcpaR0Q2NTc6alta29o7Mr292z4jsVz2BFwzEdb02nPjO5zYqCC5OtuR6jlm6yVX17LoqvVpnnc8deFjWXbVh0y+YlblCBlJa1pmXVr1haUNMCZWo5DGXV5ZoqykzQ4YQbkdWSR41Atalu0iT4U14Y/ETKhXesZXMkT+IlfwdKAnKQrAUnewkqbIIDBlTAAgY2CMQmUPDxWQcFCLjIbUCAnIeIx3EGIbSitoJZDDMostv43cLTesLaeI5q+rHawFtMfD1UyjBIbskZeSTX5Jzck9dfawVxjchLDXe9rmWu1rXbt/T8r8rCXUD5Q/WnZwElmIi9cvTuxkzUhVHXV3f2HpemCoPBEDkiD+j/kNyRK+zArj4Zx4ussP+Hn01kS9h99H9DHJPydSjfQXE0P5lXFsdyM7PJvDLQDwMwjEMZhxmYhwUo4iU38JKSUmlpRzqQTqTTeqqUSjS98GlJF29nqrr3
AAAC93ichVG7SsRAFL2b9bG+V20Em+CiaLNMRPABgmhjqaurgpEwibPuYF4kswtryA/4AxY2KiiKla2tjb1Y+AliqSCChTfZoPiekMyZc++5c26u7prcF4TcpaR0Q2NTc6alta29o7Mr292z4jsVz2BFwzEdb02nPjO5zYqCC5OtuR6jlm6yVX17LoqvVpnnc8deFjWXbVh0y+YlblCBlJa1pmXVr1haUNMCZWo5DGXV5ZoqykzQ4YQbkdWSR41Atalu0iT4U14Y/ETKhXesZXMkT+IlfwdKAnKQrAUnewkqbIIDBlTAAgY2CMQmUPDxWQcFCLjIbUCAnIeIx3EGIbSitoJZDDMostv43cLTesLaeI5q+rHawFtMfD1UyjBIbskZeSTX5Jzck9dfawVxjchLDXe9rmWu1rXbt/T8r8rCXUD5Q/WnZwElmIi9cvTuxkzUhVHXV3f2HpemCoPBEDkiD+j/kNyRK+zArj4Zx4ussP+Hn01kS9h99H9DHJPydSjfQXE0P5lXFsdyM7PJvDLQDwMwjEMZhxmYhwUo4iU38JKSUmlpRzqQTqTTeqqUSjS98GlJF29nqrr3
Policy Gradient
21
log-derivative trick
Ey1,··· ,yT
[R(y1:T
)]
AAAC0nichVFNa9RQFD2NX2396Gg3BTehQ6VCGV5EULsqFaHLdjqxhckQXjJvOo/mi7w3AzFkIe7EvQtXLbgQwR/g1o0/wC660L24rODGhTeZgGix3pC88867575zc70kkEozdjxlnDt/4eKl6ZnZy1euXptrXL/xWMWj1Be2HwdxuutxJQIZCVtLHYjdJBU89AKx4+0/LM93xiJVMo46OktEL+R7kRxIn2ui3MaGE3Ev4K6jh0Jz0wm5HvZV/qhw88y1VkzH78darZiZ2zEdJUPTSWSdXJjd9nLm5tZqp7jdcxtN1mJVmKeBVYMm6tiMG+/hoI8YPkYIIRBBEw7AoejpwgJDQlwPOXEpIVmdCxSYJe2IsgRlcGL36btHu27NRrQva6pK7dMtAb0pKU0ssSP2hp2wj+wt+8p+/rNWXtUovWS0ehOtSNy55wvbP/6rCmnVGP5WnelZY4D7lVdJ3pOKKbvwJ/rxk5cn26vtpfwWO2TfyP8BO2YfqINo/N1/vSXar87w0yd2QN2X/7egMVl/D+U0sO+0HrSsrbvNtfV6XtO4iUUs01DuYQ0b2IRNl7zDJ3zGF8M2cuOp8WySakzVmnn8EcaLX5qjrg8=
AAAC0nichVFNa9RQFD2NX2396Gg3BTehQ6VCGV5EULsqFaHLdjqxhckQXjJvOo/mi7w3AzFkIe7EvQtXLbgQwR/g1o0/wC660L24rODGhTeZgGix3pC88867575zc70kkEozdjxlnDt/4eKl6ZnZy1euXptrXL/xWMWj1Be2HwdxuutxJQIZCVtLHYjdJBU89AKx4+0/LM93xiJVMo46OktEL+R7kRxIn2ui3MaGE3Ev4K6jh0Jz0wm5HvZV/qhw88y1VkzH78darZiZ2zEdJUPTSWSdXJjd9nLm5tZqp7jdcxtN1mJVmKeBVYMm6tiMG+/hoI8YPkYIIRBBEw7AoejpwgJDQlwPOXEpIVmdCxSYJe2IsgRlcGL36btHu27NRrQva6pK7dMtAb0pKU0ssSP2hp2wj+wt+8p+/rNWXtUovWS0ehOtSNy55wvbP/6rCmnVGP5WnelZY4D7lVdJ3pOKKbvwJ/rxk5cn26vtpfwWO2TfyP8BO2YfqINo/N1/vSXar87w0yd2QN2X/7egMVl/D+U0sO+0HrSsrbvNtfV6XtO4iUUs01DuYQ0b2IRNl7zDJ3zGF8M2cuOp8WySakzVmnn8EcaLX5qjrg8=
AAAC0nichVFNa9RQFD2NX2396Gg3BTehQ6VCGV5EULsqFaHLdjqxhckQXjJvOo/mi7w3AzFkIe7EvQtXLbgQwR/g1o0/wC660L24rODGhTeZgGix3pC88867575zc70kkEozdjxlnDt/4eKl6ZnZy1euXptrXL/xWMWj1Be2HwdxuutxJQIZCVtLHYjdJBU89AKx4+0/LM93xiJVMo46OktEL+R7kRxIn2ui3MaGE3Ev4K6jh0Jz0wm5HvZV/qhw88y1VkzH78darZiZ2zEdJUPTSWSdXJjd9nLm5tZqp7jdcxtN1mJVmKeBVYMm6tiMG+/hoI8YPkYIIRBBEw7AoejpwgJDQlwPOXEpIVmdCxSYJe2IsgRlcGL36btHu27NRrQva6pK7dMtAb0pKU0ssSP2hp2wj+wt+8p+/rNWXtUovWS0ehOtSNy55wvbP/6rCmnVGP5WnelZY4D7lVdJ3pOKKbvwJ/rxk5cn26vtpfwWO2TfyP8BO2YfqINo/N1/vSXar87w0yd2QN2X/7egMVl/D+U0sO+0HrSsrbvNtfV6XtO4iUUs01DuYQ0b2IRNl7zDJ3zGF8M2cuOp8WySakzVmnn8EcaLX5qjrg8=
AAAC0nichVFNa9RQFD2NX2396Gg3BTehQ6VCGV5EULsqFaHLdjqxhckQXjJvOo/mi7w3AzFkIe7EvQtXLbgQwR/g1o0/wC660L24rODGhTeZgGix3pC88867575zc70kkEozdjxlnDt/4eKl6ZnZy1euXptrXL/xWMWj1Be2HwdxuutxJQIZCVtLHYjdJBU89AKx4+0/LM93xiJVMo46OktEL+R7kRxIn2ui3MaGE3Ev4K6jh0Jz0wm5HvZV/qhw88y1VkzH78darZiZ2zEdJUPTSWSdXJjd9nLm5tZqp7jdcxtN1mJVmKeBVYMm6tiMG+/hoI8YPkYIIRBBEw7AoejpwgJDQlwPOXEpIVmdCxSYJe2IsgRlcGL36btHu27NRrQva6pK7dMtAb0pKU0ssSP2hp2wj+wt+8p+/rNWXtUovWS0ehOtSNy55wvbP/6rCmnVGP5WnelZY4D7lVdJ3pOKKbvwJ/rxk5cn26vtpfwWO2TfyP8BO2YfqINo/N1/vSXar87w0yd2QN2X/7egMVl/D+U0sO+0HrSsrbvNtfV6XtO4iUUs01DuYQ0b2IRNl7zDJ3zGF8M2cuOp8WySakzVmnn8EcaLX5qjrg8=
=
y1:T
(y1:T
) log (y1:T
)R(y1:T
)
AAAC4HichVFNS9xQFD2mWj/64bRuCm5CB4vdDC8iaIWC6KZLHR0VjISX+GZ8+PJB8mZgGuYH6EZcCbqy0IX0HxR3bvwDXbjtrtidhW5ceJMJltbW3pC88867575zc91IyUQzdtFjPOjte9g/MDj06PGTp8OlZ89XkrAZe6LmhSqM11yeCCUDUdNSK7EWxYL7rhKr7vZ8dr7aEnEiw2BZtyOx4fNGIOvS45oop7Ty1rSTpu+kbSe1ZpY7HdOOpGPrLaH5eMG9Nu2Au4oXtGmrsPHXtOotdEplVmF5mHeBVYAyilgIS59hYxMhPDThQyCAJqzAkdCzDgsMEXEbSImLCcn8XKCDIdI2KUtQBid2m74N2q0XbED7rGaSqz26RdEbk9LEGPvCTtgVO2ef2Dd2/c9aaV4j89Km1e1qReQM775Y+vlflU+rxtYv1b2eNeqYzr1K8h7lTNaF19W33h9cLc1Ux9JX7AO7JP/H7IKdUQdB64f3cVFUj+7xs0lsnbrP/m+HxmT9OZS7oDZReVOxFifLs3PFvAYwipcYp6FMYRbvsIAaXXKKr7jEd8Mzdow9Y7+bavQUmhH8FsbhDZCVsv8=
AAAC4HichVFNS9xQFD2mWj/64bRuCm5CB4vdDC8iaIWC6KZLHR0VjISX+GZ8+PJB8mZgGuYH6EZcCbqy0IX0HxR3bvwDXbjtrtidhW5ceJMJltbW3pC88867575zc91IyUQzdtFjPOjte9g/MDj06PGTp8OlZ89XkrAZe6LmhSqM11yeCCUDUdNSK7EWxYL7rhKr7vZ8dr7aEnEiw2BZtyOx4fNGIOvS45oop7Ty1rSTpu+kbSe1ZpY7HdOOpGPrLaH5eMG9Nu2Au4oXtGmrsPHXtOotdEplVmF5mHeBVYAyilgIS59hYxMhPDThQyCAJqzAkdCzDgsMEXEbSImLCcn8XKCDIdI2KUtQBid2m74N2q0XbED7rGaSqz26RdEbk9LEGPvCTtgVO2ef2Dd2/c9aaV4j89Km1e1qReQM775Y+vlflU+rxtYv1b2eNeqYzr1K8h7lTNaF19W33h9cLc1Ux9JX7AO7JP/H7IKdUQdB64f3cVFUj+7xs0lsnbrP/m+HxmT9OZS7oDZReVOxFifLs3PFvAYwipcYp6FMYRbvsIAaXXKKr7jEd8Mzdow9Y7+bavQUmhH8FsbhDZCVsv8=
AAAC4HichVFNS9xQFD2mWj/64bRuCm5CB4vdDC8iaIWC6KZLHR0VjISX+GZ8+PJB8mZgGuYH6EZcCbqy0IX0HxR3bvwDXbjtrtidhW5ceJMJltbW3pC88867575zc91IyUQzdtFjPOjte9g/MDj06PGTp8OlZ89XkrAZe6LmhSqM11yeCCUDUdNSK7EWxYL7rhKr7vZ8dr7aEnEiw2BZtyOx4fNGIOvS45oop7Ty1rSTpu+kbSe1ZpY7HdOOpGPrLaH5eMG9Nu2Au4oXtGmrsPHXtOotdEplVmF5mHeBVYAyilgIS59hYxMhPDThQyCAJqzAkdCzDgsMEXEbSImLCcn8XKCDIdI2KUtQBid2m74N2q0XbED7rGaSqz26RdEbk9LEGPvCTtgVO2ef2Dd2/c9aaV4j89Km1e1qReQM775Y+vlflU+rxtYv1b2eNeqYzr1K8h7lTNaF19W33h9cLc1Ux9JX7AO7JP/H7IKdUQdB64f3cVFUj+7xs0lsnbrP/m+HxmT9OZS7oDZReVOxFifLs3PFvAYwipcYp6FMYRbvsIAaXXKKr7jEd8Mzdow9Y7+bavQUmhH8FsbhDZCVsv8=
AAAC4HichVFNS9xQFD2mWj/64bRuCm5CB4vdDC8iaIWC6KZLHR0VjISX+GZ8+PJB8mZgGuYH6EZcCbqy0IX0HxR3bvwDXbjtrtidhW5ceJMJltbW3pC88867575zc91IyUQzdtFjPOjte9g/MDj06PGTp8OlZ89XkrAZe6LmhSqM11yeCCUDUdNSK7EWxYL7rhKr7vZ8dr7aEnEiw2BZtyOx4fNGIOvS45oop7Ty1rSTpu+kbSe1ZpY7HdOOpGPrLaH5eMG9Nu2Au4oXtGmrsPHXtOotdEplVmF5mHeBVYAyilgIS59hYxMhPDThQyCAJqzAkdCzDgsMEXEbSImLCcn8XKCDIdI2KUtQBid2m74N2q0XbED7rGaSqz26RdEbk9LEGPvCTtgVO2ef2Dd2/c9aaV4j89Km1e1qReQM775Y+vlflU+rxtYv1b2eNeqYzr1K8h7lTNaF19W33h9cLc1Ux9JX7AO7JP/H7IKdUQdB64f3cVFUj+7xs0lsnbrP/m+HxmT9OZS7oDZReVOxFifLs3PFvAYwipcYp6FMYRbvsIAaXXKKr7jEd8Mzdow9Y7+bavQUmhH8FsbhDZCVsv8=
= Ey1:T
[ log (y1:T
)R(y1:T
)]
AAAC1HichVE9axRRFD0ZPxLjR1ZtBJvBJSE2yxsR8gFCUAKSKtlk3cDOMryZfbv7yJsPZt4urMNUkiaFbQorhRSSJrWtjX9AIZW2YhnBxiJ3ZgdFg/EOM++88+6579y5bqRkohk7njAuXLx0eXLqyvTVa9dvzFRu3nqWhIPYEw0vVGG87fJEKBmIhpZaie0oFtx3lWi6O0/y8+ZQxIkMgy09ikTb571AdqXHNVFOZe2Raftc9ztJupo56chJreWtLDNbdsBdxR1b94Xmpq3CnmlHstzPl3n3zfov2HYqVVZjRZhngVWCKspYDyvvYKODEB4G8CEQQBNW4EjoacECQ0RcGylxMSFZnAtkmCbtgLIEZXBid+jbo12rZAPa5zWTQu3RLYremJQmZtlH9padsA/skH1lP/9ZKy1q5F5GtLpjrYicmb07mz/+q/Jp1ej/Vp3rWaOLxcKrJO9RweRdeGP98Pn+yeZyfTadY2/YN/L/mh2z99RBMPzuHWyI+qtz/HSI7VL3+f/NaEzW30M5CxoPaks1a+NhdeVxOa8p3MU9zNNQFrCCp1hHgy45wid8xhejaWTGC2N3nGpMlJrb+COMl6cm1q6i
AAAC1HichVE9axRRFD0ZPxLjR1ZtBJvBJSE2yxsR8gFCUAKSKtlk3cDOMryZfbv7yJsPZt4urMNUkiaFbQorhRSSJrWtjX9AIZW2YhnBxiJ3ZgdFg/EOM++88+6579y5bqRkohk7njAuXLx0eXLqyvTVa9dvzFRu3nqWhIPYEw0vVGG87fJEKBmIhpZaie0oFtx3lWi6O0/y8+ZQxIkMgy09ikTb571AdqXHNVFOZe2Raftc9ztJupo56chJreWtLDNbdsBdxR1b94Xmpq3CnmlHstzPl3n3zfov2HYqVVZjRZhngVWCKspYDyvvYKODEB4G8CEQQBNW4EjoacECQ0RcGylxMSFZnAtkmCbtgLIEZXBid+jbo12rZAPa5zWTQu3RLYremJQmZtlH9padsA/skH1lP/9ZKy1q5F5GtLpjrYicmb07mz/+q/Jp1ej/Vp3rWaOLxcKrJO9RweRdeGP98Pn+yeZyfTadY2/YN/L/mh2z99RBMPzuHWyI+qtz/HSI7VL3+f/NaEzW30M5CxoPaks1a+NhdeVxOa8p3MU9zNNQFrCCp1hHgy45wid8xhejaWTGC2N3nGpMlJrb+COMl6cm1q6i
AAAC1HichVE9axRRFD0ZPxLjR1ZtBJvBJSE2yxsR8gFCUAKSKtlk3cDOMryZfbv7yJsPZt4urMNUkiaFbQorhRSSJrWtjX9AIZW2YhnBxiJ3ZgdFg/EOM++88+6579y5bqRkohk7njAuXLx0eXLqyvTVa9dvzFRu3nqWhIPYEw0vVGG87fJEKBmIhpZaie0oFtx3lWi6O0/y8+ZQxIkMgy09ikTb571AdqXHNVFOZe2Raftc9ztJupo56chJreWtLDNbdsBdxR1b94Xmpq3CnmlHstzPl3n3zfov2HYqVVZjRZhngVWCKspYDyvvYKODEB4G8CEQQBNW4EjoacECQ0RcGylxMSFZnAtkmCbtgLIEZXBid+jbo12rZAPa5zWTQu3RLYremJQmZtlH9padsA/skH1lP/9ZKy1q5F5GtLpjrYicmb07mz/+q/Jp1ej/Vp3rWaOLxcKrJO9RweRdeGP98Pn+yeZyfTadY2/YN/L/mh2z99RBMPzuHWyI+qtz/HSI7VL3+f/NaEzW30M5CxoPaks1a+NhdeVxOa8p3MU9zNNQFrCCp1hHgy45wid8xhejaWTGC2N3nGpMlJrb+COMl6cm1q6i
AAAC1HichVE9axRRFD0ZPxLjR1ZtBJvBJSE2yxsR8gFCUAKSKtlk3cDOMryZfbv7yJsPZt4urMNUkiaFbQorhRSSJrWtjX9AIZW2YhnBxiJ3ZgdFg/EOM++88+6579y5bqRkohk7njAuXLx0eXLqyvTVa9dvzFRu3nqWhIPYEw0vVGG87fJEKBmIhpZaie0oFtx3lWi6O0/y8+ZQxIkMgy09ikTb571AdqXHNVFOZe2Raftc9ztJupo56chJreWtLDNbdsBdxR1b94Xmpq3CnmlHstzPl3n3zfov2HYqVVZjRZhngVWCKspYDyvvYKODEB4G8CEQQBNW4EjoacECQ0RcGylxMSFZnAtkmCbtgLIEZXBid+jbo12rZAPa5zWTQu3RLYremJQmZtlH9padsA/skH1lP/9ZKy1q5F5GtLpjrYicmb07mz/+q/Jp1ej/Vp3rWaOLxcKrJO9RweRdeGP98Pn+yeZyfTadY2/YN/L/mh2z99RBMPzuHWyI+qtz/HSI7VL3+f/NaEzW30M5CxoPaks1a+NhdeVxOa8p3MU9zNNQFrCCp1hHgy45wid8xhejaWTGC2N3nGpMlJrb+COMl6cm1q6i
REINFORCE [Williams 92]
• Sample-based (Monte-Carlo) method
• Algorithm
– Generate a trajectory according to a current policy
– Update parameters
22
move most in the directions that favor actions
yielding the highest return
sampled
Ey1:T
[ log (y1:T
)R(y1:T
)]
AAAC0nichVHPa9RAFP4ardZa7Vovgpfg0lIvy6QIak+lIvTYbhtb2Cxhkp3dHTr5QTK7sA05SG/i3YMnBQ8i+Af02ot/gB56sPfisYIXD33JBkWL9YVkvvnmfW++l+fFSqaasaMJ49LlyStXp65NX5+5cXO2dmvuWRoNEl/YfqSiZMfjqVAyFLaWWomdOBE88JTY9nafFOfbQ5GkMgq39CgW7YD3QtmVPtdEubU1J+C630mzp7mbjdzMWt7Kc7PlhNxT3HV0X2huOirqmU4sq/1ilXffbP6CbbdWZw1WhnkeWBWoo4r1qHYABx1E8DFAAIEQmrACR0pPCxYYYuLayIhLCMnyXCDHNGkHlCUogxO7S98e7VoVG9K+qJmWap9uUfQmpDQxz76w9+yUfWIf2An7+c9aWVmj8DKi1RtrRezOvriz+eO/qoBWjf5v1YWeNbp4VHqV5D0umaILf6wf7r063VxuzmcL7C37Rv7fsCN2SB2Ew+/+uw3RfH2Bnw6xXeq++L85jcn6eyjngb3UeNywNh7UV1areU3hLu5hkYbyECtYwzpsuuQjPuMrjg3byIznxv441ZioNLfxRxgvzwD4fK4x
AAAC0nichVHPa9RAFP4ardZa7Vovgpfg0lIvy6QIak+lIvTYbhtb2Cxhkp3dHTr5QTK7sA05SG/i3YMnBQ8i+Af02ot/gB56sPfisYIXD33JBkWL9YVkvvnmfW++l+fFSqaasaMJ49LlyStXp65NX5+5cXO2dmvuWRoNEl/YfqSiZMfjqVAyFLaWWomdOBE88JTY9nafFOfbQ5GkMgq39CgW7YD3QtmVPtdEubU1J+C630mzp7mbjdzMWt7Kc7PlhNxT3HV0X2huOirqmU4sq/1ilXffbP6CbbdWZw1WhnkeWBWoo4r1qHYABx1E8DFAAIEQmrACR0pPCxYYYuLayIhLCMnyXCDHNGkHlCUogxO7S98e7VoVG9K+qJmWap9uUfQmpDQxz76w9+yUfWIf2An7+c9aWVmj8DKi1RtrRezOvriz+eO/qoBWjf5v1YWeNbp4VHqV5D0umaILf6wf7r063VxuzmcL7C37Rv7fsCN2SB2Ew+/+uw3RfH2Bnw6xXeq++L85jcn6eyjngb3UeNywNh7UV1areU3hLu5hkYbyECtYwzpsuuQjPuMrjg3byIznxv441ZioNLfxRxgvzwD4fK4x
AAAC0nichVHPa9RAFP4ardZa7Vovgpfg0lIvy6QIak+lIvTYbhtb2Cxhkp3dHTr5QTK7sA05SG/i3YMnBQ8i+Af02ot/gB56sPfisYIXD33JBkWL9YVkvvnmfW++l+fFSqaasaMJ49LlyStXp65NX5+5cXO2dmvuWRoNEl/YfqSiZMfjqVAyFLaWWomdOBE88JTY9nafFOfbQ5GkMgq39CgW7YD3QtmVPtdEubU1J+C630mzp7mbjdzMWt7Kc7PlhNxT3HV0X2huOirqmU4sq/1ilXffbP6CbbdWZw1WhnkeWBWoo4r1qHYABx1E8DFAAIEQmrACR0pPCxYYYuLayIhLCMnyXCDHNGkHlCUogxO7S98e7VoVG9K+qJmWap9uUfQmpDQxz76w9+yUfWIf2An7+c9aWVmj8DKi1RtrRezOvriz+eO/qoBWjf5v1YWeNbp4VHqV5D0umaILf6wf7r063VxuzmcL7C37Rv7fsCN2SB2Ew+/+uw3RfH2Bnw6xXeq++L85jcn6eyjngb3UeNywNh7UV1areU3hLu5hkYbyECtYwzpsuuQjPuMrjg3byIznxv441ZioNLfxRxgvzwD4fK4x
AAAC0nichVHPa9RAFP4ardZa7Vovgpfg0lIvy6QIak+lIvTYbhtb2Cxhkp3dHTr5QTK7sA05SG/i3YMnBQ8i+Af02ot/gB56sPfisYIXD33JBkWL9YVkvvnmfW++l+fFSqaasaMJ49LlyStXp65NX5+5cXO2dmvuWRoNEl/YfqSiZMfjqVAyFLaWWomdOBE88JTY9nafFOfbQ5GkMgq39CgW7YD3QtmVPtdEubU1J+C630mzp7mbjdzMWt7Kc7PlhNxT3HV0X2huOirqmU4sq/1ilXffbP6CbbdWZw1WhnkeWBWoo4r1qHYABx1E8DFAAIEQmrACR0pPCxYYYuLayIhLCMnyXCDHNGkHlCUogxO7S98e7VoVG9K+qJmWap9uUfQmpDQxz76w9+yUfWIf2An7+c9aWVmj8DKi1RtrRezOvriz+eO/qoBWjf5v1YWeNbp4VHqV5D0umaILf6wf7r063VxuzmcL7C37Rv7fsCN2SB2Ew+/+uw3RfH2Bnw6xXeq++L85jcn6eyjngb3UeNywNh7UV1areU3hLu5hkYbyECtYwzpsuuQjPuMrjg3byIznxv441ZioNLfxRxgvzwD4fK4x
log (ys
1:T
)R(ys
1:T
)
AAACx3ichVE9T9xAEH04hO+PS9JEorFyAkFzWkeRSJAiodBEqeDgAAkTa232jhV79sreO3E5XZGWP0BBRaIUKP8gLQ20iVLwExAlkdKkyNhnKQEEzGq9M2/nzb7x+FrJxDB21mM96H3Y1z8wODQ8Mjo2Xnj0eDWJGnEgKkGkonjd54lQMhQVI40S6zoWvO4rsebvLKT3a00RJzIKV0xLi806r4WyKgNuCPIKr12udRzt2m7IfcU912wLw21XRTXb1TKPp1vvE6/tzK10Zuzyf4FXKLISy8y+6Ti5U0Rui1HhG1xsIUKABuoQCGHIV+BIaG3AAYMmbBNtwmLyZHYv0MEQcRuUJSiDE7pD3xpFGzkaUpzWTDJ2QK8o2jExbUyyn+yIXbIT9pWdsz+31mpnNVItLTr9Lldob3zv6fLve1l1Og22/7Hu1GxQxctMqyTtOkPSLoIuv/lh/3J5rjzZnmKf2AXpP2Rn7Jg6CJu/gi9Lonxwh54tQqvUffp/OzQm5/pQbjqV56VXJWfpRXH+TT6vAUzgGaZpKLOYx1ssokKPfMYpvuOH9c7SVtPa7aZaPTnnCa6Y9fEvYGSpow==
AAACx3ichVE9T9xAEH04hO+PS9JEorFyAkFzWkeRSJAiodBEqeDgAAkTa232jhV79sreO3E5XZGWP0BBRaIUKP8gLQ20iVLwExAlkdKkyNhnKQEEzGq9M2/nzb7x+FrJxDB21mM96H3Y1z8wODQ8Mjo2Xnj0eDWJGnEgKkGkonjd54lQMhQVI40S6zoWvO4rsebvLKT3a00RJzIKV0xLi806r4WyKgNuCPIKr12udRzt2m7IfcU912wLw21XRTXb1TKPp1vvE6/tzK10Zuzyf4FXKLISy8y+6Ti5U0Rui1HhG1xsIUKABuoQCGHIV+BIaG3AAYMmbBNtwmLyZHYv0MEQcRuUJSiDE7pD3xpFGzkaUpzWTDJ2QK8o2jExbUyyn+yIXbIT9pWdsz+31mpnNVItLTr9Lldob3zv6fLve1l1Og22/7Hu1GxQxctMqyTtOkPSLoIuv/lh/3J5rjzZnmKf2AXpP2Rn7Jg6CJu/gi9Lonxwh54tQqvUffp/OzQm5/pQbjqV56VXJWfpRXH+TT6vAUzgGaZpKLOYx1ssokKPfMYpvuOH9c7SVtPa7aZaPTnnCa6Y9fEvYGSpow==
AAACx3ichVE9T9xAEH04hO+PS9JEorFyAkFzWkeRSJAiodBEqeDgAAkTa232jhV79sreO3E5XZGWP0BBRaIUKP8gLQ20iVLwExAlkdKkyNhnKQEEzGq9M2/nzb7x+FrJxDB21mM96H3Y1z8wODQ8Mjo2Xnj0eDWJGnEgKkGkonjd54lQMhQVI40S6zoWvO4rsebvLKT3a00RJzIKV0xLi806r4WyKgNuCPIKr12udRzt2m7IfcU912wLw21XRTXb1TKPp1vvE6/tzK10Zuzyf4FXKLISy8y+6Ti5U0Rui1HhG1xsIUKABuoQCGHIV+BIaG3AAYMmbBNtwmLyZHYv0MEQcRuUJSiDE7pD3xpFGzkaUpzWTDJ2QK8o2jExbUyyn+yIXbIT9pWdsz+31mpnNVItLTr9Lldob3zv6fLve1l1Og22/7Hu1GxQxctMqyTtOkPSLoIuv/lh/3J5rjzZnmKf2AXpP2Rn7Jg6CJu/gi9Lonxwh54tQqvUffp/OzQm5/pQbjqV56VXJWfpRXH+TT6vAUzgGaZpKLOYx1ssokKPfMYpvuOH9c7SVtPa7aZaPTnnCa6Y9fEvYGSpow==
AAACx3ichVE9T9xAEH04hO+PS9JEorFyAkFzWkeRSJAiodBEqeDgAAkTa232jhV79sreO3E5XZGWP0BBRaIUKP8gLQ20iVLwExAlkdKkyNhnKQEEzGq9M2/nzb7x+FrJxDB21mM96H3Y1z8wODQ8Mjo2Xnj0eDWJGnEgKkGkonjd54lQMhQVI40S6zoWvO4rsebvLKT3a00RJzIKV0xLi806r4WyKgNuCPIKr12udRzt2m7IfcU912wLw21XRTXb1TKPp1vvE6/tzK10Zuzyf4FXKLISy8y+6Ti5U0Rui1HhG1xsIUKABuoQCGHIV+BIaG3AAYMmbBNtwmLyZHYv0MEQcRuUJSiDE7pD3xpFGzkaUpzWTDJ2QK8o2jExbUyyn+yIXbIT9pWdsz+31mpnNVItLTr9Lldob3zv6fLve1l1Og22/7Hu1GxQxctMqyTtOkPSLoIuv/lh/3J5rjzZnmKf2AXpP2Rn7Jg6CJu/gi9Lonxwh54tQqvUffp/OzQm5/pQbjqV56VXJWfpRXH+TT6vAUzgGaZpKLOYx1ssokKPfMYpvuOH9c7SVtPa7aZaPTnnCa6Y9fEvYGSpow==
+ log (ys
1:T
)R(ys
1:T
)
AAAC4XichVG/T9tQEP4wbfnZktIFqYtFBKKqFD1XlShMCBZGCIQgYWo9Oy/JEy+2Zb8EhSj/AFPFwsDUSgyo/wESU5f+Ax2YmarCRKUuHXp2LGiLSs+y39139937zueGSsaasfM+o//Bw0cDg0PDI6OPn4zlno5vxEEz8kTJC1QQbbo8Fkr6oqSlVmIzjARvuEqU3Z2lJF9uiSiWgb+u26HYbvCaL6vS45ogJ1e2dV1obtpKVDWPomDXzJCXps1VWKeUz13FnZvCoGbaoczimfbb2OlY8+vdF2bxt8DJ5VmBpWbedazMySOzlSB3ChsVBPDQRAMCPjT5ChwxPVuwwBASto0OYRF5Ms0LdDFM3CZVCarghO7Qt0bRVob6FCc945Tt0S2K3oiYJqbYF3bCrtln9pF9ZT//2auT9ki0tOl0e1wROmP7E2s//stq0KlRv2Xdq1mjijepVknawxRJpvB6/Nbe4fXafHGqM80+sG+k/z07Z59oAr/13TteFcWje/RUCK3S9Mn/7dKarL+XctcpvSrMFazV1/mFxWxfg3iOSczQUmaxgGWsoESXnOECl7gyKsa+8c446JUafRnnGf4w4+gXn0Sz1A==
AAAC4XichVG/T9tQEP4wbfnZktIFqYtFBKKqFD1XlShMCBZGCIQgYWo9Oy/JEy+2Zb8EhSj/AFPFwsDUSgyo/wESU5f+Ax2YmarCRKUuHXp2LGiLSs+y39139937zueGSsaasfM+o//Bw0cDg0PDI6OPn4zlno5vxEEz8kTJC1QQbbo8Fkr6oqSlVmIzjARvuEqU3Z2lJF9uiSiWgb+u26HYbvCaL6vS45ogJ1e2dV1obtpKVDWPomDXzJCXps1VWKeUz13FnZvCoGbaoczimfbb2OlY8+vdF2bxt8DJ5VmBpWbedazMySOzlSB3ChsVBPDQRAMCPjT5ChwxPVuwwBASto0OYRF5Ms0LdDFM3CZVCarghO7Qt0bRVob6FCc945Tt0S2K3oiYJqbYF3bCrtln9pF9ZT//2auT9ki0tOl0e1wROmP7E2s//stq0KlRv2Xdq1mjijepVknawxRJpvB6/Nbe4fXafHGqM80+sG+k/z07Z59oAr/13TteFcWje/RUCK3S9Mn/7dKarL+XctcpvSrMFazV1/mFxWxfg3iOSczQUmaxgGWsoESXnOECl7gyKsa+8c446JUafRnnGf4w4+gXn0Sz1A==
AAAC4XichVG/T9tQEP4wbfnZktIFqYtFBKKqFD1XlShMCBZGCIQgYWo9Oy/JEy+2Zb8EhSj/AFPFwsDUSgyo/wESU5f+Ax2YmarCRKUuHXp2LGiLSs+y39139937zueGSsaasfM+o//Bw0cDg0PDI6OPn4zlno5vxEEz8kTJC1QQbbo8Fkr6oqSlVmIzjARvuEqU3Z2lJF9uiSiWgb+u26HYbvCaL6vS45ogJ1e2dV1obtpKVDWPomDXzJCXps1VWKeUz13FnZvCoGbaoczimfbb2OlY8+vdF2bxt8DJ5VmBpWbedazMySOzlSB3ChsVBPDQRAMCPjT5ChwxPVuwwBASto0OYRF5Ms0LdDFM3CZVCarghO7Qt0bRVob6FCc945Tt0S2K3oiYJqbYF3bCrtln9pF9ZT//2auT9ki0tOl0e1wROmP7E2s//stq0KlRv2Xdq1mjijepVknawxRJpvB6/Nbe4fXafHGqM80+sG+k/z07Z59oAr/13TteFcWje/RUCK3S9Mn/7dKarL+XctcpvSrMFazV1/mFxWxfg3iOSczQUmaxgGWsoESXnOECl7gyKsa+8c446JUafRnnGf4w4+gXn0Sz1A==
AAAC4XichVG/T9tQEP4wbfnZktIFqYtFBKKqFD1XlShMCBZGCIQgYWo9Oy/JEy+2Zb8EhSj/AFPFwsDUSgyo/wESU5f+Ax2YmarCRKUuHXp2LGiLSs+y39139937zueGSsaasfM+o//Bw0cDg0PDI6OPn4zlno5vxEEz8kTJC1QQbbo8Fkr6oqSlVmIzjARvuEqU3Z2lJF9uiSiWgb+u26HYbvCaL6vS45ogJ1e2dV1obtpKVDWPomDXzJCXps1VWKeUz13FnZvCoGbaoczimfbb2OlY8+vdF2bxt8DJ5VmBpWbedazMySOzlSB3ChsVBPDQRAMCPjT5ChwxPVuwwBASto0OYRF5Ms0LdDFM3CZVCarghO7Qt0bRVob6FCc945Tt0S2K3oiYJqbYF3bCrtln9pF9ZT//2auT9ki0tOl0e1wROmP7E2s//stq0KlRv2Xdq1mjijepVknawxRJpvB6/Nbe4fXafHGqM80+sG+k/z07Z59oAr/13TteFcWje/RUCK3S9Mn/7dKarL+XctcpvSrMFazV1/mFxWxfg3iOSczQUmaxgGWsoESXnOECl7gyKsa+8c446JUafRnnGf4w4+gXn0Sz1A==
Variance Reduction
• REINFORCE suffers from high variance
– Calculate the gradient with one (or few) trajectory
• Introduce a baseline
• Why we can subtract a baseline? [practice]
• Increase/decrease a probability when a
sampled action is better/worse than expected
23
Ey1:T
[ log (y1:T
)(R(y1:T
) Rb
)]
AAAC23ichVG/axRBFP6y/ooxmktsBJvFI+FSeMyKYJIqRATL3CVnArfnMrs3dzdk9ge7cwfnsl2qYCsWVgoW4h8QSGvjP2CRwlYRywg2Fr7dWwgajG/ZnW++ed+b7+1zIyUTzdjxlHHh4qXLV6avzlybvX5jrjK/8CQJh7EnWl6ownjX5YlQMhAtLbUSu1EsuO8qsePuPczPd0YiTmQYbOtxJDo+7weyJz2uiXIqDdvnetBN0keZk46d1FrbzjKzbQfcVdyx9UBobtoq7Jt2JMt9rcxbNmvNU3zXbDrucsd0KlVWZ0WYZ4FVgirK2AwrR7DRRQgPQ/gQCKAJK3Ak9LRhgSEiroOUuJiQLM4FMsyQdkhZgjI4sXv07dOuXbIB7fOaSaH26BZFb0xKE4vsE3vHTthH9p59Y7/+WSstauRexrS6E62InLmDW1s//6vyadUYnKrO9azRw0rhVZL3qGDyLryJfvTs5cnWWnMxXWJv2Hfy/5odsw/UQTD64b1tiOarc/x0ie1R9/n/zWhM1t9DOQta9+qrdatxv7q+Uc5rGrdxBzUaygOs4zE20aJLDvEZX/DVeGrsGwfG80mqMVVqbuKPMF78BmoMsHw=
AAAC23ichVG/axRBFP6y/ooxmktsBJvFI+FSeMyKYJIqRATL3CVnArfnMrs3dzdk9ge7cwfnsl2qYCsWVgoW4h8QSGvjP2CRwlYRywg2Fr7dWwgajG/ZnW++ed+b7+1zIyUTzdjxlHHh4qXLV6avzlybvX5jrjK/8CQJh7EnWl6ownjX5YlQMhAtLbUSu1EsuO8qsePuPczPd0YiTmQYbOtxJDo+7weyJz2uiXIqDdvnetBN0keZk46d1FrbzjKzbQfcVdyx9UBobtoq7Jt2JMt9rcxbNmvNU3zXbDrucsd0KlVWZ0WYZ4FVgirK2AwrR7DRRQgPQ/gQCKAJK3Ak9LRhgSEiroOUuJiQLM4FMsyQdkhZgjI4sXv07dOuXbIB7fOaSaH26BZFb0xKE4vsE3vHTthH9p59Y7/+WSstauRexrS6E62InLmDW1s//6vyadUYnKrO9azRw0rhVZL3qGDyLryJfvTs5cnWWnMxXWJv2Hfy/5odsw/UQTD64b1tiOarc/x0ie1R9/n/zWhM1t9DOQta9+qrdatxv7q+Uc5rGrdxBzUaygOs4zE20aJLDvEZX/DVeGrsGwfG80mqMVVqbuKPMF78BmoMsHw=
AAAC23ichVG/axRBFP6y/ooxmktsBJvFI+FSeMyKYJIqRATL3CVnArfnMrs3dzdk9ge7cwfnsl2qYCsWVgoW4h8QSGvjP2CRwlYRywg2Fr7dWwgajG/ZnW++ed+b7+1zIyUTzdjxlHHh4qXLV6avzlybvX5jrjK/8CQJh7EnWl6ownjX5YlQMhAtLbUSu1EsuO8qsePuPczPd0YiTmQYbOtxJDo+7weyJz2uiXIqDdvnetBN0keZk46d1FrbzjKzbQfcVdyx9UBobtoq7Jt2JMt9rcxbNmvNU3zXbDrucsd0KlVWZ0WYZ4FVgirK2AwrR7DRRQgPQ/gQCKAJK3Ak9LRhgSEiroOUuJiQLM4FMsyQdkhZgjI4sXv07dOuXbIB7fOaSaH26BZFb0xKE4vsE3vHTthH9p59Y7/+WSstauRexrS6E62InLmDW1s//6vyadUYnKrO9azRw0rhVZL3qGDyLryJfvTs5cnWWnMxXWJv2Hfy/5odsw/UQTD64b1tiOarc/x0ie1R9/n/zWhM1t9DOQta9+qrdatxv7q+Uc5rGrdxBzUaygOs4zE20aJLDvEZX/DVeGrsGwfG80mqMVVqbuKPMF78BmoMsHw=
AAAC23ichVG/axRBFP6y/ooxmktsBJvFI+FSeMyKYJIqRATL3CVnArfnMrs3dzdk9ge7cwfnsl2qYCsWVgoW4h8QSGvjP2CRwlYRywg2Fr7dWwgajG/ZnW++ed+b7+1zIyUTzdjxlHHh4qXLV6avzlybvX5jrjK/8CQJh7EnWl6ownjX5YlQMhAtLbUSu1EsuO8qsePuPczPd0YiTmQYbOtxJDo+7weyJz2uiXIqDdvnetBN0keZk46d1FrbzjKzbQfcVdyx9UBobtoq7Jt2JMt9rcxbNmvNU3zXbDrucsd0KlVWZ0WYZ4FVgirK2AwrR7DRRQgPQ/gQCKAJK3Ak9LRhgSEiroOUuJiQLM4FMsyQdkhZgjI4sXv07dOuXbIB7fOaSaH26BZFb0xKE4vsE3vHTthH9p59Y7/+WSstauRexrS6E62InLmDW1s//6vyadUYnKrO9azRw0rhVZL3qGDyLryJfvTs5cnWWnMxXWJv2Hfy/5odsw/UQTD64b1tiOarc/x0ie1R9/n/zWhM1t9DOQta9+qrdatxv7q+Uc5rGrdxBzUaygOs4zE20aJLDvEZX/DVeGrsGwfG80mqMVVqbuKPMF78BmoMsHw=
How to Reduce Variance
• Option 1: Mean of sampled several rewards or
rewards in a minibatch
• Option 2: Actor-Critic (several variants exist)
– Actor: policy network
– Critic: estimator to calculate the baseline reward
• Option 3: Self-Critic [Rennie+ 16]
– Reward obtained by a greedy-search
24
Supervised vs RL
25
Supervised
Reinforcement Learning
correct
sampled
L( ) = log (y1:T
)
AAACsXicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuo+2jElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjtOKrDa1CajXjBZQN9AzAQAGTYQhlKDNAQUC+wC6GGIYUhnyGZIZShlyGVIY8hhIgO4chkaEYCKMZDBkMGAqAYrEM1UCxIiArEyyfylDLwAXUWwpUlQpUkQgUzQaS6UBeNFQ0D8gHmVkM1p0MtCUHiIuAOhUYVA2uGqw0+GxwwmC1wUuDPzjNqgabAXJLJZBOguhNLYjn75II/k5QVy6QLmHIQOjC6+YShjQGC7BbM4FuLwCLgHyRDNFfVjX9c7BVkGq1msEig9dA9y80uGlwGOiDvLIvyUsDU4Nm43FPClA0Deh7UPjWAqPJED1SMBmhRnqWeoaBJsoOTtD44mCQZlBi0ABGijmDA4MHQwBDKNCSXobNDHsY9jKZMEUxJTAlQZQyMUL1CDOgAKZsAOZen7M=
AAACsXicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuo+2jElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjtOKrDa1CajXjBZQN9AzAQAGTYQhlKDNAQUC+wC6GGIYUhnyGZIZShlyGVIY8hhIgO4chkaEYCKMZDBkMGAqAYrEM1UCxIiArEyyfylDLwAXUWwpUlQpUkQgUzQaS6UBeNFQ0D8gHmVkM1p0MtCUHiIuAOhUYVA2uGqw0+GxwwmC1wUuDPzjNqgabAXJLJZBOguhNLYjn75II/k5QVy6QLmHIQOjC6+YShjQGC7BbM4FuLwCLgHyRDNFfVjX9c7BVkGq1msEig9dA9y80uGlwGOiDvLIvyUsDU4Nm43FPClA0Deh7UPjWAqPJED1SMBmhRnqWeoaBJsoOTtD44mCQZlBi0ABGijmDA4MHQwBDKNCSXobNDHsY9jKZMEUxJTAlQZQyMUL1CDOgAKZsAOZen7M=
AAACsXicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuo+2jElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjtOKrDa1CajXjBZQN9AzAQAGTYQhlKDNAQUC+wC6GGIYUhnyGZIZShlyGVIY8hhIgO4chkaEYCKMZDBkMGAqAYrEM1UCxIiArEyyfylDLwAXUWwpUlQpUkQgUzQaS6UBeNFQ0D8gHmVkM1p0MtCUHiIuAOhUYVA2uGqw0+GxwwmC1wUuDPzjNqgabAXJLJZBOguhNLYjn75II/k5QVy6QLmHIQOjC6+YShjQGC7BbM4FuLwCLgHyRDNFfVjX9c7BVkGq1msEig9dA9y80uGlwGOiDvLIvyUsDU4Nm43FPClA0Deh7UPjWAqPJED1SMBmhRnqWeoaBJsoOTtD44mCQZlBi0ABGijmDA4MHQwBDKNCSXobNDHsY9jKZMEUxJTAlQZQyMUL1CDOgAKZsAOZen7M=
AAACsXicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuo+2jElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjtOKrDa1CajXjBZQN9AzAQAGTYQhlKDNAQUC+wC6GGIYUhnyGZIZShlyGVIY8hhIgO4chkaEYCKMZDBkMGAqAYrEM1UCxIiArEyyfylDLwAXUWwpUlQpUkQgUzQaS6UBeNFQ0D8gHmVkM1p0MtCUHiIuAOhUYVA2uGqw0+GxwwmC1wUuDPzjNqgabAXJLJZBOguhNLYjn75II/k5QVy6QLmHIQOjC6+YShjQGC7BbM4FuLwCLgHyRDNFfVjX9c7BVkGq1msEig9dA9y80uGlwGOiDvLIvyUsDU4Nm43FPClA0Deh7UPjWAqPJED1SMBmhRnqWeoaBJsoOTtD44mCQZlBi0ABGijmDA4MHQwBDKNCSXobNDHsY9jKZMEUxJTAlQZQyMUL1CDOgAKZsAOZen7M=
L( ) = log (ys
1:T
)R(ys
1:T
)
AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh
AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh
AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh
AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh
Table of Contents
1. An overview of RL
2. Policy gradient
3. RL in NLP tasks
4. Implementation
26
NLP Tasks using RL
• Seq2seq
– Summarization
– MT
– Dialogue
• Machine comprehension
27
Summarization
• Action: select a next token
• Reward: ROUGE
28
https://einstein.ai/research/your-tldr-by-an-ai-a-deep-reinforced-model-for-abstractive-summarization
Model
29
Pointer-Generator Network [See+ 17]
30
Policy Network
Hybrid Learning Objective
• Supervised learning with teacher forcing
• Policy learning
31
used method to train a decoder RNN for sequence generation, called the
gorithm (Williams & Zipser, 1989), minimizes a maximum-likelihood lo
e define y⇤ = {y⇤
1
, y⇤
2
, . . . , y⇤
n0
} as the ground-truth output sequence fo
The maximum-likelihood training objective is the minimization of the
Lml =
n0
X
t=1
log p(y⇤
t
|y⇤
1
, . . . , y⇤
t 1
, x)
zing Lml
does not always produce the best results on discrete evaluatio
Lin, 2004). This phenomenon has been observed with similar sequence g
aptioning with CIDEr (Rennie et al., 2016) and machine translation w
Norouzi et al., 2016). There are two main reasons for this discrepancy.
re bias (Ranzato et al., 2015), comes from the fact that the network has k
h sequence up to the next token during training but does not have such su
ce accumulating errors as it predicts the sequence. The second reason
of potentially valid summaries, since there are more ways to arrange
ses or different sentence orders. The ROUGE metrics take some of this
he maximum-likelihood objective does not.
CY LEARNING
o remedy this is to learn a policy that maximizes a specific discrete metric
the maximum-likelihood loss, which is made possible with reinforcement le
we use the self-critical policy gradient training algorithm (Rennie et al., 2016
ning algorithm, we produce two separate output sequences at each training ite
tained by sampling from the p(ys
t
|ys
1
, . . . , ys
t 1
, x) probability distribution at e
p, and ˆ
y, the baseline output, obtained by maximizing the output probability d
e step, essentially performing a greedy search. We define r(y) as the reward f
equence y, comparing it with the ground truth sequence y⇤ with the evaluatio
Lrl = (r(ˆ
y) r(ys))
n0
X
t=1
log p(ys
t
|ys
1
, . . . , ys
t 1
, x)
that minimizing Lrl
is equivalent to maximizing the conditional likelihood o
nce ys if it obtains a higher reward than the baseline ˆ
y, thus increasing the rew
r model.
ED TRAINING OBJECTIVE FUNCTION
G OBJECTIVE FUNCTION
is reinforcement training objective is that optimizing for a spe
es not guarantee an increase in quality and readability of th
h discrete metrics and increase their score without an actua
Liu et al., 2016). While ROUGE measures the n-gram overlap
a reference sequence, human-readability is better captured by
measured by perplexity.
elihood training objective (Equation 14) is essentially a con
g the probability of a token yt
based on the previously predict
nput sequence x, we hypothesize that it can assist our policy le
natural summaries. This motivates us to define a mixed learni
quations 14 and 15:
Lmixed = Lrl + (1 )Lml,
tor accounting for the difference in magnitude between Lrl
0.9984
greedy sampling
Reward (ROUGE)
Experimental Results
32
Model ROUGE-1 ROUGE-2 ROUGE-L
Lead-3 (Nallapati et al., 2017) 39.2 15.7 35.5
SummaRuNNer (Nallapati et al., 2017) 39.6 16.2 35.3
words-lvt2k-temp-att (Nallapati et al., 2016) 35.46 13.30 32.65
ML, no intra-attention, no trigram avoidance 35.15 13.28 32.13
ML, no intra-attention 37.86 14.69 34.99
ML, with intra-attention 38.30 14.81 35.49
RL, with intra-attention 41.16 15.75 39.08
ML+RL, with intra-attention 39.87 15.82 36.90
Table 1: Quantitative results for various models on the CNN/Daily Mail test dataset
Model ROUGE-1 ROUGE-2 ROUGE-L
ML, no intra-attention, no trigram avoidance 42.85 26.22 39.09
ML, no intra-attention 44.26 27.43 40.41
ML, with intra-attention 43.86 27.10 40.11
RL, no intra-attention 47.22 30.51 43.27
ML+RL, no intra-attention 47.03 30.72 43.10
Table 2: Quantitative results for various models on the New York Times test dataset
Model ROUGE-1 ROUGE-2 ROUGE-L
Lead-3 (Nallapati et al., 2017) 39.2 15.7 35.5
SummaRuNNer (Nallapati et al., 2017) 39.6 16.2 35.3
words-lvt2k-temp-att (Nallapati et al., 2016) 35.46 13.30 32.65
ML, no intra-attention, no trigram avoidance 35.15 13.28 32.13
ML, no intra-attention 37.86 14.69 34.99
ML, with intra-attention 38.30 14.81 35.49
RL, with intra-attention 41.16 15.75 39.08
ML+RL, with intra-attention 39.87 15.82 36.90
Table 1: Quantitative results for various models on the CNN/Daily Mail test dataset
Model ROUGE-1 ROUGE-2 ROUGE-L
ML, no intra-attention, no trigram avoidance 42.85 26.22 39.09
ML, no intra-attention 44.26 27.43 40.41
ML, with intra-attention 43.86 27.10 40.11
RL, no intra-attention 47.22 30.51 43.27
ML+RL, no intra-attention 47.03 30.72 43.10
Table 2: Quantitative results for various models on the New York Times test dataset
CNN/Daily Mail
NYT
Model R-1 R-2
First sentences 28.6 17.3
First k words 35.7 21.6
Full (Durrett et al., 2016) 42.2 24.9
ML+RL, with intra-attn 42.94 26.02
Comparison of ROUGE recall scores for lead baselines, the extractive mod
16) and our model on their NYT dataset splits.
Model Readability Relevance Perplexity
ML 6.76 7.14 84.46
RL 4.18 6.32 16417.68
ML+RL 7.04 7.45 121.07
Comparison of human readability scores on a random subset of the CNN/Da
All models are with intra-decoder attention.
Machine Translation [Johnson+ 16]
33
el text containing N input-output sequence pairs, denoted D ©
)
(X(i)
, Y ú(i)
)
*
hood training aims at maximizing the sum of log probabilities of the ground-t
onding inputs,
OML(◊) =
N
ÿ
i=1
log P◊
(Y ú(i)
| X(i)
) .
his objective is that it does not reflect the task reward function as measured by
n. Further, this objective does not explicitly encourage a ranking among incor
outputs with higher BLEU scores should still obtain higher probabilities under
tputs are never observed during training. In other words, using maximum-likelih
will not learn to be robust to errors made during decoding since they are n
a mismatch between the training and testing procedure.
[33, 38, 31] have considered di erent ways of incorporating the task reward
quence-to-sequence models. In this work, we also attempt to refine a model
ikelihood objective to directly optimize for the task reward. We show that, eve
t of state-of-the-art maximum-likelihood models using task reward improves
efinement using the expected reward objective (also used in [33]), which can
N
ers [33, 38, 31] have considered di erent ways of incorporating the task
l sequence-to-sequence models. In this work, we also attempt to refine a
um likelihood objective to directly optimize for the task reward. We show th
ment of state-of-the-art maximum-likelihood models using task reward im
el refinement using the expected reward objective (also used in [33]), wh
ORL(◊) =
N
ÿ
i=1
ÿ
Y œY
P◊
(Y | X(i)
) r(Y, Y ú(i)
).
tes the per-sentence score, and we are computing an expectation over all of
certain length.
has some undesirable properties when used for single sentences, as it was
We therefore use a slightly di erent score for our RL experiments which
e GLEU score, we record all sub-sequences of 1, 2, 3 or 4 tokens in output
We then compute a recall, which is the ratio of the number of matching
-grams in the target (ground truth) sequence, and a precision, which is
ng n-grams to the number of total n-grams in the generated output sequ
the minimum of recall and precision. This GLEU score’s range is alway
ll match) and it is symmetrical when switching output and target. Accor
ation 7) and RL (equation 8) objectives as follows:
OMixed(◊) = – ú OML(◊) + ORL(◊)
is typically set to be 0.25.
rain a model using the maximum likelihood objective (equation 7) until converg
el using a mixed maximum likelihood and expected reward objective (equatio
evelopment set is no longer improving. The second step is optional.
Model and Quantized Inference
es in deploying our Neural Machine Translation model to our interactive produ
0.25
ed in the previous section are optimized for log-likelihood of the next ste
e well with translation quality, as discussed in section 5. We use RL tr
cores after normal maximum-likelihood training.
RL fine-tuning on the best EnæFr and EnæDe models are presented
ning the models with RL can improve BLEU scores. On WMT EnæFr
score by close to 1 point. On EnæDe, RL-refinement slightly hurts th
bserve about 0.4 BLEU points improvement on the development set. Th
he average of 8 independent models. We also note that there is an ov
refinement and the decoder fine-tuning (i.e., the introduction of length
). On a less fine-tuned decoder (e.g., if the decoder does beam search
m RL would have been bigger (as is evident from comparing results in T
ingle model test BLEU scores, averaged over 8 runs, on WMT EnæFr
Dataset Trained with log-likelihood Refined with RL
EnæFr 38.95 39.92
EnæDe 24.67 24.60
Deep Reinforcement Learning for
Dialogue Generation [Li+ 16]
34
Baseline mutual information model (Li et al. 2015) Proposed reinforcement learning model
A: Where are you going? (1) A: Where are you going? (1)
B: I’m going to the restroom. (2) B: I’m going to the police station. (2)
A: See you later. (3) A: I’ll come with you. (3)
B: See you later. (4) B: No, no, no, no, you’re not going anywhere. (4)
A: See you later. (5) A: Why? (5)
B: See you later. (6) B: I need you to stay here. (6)
... A: I don’t know what you are talking about. (7)
... ...
A: how old are you? (1) A: How old are you? (1)
B: I’m 16. (2) B: I’m 16. Why are you asking? (2)
A: 16? (3) A I thought you were 12. (3)
B: I don’t know what you are talking about. (4) B: What made you think so? (4)
A: You don’t know what you are saying. (5) A: I don’t know what you are talking about. (5)
B: I don’t know what you are talking about . (6) B: You don’t know what you are saying. (6)
A: You don’t know what you are saying. (7) ...
... ...
Table 1: Left Column: Dialogue simulation between two agents using a 4-layer LSTM encoder-decoder
trained on the OpenSubtitles dataset. The first turn (index 1) is input by the authors. Then the two agents
take turns conversing, taking as input the other agent’s prior generated turn. The output is generated using
Reward
1. Ease of answering
2. Information Flow
3. Semantic Coherence
35
structed a list of dull responses S consisting 8 turns
such as “I don’t know what you are talking about”,
“I have no idea”, etc., that we and others have found
occur very frequently in SEQ2SEQ models of con-
versations. The reward function is given as follows:
r1 =
1
NS
X
s2S
1
Ns
log pseq2seq(s|a) (1)
where NS denotes the cardinality of NS and Ns de-
notes the number of tokens in the dull response s.
Although of course there are more ways to generate
dull responses than the list can cover, many of these
responses are likely to fall into similar regions in the
vector space computed by the model. A system less
likely to generate utterances in the list is thus also
less likely to generate other dull responses.
pseq2seq(a|
ing respon
[pi, qi]. pb
s
bility of g
qi based o
lar way as
and target
ence of ta
log pbackwa
seq2seq
The fina
the reward
r(
where 1
0.25 and
coder
aram-
ation
ctions
ult in
imize
n this
ribute
pprox-
ed in
chine
tribute new information at each turn to keep the di-
alogue moving and avoid repetitive sequences. We
therefore propose penalizing semantic similarity be-
tween consecutive turns from the same agent. Let
hpi
and hpi+1
denote representations obtained from
the encoder for two consecutive turns pi and pi+1.
The reward is given by the negative log of the cosine
similarity between them:
r2 = log cos(hpi
, hpi+1
) = log
hpi
· hpi+1
khpi
kkhpi+1
k
(2)
Semantic Coherence We also need to measure the
adequacy of responses to avoid situations in which
the generated replies are highly rewarded but are un-
In this
tribute
pprox-
zed in
achine
a turn
he con-
off and
pose to
urn by
ding to
ly con-
similarity between them:
r2 = log cos(hpi
, hpi+1
) = log
hpi
· hpi+1
khpi
kkhpi+1
k
(2)
Semantic Coherence We also need to measure the
adequacy of responses to avoid situations in which
the generated replies are highly rewarded but are un-
grammatical or not coherent. We therefore consider
the mutual information between the action a and pre-
vious turns in the history to ensure the generated
responses are coherent and appropriate:
r3 =
1
Na
log pseq2seq(a|qi, pi)+
1
Nqi
log pbackward
seq2seq
(qi|a)
over, many of these
imilar regions in the
odel. A system less
the list is thus also
responses.
elihood output by
noting that pseq2seq
ic policy function
mer is learned based
Q2SEQ model while
for long-term future
r(a, [pi, qi]) = 1r1 + 2r2 + 3r3
where 1 + 2 + 3 = 1. We set 1 = 0.25
0.25 and 3 = 0.5. A reward is observed af
agent reaches the end of each sentence.
4 Simulation
The central idea behind our approach is to si
the process of two virtual agents taking turns
with each other, through which we can explo
1195
pi
, qi
pi+1
AAACpnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQvIFMRn6igUxmcqxBRlpmeUJBYV5ZcrFMRXZ2ob1sYLKBvoGYCBAibDEMpQZoCCgHyBXQwxDCkM+QzJDKUMuQypDHkMJUB2DkMiQzEQRjMYMhgwFADFYhmqgWJFQFYmWD6VoZaBC6i3FKgqFagiESiaDSTTgbxoqGgekA8ysxisOxloSw4QFwF1KjCoGlw1WGnw2eCEwWqDlwZ/cJpVDTYD5JZKIJ0E0ZtaEM/fJRH8naCuXCBdwpCB0IXXzSUMaQwWYLdmAt1eABYB+SIZor+savrnYKsg1Wo1g0UGr4HuX2hw0+Aw0Ad5ZV+SlwamBs3G454UoGga0Peg8AVFkyF6pGAyQo30LPUMA02UHZyg8cXBIM2gxKABjBRzBgcGD4YAhlCgJU0MKxg2Mmxi0mTyZwplCocoZWKE6hFmQAFMCQAIdJyO
AAACpnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQvIFMRn6igUxmcqxBRlpmeUJBYV5ZcrFMRXZ2ob1sYLKBvoGYCBAibDEMpQZoCCgHyBXQwxDCkM+QzJDKUMuQypDHkMJUB2DkMiQzEQRjMYMhgwFADFYhmqgWJFQFYmWD6VoZaBC6i3FKgqFagiESiaDSTTgbxoqGgekA8ysxisOxloSw4QFwF1KjCoGlw1WGnw2eCEwWqDlwZ/cJpVDTYD5JZKIJ0E0ZtaEM/fJRH8naCuXCBdwpCB0IXXzSUMaQwWYLdmAt1eABYB+SIZor+savrnYKsg1Wo1g0UGr4HuX2hw0+Aw0Ad5ZV+SlwamBs3G454UoGga0Peg8AVFkyF6pGAyQo30LPUMA02UHZyg8cXBIM2gxKABjBRzBgcGD4YAhlCgJU0MKxg2Mmxi0mTyZwplCocoZWKE6hFmQAFMCQAIdJyO
AAACpnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQvIFMRn6igUxmcqxBRlpmeUJBYV5ZcrFMRXZ2ob1sYLKBvoGYCBAibDEMpQZoCCgHyBXQwxDCkM+QzJDKUMuQypDHkMJUB2DkMiQzEQRjMYMhgwFADFYhmqgWJFQFYmWD6VoZaBC6i3FKgqFagiESiaDSTTgbxoqGgekA8ysxisOxloSw4QFwF1KjCoGlw1WGnw2eCEwWqDlwZ/cJpVDTYD5JZKIJ0E0ZtaEM/fJRH8naCuXCBdwpCB0IXXzSUMaQwWYLdmAt1eABYB+SIZor+savrnYKsg1Wo1g0UGr4HuX2hw0+Aw0Ad5ZV+SlwamBs3G454UoGga0Peg8AVFkyF6pGAyQo30LPUMA02UHZyg8cXBIM2gxKABjBRzBgcGD4YAhlCgJU0MKxg2Mmxi0mTyZwplCocoZWKE6hFmQAFMCQAIdJyO
AAACpnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQvIFMRn6igUxmcqxBRlpmeUJBYV5ZcrFMRXZ2ob1sYLKBvoGYCBAibDEMpQZoCCgHyBXQwxDCkM+QzJDKUMuQypDHkMJUB2DkMiQzEQRjMYMhgwFADFYhmqgWJFQFYmWD6VoZaBC6i3FKgqFagiESiaDSTTgbxoqGgekA8ysxisOxloSw4QFwF1KjCoGlw1WGnw2eCEwWqDlwZ/cJpVDTYD5JZKIJ0E0ZtaEM/fJRH8naCuXCBdwpCB0IXXzSUMaQwWYLdmAt1eABYB+SIZor+savrnYKsg1Wo1g0UGr4HuX2hw0+Aw0Ad5ZV+SlwamBs3G454UoGga0Peg8AVFkyF6pGAyQo30LPUMA02UHZyg8cXBIM2gxKABjBRzBgcGD4YAhlCgJU0MKxg2Mmxi0mTyZwplCocoZWKE6hFmQAFMCQAIdJyO
dull utterance
“I have no idea”, …
Adversarial Learning for Neural
Dialogue Generation [Li+ 17]
• [Li+ 16] manually define three dialogue
properties
• Adversarial training: a model should generate
utterances indistinguishable from human
dialogues
36
Model
• Generative model:
– Generate a response y given dialogue history x
• Discriminative model:
– Input: a sequence of dialogue utterances {x, y}
– Output: a label indicating whether the input is
generated by humans or machines
• Policy Gradient Training
37
is used as a reward for the generator, which is
trained to maximize the expected reward of gen-
erated utterance(s) using the REINFORCE algo-
rithm (Williams, 1992):
J(θ) = Ey∼p(y|x)
(Q+({x, y})|θ) (1)
Given the input dialogue history x, the bot gener-
ates a dialogue utterance y by sampling from the
human-generated dialogue probability
by a discriminator
Machine Comprehension
38
Reinforced Mnemonic Reader for Machine Reading
Comprehension [He+ 2017]
• Standard maximum likelihood
method is used for predicting
exactly-matched score
• In addition, we want to
optimize F1 measure
→ Reinforcement Learning
Answer Pointer
39
where ✓ represents all trainable parameters.
The standard maximum-likelihood (ML) training method
is to maximize the log probabilities of the ground truth an-
swer positions [Wang and Jiang, 2017]
LML(✓) =
X
k
log p1(y
1
k
) + log p2(y
2
k
|y
1
k
) (7)
ward measured as word overlap between predicted answer
nd groung truth, is introduced to MRC [Xiong et al., 2017a].
A baseline b, which is obtained by running greedy inference
with the current model, is used to normalize the reward and
educe variances. Such approach is known as the self-critical
equence training (SCST) [Rennie et al., 2016], which is first
sed in image caption. More specifically, let R(As
, A
⇤) de-
ote the F1 score between a sampled answer As and the
round truth A
⇤. The training objective is to minimize the
egative expected reward by
LSCST (✓) = E
As⇠p✓(A)
[R(A
s) R( ˆ
A)] (8)
where we abbreviate the model distribution p(A|C, Q; ✓) as
✓(A), and the reward function R(As
, A
⇤) as R(As). ˆ
A is
btained by greedily maximizing the model distribution:
ˆ
A = arg max p(A|C, Q; ✓)
Standard Maximum Likelihood
Reinforcement Learning (Self-Critical Sequence Training)
greedy
sampling
sequence training (SCST) [Rennie et al., 2016], which is first
used in image caption. More specifically, let R(As
, A
⇤) de-
note the F1 score between a sampled answer As and the
ground truth A
⇤. The training objective is to minimize the
negative expected reward by
LSCST (✓) = E
As⇠p✓(A)
[R(A
s) R( ˆ
A)] (8)
where we abbreviate the model distribution p(A|C, Q; ✓) as
p✓(A), and the reward function R(As
, A
⇤) as R(As). ˆ
A is
obtained by greedily maximizing the model distribution:
ˆ
A = arg max
A
p(A|C, Q; ✓)
The expected gradient r✓
LSCST (✓) can be computed ac-
cording to the REINFORCE algorithm [Sutton and Barto,
1998] as
r✓
LSCST (✓) = E
As⇠p✓(A)
[(R(A
s) b) r✓ log p✓(A
s)]
⇡
⇣
R(A
s) R( ˆ
A)
⌘
r✓ log p✓(A
s) (9)
Table of Contents
1. An overview of RL
2. Policy gradient
3. RL in NLP tasks
4. Implementation
40
OpenAI: Gym
41
Game, Classic control, Robotics, …
chainerrl
42
Policy Gradient
• In NLP, difficult to use RL library/environment
• RL loss can be added to standard ML setting
43
https://github.com/pytorch/examples/blob/master/reinforcement_learning/reinforce.py
L( ) = log (ys
1:T
)R(ys
1:T
)
AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh
AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh
AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh
AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh
sampling