$30 off During Our Annual Pro Sale. View Details »

Deep Reinforcement Learning for NLP (for beginners)

tomohideshibata
September 29, 2018

Deep Reinforcement Learning for NLP (for beginners)

tomohideshibata

September 29, 2018
Tweet

More Decks by tomohideshibata

Other Decks in Science

Transcript

  1. Deep Reinforcement
    Learning for NLP
    Tomohide Shibata
    Kyoto University
    18/09/27

    View Slide

  2. Deep Reinforcement Learning
    2
    RL + DL
    DL + RL
    + RL

    View Slide

  3. Deep Reinforcement Learning
    for Games (AlphaGo)
    https://www.reddit.com/r/baduk/comments/6ttyyz/better_graph_of_go_ai_strength_over_time/
    AlphaGoZero
    Policy network
    Move probabilities
    Position
    p
    Value
    Evaluation
    Position
    Win Euro champion (pro)
    (2015.10)
    Win top player
    (2016.3)
    Win world number 1
    (2017.5)
    Learn from scratch
    (2017.10)

    View Slide

  4. Recent NLP Papers
    • Seq2seq (encoder-decoder):
    – NMT [Ranzato+ 15, Johnson+ 16, He+ 16, Bahdanau+ 17, Wu+ 17,
    Wu+ 18, …]
    – Summarization [Paulus+ 18, Celikyilmaz+ 18, Li+ 18, Chen+ 18, …]
    – Dialogue [Li+ 16, Li+ 17, …]
    • Parsing [Lê+ 17], Coreference Resolution [Clark+ 16],
    Chinese zero anaphora resolution [Yin+ 18], …
    • Knowledge Base Inference [Xiong+ 18, …]
    • Machine Comprehension [He+ 17, Xiong+ 17]
    • …
    4
    Delayed feedback, directly optimize end metric, …

    View Slide

  5. Three Kinds of Machine Learning
    • 教師あり学習 (Supervised learning)
    – Teacher provides desired output for a given input
    • 強化学習 (Reinforcement Learning, RL)
    – Delayed feedback from the environment in form
    of reward when taking action a at state s
    • 教師なし学習 (Unsupervised learning)
    – Clustering, …
    5

    View Slide

  6. Supervised vs RL
    6
    ・・・
    Black win
    Feedback
    ・・・

    View Slide

  7. Table of Contents
    1. An overview of RL
    2. Policy gradient
    3. RL in NLP tasks
    4. Implementation
    7

    View Slide

  8. Reinforcement Learning
    • A general purpose framework for decision
    making
    • An agent with the capacity to act
    • Each action influences the agent’s future state
    • Success is measured by a scalar reward signal
    • Goal: select actions to maximize future reward
    8

    View Slide

  9. Value-based, Policy-based
    • Value function: predict of value for each state
    or state/action (discount factors are ignored)
    • Policy: maps each state to action
    – Deterministic policy:
    – Stochastic policy:
    9
    (s) = a
    AAACbHichVHLSgMxFD0d3/VVFUEQYbBY6qakIvgAoejGpa2OilpkZkw1OC9mpoVa/AHXggtRUBARP8ONP+DCTxAXLhTcuPDOdEC0qDckOTm55+Yk0RxDeD5jjzGpqbmlta29I97Z1d3Tm+jrX/XssqtzRbcN213XVI8bwuKKL3yDrzsuV03N4Gva/kKwv1bhridsa8WvOrxoqruWKAld9Yna2HJE2huX52R1O5FkGRaG3AiyEUgiiiU7cY0t7MCGjjJMcFjwCRtQ4VHbRBYMDnFF1IhzCYlwn+MQcdKWKYtThkrsPo27tNqMWIvWQU0vVOt0ikHdJaWMMfbAbtgru2e37Il9/FqrFtYIvFRp1upa7mz3Hg0tv/+rMmn2sfel+tOzjxKmQ6+CvDshE9xCr+srByevy7OFsVqKXbJn8n/BHtkd3cCqvOlXeV44RZw+IPvzuRuBMpGZyWTzk8ncfPQT7RjGKNL03FPIYRFLUOhYC8c4w3nsRRqUhqWReqoUizQD+BZS6hOBKIyF
    AAACbHichVHLSgMxFD0d3/VVFUEQYbBY6qakIvgAoejGpa2OilpkZkw1OC9mpoVa/AHXggtRUBARP8ONP+DCTxAXLhTcuPDOdEC0qDckOTm55+Yk0RxDeD5jjzGpqbmlta29I97Z1d3Tm+jrX/XssqtzRbcN213XVI8bwuKKL3yDrzsuV03N4Gva/kKwv1bhridsa8WvOrxoqruWKAld9Yna2HJE2huX52R1O5FkGRaG3AiyEUgiiiU7cY0t7MCGjjJMcFjwCRtQ4VHbRBYMDnFF1IhzCYlwn+MQcdKWKYtThkrsPo27tNqMWIvWQU0vVOt0ikHdJaWMMfbAbtgru2e37Il9/FqrFtYIvFRp1upa7mz3Hg0tv/+rMmn2sfel+tOzjxKmQ6+CvDshE9xCr+srByevy7OFsVqKXbJn8n/BHtkd3cCqvOlXeV44RZw+IPvzuRuBMpGZyWTzk8ncfPQT7RjGKNL03FPIYRFLUOhYC8c4w3nsRRqUhqWReqoUizQD+BZS6hOBKIyF
    AAACbHichVHLSgMxFD0d3/VVFUEQYbBY6qakIvgAoejGpa2OilpkZkw1OC9mpoVa/AHXggtRUBARP8ONP+DCTxAXLhTcuPDOdEC0qDckOTm55+Yk0RxDeD5jjzGpqbmlta29I97Z1d3Tm+jrX/XssqtzRbcN213XVI8bwuKKL3yDrzsuV03N4Gva/kKwv1bhridsa8WvOrxoqruWKAld9Yna2HJE2huX52R1O5FkGRaG3AiyEUgiiiU7cY0t7MCGjjJMcFjwCRtQ4VHbRBYMDnFF1IhzCYlwn+MQcdKWKYtThkrsPo27tNqMWIvWQU0vVOt0ikHdJaWMMfbAbtgru2e37Il9/FqrFtYIvFRp1upa7mz3Hg0tv/+rMmn2sfel+tOzjxKmQ6+CvDshE9xCr+srByevy7OFsVqKXbJn8n/BHtkd3cCqvOlXeV44RZw+IPvzuRuBMpGZyWTzk8ncfPQT7RjGKNL03FPIYRFLUOhYC8c4w3nsRRqUhqWReqoUizQD+BZS6hOBKIyF
    AAACbHichVHLSgMxFD0d3/VVFUEQYbBY6qakIvgAoejGpa2OilpkZkw1OC9mpoVa/AHXggtRUBARP8ONP+DCTxAXLhTcuPDOdEC0qDckOTm55+Yk0RxDeD5jjzGpqbmlta29I97Z1d3Tm+jrX/XssqtzRbcN213XVI8bwuKKL3yDrzsuV03N4Gva/kKwv1bhridsa8WvOrxoqruWKAld9Yna2HJE2huX52R1O5FkGRaG3AiyEUgiiiU7cY0t7MCGjjJMcFjwCRtQ4VHbRBYMDnFF1IhzCYlwn+MQcdKWKYtThkrsPo27tNqMWIvWQU0vVOt0ikHdJaWMMfbAbtgru2e37Il9/FqrFtYIvFRp1upa7mz3Hg0tv/+rMmn2sfel+tOzjxKmQ6+CvDshE9xCr+srByevy7OFsVqKXbJn8n/BHtkd3cCqvOlXeV44RZw+IPvzuRuBMpGZyWTzk8ncfPQT7RjGKNL03FPIYRFLUOhYC8c4w3nsRRqUhqWReqoUizQD+BZS6hOBKIyF
    Q(s, a) = E[Rt+1
    + Rt+2
    + ...|St
    = s, At
    = a]
    AAACynichVFNa9RQFD2NH/2ydqqbgpuHQ6XSEl6KYBUKU0VwI3RmHFuYDuEl86YNzRd5bwamcXZd+QdcuFKwUPwHbt24cFuhP6G4rODGhTeZQKml7Q3JO++8e+47N9eJfU9pzo9GjGvXb9wcHRufmLw1dXu6NHPnjYq6iSsbbuRHyYYjlPS9UDa0p325ESdSBI4v152d59n5ek8myovC17ofy1YgtkKv47lCE2WXKtV5tcjEQ7bCNgOht9sqfTFo1uxUL1gDtsBytJQh0zTZW1a3NaWSZDUHomWXytzkebDzwCpAGUWsRaWv2EQbEVx0EUAihCbsQ0DR04QFjpi4FlLiEkJefi4xwARpu5QlKUMQu0PfLdo1CzakfVZT5WqXbvHpTUjJMMcP+QE/4d/5F37M/15YK81rZF76tDpDrYzt6Xez9T9XqgJaNbZPVZd61uhgOffqkfc4Z7Iu3KG+t/v+pP60Npc+4J/4L/L/kR/xb9RB2Pvtfq7K2odL/LSJ7VD32f8d0Jis/4dyHjSWzCemVX1Urjwr5jWGe7iPeRrKY1TwEmto0CX7+IFD/DReGcroG+kw1RgpNHdxJoy9f/ISpjw=
    AAACynichVFNa9RQFD2NH/2ydqqbgpuHQ6XSEl6KYBUKU0VwI3RmHFuYDuEl86YNzRd5bwamcXZd+QdcuFKwUPwHbt24cFuhP6G4rODGhTeZQKml7Q3JO++8e+47N9eJfU9pzo9GjGvXb9wcHRufmLw1dXu6NHPnjYq6iSsbbuRHyYYjlPS9UDa0p325ESdSBI4v152d59n5ek8myovC17ofy1YgtkKv47lCE2WXKtV5tcjEQ7bCNgOht9sqfTFo1uxUL1gDtsBytJQh0zTZW1a3NaWSZDUHomWXytzkebDzwCpAGUWsRaWv2EQbEVx0EUAihCbsQ0DR04QFjpi4FlLiEkJefi4xwARpu5QlKUMQu0PfLdo1CzakfVZT5WqXbvHpTUjJMMcP+QE/4d/5F37M/15YK81rZF76tDpDrYzt6Xez9T9XqgJaNbZPVZd61uhgOffqkfc4Z7Iu3KG+t/v+pP60Npc+4J/4L/L/kR/xb9RB2Pvtfq7K2odL/LSJ7VD32f8d0Jis/4dyHjSWzCemVX1Urjwr5jWGe7iPeRrKY1TwEmto0CX7+IFD/DReGcroG+kw1RgpNHdxJoy9f/ISpjw=
    AAACynichVFNa9RQFD2NH/2ydqqbgpuHQ6XSEl6KYBUKU0VwI3RmHFuYDuEl86YNzRd5bwamcXZd+QdcuFKwUPwHbt24cFuhP6G4rODGhTeZQKml7Q3JO++8e+47N9eJfU9pzo9GjGvXb9wcHRufmLw1dXu6NHPnjYq6iSsbbuRHyYYjlPS9UDa0p325ESdSBI4v152d59n5ek8myovC17ofy1YgtkKv47lCE2WXKtV5tcjEQ7bCNgOht9sqfTFo1uxUL1gDtsBytJQh0zTZW1a3NaWSZDUHomWXytzkebDzwCpAGUWsRaWv2EQbEVx0EUAihCbsQ0DR04QFjpi4FlLiEkJefi4xwARpu5QlKUMQu0PfLdo1CzakfVZT5WqXbvHpTUjJMMcP+QE/4d/5F37M/15YK81rZF76tDpDrYzt6Xez9T9XqgJaNbZPVZd61uhgOffqkfc4Z7Iu3KG+t/v+pP60Npc+4J/4L/L/kR/xb9RB2Pvtfq7K2odL/LSJ7VD32f8d0Jis/4dyHjSWzCemVX1Urjwr5jWGe7iPeRrKY1TwEmto0CX7+IFD/DReGcroG+kw1RgpNHdxJoy9f/ISpjw=
    AAACynichVFNa9RQFD2NH/2ydqqbgpuHQ6XSEl6KYBUKU0VwI3RmHFuYDuEl86YNzRd5bwamcXZd+QdcuFKwUPwHbt24cFuhP6G4rODGhTeZQKml7Q3JO++8e+47N9eJfU9pzo9GjGvXb9wcHRufmLw1dXu6NHPnjYq6iSsbbuRHyYYjlPS9UDa0p325ESdSBI4v152d59n5ek8myovC17ofy1YgtkKv47lCE2WXKtV5tcjEQ7bCNgOht9sqfTFo1uxUL1gDtsBytJQh0zTZW1a3NaWSZDUHomWXytzkebDzwCpAGUWsRaWv2EQbEVx0EUAihCbsQ0DR04QFjpi4FlLiEkJefi4xwARpu5QlKUMQu0PfLdo1CzakfVZT5WqXbvHpTUjJMMcP+QE/4d/5F37M/15YK81rZF76tDpDrYzt6Xez9T9XqgJaNbZPVZd61uhgOffqkfc4Z7Iu3KG+t/v+pP60Npc+4J/4L/L/kR/xb9RB2Pvtfq7K2odL/LSJ7VD32f8d0Jis/4dyHjSWzCemVX1Urjwr5jWGe7iPeRrKY1TwEmto0CX7+IFD/DReGcroG+kw1RgpNHdxJoy9f/ISpjw=
    (a|s) = P[a|s]
    AAACpHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxRRkaiTWFGsq2CrE5CaWZKQUVwfURgNFYuMFlA30DMBAAZNhCGUoM0BBQL7ALoYYhhSGfIZkhlKGXIZUhjyGEiA7hyGRoRgIoxkMGQwYCoBisQzVQLEiICsTLJ/KUMvABdRbClSVClSRCBTNBpLpQF40VDQPyAeZWQzWnQy0JQeIi4A6FRhUDa4arDT4bHDCYLXBS4M/OM2qBpsBckslkE6C6E0tiOfvkgj+TlBXLpAuYchA6MLr5hKGNAYLsFszgW4vAIuAfJEM0V9WNf1zsFWQarWawSKD10D3LzS4aXAY6IO8si/JSwNTg2bjcU8KUDQN6HtQ+NYCo8kQPVIwGaFGepZ6hoEmyg5O0PjiYJBmUGLQAEaKOYMDgwdDAEMo0JIGhmUM6xk2MKkz+TIFM4VClDIxQvUIM6AApjgAMBmbzA==
    AAACpHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxRRkaiTWFGsq2CrE5CaWZKQUVwfURgNFYuMFlA30DMBAAZNhCGUoM0BBQL7ALoYYhhSGfIZkhlKGXIZUhjyGEiA7hyGRoRgIoxkMGQwYCoBisQzVQLEiICsTLJ/KUMvABdRbClSVClSRCBTNBpLpQF40VDQPyAeZWQzWnQy0JQeIi4A6FRhUDa4arDT4bHDCYLXBS4M/OM2qBpsBckslkE6C6E0tiOfvkgj+TlBXLpAuYchA6MLr5hKGNAYLsFszgW4vAIuAfJEM0V9WNf1zsFWQarWawSKD10D3LzS4aXAY6IO8si/JSwNTg2bjcU8KUDQN6HtQ+NYCo8kQPVIwGaFGepZ6hoEmyg5O0PjiYJBmUGLQAEaKOYMDgwdDAEMo0JIGhmUM6xk2MKkz+TIFM4VClDIxQvUIM6AApjgAMBmbzA==
    AAACpHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxRRkaiTWFGsq2CrE5CaWZKQUVwfURgNFYuMFlA30DMBAAZNhCGUoM0BBQL7ALoYYhhSGfIZkhlKGXIZUhjyGEiA7hyGRoRgIoxkMGQwYCoBisQzVQLEiICsTLJ/KUMvABdRbClSVClSRCBTNBpLpQF40VDQPyAeZWQzWnQy0JQeIi4A6FRhUDa4arDT4bHDCYLXBS4M/OM2qBpsBckslkE6C6E0tiOfvkgj+TlBXLpAuYchA6MLr5hKGNAYLsFszgW4vAIuAfJEM0V9WNf1zsFWQarWawSKD10D3LzS4aXAY6IO8si/JSwNTg2bjcU8KUDQN6HtQ+NYCo8kQPVIwGaFGepZ6hoEmyg5O0PjiYJBmUGLQAEaKOYMDgwdDAEMo0JIGhmUM6xk2MKkz+TIFM4VClDIxQvUIM6AApjgAMBmbzA==
    AAACpHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxRRkaiTWFGsq2CrE5CaWZKQUVwfURgNFYuMFlA30DMBAAZNhCGUoM0BBQL7ALoYYhhSGfIZkhlKGXIZUhjyGEiA7hyGRoRgIoxkMGQwYCoBisQzVQLEiICsTLJ/KUMvABdRbClSVClSRCBTNBpLpQF40VDQPyAeZWQzWnQy0JQeIi4A6FRhUDa4arDT4bHDCYLXBS4M/OM2qBpsBckslkE6C6E0tiOfvkgj+TlBXLpAuYchA6MLr5hKGNAYLsFszgW4vAIuAfJEM0V9WNf1zsFWQarWawSKD10D3LzS4aXAY6IO8si/JSwNTg2bjcU8KUDQN6HtQ+NYCo8kQPVIwGaFGepZ6hoEmyg5O0PjiYJBmUGLQAEaKOYMDgwdDAEMo0JIGhmUM6xk2MKkz+TIFM4VClDIxQvUIM6AApjgAMBmbzA==
    V (s) = E[Rt+1
    + Rt+2
    + ...|St
    = s]
    AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
    AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
    AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
    AAACenichVE9T8JQFD3UL0QU3ExciATjRC4Oik4mLo58iJAgIW15YGNpm7aQIOEPsDo4OGniYPwHri7+AQd+gnHExMXB20JilIi3eX3nnffOfee+q1i65rhEg4A0Mzs3vxBcDC2FQ8srkWj4xDFbtioKqqmbdkmRHaFrhii4mquLkmULuanooqicH3r7xbawHc00jt2OJSpNuWFodU2VXaYy1WickuRHbBKkxiCOcZjRR5yiBhMqWmhCwIDLWIcMh78yUiBYzFXQZc5mpPn7Aj2EWNviU4JPyMye87/Bq/KYNXjt5XR8tcq36DxsVsaQoBe6pyE90wO90uefubp+Ds9Lh2dlpBVWNdJfy3/8q2ry7OLsWzXVs4s60r5Xjb1bPuNVoY707YurYX4/l+hu0i29sf8bGtATV2C039W7rMhdT/FTY7bO1Xvv2+MupX73ZBIUtpN7yVSWEMQ6NrDFvdjFAY6QQYFz19DHpRSSSNoZNVMKjLu6ih8hpb8AreqPMw==
    AAACs3ichVFNS9xQFD3Gj6q1OroqdBMUiyKEGxGqQqEihS51pqPCOIQk82YM5ou8NwM2nT/gHyjYVYUuSv+BK8GNuu/CnyAuLbhx0ZvMgKhob0jeeefdc3Puu07se1IRXfRovX39Ay8Gh4ZfjrwaHSuMj2zIqJm4ouxGfpRsObYUvheKsvKUL7biRNiB44tNZ3c1O99siUR6UfhZ7cWiGtiN0Kt7rq2YsgoLGzNyVn+vbwe22qnJ9GO7UrRSNWe29Tk9R/MZMgxD/6qXLMWpsmoVpsigPPTHwOyCKXRjLSocYRs1RHDRRACBEIqxDxuSnwpMEGLmqkiZSxh5+blAG8OsbXKW4Ayb2V3+NnhX6bIh77OaMle7/Bef34SVOqbpD/2iazql33RJt0/WSvMamZc9Xp2OVsTW2P7r0s1/VQGvCjt3qmc9K9SxmHv12HucM1kXbkff+vLturRcnE7f0iFdsf8fdEEn3EHY+uv+XBfF78/4qTFb5+6z+23zmMyHQ3kMyvPGkmGuEwbxBpOY4Vm8wwd8whrKXPsAxzjDubaiNbSgM0+tpzvYCdwLLfoHL0KhtA==
    AAACs3ichVFNS9xQFD3Gj6q1OroqdBMUiyKEGxGqQqEihS51pqPCOIQk82YM5ou8NwM2nT/gHyjYVYUuSv+BK8GNuu/CnyAuLbhx0ZvMgKhob0jeeefdc3Puu07se1IRXfRovX39Ay8Gh4ZfjrwaHSuMj2zIqJm4ouxGfpRsObYUvheKsvKUL7biRNiB44tNZ3c1O99siUR6UfhZ7cWiGtiN0Kt7rq2YsgoLGzNyVn+vbwe22qnJ9GO7UrRSNWe29Tk9R/MZMgxD/6qXLMWpsmoVpsigPPTHwOyCKXRjLSocYRs1RHDRRACBEIqxDxuSnwpMEGLmqkiZSxh5+blAG8OsbXKW4Ayb2V3+NnhX6bIh77OaMle7/Bef34SVOqbpD/2iazql33RJt0/WSvMamZc9Xp2OVsTW2P7r0s1/VQGvCjt3qmc9K9SxmHv12HucM1kXbkff+vLturRcnE7f0iFdsf8fdEEn3EHY+uv+XBfF78/4qTFb5+6z+23zmMyHQ3kMyvPGkmGuEwbxBpOY4Vm8wwd8whrKXPsAxzjDubaiNbSgM0+tpzvYCdwLLfoHL0KhtA==
    AAACvnichVFNS9xQFD2mtn61dVo3Qjehg8UihBsRtIWCVYQudaajwjiEJPNmDOaLvDdDbZw/0D8g6Eqhi+I/cCV003bfhT+huFRw00VvMoHSSu0NyTvvvHvuOzfXiX1PKqLzAe3O4N17Q8Mjo2P3HzwcLz16vC6jTuKKmhv5UbLp2FL4XihqylO+2IwTYQeOLzacneXsfKMrEulF4Vu1G4tGYLdDr+W5tmLKKs2tT8vn+it9K7DVdlOmK716xUrVjNnTZ/QczWbIMAx9T69ailNlwyqVyaA89JvALEAZRaxGpVNsoYkILjoIIBBCMfZhQ/JThwlCzFwDKXMJIy8/F+hhlLUdzhKcYTO7w9827+oFG/I+qylztcu3+PwmrNQxRd/pE13SFzqhH/Tzn7XSvEbmZZdXp68VsTX+YbJ6/V9VwKvC9m/VrZ4VWljIvXrsPc6ZrAu3r+++37+svqxMpc/omC7Y/xGd02fuIOxeuR/XROXwFj9NZlvcffZ/ezwm8++h3AS1WeOFYa5ReXGpmNcwnuAppnko81jEG6yixpcc4Axf8U17rbW1QIv6qdpAoZnAH6G9+wUzfKLa
    AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
    AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
    AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
    AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
    AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
    AAACvnichVFNS9xQFD0TrR+j1VE3QjfBQbEI4UWEVkGwFsHlfHRUGIeQZN6MwXyR92ZQ4/yB/gGhXVXoQvoPuhK6qd134U8oLi1000VvMoGiUntD8s477577zs21QtcRkrHrnDIw+GRoeGQ0Pzb+dGKyMDW9I4JOZPOaHbhBtGeZgruOz2vSkS7fCyNuepbLd63D18n5bpdHwgn8N/I45A3PbPtOy7FNSZRRWNlZFM/VdXXfM+VBU8RbvXrFiOWS3lOX1BQtJ0jTNPVUrRqSUkXDKBSZxtJQHwI9A0VkUQoKn7GPJgLY6MADhw9J2IUJQU8dOhhC4hqIiYsIOek5Rw950nYoi1OGSewhfdu0q2esT/ukpkjVNt3i0huRUsU8+84u2C37yj6xH+z3P2vFaY3EyzGtVl/LQ2Py7Wz1139VHq0SB39Vj3qWaOFl6tUh72HKJF3YfX335Oy2ulaZjxfYObsh/x/YNftCHfjdn/bHMq+8f8RPk9gWdZ/83x6NSb8/lIegtqytanp5pbixmc1rBM8wh0UaygtsYBsl1OiSd7jEFb4pr5S24ilBP1XJZZoZ3Anl6A80vKLe
    In NLP tasks, policy-based methods are usually used.
    • Effective in high-dimensional action space.
    • We usually have a policy network in conventional maximum likelihood
    method.

    View Slide

  10. Conventional Seq2seq Model
    10
    私 は 学生 です EOS
    I am a student
    I am a student
    EOS
    Training cross entropy loss
    gold

    View Slide

  11. Cross Entropy Loss
    11
    = negative log likelihood

    View Slide

  12. Conventional Seq2seq Model
    12
    私 は 学生 です EOS
    I am a student
    I am a student
    EOS
    私 は 学生 です EOS
    I
    am
    a
    student
    EOS
    log p(I| ࢲ, ͸, ..., EOS)
    AAAC2HichVI9SxxRFD1O/P5cYyPYLC4aA7LcEcGPShIErfxYV4VVZGb27To4OzPMvF1cNxapREsbC6sELCRtKruQxj9goa1pgqVCmhS58xwQleh7MHPeeffce+57z/QdO5REl3Xam/qGxqbmlta29o7OrkT32+XQKweWyFqe4wWrphEKx3ZFVtrSEat+IIyS6YgVc+tjtL9SEUFoe+6SrPpivWQUXbtgW4Zkyku8wxoceCgiCR9DmMUnRmsowWR2G7XqD+xi+DFHXsyl1YzQNOaQwfuNRIrSpEbyOdBjkEI85r3EGSfOc1oLZS4g4EIydmAg5JmDDmJjEuuoMRcwstW+YAutrC1zlOAIg9kt/hZ5lYtZl9dRzlCpLa4StRqwMokBuqBTuqVz+ka/6e9/c9VUjshLVR2A0gp/o+ugN/PnVVWJ/xKbD6oXPUsUMK682uzdV0zUhXWvr+wc3WYmFwdqg/SVbtj/F7qkn9yBW7mzThbE4vELfvLMFrj76Hx3+Zr0p5fyHGRH0hNpfWE0NfUhvq9m9KGfX4mOMUxhBvPIcpHvuMI1fmk57bO2p+3fh2p1saYHj4Z2+A9W2aHj
    AAAC2HichVI9SxxRFD1O/P5cYyPYLC4aA7LcEcGPShIErfxYV4VVZGb27To4OzPMvF1cNxapREsbC6sELCRtKruQxj9goa1pgqVCmhS58xwQleh7MHPeeffce+57z/QdO5REl3Xam/qGxqbmlta29o7OrkT32+XQKweWyFqe4wWrphEKx3ZFVtrSEat+IIyS6YgVc+tjtL9SEUFoe+6SrPpivWQUXbtgW4Zkyku8wxoceCgiCR9DmMUnRmsowWR2G7XqD+xi+DFHXsyl1YzQNOaQwfuNRIrSpEbyOdBjkEI85r3EGSfOc1oLZS4g4EIydmAg5JmDDmJjEuuoMRcwstW+YAutrC1zlOAIg9kt/hZ5lYtZl9dRzlCpLa4StRqwMokBuqBTuqVz+ka/6e9/c9VUjshLVR2A0gp/o+ugN/PnVVWJ/xKbD6oXPUsUMK682uzdV0zUhXWvr+wc3WYmFwdqg/SVbtj/F7qkn9yBW7mzThbE4vELfvLMFrj76Hx3+Zr0p5fyHGRH0hNpfWE0NfUhvq9m9KGfX4mOMUxhBvPIcpHvuMI1fmk57bO2p+3fh2p1saYHj4Z2+A9W2aHj
    AAAC2HichVI9SxxRFD1O/P5cYyPYLC4aA7LcEcGPShIErfxYV4VVZGb27To4OzPMvF1cNxapREsbC6sELCRtKruQxj9goa1pgqVCmhS58xwQleh7MHPeeffce+57z/QdO5REl3Xam/qGxqbmlta29o7OrkT32+XQKweWyFqe4wWrphEKx3ZFVtrSEat+IIyS6YgVc+tjtL9SEUFoe+6SrPpivWQUXbtgW4Zkyku8wxoceCgiCR9DmMUnRmsowWR2G7XqD+xi+DFHXsyl1YzQNOaQwfuNRIrSpEbyOdBjkEI85r3EGSfOc1oLZS4g4EIydmAg5JmDDmJjEuuoMRcwstW+YAutrC1zlOAIg9kt/hZ5lYtZl9dRzlCpLa4StRqwMokBuqBTuqVz+ka/6e9/c9VUjshLVR2A0gp/o+ugN/PnVVWJ/xKbD6oXPUsUMK682uzdV0zUhXWvr+wc3WYmFwdqg/SVbtj/F7qkn9yBW7mzThbE4vELfvLMFrj76Hx3+Zr0p5fyHGRH0hNpfWE0NfUhvq9m9KGfX4mOMUxhBvPIcpHvuMI1fmk57bO2p+3fh2p1saYHj4Z2+A9W2aHj
    AAAC2HichVI9SxxRFD1O/P5cYyPYLC4aA7LcEcGPShIErfxYV4VVZGb27To4OzPMvF1cNxapREsbC6sELCRtKruQxj9goa1pgqVCmhS58xwQleh7MHPeeffce+57z/QdO5REl3Xam/qGxqbmlta29o7OrkT32+XQKweWyFqe4wWrphEKx3ZFVtrSEat+IIyS6YgVc+tjtL9SEUFoe+6SrPpivWQUXbtgW4Zkyku8wxoceCgiCR9DmMUnRmsowWR2G7XqD+xi+DFHXsyl1YzQNOaQwfuNRIrSpEbyOdBjkEI85r3EGSfOc1oLZS4g4EIydmAg5JmDDmJjEuuoMRcwstW+YAutrC1zlOAIg9kt/hZ5lYtZl9dRzlCpLa4StRqwMokBuqBTuqVz+ka/6e9/c9VUjshLVR2A0gp/o+ugN/PnVVWJ/xKbD6oXPUsUMK682uzdV0zUhXWvr+wc3WYmFwdqg/SVbtj/F7qkn9yBW7mzThbE4vELfvLMFrj76Hx3+Zr0p5fyHGRH0hNpfWE0NfUhvq9m9KGfX4mOMUxhBvPIcpHvuMI1fmk57bO2p+3fh2p1saYHj4Z2+A9W2aHj
    log p(am|I, ࢲ, ͸, ..., EOS)
    AAAC4HichVLPT9RAGH1UEQSURS8mXjZuMBrN5qsh4ceJSEz0JLCukAAhbXd2bei2TTu7YV35A/BiPJnoCRIPxv/AeCFc+Ac8cPVm8IYJFw6+Dk2MEHEmmb5537zve99M3TjwUy2y32NduNh7qa//8sDg0JWrw4WRa8/TqJV4qupFQZQsuk6qAj9UVe3rQC3GiXKabqAW3LWZLL7QVknqR+Ez3YnVStNphH7d9xxNKircwzICRGigiBh34KCJV8RPcJ/rMncuo+vodnawcZqTKOfKZmboEZ6igrurhZKUxYziWWDnoIR8zEaFL0xcY1oPLRZQCKGJAxpKOZdgQ2hQYwVdcgmRb+KKFgaobfGU4gmH7BrXBndLORuathRxpvZYJWs5obKIUfkmn+RQ9uSz/JDjf+bqmhyZl465AKNV8erw6xuVo/+qmvxqvPijOtezRh0TxqtP77Fhsi68E3375bvDytT8aPe2bMsB/W/Jvuyyg7D9y/s4p+Y/nOOnRrbO7rP73eAz2acf5SyoPihPlu25sdL0w/y9+nETt/i32BjHNB5jFlUW+YrvOMBPy7M2rTfW25OjVk+uuY6/hvX+Nzq9o0k=
    AAAC4HichVLPT9RAGH1UEQSURS8mXjZuMBrN5qsh4ceJSEz0JLCukAAhbXd2bei2TTu7YV35A/BiPJnoCRIPxv/AeCFc+Ac8cPVm8IYJFw6+Dk2MEHEmmb5537zve99M3TjwUy2y32NduNh7qa//8sDg0JWrw4WRa8/TqJV4qupFQZQsuk6qAj9UVe3rQC3GiXKabqAW3LWZLL7QVknqR+Ez3YnVStNphH7d9xxNKircwzICRGigiBh34KCJV8RPcJ/rMncuo+vodnawcZqTKOfKZmboEZ6igrurhZKUxYziWWDnoIR8zEaFL0xcY1oPLRZQCKGJAxpKOZdgQ2hQYwVdcgmRb+KKFgaobfGU4gmH7BrXBndLORuathRxpvZYJWs5obKIUfkmn+RQ9uSz/JDjf+bqmhyZl465AKNV8erw6xuVo/+qmvxqvPijOtezRh0TxqtP77Fhsi68E3375bvDytT8aPe2bMsB/W/Jvuyyg7D9y/s4p+Y/nOOnRrbO7rP73eAz2acf5SyoPihPlu25sdL0w/y9+nETt/i32BjHNB5jFlUW+YrvOMBPy7M2rTfW25OjVk+uuY6/hvX+Nzq9o0k=
    AAAC4HichVLPT9RAGH1UEQSURS8mXjZuMBrN5qsh4ceJSEz0JLCukAAhbXd2bei2TTu7YV35A/BiPJnoCRIPxv/AeCFc+Ac8cPVm8IYJFw6+Dk2MEHEmmb5537zve99M3TjwUy2y32NduNh7qa//8sDg0JWrw4WRa8/TqJV4qupFQZQsuk6qAj9UVe3rQC3GiXKabqAW3LWZLL7QVknqR+Ez3YnVStNphH7d9xxNKircwzICRGigiBh34KCJV8RPcJ/rMncuo+vodnawcZqTKOfKZmboEZ6igrurhZKUxYziWWDnoIR8zEaFL0xcY1oPLRZQCKGJAxpKOZdgQ2hQYwVdcgmRb+KKFgaobfGU4gmH7BrXBndLORuathRxpvZYJWs5obKIUfkmn+RQ9uSz/JDjf+bqmhyZl465AKNV8erw6xuVo/+qmvxqvPijOtezRh0TxqtP77Fhsi68E3375bvDytT8aPe2bMsB/W/Jvuyyg7D9y/s4p+Y/nOOnRrbO7rP73eAz2acf5SyoPihPlu25sdL0w/y9+nETt/i32BjHNB5jFlUW+YrvOMBPy7M2rTfW25OjVk+uuY6/hvX+Nzq9o0k=
    AAAC4HichVLPT9RAGH1UEQSURS8mXjZuMBrN5qsh4ceJSEz0JLCukAAhbXd2bei2TTu7YV35A/BiPJnoCRIPxv/AeCFc+Ac8cPVm8IYJFw6+Dk2MEHEmmb5537zve99M3TjwUy2y32NduNh7qa//8sDg0JWrw4WRa8/TqJV4qupFQZQsuk6qAj9UVe3rQC3GiXKabqAW3LWZLL7QVknqR+Ez3YnVStNphH7d9xxNKircwzICRGigiBh34KCJV8RPcJ/rMncuo+vodnawcZqTKOfKZmboEZ6igrurhZKUxYziWWDnoIR8zEaFL0xcY1oPLRZQCKGJAxpKOZdgQ2hQYwVdcgmRb+KKFgaobfGU4gmH7BrXBndLORuathRxpvZYJWs5obKIUfkmn+RQ9uSz/JDjf+bqmhyZl465AKNV8erw6xuVo/+qmvxqvPijOtezRh0TxqtP77Fhsi68E3375bvDytT8aPe2bMsB/W/Jvuyyg7D9y/s4p+Y/nOOnRrbO7rP73eAz2acf5SyoPihPlu25sdL0w/y9+nETt/i32BjHNB5jFlUW+YrvOMBPy7M2rTfW25OjVk+uuY6/hvX+Nzq9o0k=
    Training cross entropy loss
    I am a student
    gold

    View Slide

  13. Conventional Seq2seq Model
    13
    私 は 学生 です EOS
    I am an actor
    I am an actor
    EOS
    私 は 学生 です EOS
    I
    am EOS
    Testing
    an
    actor
    argmaxy
    p(y| ࢲ, ͸, ..., EOS)
    AAAC4HichVK7SiRBFD227+eMmixsMjgoyspwWwR3N5JdFjbzMTsqqAzdbc3Y2NPddNcMO876AZrIRoJGCgbiH4iJmPgDG5iaiWYKJgbeLhtEZbWK7jp1qs6951aV6Tt2KInO67T6hsam5pbWtvaOzq5EsrtnJvTKgSVylud4wZxphMKxXZGTtnTEnB8Io2Q6YtZc+R6tz1ZEENqe+0tWfbFYMoquXbAtQzLlJT/BQIAiSjz+Rh5VpOBjkMc/jBaYN+HxSq16gjUMP+fIi7mM6hH6gQlkMZRPpilDqqVeAz0GacRt0kseceAlDmuhzAkEXEjGDtsKuc9DB7ExiUXUlGUJW60LttDG2jLvErzDYHaF/0Wezcesy/MoZqjUFmdx+AtYmUI//aMDuqEzOqRLuv9vrJqKEXmpqgNQWuHnExsfsnfvqqLjlVh+Ur3pWaKAz8qrzd59xURVWI/6yurWTfbrdH9tgPboiv3v0jmdcgVu5dbanxLTO2/4WWK2wNVH57vG16S/vJTXIDeS+ZLRp0bT49/i+2rBR/TxK9ExhnH8xCRynOQYF7jCtWZp69qm9vdxq1YXa3rxrGnbDzBHpBk=
    AAAC4HichVK7SiRBFD227+eMmixsMjgoyspwWwR3N5JdFjbzMTsqqAzdbc3Y2NPddNcMO876AZrIRoJGCgbiH4iJmPgDG5iaiWYKJgbeLhtEZbWK7jp1qs6951aV6Tt2KInO67T6hsam5pbWtvaOzq5EsrtnJvTKgSVylud4wZxphMKxXZGTtnTEnB8Io2Q6YtZc+R6tz1ZEENqe+0tWfbFYMoquXbAtQzLlJT/BQIAiSjz+Rh5VpOBjkMc/jBaYN+HxSq16gjUMP+fIi7mM6hH6gQlkMZRPpilDqqVeAz0GacRt0kseceAlDmuhzAkEXEjGDtsKuc9DB7ExiUXUlGUJW60LttDG2jLvErzDYHaF/0Wezcesy/MoZqjUFmdx+AtYmUI//aMDuqEzOqRLuv9vrJqKEXmpqgNQWuHnExsfsnfvqqLjlVh+Ur3pWaKAz8qrzd59xURVWI/6yurWTfbrdH9tgPboiv3v0jmdcgVu5dbanxLTO2/4WWK2wNVH57vG16S/vJTXIDeS+ZLRp0bT49/i+2rBR/TxK9ExhnH8xCRynOQYF7jCtWZp69qm9vdxq1YXa3rxrGnbDzBHpBk=
    AAAC4HichVK7SiRBFD227+eMmixsMjgoyspwWwR3N5JdFjbzMTsqqAzdbc3Y2NPddNcMO876AZrIRoJGCgbiH4iJmPgDG5iaiWYKJgbeLhtEZbWK7jp1qs6951aV6Tt2KInO67T6hsam5pbWtvaOzq5EsrtnJvTKgSVylud4wZxphMKxXZGTtnTEnB8Io2Q6YtZc+R6tz1ZEENqe+0tWfbFYMoquXbAtQzLlJT/BQIAiSjz+Rh5VpOBjkMc/jBaYN+HxSq16gjUMP+fIi7mM6hH6gQlkMZRPpilDqqVeAz0GacRt0kseceAlDmuhzAkEXEjGDtsKuc9DB7ExiUXUlGUJW60LttDG2jLvErzDYHaF/0Wezcesy/MoZqjUFmdx+AtYmUI//aMDuqEzOqRLuv9vrJqKEXmpqgNQWuHnExsfsnfvqqLjlVh+Ur3pWaKAz8qrzd59xURVWI/6yurWTfbrdH9tgPboiv3v0jmdcgVu5dbanxLTO2/4WWK2wNVH57vG16S/vJTXIDeS+ZLRp0bT49/i+2rBR/TxK9ExhnH8xCRynOQYF7jCtWZp69qm9vdxq1YXa3rxrGnbDzBHpBk=
    AAAC4HichVK7SiRBFD227+eMmixsMjgoyspwWwR3N5JdFjbzMTsqqAzdbc3Y2NPddNcMO876AZrIRoJGCgbiH4iJmPgDG5iaiWYKJgbeLhtEZbWK7jp1qs6951aV6Tt2KInO67T6hsam5pbWtvaOzq5EsrtnJvTKgSVylud4wZxphMKxXZGTtnTEnB8Io2Q6YtZc+R6tz1ZEENqe+0tWfbFYMoquXbAtQzLlJT/BQIAiSjz+Rh5VpOBjkMc/jBaYN+HxSq16gjUMP+fIi7mM6hH6gQlkMZRPpilDqqVeAz0GacRt0kseceAlDmuhzAkEXEjGDtsKuc9DB7ExiUXUlGUJW60LttDG2jLvErzDYHaF/0Wezcesy/MoZqjUFmdx+AtYmUI//aMDuqEzOqRLuv9vrJqKEXmpqgNQWuHnExsfsnfvqqLjlVh+Ur3pWaKAz8qrzd59xURVWI/6yurWTfbrdH9tgPboiv3v0jmdcgVu5dbanxLTO2/4WWK2wNVH57vG16S/vJTXIDeS+ZLRp0bT49/i+2rBR/TxK9ExhnH8xCRynOQYF7jCtWZp69qm9vdxq1YXa3rxrGnbDzBHpBk=
    argmaxy
    p(y|I, ࢲ, ͸, ..., EOS)
    AAAC5nichVJNaxRBEH0ZPxLjRzZ6EbwsLhEFGWokkMRTiAh6Msm6JhDDMjPpXZvMzgwzvYubNX/AiyfJwYsKHsR/4E295A/kkH+geNIIXjz4pjMgJhi7me7qV/2qXlVPkEY6NyK7Q86x4ydODo+cGj195uy5scr4+Qd50s1C1QiTKMmWAz9XkY5Vw2gTqeU0U34niNRSsH6r8C/1VJbrJL5v+qla7fjtWLd06BtCScWFjwxtdLg/RhN9VJHiKvcntO7iOteH9AZI6B/0P2DzICZJibl2FtZt3EMd15qVmrhiR/Ww4ZVGDeWYTyrvGXiNYUN0mUAhhqEdUVzOuQIPQnkGqxhY4Qba+hUljJLb5S3FGz7Rda5tnlZKNOa5iJlbdsgsEb+MzComZEfeyp5syzv5Ir/+GWtgYxRa+rYBlqvS5tjTi/Wf/2UVTTZ49Id1pGaDFqatVk3tqUWKKsJ9fm9ja69+c3FicEVey1fqfyW78okVxL0f4ZsFtfjiCD1rRFusvujvJp/JO/goh43GDXfG9RYma7Nz5XuN4BIu81/xMIVZ3ME8GkzyEZ/xDd8d7Txznjtb+1edoZJzAX8N5+Vvj9ek5w==
    AAAC5nichVJNaxRBEH0ZPxLjRzZ6EbwsLhEFGWokkMRTiAh6Msm6JhDDMjPpXZvMzgwzvYubNX/AiyfJwYsKHsR/4E295A/kkH+geNIIXjz4pjMgJhi7me7qV/2qXlVPkEY6NyK7Q86x4ydODo+cGj195uy5scr4+Qd50s1C1QiTKMmWAz9XkY5Vw2gTqeU0U34niNRSsH6r8C/1VJbrJL5v+qla7fjtWLd06BtCScWFjwxtdLg/RhN9VJHiKvcntO7iOteH9AZI6B/0P2DzICZJibl2FtZt3EMd15qVmrhiR/Ww4ZVGDeWYTyrvGXiNYUN0mUAhhqEdUVzOuQIPQnkGqxhY4Qba+hUljJLb5S3FGz7Rda5tnlZKNOa5iJlbdsgsEb+MzComZEfeyp5syzv5Ir/+GWtgYxRa+rYBlqvS5tjTi/Wf/2UVTTZ49Id1pGaDFqatVk3tqUWKKsJ9fm9ja69+c3FicEVey1fqfyW78okVxL0f4ZsFtfjiCD1rRFusvujvJp/JO/goh43GDXfG9RYma7Nz5XuN4BIu81/xMIVZ3ME8GkzyEZ/xDd8d7Txznjtb+1edoZJzAX8N5+Vvj9ek5w==
    AAAC5nichVJNaxRBEH0ZPxLjRzZ6EbwsLhEFGWokkMRTiAh6Msm6JhDDMjPpXZvMzgwzvYubNX/AiyfJwYsKHsR/4E295A/kkH+geNIIXjz4pjMgJhi7me7qV/2qXlVPkEY6NyK7Q86x4ydODo+cGj195uy5scr4+Qd50s1C1QiTKMmWAz9XkY5Vw2gTqeU0U34niNRSsH6r8C/1VJbrJL5v+qla7fjtWLd06BtCScWFjwxtdLg/RhN9VJHiKvcntO7iOteH9AZI6B/0P2DzICZJibl2FtZt3EMd15qVmrhiR/Ww4ZVGDeWYTyrvGXiNYUN0mUAhhqEdUVzOuQIPQnkGqxhY4Qba+hUljJLb5S3FGz7Rda5tnlZKNOa5iJlbdsgsEb+MzComZEfeyp5syzv5Ir/+GWtgYxRa+rYBlqvS5tjTi/Wf/2UVTTZ49Id1pGaDFqatVk3tqUWKKsJ9fm9ja69+c3FicEVey1fqfyW78okVxL0f4ZsFtfjiCD1rRFusvujvJp/JO/goh43GDXfG9RYma7Nz5XuN4BIu81/xMIVZ3ME8GkzyEZ/xDd8d7Txznjtb+1edoZJzAX8N5+Vvj9ek5w==
    AAAC5nichVJNaxRBEH0ZPxLjRzZ6EbwsLhEFGWokkMRTiAh6Msm6JhDDMjPpXZvMzgwzvYubNX/AiyfJwYsKHsR/4E295A/kkH+geNIIXjz4pjMgJhi7me7qV/2qXlVPkEY6NyK7Q86x4ydODo+cGj195uy5scr4+Qd50s1C1QiTKMmWAz9XkY5Vw2gTqeU0U34niNRSsH6r8C/1VJbrJL5v+qla7fjtWLd06BtCScWFjwxtdLg/RhN9VJHiKvcntO7iOteH9AZI6B/0P2DzICZJibl2FtZt3EMd15qVmrhiR/Ww4ZVGDeWYTyrvGXiNYUN0mUAhhqEdUVzOuQIPQnkGqxhY4Qba+hUljJLb5S3FGz7Rda5tnlZKNOa5iJlbdsgsEb+MzComZEfeyp5syzv5Ir/+GWtgYxRa+rYBlqvS5tjTi/Wf/2UVTTZ49Id1pGaDFqatVk3tqUWKKsJ9fm9ja69+c3FicEVey1fqfyW78okVxL0f4ZsFtfjiCD1rRFusvujvJp/JO/goh43GDXfG9RYma7Nz5XuN4BIu81/xMIVZ3ME8GkzyEZ/xDd8d7Txznjtb+1edoZJzAX8N5+Vvj9ek5w==
    system output

    View Slide

  14. Problems
    • Training and testing mismatch
    – Sequence generation:
    • Training: generate a next ground truth word given the
    previous ground truth words (teacher forcing)
    • Testing: generate an entire sequence
    – Objective:
    • Training: word-level loss (e.g., cross entropy loss)
    • Testing: sentence-level evaluation (e.g., BLEU)
    ⇒ Reinforcement Learning
    14
    Data Distribution Mismatch!
    4 CHAPTER 1. INTROD
    Expert trajectory
    Learned Policy
    No data on
    how to recover
    Figure 1.1: Mismatch between the distribution of training and test inputs in
    scenario.
    many state-of-the-art software system that we use everyday. Systems based
    vised learning already translate our documents, recommend what we should
    and Guestrin, 2011), watch (Toscher et al., 2009) or buy, read our handwritin
    III et al., 2009) and filter spam from our emails (Weinberger et al., 2009), just
    few. Many subfields of artificial intelligence, such as natural language processin
    derstanding of natural language by computers) and computer vision (the und
    of visual input by computers), now deeply integrate machine learning.
    Despite this widespread proliferation and success of machine learning in va
    and applications, machine learning has had a much more limited success wh
    in control applications, e.g. learning to drive from demonstrations by hum
    One of the main reason behind this limited success is that control proble
    fundamentally di↵erent issues that are not typically addressed by standard
    learning techniques.
    In particular, much of the theory and algorithms for supervised learning ar
    the fundamental assumption that inputs/observations perceived by the predict
    its predictions are independent and always coming from the same underlying d
    during both training and testing (Hastie et al., 2001). This ensures that a
    enough training examples, we will be able to predict well on new example
    in expectation). However, this assumption is clearly violated in control task
    are inherently dynamic and sequential : one must perform a sequence of ac
    time that have consequences on future inputs or observations of the system, t
    p⇡⇤
    (ot) 6= p⇡✓
    (ot)
    exposure bias

    View Slide

  15. Solution: RL
    • Action: generate a word
    • Reward: sentence-level evaluation (e.g., BLEU)
    – In NLP, a reward is usually calculated using a gold
    • Only relies on its output to generate the results
    → avoid exposure bias
    • Directly optimize the model using the evaluation
    metric → avoid mismatch between training and
    testing measures
    15

    View Slide

  16. RL-based Seq2seq Model
    16
    私 は 学生 です EOS am
    a
    boy
    EOS
    Training
    He
    12.0
    私 は 学生 です EOS am
    a
    boy
    EOS
    I
    38.0
    Reward
    (BLEU calculated
    using a gold reference)
    ・・・
    Sampling
    with stochastic policy
    ・・・

    View Slide

  17. Table of Contents
    1. An overview of RL
    2. Policy gradient
    3. RL in NLP tasks
    4. Implementation
    17

    View Slide

  18. Policy Gradient
    • Goal: maximize
    18
    J( ) = Ey1,··· ,yT
    [R(y1:T
    )]
    AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
    AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
    AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
    AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=

    View Slide

  19. Policy Gradient
    • Goal: maximize
    19
    私 は 学生 です EOS am
    a
    boy
    EOS
    I
    38.0
    10.0
    8.0
    ・・・
    ・・・
    ・・・
    x 0.0002
    x 0.0008
    x 0.0001
    +
    +
    ・・・
    ・・・
    = 28.0
    J( ) = Ey1,··· ,yT
    [R(y1:T
    )]
    AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
    AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
    AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
    AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=

    View Slide

  20. Policy Gradient
    • Goal: maximize
    • Parameter update with gradient ascent
    • How to compute this gradient when rewards
    are not differentiable
    20
    + J( )
    AAACw3ichVFNSyNBEH3Oum7WXTW7XgQvgyGiCKEjwq6eREHEkx8bEzASesaOGezMDD2dSAz5A/4BDx5khT0s/gOvInjSkwd/wrLHCF48WDMZEBW1h5l+/ape9aspy5dOoBm76TI+dH/s+ZT43Pvla1//QPLb9/XAqylb5GxPeqpg8UBIxxU57WgpCr4SvGpJkbd25sN4vi5U4HjuL93wxWaVb7tO2bG5JqqUnC7qitDcLEpR1lwpb9eMmQmzyKVfoZDLLclLMb001gHjpWSKZVi0zJcgG4MU4rXsJU9RxBY82KihCgEXmrAER0DPBrJg8InbRJM4RciJ4gIt9JK2RlmCMjixO/TdptNGzLp0DmsGkdqmWyS9ipQm0uya/WVtdsFO2D92/2qtZlQj9NKg3epohV8a2B9au3tXVaVdo/KoetOzRhk/I68OefcjJuzC7ujrewfttZnVdHOUHbP/5P83u2Fn1IFbv7X/rIjVwzf8bBFbpu7D/9uiMWWfD+UlyE1mpjPZlanU7Fw8rwSGMYIxGsoPzGIRy8jRJUc4xyWujAVDGsrQnVSjK9YM4skyWg94Cqgd
    AAACw3ichVFNSyNBEH3Oum7WXTW7XgQvgyGiCKEjwq6eREHEkx8bEzASesaOGezMDD2dSAz5A/4BDx5khT0s/gOvInjSkwd/wrLHCF48WDMZEBW1h5l+/ape9aspy5dOoBm76TI+dH/s+ZT43Pvla1//QPLb9/XAqylb5GxPeqpg8UBIxxU57WgpCr4SvGpJkbd25sN4vi5U4HjuL93wxWaVb7tO2bG5JqqUnC7qitDcLEpR1lwpb9eMmQmzyKVfoZDLLclLMb001gHjpWSKZVi0zJcgG4MU4rXsJU9RxBY82KihCgEXmrAER0DPBrJg8InbRJM4RciJ4gIt9JK2RlmCMjixO/TdptNGzLp0DmsGkdqmWyS9ipQm0uya/WVtdsFO2D92/2qtZlQj9NKg3epohV8a2B9au3tXVaVdo/KoetOzRhk/I68OefcjJuzC7ujrewfttZnVdHOUHbP/5P83u2Fn1IFbv7X/rIjVwzf8bBFbpu7D/9uiMWWfD+UlyE1mpjPZlanU7Fw8rwSGMYIxGsoPzGIRy8jRJUc4xyWujAVDGsrQnVSjK9YM4skyWg94Cqgd
    AAACw3ichVFNSyNBEH3Oum7WXTW7XgQvgyGiCKEjwq6eREHEkx8bEzASesaOGezMDD2dSAz5A/4BDx5khT0s/gOvInjSkwd/wrLHCF48WDMZEBW1h5l+/ape9aspy5dOoBm76TI+dH/s+ZT43Pvla1//QPLb9/XAqylb5GxPeqpg8UBIxxU57WgpCr4SvGpJkbd25sN4vi5U4HjuL93wxWaVb7tO2bG5JqqUnC7qitDcLEpR1lwpb9eMmQmzyKVfoZDLLclLMb001gHjpWSKZVi0zJcgG4MU4rXsJU9RxBY82KihCgEXmrAER0DPBrJg8InbRJM4RciJ4gIt9JK2RlmCMjixO/TdptNGzLp0DmsGkdqmWyS9ipQm0uya/WVtdsFO2D92/2qtZlQj9NKg3epohV8a2B9au3tXVaVdo/KoetOzRhk/I68OefcjJuzC7ujrewfttZnVdHOUHbP/5P83u2Fn1IFbv7X/rIjVwzf8bBFbpu7D/9uiMWWfD+UlyE1mpjPZlanU7Fw8rwSGMYIxGsoPzGIRy8jRJUc4xyWujAVDGsrQnVSjK9YM4skyWg94Cqgd
    AAACw3ichVFNSyNBEH3Oum7WXTW7XgQvgyGiCKEjwq6eREHEkx8bEzASesaOGezMDD2dSAz5A/4BDx5khT0s/gOvInjSkwd/wrLHCF48WDMZEBW1h5l+/ape9aspy5dOoBm76TI+dH/s+ZT43Pvla1//QPLb9/XAqylb5GxPeqpg8UBIxxU57WgpCr4SvGpJkbd25sN4vi5U4HjuL93wxWaVb7tO2bG5JqqUnC7qitDcLEpR1lwpb9eMmQmzyKVfoZDLLclLMb001gHjpWSKZVi0zJcgG4MU4rXsJU9RxBY82KihCgEXmrAER0DPBrJg8InbRJM4RciJ4gIt9JK2RlmCMjixO/TdptNGzLp0DmsGkdqmWyS9ipQm0uya/WVtdsFO2D92/2qtZlQj9NKg3epohV8a2B9au3tXVaVdo/KoetOzRhk/I68OefcjJuzC7ujrewfttZnVdHOUHbP/5P83u2Fn1IFbv7X/rIjVwzf8bBFbpu7D/9uiMWWfD+UlyE1mpjPZlanU7Fw8rwSGMYIxGsoPzGIRy8jRJUc4xyWujAVDGsrQnVSjK9YM4skyWg94Cqgd
    J( ) = Ey1,··· ,yT
    [R(y1:T
    )]
    AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
    AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
    AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=
    AAACznichVFNa9RQFD2NH/3wo6NuhG6CQ2UKZXgRoVoQirZQXHWmM7YwGUKSedN5NF/kvRmJIXRb+ge6cKXgovoP3Lpx6cbF+A/EZQU3LrzJBESL9YbknXfePfedm+tEnpCKsfGUduHipcvTM7NzV65euz5fuXHzmQyHscvbbuiF8a5jS+6JgLeVUB7fjWJu+47Hd5z9J/n5zojHUoRBSyUR7/r2XiD6wrUVUVZl/WnNVAOu7CX9kW76thr0ZLqRWWliGcum2wuVXE6slm5K4etmJKxJdqZ3mrXESo3VVrbUtSpVVmdF6GeBUYIqytgKK+9hoocQLobwwRFAEfZgQ9LTgQGGiLguUuJiQqI458gwR9ohZXHKsIndp+8e7TolG9A+rykLtUu3ePTGpNSxyD6zE3bKPrJ37Cv7+c9aaVEj95LQ6ky0PLLmj25v//ivyqdVYfBbda5nhT4eFF4FeY8KJu/CnehHL45Pt1ebi+ld9pp9I/+v2Jh9oA6C0Xf3TYM3X57jp0dsn7rP/29GYzL+HspZ0L5Xf1g3Gvera4/Lec1gAXdQo6GsYA2b2EKbLnmLTxjji9bQnmuZdjBJ1aZKzS38EdrhLyJGq+Y=

    View Slide

  21. =
    y1:T
    (y1:T
    )R(y1:T
    )
    AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
    AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
    AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
    AAACenichVE9T8JQFD3UL0QU3ExciATjRC4Oik4mLo58iJAgIW15YGNpm7aQIOEPsDo4OGniYPwHri7+AQd+gnHExMXB20JilIi3eX3nnffOfee+q1i65rhEg4A0Mzs3vxBcDC2FQ8srkWj4xDFbtioKqqmbdkmRHaFrhii4mquLkmULuanooqicH3r7xbawHc00jt2OJSpNuWFodU2VXaYy1WickuRHbBKkxiCOcZjRR5yiBhMqWmhCwIDLWIcMh78yUiBYzFXQZc5mpPn7Aj2EWNviU4JPyMye87/Bq/KYNXjt5XR8tcq36DxsVsaQoBe6pyE90wO90uefubp+Ds9Lh2dlpBVWNdJfy3/8q2ry7OLsWzXVs4s60r5Xjb1bPuNVoY707YurYX4/l+hu0i29sf8bGtATV2C039W7rMhdT/FTY7bO1Xvv2+MupX73ZBIUtpN7yVSWEMQ6NrDFvdjFAY6QQYFz19DHpRSSSNoZNVMKjLu6ih8hpb8AreqPMw==
    AAACvHichVG7TtxAFD0YEpZHwkIVicYCEZFmdU3DQ1oJiQalgoUNSBhZYzMLI/ySPbtisfYH+AEKKhKliPIHadMkbaIUfEKUkkg0FFx7rUQJCrmWPWfOnXN97lw39lWqia4GjMGhR4+HKyOjY+NPnk5UJ8dfpVE78WTTi/wo2XVFKn0VyqZW2pe7cSJF4Ppyxz1ey/M7HZmkKgq3dTeW+4E4DFVLeUIz5VTrddMOhesLx9ZHUgvTTtuBk3WdzFrZ7vVMO1ZlZr7kXpiNX9CpzlKNijDvA6sEsyhjI6p+gI0DRPDQRgCJEJqxD4GUnz1YIMTM7SNjLmGkirxED6OsbfMpyScEs8f8PeTdXsmGvM9rpoXa47/4/CasNDFH3+gdXdMnek/f6faftbKiRu6ly6vb18rYmTh7tnXzX1XAq8bRb9WDnjVaWCq8KvYeF0zehdfXd07Pr7dWGnPZc3pNP9j/JV3RR+4g7Pz03m7KxsUDfg6YbXH3+f32eEzW30O5D5oLteWatUmoYBozmOdZLGIV69hAk2u/wWd8wVfjpREbnf48jYFysFP4I4yTOzFkqA4=
    AAACvHichVG7TtxAFD0YEpZHwkIVicYCEZFmdU3DQ1oJiQalgoUNSBhZYzMLI/ySPbtisfYH+AEKKhKliPIHadMkbaIUfEKUkkg0FFx7rUQJCrmWPWfOnXN97lw39lWqia4GjMGhR4+HKyOjY+NPnk5UJ8dfpVE78WTTi/wo2XVFKn0VyqZW2pe7cSJF4Ppyxz1ey/M7HZmkKgq3dTeW+4E4DFVLeUIz5VTrddMOhesLx9ZHUgvTTtuBk3WdzFrZ7vVMO1ZlZr7kXpiNX9CpzlKNijDvA6sEsyhjI6p+gI0DRPDQRgCJEJqxD4GUnz1YIMTM7SNjLmGkirxED6OsbfMpyScEs8f8PeTdXsmGvM9rpoXa47/4/CasNDFH3+gdXdMnek/f6faftbKiRu6ly6vb18rYmTh7tnXzX1XAq8bRb9WDnjVaWCq8KvYeF0zehdfXd07Pr7dWGnPZc3pNP9j/JV3RR+4g7Pz03m7KxsUDfg6YbXH3+f32eEzW30O5D5oLteWatUmoYBozmOdZLGIV69hAk2u/wWd8wVfjpREbnf48jYFysFP4I4yTOzFkqA4=
    AAACx3ichVE9TxRRFD0MIB+CrNCY2EzcQKDZ3LERSUgINsQKFhZIGDJ5M7yFF+YrM283rpMtaP0DFFRKLIz/wNZGW4wFP8FQYmJjwZ3ZiUYJ8CYz77xz77nv3Llu7KtUE533Gf0Dg/eGhkdG74+NP5ioPJzcTKNW4smGF/lRsu2KVPoqlA2ttC+340SKwPXllnv4Io9vtWWSqijc0J1Y7gZiP1RN5QnNlFNZXDTtULi+cGx9ILUw7bQVOFnHyayFjW7XtGNVRmZLbs6s/4FOpUo1KpZ5HVglqKJcq1HlE2zsIYKHFgJIhNCMfQik/OzAAiFmbhcZcwkjVcQluhhlbYuzJGcIZg/5u8+nnZIN+ZzXTAu1x7f4/CasNDFN3+kDXdIX+kg/6PeNtbKiRu6lw7vb08rYmXjzaP3XnaqAd42Dv6pbPWs0MV94Vew9Lpi8C6+nb78+vlxfqE9nM/SOLtj/Wzqnz9xB2P7pvV+T9ZNb/Owx2+Tu8//b5TFZ/w/lOmg8rT2vWWtUXVou5zWMx3iCWR7KMyxhBato8CWn+IozfDNeGrHRNl71Uo2+UjOFf5ZxdAV6Qak9
    AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
    AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
    AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
    AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
    AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
    AAACx3ichVFNS9xQFD2mtY5W61g3BTehg0U3w4sUqoIw6KZ0paNTBSPhJb7Rh/kieTN0Gmbh1j/gwpUWF+I/cOumbi1d+BNKlxbcuOhNJigq6gvJO+/ce+47N9cOXRkrxi66tBcvu1/1FHr7XvcPvBksDr39GgeNyBE1J3CDaMXmsXClL2pKKleshJHgnu2KZXtrLo0vN0UUy8BfUq1QrHl8w5d16XBFlFWcmdFNn9sut0y1KRTXzbjhWUnLSozppXZbN0OZR8Zyblyv3kCrWGJlli39ITByUEK+5oPiCUysI4CDBjwI+FCEXXDE9KzCAENI3BoS4iJCMosLtNFH2gZlCcrgxG7Rd4NOqznr0zmtGWdqh25x6Y1IqWOU/WZH7JL9ZMfsD7t+tFaS1Ui9tGi3O1oRWoM77xavnlV5tCts3qqe9KxQx2TmVZL3MGPSLpyOvvl993JxujqafGAH7C/532cX7JQ68Jv/nMMFUd17ws86sXXqPv2/bRqTcX8oD0FtojxVNhY+liqz+bwKGMF7jNFQPqGCz5hHjS75gTOc45f2RQu1pvatk6p15Zph3Fna9n97galB
    =
    y1:T
    (y1:T )
    (y1:T )
    (y1:T )
    R(y1:T )
    AAAC93ichVG7SsRAFL2b9bG+V20Em+CiaLNMRPABgmhjqaurgpEwibPuYF4kswtryA/4AxY2KiiKla2tjb1Y+AliqSCChTfZoPiekMyZc++5c26u7prcF4TcpaR0Q2NTc6alta29o7Mr292z4jsVz2BFwzEdb02nPjO5zYqCC5OtuR6jlm6yVX17LoqvVpnnc8deFjWXbVh0y+YlblCBlJa1pmXVr1haUNMCZWo5DGXV5ZoqykzQ4YQbkdWSR41Atalu0iT4U14Y/ETKhXesZXMkT+IlfwdKAnKQrAUnewkqbIIDBlTAAgY2CMQmUPDxWQcFCLjIbUCAnIeIx3EGIbSitoJZDDMostv43cLTesLaeI5q+rHawFtMfD1UyjBIbskZeSTX5Jzck9dfawVxjchLDXe9rmWu1rXbt/T8r8rCXUD5Q/WnZwElmIi9cvTuxkzUhVHXV3f2HpemCoPBEDkiD+j/kNyRK+zArj4Zx4ussP+Hn01kS9h99H9DHJPydSjfQXE0P5lXFsdyM7PJvDLQDwMwjEMZhxmYhwUo4iU38JKSUmlpRzqQTqTTeqqUSjS98GlJF29nqrr3
    AAAC93ichVG7SsRAFL2b9bG+V20Em+CiaLNMRPABgmhjqaurgpEwibPuYF4kswtryA/4AxY2KiiKla2tjb1Y+AliqSCChTfZoPiekMyZc++5c26u7prcF4TcpaR0Q2NTc6alta29o7Mr292z4jsVz2BFwzEdb02nPjO5zYqCC5OtuR6jlm6yVX17LoqvVpnnc8deFjWXbVh0y+YlblCBlJa1pmXVr1haUNMCZWo5DGXV5ZoqykzQ4YQbkdWSR41Atalu0iT4U14Y/ETKhXesZXMkT+IlfwdKAnKQrAUnewkqbIIDBlTAAgY2CMQmUPDxWQcFCLjIbUCAnIeIx3EGIbSitoJZDDMostv43cLTesLaeI5q+rHawFtMfD1UyjBIbskZeSTX5Jzck9dfawVxjchLDXe9rmWu1rXbt/T8r8rCXUD5Q/WnZwElmIi9cvTuxkzUhVHXV3f2HpemCoPBEDkiD+j/kNyRK+zArj4Zx4ussP+Hn01kS9h99H9DHJPydSjfQXE0P5lXFsdyM7PJvDLQDwMwjEMZhxmYhwUo4iU38JKSUmlpRzqQTqTTeqqUSjS98GlJF29nqrr3
    AAAC93ichVG7SsRAFL2b9bG+V20Em+CiaLNMRPABgmhjqaurgpEwibPuYF4kswtryA/4AxY2KiiKla2tjb1Y+AliqSCChTfZoPiekMyZc++5c26u7prcF4TcpaR0Q2NTc6alta29o7Mr292z4jsVz2BFwzEdb02nPjO5zYqCC5OtuR6jlm6yVX17LoqvVpnnc8deFjWXbVh0y+YlblCBlJa1pmXVr1haUNMCZWo5DGXV5ZoqykzQ4YQbkdWSR41Atalu0iT4U14Y/ETKhXesZXMkT+IlfwdKAnKQrAUnewkqbIIDBlTAAgY2CMQmUPDxWQcFCLjIbUCAnIeIx3EGIbSitoJZDDMostv43cLTesLaeI5q+rHawFtMfD1UyjBIbskZeSTX5Jzck9dfawVxjchLDXe9rmWu1rXbt/T8r8rCXUD5Q/WnZwElmIi9cvTuxkzUhVHXV3f2HpemCoPBEDkiD+j/kNyRK+zArj4Zx4ussP+Hn01kS9h99H9DHJPydSjfQXE0P5lXFsdyM7PJvDLQDwMwjEMZhxmYhwUo4iU38JKSUmlpRzqQTqTTeqqUSjS98GlJF29nqrr3
    AAAC93ichVG7SsRAFL2b9bG+V20Em+CiaLNMRPABgmhjqaurgpEwibPuYF4kswtryA/4AxY2KiiKla2tjb1Y+AliqSCChTfZoPiekMyZc++5c26u7prcF4TcpaR0Q2NTc6alta29o7Mr292z4jsVz2BFwzEdb02nPjO5zYqCC5OtuR6jlm6yVX17LoqvVpnnc8deFjWXbVh0y+YlblCBlJa1pmXVr1haUNMCZWo5DGXV5ZoqykzQ4YQbkdWSR41Atalu0iT4U14Y/ETKhXesZXMkT+IlfwdKAnKQrAUnewkqbIIDBlTAAgY2CMQmUPDxWQcFCLjIbUCAnIeIx3EGIbSitoJZDDMostv43cLTesLaeI5q+rHawFtMfD1UyjBIbskZeSTX5Jzck9dfawVxjchLDXe9rmWu1rXbt/T8r8rCXUD5Q/WnZwElmIi9cvTuxkzUhVHXV3f2HpemCoPBEDkiD+j/kNyRK+zArj4Zx4ussP+Hn01kS9h99H9DHJPydSjfQXE0P5lXFsdyM7PJvDLQDwMwjEMZhxmYhwUo4iU38JKSUmlpRzqQTqTTeqqUSjS98GlJF29nqrr3
    Policy Gradient
    21
    log-derivative trick
    Ey1,··· ,yT
    [R(y1:T
    )]
    AAAC0nichVFNa9RQFD2NX2396Gg3BTehQ6VCGV5EULsqFaHLdjqxhckQXjJvOo/mi7w3AzFkIe7EvQtXLbgQwR/g1o0/wC660L24rODGhTeZgGix3pC88867575zc70kkEozdjxlnDt/4eKl6ZnZy1euXptrXL/xWMWj1Be2HwdxuutxJQIZCVtLHYjdJBU89AKx4+0/LM93xiJVMo46OktEL+R7kRxIn2ui3MaGE3Ev4K6jh0Jz0wm5HvZV/qhw88y1VkzH78darZiZ2zEdJUPTSWSdXJjd9nLm5tZqp7jdcxtN1mJVmKeBVYMm6tiMG+/hoI8YPkYIIRBBEw7AoejpwgJDQlwPOXEpIVmdCxSYJe2IsgRlcGL36btHu27NRrQva6pK7dMtAb0pKU0ssSP2hp2wj+wt+8p+/rNWXtUovWS0ehOtSNy55wvbP/6rCmnVGP5WnelZY4D7lVdJ3pOKKbvwJ/rxk5cn26vtpfwWO2TfyP8BO2YfqINo/N1/vSXar87w0yd2QN2X/7egMVl/D+U0sO+0HrSsrbvNtfV6XtO4iUUs01DuYQ0b2IRNl7zDJ3zGF8M2cuOp8WySakzVmnn8EcaLX5qjrg8=
    AAAC0nichVFNa9RQFD2NX2396Gg3BTehQ6VCGV5EULsqFaHLdjqxhckQXjJvOo/mi7w3AzFkIe7EvQtXLbgQwR/g1o0/wC660L24rODGhTeZgGix3pC88867575zc70kkEozdjxlnDt/4eKl6ZnZy1euXptrXL/xWMWj1Be2HwdxuutxJQIZCVtLHYjdJBU89AKx4+0/LM93xiJVMo46OktEL+R7kRxIn2ui3MaGE3Ev4K6jh0Jz0wm5HvZV/qhw88y1VkzH78darZiZ2zEdJUPTSWSdXJjd9nLm5tZqp7jdcxtN1mJVmKeBVYMm6tiMG+/hoI8YPkYIIRBBEw7AoejpwgJDQlwPOXEpIVmdCxSYJe2IsgRlcGL36btHu27NRrQva6pK7dMtAb0pKU0ssSP2hp2wj+wt+8p+/rNWXtUovWS0ehOtSNy55wvbP/6rCmnVGP5WnelZY4D7lVdJ3pOKKbvwJ/rxk5cn26vtpfwWO2TfyP8BO2YfqINo/N1/vSXar87w0yd2QN2X/7egMVl/D+U0sO+0HrSsrbvNtfV6XtO4iUUs01DuYQ0b2IRNl7zDJ3zGF8M2cuOp8WySakzVmnn8EcaLX5qjrg8=
    AAAC0nichVFNa9RQFD2NX2396Gg3BTehQ6VCGV5EULsqFaHLdjqxhckQXjJvOo/mi7w3AzFkIe7EvQtXLbgQwR/g1o0/wC660L24rODGhTeZgGix3pC88867575zc70kkEozdjxlnDt/4eKl6ZnZy1euXptrXL/xWMWj1Be2HwdxuutxJQIZCVtLHYjdJBU89AKx4+0/LM93xiJVMo46OktEL+R7kRxIn2ui3MaGE3Ev4K6jh0Jz0wm5HvZV/qhw88y1VkzH78darZiZ2zEdJUPTSWSdXJjd9nLm5tZqp7jdcxtN1mJVmKeBVYMm6tiMG+/hoI8YPkYIIRBBEw7AoejpwgJDQlwPOXEpIVmdCxSYJe2IsgRlcGL36btHu27NRrQva6pK7dMtAb0pKU0ssSP2hp2wj+wt+8p+/rNWXtUovWS0ehOtSNy55wvbP/6rCmnVGP5WnelZY4D7lVdJ3pOKKbvwJ/rxk5cn26vtpfwWO2TfyP8BO2YfqINo/N1/vSXar87w0yd2QN2X/7egMVl/D+U0sO+0HrSsrbvNtfV6XtO4iUUs01DuYQ0b2IRNl7zDJ3zGF8M2cuOp8WySakzVmnn8EcaLX5qjrg8=
    AAAC0nichVFNa9RQFD2NX2396Gg3BTehQ6VCGV5EULsqFaHLdjqxhckQXjJvOo/mi7w3AzFkIe7EvQtXLbgQwR/g1o0/wC660L24rODGhTeZgGix3pC88867575zc70kkEozdjxlnDt/4eKl6ZnZy1euXptrXL/xWMWj1Be2HwdxuutxJQIZCVtLHYjdJBU89AKx4+0/LM93xiJVMo46OktEL+R7kRxIn2ui3MaGE3Ev4K6jh0Jz0wm5HvZV/qhw88y1VkzH78darZiZ2zEdJUPTSWSdXJjd9nLm5tZqp7jdcxtN1mJVmKeBVYMm6tiMG+/hoI8YPkYIIRBBEw7AoejpwgJDQlwPOXEpIVmdCxSYJe2IsgRlcGL36btHu27NRrQva6pK7dMtAb0pKU0ssSP2hp2wj+wt+8p+/rNWXtUovWS0ehOtSNy55wvbP/6rCmnVGP5WnelZY4D7lVdJ3pOKKbvwJ/rxk5cn26vtpfwWO2TfyP8BO2YfqINo/N1/vSXar87w0yd2QN2X/7egMVl/D+U0sO+0HrSsrbvNtfV6XtO4iUUs01DuYQ0b2IRNl7zDJ3zGF8M2cuOp8WySakzVmnn8EcaLX5qjrg8=
    =
    y1:T
    (y1:T
    ) log (y1:T
    )R(y1:T
    )
    AAAC4HichVFNS9xQFD2mWj/64bRuCm5CB4vdDC8iaIWC6KZLHR0VjISX+GZ8+PJB8mZgGuYH6EZcCbqy0IX0HxR3bvwDXbjtrtidhW5ceJMJltbW3pC88867575zc91IyUQzdtFjPOjte9g/MDj06PGTp8OlZ89XkrAZe6LmhSqM11yeCCUDUdNSK7EWxYL7rhKr7vZ8dr7aEnEiw2BZtyOx4fNGIOvS45oop7Ty1rSTpu+kbSe1ZpY7HdOOpGPrLaH5eMG9Nu2Au4oXtGmrsPHXtOotdEplVmF5mHeBVYAyilgIS59hYxMhPDThQyCAJqzAkdCzDgsMEXEbSImLCcn8XKCDIdI2KUtQBid2m74N2q0XbED7rGaSqz26RdEbk9LEGPvCTtgVO2ef2Dd2/c9aaV4j89Km1e1qReQM775Y+vlflU+rxtYv1b2eNeqYzr1K8h7lTNaF19W33h9cLc1Ux9JX7AO7JP/H7IKdUQdB64f3cVFUj+7xs0lsnbrP/m+HxmT9OZS7oDZReVOxFifLs3PFvAYwipcYp6FMYRbvsIAaXXKKr7jEd8Mzdow9Y7+bavQUmhH8FsbhDZCVsv8=
    AAAC4HichVFNS9xQFD2mWj/64bRuCm5CB4vdDC8iaIWC6KZLHR0VjISX+GZ8+PJB8mZgGuYH6EZcCbqy0IX0HxR3bvwDXbjtrtidhW5ceJMJltbW3pC88867575zc91IyUQzdtFjPOjte9g/MDj06PGTp8OlZ89XkrAZe6LmhSqM11yeCCUDUdNSK7EWxYL7rhKr7vZ8dr7aEnEiw2BZtyOx4fNGIOvS45oop7Ty1rSTpu+kbSe1ZpY7HdOOpGPrLaH5eMG9Nu2Au4oXtGmrsPHXtOotdEplVmF5mHeBVYAyilgIS59hYxMhPDThQyCAJqzAkdCzDgsMEXEbSImLCcn8XKCDIdI2KUtQBid2m74N2q0XbED7rGaSqz26RdEbk9LEGPvCTtgVO2ef2Dd2/c9aaV4j89Km1e1qReQM775Y+vlflU+rxtYv1b2eNeqYzr1K8h7lTNaF19W33h9cLc1Ux9JX7AO7JP/H7IKdUQdB64f3cVFUj+7xs0lsnbrP/m+HxmT9OZS7oDZReVOxFifLs3PFvAYwipcYp6FMYRbvsIAaXXKKr7jEd8Mzdow9Y7+bavQUmhH8FsbhDZCVsv8=
    AAAC4HichVFNS9xQFD2mWj/64bRuCm5CB4vdDC8iaIWC6KZLHR0VjISX+GZ8+PJB8mZgGuYH6EZcCbqy0IX0HxR3bvwDXbjtrtidhW5ceJMJltbW3pC88867575zc91IyUQzdtFjPOjte9g/MDj06PGTp8OlZ89XkrAZe6LmhSqM11yeCCUDUdNSK7EWxYL7rhKr7vZ8dr7aEnEiw2BZtyOx4fNGIOvS45oop7Ty1rSTpu+kbSe1ZpY7HdOOpGPrLaH5eMG9Nu2Au4oXtGmrsPHXtOotdEplVmF5mHeBVYAyilgIS59hYxMhPDThQyCAJqzAkdCzDgsMEXEbSImLCcn8XKCDIdI2KUtQBid2m74N2q0XbED7rGaSqz26RdEbk9LEGPvCTtgVO2ef2Dd2/c9aaV4j89Km1e1qReQM775Y+vlflU+rxtYv1b2eNeqYzr1K8h7lTNaF19W33h9cLc1Ux9JX7AO7JP/H7IKdUQdB64f3cVFUj+7xs0lsnbrP/m+HxmT9OZS7oDZReVOxFifLs3PFvAYwipcYp6FMYRbvsIAaXXKKr7jEd8Mzdow9Y7+bavQUmhH8FsbhDZCVsv8=
    AAAC4HichVFNS9xQFD2mWj/64bRuCm5CB4vdDC8iaIWC6KZLHR0VjISX+GZ8+PJB8mZgGuYH6EZcCbqy0IX0HxR3bvwDXbjtrtidhW5ceJMJltbW3pC88867575zc91IyUQzdtFjPOjte9g/MDj06PGTp8OlZ89XkrAZe6LmhSqM11yeCCUDUdNSK7EWxYL7rhKr7vZ8dr7aEnEiw2BZtyOx4fNGIOvS45oop7Ty1rSTpu+kbSe1ZpY7HdOOpGPrLaH5eMG9Nu2Au4oXtGmrsPHXtOotdEplVmF5mHeBVYAyilgIS59hYxMhPDThQyCAJqzAkdCzDgsMEXEbSImLCcn8XKCDIdI2KUtQBid2m74N2q0XbED7rGaSqz26RdEbk9LEGPvCTtgVO2ef2Dd2/c9aaV4j89Km1e1qReQM775Y+vlflU+rxtYv1b2eNeqYzr1K8h7lTNaF19W33h9cLc1Ux9JX7AO7JP/H7IKdUQdB64f3cVFUj+7xs0lsnbrP/m+HxmT9OZS7oDZReVOxFifLs3PFvAYwipcYp6FMYRbvsIAaXXKKr7jEd8Mzdow9Y7+bavQUmhH8FsbhDZCVsv8=
    = Ey1:T
    [ log (y1:T
    )R(y1:T
    )]
    AAAC1HichVE9axRRFD0ZPxLjR1ZtBJvBJSE2yxsR8gFCUAKSKtlk3cDOMryZfbv7yJsPZt4urMNUkiaFbQorhRSSJrWtjX9AIZW2YhnBxiJ3ZgdFg/EOM++88+6579y5bqRkohk7njAuXLx0eXLqyvTVa9dvzFRu3nqWhIPYEw0vVGG87fJEKBmIhpZaie0oFtx3lWi6O0/y8+ZQxIkMgy09ikTb571AdqXHNVFOZe2Raftc9ztJupo56chJreWtLDNbdsBdxR1b94Xmpq3CnmlHstzPl3n3zfov2HYqVVZjRZhngVWCKspYDyvvYKODEB4G8CEQQBNW4EjoacECQ0RcGylxMSFZnAtkmCbtgLIEZXBid+jbo12rZAPa5zWTQu3RLYremJQmZtlH9padsA/skH1lP/9ZKy1q5F5GtLpjrYicmb07mz/+q/Jp1ej/Vp3rWaOLxcKrJO9RweRdeGP98Pn+yeZyfTadY2/YN/L/mh2z99RBMPzuHWyI+qtz/HSI7VL3+f/NaEzW30M5CxoPaks1a+NhdeVxOa8p3MU9zNNQFrCCp1hHgy45wid8xhejaWTGC2N3nGpMlJrb+COMl6cm1q6i
    AAAC1HichVE9axRRFD0ZPxLjR1ZtBJvBJSE2yxsR8gFCUAKSKtlk3cDOMryZfbv7yJsPZt4urMNUkiaFbQorhRSSJrWtjX9AIZW2YhnBxiJ3ZgdFg/EOM++88+6579y5bqRkohk7njAuXLx0eXLqyvTVa9dvzFRu3nqWhIPYEw0vVGG87fJEKBmIhpZaie0oFtx3lWi6O0/y8+ZQxIkMgy09ikTb571AdqXHNVFOZe2Raftc9ztJupo56chJreWtLDNbdsBdxR1b94Xmpq3CnmlHstzPl3n3zfov2HYqVVZjRZhngVWCKspYDyvvYKODEB4G8CEQQBNW4EjoacECQ0RcGylxMSFZnAtkmCbtgLIEZXBid+jbo12rZAPa5zWTQu3RLYremJQmZtlH9padsA/skH1lP/9ZKy1q5F5GtLpjrYicmb07mz/+q/Jp1ej/Vp3rWaOLxcKrJO9RweRdeGP98Pn+yeZyfTadY2/YN/L/mh2z99RBMPzuHWyI+qtz/HSI7VL3+f/NaEzW30M5CxoPaks1a+NhdeVxOa8p3MU9zNNQFrCCp1hHgy45wid8xhejaWTGC2N3nGpMlJrb+COMl6cm1q6i
    AAAC1HichVE9axRRFD0ZPxLjR1ZtBJvBJSE2yxsR8gFCUAKSKtlk3cDOMryZfbv7yJsPZt4urMNUkiaFbQorhRSSJrWtjX9AIZW2YhnBxiJ3ZgdFg/EOM++88+6579y5bqRkohk7njAuXLx0eXLqyvTVa9dvzFRu3nqWhIPYEw0vVGG87fJEKBmIhpZaie0oFtx3lWi6O0/y8+ZQxIkMgy09ikTb571AdqXHNVFOZe2Raftc9ztJupo56chJreWtLDNbdsBdxR1b94Xmpq3CnmlHstzPl3n3zfov2HYqVVZjRZhngVWCKspYDyvvYKODEB4G8CEQQBNW4EjoacECQ0RcGylxMSFZnAtkmCbtgLIEZXBid+jbo12rZAPa5zWTQu3RLYremJQmZtlH9padsA/skH1lP/9ZKy1q5F5GtLpjrYicmb07mz/+q/Jp1ej/Vp3rWaOLxcKrJO9RweRdeGP98Pn+yeZyfTadY2/YN/L/mh2z99RBMPzuHWyI+qtz/HSI7VL3+f/NaEzW30M5CxoPaks1a+NhdeVxOa8p3MU9zNNQFrCCp1hHgy45wid8xhejaWTGC2N3nGpMlJrb+COMl6cm1q6i
    AAAC1HichVE9axRRFD0ZPxLjR1ZtBJvBJSE2yxsR8gFCUAKSKtlk3cDOMryZfbv7yJsPZt4urMNUkiaFbQorhRSSJrWtjX9AIZW2YhnBxiJ3ZgdFg/EOM++88+6579y5bqRkohk7njAuXLx0eXLqyvTVa9dvzFRu3nqWhIPYEw0vVGG87fJEKBmIhpZaie0oFtx3lWi6O0/y8+ZQxIkMgy09ikTb571AdqXHNVFOZe2Raftc9ztJupo56chJreWtLDNbdsBdxR1b94Xmpq3CnmlHstzPl3n3zfov2HYqVVZjRZhngVWCKspYDyvvYKODEB4G8CEQQBNW4EjoacECQ0RcGylxMSFZnAtkmCbtgLIEZXBid+jbo12rZAPa5zWTQu3RLYremJQmZtlH9padsA/skH1lP/9ZKy1q5F5GtLpjrYicmb07mz/+q/Jp1ej/Vp3rWaOLxcKrJO9RweRdeGP98Pn+yeZyfTadY2/YN/L/mh2z99RBMPzuHWyI+qtz/HSI7VL3+f/NaEzW30M5CxoPaks1a+NhdeVxOa8p3MU9zNNQFrCCp1hHgy45wid8xhejaWTGC2N3nGpMlJrb+COMl6cm1q6i

    View Slide

  22. REINFORCE [Williams 92]
    • Sample-based (Monte-Carlo) method
    • Algorithm
    – Generate a trajectory according to a current policy
    – Update parameters
    22
    move most in the directions that favor actions
    yielding the highest return
    sampled
    Ey1:T
    [ log (y1:T
    )R(y1:T
    )]
    AAAC0nichVHPa9RAFP4ardZa7Vovgpfg0lIvy6QIak+lIvTYbhtb2Cxhkp3dHTr5QTK7sA05SG/i3YMnBQ8i+Af02ot/gB56sPfisYIXD33JBkWL9YVkvvnmfW++l+fFSqaasaMJ49LlyStXp65NX5+5cXO2dmvuWRoNEl/YfqSiZMfjqVAyFLaWWomdOBE88JTY9nafFOfbQ5GkMgq39CgW7YD3QtmVPtdEubU1J+C630mzp7mbjdzMWt7Kc7PlhNxT3HV0X2huOirqmU4sq/1ilXffbP6CbbdWZw1WhnkeWBWoo4r1qHYABx1E8DFAAIEQmrACR0pPCxYYYuLayIhLCMnyXCDHNGkHlCUogxO7S98e7VoVG9K+qJmWap9uUfQmpDQxz76w9+yUfWIf2An7+c9aWVmj8DKi1RtrRezOvriz+eO/qoBWjf5v1YWeNbp4VHqV5D0umaILf6wf7r063VxuzmcL7C37Rv7fsCN2SB2Ew+/+uw3RfH2Bnw6xXeq++L85jcn6eyjngb3UeNywNh7UV1areU3hLu5hkYbyECtYwzpsuuQjPuMrjg3byIznxv441ZioNLfxRxgvzwD4fK4x
    AAAC0nichVHPa9RAFP4ardZa7Vovgpfg0lIvy6QIak+lIvTYbhtb2Cxhkp3dHTr5QTK7sA05SG/i3YMnBQ8i+Af02ot/gB56sPfisYIXD33JBkWL9YVkvvnmfW++l+fFSqaasaMJ49LlyStXp65NX5+5cXO2dmvuWRoNEl/YfqSiZMfjqVAyFLaWWomdOBE88JTY9nafFOfbQ5GkMgq39CgW7YD3QtmVPtdEubU1J+C630mzp7mbjdzMWt7Kc7PlhNxT3HV0X2huOirqmU4sq/1ilXffbP6CbbdWZw1WhnkeWBWoo4r1qHYABx1E8DFAAIEQmrACR0pPCxYYYuLayIhLCMnyXCDHNGkHlCUogxO7S98e7VoVG9K+qJmWap9uUfQmpDQxz76w9+yUfWIf2An7+c9aWVmj8DKi1RtrRezOvriz+eO/qoBWjf5v1YWeNbp4VHqV5D0umaILf6wf7r063VxuzmcL7C37Rv7fsCN2SB2Ew+/+uw3RfH2Bnw6xXeq++L85jcn6eyjngb3UeNywNh7UV1areU3hLu5hkYbyECtYwzpsuuQjPuMrjg3byIznxv441ZioNLfxRxgvzwD4fK4x
    AAAC0nichVHPa9RAFP4ardZa7Vovgpfg0lIvy6QIak+lIvTYbhtb2Cxhkp3dHTr5QTK7sA05SG/i3YMnBQ8i+Af02ot/gB56sPfisYIXD33JBkWL9YVkvvnmfW++l+fFSqaasaMJ49LlyStXp65NX5+5cXO2dmvuWRoNEl/YfqSiZMfjqVAyFLaWWomdOBE88JTY9nafFOfbQ5GkMgq39CgW7YD3QtmVPtdEubU1J+C630mzp7mbjdzMWt7Kc7PlhNxT3HV0X2huOirqmU4sq/1ilXffbP6CbbdWZw1WhnkeWBWoo4r1qHYABx1E8DFAAIEQmrACR0pPCxYYYuLayIhLCMnyXCDHNGkHlCUogxO7S98e7VoVG9K+qJmWap9uUfQmpDQxz76w9+yUfWIf2An7+c9aWVmj8DKi1RtrRezOvriz+eO/qoBWjf5v1YWeNbp4VHqV5D0umaILf6wf7r063VxuzmcL7C37Rv7fsCN2SB2Ew+/+uw3RfH2Bnw6xXeq++L85jcn6eyjngb3UeNywNh7UV1areU3hLu5hkYbyECtYwzpsuuQjPuMrjg3byIznxv441ZioNLfxRxgvzwD4fK4x
    AAAC0nichVHPa9RAFP4ardZa7Vovgpfg0lIvy6QIak+lIvTYbhtb2Cxhkp3dHTr5QTK7sA05SG/i3YMnBQ8i+Af02ot/gB56sPfisYIXD33JBkWL9YVkvvnmfW++l+fFSqaasaMJ49LlyStXp65NX5+5cXO2dmvuWRoNEl/YfqSiZMfjqVAyFLaWWomdOBE88JTY9nafFOfbQ5GkMgq39CgW7YD3QtmVPtdEubU1J+C630mzp7mbjdzMWt7Kc7PlhNxT3HV0X2huOirqmU4sq/1ilXffbP6CbbdWZw1WhnkeWBWoo4r1qHYABx1E8DFAAIEQmrACR0pPCxYYYuLayIhLCMnyXCDHNGkHlCUogxO7S98e7VoVG9K+qJmWap9uUfQmpDQxz76w9+yUfWIf2An7+c9aWVmj8DKi1RtrRezOvriz+eO/qoBWjf5v1YWeNbp4VHqV5D0umaILf6wf7r063VxuzmcL7C37Rv7fsCN2SB2Ew+/+uw3RfH2Bnw6xXeq++L85jcn6eyjngb3UeNywNh7UV1areU3hLu5hkYbyECtYwzpsuuQjPuMrjg3byIznxv441ZioNLfxRxgvzwD4fK4x
    log (ys
    1:T
    )R(ys
    1:T
    )
    AAACx3ichVE9T9xAEH04hO+PS9JEorFyAkFzWkeRSJAiodBEqeDgAAkTa232jhV79sreO3E5XZGWP0BBRaIUKP8gLQ20iVLwExAlkdKkyNhnKQEEzGq9M2/nzb7x+FrJxDB21mM96H3Y1z8wODQ8Mjo2Xnj0eDWJGnEgKkGkonjd54lQMhQVI40S6zoWvO4rsebvLKT3a00RJzIKV0xLi806r4WyKgNuCPIKr12udRzt2m7IfcU912wLw21XRTXb1TKPp1vvE6/tzK10Zuzyf4FXKLISy8y+6Ti5U0Rui1HhG1xsIUKABuoQCGHIV+BIaG3AAYMmbBNtwmLyZHYv0MEQcRuUJSiDE7pD3xpFGzkaUpzWTDJ2QK8o2jExbUyyn+yIXbIT9pWdsz+31mpnNVItLTr9Lldob3zv6fLve1l1Og22/7Hu1GxQxctMqyTtOkPSLoIuv/lh/3J5rjzZnmKf2AXpP2Rn7Jg6CJu/gi9Lonxwh54tQqvUffp/OzQm5/pQbjqV56VXJWfpRXH+TT6vAUzgGaZpKLOYx1ssokKPfMYpvuOH9c7SVtPa7aZaPTnnCa6Y9fEvYGSpow==
    AAACx3ichVE9T9xAEH04hO+PS9JEorFyAkFzWkeRSJAiodBEqeDgAAkTa232jhV79sreO3E5XZGWP0BBRaIUKP8gLQ20iVLwExAlkdKkyNhnKQEEzGq9M2/nzb7x+FrJxDB21mM96H3Y1z8wODQ8Mjo2Xnj0eDWJGnEgKkGkonjd54lQMhQVI40S6zoWvO4rsebvLKT3a00RJzIKV0xLi806r4WyKgNuCPIKr12udRzt2m7IfcU912wLw21XRTXb1TKPp1vvE6/tzK10Zuzyf4FXKLISy8y+6Ti5U0Rui1HhG1xsIUKABuoQCGHIV+BIaG3AAYMmbBNtwmLyZHYv0MEQcRuUJSiDE7pD3xpFGzkaUpzWTDJ2QK8o2jExbUyyn+yIXbIT9pWdsz+31mpnNVItLTr9Lldob3zv6fLve1l1Og22/7Hu1GxQxctMqyTtOkPSLoIuv/lh/3J5rjzZnmKf2AXpP2Rn7Jg6CJu/gi9Lonxwh54tQqvUffp/OzQm5/pQbjqV56VXJWfpRXH+TT6vAUzgGaZpKLOYx1ssokKPfMYpvuOH9c7SVtPa7aZaPTnnCa6Y9fEvYGSpow==
    AAACx3ichVE9T9xAEH04hO+PS9JEorFyAkFzWkeRSJAiodBEqeDgAAkTa232jhV79sreO3E5XZGWP0BBRaIUKP8gLQ20iVLwExAlkdKkyNhnKQEEzGq9M2/nzb7x+FrJxDB21mM96H3Y1z8wODQ8Mjo2Xnj0eDWJGnEgKkGkonjd54lQMhQVI40S6zoWvO4rsebvLKT3a00RJzIKV0xLi806r4WyKgNuCPIKr12udRzt2m7IfcU912wLw21XRTXb1TKPp1vvE6/tzK10Zuzyf4FXKLISy8y+6Ti5U0Rui1HhG1xsIUKABuoQCGHIV+BIaG3AAYMmbBNtwmLyZHYv0MEQcRuUJSiDE7pD3xpFGzkaUpzWTDJ2QK8o2jExbUyyn+yIXbIT9pWdsz+31mpnNVItLTr9Lldob3zv6fLve1l1Og22/7Hu1GxQxctMqyTtOkPSLoIuv/lh/3J5rjzZnmKf2AXpP2Rn7Jg6CJu/gi9Lonxwh54tQqvUffp/OzQm5/pQbjqV56VXJWfpRXH+TT6vAUzgGaZpKLOYx1ssokKPfMYpvuOH9c7SVtPa7aZaPTnnCa6Y9fEvYGSpow==
    AAACx3ichVE9T9xAEH04hO+PS9JEorFyAkFzWkeRSJAiodBEqeDgAAkTa232jhV79sreO3E5XZGWP0BBRaIUKP8gLQ20iVLwExAlkdKkyNhnKQEEzGq9M2/nzb7x+FrJxDB21mM96H3Y1z8wODQ8Mjo2Xnj0eDWJGnEgKkGkonjd54lQMhQVI40S6zoWvO4rsebvLKT3a00RJzIKV0xLi806r4WyKgNuCPIKr12udRzt2m7IfcU912wLw21XRTXb1TKPp1vvE6/tzK10Zuzyf4FXKLISy8y+6Ti5U0Rui1HhG1xsIUKABuoQCGHIV+BIaG3AAYMmbBNtwmLyZHYv0MEQcRuUJSiDE7pD3xpFGzkaUpzWTDJ2QK8o2jExbUyyn+yIXbIT9pWdsz+31mpnNVItLTr9Lldob3zv6fLve1l1Og22/7Hu1GxQxctMqyTtOkPSLoIuv/lh/3J5rjzZnmKf2AXpP2Rn7Jg6CJu/gi9Lonxwh54tQqvUffp/OzQm5/pQbjqV56VXJWfpRXH+TT6vAUzgGaZpKLOYx1ssokKPfMYpvuOH9c7SVtPa7aZaPTnnCa6Y9fEvYGSpow==
    + log (ys
    1:T
    )R(ys
    1:T
    )
    AAAC4XichVG/T9tQEP4wbfnZktIFqYtFBKKqFD1XlShMCBZGCIQgYWo9Oy/JEy+2Zb8EhSj/AFPFwsDUSgyo/wESU5f+Ax2YmarCRKUuHXp2LGiLSs+y39139937zueGSsaasfM+o//Bw0cDg0PDI6OPn4zlno5vxEEz8kTJC1QQbbo8Fkr6oqSlVmIzjARvuEqU3Z2lJF9uiSiWgb+u26HYbvCaL6vS45ogJ1e2dV1obtpKVDWPomDXzJCXps1VWKeUz13FnZvCoGbaoczimfbb2OlY8+vdF2bxt8DJ5VmBpWbedazMySOzlSB3ChsVBPDQRAMCPjT5ChwxPVuwwBASto0OYRF5Ms0LdDFM3CZVCarghO7Qt0bRVob6FCc945Tt0S2K3oiYJqbYF3bCrtln9pF9ZT//2auT9ki0tOl0e1wROmP7E2s//stq0KlRv2Xdq1mjijepVknawxRJpvB6/Nbe4fXafHGqM80+sG+k/z07Z59oAr/13TteFcWje/RUCK3S9Mn/7dKarL+XctcpvSrMFazV1/mFxWxfg3iOSczQUmaxgGWsoESXnOECl7gyKsa+8c446JUafRnnGf4w4+gXn0Sz1A==
    AAAC4XichVG/T9tQEP4wbfnZktIFqYtFBKKqFD1XlShMCBZGCIQgYWo9Oy/JEy+2Zb8EhSj/AFPFwsDUSgyo/wESU5f+Ax2YmarCRKUuHXp2LGiLSs+y39139937zueGSsaasfM+o//Bw0cDg0PDI6OPn4zlno5vxEEz8kTJC1QQbbo8Fkr6oqSlVmIzjARvuEqU3Z2lJF9uiSiWgb+u26HYbvCaL6vS45ogJ1e2dV1obtpKVDWPomDXzJCXps1VWKeUz13FnZvCoGbaoczimfbb2OlY8+vdF2bxt8DJ5VmBpWbedazMySOzlSB3ChsVBPDQRAMCPjT5ChwxPVuwwBASto0OYRF5Ms0LdDFM3CZVCarghO7Qt0bRVob6FCc945Tt0S2K3oiYJqbYF3bCrtln9pF9ZT//2auT9ki0tOl0e1wROmP7E2s//stq0KlRv2Xdq1mjijepVknawxRJpvB6/Nbe4fXafHGqM80+sG+k/z07Z59oAr/13TteFcWje/RUCK3S9Mn/7dKarL+XctcpvSrMFazV1/mFxWxfg3iOSczQUmaxgGWsoESXnOECl7gyKsa+8c446JUafRnnGf4w4+gXn0Sz1A==
    AAAC4XichVG/T9tQEP4wbfnZktIFqYtFBKKqFD1XlShMCBZGCIQgYWo9Oy/JEy+2Zb8EhSj/AFPFwsDUSgyo/wESU5f+Ax2YmarCRKUuHXp2LGiLSs+y39139937zueGSsaasfM+o//Bw0cDg0PDI6OPn4zlno5vxEEz8kTJC1QQbbo8Fkr6oqSlVmIzjARvuEqU3Z2lJF9uiSiWgb+u26HYbvCaL6vS45ogJ1e2dV1obtpKVDWPomDXzJCXps1VWKeUz13FnZvCoGbaoczimfbb2OlY8+vdF2bxt8DJ5VmBpWbedazMySOzlSB3ChsVBPDQRAMCPjT5ChwxPVuwwBASto0OYRF5Ms0LdDFM3CZVCarghO7Qt0bRVob6FCc945Tt0S2K3oiYJqbYF3bCrtln9pF9ZT//2auT9ki0tOl0e1wROmP7E2s//stq0KlRv2Xdq1mjijepVknawxRJpvB6/Nbe4fXafHGqM80+sG+k/z07Z59oAr/13TteFcWje/RUCK3S9Mn/7dKarL+XctcpvSrMFazV1/mFxWxfg3iOSczQUmaxgGWsoESXnOECl7gyKsa+8c446JUafRnnGf4w4+gXn0Sz1A==
    AAAC4XichVG/T9tQEP4wbfnZktIFqYtFBKKqFD1XlShMCBZGCIQgYWo9Oy/JEy+2Zb8EhSj/AFPFwsDUSgyo/wESU5f+Ax2YmarCRKUuHXp2LGiLSs+y39139937zueGSsaasfM+o//Bw0cDg0PDI6OPn4zlno5vxEEz8kTJC1QQbbo8Fkr6oqSlVmIzjARvuEqU3Z2lJF9uiSiWgb+u26HYbvCaL6vS45ogJ1e2dV1obtpKVDWPomDXzJCXps1VWKeUz13FnZvCoGbaoczimfbb2OlY8+vdF2bxt8DJ5VmBpWbedazMySOzlSB3ChsVBPDQRAMCPjT5ChwxPVuwwBASto0OYRF5Ms0LdDFM3CZVCarghO7Qt0bRVob6FCc945Tt0S2K3oiYJqbYF3bCrtln9pF9ZT//2auT9ki0tOl0e1wROmP7E2s//stq0KlRv2Xdq1mjijepVknawxRJpvB6/Nbe4fXafHGqM80+sG+k/z07Z59oAr/13TteFcWje/RUCK3S9Mn/7dKarL+XctcpvSrMFazV1/mFxWxfg3iOSczQUmaxgGWsoESXnOECl7gyKsa+8c446JUafRnnGf4w4+gXn0Sz1A==

    View Slide

  23. Variance Reduction
    • REINFORCE suffers from high variance
    – Calculate the gradient with one (or few) trajectory
    • Introduce a baseline
    • Why we can subtract a baseline? [practice]
    • Increase/decrease a probability when a
    sampled action is better/worse than expected
    23
    Ey1:T
    [ log (y1:T
    )(R(y1:T
    ) Rb
    )]
    AAAC23ichVG/axRBFP6y/ooxmktsBJvFI+FSeMyKYJIqRATL3CVnArfnMrs3dzdk9ge7cwfnsl2qYCsWVgoW4h8QSGvjP2CRwlYRywg2Fr7dWwgajG/ZnW++ed+b7+1zIyUTzdjxlHHh4qXLV6avzlybvX5jrjK/8CQJh7EnWl6ownjX5YlQMhAtLbUSu1EsuO8qsePuPczPd0YiTmQYbOtxJDo+7weyJz2uiXIqDdvnetBN0keZk46d1FrbzjKzbQfcVdyx9UBobtoq7Jt2JMt9rcxbNmvNU3zXbDrucsd0KlVWZ0WYZ4FVgirK2AwrR7DRRQgPQ/gQCKAJK3Ak9LRhgSEiroOUuJiQLM4FMsyQdkhZgjI4sXv07dOuXbIB7fOaSaH26BZFb0xKE4vsE3vHTthH9p59Y7/+WSstauRexrS6E62InLmDW1s//6vyadUYnKrO9azRw0rhVZL3qGDyLryJfvTs5cnWWnMxXWJv2Hfy/5odsw/UQTD64b1tiOarc/x0ie1R9/n/zWhM1t9DOQta9+qrdatxv7q+Uc5rGrdxBzUaygOs4zE20aJLDvEZX/DVeGrsGwfG80mqMVVqbuKPMF78BmoMsHw=
    AAAC23ichVG/axRBFP6y/ooxmktsBJvFI+FSeMyKYJIqRATL3CVnArfnMrs3dzdk9ge7cwfnsl2qYCsWVgoW4h8QSGvjP2CRwlYRywg2Fr7dWwgajG/ZnW++ed+b7+1zIyUTzdjxlHHh4qXLV6avzlybvX5jrjK/8CQJh7EnWl6ownjX5YlQMhAtLbUSu1EsuO8qsePuPczPd0YiTmQYbOtxJDo+7weyJz2uiXIqDdvnetBN0keZk46d1FrbzjKzbQfcVdyx9UBobtoq7Jt2JMt9rcxbNmvNU3zXbDrucsd0KlVWZ0WYZ4FVgirK2AwrR7DRRQgPQ/gQCKAJK3Ak9LRhgSEiroOUuJiQLM4FMsyQdkhZgjI4sXv07dOuXbIB7fOaSaH26BZFb0xKE4vsE3vHTthH9p59Y7/+WSstauRexrS6E62InLmDW1s//6vyadUYnKrO9azRw0rhVZL3qGDyLryJfvTs5cnWWnMxXWJv2Hfy/5odsw/UQTD64b1tiOarc/x0ie1R9/n/zWhM1t9DOQta9+qrdatxv7q+Uc5rGrdxBzUaygOs4zE20aJLDvEZX/DVeGrsGwfG80mqMVVqbuKPMF78BmoMsHw=
    AAAC23ichVG/axRBFP6y/ooxmktsBJvFI+FSeMyKYJIqRATL3CVnArfnMrs3dzdk9ge7cwfnsl2qYCsWVgoW4h8QSGvjP2CRwlYRywg2Fr7dWwgajG/ZnW++ed+b7+1zIyUTzdjxlHHh4qXLV6avzlybvX5jrjK/8CQJh7EnWl6ownjX5YlQMhAtLbUSu1EsuO8qsePuPczPd0YiTmQYbOtxJDo+7weyJz2uiXIqDdvnetBN0keZk46d1FrbzjKzbQfcVdyx9UBobtoq7Jt2JMt9rcxbNmvNU3zXbDrucsd0KlVWZ0WYZ4FVgirK2AwrR7DRRQgPQ/gQCKAJK3Ak9LRhgSEiroOUuJiQLM4FMsyQdkhZgjI4sXv07dOuXbIB7fOaSaH26BZFb0xKE4vsE3vHTthH9p59Y7/+WSstauRexrS6E62InLmDW1s//6vyadUYnKrO9azRw0rhVZL3qGDyLryJfvTs5cnWWnMxXWJv2Hfy/5odsw/UQTD64b1tiOarc/x0ie1R9/n/zWhM1t9DOQta9+qrdatxv7q+Uc5rGrdxBzUaygOs4zE20aJLDvEZX/DVeGrsGwfG80mqMVVqbuKPMF78BmoMsHw=
    AAAC23ichVG/axRBFP6y/ooxmktsBJvFI+FSeMyKYJIqRATL3CVnArfnMrs3dzdk9ge7cwfnsl2qYCsWVgoW4h8QSGvjP2CRwlYRywg2Fr7dWwgajG/ZnW++ed+b7+1zIyUTzdjxlHHh4qXLV6avzlybvX5jrjK/8CQJh7EnWl6ownjX5YlQMhAtLbUSu1EsuO8qsePuPczPd0YiTmQYbOtxJDo+7weyJz2uiXIqDdvnetBN0keZk46d1FrbzjKzbQfcVdyx9UBobtoq7Jt2JMt9rcxbNmvNU3zXbDrucsd0KlVWZ0WYZ4FVgirK2AwrR7DRRQgPQ/gQCKAJK3Ak9LRhgSEiroOUuJiQLM4FMsyQdkhZgjI4sXv07dOuXbIB7fOaSaH26BZFb0xKE4vsE3vHTthH9p59Y7/+WSstauRexrS6E62InLmDW1s//6vyadUYnKrO9azRw0rhVZL3qGDyLryJfvTs5cnWWnMxXWJv2Hfy/5odsw/UQTD64b1tiOarc/x0ie1R9/n/zWhM1t9DOQta9+qrdatxv7q+Uc5rGrdxBzUaygOs4zE20aJLDvEZX/DVeGrsGwfG80mqMVVqbuKPMF78BmoMsHw=

    View Slide

  24. How to Reduce Variance
    • Option 1: Mean of sampled several rewards or
    rewards in a minibatch
    • Option 2: Actor-Critic (several variants exist)
    – Actor: policy network
    – Critic: estimator to calculate the baseline reward
    • Option 3: Self-Critic [Rennie+ 16]
    – Reward obtained by a greedy-search
    24

    View Slide

  25. Supervised vs RL
    25
    Supervised
    Reinforcement Learning
    correct
    sampled
    L( ) = log (y1:T
    )
    AAACsXicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuo+2jElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjtOKrDa1CajXjBZQN9AzAQAGTYQhlKDNAQUC+wC6GGIYUhnyGZIZShlyGVIY8hhIgO4chkaEYCKMZDBkMGAqAYrEM1UCxIiArEyyfylDLwAXUWwpUlQpUkQgUzQaS6UBeNFQ0D8gHmVkM1p0MtCUHiIuAOhUYVA2uGqw0+GxwwmC1wUuDPzjNqgabAXJLJZBOguhNLYjn75II/k5QVy6QLmHIQOjC6+YShjQGC7BbM4FuLwCLgHyRDNFfVjX9c7BVkGq1msEig9dA9y80uGlwGOiDvLIvyUsDU4Nm43FPClA0Deh7UPjWAqPJED1SMBmhRnqWeoaBJsoOTtD44mCQZlBi0ABGijmDA4MHQwBDKNCSXobNDHsY9jKZMEUxJTAlQZQyMUL1CDOgAKZsAOZen7M=
    AAACsXicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuo+2jElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjtOKrDa1CajXjBZQN9AzAQAGTYQhlKDNAQUC+wC6GGIYUhnyGZIZShlyGVIY8hhIgO4chkaEYCKMZDBkMGAqAYrEM1UCxIiArEyyfylDLwAXUWwpUlQpUkQgUzQaS6UBeNFQ0D8gHmVkM1p0MtCUHiIuAOhUYVA2uGqw0+GxwwmC1wUuDPzjNqgabAXJLJZBOguhNLYjn75II/k5QVy6QLmHIQOjC6+YShjQGC7BbM4FuLwCLgHyRDNFfVjX9c7BVkGq1msEig9dA9y80uGlwGOiDvLIvyUsDU4Nm43FPClA0Deh7UPjWAqPJED1SMBmhRnqWeoaBJsoOTtD44mCQZlBi0ABGijmDA4MHQwBDKNCSXobNDHsY9jKZMEUxJTAlQZQyMUL1CDOgAKZsAOZen7M=
    AAACsXicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuo+2jElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjtOKrDa1CajXjBZQN9AzAQAGTYQhlKDNAQUC+wC6GGIYUhnyGZIZShlyGVIY8hhIgO4chkaEYCKMZDBkMGAqAYrEM1UCxIiArEyyfylDLwAXUWwpUlQpUkQgUzQaS6UBeNFQ0D8gHmVkM1p0MtCUHiIuAOhUYVA2uGqw0+GxwwmC1wUuDPzjNqgabAXJLJZBOguhNLYjn75II/k5QVy6QLmHIQOjC6+YShjQGC7BbM4FuLwCLgHyRDNFfVjX9c7BVkGq1msEig9dA9y80uGlwGOiDvLIvyUsDU4Nm43FPClA0Deh7UPjWAqPJED1SMBmhRnqWeoaBJsoOTtD44mCQZlBi0ABGijmDA4MHQwBDKNCSXobNDHsY9jKZMEUxJTAlQZQyMUL1CDOgAKZsAOZen7M=
    AAACsXicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuo+2jElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjtOKrDa1CajXjBZQN9AzAQAGTYQhlKDNAQUC+wC6GGIYUhnyGZIZShlyGVIY8hhIgO4chkaEYCKMZDBkMGAqAYrEM1UCxIiArEyyfylDLwAXUWwpUlQpUkQgUzQaS6UBeNFQ0D8gHmVkM1p0MtCUHiIuAOhUYVA2uGqw0+GxwwmC1wUuDPzjNqgabAXJLJZBOguhNLYjn75II/k5QVy6QLmHIQOjC6+YShjQGC7BbM4FuLwCLgHyRDNFfVjX9c7BVkGq1msEig9dA9y80uGlwGOiDvLIvyUsDU4Nm43FPClA0Deh7UPjWAqPJED1SMBmhRnqWeoaBJsoOTtD44mCQZlBi0ABGijmDA4MHQwBDKNCSXobNDHsY9jKZMEUxJTAlQZQyMUL1CDOgAKZsAOZen7M=
    L( ) = log (ys
    1:T
    )R(ys
    1:T
    )
    AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh
    AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh
    AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh
    AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh

    View Slide

  26. Table of Contents
    1. An overview of RL
    2. Policy gradient
    3. RL in NLP tasks
    4. Implementation
    26

    View Slide

  27. NLP Tasks using RL
    • Seq2seq
    – Summarization
    – MT
    – Dialogue
    • Machine comprehension
    27

    View Slide

  28. Summarization
    • Action: select a next token
    • Reward: ROUGE
    28
    https://einstein.ai/research/your-tldr-by-an-ai-a-deep-reinforced-model-for-abstractive-summarization

    View Slide

  29. Model
    29

    View Slide

  30. Pointer-Generator Network [See+ 17]
    30
    Policy Network

    View Slide

  31. Hybrid Learning Objective
    • Supervised learning with teacher forcing
    • Policy learning
    31
    used method to train a decoder RNN for sequence generation, called the
    gorithm (Williams & Zipser, 1989), minimizes a maximum-likelihood lo
    e define y⇤ = {y⇤
    1
    , y⇤
    2
    , . . . , y⇤
    n0
    } as the ground-truth output sequence fo
    The maximum-likelihood training objective is the minimization of the
    Lml =
    n0
    X
    t=1
    log p(y⇤
    t
    |y⇤
    1
    , . . . , y⇤
    t 1
    , x)
    zing Lml
    does not always produce the best results on discrete evaluatio
    Lin, 2004). This phenomenon has been observed with similar sequence g
    aptioning with CIDEr (Rennie et al., 2016) and machine translation w
    Norouzi et al., 2016). There are two main reasons for this discrepancy.
    re bias (Ranzato et al., 2015), comes from the fact that the network has k
    h sequence up to the next token during training but does not have such su
    ce accumulating errors as it predicts the sequence. The second reason
    of potentially valid summaries, since there are more ways to arrange
    ses or different sentence orders. The ROUGE metrics take some of this
    he maximum-likelihood objective does not.
    CY LEARNING
    o remedy this is to learn a policy that maximizes a specific discrete metric
    the maximum-likelihood loss, which is made possible with reinforcement le
    we use the self-critical policy gradient training algorithm (Rennie et al., 2016
    ning algorithm, we produce two separate output sequences at each training ite
    tained by sampling from the p(ys
    t
    |ys
    1
    , . . . , ys
    t 1
    , x) probability distribution at e
    p, and ˆ
    y, the baseline output, obtained by maximizing the output probability d
    e step, essentially performing a greedy search. We define r(y) as the reward f
    equence y, comparing it with the ground truth sequence y⇤ with the evaluatio
    Lrl = (r(ˆ
    y) r(ys))
    n0
    X
    t=1
    log p(ys
    t
    |ys
    1
    , . . . , ys
    t 1
    , x)
    that minimizing Lrl
    is equivalent to maximizing the conditional likelihood o
    nce ys if it obtains a higher reward than the baseline ˆ
    y, thus increasing the rew
    r model.
    ED TRAINING OBJECTIVE FUNCTION
    G OBJECTIVE FUNCTION
    is reinforcement training objective is that optimizing for a spe
    es not guarantee an increase in quality and readability of th
    h discrete metrics and increase their score without an actua
    Liu et al., 2016). While ROUGE measures the n-gram overlap
    a reference sequence, human-readability is better captured by
    measured by perplexity.
    elihood training objective (Equation 14) is essentially a con
    g the probability of a token yt
    based on the previously predict
    nput sequence x, we hypothesize that it can assist our policy le
    natural summaries. This motivates us to define a mixed learni
    quations 14 and 15:
    Lmixed = Lrl + (1 )Lml,
    tor accounting for the difference in magnitude between Lrl
    0.9984
    greedy sampling
    Reward (ROUGE)

    View Slide

  32. Experimental Results
    32
    Model ROUGE-1 ROUGE-2 ROUGE-L
    Lead-3 (Nallapati et al., 2017) 39.2 15.7 35.5
    SummaRuNNer (Nallapati et al., 2017) 39.6 16.2 35.3
    words-lvt2k-temp-att (Nallapati et al., 2016) 35.46 13.30 32.65
    ML, no intra-attention, no trigram avoidance 35.15 13.28 32.13
    ML, no intra-attention 37.86 14.69 34.99
    ML, with intra-attention 38.30 14.81 35.49
    RL, with intra-attention 41.16 15.75 39.08
    ML+RL, with intra-attention 39.87 15.82 36.90
    Table 1: Quantitative results for various models on the CNN/Daily Mail test dataset
    Model ROUGE-1 ROUGE-2 ROUGE-L
    ML, no intra-attention, no trigram avoidance 42.85 26.22 39.09
    ML, no intra-attention 44.26 27.43 40.41
    ML, with intra-attention 43.86 27.10 40.11
    RL, no intra-attention 47.22 30.51 43.27
    ML+RL, no intra-attention 47.03 30.72 43.10
    Table 2: Quantitative results for various models on the New York Times test dataset
    Model ROUGE-1 ROUGE-2 ROUGE-L
    Lead-3 (Nallapati et al., 2017) 39.2 15.7 35.5
    SummaRuNNer (Nallapati et al., 2017) 39.6 16.2 35.3
    words-lvt2k-temp-att (Nallapati et al., 2016) 35.46 13.30 32.65
    ML, no intra-attention, no trigram avoidance 35.15 13.28 32.13
    ML, no intra-attention 37.86 14.69 34.99
    ML, with intra-attention 38.30 14.81 35.49
    RL, with intra-attention 41.16 15.75 39.08
    ML+RL, with intra-attention 39.87 15.82 36.90
    Table 1: Quantitative results for various models on the CNN/Daily Mail test dataset
    Model ROUGE-1 ROUGE-2 ROUGE-L
    ML, no intra-attention, no trigram avoidance 42.85 26.22 39.09
    ML, no intra-attention 44.26 27.43 40.41
    ML, with intra-attention 43.86 27.10 40.11
    RL, no intra-attention 47.22 30.51 43.27
    ML+RL, no intra-attention 47.03 30.72 43.10
    Table 2: Quantitative results for various models on the New York Times test dataset
    CNN/Daily Mail
    NYT
    Model R-1 R-2
    First sentences 28.6 17.3
    First k words 35.7 21.6
    Full (Durrett et al., 2016) 42.2 24.9
    ML+RL, with intra-attn 42.94 26.02
    Comparison of ROUGE recall scores for lead baselines, the extractive mod
    16) and our model on their NYT dataset splits.
    Model Readability Relevance Perplexity
    ML 6.76 7.14 84.46
    RL 4.18 6.32 16417.68
    ML+RL 7.04 7.45 121.07
    Comparison of human readability scores on a random subset of the CNN/Da
    All models are with intra-decoder attention.

    View Slide

  33. Machine Translation [Johnson+ 16]
    33
    el text containing N input-output sequence pairs, denoted D ©
    )
    (X(i)
    , Y ú(i)
    )
    *
    hood training aims at maximizing the sum of log probabilities of the ground-t
    onding inputs,
    OML(◊) =
    N
    ÿ
    i=1
    log P◊
    (Y ú(i)
    | X(i)
    ) .
    his objective is that it does not reflect the task reward function as measured by
    n. Further, this objective does not explicitly encourage a ranking among incor
    outputs with higher BLEU scores should still obtain higher probabilities under
    tputs are never observed during training. In other words, using maximum-likelih
    will not learn to be robust to errors made during decoding since they are n
    a mismatch between the training and testing procedure.
    [33, 38, 31] have considered di erent ways of incorporating the task reward
    quence-to-sequence models. In this work, we also attempt to refine a model
    ikelihood objective to directly optimize for the task reward. We show that, eve
    t of state-of-the-art maximum-likelihood models using task reward improves
    efinement using the expected reward objective (also used in [33]), which can
    N
    ers [33, 38, 31] have considered di erent ways of incorporating the task
    l sequence-to-sequence models. In this work, we also attempt to refine a
    um likelihood objective to directly optimize for the task reward. We show th
    ment of state-of-the-art maximum-likelihood models using task reward im
    el refinement using the expected reward objective (also used in [33]), wh
    ORL(◊) =
    N
    ÿ
    i=1
    ÿ
    Y œY
    P◊
    (Y | X(i)
    ) r(Y, Y ú(i)
    ).
    tes the per-sentence score, and we are computing an expectation over all of
    certain length.
    has some undesirable properties when used for single sentences, as it was
    We therefore use a slightly di erent score for our RL experiments which
    e GLEU score, we record all sub-sequences of 1, 2, 3 or 4 tokens in output
    We then compute a recall, which is the ratio of the number of matching
    -grams in the target (ground truth) sequence, and a precision, which is
    ng n-grams to the number of total n-grams in the generated output sequ
    the minimum of recall and precision. This GLEU score’s range is alway
    ll match) and it is symmetrical when switching output and target. Accor
    ation 7) and RL (equation 8) objectives as follows:
    OMixed(◊) = – ú OML(◊) + ORL(◊)
    is typically set to be 0.25.
    rain a model using the maximum likelihood objective (equation 7) until converg
    el using a mixed maximum likelihood and expected reward objective (equatio
    evelopment set is no longer improving. The second step is optional.
    Model and Quantized Inference
    es in deploying our Neural Machine Translation model to our interactive produ
    0.25
    ed in the previous section are optimized for log-likelihood of the next ste
    e well with translation quality, as discussed in section 5. We use RL tr
    cores after normal maximum-likelihood training.
    RL fine-tuning on the best EnæFr and EnæDe models are presented
    ning the models with RL can improve BLEU scores. On WMT EnæFr
    score by close to 1 point. On EnæDe, RL-refinement slightly hurts th
    bserve about 0.4 BLEU points improvement on the development set. Th
    he average of 8 independent models. We also note that there is an ov
    refinement and the decoder fine-tuning (i.e., the introduction of length
    ). On a less fine-tuned decoder (e.g., if the decoder does beam search
    m RL would have been bigger (as is evident from comparing results in T
    ingle model test BLEU scores, averaged over 8 runs, on WMT EnæFr
    Dataset Trained with log-likelihood Refined with RL
    EnæFr 38.95 39.92
    EnæDe 24.67 24.60

    View Slide

  34. Deep Reinforcement Learning for
    Dialogue Generation [Li+ 16]
    34
    Baseline mutual information model (Li et al. 2015) Proposed reinforcement learning model
    A: Where are you going? (1) A: Where are you going? (1)
    B: I’m going to the restroom. (2) B: I’m going to the police station. (2)
    A: See you later. (3) A: I’ll come with you. (3)
    B: See you later. (4) B: No, no, no, no, you’re not going anywhere. (4)
    A: See you later. (5) A: Why? (5)
    B: See you later. (6) B: I need you to stay here. (6)
    ... A: I don’t know what you are talking about. (7)
    ... ...
    A: how old are you? (1) A: How old are you? (1)
    B: I’m 16. (2) B: I’m 16. Why are you asking? (2)
    A: 16? (3) A I thought you were 12. (3)
    B: I don’t know what you are talking about. (4) B: What made you think so? (4)
    A: You don’t know what you are saying. (5) A: I don’t know what you are talking about. (5)
    B: I don’t know what you are talking about . (6) B: You don’t know what you are saying. (6)
    A: You don’t know what you are saying. (7) ...
    ... ...
    Table 1: Left Column: Dialogue simulation between two agents using a 4-layer LSTM encoder-decoder
    trained on the OpenSubtitles dataset. The first turn (index 1) is input by the authors. Then the two agents
    take turns conversing, taking as input the other agent’s prior generated turn. The output is generated using

    View Slide

  35. Reward
    1. Ease of answering
    2. Information Flow
    3. Semantic Coherence
    35
    structed a list of dull responses S consisting 8 turns
    such as “I don’t know what you are talking about”,
    “I have no idea”, etc., that we and others have found
    occur very frequently in SEQ2SEQ models of con-
    versations. The reward function is given as follows:
    r1 =
    1
    NS
    X
    s2S
    1
    Ns
    log pseq2seq(s|a) (1)
    where NS denotes the cardinality of NS and Ns de-
    notes the number of tokens in the dull response s.
    Although of course there are more ways to generate
    dull responses than the list can cover, many of these
    responses are likely to fall into similar regions in the
    vector space computed by the model. A system less
    likely to generate utterances in the list is thus also
    less likely to generate other dull responses.
    pseq2seq(a|
    ing respon
    [pi, qi]. pb
    s
    bility of g
    qi based o
    lar way as
    and target
    ence of ta
    log pbackwa
    seq2seq
    The fina
    the reward
    r(
    where 1
    0.25 and
    coder
    aram-
    ation
    ctions
    ult in
    imize
    n this
    ribute
    pprox-
    ed in
    chine
    tribute new information at each turn to keep the di-
    alogue moving and avoid repetitive sequences. We
    therefore propose penalizing semantic similarity be-
    tween consecutive turns from the same agent. Let
    hpi
    and hpi+1
    denote representations obtained from
    the encoder for two consecutive turns pi and pi+1.
    The reward is given by the negative log of the cosine
    similarity between them:
    r2 = log cos(hpi
    , hpi+1
    ) = log
    hpi
    · hpi+1
    khpi
    kkhpi+1
    k
    (2)
    Semantic Coherence We also need to measure the
    adequacy of responses to avoid situations in which
    the generated replies are highly rewarded but are un-
    In this
    tribute
    pprox-
    zed in
    achine
    a turn
    he con-
    off and
    pose to
    urn by
    ding to
    ly con-
    similarity between them:
    r2 = log cos(hpi
    , hpi+1
    ) = log
    hpi
    · hpi+1
    khpi
    kkhpi+1
    k
    (2)
    Semantic Coherence We also need to measure the
    adequacy of responses to avoid situations in which
    the generated replies are highly rewarded but are un-
    grammatical or not coherent. We therefore consider
    the mutual information between the action a and pre-
    vious turns in the history to ensure the generated
    responses are coherent and appropriate:
    r3 =
    1
    Na
    log pseq2seq(a|qi, pi)+
    1
    Nqi
    log pbackward
    seq2seq
    (qi|a)
    over, many of these
    imilar regions in the
    odel. A system less
    the list is thus also
    responses.
    elihood output by
    noting that pseq2seq
    ic policy function
    mer is learned based
    Q2SEQ model while
    for long-term future
    r(a, [pi, qi]) = 1r1 + 2r2 + 3r3
    where 1 + 2 + 3 = 1. We set 1 = 0.25
    0.25 and 3 = 0.5. A reward is observed af
    agent reaches the end of each sentence.
    4 Simulation
    The central idea behind our approach is to si
    the process of two virtual agents taking turns
    with each other, through which we can explo
    1195
    pi
    , qi
    pi+1
    AAACpnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQvIFMRn6igUxmcqxBRlpmeUJBYV5ZcrFMRXZ2ob1sYLKBvoGYCBAibDEMpQZoCCgHyBXQwxDCkM+QzJDKUMuQypDHkMJUB2DkMiQzEQRjMYMhgwFADFYhmqgWJFQFYmWD6VoZaBC6i3FKgqFagiESiaDSTTgbxoqGgekA8ysxisOxloSw4QFwF1KjCoGlw1WGnw2eCEwWqDlwZ/cJpVDTYD5JZKIJ0E0ZtaEM/fJRH8naCuXCBdwpCB0IXXzSUMaQwWYLdmAt1eABYB+SIZor+savrnYKsg1Wo1g0UGr4HuX2hw0+Aw0Ad5ZV+SlwamBs3G454UoGga0Peg8AVFkyF6pGAyQo30LPUMA02UHZyg8cXBIM2gxKABjBRzBgcGD4YAhlCgJU0MKxg2Mmxi0mTyZwplCocoZWKE6hFmQAFMCQAIdJyO
    AAACpnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQvIFMRn6igUxmcqxBRlpmeUJBYV5ZcrFMRXZ2ob1sYLKBvoGYCBAibDEMpQZoCCgHyBXQwxDCkM+QzJDKUMuQypDHkMJUB2DkMiQzEQRjMYMhgwFADFYhmqgWJFQFYmWD6VoZaBC6i3FKgqFagiESiaDSTTgbxoqGgekA8ysxisOxloSw4QFwF1KjCoGlw1WGnw2eCEwWqDlwZ/cJpVDTYD5JZKIJ0E0ZtaEM/fJRH8naCuXCBdwpCB0IXXzSUMaQwWYLdmAt1eABYB+SIZor+savrnYKsg1Wo1g0UGr4HuX2hw0+Aw0Ad5ZV+SlwamBs3G454UoGga0Peg8AVFkyF6pGAyQo30LPUMA02UHZyg8cXBIM2gxKABjBRzBgcGD4YAhlCgJU0MKxg2Mmxi0mTyZwplCocoZWKE6hFmQAFMCQAIdJyO
    AAACpnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQvIFMRn6igUxmcqxBRlpmeUJBYV5ZcrFMRXZ2ob1sYLKBvoGYCBAibDEMpQZoCCgHyBXQwxDCkM+QzJDKUMuQypDHkMJUB2DkMiQzEQRjMYMhgwFADFYhmqgWJFQFYmWD6VoZaBC6i3FKgqFagiESiaDSTTgbxoqGgekA8ysxisOxloSw4QFwF1KjCoGlw1WGnw2eCEwWqDlwZ/cJpVDTYD5JZKIJ0E0ZtaEM/fJRH8naCuXCBdwpCB0IXXzSUMaQwWYLdmAt1eABYB+SIZor+savrnYKsg1Wo1g0UGr4HuX2hw0+Aw0Ad5ZV+SlwamBs3G454UoGga0Peg8AVFkyF6pGAyQo30LPUMA02UHZyg8cXBIM2gxKABjBRzBgcGD4YAhlCgJU0MKxg2Mmxi0mTyZwplCocoZWKE6hFmQAFMCQAIdJyO
    AAACpnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQvIFMRn6igUxmcqxBRlpmeUJBYV5ZcrFMRXZ2ob1sYLKBvoGYCBAibDEMpQZoCCgHyBXQwxDCkM+QzJDKUMuQypDHkMJUB2DkMiQzEQRjMYMhgwFADFYhmqgWJFQFYmWD6VoZaBC6i3FKgqFagiESiaDSTTgbxoqGgekA8ysxisOxloSw4QFwF1KjCoGlw1WGnw2eCEwWqDlwZ/cJpVDTYD5JZKIJ0E0ZtaEM/fJRH8naCuXCBdwpCB0IXXzSUMaQwWYLdmAt1eABYB+SIZor+savrnYKsg1Wo1g0UGr4HuX2hw0+Aw0Ad5ZV+SlwamBs3G454UoGga0Peg8AVFkyF6pGAyQo30LPUMA02UHZyg8cXBIM2gxKABjBRzBgcGD4YAhlCgJU0MKxg2Mmxi0mTyZwplCocoZWKE6hFmQAFMCQAIdJyO
    dull utterance
    “I have no idea”, …

    View Slide

  36. Adversarial Learning for Neural
    Dialogue Generation [Li+ 17]
    • [Li+ 16] manually define three dialogue
    properties
    • Adversarial training: a model should generate
    utterances indistinguishable from human
    dialogues
    36

    View Slide

  37. Model
    • Generative model:
    – Generate a response y given dialogue history x
    • Discriminative model:
    – Input: a sequence of dialogue utterances {x, y}
    – Output: a label indicating whether the input is
    generated by humans or machines
    • Policy Gradient Training
    37
    is used as a reward for the generator, which is
    trained to maximize the expected reward of gen-
    erated utterance(s) using the REINFORCE algo-
    rithm (Williams, 1992):
    J(θ) = Ey∼p(y|x)
    (Q+({x, y})|θ) (1)
    Given the input dialogue history x, the bot gener-
    ates a dialogue utterance y by sampling from the
    human-generated dialogue probability
    by a discriminator

    View Slide

  38. Machine Comprehension
    38
    Reinforced Mnemonic Reader for Machine Reading
    Comprehension [He+ 2017]
    • Standard maximum likelihood
    method is used for predicting
    exactly-matched score
    • In addition, we want to
    optimize F1 measure
    → Reinforcement Learning

    View Slide

  39. Answer Pointer
    39
    where ✓ represents all trainable parameters.
    The standard maximum-likelihood (ML) training method
    is to maximize the log probabilities of the ground truth an-
    swer positions [Wang and Jiang, 2017]
    LML(✓) =
    X
    k
    log p1(y
    1
    k
    ) + log p2(y
    2
    k
    |y
    1
    k
    ) (7)
    ward measured as word overlap between predicted answer
    nd groung truth, is introduced to MRC [Xiong et al., 2017a].
    A baseline b, which is obtained by running greedy inference
    with the current model, is used to normalize the reward and
    educe variances. Such approach is known as the self-critical
    equence training (SCST) [Rennie et al., 2016], which is first
    sed in image caption. More specifically, let R(As
    , A
    ⇤) de-
    ote the F1 score between a sampled answer As and the
    round truth A
    ⇤. The training objective is to minimize the
    egative expected reward by
    LSCST (✓) = E
    As⇠p✓(A)
    [R(A
    s) R( ˆ
    A)] (8)
    where we abbreviate the model distribution p(A|C, Q; ✓) as
    ✓(A), and the reward function R(As
    , A
    ⇤) as R(As). ˆ
    A is
    btained by greedily maximizing the model distribution:
    ˆ
    A = arg max p(A|C, Q; ✓)
    Standard Maximum Likelihood
    Reinforcement Learning (Self-Critical Sequence Training)
    greedy
    sampling
    sequence training (SCST) [Rennie et al., 2016], which is first
    used in image caption. More specifically, let R(As
    , A
    ⇤) de-
    note the F1 score between a sampled answer As and the
    ground truth A
    ⇤. The training objective is to minimize the
    negative expected reward by
    LSCST (✓) = E
    As⇠p✓(A)
    [R(A
    s) R( ˆ
    A)] (8)
    where we abbreviate the model distribution p(A|C, Q; ✓) as
    p✓(A), and the reward function R(As
    , A
    ⇤) as R(As). ˆ
    A is
    obtained by greedily maximizing the model distribution:
    ˆ
    A = arg max
    A
    p(A|C, Q; ✓)
    The expected gradient r✓
    LSCST (✓) can be computed ac-
    cording to the REINFORCE algorithm [Sutton and Barto,
    1998] as
    r✓
    LSCST (✓) = E
    As⇠p✓(A)
    [(R(A
    s) b) r✓ log p✓(A
    s)]


    R(A
    s) R( ˆ
    A)

    r✓ log p✓(A
    s) (9)

    View Slide

  40. Table of Contents
    1. An overview of RL
    2. Policy gradient
    3. RL in NLP tasks
    4. Implementation
    40

    View Slide

  41. OpenAI: Gym
    41
    Game, Classic control, Robotics, …

    View Slide

  42. chainerrl
    42

    View Slide

  43. Policy Gradient
    • In NLP, difficult to use RL library/environment
    • RL loss can be added to standard ML setting
    43
    https://github.com/pytorch/examples/blob/master/reinforcement_learning/reinforce.py
    L( ) = log (ys
    1:T
    )R(ys
    1:T
    )
    AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh
    AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh
    AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh
    AAACvnicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuY+GjElGSkliRqKtgq6Mbk5KcrxBRkxkPENCrjiuOrDa1CajUVgpA48QLKBnoGYKCAyTCEMpQZoCAgX2AXQwxDCkM+QzJDKUMuQypDHkMJkJ3DkMhQDITRDIYMBgwFQLFYhmqgWBGQlQmWT2WoZeAC6i0FqkoFqkgEimYDyXQgLxoqmgfkg8wsButOBtqSA8RFQJ0KDKoGVw1WGnw2OGGw2uClwR+cZlWDzQC5pRJIJ0H0phbE83dJBH8nqCsXSJcwZCB04XVzCUMagwXYrZlAtxeARUC+SIboL6ua/jnYKki1Ws1gkcFroPsXGtw0OAz0QV7Zl+SlgalBs/G4JwUomgb0PSh8a4HRZIgeKZiMUCM9Sz3DQBNlBydofHEwSDMoMWgAI8WcwYHBgyGAIRRoySyGAwwnGU4xOTKlM+Uy5UOUMjFC9QgzoACmCgAWAqSh
    sampling

    View Slide