Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reinforcement Learning(Second edition) - Notes on Chapter 2

Reinforcement Learning(Second edition) - Notes on Chapter 2

ver1.0 2019/04/10

Etsuji Nakai

April 10, 2019
Tweet

More Decks by Etsuji Nakai

Other Decks in Technology

Transcript

  1. Action-value Q(a) • The state of environment s is fixed.

    ◦ No point of defining the value function v(s) ▪ v(s) : expected value (total rewards) you can get when starting from the state s. ◦ Instead, you need to evaluate the acton-value Q(a) ▪ Q(a) : expected value (immediate reward) you can get when choosing the action a. 3
  2. Approximation of Action-value Q(a) • If you are allowed to

    take actions indefinitely, you can get the perfect estimation for Q(a). • When the number of actions are limited, how can you maximize the average reward? ◦ What is the better policy to decide the next action a from the current estimation of Q(a)? 4
  3. Greedy policy / ε-greedy policy • Greedy policy ◦ If

    Q(a) is accurate, this is the best policy. ◦ If not, you'd better try different actions to improve the accuracy of Q(a). • ε-greedy policy ◦ With probability ε, choose a random action. 5
  4. Exponential recency-weighted average • Giving higher weights for recent rewards.

    ◦ Works better for nonstationary problem. : Updating the estimation with a constant weight. 8
  5. Initial bias The result depends on the value of Q

    1 • You can avoid the initial bias with the following 'semi-constant' weight. 10
  6. Upper-Confidence-Bound Action Selection • Modifying the ε-greedy: Try less selected

    actions more often. https://gist.github.com/enakai00/dbfb695dd4602f4bc1edc8d0c98a85c5#file-multi-armed-bandits-ipynb 12
  7. Summary • The models (considered in this chapter) are consisted

    of : ◦ Action-value function Q(a) ◦ Policy P(a | Q) • Q(a) : average, weighted average • P(a | Q) : greedy, ε-greedy, UCB, gradient bandit 16
  8. Associative Search (Contextual Bandits) • In the nonstationary case, the

    state of environment changes, but you don't have any "hint" about the current state. ◦ What if you have a "hint" about the state? ▪ eg: the color of slot machines changes according to the reward distribution R(a). • The "hint" can be formalized as a state "s" of the environment. ◦ Action-value function and policy can be functions of "s". ▪ Q(a | s) ▪ P(a | Q(s), s) 17