Reinforcement Learning(Second edition) - Notes on Chapter 2

Reinforcement Learning Second edition - Notes on Chapter 2 Etsuji
Nakai (@enakai00)

K-armed Bandit Problem 2

Action-value Q(a) • The state of environment s is ﬁxed.
◦ No point of deﬁning the value function v(s) ▪ v(s) : expected value (total rewards) you can get when starting from the state s. ◦ Instead, you need to evaluate the acton-value Q(a) ▪ Q(a) : expected value (immediate reward) you can get when choosing the action a. 3

Approximation of Action-value Q(a) • If you are allowed to
take actions indeﬁnitely, you can get the perfect estimation for Q(a). • When the number of actions are limited, how can you maximize the average reward? ◦ What is the better policy to decide the next action a from the current estimation of Q(a)? 4

Greedy policy / ε-greedy policy • Greedy policy ◦ If
Q(a) is accurate, this is the best policy. ◦ If not, you'd better try different actions to improve the accuracy of Q(a). • ε-greedy policy ◦ With probability ε, choose a random action. 5

Testing the ε-greedy policy https://gist.github.com/enakai00/dbfb695dd4602f4bc1edc8d0c98a85c5#ﬁle-multi-armed-bandits-ipynb 6

Incremental Implementation : Updating the estimation with weight 1/n. 7

Exponential recency-weighted average • Giving higher weights for recent rewards.
◦ Works better for nonstationary problem. : Updating the estimation with a constant weight. 8

Testing the nonstationary problem https://gist.github.com/enakai00/dbfb695dd4602f4bc1edc8d0c98a85c5#ﬁle-multi-armed-bandits-ipynb Exponential recency-weighted average 9

Initial bias The result depends on the value of Q
1 • You can avoid the initial bias with the following 'semi-constant' weight. 10

Initial bias 11

Upper-Conﬁdence-Bound Action Selection • Modifying the ε-greedy: Try less selected
actions more often. https://gist.github.com/enakai00/dbfb695dd4602f4bc1edc8d0c98a85c5#ﬁle-multi-armed-bandits-ipynb 12

Another trick (Optimistic initial values) • What if setting the
initial value as Q(a) = 5 ? 13

Gradient Bandit Algorithms • Choose actions with a probability distribution.
14 Baseline

Hyperparameter Tuning 15

Summary • The models (considered in this chapter) are consisted
of : ◦ Action-value function Q(a) ◦ Policy P(a | Q) • Q(a) : average, weighted average • P(a | Q) : greedy, ε-greedy, UCB, gradient bandit 16

Associative Search (Contextual Bandits) • In the nonstationary case, the
state of environment changes, but you don't have any "hint" about the current state. ◦ What if you have a "hint" about the state? ▪ eg: the color of slot machines changes according to the reward distribution R(a). • The "hint" can be formalized as a state "s" of the environment. ◦ Action-value function and policy can be functions of "s". ▪ Q(a | s) ▪ P(a | Q(s), s) 17

Reinforcement Learning(Second edition) - Notes ...

Reinforcement Learning(Second edition) - Notes on Chapter 2

Etsuji Nakai

More Decks by Etsuji Nakai

Other Decks in Technology

Featured

Transcript