Introduction to ChainerRL

Chainer Meetup #5 (2017.06.10) Preferred Networks, Toshiki Kataoka Introduction to
ChainerRL

Self introduction • Toshiki Kataoka / toslunar • 2016.12– Preferred
Networks • Recently worked on ChainerRL for Chainer v2

ChainerMN ChainerRL ChainerCV Chainer brothers

ChainerMN ChainerRL ChainerCV Chainer brothers ʮਐԽ͢ΔChainerʯ(PFN ւ໺ɼ2017.05.30) (Since DL itself
gets easier, methods get more complex)

ChainerRL github.com/chainer/chainerrl • RL  = Reinforcement Learning  • an extension
of Chainer 2017.03.27 ChainerRL v0.1.0 2017.06.08 ChainerRL v0.2.0 ChainerRL

What's Reinforcement Learning • Learn actions that maximize rewards from
an environment agent environment action observation, reward

Algorithms of RL (1/2) • Q-learning • Function to be
learned: Q*(s, a)  (sum of rewards for  best actions after  acting a at state s) • Watkins '89 • Mnih+ '13 (deep) s a Q*(s, a) ≈ −5.7 a = argmax  Q*(s, _) … or random

Algorithms of RL (2/2) • Policy gradient • Function to
be optimized: π(s)  (action at state s) • Simultaneously learn e.g. Qπ(s, a)  (sum of rewards for actions on policy π) • Williams '92, Sutton+ '99 • Lillicrap+ '15 (deep) s a Qπ(s, a) ≈ −5.7 a = π(s) + ε

• Learn results of actions • Act to the best
of its learned knowledge • If implemented carelessly,  it will be a spaghetti code Acting and learning are inseparable

In ChainerRL for _ in range(1000): obs = env.reset() reward
= 0.0 done = False while not done: action = agent.act_and_train(obs, reward) obs, reward, done, _ = env.step(action) agent.stop_episode_and_train(obs, reward, done) agent.save('final_agent')

How to make envs (1/2) • To do it yourself
• resetting: env.reset • step execution: env.step • E.g. Guess Number Game • Choose the secret value on resetting • Return the comparison ("observation") to the agent's guess ("action") class GuessNumberEnv (object): def reset(self): self._state = np.random.uniform(-1, 1) obs = np.array([0, 0, 0, 1], dtype=np.float32) return obs def step(self, action): assert action.shape == (1,) diff = action[0] - self._state obs = np.array([0, 0, 0, 0], dtype=np.float32) obs[1 + int(np.sign(diff))] = 1 reward = np.random.normal(0, 1) - abs(diff) return obs, reward, False, None # not done, no info

How to make envs (2/2) • Use environments available at
OpenAI Gym import gym env = gym.make('CartPole-v1')

How to make agents (e.g. Deep Q-Network) model = chainerrl.q_functions.FCStateQFunctionWithDiscreteAction(
env.observation_space.low.size, env.action_space.n, n_hidden_channels=64, n_hidden_layers=1) opt = chainer.optimizers.Adam() opt.setup(model) rbuf = chainerrl.replay_buffer.ReplayBuffer(None) explorer = chainerrl.explorers.LinearDecayEpsilonGreedy(1.0, 0.1, 10**4,  random_action_func=env.action_space.sample) agent = chainerrl.agents.DQN(model, opt, rbuf, gamma=0.98, explorer=explorer, target_update_interval=100, replay_start_size=10**3)

One can independently change the agent's components: • model •
optimizer • replay buffer • explorer

Changing models (1/2) A model of DQN has:  inputs =
states; outputs = Q-values for each action • Use models available at ChainerRL model = chainerrl.q_functions.FCStateQFunctionWithDiscreteAction( dim_obs, n_action, n_hidden_channels, n_hidden_layers) • Convert your Chain for ChainerRL model = chainerrl.q_functions.SingleModelStateQFunctionWithDiscreteAction( MyChain(dim_obs, n_action)) model s Q(s, a1) Q(s, a2) … …

Changing models (2/2) • Normalized Advantage Function for continuous-action environments
model = chainerrl.q_functions.FCQuadraticStateQFunction( dim_obs, dim_action, n_hidden_channels, n_hidden_layers, action_space) • (BTW, explorer with Ornstein-Uhlenbeck process is better, as it's used in DDPG) explorer = chainerrl.explorers.AdditiveOU(...)

Changing replay buffers Replay buffer is for giving variation in
a minibatch  by sampling from (saved) memories • One can choose size of buffer, and algorithm of sampling rbuf = chainerrl.replay_buffer.ReplayBuffer(5 * 10**5) rbuf = chainerrl.replay_buffer.EpisodicReplayBuffer(10**4) rbuf = chainerrl.replay_buffer.PrioritizedReplayBuffer(5 * 10**5)

Changing algorithms • To use improved DQNs (e.g. Double DQN),
just change the line agent = chainerrl.agents.DQN(...) to the line agent = chainerrl.agents.DoubleDQN(...) • Replay buffer & explorer can be unchanged  for algorithms (e.g. DDPG) with different types of models

Implemented algorithms • Q-learning algorithms: • Deep Q-Network, • Double
DQN, • Normalized Advantage Function, • (Persistent) Advantage Learning, • Asynchronous Advantage Actor-Critic, • Asynchronous N-step Q-learning • Path Consistency Learning • Policy gradient methods: • Deep Deterministic Policy Gradient, • SVG(0), • Actor-Critic with Experience Replay

Training loops in ChainerRL • chainerrl.experiments.train_agent allows one • to
evaluate with test environment for some interval of iterations  (to draw learning curves) • to save models automatically • This is not Trainer in Chainer :( • TBD

Parallelization • Agents are parallelized in some algorithms (e.g. A3C)
• RL's mainstream is async update • train_agent_async in ChainerRL will execute with multi-processes ChainerRL async ChainerMN sync

Concluding remarks • Let's do RL with ChainerRL • Many
algorithms (including newest ones) are implemented • Many parts of algorithms are reusable • Give me feedbacks • Features/algorithms to be implemented • Interfaces github.com/chainer/chainerrl

Introduction to ChainerRL

Introduction to ChainerRL

tos

More Decks by tos

Featured

Transcript

Chainer Meetup #5 (2017.06.10) Preferred Networks, Toshiki Kataoka Introduction to

Self introduction • Toshiki Kataoka / toslunar • 2016.12– Preferred

ChainerMN ChainerRL ChainerCV Chainer brothers

ChainerMN ChainerRL ChainerCV Chainer brothers ʮਐԽ͢ΔChainerʯ(PFN ւ໺ɼ2017.05.30) (Since DL itself

ChainerRL github.com/chainer/chainerrl • RL  = Reinforcement Learning  • an extension

What's Reinforcement Learning • Learn actions that maximize rewards from

Algorithms of RL (1/2) • Q-learning • Function to be

Algorithms of RL (2/2) • Policy gradient • Function to

• Learn results of actions • Act to the best

In ChainerRL for _ in range(1000): obs = env.reset() reward

How to make envs (1/2) • To do it yourself

How to make envs (2/2) • Use environments available at

How to make agents (e.g. Deep Q-Network) model = chainerrl.q_functions.FCStateQFunctionWithDiscreteAction(

One can independently change the agent's components: • model •

Changing models (1/2) A model of DQN has:  inputs =

Changing models (2/2) • Normalized Advantage Function for continuous-action environments

Changing replay buffers Replay buffer is for giving variation in

Changing algorithms • To use improved DQNs (e.g. Double DQN),

Implemented algorithms • Q-learning algorithms: • Deep Q-Network, • Double

Training loops in ChainerRL • chainerrl.experiments.train_agent allows one • to

Parallelization • Agents are parallelized in some algorithms (e.g. A3C)

Concluding remarks • Let's do RL with ChainerRL • Many