Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ChainerRL について / Introduction to ChainerRL (Ja)
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
tos
June 10, 2017
2.1k
2
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
ChainerRL について / Introduction to ChainerRL (Ja)
Chainer Meetup #05
tos
June 10, 2017
More Decks by tos
See All by tos
ICLR2018 Yomikai: Deep Reinforcement Learning
toslunar
0
93
Introduction to ChainerRL
toslunar
0
110
Featured
See All Featured
Marketing to machines
jonoalderson
1
5.4k
AI Search: Where Are We & What Can We Do About It?
aleyda
0
7.6k
Producing Creativity
orderedlist
PRO
348
40k
First, design no harm
axbom
PRO
2
1.2k
The AI Search Optimization Roadmap by Aleyda Solis
aleyda
1
5.9k
世界の人気アプリ100個を分析して見えたペイウォール設計の心得
akihiro_kokubo
PRO
71
40k
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
Learning to Love Humans: Emotional Interface Design
aarron
275
41k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
12
1.2k
So, you think you're a good person
axbom
PRO
2
2.1k
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
2
850
Are puppies a ranking factor?
jonoalderson
1
3.5k
Transcript
Chainer Meetup #5 (2017.06.10) Preferred Networks ยԬ ढ़ج ChainerRLʹ͍ͭͯ
ࣗݾհ • ยԬ ढ़ج / toslunar • 2016.12– Preferred Networks
• ࠷ۙɼChainerRL ͷ Chainer v2 ରԠͯ͠·ͨ͠
ChainerMN ChainerRL ChainerCV Chainer ܑఋ
ChainerMN ChainerRL ChainerCV Chainer ܑఋ ʮਐԽ͢ΔChainerʯ(PFN ւɼ2017.05.30)
ChainerRL github.com/chainer/chainerrl • RL = Reinforcement Learning = ڧԽֶश •
Chainerͷ֦ுύοέʔδ 2017.03.27 ChainerRL v0.1.0 2017.06.08 ChainerRL v0.2.0 ChainerRL
ڧԽֶशͱ • ڥ͔ΒಘΒΕΔใुΛ࠷େԽ͢ΔߦಈΛֶश͢Δ ΤʔδΣϯτ ڥ ߦಈ ؍ଌɼใु
ڧԽֶशͷΞϧΰϦζϜ (1/2) • Qֶश • ֶश͢ΔؔɿQ*(s, a) (ঢ়ଶ s Ͱߦಈ
a Λͯ͠ ͦͷޙ࠷దߦಈΛͨ͠ͱ͖ͷ ใुͷ) • Watkins '89 • Mnih+ '13 (ਂ) s a Q*(s, a) ≈ −5.7 a = argmax Q*(s, _) … or random
ڧԽֶशͷΞϧΰϦζϜ (2/2) • ํࡦޯ๏ • ࠷దԽ͢Δؔɿπ(s) (ঢ়ଶ s ͷͱ͖ʹ͢Δߦಈ) •
Qπ(s, a) (ޙʹ π ʹै͏ߦಈΛͨ͠ͱ͖ͷ ใुͷ) ͳͲΛಉ࣌ʹֶश • Williams '92, Sutton+ '99 • Lillicrap+ '15 (ਂ) s a Qπ(s, a) ≈ −5.7 a = π(s) + ε
• ߦಈͨ݁͠Ռ͔Βֶश͢Δ • ֶश݁ՌΛߦಈʹөͤ͞Δ • ͖ͯͱʔʹ࣮͢Δͱ εύήςΟίʔυʹ ߦಈͱֶशෆՄ
ChainerRLͰ for _ in range(1000): obs = env.reset() reward =
0.0 done = False while not done: action = agent.act_and_train(obs, reward) obs, reward, done, _ = env.step(action) agent.stop_episode_and_train(obs, reward, done) agent.save('final_agent')
ڥͷͭ͘Γ͔ͨ (1/2) • ࣗͰఆٛ͢Δ߹ • ڥͷॳظԽɿenv.reset • ڥͷ࣮ߦɿenv.step • ྫɿͯήʔϜ
• ॳظԽ࣌ʹൿີͷΛܾΊΔ • ΤʔδΣϯτͷਪଌ (ʮߦಈʯ) ͷେখΛʮ؍ଌʯͱͯ͠ฦ͢ class GuessNumberEnv (object): def reset(self): self._state = np.random.uniform(-1, 1) obs = np.array([0, 0, 0, 1], dtype=np.float32) return obs def step(self, action): assert action.shape == (1,) diff = action[0] - self._state obs = np.array([0, 0, 0, 0], dtype=np.float32) obs[1 + int(np.sign(diff))] = 1 reward = np.random.normal(0, 1) - abs(diff) return obs, reward, False, None # not done, no info
ڥͷͭ͘Γ͔ͨ (2/2) • OpenAI Gym ʹ༻ҙ͞Ε͍ͯΔڥͦͷ··͑Δ import gym env =
gym.make('CartPole-v1')
ΤʔδΣϯτͷͭ͘Γ͔ͨ (ྫɿDeep Q-Network) model = chainerrl.q_functions.FCStateQFunctionWithDiscreteAction( env.observation_space.low.size, env.action_space.n, n_hidden_channels=64, n_hidden_layers=1)
opt = chainer.optimizers.Adam() opt.setup(model) rbuf = chainerrl.replay_buffer.ReplayBuffer(None) explorer = chainerrl.explorers.LinearDecayEpsilonGreedy(1.0, 0.1, 10**4, random_action_func=env.action_space.sample) agent = chainerrl.agents.DQN(model, opt, rbuf, gamma=0.98, explorer=explorer, target_update_interval=100, replay_start_size=10**3)
agent Λߏ͍ͯ͠Δ • model • optimizer • replay buffer •
explorer͋ ɼͦΕͧΕ (ಠཱʹ) มߋՄೳ
ϞσϧΛมߋ͢Δ (1/2) DQNͰɼঢ়ଶ͕ೖྗɼ֤ߦಈʹର͢ΔQ͕ग़ྗͷϞσϧ • ChainerRLʹ༻ҙ͞Ε͍ͯΔϞσϧΛ͏ model = chainerrl.q_functions.FCStateQFunctionWithDiscreteAction( dim_obs, n_action,
n_hidden_channels, n_hidden_layers) • ࣗͰ࡞ͬͨ chain Λ ChainerRL ͚ʹม model = chainerrl.q_functions.SingleModelStateQFunctionWithDiscreteAction( MyChain(dim_obs, n_action)) model s Q(s, a1) Q(s, a2) … …
ϞσϧΛมߋ͢Δ (2/2) • ࿈ଓͰߦಈ͢ΔڥͰɼNormalized Advantage Function Λ༻͍Δ model = chainerrl.q_functions.FCQuadraticStateQFunction(
dim_obs, dim_action, n_hidden_channels, n_hidden_layers, action_space) • (ͪͳΈʹɼ͜ͷͱ͖ explorer DDPG ಉ༷ʹ Ornstein-Uhlenbeck աఔΛ༻͍Δͷ͕ྑ͍) explorer = chainerrl.explorers.AdditiveOU(...)
Replay buffer Λมߋ͢Δ Replay buffer: ϛχόονͷσʔλ͕ภΒͳ͍Α͏ʹɼ อଘ͓͍ͯͨ͠ܦݧ͔ΒαϯϓϦϯάֶͯ͠श͢ΔςΫ • αΠζΛઃఆͨ͠ΓɼαϯϓϦϯάΞϧΰϦζϜΛม͑ͨΓͰ͖Δ rbuf
= chainerrl.replay_buffer.ReplayBuffer(5 * 10**5) rbuf = chainerrl.replay_buffer.EpisodicReplayBuffer(10**4) rbuf = chainerrl.replay_buffer.PrioritizedReplayBuffer(5 * 10**5)
ΞϧΰϦζϜΛมߋ͢Δ • DQN ͷվྑΞϧΰϦζϜ (ͨͱ͑ Double DQN) ʹ มߋ͍ͨ͠ͱ͖ agent
= chainerrl.agents.DQN(...) Λ agent = chainerrl.agents.DoubleDQN(...) ͷΑ͏ʹ͢ΕOK • ҟͳΔϞσϧΛ༻͍ΔΞϧΰϦζϜ (ͨͱ͑ DDPG) ʹมߋ͢Δ߹Ͱ replay buffer explorer ಉ͡ͷ͕͑Δ
࣮ࡁΈͷΞϧΰϦζϜ • Q-learning algorithms: • Deep Q-Network, • Double DQN,
• Normalized Advantage Function, • (Persistent) Advantage Learning, • Asynchronous Advantage Actor-Critic, • Asynchronous N-step Q-learning • Path Consistency Learning • Policy gradient methods: • Deep Deterministic Policy Gradient, • SVG(0), • Actor-Critic with Experience Replay
ChainerRL ͷֶशϧʔϓ • chainerrl.experiments.train_agent ͰͰ͖Δ͜ͱɿ • Ұఆͷ iteration ͝ͱʹςετڥͰ࣮ߦͤ͞ ֶशۂઢ
(Λඳ͘ͷʹඞཁͳσʔλ) Λग़͢ • ϞσϧΛࣗಈͰอଘ͢Δ • chainer ͷ Trainer ʹͳ͍ͬͯͳ͍ • Ͳ͏͢Δ͔ະఆ
ฒྻԽ • A3C ͳͲͷΞϧΰϦζϜΤʔδΣϯτΛฒྻԽ • ڧԽֶशͰݱঢ় async update ͕ओྲྀ •
ChainerRL Ͱ train_agent_async ΛݺͿͱϚϧνϓϩηε࣮ߦ ChainerRL async ChainerMN sync
͓ΘΓʹ • ChainerRL ΛͬͯڧԽֶश͠Α͏ • ࠷৽ͷΞϧΰϦζϜؚΉଟͷ࣮ • ΞϧΰϦζϜʹΑ͘ग़Δύʔπ͝ͱʹ͑Δ • ϑΟʔυόοΫ͍ͩ͘͞
• ࣮ͯ͠΄͍͠ػೳɾΞϧΰϦζϜ • ΠϯλʔϑΣʔε github.com/chainer/chainerrl