Reinforcement Learning - A gentle introduction

Slide 1

Slide 1 text

Reinforcement learning A gentle introduction Uwe Friedrichsen –codecentric AG – 2018-2019

Slide 2

Slide 2 text

Uwe Friedrichsen CTO @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://medium.com/@ufried

Slide 3

Slide 3 text

What can I expect from this talk?

Slide 4

Slide 4 text

Goals of this talk • Understand the most important concepts and how they are connected to each other • Know the most important terms • Get some understanding how to translate concepts to code • Lose fear of math … ;) • Spark interest • Give you a little head start if you decide to dive deeper

Slide 5

Slide 5 text

Why Some success stories and a prognosis

Slide 6

Slide 6 text

2013

Slide 7

Slide 7 text

2016

Slide 8

Slide 8 text

2017

Slide 9

Slide 9 text

2018

Slide 10

Slide 10 text

2019

Slide 11

Slide 11 text

… to be continued in the near future …

Slide 12

Slide 12 text

Do you really think that an average white collar job has more degrees of freedom than StarCraft II?

Slide 13

Slide 13 text

(Deep) Reinforcement Learning has the potential to affect white collar workers similar to how robots affected blue collar workers

Slide 14

Slide 14 text

What The basic idea

Slide 15

Slide 15 text

Reinforcement learning (RL) is the study of how an agent can interact with its environment to learn a policy which maximizes expected cumulative rewards for a task -- Henderson et al., Deep Reinforcement Learning that Matters Source: https://arxiv.org/abs/1709.06560

Slide 16

Slide 16 text

Eh, what do you mean?

Slide 17

Slide 17 text

Agent interact Environment solve … Core idea: An agent tries to solve a task by interacting with its environment Task

Slide 18

Slide 18 text

Problem: How can we model the interaction in a way that allows the agent to learn how to solve the given task? Approach: Apply concepts from learning theory, particularly from operant conditioning * * learn a desired behavior (which solves the task) based on rewards and punishment

Slide 19

Slide 19 text

Agent Environment Core idea: An agent tries to maximize the rewards received over time from the environment maximize rewards over time … Task Observe (State, Reward) Manipulate (Action)

Slide 20

Slide 20 text

Observations so far • Approach deliberately limited to reward-based learning • Results in very narrow interaction interface • Good modeling is essential and often challenging • Rewards are crucial for ability to learn • States must support deciding on actions • Actions need to be effective in the environment

Slide 21

Slide 21 text

How 6 questions from concepts to implementation

Slide 22

Slide 22 text

Agent Environment Goal: Learn the best possible actions (with respect to the cumulated rewards) in response to the observed states maximize rewards over time … Task Observe (State, Reward) Manipulate (Action)

Slide 23

Slide 23 text

Question 1 Where do states, rewards and possible actions come from?

Slide 24

Slide 24 text

Agent Environment Observe (State, Reward) Manipulate (Action) Map (State, Action) Model (Reward) Known to the agent Sometimes known to the agent (but usually not) Unknown to the agent

Slide 25

Slide 25 text

Model • Maps environment to narrow interface • Makes complexity of environment manageable for agent • Responsible for calculating rewards (tricky to get right) • Usually not known to the agent • Usually only interface visible • Sometimes model also known (e.g., dynamic programming) • Creating a good model can be challenging • Representational learning approaches in DL can help

Slide 26

Slide 26 text

Question 2 How can we represent the (unknown) behavior of a model in a general way?

Slide 27

Slide 27 text

Represent the behavior using a probability distribution function

Slide 28

Slide 28 text

Transition probability function (based on MDP) Read: Probability that you will observe s’ and r after you sent a as a response to observing s Reward function (derived from transition probability function) Read: Expected reward r after you sent a as a response to observing s

Slide 29

Slide 29 text

Question 3 How can we represent the behavior of the agent? (It's still about learning the best possible actions)

Slide 30

Slide 30 text

Represent the behavior using a probability distribution function

Slide 31

Slide 31 text

Policy (stochastic) Read: Probability that you will choose a after you observed s Policy(deterministic) Read: You know how to read that one … ;)

Slide 32

Slide 32 text

Question 4 How can we learn a good (or even optimal) policy?

Slide 33

Slide 33 text

Well, it depends …

Slide 34

Slide 34 text

Learning a policy • Many approaches available • Model-based vs. model-free • Value-based vs. policy-based • On-policy vs. off-policy • Shallow backups vs. deep backups • Sample backups vs. full backups … plus a lot of variants • Here focus on 2 common (basic) approaches

Slide 35

Slide 35 text

Approach 1 Monte Carlo model-free – value-based – on-policy – deep backups – sample backups

Slide 36

Slide 36 text

Return • Goal is to optimize future rewards • Return describes discounted future rewards from step t

Slide 37

Slide 37 text

Return • Goal is to optimize future rewards • Return describes discounted future rewards from step t • Why a discounting factor !? • Future rewards may have higher uncertainty • Makes handling of infinite episodes easier • Controls how greedy or foresighted an agent acts

Slide 38

Slide 38 text

Value functions • State-value function • Read: Value of being in state s under policy !

Slide 39

Slide 39 text

Value functions • State-value function • Read: Value of being in state s under policy ! • Action-value function (also called Q-value function) • Read: Value of taking action a in state s under policy !

Slide 40

Slide 40 text

In value-based learning an optimal policy is a policy that results in an optimal value function

Slide 41

Slide 41 text

Optimal policy and value function • Optimal state-value function • Optimal action-value function

Slide 42

Slide 42 text

How can we learn an optimal policy?

Slide 43

Slide 43 text

Generalized policy iteration • Learning an optimal policy usually not feasible • Instead an approximation is targeted • Following algorithm is known to converge 1. Start with a random policy 2. Evaluate the policy 3. Update the policy greedily 4. Repeat from step 2 ! Evaluation Improvement "

Slide 44

Slide 44 text

How can we evaluate a policy?

Slide 45

Slide 45 text

Use a Monte Carlo method

Slide 46

Slide 46 text

Monte Carlo methods • Umbrella term for class of algorithms • Use repeated random sampling to obtain numerical results • Algorithm used for policy evaluation (idea) • Play many episodes, each with random start state and action • Calculate returns for all state-action pairs seen • Approximate q-value function by averaging returns for each state-action pair seen • Stop if change of q-value function becomes small enough

Slide 47

Slide 47 text

Source: [Sut2018]

Slide 48

Slide 48 text

Monte Carlo methods – pitfalls • Exploring enough state-action pairs to learn properly • Known as explore-exploit dilemma • Usually addressed by exploring starts or epsilon-greedy • Very slow convergence • Lots of episodes needed before policy gets updated • Trick: Update policy after each episode • Not yet formally proven, but empirically known to work For code example, e.g., see https://github.com/lazyprogrammer/machine_learning_examples/blob/master/rl/monte_carlo_es.py

Slide 49

Slide 49 text

Approach 2 Temporal-Difference learning (TD learning) model-free – value-based – on/off-policy – shallow backups – sample backups

Slide 50

Slide 50 text

Bellman equations • Allow describing the value function recursively

Slide 51

Slide 51 text

Bellman equations – consequences • Enables bootstrapping • Update value estimate on the basis of another estimate • Enables updating policy after each single step • Proven to converge for many configurations • Leads to Temporal-Difference learning (TD learning) • Update value function after each step just a bit • Use estimate of next step’s value to calculate the return • SARSA (on-policy) or Q-learning (off-policy) as control

Slide 52

Slide 52 text

Source: [Sut2018]

Slide 53

Slide 53 text

Source: [Sut2018]

Slide 54

Slide 54 text

Question 5 When do we need all the other concepts?

Slide 55

Slide 55 text

Deep Reinforcement Learning • Problem: Very large state spaces • Default for non-trivial environments • Make tabular representation of value function infeasible • Solution: Replace value table with Deep Neural Network • Deep Neural Networks are great function approximators • Implement q-value-function as DNN • Use Stochastic Gradient Descent (SGD) to train DNN

Slide 56

Slide 56 text

Policy Gradient learning • Learning the policy directly (without value function “detour”) • Actual goal is learning an optimal policy, not a value function • Sometimes learning a value function is not feasible • Leads to Policy Gradient learning • Parameterize policy with ! which stores “configuration” • Learn optimal policy using gradient ascent • Lots of implementations, e.g., REINFORCE, Actor-Critic, … • Can easily be extended to DNNs

Slide 57

Slide 57 text

Question 6 How can I start my own journey into DL best?

Slide 58

Slide 58 text

No one-size-fits-all approach – people are different Here is what worked quite well for me …

Slide 59

Slide 59 text

Your own journey – foundations • Learn some of the foundations • Read some blogs, e.g., [Wen2018] • Do an online course at Coursera, Udemy, … • Read a book, e.g., [Sut2018] • Code some basic stuff on your own • Online courses often offer nice exercises • Try some simple environments on OpenAI Gym [OAIGym]

Slide 60

Slide 60 text

Your own journey – Deep RL • Pick up some of the basic Deep RL concepts • Read some blogs, e.g., [Sim2018] • Do an online course at Coursera, Udemy, … • Read a book, e.g., [Goo2016], [Lap2018] • Revisit OpenAI Gym • Retry previous environments using a DNN • Try more advanced environments, e.g., Atari environments

Slide 61

Slide 61 text

Your own journey – moving on • Repeat and pick up some new stuff on each iteration • Complete the books • Do advanced online courses • Read research papers, e.g., [Mni2013] • Try to implement some of the papers (if you have enough computing power at hand) • Try more complex environments, e.g., Vizdoom [Vizdoom]

Slide 62

Slide 62 text

Outlook Moving on – evolution, challenges and you

Slide 63

Slide 63 text

Remember my claim from the beginning?

Slide 64

Slide 64 text

(Deep) Reinforcement Learning has the potential to affect white collar workers similar to how robots affected blue collar workers

Slide 65

Slide 65 text

Challenges of (Deep) RL • Massive training data demands • Hard to provide or generate • One of the reasons games are used so often as environments • Probably the reason white collar workers are still quite unaffected • Hard to stabilize and get production-ready • Research results are often hard to reproduce [Hen2019] • Hyperparameter tuning and a priori error prediction is hard • Massive demand for computing power

Slide 66

Slide 66 text

Status quo of (Deep) RL • Most current progress based on brute force and trial & error • Lack of training data for most real-world problems becomes a huge issue • Research (and application) limited to few companies • Most other companies neither have comprehension nor skills nor resources to drive RL solutions

Slide 67

Slide 67 text

Potential futures of (Deep) RL • Expected breakthrough happens soon • Discovery how to easily apply RL to real-world problems • Market probably dominated by few companies • All other companies just use their solutions • Expected breakthrough does not happen soon • Inflated market expectations do not get satisfied • Next “Winter of AI” • AI will become invisible parts of commodity solutions • RL will not face any progress for several years

Slide 68

Slide 68 text

And what does all that mean for me?

Slide 69

Slide 69 text

Positioning yourself • You rather believe in the breakthrough of (Deep) RL • Help democratize AI & RL – become part of the community • You rather do not believe in the breakthrough of (Deep) RL • Observe and enjoy your coffee … ;) • You are undecided • It’s a fascinating topic after all • So, dive in a bit and decide when things become clearer

Slide 70

Slide 70 text

Enjoy and have a great time!

Slide 71

Slide 71 text

References – Books [Goo2016] I. Goodfellow, Y. Bengio, A. Courville, ”Deep learning", MIT press, 2016 [Lap2018] Maxim Lapan, “Deep Reinforcement Learning Hands-On”, Packt Publishing,2018 [Sut2018] Richard S. Sutton, Andrew G. Barto, “Reinforcement Learning – An Introduction”, 2nd edition, MIT press, 2018

Slide 72

Slide 72 text

References – Papers [Hen2019] P. Henderson et al., “Deep Reinforcement Learning that Matters”, arxiv:1709.06560 [Mni2013] V. Mnih et al., “Playing Atari with Deep Reinforcement Learning”, arxiv:1312.5602v1

Slide 73

Slide 73 text

References – Blogs [Sim2018] T. Simonini, “A free course in Deep Reinforcement Learning from beginner to expert”, https://simoninithomas.github.io/ Deep_reinforcement_learning_Course/ [Wen2018] L. Weng, “A (long) peek into Reinforcement Learning”, https://lilianweng.github.io/lil-log/2018/02/19/ a-long-peek-into-reinforcement-learning.html

Slide 74

Slide 74 text

References – Environments [OAIGym] OpenAI Gym, http://gym.openai.com [Vizdoom] Vizdoom, Doom-based AI Research Platform, http://vizdoom.cs.put.edu.pl

Slide 75

Slide 75 text

Uwe Friedrichsen CTO @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://medium.com/@ufried