Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reinforcement Learning - A gentle introduction

Reinforcement Learning - A gentle introduction

If you dive into Reinforcement Learning, you get flooded with a plethora of concepts. Unfortunately, most of the time nobody will explain you the ideas behind the concepts: Why these concepts? What are they good for? What are their limits? How are they related to each other? When to pick what? And so on.

As a result, at least I got quite some things wrong in the beginning which resulted in useless code and more. To (hopefully) give you a better start than I had, I created this talk. It starts at the general idea of Reinforcement Learning and then stepwise moves to the implementation level, explaining the concepts, the whys, hows and limits along the way. Afterwards, some popular additional concepts are briefly shown and why and when they are needed. At the end, some resource pointers to dive deeper are given.

While this is just a first peek into the world of Reinforcement Learning and as always the voice track is missing, I still hope it will make the start into that fascinating topic a bit easier for you.

Update: A recording of the talk (leaving out some details due to time restrictions) can be found at https://www.youtube.com/watch?v=RIEVPxywzu8


Uwe Friedrichsen

May 10, 2019


  1. Reinforcement learning A gentle introduction Uwe Friedrichsen –codecentric AG –

  2. Uwe Friedrichsen CTO @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://medium.com/@ufried

  3. What can I expect from this talk?

  4. Goals of this talk • Understand the most important concepts

    and how they are connected to each other • Know the most important terms • Get some understanding how to translate concepts to code • Lose fear of math … ;) • Spark interest • Give you a little head start if you decide to dive deeper
  5. Why Some success stories and a prognosis

  6. 2013

  7. 2016

  8. 2017

  9. 2018

  10. 2019

  11. … to be continued in the near future …

  12. Do you really think that an average white collar job

    has more degrees of freedom than StarCraft II?
  13. (Deep) Reinforcement Learning has the potential to affect white collar

    workers similar to how robots affected blue collar workers
  14. What The basic idea

  15. Reinforcement learning (RL) is the study of how an agent

    can interact with its environment to learn a policy which maximizes expected cumulative rewards for a task -- Henderson et al., Deep Reinforcement Learning that Matters Source: https://arxiv.org/abs/1709.06560
  16. Eh, what do you mean?

  17. Agent interact Environment solve … Core idea: An agent tries

    to solve a task by interacting with its environment Task
  18. Problem: How can we model the interaction in a way

    that allows the agent to learn how to solve the given task? Approach: Apply concepts from learning theory, particularly from operant conditioning * * learn a desired behavior (which solves the task) based on rewards and punishment
  19. Agent Environment Core idea: An agent tries to maximize the

    rewards received over time from the environment maximize rewards over time … Task Observe (State, Reward) Manipulate (Action)
  20. Observations so far • Approach deliberately limited to reward-based learning

    • Results in very narrow interaction interface • Good modeling is essential and often challenging • Rewards are crucial for ability to learn • States must support deciding on actions • Actions need to be effective in the environment
  21. How 6 questions from concepts to implementation

  22. Agent Environment Goal: Learn the best possible actions (with respect

    to the cumulated rewards) in response to the observed states maximize rewards over time … Task Observe (State, Reward) Manipulate (Action)
  23. Question 1 Where do states, rewards and possible actions come

  24. Agent Environment Observe (State, Reward) Manipulate (Action) Map (State, Action)

    Model (Reward) Known to the agent Sometimes known to the agent (but usually not) Unknown to the agent
  25. Model • Maps environment to narrow interface • Makes complexity

    of environment manageable for agent • Responsible for calculating rewards (tricky to get right) • Usually not known to the agent • Usually only interface visible • Sometimes model also known (e.g., dynamic programming) • Creating a good model can be challenging • Representational learning approaches in DL can help
  26. Question 2 How can we represent the (unknown) behavior of

    a model in a general way?
  27. Represent the behavior using a probability distribution function

  28. Transition probability function (based on MDP) Read: Probability that you

    will observe s’ and r after you sent a as a response to observing s Reward function (derived from transition probability function) Read: Expected reward r after you sent a as a response to observing s
  29. Question 3 How can we represent the behavior of the

    agent? (It's still about learning the best possible actions)
  30. Represent the behavior using a probability distribution function

  31. Policy (stochastic) Read: Probability that you will choose a after

    you observed s Policy(deterministic) Read: You know how to read that one … ;)
  32. Question 4 How can we learn a good (or even

    optimal) policy?
  33. Well, it depends …

  34. Learning a policy • Many approaches available • Model-based vs.

    model-free • Value-based vs. policy-based • On-policy vs. off-policy • Shallow backups vs. deep backups • Sample backups vs. full backups … plus a lot of variants • Here focus on 2 common (basic) approaches
  35. Approach 1 Monte Carlo model-free – value-based – on-policy –

    deep backups – sample backups
  36. Return • Goal is to optimize future rewards • Return

    describes discounted future rewards from step t
  37. Return • Goal is to optimize future rewards • Return

    describes discounted future rewards from step t • Why a discounting factor !? • Future rewards may have higher uncertainty • Makes handling of infinite episodes easier • Controls how greedy or foresighted an agent acts
  38. Value functions • State-value function • Read: Value of being

    in state s under policy !
  39. Value functions • State-value function • Read: Value of being

    in state s under policy ! • Action-value function (also called Q-value function) • Read: Value of taking action a in state s under policy !
  40. In value-based learning an optimal policy is a policy that

    results in an optimal value function
  41. Optimal policy and value function • Optimal state-value function •

    Optimal action-value function
  42. How can we learn an optimal policy?

  43. Generalized policy iteration • Learning an optimal policy usually not

    feasible • Instead an approximation is targeted • Following algorithm is known to converge 1. Start with a random policy 2. Evaluate the policy 3. Update the policy greedily 4. Repeat from step 2 ! Evaluation Improvement "
  44. How can we evaluate a policy?

  45. Use a Monte Carlo method

  46. Monte Carlo methods • Umbrella term for class of algorithms

    • Use repeated random sampling to obtain numerical results • Algorithm used for policy evaluation (idea) • Play many episodes, each with random start state and action • Calculate returns for all state-action pairs seen • Approximate q-value function by averaging returns for each state-action pair seen • Stop if change of q-value function becomes small enough
  47. Source: [Sut2018]

  48. Monte Carlo methods – pitfalls • Exploring enough state-action pairs

    to learn properly • Known as explore-exploit dilemma • Usually addressed by exploring starts or epsilon-greedy • Very slow convergence • Lots of episodes needed before policy gets updated • Trick: Update policy after each episode • Not yet formally proven, but empirically known to work For code example, e.g., see https://github.com/lazyprogrammer/machine_learning_examples/blob/master/rl/monte_carlo_es.py
  49. Approach 2 Temporal-Difference learning (TD learning) model-free – value-based –

    on/off-policy – shallow backups – sample backups
  50. Bellman equations • Allow describing the value function recursively

  51. Bellman equations – consequences • Enables bootstrapping • Update value

    estimate on the basis of another estimate • Enables updating policy after each single step • Proven to converge for many configurations • Leads to Temporal-Difference learning (TD learning) • Update value function after each step just a bit • Use estimate of next step’s value to calculate the return • SARSA (on-policy) or Q-learning (off-policy) as control
  52. Source: [Sut2018]

  53. Source: [Sut2018]

  54. Question 5 When do we need all the other concepts?

  55. Deep Reinforcement Learning • Problem: Very large state spaces •

    Default for non-trivial environments • Make tabular representation of value function infeasible • Solution: Replace value table with Deep Neural Network • Deep Neural Networks are great function approximators • Implement q-value-function as DNN • Use Stochastic Gradient Descent (SGD) to train DNN
  56. Policy Gradient learning • Learning the policy directly (without value

    function “detour”) • Actual goal is learning an optimal policy, not a value function • Sometimes learning a value function is not feasible • Leads to Policy Gradient learning • Parameterize policy with ! which stores “configuration” • Learn optimal policy using gradient ascent • Lots of implementations, e.g., REINFORCE, Actor-Critic, … • Can easily be extended to DNNs
  57. Question 6 How can I start my own journey into

    DL best?
  58. No one-size-fits-all approach – people are different Here is what

    worked quite well for me …
  59. Your own journey – foundations • Learn some of the

    foundations • Read some blogs, e.g., [Wen2018] • Do an online course at Coursera, Udemy, … • Read a book, e.g., [Sut2018] • Code some basic stuff on your own • Online courses often offer nice exercises • Try some simple environments on OpenAI Gym [OAIGym]
  60. Your own journey – Deep RL • Pick up some

    of the basic Deep RL concepts • Read some blogs, e.g., [Sim2018] • Do an online course at Coursera, Udemy, … • Read a book, e.g., [Goo2016], [Lap2018] • Revisit OpenAI Gym • Retry previous environments using a DNN • Try more advanced environments, e.g., Atari environments
  61. Your own journey – moving on • Repeat and pick

    up some new stuff on each iteration • Complete the books • Do advanced online courses • Read research papers, e.g., [Mni2013] • Try to implement some of the papers (if you have enough computing power at hand) • Try more complex environments, e.g., Vizdoom [Vizdoom]
  62. Outlook Moving on – evolution, challenges and you

  63. Remember my claim from the beginning?

  64. (Deep) Reinforcement Learning has the potential to affect white collar

    workers similar to how robots affected blue collar workers
  65. Challenges of (Deep) RL • Massive training data demands •

    Hard to provide or generate • One of the reasons games are used so often as environments • Probably the reason white collar workers are still quite unaffected • Hard to stabilize and get production-ready • Research results are often hard to reproduce [Hen2019] • Hyperparameter tuning and a priori error prediction is hard • Massive demand for computing power
  66. Status quo of (Deep) RL • Most current progress based

    on brute force and trial & error • Lack of training data for most real-world problems becomes a huge issue • Research (and application) limited to few companies • Most other companies neither have comprehension nor skills nor resources to drive RL solutions
  67. Potential futures of (Deep) RL • Expected breakthrough happens soon

    • Discovery how to easily apply RL to real-world problems • Market probably dominated by few companies • All other companies just use their solutions • Expected breakthrough does not happen soon • Inflated market expectations do not get satisfied • Next “Winter of AI” • AI will become invisible parts of commodity solutions • RL will not face any progress for several years
  68. And what does all that mean for me?

  69. Positioning yourself • You rather believe in the breakthrough of

    (Deep) RL • Help democratize AI & RL – become part of the community • You rather do not believe in the breakthrough of (Deep) RL • Observe and enjoy your coffee … ;) • You are undecided • It’s a fascinating topic after all • So, dive in a bit and decide when things become clearer
  70. Enjoy and have a great time!

  71. References – Books [Goo2016] I. Goodfellow, Y. Bengio, A. Courville,

    ”Deep learning", MIT press, 2016 [Lap2018] Maxim Lapan, “Deep Reinforcement Learning Hands-On”, Packt Publishing,2018 [Sut2018] Richard S. Sutton, Andrew G. Barto, “Reinforcement Learning – An Introduction”, 2nd edition, MIT press, 2018
  72. References – Papers [Hen2019] P. Henderson et al., “Deep Reinforcement

    Learning that Matters”, arxiv:1709.06560 [Mni2013] V. Mnih et al., “Playing Atari with Deep Reinforcement Learning”, arxiv:1312.5602v1
  73. References – Blogs [Sim2018] T. Simonini, “A free course in

    Deep Reinforcement Learning from beginner to expert”, https://simoninithomas.github.io/ Deep_reinforcement_learning_Course/ [Wen2018] L. Weng, “A (long) peek into Reinforcement Learning”, https://lilianweng.github.io/lil-log/2018/02/19/ a-long-peek-into-reinforcement-learning.html
  74. References – Environments [OAIGym] OpenAI Gym, http://gym.openai.com [Vizdoom] Vizdoom, Doom-based

    AI Research Platform, http://vizdoom.cs.put.edu.pl
  75. Uwe Friedrichsen CTO @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://medium.com/@ufried