Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to reinforcement learning

Introduction to reinforcement learning

This talk will be a general introduction to reinforcement learning and some working examples to help people understand the basic concepts. It is most suitable to people who have a proper math background and machine learning background but lack some basic knowledge in reinforcement learning.

MunichDataGeeks

July 31, 2016
Tweet

More Decks by MunichDataGeeks

Other Decks in Science

Transcript

  1. A not rigid introduction to reinforcement learning A not rigid

    introduction to reinforcement learning Xudong Sun, [email protected] July 28, 2016
  2. A not rigid introduction to reinforcement learning Outline 1 Introduction

    Introductory example Basic concepts 2 Temporal dierence learning Sarsa Q-learning 3 Value function approximation The mountain car problem 4 Gaussian Process Temporal Dierence learning Introductory example and GP Octpus control 5 Reference
  3. A not rigid introduction to reinforcement learning Introduction Outline 1

    Introduction Introductory example Basic concepts 2 Temporal dierence learning Sarsa Q-learning 3 Value function approximation The mountain car problem 4 Gaussian Process Temporal Dierence learning Introductory example and GP Octpus control 5 Reference
  4. A not rigid introduction to reinforcement learning Introduction Introductory example

    Introductary example grey: Mouse yellow: Cheese orange: cat
  5. A not rigid introduction to reinforcement learning Introduction Basic concepts

    Preliminary state,action, transition(system), reward between supervised and unsupervised: learning reaction policy(Look up table) when labeled training example are not available Agent environment interaction as a closed loop system Reward function r: Adaptive to Environment Returns: accumulated gain of reward R = ∞ i=1 γirt+1 Markov decision process(MDP) in RL:
  6. A not rigid introduction to reinforcement learning Introduction Basic concepts

    value function-long term consideration state value function Vπ(s) = Eπ[Rt|St = s] = Eπ[ ∞ i=1 γirt+1|st = s] = Eπ[rt + γVπ(st+1)|St = s] = s ∈S T(s− > s , π)[rt + γVπ(st = s )] (Bellman equation) π∗ = argmaxπ[rt + γVπ(st+1)|St = s ], ∀s ∈ S action value function Q(st, at) = r(st, at) + γV∗(st+1) V∗(s) = argmaxπ[rt + γVπ(st+1)|St = s ] Q(st, at) = r(st, at) + argmax a {Q(st+1, at)}
  7. A not rigid introduction to reinforcement learning Introduction Basic concepts

    Computation approach Dynamic programing Monte Carlo Temporal dierence
  8. A not rigid introduction to reinforcement learning Temporal dierence learning

    Outline 1 Introduction Introductory example Basic concepts 2 Temporal dierence learning Sarsa Q-learning 3 Value function approximation The mountain car problem 4 Gaussian Process Temporal Dierence learning Introductory example and GP Octpus control 5 Reference
  9. A not rigid introduction to reinforcement learning Temporal dierence learning

    Q-learning Q(state, action) = R(state, action) + γ Max[Q(next state, all actions)], Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] = 100 + 0.8 * 0 = 100
  10. A not rigid introduction to reinforcement learning Value function approximation

    Outline 1 Introduction Introductory example Basic concepts 2 Temporal dierence learning Sarsa Q-learning 3 Value function approximation The mountain car problem 4 Gaussian Process Temporal Dierence learning Introductory example and GP Octpus control 5 Reference
  11. A not rigid introduction to reinforcement learning Value function approximation

    The mountain car problem xt+1 = [xt + ˙ xt+1] ˙ xt+1 = [ ˙ xt + 0.001at − 0.0025cos(3xt)]
  12. A not rigid introduction to reinforcement learning Value function approximation

    The mountain car problem θt+1 = θt + αδtet et = γλet−1 + θt Qt(st, at) δt = rt+1 + γQt(st+1, at+1) − Qt(st, at)
  13. A not rigid introduction to reinforcement learning Gaussian Process Temporal

    Dierence learning Outline 1 Introduction Introductory example Basic concepts 2 Temporal dierence learning Sarsa Q-learning 3 Value function approximation The mountain car problem 4 Gaussian Process Temporal Dierence learning Introductory example and GP Octpus control 5 Reference
  14. A not rigid introduction to reinforcement learning Gaussian Process Temporal

    Dierence learning Introductory example and GP GPs Gaussian processes (GPs) provide a consistent and principled probabilistic framework for ranking functions according to their plausibility by dening a corresponding probability distribution over functions (Rasmussen and Williams, 2006) Y(x)=F(x)+N(x) Yt = (Y(x1), Y(x2), ..Y(xt)) Ft = (F(x1), F(x2), ..F(xt)) Nt = (N(x1), N(x2), ..N(xt)) F(x) Yt ˜N f0(x) f0 , k(x, x) kt(x) kt(x) Kt + σ2I (F(.)|Yt)˜N{ˆ F(.), Pt(., .)}
  15. A not rigid introduction to reinforcement learning Gaussian Process Temporal

    Dierence learning Introductory example and GP GPs for regression
  16. A not rigid introduction to reinforcement learning Gaussian Process Temporal

    Dierence learning Introductory example and GP GPs for reinforcement learning vt ∝ N(0, k(.)), E[v(x), v(x )] = k(x, x ) vt = Eπ[rt + γvπ(st+1)|St = s] GPs model for RL: rt = vt − γvt+1 + Nt Dene sample vector: Rt = [r(x1), r(x2), ...r(xt)], Vt = [v(x1), v(x2), ...v(xt)],Nt = [n(x1), n(x2), ...n(xt)] Rt−1 = HtVt + Nt−1 , (t − 1) × 1 = [(t − 1) × t] × [t × 1] Rt−1 Vt ˜N 0 0 , ∗ ∗ ∗ ∗ (V(.)|Rt−1 = r)˜N{ˆ v(.), Pt(., .)}, where mean and covariance depends on the award vector, state vector and kernel function.
  17. A not rigid introduction to reinforcement learning Gaussian Process Temporal

    Dierence learning Octpus control Octpus control:22 point masses(x,y,xd,yd)
  18. A not rigid introduction to reinforcement learning Reference Some pictures

    are adopted from the following source: 1.Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, A Bradford Book, The MIT Press ,Cambridge, Massachusetts ,London, England 2.John McCullock, mnemstudio tutorial on path nding. 3.Travis DeWolf tutorial on reinforcement learning. 4.Engel, Y., Szabo, P., Volkinshtein, D. (2005). Learning to Control an Octopus Arm with Gassuian Process Temporal Dierence Methods. Proc. NIPS, c. 5.Reinforcement Learning By: Chandra Prakash IIITM Gwalior Thanks Xudong Sun, [email protected] Linkedin: Xudong Sun, LMU github:https://github.com/smilesun