Reinforcement Learning

Reinforcement Learning

Reinforcement learning is a popular subfield in machine learning because of its success in beating humans at complex games like Go and Atari. The field’s value is in utilizing an award system to develop models and find more optimal ways to solve complex, real-world problems. This approach allows software to adapt to its environment without full knowledge of what the results should look like. This talk will cover reinforcement learning fundamentals and examples to help you understand how it works.

2168aa4564112d3ba88869ca3cc994b3?s=128

Melanie Warrick

September 21, 2017
Tweet

Transcript

  1. Reinforcement Learning Melanie Warrick | @nyghtowl

  2. Who am I?

  3. @nyghtowl Definition ...

  4. @nyghtowl Goal => Reward

  5. @nyghtowl TD-Gammon

  6. @nyghtowl AlphaGo

  7. @nyghtowl Cooling Energy Reduction

  8. @nyghtowl RL Components Reward(R) Actions(A): - Right - Left -

    Straight Agent State(S) = Start Environment
  9. Agent Behavior = Policies

  10. @nyghtowl Environment | MDP Markov Decision Process = current state

    all you need Transition Function & Reward
  11. @nyghtowl

  12. @nyghtowl Transition Function Probability to move current to future state

    stand by citadel & fly flying over citadel stand by citadel & breath fire stand by citadel & fly 70% 10% flying over citadel 70% 20% stand by citadel & breath fire 10% 20%
  13. @nyghtowl States & Actions | Search Tree You’re above the

    citadel & see a party inside You’re standing at the base of a mountain citadel The trees are burnt and nearby birds are angry fly up breath fire
  14. @nyghtowl State Size | Search Tree Tic-Tac-Toe = 10^3 Backgammon

    = 10^20 Chess = 10^47 Go = 10^170 * Jason Fox & neverstopbuilding.com
  15. Goal => Optimal Policy => Reward *

  16. @nyghtowl Main Functions 1. model = agent’s representation of environment

    2. value function = reward now and future if optimal acts 3. policy = agent behavior | state map action
  17. @nyghtowl Main Functions 1. model = agent’s representation of environment

    2. value function = reward now and future if optimal acts 3. policy = agent behavior | state map action
  18. @nyghtowl Model Environment - Table Lookup - Gaussian Process -

    Neural Network
  19. @nyghtowl - Task transfer - Less supervised data - Easier

    scale data collection - Don’t optimize task performance - Need assumptions for complex skills Algorithm Comparison Model-based Model-free Pros Cons - Less assumptions for complex skills - Learn complex policies - Slow to train | experience
  20. @nyghtowl Planning & Learning value/ policy experience model direct acting

    model learning planning
  21. @nyghtowl Main Functions 1. model = agent’s rep. of environment

    2. value function = reward now and future if optimal acts - v (state-value) | q (action-value) 3. policy = agent behavior | state map action
  22. @nyghtowl Hello World | Gridworld * Kevin Binz & kevinbinz.com

  23. @nyghtowl Gridworld Environment * Kevin Binz & kevinbinz.com

  24. @nyghtowl Prediction & Control policy ( ) prediction = evaluate

    policy & update v control = improve & find optimal policy value (V)
  25. @nyghtowl Estimates value Assumes MDP Breakdown to smaller chunks Bellman

    Equation Current reward Future discounted, predicted rewards
  26. @nyghtowl Agent behavior Exploit Always be optimizing policy

  27. @nyghtowl Optimal Policy | * * Kevin Binz & kevinbinz.com

    0.85 0.57 0.64 0.74 0.57 0.28 0.48 0.43 0.49
  28. @nyghtowl Value Methods by Backups Sample Shallow Full Deep TD

    Learning Monte Carlo Bootstrapping Environment * https://arxiv.org/pdf/1708.05866.pdf Exhaustive Search Dynamic Programming
  29. @nyghtowl Dynamic Programming value (V) policy ( ) immediate reward

    future discounted rewards probability transition value function
  30. @nyghtowl Q-Values 0.85 0.68 0.77 0.57 0.74 0.60 0.67 0.67

    0.64 0.57 0.59 0.53 - 0.66 0.53 0.57 0.30 0.51 0.51 0.57 0.46 0.29 0.40 0.48 0.41 0.42 0.43 0.40 0.40 0.41 0.45 0.49 0.44 0.13 0.28 - 0.65 0.27 * Kevin Binz & kevinbinz.com
  31. @nyghtowl State subset sampling | learn w/ environment interaction -

    Model-free - Episodic update Monte Carlo | Sampling & Simulation # visits to state estimate value actual reward estimate value
  32. @nyghtowl Temporal Difference-Learning DP bootstrapping/estimating & MC sampling - Off-policy

    vs. on-policy - Incomplete & continuous environments estimate value revised estimate value estimate value
  33. @nyghtowl Deep Q Network (DQN) Q-Learning with Neural Nets -

    CNN & Full - Convergence => Target Network & Experience Replay TD Learning
  34. @nyghtowl DeepMind | Space Invaders

  35. @nyghtowl Main Functions 1. model = agent’s rep. of environment

    2. value function = reward now and future if optimal acts 3. policy = agent behavior | state map action
  36. @nyghtowl Policy Search Directly model policies & no value function

    - Efficient storage - High-dimensions & continuous environments - Baseline & advantage function for variance & converg Example = uniform random policy
  37. @nyghtowl Policy Search Example Algorithms Gradient - Policy Network -

    REINFORCE (likelihood ratio) - TRPO (Trust Region Policy Optimization) Gradient-free - Evolution - Simulated annealing - Genetic algorithms
  38. @nyghtowl Policy NN | ATARI 2600 Pong Ex - Input

    image ~100K (210x160x3) byte array - Environment (2 paddles, 1 run by NN) - Actions (paddle up and down) (binary) - Reward (+1 ball past opponent, -1 if miss ball)
  39. @nyghtowl Main Functions 1. model = agent’s rep. of environment

    1. value function = reward now and future if optimal acts 2. policy = agent behavior | state map action
  40. @nyghtowl Asynch. Advantage Actor-Critic (A3C) policy ( ) critic =

    TD| eval policy & estimate value actor = PG| update policy | score * critic value (V)
  41. @nyghtowl DeepMind Labyrinth Example

  42. @nyghtowl Main Functions 1. model 2. value function - DP

    (V & Pi Iteration) = full model, bootstrapping - MC = model-free sampling, episodic - TD-learning (DQN) = sampling, bootstrapping & online 3. policy search - Policy Search = gradients or not 4. value function & policy search - A3C - TD-learning & PG
  43. @nyghtowl Challenges - Convergence - Credit Assignment | Delayed Reward

    - Exploration vs Exploitation - Generalization
  44. @nyghtowl Libraries to Explore • DeepMind = Lab(agent-based AI research

    3D platform) & PySC2 (Blizzard Entertainment’s StarCraft II API in an RL Environment) • OpenAI = Gym (develop RL algorithms w/ any library) & Baselines (RL algorithms) & Roboschool (robot simulation in Gym) • Facebook Research = EFL (environments for game research) & ParlAI (framework for dialog AI research)
  45. @nyghtowl Last Week Tonight & RoboCup

  46. @nyghtowl Last Points Exploration & Transfer Learning Break down problem

    Optimal policy
  47. @nyghtowl Resources: - An Introduction to Reinforcement Learning, Sutton &

    Barto 1998: http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf - Brief Survey of Deep Reinforcement Learning: https://arxiv.org/pdf/1708.05866.pdf - David Silver’s RL Course: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html - Deep Reinforcement Learning: Pong from Pixels: http://karpathy.github.io/2016/05/31/rl/ - Computational Neuroscience Lab: http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ - Playing Atari with DRL: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf - OpenAI Baselines: https://blog.openai.com/openai-baselines-dqn/ - Robotics overviews: https://www.youtube.com/channel/UC4e_-TvgALrwE1dUPvF_UTQ - Gridworld example: http://www.cs.ubc.ca/~poole/demos/mdp/vi.html - Automous HVAC Control, A RL Approach: https://link.springer.com/chapter/10.1007/978-3-319-23461-8_1 Github Repos to Explore: - DeepMind Lab: https://github.com/deepmind/lab - OpenAI Gym: https://github.com/openai/gym - FacebookResearch ELF: https://github.com/facebookresearch/ELF - Intro to RL with Python: https://github.com/jzf2101/intro_rl
  48. @nyghtowl References: Images • iStock.com/4x-image • iStock.com/edeart • iStock.com/higyou •

    iStock.com/lilu330 • iStock.com/alxpin • iStock.com/patpitchaya • iStock.com/Devrimb • An Introduction to Reinforcement Learning, Sutton & Barto 1998 • http://neverstopbuilding.com/minimax • https://kevinbinz.com/2016/10/19/mdp/ • http://karpathy.github.io/2016/05/31/rl/ • Last Week Tonight with John Oliver (HBO) & RoboCup • https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwj_pe26xqXWAhDeepMind https://www.youtube.com/watch?v=W2CAghUiofY • DeepMind https://youtu.be/nMR5mjCFZCw • https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/ • VLjlQKHQZbCcUQjRwIBw&url=http%3A%2F%2Fwww.fanpop.com%2Fclubs%2Fgame-of-thrones%2Fimages%2F38364122%2Ftitle% 2Fjon-snow-season-5-photo&psig=AFQjCNHQiVvgBQJ0xl3iM1M4lHU04ETmQQ&ust=1505508473062578 • https://pixabay.com/en/backgammon-board-game-cube-strategy-1903940/ • https://commons.wikimedia.org/wiki/File:Stones_go.jpg • http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ • https://www.flickr.com/photos/jurvetson/30374100613 | Steve Jurvetson • Copyright and disclaimer notice: https://creativecommons.org/licenses/by/2.0/ • License notice: https://creativecommons.org/licenses/by/2.0/legalcode
  49. @nyghtowl Thank you... Deep RL Bootcamp (Pieter Abbeel et al.,

    Berkeley, OpenAI, Gradescope) Lindsay Cade Alex Graves Isabel Markl Jason Morrison Kelley Robinson Adam Shin Dragons
  50. @nyghtowl Artificial Intelligence Melanie Warrick @nyghtowl