Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reinforcement Learning

Reinforcement Learning

Reinforcement learning is a popular subfield in machine learning because of its success in beating humans at complex games like Go and Atari. The field’s value is in utilizing an award system to develop models and find more optimal ways to solve complex, real-world problems. This approach allows software to adapt to its environment without full knowledge of what the results should look like. This talk will cover reinforcement learning fundamentals and examples to help you understand how it works.

Melanie Warrick

September 21, 2017
Tweet

More Decks by Melanie Warrick

Other Decks in Technology

Transcript

  1. @nyghtowl RL Components Reward(R) Actions(A): - Right - Left -

    Straight Agent State(S) = Start Environment
  2. @nyghtowl Transition Function Probability to move current to future state

    stand by citadel & fly flying over citadel stand by citadel & breath fire stand by citadel & fly 70% 10% flying over citadel 70% 20% stand by citadel & breath fire 10% 20%
  3. @nyghtowl States & Actions | Search Tree You’re above the

    citadel & see a party inside You’re standing at the base of a mountain citadel The trees are burnt and nearby birds are angry fly up breath fire
  4. @nyghtowl State Size | Search Tree Tic-Tac-Toe = 10^3 Backgammon

    = 10^20 Chess = 10^47 Go = 10^170 * Jason Fox & neverstopbuilding.com
  5. @nyghtowl Main Functions 1. model = agent’s representation of environment

    2. value function = reward now and future if optimal acts 3. policy = agent behavior | state map action
  6. @nyghtowl Main Functions 1. model = agent’s representation of environment

    2. value function = reward now and future if optimal acts 3. policy = agent behavior | state map action
  7. @nyghtowl - Task transfer - Less supervised data - Easier

    scale data collection - Don’t optimize task performance - Need assumptions for complex skills Algorithm Comparison Model-based Model-free Pros Cons - Less assumptions for complex skills - Learn complex policies - Slow to train | experience
  8. @nyghtowl Main Functions 1. model = agent’s rep. of environment

    2. value function = reward now and future if optimal acts - v (state-value) | q (action-value) 3. policy = agent behavior | state map action
  9. @nyghtowl Prediction & Control policy ( ) prediction = evaluate

    policy & update v control = improve & find optimal policy value (V)
  10. @nyghtowl Estimates value Assumes MDP Breakdown to smaller chunks Bellman

    Equation Current reward Future discounted, predicted rewards
  11. @nyghtowl Optimal Policy | * * Kevin Binz & kevinbinz.com

    0.85 0.57 0.64 0.74 0.57 0.28 0.48 0.43 0.49
  12. @nyghtowl Value Methods by Backups Sample Shallow Full Deep TD

    Learning Monte Carlo Bootstrapping Environment * https://arxiv.org/pdf/1708.05866.pdf Exhaustive Search Dynamic Programming
  13. @nyghtowl Dynamic Programming value (V) policy ( ) immediate reward

    future discounted rewards probability transition value function
  14. @nyghtowl Q-Values 0.85 0.68 0.77 0.57 0.74 0.60 0.67 0.67

    0.64 0.57 0.59 0.53 - 0.66 0.53 0.57 0.30 0.51 0.51 0.57 0.46 0.29 0.40 0.48 0.41 0.42 0.43 0.40 0.40 0.41 0.45 0.49 0.44 0.13 0.28 - 0.65 0.27 * Kevin Binz & kevinbinz.com
  15. @nyghtowl State subset sampling | learn w/ environment interaction -

    Model-free - Episodic update Monte Carlo | Sampling & Simulation # visits to state estimate value actual reward estimate value
  16. @nyghtowl Temporal Difference-Learning DP bootstrapping/estimating & MC sampling - Off-policy

    vs. on-policy - Incomplete & continuous environments estimate value revised estimate value estimate value
  17. @nyghtowl Deep Q Network (DQN) Q-Learning with Neural Nets -

    CNN & Full - Convergence => Target Network & Experience Replay TD Learning
  18. @nyghtowl Main Functions 1. model = agent’s rep. of environment

    2. value function = reward now and future if optimal acts 3. policy = agent behavior | state map action
  19. @nyghtowl Policy Search Directly model policies & no value function

    - Efficient storage - High-dimensions & continuous environments - Baseline & advantage function for variance & converg Example = uniform random policy
  20. @nyghtowl Policy Search Example Algorithms Gradient - Policy Network -

    REINFORCE (likelihood ratio) - TRPO (Trust Region Policy Optimization) Gradient-free - Evolution - Simulated annealing - Genetic algorithms
  21. @nyghtowl Policy NN | ATARI 2600 Pong Ex - Input

    image ~100K (210x160x3) byte array - Environment (2 paddles, 1 run by NN) - Actions (paddle up and down) (binary) - Reward (+1 ball past opponent, -1 if miss ball)
  22. @nyghtowl Main Functions 1. model = agent’s rep. of environment

    1. value function = reward now and future if optimal acts 2. policy = agent behavior | state map action
  23. @nyghtowl Asynch. Advantage Actor-Critic (A3C) policy ( ) critic =

    TD| eval policy & estimate value actor = PG| update policy | score * critic value (V)
  24. @nyghtowl Main Functions 1. model 2. value function - DP

    (V & Pi Iteration) = full model, bootstrapping - MC = model-free sampling, episodic - TD-learning (DQN) = sampling, bootstrapping & online 3. policy search - Policy Search = gradients or not 4. value function & policy search - A3C - TD-learning & PG
  25. @nyghtowl Challenges - Convergence - Credit Assignment | Delayed Reward

    - Exploration vs Exploitation - Generalization
  26. @nyghtowl Libraries to Explore • DeepMind = Lab(agent-based AI research

    3D platform) & PySC2 (Blizzard Entertainment’s StarCraft II API in an RL Environment) • OpenAI = Gym (develop RL algorithms w/ any library) & Baselines (RL algorithms) & Roboschool (robot simulation in Gym) • Facebook Research = EFL (environments for game research) & ParlAI (framework for dialog AI research)
  27. @nyghtowl Resources: - An Introduction to Reinforcement Learning, Sutton &

    Barto 1998: http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf - Brief Survey of Deep Reinforcement Learning: https://arxiv.org/pdf/1708.05866.pdf - David Silver’s RL Course: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html - Deep Reinforcement Learning: Pong from Pixels: http://karpathy.github.io/2016/05/31/rl/ - Computational Neuroscience Lab: http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ - Playing Atari with DRL: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf - OpenAI Baselines: https://blog.openai.com/openai-baselines-dqn/ - Robotics overviews: https://www.youtube.com/channel/UC4e_-TvgALrwE1dUPvF_UTQ - Gridworld example: http://www.cs.ubc.ca/~poole/demos/mdp/vi.html - Automous HVAC Control, A RL Approach: https://link.springer.com/chapter/10.1007/978-3-319-23461-8_1 Github Repos to Explore: - DeepMind Lab: https://github.com/deepmind/lab - OpenAI Gym: https://github.com/openai/gym - FacebookResearch ELF: https://github.com/facebookresearch/ELF - Intro to RL with Python: https://github.com/jzf2101/intro_rl
  28. @nyghtowl References: Images • iStock.com/4x-image • iStock.com/edeart • iStock.com/higyou •

    iStock.com/lilu330 • iStock.com/alxpin • iStock.com/patpitchaya • iStock.com/Devrimb • An Introduction to Reinforcement Learning, Sutton & Barto 1998 • http://neverstopbuilding.com/minimax • https://kevinbinz.com/2016/10/19/mdp/ • http://karpathy.github.io/2016/05/31/rl/ • Last Week Tonight with John Oliver (HBO) & RoboCup • https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwj_pe26xqXWAhDeepMind https://www.youtube.com/watch?v=W2CAghUiofY • DeepMind https://youtu.be/nMR5mjCFZCw • https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/ • VLjlQKHQZbCcUQjRwIBw&url=http%3A%2F%2Fwww.fanpop.com%2Fclubs%2Fgame-of-thrones%2Fimages%2F38364122%2Ftitle% 2Fjon-snow-season-5-photo&psig=AFQjCNHQiVvgBQJ0xl3iM1M4lHU04ETmQQ&ust=1505508473062578 • https://pixabay.com/en/backgammon-board-game-cube-strategy-1903940/ • https://commons.wikimedia.org/wiki/File:Stones_go.jpg • http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ • https://www.flickr.com/photos/jurvetson/30374100613 | Steve Jurvetson • Copyright and disclaimer notice: https://creativecommons.org/licenses/by/2.0/ • License notice: https://creativecommons.org/licenses/by/2.0/legalcode
  29. @nyghtowl Thank you... Deep RL Bootcamp (Pieter Abbeel et al.,

    Berkeley, OpenAI, Gradescope) Lindsay Cade Alex Graves Isabel Markl Jason Morrison Kelley Robinson Adam Shin Dragons