Slide 1

Slide 1 text

Reinforcement Learning Melanie Warrick | @nyghtowl

Slide 2

Slide 2 text

Who am I?

Slide 3

Slide 3 text

@nyghtowl Definition ...

Slide 4

Slide 4 text

@nyghtowl Goal => Reward

Slide 5

Slide 5 text

@nyghtowl TD-Gammon

Slide 6

Slide 6 text

@nyghtowl AlphaGo

Slide 7

Slide 7 text

@nyghtowl Cooling Energy Reduction

Slide 8

Slide 8 text

@nyghtowl RL Components Reward(R) Actions(A): - Right - Left - Straight Agent State(S) = Start Environment

Slide 9

Slide 9 text

Agent Behavior = Policies

Slide 10

Slide 10 text

@nyghtowl Environment | MDP Markov Decision Process = current state all you need Transition Function & Reward

Slide 11

Slide 11 text

@nyghtowl

Slide 12

Slide 12 text

@nyghtowl Transition Function Probability to move current to future state stand by citadel & fly flying over citadel stand by citadel & breath fire stand by citadel & fly 70% 10% flying over citadel 70% 20% stand by citadel & breath fire 10% 20%

Slide 13

Slide 13 text

@nyghtowl States & Actions | Search Tree You’re above the citadel & see a party inside You’re standing at the base of a mountain citadel The trees are burnt and nearby birds are angry fly up breath fire

Slide 14

Slide 14 text

@nyghtowl State Size | Search Tree Tic-Tac-Toe = 10^3 Backgammon = 10^20 Chess = 10^47 Go = 10^170 * Jason Fox & neverstopbuilding.com

Slide 15

Slide 15 text

Goal => Optimal Policy => Reward *

Slide 16

Slide 16 text

@nyghtowl Main Functions 1. model = agent’s representation of environment 2. value function = reward now and future if optimal acts 3. policy = agent behavior | state map action

Slide 17

Slide 17 text

@nyghtowl Main Functions 1. model = agent’s representation of environment 2. value function = reward now and future if optimal acts 3. policy = agent behavior | state map action

Slide 18

Slide 18 text

@nyghtowl Model Environment - Table Lookup - Gaussian Process - Neural Network

Slide 19

Slide 19 text

@nyghtowl - Task transfer - Less supervised data - Easier scale data collection - Don’t optimize task performance - Need assumptions for complex skills Algorithm Comparison Model-based Model-free Pros Cons - Less assumptions for complex skills - Learn complex policies - Slow to train | experience

Slide 20

Slide 20 text

@nyghtowl Planning & Learning value/ policy experience model direct acting model learning planning

Slide 21

Slide 21 text

@nyghtowl Main Functions 1. model = agent’s rep. of environment 2. value function = reward now and future if optimal acts - v (state-value) | q (action-value) 3. policy = agent behavior | state map action

Slide 22

Slide 22 text

@nyghtowl Hello World | Gridworld * Kevin Binz & kevinbinz.com

Slide 23

Slide 23 text

@nyghtowl Gridworld Environment * Kevin Binz & kevinbinz.com

Slide 24

Slide 24 text

@nyghtowl Prediction & Control policy ( ) prediction = evaluate policy & update v control = improve & find optimal policy value (V)

Slide 25

Slide 25 text

@nyghtowl Estimates value Assumes MDP Breakdown to smaller chunks Bellman Equation Current reward Future discounted, predicted rewards

Slide 26

Slide 26 text

@nyghtowl Agent behavior Exploit Always be optimizing policy

Slide 27

Slide 27 text

@nyghtowl Optimal Policy | * * Kevin Binz & kevinbinz.com 0.85 0.57 0.64 0.74 0.57 0.28 0.48 0.43 0.49

Slide 28

Slide 28 text

@nyghtowl Value Methods by Backups Sample Shallow Full Deep TD Learning Monte Carlo Bootstrapping Environment * https://arxiv.org/pdf/1708.05866.pdf Exhaustive Search Dynamic Programming

Slide 29

Slide 29 text

@nyghtowl Dynamic Programming value (V) policy ( ) immediate reward future discounted rewards probability transition value function

Slide 30

Slide 30 text

@nyghtowl Q-Values 0.85 0.68 0.77 0.57 0.74 0.60 0.67 0.67 0.64 0.57 0.59 0.53 - 0.66 0.53 0.57 0.30 0.51 0.51 0.57 0.46 0.29 0.40 0.48 0.41 0.42 0.43 0.40 0.40 0.41 0.45 0.49 0.44 0.13 0.28 - 0.65 0.27 * Kevin Binz & kevinbinz.com

Slide 31

Slide 31 text

@nyghtowl State subset sampling | learn w/ environment interaction - Model-free - Episodic update Monte Carlo | Sampling & Simulation # visits to state estimate value actual reward estimate value

Slide 32

Slide 32 text

@nyghtowl Temporal Difference-Learning DP bootstrapping/estimating & MC sampling - Off-policy vs. on-policy - Incomplete & continuous environments estimate value revised estimate value estimate value

Slide 33

Slide 33 text

@nyghtowl Deep Q Network (DQN) Q-Learning with Neural Nets - CNN & Full - Convergence => Target Network & Experience Replay TD Learning

Slide 34

Slide 34 text

@nyghtowl DeepMind | Space Invaders

Slide 35

Slide 35 text

@nyghtowl Main Functions 1. model = agent’s rep. of environment 2. value function = reward now and future if optimal acts 3. policy = agent behavior | state map action

Slide 36

Slide 36 text

@nyghtowl Policy Search Directly model policies & no value function - Efficient storage - High-dimensions & continuous environments - Baseline & advantage function for variance & converg Example = uniform random policy

Slide 37

Slide 37 text

@nyghtowl Policy Search Example Algorithms Gradient - Policy Network - REINFORCE (likelihood ratio) - TRPO (Trust Region Policy Optimization) Gradient-free - Evolution - Simulated annealing - Genetic algorithms

Slide 38

Slide 38 text

@nyghtowl Policy NN | ATARI 2600 Pong Ex - Input image ~100K (210x160x3) byte array - Environment (2 paddles, 1 run by NN) - Actions (paddle up and down) (binary) - Reward (+1 ball past opponent, -1 if miss ball)

Slide 39

Slide 39 text

@nyghtowl Main Functions 1. model = agent’s rep. of environment 1. value function = reward now and future if optimal acts 2. policy = agent behavior | state map action

Slide 40

Slide 40 text

@nyghtowl Asynch. Advantage Actor-Critic (A3C) policy ( ) critic = TD| eval policy & estimate value actor = PG| update policy | score * critic value (V)

Slide 41

Slide 41 text

@nyghtowl DeepMind Labyrinth Example

Slide 42

Slide 42 text

@nyghtowl Main Functions 1. model 2. value function - DP (V & Pi Iteration) = full model, bootstrapping - MC = model-free sampling, episodic - TD-learning (DQN) = sampling, bootstrapping & online 3. policy search - Policy Search = gradients or not 4. value function & policy search - A3C - TD-learning & PG

Slide 43

Slide 43 text

@nyghtowl Challenges - Convergence - Credit Assignment | Delayed Reward - Exploration vs Exploitation - Generalization

Slide 44

Slide 44 text

@nyghtowl Libraries to Explore ● DeepMind = Lab(agent-based AI research 3D platform) & PySC2 (Blizzard Entertainment’s StarCraft II API in an RL Environment) ● OpenAI = Gym (develop RL algorithms w/ any library) & Baselines (RL algorithms) & Roboschool (robot simulation in Gym) ● Facebook Research = EFL (environments for game research) & ParlAI (framework for dialog AI research)

Slide 45

Slide 45 text

@nyghtowl Last Week Tonight & RoboCup

Slide 46

Slide 46 text

@nyghtowl Last Points Exploration & Transfer Learning Break down problem Optimal policy

Slide 47

Slide 47 text

@nyghtowl Resources: - An Introduction to Reinforcement Learning, Sutton & Barto 1998: http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf - Brief Survey of Deep Reinforcement Learning: https://arxiv.org/pdf/1708.05866.pdf - David Silver’s RL Course: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html - Deep Reinforcement Learning: Pong from Pixels: http://karpathy.github.io/2016/05/31/rl/ - Computational Neuroscience Lab: http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ - Playing Atari with DRL: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf - OpenAI Baselines: https://blog.openai.com/openai-baselines-dqn/ - Robotics overviews: https://www.youtube.com/channel/UC4e_-TvgALrwE1dUPvF_UTQ - Gridworld example: http://www.cs.ubc.ca/~poole/demos/mdp/vi.html - Automous HVAC Control, A RL Approach: https://link.springer.com/chapter/10.1007/978-3-319-23461-8_1 Github Repos to Explore: - DeepMind Lab: https://github.com/deepmind/lab - OpenAI Gym: https://github.com/openai/gym - FacebookResearch ELF: https://github.com/facebookresearch/ELF - Intro to RL with Python: https://github.com/jzf2101/intro_rl

Slide 48

Slide 48 text

@nyghtowl References: Images ● iStock.com/4x-image ● iStock.com/edeart ● iStock.com/higyou ● iStock.com/lilu330 ● iStock.com/alxpin ● iStock.com/patpitchaya ● iStock.com/Devrimb ● An Introduction to Reinforcement Learning, Sutton & Barto 1998 ● http://neverstopbuilding.com/minimax ● https://kevinbinz.com/2016/10/19/mdp/ ● http://karpathy.github.io/2016/05/31/rl/ ● Last Week Tonight with John Oliver (HBO) & RoboCup ● https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwj_pe26xqXWAhDeepMind https://www.youtube.com/watch?v=W2CAghUiofY ● DeepMind https://youtu.be/nMR5mjCFZCw ● https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/ ● VLjlQKHQZbCcUQjRwIBw&url=http%3A%2F%2Fwww.fanpop.com%2Fclubs%2Fgame-of-thrones%2Fimages%2F38364122%2Ftitle% 2Fjon-snow-season-5-photo&psig=AFQjCNHQiVvgBQJ0xl3iM1M4lHU04ETmQQ&ust=1505508473062578 ● https://pixabay.com/en/backgammon-board-game-cube-strategy-1903940/ ● https://commons.wikimedia.org/wiki/File:Stones_go.jpg ● http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ ● https://www.flickr.com/photos/jurvetson/30374100613 | Steve Jurvetson ● Copyright and disclaimer notice: https://creativecommons.org/licenses/by/2.0/ ● License notice: https://creativecommons.org/licenses/by/2.0/legalcode

Slide 49

Slide 49 text

@nyghtowl Thank you... Deep RL Bootcamp (Pieter Abbeel et al., Berkeley, OpenAI, Gradescope) Lindsay Cade Alex Graves Isabel Markl Jason Morrison Kelley Robinson Adam Shin Dragons

Slide 50

Slide 50 text

@nyghtowl Artificial Intelligence Melanie Warrick @nyghtowl