Reinforcement Learning

Reinforcement Learning Melanie Warrick | @nyghtowl

Who am I?

@nyghtowl Definition ...

@nyghtowl Goal => Reward

@nyghtowl TD-Gammon

@nyghtowl AlphaGo

@nyghtowl Cooling Energy Reduction

@nyghtowl RL Components Reward(R) Actions(A): - Right - Left -
Straight Agent State(S) = Start Environment

Agent Behavior = Policies

@nyghtowl Environment | MDP Markov Decision Process = current state
all you need Transition Function & Reward

@nyghtowl

@nyghtowl Transition Function Probability to move current to future state
stand by citadel & fly flying over citadel stand by citadel & breath fire stand by citadel & fly 70% 10% flying over citadel 70% 20% stand by citadel & breath fire 10% 20%

@nyghtowl States & Actions | Search Tree You’re above the
citadel & see a party inside You’re standing at the base of a mountain citadel The trees are burnt and nearby birds are angry fly up breath fire

@nyghtowl State Size | Search Tree Tic-Tac-Toe = 10^3 Backgammon
= 10^20 Chess = 10^47 Go = 10^170 * Jason Fox & neverstopbuilding.com

Goal => Optimal Policy => Reward *

@nyghtowl Main Functions 1. model = agent’s representation of environment
2. value function = reward now and future if optimal acts 3. policy = agent behavior | state map action

@nyghtowl Model Environment - Table Lookup - Gaussian Process -
Neural Network

@nyghtowl - Task transfer - Less supervised data - Easier
scale data collection - Don’t optimize task performance - Need assumptions for complex skills Algorithm Comparison Model-based Model-free Pros Cons - Less assumptions for complex skills - Learn complex policies - Slow to train | experience

@nyghtowl Planning & Learning value/ policy experience model direct acting
model learning planning

@nyghtowl Main Functions 1. model = agent’s rep. of environment
2. value function = reward now and future if optimal acts - v (state-value) | q (action-value) 3. policy = agent behavior | state map action

@nyghtowl Hello World | Gridworld * Kevin Binz & kevinbinz.com

@nyghtowl Gridworld Environment * Kevin Binz & kevinbinz.com

@nyghtowl Prediction & Control policy ( ) prediction = evaluate
policy & update v control = improve & find optimal policy value (V)

@nyghtowl Estimates value Assumes MDP Breakdown to smaller chunks Bellman
Equation Current reward Future discounted, predicted rewards

@nyghtowl Agent behavior Exploit Always be optimizing policy

@nyghtowl Optimal Policy | * * Kevin Binz & kevinbinz.com
0.85 0.57 0.64 0.74 0.57 0.28 0.48 0.43 0.49

@nyghtowl Value Methods by Backups Sample Shallow Full Deep TD
Learning Monte Carlo Bootstrapping Environment * https://arxiv.org/pdf/1708.05866.pdf Exhaustive Search Dynamic Programming

@nyghtowl Dynamic Programming value (V) policy ( ) immediate reward
future discounted rewards probability transition value function

@nyghtowl Q-Values 0.85 0.68 0.77 0.57 0.74 0.60 0.67 0.67
0.64 0.57 0.59 0.53 - 0.66 0.53 0.57 0.30 0.51 0.51 0.57 0.46 0.29 0.40 0.48 0.41 0.42 0.43 0.40 0.40 0.41 0.45 0.49 0.44 0.13 0.28 - 0.65 0.27 * Kevin Binz & kevinbinz.com

@nyghtowl State subset sampling | learn w/ environment interaction -
Model-free - Episodic update Monte Carlo | Sampling & Simulation # visits to state estimate value actual reward estimate value

@nyghtowl Temporal Difference-Learning DP bootstrapping/estimating & MC sampling - Off-policy
vs. on-policy - Incomplete & continuous environments estimate value revised estimate value estimate value

@nyghtowl Deep Q Network (DQN) Q-Learning with Neural Nets -
CNN & Full - Convergence => Target Network & Experience Replay TD Learning

@nyghtowl DeepMind | Space Invaders

@nyghtowl Policy Search Directly model policies & no value function
- Efficient storage - High-dimensions & continuous environments - Baseline & advantage function for variance & converg Example = uniform random policy

@nyghtowl Policy Search Example Algorithms Gradient - Policy Network -
REINFORCE (likelihood ratio) - TRPO (Trust Region Policy Optimization) Gradient-free - Evolution - Simulated annealing - Genetic algorithms

@nyghtowl Policy NN | ATARI 2600 Pong Ex - Input
image ~100K (210x160x3) byte array - Environment (2 paddles, 1 run by NN) - Actions (paddle up and down) (binary) - Reward (+1 ball past opponent, -1 if miss ball)

@nyghtowl Asynch. Advantage Actor-Critic (A3C) policy ( ) critic =
TD| eval policy & estimate value actor = PG| update policy | score * critic value (V)

@nyghtowl DeepMind Labyrinth Example

@nyghtowl Main Functions 1. model 2. value function - DP
(V & Pi Iteration) = full model, bootstrapping - MC = model-free sampling, episodic - TD-learning (DQN) = sampling, bootstrapping & online 3. policy search - Policy Search = gradients or not 4. value function & policy search - A3C - TD-learning & PG

@nyghtowl Challenges - Convergence - Credit Assignment | Delayed Reward
- Exploration vs Exploitation - Generalization

@nyghtowl Libraries to Explore • DeepMind = Lab(agent-based AI research
3D platform) & PySC2 (Blizzard Entertainment’s StarCraft II API in an RL Environment) • OpenAI = Gym (develop RL algorithms w/ any library) & Baselines (RL algorithms) & Roboschool (robot simulation in Gym) • Facebook Research = EFL (environments for game research) & ParlAI (framework for dialog AI research)

@nyghtowl Last Week Tonight & RoboCup

@nyghtowl Last Points Exploration & Transfer Learning Break down problem
Optimal policy

@nyghtowl Resources: - An Introduction to Reinforcement Learning, Sutton &
Barto 1998: http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf - Brief Survey of Deep Reinforcement Learning: https://arxiv.org/pdf/1708.05866.pdf - David Silver’s RL Course: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html - Deep Reinforcement Learning: Pong from Pixels: http://karpathy.github.io/2016/05/31/rl/ - Computational Neuroscience Lab: http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ - Playing Atari with DRL: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf - OpenAI Baselines: https://blog.openai.com/openai-baselines-dqn/ - Robotics overviews: https://www.youtube.com/channel/UC4e_-TvgALrwE1dUPvF_UTQ - Gridworld example: http://www.cs.ubc.ca/~poole/demos/mdp/vi.html - Automous HVAC Control, A RL Approach: https://link.springer.com/chapter/10.1007/978-3-319-23461-8_1 Github Repos to Explore: - DeepMind Lab: https://github.com/deepmind/lab - OpenAI Gym: https://github.com/openai/gym - FacebookResearch ELF: https://github.com/facebookresearch/ELF - Intro to RL with Python: https://github.com/jzf2101/intro_rl

@nyghtowl References: Images • iStock.com/4x-image • iStock.com/edeart • iStock.com/higyou •
iStock.com/lilu330 • iStock.com/alxpin • iStock.com/patpitchaya • iStock.com/Devrimb • An Introduction to Reinforcement Learning, Sutton & Barto 1998 • http://neverstopbuilding.com/minimax • https://kevinbinz.com/2016/10/19/mdp/ • http://karpathy.github.io/2016/05/31/rl/ • Last Week Tonight with John Oliver (HBO) & RoboCup • https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwj_pe26xqXWAhDeepMind https://www.youtube.com/watch?v=W2CAghUiofY • DeepMind https://youtu.be/nMR5mjCFZCw • https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/ • VLjlQKHQZbCcUQjRwIBw&url=http%3A%2F%2Fwww.fanpop.com%2Fclubs%2Fgame-of-thrones%2Fimages%2F38364122%2Ftitle% 2Fjon-snow-season-5-photo&psig=AFQjCNHQiVvgBQJ0xl3iM1M4lHU04ETmQQ&ust=1505508473062578 • https://pixabay.com/en/backgammon-board-game-cube-strategy-1903940/ • https://commons.wikimedia.org/wiki/File:Stones_go.jpg • http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ • https://www.flickr.com/photos/jurvetson/30374100613 | Steve Jurvetson • Copyright and disclaimer notice: https://creativecommons.org/licenses/by/2.0/ • License notice: https://creativecommons.org/licenses/by/2.0/legalcode

@nyghtowl Thank you... Deep RL Bootcamp (Pieter Abbeel et al.,
Berkeley, OpenAI, Gradescope) Lindsay Cade Alex Graves Isabel Markl Jason Morrison Kelley Robinson Adam Shin Dragons

@nyghtowl Artificial Intelligence Melanie Warrick @nyghtowl

Reinforcement Learning

Reinforcement Learning

More Decks by Melanie Warrick

Other Decks in Technology

Featured

Transcript