Melanie Warrick
September 21, 2017
380

# Reinforcement Learning

Reinforcement learning is a popular subfield in machine learning because of its success in beating humans at complex games like Go and Atari. The field’s value is in utilizing an award system to develop models and find more optimal ways to solve complex, real-world problems. This approach allows software to adapt to its environment without full knowledge of what the results should look like. This talk will cover reinforcement learning fundamentals and examples to help you understand how it works.

## Melanie Warrick

September 21, 2017

## Transcript

8. ### @nyghtowl RL Components Reward(R) Actions(A): - Right - Left -

Straight Agent State(S) = Start Environment

10. ### @nyghtowl Environment | MDP Markov Decision Process = current state

all you need Transition Function & Reward

12. ### @nyghtowl Transition Function Probability to move current to future state

stand by citadel & fly flying over citadel stand by citadel & breath fire stand by citadel & fly 70% 10% flying over citadel 70% 20% stand by citadel & breath fire 10% 20%
13. ### @nyghtowl States & Actions | Search Tree You’re above the

citadel & see a party inside You’re standing at the base of a mountain citadel The trees are burnt and nearby birds are angry fly up breath fire
14. ### @nyghtowl State Size | Search Tree Tic-Tac-Toe = 10^3 Backgammon

= 10^20 Chess = 10^47 Go = 10^170 * Jason Fox & neverstopbuilding.com

16. ### @nyghtowl Main Functions 1. model = agent’s representation of environment

2. value function = reward now and future if optimal acts 3. policy = agent behavior | state map action
17. ### @nyghtowl Main Functions 1. model = agent’s representation of environment

2. value function = reward now and future if optimal acts 3. policy = agent behavior | state map action
18. ### @nyghtowl Model Environment - Table Lookup - Gaussian Process -

Neural Network
19. ### @nyghtowl - Task transfer - Less supervised data - Easier

scale data collection - Don’t optimize task performance - Need assumptions for complex skills Algorithm Comparison Model-based Model-free Pros Cons - Less assumptions for complex skills - Learn complex policies - Slow to train | experience
20. ### @nyghtowl Planning & Learning value/ policy experience model direct acting

model learning planning
21. ### @nyghtowl Main Functions 1. model = agent’s rep. of environment

2. value function = reward now and future if optimal acts - v (state-value) | q (action-value) 3. policy = agent behavior | state map action

24. ### @nyghtowl Prediction & Control policy ( ) prediction = evaluate

policy & update v control = improve & find optimal policy value (V)
25. ### @nyghtowl Estimates value Assumes MDP Breakdown to smaller chunks Bellman

Equation Current reward Future discounted, predicted rewards

27. ### @nyghtowl Optimal Policy | * * Kevin Binz & kevinbinz.com

0.85 0.57 0.64 0.74 0.57 0.28 0.48 0.43 0.49
28. ### @nyghtowl Value Methods by Backups Sample Shallow Full Deep TD

Learning Monte Carlo Bootstrapping Environment * https://arxiv.org/pdf/1708.05866.pdf Exhaustive Search Dynamic Programming
29. ### @nyghtowl Dynamic Programming value (V) policy ( ) immediate reward

future discounted rewards probability transition value function
30. ### @nyghtowl Q-Values 0.85 0.68 0.77 0.57 0.74 0.60 0.67 0.67

0.64 0.57 0.59 0.53 - 0.66 0.53 0.57 0.30 0.51 0.51 0.57 0.46 0.29 0.40 0.48 0.41 0.42 0.43 0.40 0.40 0.41 0.45 0.49 0.44 0.13 0.28 - 0.65 0.27 * Kevin Binz & kevinbinz.com
31. ### @nyghtowl State subset sampling | learn w/ environment interaction -

Model-free - Episodic update Monte Carlo | Sampling & Simulation # visits to state estimate value actual reward estimate value
32. ### @nyghtowl Temporal Difference-Learning DP bootstrapping/estimating & MC sampling - Off-policy

vs. on-policy - Incomplete & continuous environments estimate value revised estimate value estimate value
33. ### @nyghtowl Deep Q Network (DQN) Q-Learning with Neural Nets -

CNN & Full - Convergence => Target Network & Experience Replay TD Learning

35. ### @nyghtowl Main Functions 1. model = agent’s rep. of environment

2. value function = reward now and future if optimal acts 3. policy = agent behavior | state map action
36. ### @nyghtowl Policy Search Directly model policies & no value function

- Efficient storage - High-dimensions & continuous environments - Baseline & advantage function for variance & converg Example = uniform random policy
37. ### @nyghtowl Policy Search Example Algorithms Gradient - Policy Network -

REINFORCE (likelihood ratio) - TRPO (Trust Region Policy Optimization) Gradient-free - Evolution - Simulated annealing - Genetic algorithms
38. ### @nyghtowl Policy NN | ATARI 2600 Pong Ex - Input

image ~100K (210x160x3) byte array - Environment (2 paddles, 1 run by NN) - Actions (paddle up and down) (binary) - Reward (+1 ball past opponent, -1 if miss ball)
39. ### @nyghtowl Main Functions 1. model = agent’s rep. of environment

1. value function = reward now and future if optimal acts 2. policy = agent behavior | state map action
40. ### @nyghtowl Asynch. Advantage Actor-Critic (A3C) policy ( ) critic =

TD| eval policy & estimate value actor = PG| update policy | score * critic value (V)

42. ### @nyghtowl Main Functions 1. model 2. value function - DP

(V & Pi Iteration) = full model, bootstrapping - MC = model-free sampling, episodic - TD-learning (DQN) = sampling, bootstrapping & online 3. policy search - Policy Search = gradients or not 4. value function & policy search - A3C - TD-learning & PG
43. ### @nyghtowl Challenges - Convergence - Credit Assignment | Delayed Reward

- Exploration vs Exploitation - Generalization
44. ### @nyghtowl Libraries to Explore • DeepMind = Lab(agent-based AI research

3D platform) & PySC2 (Blizzard Entertainment’s StarCraft II API in an RL Environment) • OpenAI = Gym (develop RL algorithms w/ any library) & Baselines (RL algorithms) & Roboschool (robot simulation in Gym) • Facebook Research = EFL (environments for game research) & ParlAI (framework for dialog AI research)

46. ### @nyghtowl Last Points Exploration & Transfer Learning Break down problem

Optimal policy
47. ### @nyghtowl Resources: - An Introduction to Reinforcement Learning, Sutton &

Barto 1998: http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf - Brief Survey of Deep Reinforcement Learning: https://arxiv.org/pdf/1708.05866.pdf - David Silver’s RL Course: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html - Deep Reinforcement Learning: Pong from Pixels: http://karpathy.github.io/2016/05/31/rl/ - Computational Neuroscience Lab: http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ - Playing Atari with DRL: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf - OpenAI Baselines: https://blog.openai.com/openai-baselines-dqn/ - Robotics overviews: https://www.youtube.com/channel/UC4e_-TvgALrwE1dUPvF_UTQ - Gridworld example: http://www.cs.ubc.ca/~poole/demos/mdp/vi.html - Automous HVAC Control, A RL Approach: https://link.springer.com/chapter/10.1007/978-3-319-23461-8_1 Github Repos to Explore: - DeepMind Lab: https://github.com/deepmind/lab - OpenAI Gym: https://github.com/openai/gym - FacebookResearch ELF: https://github.com/facebookresearch/ELF - Intro to RL with Python: https://github.com/jzf2101/intro_rl
48. ### @nyghtowl References: Images • iStock.com/4x-image • iStock.com/edeart • iStock.com/higyou •

iStock.com/lilu330 • iStock.com/alxpin • iStock.com/patpitchaya • iStock.com/Devrimb • An Introduction to Reinforcement Learning, Sutton & Barto 1998 • http://neverstopbuilding.com/minimax • https://kevinbinz.com/2016/10/19/mdp/ • http://karpathy.github.io/2016/05/31/rl/ • Last Week Tonight with John Oliver (HBO) & RoboCup • https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwj_pe26xqXWAhDeepMind https://www.youtube.com/watch?v=W2CAghUiofY • DeepMind https://youtu.be/nMR5mjCFZCw • https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/ • VLjlQKHQZbCcUQjRwIBw&url=http%3A%2F%2Fwww.fanpop.com%2Fclubs%2Fgame-of-thrones%2Fimages%2F38364122%2Ftitle% 2Fjon-snow-season-5-photo&psig=AFQjCNHQiVvgBQJ0xl3iM1M4lHU04ETmQQ&ust=1505508473062578 • https://pixabay.com/en/backgammon-board-game-cube-strategy-1903940/ • https://commons.wikimedia.org/wiki/File:Stones_go.jpg • http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ • https://www.flickr.com/photos/jurvetson/30374100613 | Steve Jurvetson • Copyright and disclaimer notice: https://creativecommons.org/licenses/by/2.0/ • License notice: https://creativecommons.org/licenses/by/2.0/legalcode