Deep Reinforcement Learning - Introduction

Slide 1

Slide 1 text

3-9%- 8POTFPL+VOH *OUSPEVDUJPO GFBUDT

Slide 2

Slide 2 text

  8POTFPL+VOH $JUZ6OJWFSTJUZPG/FX:PSL#BSVDI$PMMFHF %BUB4DJFODF.BKPS $POOFYJPO"*'PVOEFS %FFQ-FBSOJOH$PMMFHF3FJOGPSDFNFOU-FBSOJOH3FTFBSDIFS .PEVMBCT$53--FBEFS  3FJOGPSDFNFOU-FBSOJOH 0CKFDU%FUFDUJPO $IBUCPU (JUIVC IUUQTHJUIVCDPNXPOTFPLKVOH 'BDFCPPL IUUQTXXXGBDFCPPLDPNXTKVOH  #MPH IUUQTXPOTFPLKVOHHJUIVCJP  :PVUVCF  IUUQTXXXZPVUVCFDPNDIBOOFM6$N5Y8,EIM8W+6GS3X 

Slide 3

Slide 3 text

8IBUXFDPWFS 1. From supervised learning to decision making 2. Model-free algorithms : Q - Learning, policy gradients, actor-critic 3. Advanced model learning and Prediction 4. Exploration 5. Transfer and multi-task learning, meta-learning 6. Open problems, research talks, invited lectures

Slide 4

Slide 4 text

पण 1. Imitation Learning 2. Policy gradients 3. Q-learning and actor-critic algorithms 4. Model-based reinforcement learning 5. Advanced model-free RL algorithms

Slide 5

Slide 5 text

3FGFSFODF https://www.tensorflow.org/guide/low_level_intro 5FOTPSGMPXਸੜݽܲ׮ݶ ইې݂௼ܳࠁࣁਃ աࠗఠŬŬ https://github.com/wonseokjung/hands_on_tf 5FOTPSGMPXFBHFSਸোणೞҊर׮ݶইې݂௼۽ ઁӥ೸

Slide 6

Slide 6 text

What is R.L and why should we care about it now?

Slide 7

Slide 7 text

)PXXFCVJMEJOUFMMJHFODFNBDIJOF IPVTFIPME TVQQPSU )VNBOPJE 3PCPUীѱ*OUFMMJHFODFо੓׮ݶ ৈ۞о૑ਊب۽ࢎਊؼࣻ੓׮ о੿۽ࠈ ҳઑ۽ࠈ ോݠ֢੉٘

Slide 8

Slide 8 text

Intelligent machines must be able to adapt

Slide 9

Slide 9 text

0JM5BOL ߓחஶప੉ց য়ੌఝ௼ ܳऩҊ߄׮ܳѤցח৉ೡਸೠ׮ ݅ডߓীࢶਗ੉఑थೞ૑ঋইبҡଳ׮ݶ

Slide 10

Slide 10 text

%FFQMFBSOJOHIFMQTVTIBOEMFVOTUSVDUVSFEFOWJSPONFOUT 1SFEJDU6OTUSVDUVSFEFOWJSPONFOU 6OTUSVDUVSFE੿ഛೞѱ1SFEJDUೡࣻহחѪ FYߓীൗܽয়ੌী੄ೠച੤ߊࢤ

Slide 11

Slide 11 text

/FVSBMOFUXPSUܳࢎਊೞৈࣻ݅ѐ੄QBSBNFUFSܳ݅ٞ *OQVUਵ۽*NBHFܳ߉ਸࣻ੓׮ ੸ਊ࠙ঠ࠺੹ ੗োয୊ܻ١

Slide 12

Slide 12 text

3FJOGPSDFNFOU-FBSOJOHQSPWJEFTBGPSNBMJTNGPSCFIBWJPS 1. Deep learning does not provide decision making 2. In order to make a decision -> mathematical formalism 3. Reinforcement learning is what give us the mathematical framework   for dealing with the decision making

Slide 13

Slide 13 text

.BUIFNBUJDBMGSBNFXPSLGPSEFBMJOH XJUIEFDJTJPONBLJOH 1. Models an interaction between and Agent and an World 2. Agent makes a decision 3. World responds to that decision with consequences - observation, reward "DUJ "HFOU &O 3F A t R t 4UB S t R t+1 S t+1 &OWJSPONFOU "HFOU

Slide 14

Slide 14 text

'JSTUCJHTVDDFTTFTJO3FJOGPSDFNFOU-FBSOJOH 1. It came from the combination of reinforcement learning 2. Playing 3. AlphaGo 4. Robotic manipulation

Slide 15

Slide 15 text

8IBUJTEFFQ3- BOEXIZTIPVMEXFDBSF

Slide 16

Slide 16 text

8IBUEPFTFOEUPFOEMFBSOJOHNFBOGPS TFRVFOUJBMEFDJTJPONBLJOH 1. You are walking to the jungle and see the tiger 2. You need to take some action (You may wanna run away ) 3. Tiger -> perception (“oh yeah it is a tiger”) -> control system -> “Run”

Slide 17

Slide 17 text

4JNQMJGJFE 1. You don’t even know that is a tiger 2. You just know that if getting eaten is a bad thing, not getting eaten is a good thing 3. Tiger -> control system -> “Run”

Slide 18

Slide 18 text

Action, Observation and Rewards 1. Agent makes decisions : actions 2. The world responds with consequences : observations and rewards

Slide 19

Slide 19 text

)PXUPBOJNBMTMFBSO 1. Actions : muscle contractions 2. Observations : sight, smell 3. Rewards : food

Slide 20

Slide 20 text

&OWJSPONFOU 3FXBSE A t R t S t R t+1 S t+1 5BQUIFCBMM 1PTJUJWF3FXBSE 

Slide 21

Slide 21 text

3PCPUJDT 1. Actions : motor current or torque 2. Observations : Camera images 3. Rewards : task success measure

Slide 22

Slide 22 text

*OWFOUPSZ.BOBHFNFOU 1. Actions : what to purchase 2. Observations : Inventory levels 3. Rewards : profit

Slide 23

Slide 23 text

*NBHFDMBTTJGJDBUJPO 1. Actions : label the output 2. Observations : Image pixels 3. Rewards : correct or not correct

Slide 24

Slide 24 text

1. Reinforcement basically is solving the task in the most general form in the most complex setting 2. Deep models are what allow reinforcement learning algorithms to solve complex problems end to end 3. Reinforcement learning provides formulas (algorithm framework) 4. Deep learning provides the representations that allow us to apply that formalism to very complex problems  with high dimensional observations and complicated action spaces 3-9%-

Slide 25

Slide 25 text

https://arxiv.org/pdf/1709.10087.pdf

Slide 26

Slide 26 text

1. Dexterous multi-fingered hands : versatile, provide a generic way to perform a multitude of tasks 2. Controlling : challenging due to high dimensionality and large number of potential contacts. 3. Success of DRL in robotics has thus far been limited to simpler manipulators and tasks 4VNNBSZ 1. Model-free DRL can effectively scale up to complex manipulation tasks ( in simulated experiments ) 2. use of a small number of human demonstrations -> sample complexity can be significantly reduced 3. Use of demonstrations -> natural movements 4. successful policies -> object relocation, in-hand manipulation, tool use, door opening

Slide 27

Slide 27 text

https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

Slide 28

Slide 28 text

APPROXIMATE ACTION-VALUE SUPERMARIO WITH R.L

Slide 29

Slide 29 text

DOUBLE DQN SUPERMARIO WITH R.L JOQVU "DUJPO WBMVF &OW 2/FUXPSL s’ s 3FQMBZNFNPSZ 2 T B a r (S t , A t , R t+ 1 , S t+ 1 )

Slide 30

Slide 30 text

https://arxiv.org/pdf/1806.10293.pdf

Slide 31

Slide 31 text

8IZTIPVMEXFTUVEZUIJTOPX 1. Advances in deep learning 2. Advances in reinforcement learning 3. Advances in computational capability https://pdfs.semanticscholar.org/54c4/cf3a8168c1b70f91cf78a3dc98b671935492.pdf

Slide 32

Slide 32 text

8IZTIPVMEXFTUVEZUIJTOPX 1. In five years, we’ve seen few impressive successes 2. playing game(Left) 3. Robot(middle) 4. GO (Right)

Slide 33

Slide 33 text

8IBUPUIFSQSPCMFNTEPXFOFFEUPTPMWFUPFOBCMFSFBM XPSMETFRVFOUJBMEFDJTJPONBLJOH

Slide 34

Slide 34 text

-FBSOJOHGSPNSFXBSE 1. Basic reinforcement learning : maximizing rewards 2. Learning reward function from examples ( Inverse R.L ) 3. Transferring knowledge between domains (transfer learning , meta-learning) 4. Learning to predict and using prediction to act

Slide 35

Slide 35 text

8IFSFEPSFXBSEDPNFGSPN

Slide 36

Slide 36 text

"SFUIFSFBOZGPSNTPGTVQFSWJTJPO 1. Learning from demonstrations  - Directly copying observed behavior (Imitation )  - Inferring rewards from observed behavior 2. Learning from observing the world   - Learning to Predict   - Unsupervised Learning 3. Learning from other tasks  - Transfer learning  - Meta-learning : learning to learn

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

8IBUIBTQSPWFODIBMMFOHJOHTPGBS 1. Humans can learn incredibly quickly   - Deep RL methods are usually slow 2. Humans can reuse past knowledge   - Transfer learning in deep RL is an open problem 3. Not clear what the reward function should be 4. Not clear what the role of prediction should be

Slide 43

Slide 43 text

/&95&1*40%&    4VQFSWJTFE-FBSOJOHBOE*NJUBUJPO