Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep reinforcemenet learning -2

Wonseok Jung
November 06, 2018
170

Deep reinforcemenet learning -2

Wonseok Jung

November 06, 2018
Tweet

Transcript

  1. 
 8POTFPL+VOH $JUZ6OJWFSTJUZPG/FX:PSL#BSVDI$PMMFHF %BUB4DJFODF.BKPS  $POOFYJPO"*'PVOEFS %FFQ-FBSOJOH$PMMFHF3FJOGPSDFNFOU-FBSOJOH3FTFBSDIFS .PEVMBCT$53--FBEFS
 3FJOGPSDFNFOU-FBSOJOH 0CKFDU%FUFDUJPO

    $IBUCPU (JUIVC IUUQTHJUIVCDPNXPOTFPLKVOH 'BDFCPPL IUUQTXXXGBDFCPPLDPNXTKVOH
 #MPH IUUQTXPOTFPLKVOHHJUIVCJP
 :PVUVCF
 IUUQTXXXZPVUVCFDPNDIBOOFM6$N5Y8,EIM8W+6GS3X

  2. 5PEBZ 1.Definition of sequential decision problems 2. Imitation learning :

    Supervised learning for decision making a. Does direct imitation work? b. How can we make it work more often? 3. Case studies of recent work in (deep) imitation learning 4. What is missing from imitation learning? 

  3. (PBMT 
 * Understand definitions & notation * Understand basic

    imitation learning algorithms * Understand their strengths & weaknesses
  4. 5FSNJOPMPHZBOEOPUBUJPO CAT
 DOG
 TIGER 
 input : pixels 
 output

    : categorical random variable (label of the object) 
 Model : what you want to learn
  5. 1. pet a cat 2. ignoring 3. give foods *O3FJOGPSDFNFOU-FBSOJOH

    1. output could be not labeling, but actions
  6. 4FRVFOUJBM%FDJTJPO 1. pet a cat 2. ignoring 3. give foods

    at πθ (at ∣ ot ) ot st ot at πθ (at ∣ ot ) πθ (at ∣ st ) - state - observation - action - policy - policy ( fully observed )
  7. TUBUFBOEPCTFSWBUJPO 1. State : Underlying state of the world ,

    (ex : position, momentum, cat, mouse )
 2. Observation : Image pixel (Underlying the state of the world ) but those are actually hidden inside the image , you actually the image to get those out
  8. TUBUFBOEPCTFSWBUJPO 1. State : Summary of the world , using

    it to predict the world 2. Observation : Consequence of the state but lossy consequence Observation State
  9. (SBQIJDBMMZNPEFM πθ (at ∣ ot ) πθ (at ∣ st

    ) - policy - policy ( fully observed ) st ot at - state - observation - action o1 s1 a1 o2 s2 a2 o3 s3 a3 1. Drawing a graphically model to relate state, observation, and action
 2. Observing previous observations might give you more information p(st+1 ∣ st , at ) p(st+1 ∣ st , at )
  10. *NJUBUJPO-FBSOJOH 1. observation : image from dashboard camera in the

    car 2. action : steering commands (turn steering left or right) 3. collect data and do supervised training
  11. #FIBWJPSDMPOJOH 1. Human driver 2. Driver drives the car 3.

    Record image from dashboard 4. Encode in the steering wheel and record of the steering wheel 5. get the data set ( observation : record image, action : record steering wheel ) 6. Storing data 7. Use it with supervised learning algorithm
  12. *UEPFTOPUXPSL 1. No training algorithm is perfect, even on the

    training data 2. Incurring little bit of error 3. After long enough trajectory the policy might diverge little bit of error
  13. -FBSOJOHGSPNBTUBCJMJ[JOHDPOUSPMMFS 1. Obtaining a distribution over trajectories 2. Distribution :

    users behaviors perturb by some kind of noise τ = p(s1 , a1 , s2 , a2 , . . . . . , sT , aT )
  14. $BOXFNBLFJUXPSLNPSFPGUFO  1. When we run a policy pie, we

    are sampling action given observation. 2. Policy pi theta was trained on a distribution of observations, called pdata 3. pdata : human drove the car and the distribution of observations in data set
  15. $BOXFNBLFJUXPSLNPSFPGUFO  1. Distribution of observation seen by the policy

    of o is not equal to pdata 2. Whole problem exists is that p data is not equal to p pi theta 3. Can we make ??? pdata(ot ) = pπθ (ot )
  16. $BOXFNBLFJUXPSLNPSFPGUFO  1. Can we make ??? 2. Idea :

    clever about pdata(ot ) = pπθ (ot ) pdata(ot )
  17. %"HHFS%BUBTFU"HHSFHBUJPO  ࢎۈ੉ؘ݅ٚ੉ఠীࢲ
 ੿଼ܳ೟णೠ׮  ܳप೯ೞৈؘ੉ఠܳݽ਷׮  ࢎۈ੉੿଼ী੄೧प೯ػ਍੹ചݶਸࠁҊ πθ (at

    ∣ ot ) at D = {o1 , a1 , o2 , a2 . . . , oN , aN } D ← D ∪ Dπ πθ (at ∣ ot ) Dπ = {o1 , o2 , . . , oM } πθ (at ∣ ot ) ੄೯زਸࣻ੿ೠ׮ Dπ  ߣߣਸ҅ࣘ߈ࠂೠ׮
  18. *NJUBUJPOMFBSOJOHXIBUsTUIFQSPCMFN 1. Humans need to provide data, which is typically

    finite 2. Humans are not good at providing some kinds of actions 3. Humans can learn autonomously. Can machines do the same as human?