Deep reinforcemenet learning -2

3-9%- 8POTFPL+VOH 4VQFSWJTFE*NJUBUJPO GFBUDT

  8POTFPL+VOH $JUZ6OJWFSTJUZPG/FX:PSL#BSVDI$PMMFHF %BUB4DJFODF.BKPS $POOFYJPO"*'PVOEFS %FFQ-FBSOJOH$PMMFHF3FJOGPSDFNFOU-FBSOJOH3FTFBSDIFS .PEVMBCT$53--FBEFS  3FJOGPSDFNFOU-FBSOJOH 0CKFDU%FUFDUJPO
$IBUCPU (JUIVC IUUQTHJUIVCDPNXPOTFPLKVOH 'BDFCPPL IUUQTXXXGBDFCPPLDPNXTKVOH  #MPH IUUQTXPOTFPLKVOHHJUIVCJP  :PVUVCF  IUUQTXXXZPVUVCFDPNDIBOOFM6$N5Y8,EIM8W+6GS3X 

5PEBZ 1.Definition of sequential decision problems 2. Imitation learning :
Supervised learning for decision making a. Does direct imitation work? b. How can we make it work more often? 3. Case studies of recent work in (deep) imitation learning 4. What is missing from imitation learning?  

(PBMT   * Understand definitions & notation * Understand basic
imitation learning algorithms * Understand their strengths & weaknesses

5FSNJOPMPHZBOEOPUBUJPO CAT  DOG  TIGER   input : pixels   output
: categorical random variable (label of the object)   Model : what you want to learn

1. pet a cat 2. ignoring 3. give foods *O3FJOGPSDFNFOU-FBSOJOH
1. output could be not labeling, but actions

4FRVFOUJBM%FDJTJPO 1. pet a cat 2. ignoring 3. give foods

4FRVFOUJBM%FDJTJPO 1. pet a cat 2. ignoring 3. give foods
at πθ (at ∣ ot ) ot st ot at πθ (at ∣ ot ) πθ (at ∣ st ) - state - observation - action - policy - policy ( fully observed )

TUBUFBOEPCTFSWBUJPO 1. State : Underlying state of the world ,
(ex : position, momentum, cat, mouse )  2. Observation : Image pixel (Underlying the state of the world ) but those are actually hidden inside the image , you actually the image to get those out

TUBUFBOEPCTFSWBUJPO 1. State : Summary of the world , using
it to predict the world 2. Observation : Consequence of the state but lossy consequence Observation State

(SBQIJDBMMZNPEFM πθ (at ∣ ot ) πθ (at ∣ st
) - policy - policy ( fully observed ) st ot at - state - observation - action o1 s1 a1 o2 s2 a2 o3 s3 a3 1. Drawing a graphically model to relate state, observation, and action  2. Observing previous observations might give you more information p(st+1 ∣ st , at ) p(st+1 ∣ st , at )

*NJUBUJPO-FBSOJOH 1. observation : image from dashboard camera in the
car 2. action : steering commands (turn steering left or right) 3. collect data and do supervised training

#FIBWJPSDMPOJOH 1. Human driver 2. Driver drives the car 3.
Record image from dashboard 4. Encode in the steering wheel and record of the steering wheel 5. get the data set ( observation : record image, action : record steering wheel ) 6. Storing data 7. Use it with supervised learning algorithm

*UEPFTOPUXPSL 1. No training algorithm is perfect, even on the
training data 2. Incurring little bit of error 3. After long enough trajectory the policy might diverge little bit of error

$BOXFNBLFJUXPSLNPSFPGUFO   (FOFSBMWFSTJPO 1. Obtaining a distribution over trajectories 2.
Distribution : users behaviors perturb by some kind of noise

-FBSOJOHGSPNBTUBCJMJ[JOHDPOUSPMMFS 1. Obtaining a distribution over trajectories 2. Distribution :
users behaviors perturb by some kind of noise τ = p(s1 , a1 , s2 , a2 , . . . . . , sT , aT )

$BOXFNBLFJUXPSLNPSFPGUFO 1. When we run a policy pie, we
are sampling action given observation. 2. Policy pi theta was trained on a distribution of observations, called pdata 3. pdata : human drove the car and the distribution of observations in data set

$BOXFNBLFJUXPSLNPSFPGUFO 1. Distribution of observation seen by the policy
of o is not equal to pdata 2. Whole problem exists is that p data is not equal to p pi theta 3. Can we make ??? pdata(ot ) = pπθ (ot )

$BOXFNBLFJUXPSLNPSFPGUFO 1. Can we make ??? 2. Idea :
clever about pdata(ot ) = pπθ (ot ) pdata(ot )

%"HHFS%BUBTFU"HHSFHBUJPO ݾ಴؀नܳࢎਊೞৈؘ੉ఠܳݽਵҊर׮ যڌѱ ੿଼ 1PMJDZ ܳࢎਊ೧೯زਸೠ׮ ࢎۈ੉೯زਸࣻ੿೧ળ׮
pπθ (ot ) pdata(ot ) πθ (at ∣ ot ) at

%"HHFS%BUBTFU"HHSFHBUJPO ࢎۈ੉ؘ݅ٚ੉ఠীࢲ  ੿଼ܳ೟णೠ׮ ܳप೯ೞৈؘ੉ఠܳݽ਷׮ ࢎۈ੉੿଼ী੄೧प೯ػ਍੹ചݶਸࠁҊ πθ (at
∣ ot ) at D = {o1 , a1 , o2 , a2 . . . , oN , aN } D ← D ∪ Dπ πθ (at ∣ ot ) Dπ = {o1 , o2 , . . , oM } πθ (at ∣ ot ) ੄೯زਸࣻ੿ೠ׮ Dπ ߣߣਸ҅ࣘ߈ࠂೠ׮

*NJUBUJPOMFBSOJOHXIBUsTUIFQSPCMFN 1. Humans need to provide data, which is typically
finite 2. Humans are not good at providing some kinds of actions 3. Humans can learn autonomously. Can machines do the same as human?

/&95DIBQUFS    3FJOGPSDFNFOUMFBSOJOH*OUSPEVDUJPO

Deep reinforcemenet learning -2

Deep reinforcemenet learning -2

Wonseok Jung

More Decks by Wonseok Jung

Featured

Transcript

3-9%- 8POTFPL+VOH 4VQFSWJTFE*NJUBUJPO GFBUDT

8POTFPL+VOH $JUZ6OJWFSTJUZPG/FX:PSL#BSVDI$PMMFHF %BUB4DJFODF.BKPS $POOFYJPO"*'PVOEFS %FFQ-FBSOJOH$PMMFHF3FJOGPSDFNFOU-FBSOJOH3FTFBSDIFS .PEVMBCT$53--FBEFS  3FJOGPSDFNFOU-FBSOJOH 0CKFDU%FUFDUJPO

5PEBZ 1.Definition of sequential decision problems 2. Imitation learning :

(PBMT   * Understand definitions & notation * Understand basic

5FSNJOPMPHZBOEOPUBUJPO CAT  DOG  TIGER   input : pixels   output

1. pet a cat 2. ignoring 3. give foods *O3FJOGPSDFNFOU-FBSOJOH

4FRVFOUJBM%FDJTJPO 1. pet a cat 2. ignoring 3. give foods

4FRVFOUJBM%FDJTJPO 1. pet a cat 2. ignoring 3. give foods

TUBUFBOEPCTFSWBUJPO 1. State : Underlying state of the world ,

TUBUFBOEPCTFSWBUJPO 1. State : Summary of the world , using

(SBQIJDBMMZNPEFM πθ (at ∣ ot ) πθ (at ∣ st

*NJUBUJPO-FBSOJOH 1. observation : image from dashboard camera in the

#FIBWJPSDMPOJOH 1. Human driver 2. Driver drives the car 3.

*UEPFTOPUXPSL 1. No training algorithm is perfect, even on the

$BOXFNBLFJUXPSLNPSFPGUFO   (FOFSBMWFSTJPO 1. Obtaining a distribution over trajectories 2.

-FBSOJOHGSPNBTUBCJMJ[JOHDPOUSPMMFS 1. Obtaining a distribution over trajectories 2. Distribution :

$BOXFNBLFJUXPSLNPSFPGUFO 1. When we run a policy pie, we

$BOXFNBLFJUXPSLNPSFPGUFO 1. Distribution of observation seen by the policy

$BOXFNBLFJUXPSLNPSFPGUFO 1. Can we make ??? 2. Idea :

%"HHFS%BUBTFU"HHSFHBUJPO ݾ಴؀नܳࢎਊೞৈؘ੉ఠܳݽਵҊर׮ যڌѱ ੿଼ 1PMJDZ ܳࢎਊ೧೯زਸೠ׮ ࢎۈ੉೯زਸࣻ੿೧ળ׮

%"HHFS%BUBTFU"HHSFHBUJPO ࢎۈ੉ؘ݅ٚ੉ఠীࢲ  ੿଼ܳ೟णೠ׮ ܳप೯ೞৈؘ੉ఠܳݽ਷׮ ࢎۈ੉੿଼ী੄೧प೯ػ਍੹ചݶਸࠁҊ πθ (at

*NJUBUJPOMFBSOJOHXIBUsTUIFQSPCMFN 1. Humans need to provide data, which is typically

/&95DIBQUFS    3FJOGPSDFNFOUMFBSOJOH*OUSPEVDUJPO