an World 2. Agent makes a decision 3. World responds to that decision with consequences - observation, reward "DUJ "HFOU &O 3F At Rt 4UB St Rt+1 St+1 &OWJSPONFOU "HFOU
(ex : position, momentum, cat, mouse ) 2. Observation : Image pixel (Underlying the state of the world ) but those are actually hidden inside the image , you actually the image to get those out
) - policy - policy ( fully observed ) st ot at - state - observation - action o1 s1 a1 o2 s2 a2 o3 s3 a3 1. Drawing a graphically model to relate state, observation, and action 2. Observing previous observations might give you more information p(st+1 ∣ st , at ) p(st+1 ∣ st , at )
i ∣ st = j, at = k) UJNFTUFQUীTUBUFоJੌഛܫ UJNFTUFQUীࢲTUBUFоKҊBDUJPOLੌ⮶ UJNFTUFQU ীࢲTUBUFоJੌഛܫ ξt,k = p(at = k) UJNFTUFQUীBDUJPOLੌഛܫ r : SxA → R SFXBSEGVODUJPO
4UBUFTQBDF "DUJPOTQBDF 5SBOTJUJPOPQFSBUPS 4UBUFTQBDF "DUJPOTQBDF M = {S, A, O, T, E, r} O 0CTFSWBUJPOTQBDF E &NJTTJPOQSPCBCJMJUZ r 3FXBSEGVODUJPO o ∈ O PCTFSWBUJPOTQBDF o1 s1 a1 o2 s2 a2 o3 s3 a3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at )
Deep RL methods are usually slow 2. Humans can reuse past knowledge - Transfer learning in deep RL is an open problem 3. Not clear what the reward function should be 4. Not clear what the role of prediction should be