Upgrade to Pro — share decks privately, control downloads, hide ads and more …

reinforcement_learning_.pdf

Wonseok Jung
December 19, 2018
1.5k

 reinforcement_learning_.pdf

Wonseok Jung

December 19, 2018
Tweet

Transcript

  1. ALL ANIMALS HAVE THE ABILITY TO LEARN - ݽٚزޛ਷೟णמ۱੉੓׮ -

    ৈѐ੄न҃ࣁನ݅ਸыҊ੓ח৘ࢂԘ݃ࢶ୽ژೠ೟णמ۱੉੓׮ - ݠ୍ܻࣻ߈ࢎIFBEXJUIESBXTSFGMFYਤ೷ೠޛ୓о੓ਸѪ੉ۄ౸ױীٮܲ߈ࢎ೯ز - ৘ࢂԘ݃ࢶ୽੄ݠܻܳѤܻ٘ݶੌ੿Ѣٍܻܳ۽р׮ HOW ANIMALS LEARN
  2. LAW OF EFFECT - &EXBSE5IPSOEJLF   - -BXPGFGGFDUযڃ೯ز੄Ѿҗо݅઒झ۞਋ݶ׮਺ীبӒ೯زਸ߈ࠂೠ׮ ߈؀۽݅઒ೞ૑ঋਵݶӒ೯زਸೞ૑ঋח׮

    - 3FJOGPSDFNFOU ъച ੉੹ীੌযդ೯زਸ߈ࠂೞѱ݅٘ח੗ӓ - 1VOJTINFOU ୊ߥ ੉੹ীੌযդ೯زਸೖೞѱ݅٘ח੗ӓ HOW ANIMALS LEARN
  3. .BUIFNBUJDBMGSBNFXPSLGPSEFBMJOH XJUIEFDJTJPONBLJOH 1. Models an interaction between and Agent and

    an World 2. Agent makes a decision 3. World responds to that decision with consequences - observation, reward "DUJ "HFOU &O 3F At Rt 4UB St Rt+1 St+1 &OWJSPONFOU "HFOU
  4. 8IBUEPFTFOEUPFOEMFBSOJOHNFBOGPS TFRVFOUJBMEFDJTJPONBLJOH 1. You are walking to the jungle and

    see the tiger 2. You need to take some action (You may wanna run away ) 3. Tiger -> perception (“oh yeah it is a tiger”) -> control system -> “Run”
  5. 4JNQMJGJFE 1. You don’t even know that is a tiger

    2. You just know that if getting eaten is a bad thing, not getting eaten is a good thing 3. Tiger -> control system -> “Run”
  6. Action, Observation and Rewards 1. Agent makes decisions : actions

    2. The world responds with consequences : observations and rewards
  7. 3PCPUJDT 1. Actions : motor current or torque 2. Observations

    : Camera images 3. Rewards : task success measure
  8. *NBHFDMBTTJGJDBUJPO 1. Actions : label the output 2. Observations :

    Image pixels 3. Rewards : correct or not correct
  9. 5FSNJOPMPHZBOEOPUBUJPO CAT
 DOG
 TIGER 1. Supervised learning 
 input :

    pixels 
 output : categorical random variable (label of the object) 
 Model : what you want to learn
  10. 1. pet a cat 2. ignoring 3. give foods *O3FJOGPSDFNFOU-FBSOJOH

    1. output could be not labeling, but actions
  11. 4FRVFOUJBM%FDJTJPO 1. pet a cat 2. ignoring 3. give foods

    at πθ (at ∣ ot ) ot st ot at πθ (at ∣ ot ) πθ (at ∣ st ) - state - observation - action - policy - policy ( fully observed )
  12. TUBUFBOEPCTFSWBUJPO 1. State : Underlying state of the world ,

    (ex : position, momentum, cat, mouse )
 2. Observation : Image pixel (Underlying the state of the world ) but those are actually hidden inside the image , you actually the image to get those out
  13. TUBUFBOEPCTFSWBUJPO 1. State : Summary of the world , using

    it to predict the world 2. Observation : Consequence of the state but lossy consequence Observation State
  14. 3FXBSEGVODUJPOT  )JHI3FXBSE ֫਷ࠁ࢚ উ੹ೞѱݾ੸૑ীب଱  -PX3FXBSE ծ਷ࠁ࢚ ҮాࢎҊ 

    1PMJDZউ੹ೞѱ਍੹ೞѱೞח੿଼೟ण )JHI3FXBSE -PX3FXBSE 3FXBSEGVODUJPOਸా೧೟ण
  15. (SBQIJDBMMZNPEFM πθ (at ∣ ot ) πθ (at ∣ st

    ) - policy - policy ( fully observed ) st ot at - state - observation - action o1 s1 a1 o2 s2 a2 o3 s3 a3 1. Drawing a graphically model to relate state, observation, and action
 2. Observing previous observations might give you more information p(st+1 ∣ st , at ) p(st+1 ∣ st , at )
  16. .BSLPW%FDJTJPO1SPCMFN πθ (at ∣ ot ) πθ (at ∣ st

    ) - policy - policy ( fully observed ) st ot at - state - observation - action o1 s1 a1 o2 s2 a2 o3 s3 a3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at ) .BSLPW%FDJTJPO1SPCMFN
 
 TUBUF BDUJPO SFXBSE USBOTJUJPO١ਵ۽಴അغחъച೟णীࢲ੄ࣁ࢚ਸ੿੄
  17. TRAJECTORY REINFORCEMENT LEARNING (St , At , Rt+1 , St+1

    ) (St+1 , At+1 , Rt+2 , St+2 ) (St+2 , At+2 , Rt+3 , St+3 )
  18. .BSLPWDIBJO HSBQIJDBMMZ μt,i = p(st = i) s1 s2 s3

    p(st+1 ∣ st , at ) p(st+1 ∣ st , at ) Ti,j = p(st+1 = i ∣ st = j) TUBUFKо઱য઎ਸٸTUBUFоJ੉ؼഛܫ UJNFTUFQUীTUBUFоJੌഛܫ
  19. .BSLPWEFDJTJPOQSPDFTT s1 a1 s2 a2 s3 p(st+1 ∣ st ,

    at ) p(st+1 ∣ st , at ) s ∈ S s A T a ∈ A 4UBUFTQBDF "DUJPOTQBDF 5SBOTJUJPOPQFSBUPS 4UBUFTQBDF "DUJPOTQBDF M = {S, A, T, r} r 3FXBSEGVODUJPO
  20. .BSLPWEFDJTJPOQSPDFTT μt,i = p(st = i) Ti,j,k = p(st+1 =

    i ∣ st = j, at = k) UJNFTUFQUীTUBUFоJੌഛܫ UJNFTUFQUীࢲTUBUFоK੉ҊBDUJPO੉Lੌ⮶  UJNFTUFQU ীࢲTUBUFоJੌഛܫ ξt,k = p(at = k) UJNFTUFQUীBDUJPO੉Lੌഛܫ r : SxA → R SFXBSEGVODUJPO
  21. 1BSUJBM0CTFSWFE.BSLPWEFDJTJPOQSPDFTT s ∈ S s A T a ∈ A

    4UBUFTQBDF "DUJPOTQBDF 5SBOTJUJPOPQFSBUPS 4UBUFTQBDF "DUJPOTQBDF M = {S, A, O, T, E, r} O 0CTFSWBUJPOTQBDF E &NJTTJPOQSPCBCJMJUZ r 3FXBSEGVODUJPO o ∈ O PCTFSWBUJPOTQBDF o1 s1 a1 o2 s2 a2 o3 s3 a3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at )
  22. 5IFHPBMPGSFJOGPSDFNFOUMFBSOJOH pθ (s1 , a1 , . . . .

    . , ST , aT ) = p(s1 ) T ∏ t=1 πθ (at ∣ st )p(st+1 ∣ st , at ) pθ (τ) θ* = argmaxθ Eτ∼pθ (τ) [∑ t r(st , at )]
  23.  ݽٚTUBUFܳ7 T ਵ۽ୡӝചदఅ׮ пTUBUFܳ6QEBUF3VMFਸࢎਊೞৈ7 T ܳসؘ੉౟ೠ׮ Policy iteration- Policy

    Evaluation সؘ੉౟ೞݴ7 T ੄߸ച۝੉ݒ਋੘ਸٸসؘ੉౟ܳݥ୸׮ Policyܳٮۄ state-valueܳ҅࢑ೞ੗
  24. Y(SJEXPSME 4UBUFӒܻ٘੄ઝ಴ "DUJPO࢚ ೞ ઝ ਋ 3FXBSEೠ஢਑૒ੌٸ݃׮ 5SBOTJUJPO1SPCBCJMJUZ %JTDPVOUGBDUPS (SJE8PSME&OWJSPONFOU1PMJDZJUFSBUJPO

    3FXBSE  (PBM "DUJPO (PBM 3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE  3FXBSE 
  25. (SJE8PSME&OWJSPONFOU1PMJDZJUFSBUJPO Lੌٸ ୡӝച 7L (SFFE1PMJDZ     

                       
  26. 7L (SFFE1PMJDZ    L    

                      (SJE8PSME&OWJSPONFOU1PMJDZJUFSBUJPO
  27. 7L (SFFE1PMJDZ    L    

        (SJE8PSME&OWJSPONFOU1PMJDZJUFSBUJPO              
  28. 7L (SFFE1PMJDZ    LJOG    

        (SJE8PSME&OWJSPONFOU1PMJDZJUFSBUJPO              
  29. 7L (SFFE1PMJDZ    Lࣻ۴ೞݶ    

        (SJE8PSME&OWJSPONFOU7BMVFJUFSBUJPO              
  30. 4BSTBHSJEXPSME         

    ҃೷ೞ૑ঋ਷झప੉౟੄੿ࠁоহ׮
  31. 4BSTBHSJEXPSME         

    ҃೷ਸৈ۞ߣ೧ࠁݴBDUJPOWBMVFܳসؘ੉౟ೠ׮ 1PMJDZח0OQPMJDZ       
  32. 2MFBSOJOHHSJEXPSME         

    ҃೷ೞ૑ঋ਷झప੉౟੄੿ࠁоহ׮
  33. 2MFBSOJOHHSJEXPSME         

    ҃೷ਸৈ۞ߣ೧ࠁݴBDUJPOWBMVFܳসؘ੉౟ೠ׮ 1PMJDZח0GGQPMJDZ       
  34. MARKOV DECISION PROCESS "DUJPO "HFOU &OWJSPONFOU 3FXBSE At Rt 4UBUF

    St Rt+1 St+1 SUPERMARIO WITH R.L 3FXBSE  1FOBMUZ
  35. MARKOV DECISION PROCESS "DUJPO "HFOU &OWJSPONFOU 3FXBSE At Rt 4UBUF

    St Rt+1 St+1 SUPERMARIO WITH R.L 3FXBSE  1FOBMUZ
  36. SUPERMARIO WITH R.L https://github.com/wonseokjung/gym-super-mario-bros pip install gym-super-mario-bros
 import gym_super_mario_bros
 env

    = gym_super_mario_bros.make(‘SuperMarioBros-v0') env.reset() env.render() INSTALL AND IMPORT ENVIRONMENT
  37. WORLDS & LEVELS ( WORLD 1~4) SUPERMARIO WITH R.L 8PSME

    8PSME 8PSME 8PSME env = gym_super_mario_bros.make('SuperMarioBros-<world>-<level>-v<version>')
  38. WORLDS & LEVELS ( WORLD 5~8) SUPERMARIO WITH R.L 8PSME

    8PSME 8PSME 8PSME env = gym_super_mario_bros.make('SuperMarioBros-<world>-<level>-v<version>')
  39. REWARD AND PENALTY SUPERMARIO WITH R.L 3FXBSE 1FOBMUZ ӥߊীоөਕ૑ݶ 

    ݾ಴ীب଱ೞݶ ݾ಴׳ࢿೞ૑ޅೞݶ दр੉૑զٸ݃׮ ӥߊীࢲݣয૑ݶ
  40. STATE, ACTION SUPERMARIO WITH R.L env.observation_space.shape (240, 256, 3) #

    [ height, weight, channel ] env.action_space.n 256 SIMPLE_MOVEMENT = [ [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT)
  41. OBSERVATION SPACE SUPERMARIO WITH R.L env.action_space.n 256 SIMPLE_MOVEMENT = [

    [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT) env.observation_space.shape (240, 256, 3) # [ height, weight, channel ]
  42. ACTION SPACE SUPERMARIO WITH R.L env.action_space.n 256 SIMPLE_MOVEMENT = [

    [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT) env.observation_space.shape (240, 256, 3) # [ height, weight, channel ]
  43. ACTION AFTER WRAPPER SUPERMARIO WITH R.L env.action_space.n 256 SIMPLE_MOVEMENT =

    [ [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env.observation_space.shape (240, 256, 3) # [ height, weight, channel ] env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT) from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv
  44. EXPLOITATION AND EXPLORATION SUPERMARIO WITH R.L next_state, reward, done, info

    = env.step(action) else : 
 action = np.argmax(output) Exploitation Exploration def epsilon_greedy(q_value,step): if np.random.rand() < epsilon : action=np.random.randint(output) ?
  45. EXPLORATION SUPERMARIO WITH R.L next_state, reward, done, info = env.step(action)

    else : 
 action = np.argmax(output) Exploitation Exploration if np.random.rand() < epsilon : action=np.random.randint(output) def epsilon_greedy(q_value,step): ?
  46. EXPLOITATION SUPERMARIO WITH R.L next_state, reward, done, info = env.step(action)

    else : 
 action = np.argmax(output) def epsilon_greedy(q_value,step): if np.random.rand() < epsilon : action=np.random.randint(output) Exploitation Exploration ?
  47. ENV.STEP( ) SUPERMARIO WITH R.L next_state, reward, done, info =

    env.step(action) else : 
 action = np.argmax(output) def epsilon_greedy(q_value,step): if np.random.rand() < epsilon : action=np.random.randint(output)
  48. EXPLORATION RATE AND REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory

    = deque([],maxlen=1000000) memory.append(state,action,reward,next_state) (St , At , Rt+1 , St+1 ) next_state, reward, done, info = env.step(action) eps_max = 1 eps_min = 0.1 eps_decay_steps = 200000
  49. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)

    next_state, reward, done, info = env.step(action) eps_max = 1 eps_decay_steps = 200000 eps_min = 0.1
  50. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)

    next_state, reward, done, info = env.step(action) eps_max = 1 eps_decay_steps = 200000 eps_min = 0.1
  51. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)

    next_state, reward, done, info = env.step(action) eps_max = 1 eps_decay_steps = 200000 eps_min = 0.1
  52. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)

    eps_max = 1 eps_min = 0.1 eps_decay_steps = 200000 next_state, reward, done, info = env.step(action)
  53. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L eps_max = 1 eps_min

    = 0.1 eps_decay_steps = 200000 next_state, reward, done, info = env.step(action) memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)
  54. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L eps_max = 1 eps_min

    = 0.1 eps_decay_steps = 200000 next_state, reward, done, info = env.step(action) memory.append(state,action,reward,next_state) memory = deque([],maxlen=1000000)
  55. MINIMIZE LOSS SUPERMARIO WITH R.L import tensorflow as tf loss

    = tf.reduce_mean(tf.squre( y - Q_action ) ) Optimizer =tf.train.AdamsOptimizer(learning_rate) training_op = optimizer.minize(loss) (Rt+1 + γt+1 maxa′ qθ (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 )
  56. MINIMIZE LOSS SUPERMARIO WITH R.L (Rt+1 + γt+1 maxa′ qθ

    (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 ) loss = tf.reduce_mean(tf.squre( y - Q_action ) ) Optimizer =tf.train.AdamsOptimizer(learning_rate) training_op = optimizer.minize(loss) import tensorflow as tf
  57. MINIMIZE LOSS SUPERMARIO WITH R.L import tensorflow as tf Optimizer

    =tf.train.AdamsOptimizer(learning_rate) training_op = optimizer.minize(loss) (Rt+1 + γt+1 maxa′ qθ (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 ) loss = tf.reduce_mean(tf.squre( y - Q_action ) )
  58. MINIMIZE LOSS SUPERMARIO WITH R.L import tensorflow as tf training_op

    = optimizer.minize(loss) (Rt+1 + γt+1 maxa′ qθ (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 ) loss = tf.reduce_mean(tf.squre( y - Q_action ) ) Optimizer =tf.train.AdamsOptimizer(learning_rate)
  59. MINIMIZE LOSS SUPERMARIO WITH R.L import tensorflow as tf (Rt+1

    + γt+1 maxa′ qθ (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 ) loss = tf.reduce_mean(tf.squre( y - Q_action ) ) Optimizer =tf.train.AdamsOptimizer(learning_rate) training_op = optimizer.minize(loss)
  60. DOUBLE DQN SUPERMARIO WITH R.L JOQVU "DUJPO WBMVF &OW 2/FUXPSL

    s’ s 3FQMBZNFNPSZ 2 T B a r (St , At , Rt+1 , St+1 )
  61. 8IBUIBTQSPWFODIBMMFOHJOHTPGBS  1. Humans can learn incredibly quickly 
 -

    Deep RL methods are usually slow 2. Humans can reuse past knowledge 
 - Transfer learning in deep RL is an open problem 3. Not clear what the reward function should be 4. Not clear what the role of prediction should be
  62. HOW WE CAN ALLOW OUT A.I SYSTEM MAKE TO USE

    PRIOR KNOWLEDGE? REINFORCEMENT LEARNING https://ubisafe.org/explore/demeanure-clipart-prior-knowledge/
  63. QUESTIONS ૕ޙ ઱য૓ജ҃੉ইצղоҾӘೠޙઁܳಽӝਤ೧ ജ҃ਸ݅٘חѪ੉оמೡө  ೟णदрਸ઴ੌࣻ੓חߑߨ੉੓ਸө *TTVFT
 ೟णदр੉ցޖցޖয়ېѦܽ׮ * ӝળ

    0QFO"*(:.୭ࣗ࠙_ੌ઱ੌ੉࢚ 4VQFS.BSJP-FWFMੌࣗਃ 4POJD0QFO"*ઁҕࢲߡࢎਊदр 1SPTUIFUJDT׳੉࢚৘࢚ ъച೟ण਷ജ҃੉೙ࣻ׮ ઁҕೞחജ҃݅ࢎਊೞחѪ੉оמೞ׮ 
  64. ADVERSARIAL LEARNING - ҕాػݾ಴ܳ׳ࢿೞӝਤ೧੿ࠁܳҕਬೞݴ$PPQFSBUJPOೡٸب੓૑݅ ࠂय ୷ҳ ఌ ҳ పפझ١җэ੉थಁоഛपೠ҃਋ب੓׮ -

    ౱׼Ҏਸ֍חTUSJLFS৬Ҏਸ݄ח(PBMLFFQFS۽ҳࢿغয੓׮ 4USJLFS (PBMLFFQFS 4USJLFS (PBMLFFQFS 0CKFDU &OWJSPONFOU
  65. TRAINING USING IMITATION LEARNING (SBWJUZ "HFOU "HFOU (SBWJUZ #BMM #BMM

    *OJUJBMJ[BUJPO - ҕ਷઺۱ী੄೧ڄয૑ݴ п"HFOUחҕਸ߉ই߈؀"HFOU৔৉ਵ۽ֈѹঠೠ׮ - ߈؀ীਤ஖ೠ"HFOUחҕਸࠁҊ׮दֈӟ׮ &OWJSPONFOU 3FJOGPSDFNFOU-FBSOJOH
  66. TRAINING USING IMITATION LEARNING "HFOU "HFOU 4USBU5SBJOJOH "DUJPO "DUJPO "DUJPO

    "DUJPO "DUJPO "DUJPO "HFOUחৈ۞о૑BDUJPOਸࢶఖೞݴ؊݆਷3FXBSEܳ߉ਸࣻ੓חBDUJPOਸఐ࢝ ੉۠ߑध਷೟णغחؘदр੉݆੉ࣗਃػ׮
  67. TRAINING USING IMITATION LEARNING *NJUBUJPO-FBSOJOH 5FBDIFS 4UVEFOU - ؊ࡅܰҊബҗ੸ਵ۽ߓ਎ࣻ੓ѱ੹ޙо੄೯زਸࠁҊߓ਋ח*NJUBUJPO-FBSOJOHਸ ࢎਊೠ׮

    - 4UVEFOUח5FBDIFSਸࠁҊߓ਑ - 5FBDIFS 1MBZFS ח੸੺ೠBDUJPOਸࢶఖೞݴTUVEFOUоࡈܻߓ਎ࣻ੓ѱب৬ળ׮
  68. %FFQ2OFUXPSL input Env %PVCMF%2/
 %*45 s’ s Replay memory Q(s,a)

    a r /PJTZ (Rt+1 + γt+1 qθ (St+1 , argmaxa′ q(St) − qθ (St , At ))2 1SJPSJUJ[FE SFQMBZ ೟ण੉؊೙ਃೠ USBOTJUJPOTਸ TBNQMF Multi-step learning
  69. 5XPNFUIPETPGDIPPTJOHBDUJPO BDUJPOWBMVF  -FBSOJOHUIFBDUJPOWBMVF  &TUJNBUFBDUJPOWBMVFਸ߄ఔਵ۽BDUJPOਸࢶఖೠ׮  1PMJDJFTXPVMEOPUFWFOFYJTUXJUIPVUUIFBDUJPOWBMVFFTUJNBUFT 1BSBNFUFSJ[FEQPMJDZ 

    TFMFDUBDUJPOTXJUIPVUDPOTVMUJOHWBMVFGVODUJPO  7BMVFGVODUJPOTUJMMCFVTFEUPMFBSOQPMJDZQBSBNFUFS  7BMVFGVODUJPO੉BDUJPOਸࢶఖೞחӝળਵ۽ࢎਊغ૑ঋח׮ J(θ) 1FSGPSNBODFNFBTVSF qπ(s, a) = Eπ [Gt ∣ St = s, At = a]
  70. 
 .PEVMBCT$53-3FTFBSDIFS $JUZ6OJWFSTJUZPG/FX:PSL#BSVDI $PMMFHF %BUB4DJFODF.BKPS  $POOFYJPO"*'PVOEFS %FFQ-FBSOJOH$PMMFHF3FJOGPSDFNFOU -FBSOJOH3FTFBSDIFS CTRL

    MEMBERS 8POTFPL+VOH ,ZVOHIXBO,JN 
 .PEVMBCT$53-3FTFBSDIFS )BOTVOH6OJWFSTJUZ
 &MFDUSPOJDJOGPSNBUJPO&OHJOFFSJOH.BKPS  *OUFSFTUFE"* 3FJOGPSFNFOUMFBSOJOH  (BNF )ZP+FPOH+FPO 
 .PEVMBCT$53-3FTFBSDIFS #JOHIBNUPO6OJWFSTJUZ.4PG.FDIBOJDBM &OHJOFFSJOH.BKPS .BUIJOTUSVDUPSBU$PMMFHF1SFQ*OTUJUVUF *OUFSFTU"QQMJDBUJPOPG6OJUZ.-"HFOUT JO3- ("/
  71. CTRL MEMBERS 4FVOH+BF-FF 
 .PEVMBCT$53-3FTFBSDIFS 1SJODFUPO6OJWFSTJUZ$MBTTPG .BUIFNBUJDT.BKPS  4DSBUDIXPSL--$$PGPVOEFS 3FTFBSDI

    &YQFSJNFOUTXJUIUIF.BSLPGG4VSGBDF@ XJUI.BUUIFXEF$PVSDZ*SFMBOEVOEFS )BOZBOH6OJWFSTJUZ
 .4#JPJOGPSNBUJDT #*(-BC 
 6OJUZ%FWFMPQFS .BSWSVT 
 *OUFSFTUFE73"3 
 3FJOGPSDFNFOUMFBSOJOH "*JOUFSBDUJPO 1I% 2VBMJUZ4ZTUFNT-BC1045&$) 
 4FBSDI3FDPNNFOEFS4ZTUFN&OHJOFFS
 *OUFSFTUFE3-"QQMJDBUJPOJO&WFSZXIFSF .PEVMBCT$53-3FTFBSDIFS .PEVMBCT$53-3FTFBSDIFS +XBXPO4FP 7JDT,XPO
  72. CTRL MEMBERS :VOLZV$IPJ 
 .PEVMBCT$53-3FTFBSDIFS 73/&3% 4VIZVL1BSL 
 .PEVMBCT$53-3FTFBSDIFS .BTUFS

    /BUVSBM-BOHVBHF1SPDFTTJOH ,PSFB 6OJWFSTJUZ  %BUB*OHFTUJPO5FBN-FBEFSBU/$40'5 *OUFSFTUFE"QQMZ3FJOGPSDFNFOU-FBSOJOH 5FDIOJRVFTUP%BUB1SPDFTTJOH"SFB ,VSU 
 .PEVMBCT$53-3FTFBSDIFS 3-/&3%
  73. 
 .PEVMBCT,"*33FTFBSDIFS $JUZ6OJWFSTJUZPG/FX:PSL#BSVDI $PMMFHF %BUB4DJFODF.BKPS  $POOFYJPO"*'PVOEFS %FFQ-FBSOJOH$PMMFHF3FJOGPSDFNFOU -FBSOJOH3FTFBSDIFS 3FJOGPSDFNFOU-FBSOJOH

    0CKFDU %FUFDUJPO $IBUCPU KAIR MEMBERS 8POTFPL+VOH $IFPMIVJ.JO 
 .PEVMBCT,"*33FTFBSDIFS #4JO.FDIBOJDBM&OH,PSFB6OJW .4DDBOEJEBUFJO.FDIBOJDBM&OH,PSFB 6OJW 3FTFBSDIGJFME%FFQSFJOGPSDFNFOUMFBSOJOH GPSSPCPUJDT 3PCPUDPOUSPMJOUFSGBDF *OUFSFTUT%FFQ3-BOEPQUJNBMDPOUSPM  3PCPUJDT $IJOFTF 8IJ,XPO 
 .PEVMBCT,"*33FTFBSDIFS JOUFSFTUFE3- NBOJQVMBUPS 4PHBOH6OJW$IFNJDBMBOE#JPFOHJOFFSJOH .FEJQJYFM"*SFTFBSDIFSMFBSOJOHGPS SPCPUJDT 3PCPUDPOUSPMJOUFSGBDF
  74. 
 .PEVMBCT,"*33FTFBSDIFS ,ZVOHIFF6OJWFSTJUZ.FDIBOJDBMBOE 4PGUXBSFFOHJOFFSJOH EVBMNBKPS  *OUFSFTUSPCPUJDTBOETJNVMBUJPO 3- NM BHFOU

    
 .PEVMBCT,"*33FTFBSDIFS 4FPVM/BUJPOBM6OJWFSTJUZ .FDIBOJDBM "FSPTQBDF&OHJOFFSJOH.BKPS  4QBDF/FSE *OUFSFTU3PCPUJDT /BWJHBUJPO$POUSPM  #BUUMFHSPVOE .BDBSPO 
 .PEVMBCT,"*33FTFBSDIFS %BFKFPO6OJWFSTJUZ &MFDUSPOJD*OGPSNBUJPO$PNNVOJDBUJPO &OHJOFFSJOH  *OUFSFTUFE3FJOGPDFNFOU-FBSOJOH  4JN3FBM KAIR MEMBERS 4VCJO:BOH 4FPZFPO:BOH +VOUBF,JN
  75. 
 .PEVMBCT,"*33FTFBSDIFS 4VOHLZVOLXBO6OJW.FDIBOJDBM FOHJOFFSJOH#4 4FPVM/BUJPOBM6OJW.4TUVEFOU *OUFSFTUFE)VNBOPJE 5PSRVFDPOUSPM  3- 


    .PEVMBCT,"*33FTFBSDIFS ,PSFB6OJWFSTJUZ $POUSPM3PCPUJDT 4ZTUFN *OUFSFTU2VBOUJ[FE/FVSBM/FUXPSLTBOE "4*$*NQMFNFOUBUJPO 4UBUFFTUJNBUJPO 3- 
 .PEVMBCT,"*33FTFBSDIFS #4JO$PNQVUFS4DJFODFBOE&OHJOFFSJOH 3FTFBSDI&OHJOFFSBU.FEJQJYFM *OUFSFTUFEJO$POWFY0QUJNJ[BUJPO  $PNQVUFS7JTJPO BOE3FJOGPSDFNFOU -FBSOJOH KAIR MEMBERS %POHIZFPO,JN +FPOHIPPO,JN +JOXPP1BSL
  76. 
 .PEVMBCT,"*33FTFBSDIFS #4JO.BUIFNBUJDTBOE$PNQVUFS 4DJFODFJO,PSFB6OJW .4$PVSTFJO&MFDUSJDBM&OHJO4FPVM /BUJPOBM6OJW 3FTFBSDIGJFME4UPDIBTUJD$POUSPM1SPDFTT KAIR MEMBERS 4JIZVO$IPJ

    4FVOH+BF3ZBO-FF 
 .PEVMBCT,"*33FTFBSDIFS 1SJODFUPO6OJW$MBTTPG .BUIFNBUJDT  FOEUPFOE"*3FTFBSDIFS %FFQ-FBSOJOH;FSP5P"MM4FBTPO1Z5PSDI $POUFOU$POUSJCVUPS *OUFSFTUT%BUBFGGJDJFOU3- $VSJPTJUZESJWFO -FBSOJOH .FUB-FBSOJOH