Markov Decision Processes M = (S, A, T, R, γ) S: set of states A: set of actions T: transitions R: reward γ: discount factor Objective: Find a policy π(a | s) which maximizes total discounted reward s t s t+1 r t+1 a t ・・・・
Algorithm: MaxQInit Task M 1 Task M 2 ・・・ Task M m ・・・ (Theorem) For m sufficiently large, MaxQInit preserves the PAC-MDP property with high probability