(1803.07246) • assume factorized policy: π(a|s) = π1(a1|s)…πd(ad|s) • “very common” for action space {a = (a1, …, ad)} • e.g. Gaussian with diagonal covariance • baseline b(s, a) = b1 + … + bd • use all the other dims' values a−i = (a1, …, ai−1, ai+1 …, ad)
in baseline for i-th dim, because unbiased: