Slide 18
Slide 18 text
• The proposed policy assumes the contextual and nonstationary reward model by the
following linear Gaussian state space (LGSS) model.
Modeling reward using LGSS model
18
AAADi3ichVFNa9RQFL1p1NbR2qluBDfBoWGkYXhpi0pxoCiCK5lOO22xacNL5nXm0ZcPk5eBGvIHXAsuREHBhbj3D7jxD7joTxCXFdy48CaT0trB6QvJu/fcc+494Tqh4LEk5FCZUC9cvDQ5dbly5er0tZnq7PWNOEgil3XcQATRlkNjJrjPOpJLwbbCiFHPEWzT2X+U1zcHLIp54K/Lg5DteLTn8z3uUomQPatMRrbUm89saTlealER9mlmy3mLhTEXyJCGpmsnmRVzz/Ko7LtUpE+zOjEQ6Xl0d8FOj1nZHUO3rMrphqmcNzO9uT46RlJbto+n5Mn4CZJi95z7PKFdTTZNw+oGMjYsSZN8ZgHrOeHUHPNMT7tbL8pegjWjZZto2K7WSIMURxsNzDKoQXlaQfULWNCFAFxIwAMGPkiMBVCI8dkGEwiEiO1AiliEES/qDDKooDZBFkMGRXQfvz3MtkvUxzzvGRdqF6cIfCNUajBHvpNP5Ih8I5/JD/Lnv73Sokfu5QBvZ6hloT3z8uba73NVHt4S+ieqsZ4l7MH9witH72GB5H/hDvWDF6+P1pbbc6lOPpCf6P89OSRf8Q/8wS/34yprvxnjx0EvGa7HPLuM0WBjoWHebSytLtVWHpaLmoJbcBvquI17sAJPoAUdcBVfeaW8Vd6p0+qiuqw+GFInlFJzA/456uO/GPDuaw==
rt = Zt↵t + ✏t, ✏t
⇠ N(0, 2
✏
),
↵t+1 = Tt↵t + ⌘tRt, ⌘t
⇠ N(0, 2
⌘
), t = 1, . . . , ⌧
↵1
⇠ Nd(µ1, P1),
Advantages
• The model can naturally represent the temporal change of the state.
• The design of the coefficient matrix ( ) provides flexible representations for the
contextual and nonstationary problem setting.
• A lightweight estimator algorithm does not affect response time.
Z, T, R