Slide 1

Slide 1 text

Yusuke Miyake [1][2] and Tsunenori Mine [1] / Kyushu Univ [1]. Pepabo R&D Institute, GMO Pepabo, Inc [2]. October 1-4, 2023. The 2023 IEEE Conference on Systems, Man, and Cybernetics Contextual and Nonstationary Multi-armed Bandits Using the Linear Gaussian State Space Model for the Meta-Recommender System

Slide 2

Slide 2 text

1. Introduction 2. Related works 3. Proposal 4. Empirical evaluation on real data and results 5. Conclusion 2 Agenda

Slide 3

Slide 3 text

1. Introduction

Slide 4

Slide 4 text

4 Background Users Best Recommendation Method Items • Selecting an optimal recommendation method from many ones is crucial for EC site. • The effectiveness of them cannot be known in advance. • Continuous comparative evaluation in an actual environment is essential. • However, it results in opportunity loss. → → IUUQTJDPOTDPN

Slide 5

Slide 5 text

5 Previous studies • The reduction of opportunity loss is considered as a multi-armed bandit (MAB) problem. • Many meta-recommendation systems (meta-RSs) that use multi-armed bandit policies to select the efficient recommendation method have been proposed. Users Best Recommendation Method Items Multi-armed bandit to select the best one. IUUQTJDPOTDPN

Slide 6

Slide 6 text

6 Issue in previous studies • The MAB policy used by meta-RSs must take into account the following characteristics of the recommendation method. 
 1. A temporal change in the effectiveness of recommendation methods. 2. The performance of a recommendation method depends on the context. 3. The performance also depends on response time. • A policy that satisfies all three requirements based on these characteristics exists yet. →

Slide 7

Slide 7 text

7 Proposed MAB policy for the meta-RS Linear Gaussian state space bandits (LGSS bandits) • It addresses all three factors causing opportunity loss in the selection of recommendation methods. 
 1. For temporal change, effectiveness changes are modeled as LGSS model. 2. For context, a novel context-aware coefficient matrix for LGSS is designed. 3. For response time, a linear Kalman filter is used to estimate the state. • A probability matching method of bandits is used to find best recommendation method based on the estimated state.

Slide 8

Slide 8 text

2. Related works

Slide 9

Slide 9 text

• The player must select one arm from multiple arms to maximize the reward. • The reward is stochastically given based on the chosen arm. • To find the optimal arm, player should balance exploitation and exploration. 
 • Two types of extended problems have been developed. • Contextual MAB problem • Nonstationary MAB problem 9 Multi-armed bandit (MAB) problem

Slide 10

Slide 10 text

10 Basic problem setting Arm0 Arm1 Arm2 Player Estimated probability distribution Probability distribution of rewards Select Best arm

Slide 11

Slide 11 text

Contextual problem setting 11 Arm0 Arm1 Arm2 Player Select Context = 0 Estimated probability distribution Probability distribution of rewards Best arm for context = 0

Slide 12

Slide 12 text

12 Arm0 Arm1 Arm2 Player Select Context = 1 Contextual problem setting Estimated probability distribution Probability distribution of rewards Best arm for context = 1

Slide 13

Slide 13 text

Nonstationary problem setting 13 Arm0 Arm1 Arm2 Player Select t = 0~ t = 100~ t = 0~ t = 100~ t = 99 Estimated probability distribution Probability distribution of rewards Best arm at t = 99

Slide 14

Slide 14 text

14 Arm0 Arm1 Arm2 Player Select t = 0~ t = 100~ t = 0~ t = 100~ t = 199 Nonstationary problem setting Estimated probability distribution Probability distribution of rewards Best arm at t = 199

Slide 15

Slide 15 text

15 Meta-Recommender systems (meta-RSs) Policies of meta-RSs • Studies have used meta-RSs to select the best recommendation method based on evaluation in the actual environment. • These studies didn't employ contextual and nonstationary policy. Existing policies • In existing contextual and nonstationary policies, one requirement, response time, cannot be satisfied because mechanisms to track change are time-consuming. • By contrast, a policy that satisfies the requirement with a lightweight mechanism cannot satisfy the other requirement, the temporal change. →

Slide 16

Slide 16 text

3. Proposal

Slide 17

Slide 17 text

• The proposed policy assumes the contextual and nonstationary reward model by the following linear Gaussian state space (LGSS) model. Modeling reward using LGSS model 17 AAADi3ichVFNa9RQFL1p1NbR2qluBDfBoWGkYXhpi0pxoCiCK5lOO22xacNL5nXm0ZcPk5eBGvIHXAsuREHBhbj3D7jxD7joTxCXFdy48CaT0trB6QvJu/fcc+494Tqh4LEk5FCZUC9cvDQ5dbly5er0tZnq7PWNOEgil3XcQATRlkNjJrjPOpJLwbbCiFHPEWzT2X+U1zcHLIp54K/Lg5DteLTn8z3uUomQPatMRrbUm89saTlealER9mlmy3mLhTEXyJCGpmsnmRVzz/Ko7LtUpE+zOjEQ6Xl0d8FOj1nZHUO3rMrphqmcNzO9uT46RlJbto+n5Mn4CZJi95z7PKFdTTZNw+oGMjYsSZN8ZgHrOeHUHPNMT7tbL8pegjWjZZto2K7WSIMURxsNzDKoQXlaQfULWNCFAFxIwAMGPkiMBVCI8dkGEwiEiO1AiliEES/qDDKooDZBFkMGRXQfvz3MtkvUxzzvGRdqF6cIfCNUajBHvpNP5Ih8I5/JD/Lnv73Sokfu5QBvZ6hloT3z8uba73NVHt4S+ieqsZ4l7MH9witH72GB5H/hDvWDF6+P1pbbc6lOPpCf6P89OSRf8Q/8wS/34yprvxnjx0EvGa7HPLuM0WBjoWHebSytLtVWHpaLmoJbcBvquI17sAJPoAUdcBVfeaW8Vd6p0+qiuqw+GFInlFJzA/456uO/GPDuaw== rt = Zt↵t + ✏t, ✏t ⇠ N(0, 2 ✏ ), ↵t+1 = Tt↵t + ⌘tRt, ⌘t ⇠ N(0, 2 ⌘ ), t = 1, . . . , ⌧ ↵1 ⇠ Nd(µ1, P1), • It represents that state follows the previous state and reward follows state . αt αt−1 rt αt Rt t + 1 Zt Tt ηt ϵt rt αt αt+1 + +

Slide 18

Slide 18 text

• The proposed policy assumes the contextual and nonstationary reward model by the following linear Gaussian state space (LGSS) model. Modeling reward using LGSS model 18 AAADi3ichVFNa9RQFL1p1NbR2qluBDfBoWGkYXhpi0pxoCiCK5lOO22xacNL5nXm0ZcPk5eBGvIHXAsuREHBhbj3D7jxD7joTxCXFdy48CaT0trB6QvJu/fcc+494Tqh4LEk5FCZUC9cvDQ5dbly5er0tZnq7PWNOEgil3XcQATRlkNjJrjPOpJLwbbCiFHPEWzT2X+U1zcHLIp54K/Lg5DteLTn8z3uUomQPatMRrbUm89saTlealER9mlmy3mLhTEXyJCGpmsnmRVzz/Ko7LtUpE+zOjEQ6Xl0d8FOj1nZHUO3rMrphqmcNzO9uT46RlJbto+n5Mn4CZJi95z7PKFdTTZNw+oGMjYsSZN8ZgHrOeHUHPNMT7tbL8pegjWjZZto2K7WSIMURxsNzDKoQXlaQfULWNCFAFxIwAMGPkiMBVCI8dkGEwiEiO1AiliEES/qDDKooDZBFkMGRXQfvz3MtkvUxzzvGRdqF6cIfCNUajBHvpNP5Ih8I5/JD/Lnv73Sokfu5QBvZ6hloT3z8uba73NVHt4S+ieqsZ4l7MH9witH72GB5H/hDvWDF6+P1pbbc6lOPpCf6P89OSRf8Q/8wS/34yprvxnjx0EvGa7HPLuM0WBjoWHebSytLtVWHpaLmoJbcBvquI17sAJPoAUdcBVfeaW8Vd6p0+qiuqw+GFInlFJzA/456uO/GPDuaw== rt = Zt↵t + ✏t, ✏t ⇠ N(0, 2 ✏ ), ↵t+1 = Tt↵t + ⌘tRt, ⌘t ⇠ N(0, 2 ⌘ ), t = 1, . . . , ⌧ ↵1 ⇠ Nd(µ1, P1), Advantages • The model can naturally represent the temporal change of the state. • The design of the coefficient matrix ( ) provides flexible representations for the contextual and nonstationary problem setting. • A lightweight estimator algorithm does not affect response time. Z, T, R

Slide 19

Slide 19 text

Context-aware coefficient matrices 
 for the LGSS model

Slide 20

Slide 20 text

• The designed matrices handle time-varying contexts in the LGSS model. • These matrices vary according to the context at each time point . xt t 20 Context-aware coefficient matrix AAACvXichVHLSuRAFD2dcXz0+OgZN4KbYKPMQppqEWcQhhHd6K67tVW0NSSx1GDlQVLdo4b8gD/gwo0KLsS9PyCIO9248BMGlwpuXHg7HRAV9YZKnXvqnlunuIYnrEAydpNSvjR9bW5pbUt/a+/o7Mp8/zEbuFXf5GXTFa4/b+gBF5bDy9KSgs97PtdtQ/A5Y2Oifj5X435guc6M3PL4kq2vOdaqZeqSKC0zuqBJ9Y9aMexwM9LkcliRrhcNqqUX9KA608htXa6buginIs3WMlmWY3Gob0E+AVkkUXAzp6hgBS5MVGGDw4EkLKAjoG8ReTB4xC0hJM4nZMXnHBHSpK1SFacKndgN+q9RtpiwDuX1nkGsNukWQcsnpYp+ds2O2R27YCfsP3t8t1cY96h72aLdaGi5p3Xt9Ew/fKqyaZdYf1Z96FliFb9jrxZ592Km/gqzoa9t795Nj5b6wwF2yG7J/wG7YWf0Aqd2bx4VeWnvAz8GeYloPPnXw3gLZody+ZHccHE4OzaeDKoVvejDT5rGL4xhEgWUqfs+znGJK+WvwhWhOI1SJZVouvEilH9P00imgg== Zt = x> t , Rt = xt, Tt = Im AAACm3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxCTlVlfUxpcoxGTmKcRUKxjoKBjG1MblxgsoG+gZgIECJsMQylBmgIKAfIFtDDEMKQz5DMkMpQy5DKkMeQwlQHYOQyJDMRBGMxgyGDAUAMViGaqBYkVAViZYPpWhloELqLcUqCoVqCIRKJoNJNOBvGioaB6QDzKzGKw7GWhLDhAXAXUqMKgaXDVYafDZ4ITBaoOXBn9wmlUNNgPklkognQTRm1oQz98lEfydoK5cIF3CkIHQhdfNJQxpDBZgt2YC3V4AFgH5Ihmiv6xq+udgqyDVajWDRQavge5faHDT4DDQB3llX5KXBqYGzcbjniSgW2qB0WOIHhmYjDAjPUMzPZNAE2UHJ2hEcTBIMygxaABjw5zBgcGDIYAhFGh6HcMShrUM65hkmZyZvJh8IEqZGKF6hBlQAFMoAGZ8mKs= xt 2 {0, 1}m

Slide 21

Slide 21 text

• The equation is consistent with the reward formulation in contextual MAB. • It enables the estimated state to be used to solve the MAB. αt = θt • The designed matrices handle time-varying contexts in the LGSS model. • These matrices vary according to the context at each time point . • When these matrices are applied to the model: xt t 21 Context-aware coefficient matrix AAACvXichVHLSuRAFD2dcXz0+OgZN4KbYKPMQppqEWcQhhHd6K67tVW0NSSx1GDlQVLdo4b8gD/gwo0KLsS9PyCIO9248BMGlwpuXHg7HRAV9YZKnXvqnlunuIYnrEAydpNSvjR9bW5pbUt/a+/o7Mp8/zEbuFXf5GXTFa4/b+gBF5bDy9KSgs97PtdtQ/A5Y2Oifj5X435guc6M3PL4kq2vOdaqZeqSKC0zuqBJ9Y9aMexwM9LkcliRrhcNqqUX9KA608htXa6buginIs3WMlmWY3Gob0E+AVkkUXAzp6hgBS5MVGGDw4EkLKAjoG8ReTB4xC0hJM4nZMXnHBHSpK1SFacKndgN+q9RtpiwDuX1nkGsNukWQcsnpYp+ds2O2R27YCfsP3t8t1cY96h72aLdaGi5p3Xt9Ew/fKqyaZdYf1Z96FliFb9jrxZ592Km/gqzoa9t795Nj5b6wwF2yG7J/wG7YWf0Aqd2bx4VeWnvAz8GeYloPPnXw3gLZody+ZHccHE4OzaeDKoVvejDT5rGL4xhEgWUqfs+znGJK+WvwhWhOI1SJZVouvEilH9P00imgg== Zt = x> t , Rt = xt, Tt = Im AAACm3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxCTlVlfUxpcoxGTmKcRUKxjoKBjG1MblxgsoG+gZgIECJsMQylBmgIKAfIFtDDEMKQz5DMkMpQy5DKkMeQwlQHYOQyJDMRBGMxgyGDAUAMViGaqBYkVAViZYPpWhloELqLcUqCoVqCIRKJoNJNOBvGioaB6QDzKzGKw7GWhLDhAXAXUqMKgaXDVYafDZ4ITBaoOXBn9wmlUNNgPklkognQTRm1oQz98lEfydoK5cIF3CkIHQhdfNJQxpDBZgt2YC3V4AFgH5Ihmiv6xq+udgqyDVajWDRQavge5faHDT4DDQB3llX5KXBqYGzcbjniSgW2qB0WOIHhmYjDAjPUMzPZNAE2UHJ2hEcTBIMygxaABjw5zBgcGDIYAhFGh6HcMShrUM65hkmZyZvJh8IEqZGKF6hBlQAFMoAGZ8mKs= xt 2 {0, 1}m AAADXHichVFNa9RQFL2ZVK1Ta0cFEdwEhw6VhuGlFBWhWHTjSvrhtIW+NrxkXmdCXz5MbgZryB/wD7hwpeBC3PsH3PgHXMzGvbhsoRsX3iRTWi2OL+S9e887597zuE6kvAQZG2o1feLCxUuTl+tTV6avzjSuXd9IwjR2ZccNVRhvOSKRygtkBz1UciuKpfAdJTed/SfF/eZAxokXBs/xIJI7vugF3p7nCiTIbgxjG1tL3PGzl7mNuxnHMMqLlAsV9QVh81xGiaeIjKbRMk4znng+9wX2XaGyZ/kcMwnp+WJ3wc5OWPlds8V5/WzBDOetvGp5tgcKG40TH1WjAhrfBAU1KLgvUtE1cMkyeTfExOQoUrvRZG1WLuN8YI2CJozWStj4DBy6EIILKfggIQCkWIGAhL5tsIBBRNgOZITFFHnlvYQc6qRNiSWJIQjdp71H2fYIDSgvaial2qUuiv6YlAbMsm/sIztkX9kn9oP9+metrKxReDmg06m0MrJnXt9aP/6vyqcToX+qGusZYQ8elF498h6VSPEKt9IPXr05XH+4Npu12Hv2k/y/Y0P2hV4QDI7cD6ty7e0YPw55yWk81t/DOB9sLLSte+3F1cXm8uPRoCbhNtyBOZrGfViGp7ACHXC1R5rUAi2sfdcn9Cl9uqLWtJHmBvyx9Ju/AaLC4A0= rt = x> t ↵t + ✏t, ✏t ⇠ N(0, 2 ✏ ), ↵t+1 = ↵t + ⌘txt, ⌘t ⇠ N(0, 2 ⌘ ), t = 1, . . . , ⌧

Slide 22

Slide 22 text

• The error is added only to the element in the corresponding state of the context . • The design enables the modeling of state transitions according to the frequency of occurrence of context elements, which is expected to stabilize the accuracy of predictions. ηt αt xt • The designed matrices handle time-varying contexts in the LGSS model. • These matrices vary according to the context at each time point . • When these matrices are applied to the model: xt t 22 Context-aware coefficient matrix AAACvXichVHLSuRAFD2dcXz0+OgZN4KbYKPMQppqEWcQhhHd6K67tVW0NSSx1GDlQVLdo4b8gD/gwo0KLsS9PyCIO9248BMGlwpuXHg7HRAV9YZKnXvqnlunuIYnrEAydpNSvjR9bW5pbUt/a+/o7Mp8/zEbuFXf5GXTFa4/b+gBF5bDy9KSgs97PtdtQ/A5Y2Oifj5X435guc6M3PL4kq2vOdaqZeqSKC0zuqBJ9Y9aMexwM9LkcliRrhcNqqUX9KA608htXa6buginIs3WMlmWY3Gob0E+AVkkUXAzp6hgBS5MVGGDw4EkLKAjoG8ReTB4xC0hJM4nZMXnHBHSpK1SFacKndgN+q9RtpiwDuX1nkGsNukWQcsnpYp+ds2O2R27YCfsP3t8t1cY96h72aLdaGi5p3Xt9Ew/fKqyaZdYf1Z96FliFb9jrxZ592Km/gqzoa9t795Nj5b6wwF2yG7J/wG7YWf0Aqd2bx4VeWnvAz8GeYloPPnXw3gLZody+ZHccHE4OzaeDKoVvejDT5rGL4xhEgWUqfs+znGJK+WvwhWhOI1SJZVouvEilH9P00imgg== Zt = x> t , Rt = xt, Tt = Im AAACm3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxCTlVlfUxpcoxGTmKcRUKxjoKBjG1MblxgsoG+gZgIECJsMQylBmgIKAfIFtDDEMKQz5DMkMpQy5DKkMeQwlQHYOQyJDMRBGMxgyGDAUAMViGaqBYkVAViZYPpWhloELqLcUqCoVqCIRKJoNJNOBvGioaB6QDzKzGKw7GWhLDhAXAXUqMKgaXDVYafDZ4ITBaoOXBn9wmlUNNgPklkognQTRm1oQz98lEfydoK5cIF3CkIHQhdfNJQxpDBZgt2YC3V4AFgH5Ihmiv6xq+udgqyDVajWDRQavge5faHDT4DDQB3llX5KXBqYGzcbjniSgW2qB0WOIHhmYjDAjPUMzPZNAE2UHJ2hEcTBIMygxaABjw5zBgcGDIYAhFGh6HcMShrUM65hkmZyZvJh8IEqZGKF6hBlQAFMoAGZ8mKs= xt 2 {0, 1}m AAADXHichVFNa9RQFL2ZVK1Ta0cFEdwEhw6VhuGlFBWhWHTjSvrhtIW+NrxkXmdCXz5MbgZryB/wD7hwpeBC3PsH3PgHXMzGvbhsoRsX3iRTWi2OL+S9e887597zuE6kvAQZG2o1feLCxUuTl+tTV6avzjSuXd9IwjR2ZccNVRhvOSKRygtkBz1UciuKpfAdJTed/SfF/eZAxokXBs/xIJI7vugF3p7nCiTIbgxjG1tL3PGzl7mNuxnHMMqLlAsV9QVh81xGiaeIjKbRMk4znng+9wX2XaGyZ/kcMwnp+WJ3wc5OWPlds8V5/WzBDOetvGp5tgcKG40TH1WjAhrfBAU1KLgvUtE1cMkyeTfExOQoUrvRZG1WLuN8YI2CJozWStj4DBy6EIILKfggIQCkWIGAhL5tsIBBRNgOZITFFHnlvYQc6qRNiSWJIQjdp71H2fYIDSgvaial2qUuiv6YlAbMsm/sIztkX9kn9oP9+metrKxReDmg06m0MrJnXt9aP/6vyqcToX+qGusZYQ8elF498h6VSPEKt9IPXr05XH+4Npu12Hv2k/y/Y0P2hV4QDI7cD6ty7e0YPw55yWk81t/DOB9sLLSte+3F1cXm8uPRoCbhNtyBOZrGfViGp7ACHXC1R5rUAi2sfdcn9Cl9uqLWtJHmBvyx9Ju/AaLC4A0= rt = x> t ↵t + ✏t, ✏t ⇠ N(0, 2 ✏ ), ↵t+1 = ↵t + ⌘txt, ⌘t ⇠ N(0, 2 ⌘ ), t = 1, . . . , ⌧

Slide 23

Slide 23 text

23 Context-aware coefficient matrix • The design is based on the structural time-series model using an LGSS model. • For example, if a trend can be assumed for the state transition, it can be extended to the following matrix: AAADunichVHNbtNAEB7X/JTy0xQuSFxWRI1aKYrWqCoIgVTBBW5tmrQVcWrZm3Wyite27E3UYPkFeAEOiANIHBB3XqAXDnDk0EdAHIvEhQNjx2lVStqxvDs7M98332ic0BOxovRAm9EvXLx0efbK3NVr12/MlxZubsXBIGK8yQIviHYcO+ae8HlTCeXxnTDitnQ8vu30n2b57SGPYhH4DTUKeVvaXV+4gtkKQ9aCtvrCUuQxWTIdmeyllto1pa16sZs00irJfcdNaLqbGMRUQvKYyHS5Suo5yvS4q1rEdHhX+IkdRfYoTRhLyYSNVE5wyAmHgSXm1FTlGG9yv1MQEzMS3Z5qV0njnOYZLbO95HlqyYxM8T2VdITdTY/GXJ7aX6YT0UcU/xFhlcq0RnMjpx2jcMpQ2HpQ+gwmdCAABgOQwMEHhb4HNsT4tcAACiHG2pBgLEJP5HkOKcwhdoBVHCtsjPbx7OKrVUR9fGeccY5m2MXDP0IkgUX6nX6kh/QL/UR/0D9TuZKcI9MywtsZY3lozb+6vfn7XJTEW0HvGHWmZgUuPMi1CtQe5pFsCjbGD1++Ptx8WF9MKvQ9/Yn639EDuo8T+MNf7MMGr785Q4+DWlJcj/HvMk47W/dqxmptZWOlvPakWNQs3IG7sITbuA9r8AzWoQlMe6vta1+1b/oj3dGF3h+XzmgF5hacMF39Bf56A1I= Zt = (xT t , 01⇥m), Rt =  xt 0m⇥1 0m⇥1 xt , Tt =  Im diag(xt) 0m⇥m Im AAACm3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxCTlVlfUxpcoxGTmKcRUKxjoKBjG1MblxgsoG+gZgIECJsMQylBmgIKAfIFtDDEMKQz5DMkMpQy5DKkMeQwlQHYOQyJDMRBGMxgyGDAUAMViGaqBYkVAViZYPpWhloELqLcUqCoVqCIRKJoNJNOBvGioaB6QDzKzGKw7GWhLDhAXAXUqMKgaXDVYafDZ4ITBaoOXBn9wmlUNNgPklkognQTRm1oQz98lEfydoK5cIF3CkIHQhdfNJQxpDBZgt2YC3V4AFgH5Ihmiv6xq+udgqyDVajWDRQavge5faHDT4DDQB3llX5KXBqYGzcbjniSgW2qB0WOIHhmYjDAjPUMzPZNAE2UHJ2hEcTBIMygxaABjw5zBgcGDIYAhFGh6HcMShrUM65hkmZyZvJh8IEqZGKF6hBlQAFMoAGZ8mKs= xt 2 {0, 1}m AAACpXichVG7SgNBFD2u73fURrAJBh9VuBFREQTRxkbwlSgkYdldR7O4L3YnAQ3pxR+wsFKwEMFS7W38AQs/QSwVbCy8u1kQFfUOM3PmzD13znB1zzIDSfTYoDQ2Nbe0trV3dHZ19/Qm+vpzgVv2DZE1XMv1t3QtEJbpiKw0pSW2PF9otm6JTX1vMbzfrAg/MF1nQ+57omhru465YxqaZEpNDBd0u1rQLK+k1VSZp1m7mJxLRqQsCRmSaiJFaYoi+RNkYpBCHCtu4hoFbMOFgTJsCDiQjC1oCHjkkQHBY66IKnM+IzO6F6ihg7VlzhKcoTG7x+sun/Ix6/A5rBlEaoNfsXj6rExihB7ogl7oni7pid5/rVWNaoRe9nnX61rhqb1Hg+tv/6ps3iVKn6o/PUvsYCbyarJ3L2LCXxh1feXg+GV9dm2kOkpn9Mz+T+mR7vgHTuXVOF8Vayd/+NHZS43bk/nejJ8gN5HOTKUnVydT8wtxo9owhGGMczemMY8lrCDL1Q9xhRvcKmPKsrKh5OqpSkOsGcCXUNQPIGmdsg== ↵t[0 : m] = ✓t

Slide 24

Slide 24 text

MAB policy 
 using the Kalman filter

Slide 25

Slide 25 text

Estimation state using linear Kalman filter 25 AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY µt+1 = Ttµt|t , Pt+1 = TtPt|t TT t + 2 ⌘ RtRT t AAADBXichVG7SsRQEJ3E9/patVCwCS6KIrvciKgIgiiIsM36WBWNhiTe1WBeJHcXNKaxEfwBCysFCxE7sbWw8Qcs/APFUsHGwkk24At1QnLPnJkzOZdRHUP3GCH3HF9RWVVdU1uXqG9obGpOtrQueHbR1Whesw3bXVIVjxq6RfNMZwZdclyqmKpBF9WtybC+WKKup9vWPNt26KqpbFh6QdcUhpSc3JNU05fMouyzXRYEQs+YEDOBzIR+IRtmJcSSlPhe683JbFlma5KpsE2v4M8HU5j5aTHo+6TKlUeHk7FfSAtZ7Mp+EsnJFMmQKISfQIxBCuLI2ckrkGAdbNCgCCZQsIAhNkABD58VEIGAg9wq+Mi5iPSoTiGABGqL2EWxQ0F2C78bmK3ErIV5ONOL1Br+xcDXRaUA3eSOnJFnckvOySN5+3WWH80IvWzjqZa11JGbDzrmXv9VmXgy2PxQ/emZQQFGIq86enciJryFVtaXdg6f50Znu/0eckKe0P8xuSc3eAOr9KKdztDZoz/8qOglXI/4fRk/wcJARhzKDM4MpsYn4kXVQid0QS9uYxjGYRpykMfpD1wT18518Pv8BX/JX5VbeS7WtMGX4K/fAYilwD0= µt|t = µt + Kvt = µt + (PtZT t F 1 t )vt Pt|t = Pt KFtKT AAACnHichVHLSsNAFD3G97vqRhGkWCquylRERRBEXQgiaGsf2EpI4qjBvEimBQ3FvT/gwpWCCxF05w+48Qdc9BPEZQU3LrxNA6Ki3jCZc8/cc+cMV3UM3ROMVZuk5pbWtvaOzq7unt6+/sjAYNazS67GM5pt2G5eVTxu6BbPCF0YPO+4XDFVg+fUw+X6ea7MXU+3rS1x5PAdU9m39D1dUwRRcmSkeKAI363IIroQ3ZZFUTX9olmiXI7EWIIFEf0JkiGIIYwNO3KPInZhQ0MJJjgsCMIGFHj0FZAEg0PcDnziXEJ6cM5RQRdpS1TFqUIh9pD++5QVQtaivN7TC9Qa3WLQckkZRZw9sWtWY4/shj2z9197+UGPupcj2tWGljty/+lw+u1flUm7wMGn6k/PAnuYC7zq5N0JmPortIa+fHxWS8+n4v4Eu2Qv5P+CVdkDvcAqv2pXmzx1/ocflbxUaDzJ78P4CbJTieRMYnpzOra4FA6qA6MYxyRNYxaLWMUGMtT9BFe4xZ00Jq1Ia9J6o1RqCjVD+BJS9gPesZpA ˆ rt = Ztµt AAACznichVHLShxBFD22JjFjEkfdBNwMGZRAyFAjEkNAEAMiuBkfoxJbm+q2ZqawX3TVNGjTZBvyAy6ySiALcZVNfiCb/EAQd7oUlwayySK3expCIjG3qKpTp+65dYprh65UmrHTPqN/4NbtO4N3S0P37j8YLo+MrqugGzmi6QRuEG3aXAlX+qKppXbFZhgJ7tmu2LD3Xmb3G7GIlAz8Nb0fim2Pt33Zkg7XRFnlJdP2kji1dGVythJZ+qnZ4TqJiDDN0kKPfmXphqVp3TE9rjuqlaylT0wl2x7fmbISU4RKuoGfWuUqq7E8KtdBvQBVFNEIyp9hYhcBHHThQcCHJuyCQ9HYQh0MIXHbSIiLCMn8XiBFibRdyhKUwYndo7VNp62C9emc1VS52qFXXJoRKSuYYN/YEbtiX9kxu2A//1kryWtkXvZpt3taEVrDbx+u/vivyqNdo/NbdaNnjRae514leQ9zJvuF09PHB4dXqy9WJpJJ9oFdkv/37JR9oR/48Xfn47JYeXeDH5u8ZO2p/92M62B9qlZ/Vptenq7OzReNGsQ4HuExdWMGc1hEA02q/gknOMO50TBiIzVe91KNvkIzhj/CePMLz5WuMA== vt = rt ˆ rt Ft = ZtPtZT t + 2 ✏ Filtering One-step ahead prediction Prediction of observation Error of observation • The Kalman filter continuously estimates the state by repeating one-step ahead prediction and filtering. • State follows a multivariate normal distribution with mean variance-covariance . α 𝒩 m μ P Cycle of state updates in linear Kalman filter t = t+1

Slide 26

Slide 26 text

Arm selection by probability matching method 26 Probability matching method using the state estimates from the Kalman filter. 1. The mean variance-covariance of the state are estimated for each arm. 2. The parameters for each arm are sampled from with the estimators. 3. The arm with the largest inner product between the sampled parameter and context is selected. μt Pt αt ( = θt ) ˜ θ(k) t k 𝒩 m ˜ θ(k) t xt

Slide 27

Slide 27 text

Virtual exploration 
 using missing observations

Slide 28

Slide 28 text

Estimation state using linear Kalman fi lter (again) 28 AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY µt+1 = Ttµt|t , Pt+1 = TtPt|t TT t + 2 ⌘ RtRT t AAADBXichVG7SsRQEJ3E9/patVCwCS6KIrvciKgIgiiIsM36WBWNhiTe1WBeJHcXNKaxEfwBCysFCxE7sbWw8Qcs/APFUsHGwkk24At1QnLPnJkzOZdRHUP3GCH3HF9RWVVdU1uXqG9obGpOtrQueHbR1Whesw3bXVIVjxq6RfNMZwZdclyqmKpBF9WtybC+WKKup9vWPNt26KqpbFh6QdcUhpSc3JNU05fMouyzXRYEQs+YEDOBzIR+IRtmJcSSlPhe683JbFlma5KpsE2v4M8HU5j5aTHo+6TKlUeHk7FfSAtZ7Mp+EsnJFMmQKISfQIxBCuLI2ckrkGAdbNCgCCZQsIAhNkABD58VEIGAg9wq+Mi5iPSoTiGABGqL2EWxQ0F2C78bmK3ErIV5ONOL1Br+xcDXRaUA3eSOnJFnckvOySN5+3WWH80IvWzjqZa11JGbDzrmXv9VmXgy2PxQ/emZQQFGIq86enciJryFVtaXdg6f50Znu/0eckKe0P8xuSc3eAOr9KKdztDZoz/8qOglXI/4fRk/wcJARhzKDM4MpsYn4kXVQid0QS9uYxjGYRpykMfpD1wT18518Pv8BX/JX5VbeS7WtMGX4K/fAYilwD0= µt|t = µt + Kvt = µt + (PtZT t F 1 t )vt Pt|t = Pt KFtKT AAACnHichVHLSsNAFD3G97vqRhGkWCquylRERRBEXQgiaGsf2EpI4qjBvEimBQ3FvT/gwpWCCxF05w+48Qdc9BPEZQU3LrxNA6Ki3jCZc8/cc+cMV3UM3ROMVZuk5pbWtvaOzq7unt6+/sjAYNazS67GM5pt2G5eVTxu6BbPCF0YPO+4XDFVg+fUw+X6ea7MXU+3rS1x5PAdU9m39D1dUwRRcmSkeKAI363IIroQ3ZZFUTX9olmiXI7EWIIFEf0JkiGIIYwNO3KPInZhQ0MJJjgsCMIGFHj0FZAEg0PcDnziXEJ6cM5RQRdpS1TFqUIh9pD++5QVQtaivN7TC9Qa3WLQckkZRZw9sWtWY4/shj2z9197+UGPupcj2tWGljty/+lw+u1flUm7wMGn6k/PAnuYC7zq5N0JmPortIa+fHxWS8+n4v4Eu2Qv5P+CVdkDvcAqv2pXmzx1/ocflbxUaDzJ78P4CbJTieRMYnpzOra4FA6qA6MYxyRNYxaLWMUGMtT9BFe4xZ00Jq1Ia9J6o1RqCjVD+BJS9gPesZpA ˆ rt = Ztµt AAACznichVHLShxBFD22JjFjEkfdBNwMGZRAyFAjEkNAEAMiuBkfoxJbm+q2ZqawX3TVNGjTZBvyAy6ySiALcZVNfiCb/EAQd7oUlwayySK3expCIjG3qKpTp+65dYprh65UmrHTPqN/4NbtO4N3S0P37j8YLo+MrqugGzmi6QRuEG3aXAlX+qKppXbFZhgJ7tmu2LD3Xmb3G7GIlAz8Nb0fim2Pt33Zkg7XRFnlJdP2kji1dGVythJZ+qnZ4TqJiDDN0kKPfmXphqVp3TE9rjuqlaylT0wl2x7fmbISU4RKuoGfWuUqq7E8KtdBvQBVFNEIyp9hYhcBHHThQcCHJuyCQ9HYQh0MIXHbSIiLCMn8XiBFibRdyhKUwYndo7VNp62C9emc1VS52qFXXJoRKSuYYN/YEbtiX9kxu2A//1kryWtkXvZpt3taEVrDbx+u/vivyqNdo/NbdaNnjRae514leQ9zJvuF09PHB4dXqy9WJpJJ9oFdkv/37JR9oR/48Xfn47JYeXeDH5u8ZO2p/92M62B9qlZ/Vptenq7OzReNGsQ4HuExdWMGc1hEA02q/gknOMO50TBiIzVe91KNvkIzhj/CePMLz5WuMA== vt = rt ˆ rt Ft = ZtPtZT t + 2 ✏ Filtering reduces the values of P One-step ahead prediction increases the values of P Prediction of observation Error of observation • The Kalman filter continuously estimates the state by repeating one-step ahead prediction and filtering. • State follows a multivariate normal distribution with mean variance-covariance . α 𝒩 m μ P Cycle of state updates in linear Kalman filter t = t+1

Slide 29

Slide 29 text

Virtual exploration using missing observations 29 AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY µt+1 = Ttµt|t , Pt+1 = TtPt|t TT t + 2 ⌘ RtRT t One-step ahead prediction increases the values of P • The Kalman filter can handle missing observations appropriately. • The proposed policy incorporates this process as an update operation for the evaluation of the unselected arm in the MAB problem. Cycle of processing the missing observations in linear Kalman filter t = t+1 AAACuHichVG7SsRAFD3G9/rYVRvBJrgoVsus+EIQRBvL9bEqGIlJHNfBvEhmFzTsD/gDFlYKFmJvLzaitYWfIJYKNhbezQZFRb1hMmfOvefOGa7p2yKUjD00KI1NzS2tbe2pjs6u7nSmp3c19MqBxYuWZ3vBummE3BYuL0ohbb7uB9xwTJuvmXvztfxahQeh8NwVue/zTccouWJHWIYkSs+Ma6YTaU65qkdSc8S2Kqvq8MwHKTUtVfiSUgu6VPVMluVYHOpPkE9AFkkUvMwlNGzDg4UyHHC4kIRtGAjp20AeDD5xm4iICwiJOM9RRYq0ZariVGEQu0f/Ep02Etalc61nGKstusWmFZBSxRC7Z+fsmd2wC/bI3n7tFcU9al72aTfrWu7r6cP+5dd/VQ7tErufqj89S+xgKvYqyLsfM7VXWHV95eDoeXl6aSgaZqfsifyfsAd2TS9wKy/W2SJfOv7Dj0leqjSe/Pdh/ASro7n8RG5scSw7O5cMqg0DGMQITWMSs1hAAUXqfowr3OJOmVa2lJIi6qVKQ6Lpw5dQgndBlqSe µt|t = µt Pt|t = Pt Filtering keeps the values of P

Slide 30

Slide 30 text

Virtual exploration using missing observations 30 AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY µt+1 = Ttµt|t , Pt+1 = TtPt|t TT t + 2 ⌘ RtRT t One-step ahead prediction increases the values of P • The Kalman filter can handle missing observations appropriately. • The proposed policy incorporates this process as an update operation for the evaluation of the unselected arm in the MAB problem. Cycle of processing the missing observations in linear Kalman filter t = t+1 AAACuHichVG7SsRAFD3G9/rYVRvBJrgoVsus+EIQRBvL9bEqGIlJHNfBvEhmFzTsD/gDFlYKFmJvLzaitYWfIJYKNhbezQZFRb1hMmfOvefOGa7p2yKUjD00KI1NzS2tbe2pjs6u7nSmp3c19MqBxYuWZ3vBummE3BYuL0ohbb7uB9xwTJuvmXvztfxahQeh8NwVue/zTccouWJHWIYkSs+Ma6YTaU65qkdSc8S2Kqvq8MwHKTUtVfiSUgu6VPVMluVYHOpPkE9AFkkUvMwlNGzDg4UyHHC4kIRtGAjp20AeDD5xm4iICwiJOM9RRYq0ZariVGEQu0f/Ep02Etalc61nGKstusWmFZBSxRC7Z+fsmd2wC/bI3n7tFcU9al72aTfrWu7r6cP+5dd/VQ7tErufqj89S+xgKvYqyLsfM7VXWHV95eDoeXl6aSgaZqfsifyfsAd2TS9wKy/W2SJfOv7Dj0leqjSe/Pdh/ASro7n8RG5scSw7O5cMqg0DGMQITWMSs1hAAUXqfowr3OJOmVa2lJIi6qVKQ6Lpw5dQgndBlqSe µt|t = µt Pt|t = Pt By introducing this behavior into the policy, the exploration for the unselected arm can be encouraged by the mechanism of the probability matching method. Filtering keeps the values of P

Slide 31

Slide 31 text

4. Empirical evaluation on real data And results

Slide 32

Slide 32 text

32 Evaluation setting • The data were collected from an actual EC site. • Each product is assigned to one of 18 categories. • Four recommendation methods are used. • Counts of the recommendations of each method and the clicks for each recommendation in each product category hourly from June 20, 2019, through July 22, 2019. • The total number of recommendations is 1,488,626. 0 100 200 300 400 500 600 Hours 0.05 0.10 0.15 Click-through rate Browsing path Demographic LLR Similar image The highest CTR switches. Typical characteristics of the recommendation method appears. Fig. 2. Change of CTR for each recommendation method.

Slide 33

Slide 33 text

33 Evaluation setting • We simulated the number of clicks obtained by the selected recommendation method during the time that the data were acquired. • Assumptions • The user clicks for recommendations follow the Bernoulli distribution of the CTR, where recommendation methods are selected by the MAB policy. • Each recommendation method has an 18-dimensional parameter equal to the number of product categories. • The CTR is calculated by the inner product of and context . • The context is represented as a 1-hot vector of the product category that the user is browsing at time . ˜ θ(k) t ˜ θ(k) t xt xt t

Slide 34

Slide 34 text

Effect of context and temporal change 34 0 5000 10000 15000 20000 Cumulative regret adaptive thompson sampling decay linucb dynamic linucb LGSS bandits linear thompson sampling linucb neural linear select best arm at first select random arm 0 100 200 300 400 500 600 0.0 0.2 0.4 0.6 0.8 1.0 Best arm rate • The proposed LGSS bandits achieved the highest cumulative reward, 201,330.4. • LGSS bandits and adaptive Thompson sampling reduce overall opportunity loss, not only during best arm switching but also during stationary periods.

Slide 35

Slide 35 text

Effect of context and temporal change 35 0.00 0.05 0.10 0.15 0.20 Change of click-through rate (Plants) Browsing path Demographic LLR Similar image 0 100 200 300 400 500 600 0.0 0.2 0.4 0.6 0.8 1.0 Best arm rate (Plants) adaptive thompson sampling dynamic linucb LGSS bandits linear thompson sampling 0.00 0.05 0.10 0.15 0.20 Change of click-through rate (Stationery) Browsing path Demographic LLR Similar image 0 100 200 300 400 500 600 0.0 0.2 0.4 0.6 0.8 1.0 Best arm rate (Stationery) adaptive thompson sampling dynamic linucb LGSS bandits linear thompson sampling 0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 Period 1 240 260 280 300 Period 2 300 400 500 600 Period 3 • Comparison of the ratio of optimal arms in categories where the recommendation method with the highest CTR switches and categories where it does not. • The proposed policy has been shown to continuously select the best arm in both situations.

Slide 36

Slide 36 text

Effect of the response time 36 0.0 0.2 0.4 ms 0.48 0.19 0.04 0.24 0.29 0.04 0.27 Select per 1 time 10°1 100 sec (Log) 1.49 0.04 0.06 0.09 0.04 0.05 1.82 Update per 2665 times adaptive thompson sampling decay linucb dynamic linucb LGSS bandits linear thompson sampling linucb neural linear Elapsed time for 2665 times for 18 dimension 4 arms • We measured the time elapsed from the arm selection and evaluation updates for each policy. • For arm selection, the LGSS bandits had an elapsed time of 0.24 ms. • All measures were less than 0.5 ms and considered to have sufficient performance without significant impact. • Measures of evaluation updates, the elapsed time was sufficiently short compared with the unit time (1 h). • Some policy are not scalable to an increase in the number of trials per unit time.

Slide 37

Slide 37 text

5. Conclusion

Slide 38

Slide 38 text

• We proposed an LGSS-based MAB policy to optimize the selection of recommendation methods continuously and quickly according to the context and temporal change using the Kalman filter for meta-RSs. • The experimental results demonstrated that the proposed policy can effectively reduce opportunity loss during evaluation and increase the cumulative number of clicks compared with baseline policies. • The results also revealed that the proposed policy has no overhead on the response time of the meta-RS. • In the future, we will develop general-purpose policies that can quickly handle nonlinearity. 38 Conclusion

Slide 39

Slide 39 text

No content