Contextual and Nonstationary Multi-armed Bandits Using the Linear Gaussian State Space Model for the Meta-Recommender System

Yusuke Miyake [1][2] and Tsunenori Mine [1] / Kyushu Univ
[1]. Pepabo R&D Institute, GMO Pepabo, Inc [2]. October 1-4, 2023. The 2023 IEEE Conference on Systems, Man, and Cybernetics Contextual and Nonstationary Multi-armed Bandits Using the Linear Gaussian State Space Model for the Meta-Recommender System

1. Introduction 2. Related works 3. Proposal 4. Empirical evaluation
on real data and results 5. Conclusion 2 Agenda

1. Introduction

4 Background Users Best Recommendation Method Items • Selecting an
optimal recommendation method from many ones is crucial for EC site. • The effectiveness of them cannot be known in advance. • Continuous comparative evaluation in an actual environment is essential. • However, it results in opportunity loss. → → IUUQTJDPOTDPN

5 Previous studies • The reduction of opportunity loss is
considered as a multi-armed bandit (MAB) problem. • Many meta-recommendation systems (meta-RSs) that use multi-armed bandit policies to select the efficient recommendation method have been proposed. Users Best Recommendation Method Items Multi-armed bandit to select the best one. IUUQTJDPOTDPN

6 Issue in previous studies • The MAB policy used
by meta-RSs must take into account the following characteristics of the recommendation method.   1. A temporal change in the effectiveness of recommendation methods. 2. The performance of a recommendation method depends on the context. 3. The performance also depends on response time. • A policy that satisfies all three requirements based on these characteristics exists yet. →

7 Proposed MAB policy for the meta-RS Linear Gaussian state
space bandits (LGSS bandits) • It addresses all three factors causing opportunity loss in the selection of recommendation methods.   1. For temporal change, effectiveness changes are modeled as LGSS model. 2. For context, a novel context-aware coefficient matrix for LGSS is designed. 3. For response time, a linear Kalman filter is used to estimate the state. • A probability matching method of bandits is used to find best recommendation method based on the estimated state.

2. Related works

• The player must select one arm from multiple arms
to maximize the reward. • The reward is stochastically given based on the chosen arm. • To find the optimal arm, player should balance exploitation and exploration.   • Two types of extended problems have been developed. • Contextual MAB problem • Nonstationary MAB problem 9 Multi-armed bandit (MAB) problem

10 Basic problem setting Arm0 Arm1 Arm2 Player Estimated probability
distribution Probability distribution of rewards Select Best arm

Contextual problem setting 11 Arm0 Arm1 Arm2 Player Select Context
= 0 Estimated probability distribution Probability distribution of rewards Best arm for context = 0

12 Arm0 Arm1 Arm2 Player Select Context = 1 Contextual
problem setting Estimated probability distribution Probability distribution of rewards Best arm for context = 1

Nonstationary problem setting 13 Arm0 Arm1 Arm2 Player Select t
= 0~ t = 100~ t = 0~ t = 100~ t = 99 Estimated probability distribution Probability distribution of rewards Best arm at t = 99

14 Arm0 Arm1 Arm2 Player Select t = 0~ t
= 100~ t = 0~ t = 100~ t = 199 Nonstationary problem setting Estimated probability distribution Probability distribution of rewards Best arm at t = 199

15 Meta-Recommender systems (meta-RSs) Policies of meta-RSs • Studies have
used meta-RSs to select the best recommendation method based on evaluation in the actual environment. • These studies didn't employ contextual and nonstationary policy. Existing policies • In existing contextual and nonstationary policies, one requirement, response time, cannot be satisfied because mechanisms to track change are time-consuming. • By contrast, a policy that satisfies the requirement with a lightweight mechanism cannot satisfy the other requirement, the temporal change. →

3. Proposal

• The proposed policy assumes the contextual and nonstationary reward
model by the following linear Gaussian state space (LGSS) model. Modeling reward using LGSS model 17 <latexit sha1_base64="kUo8M4UkzCBhvYZGoaF1m2YsBbE=">AAADi3ichVFNa9RQFL1p1NbR2qluBDfBoWGkYXhpi0pxoCiCK5lOO22xacNL5nXm0ZcPk5eBGvIHXAsuREHBhbj3D7jxD7joTxCXFdy48CaT0trB6QvJu/fcc+494Tqh4LEk5FCZUC9cvDQ5dbly5er0tZnq7PWNOEgil3XcQATRlkNjJrjPOpJLwbbCiFHPEWzT2X+U1zcHLIp54K/Lg5DteLTn8z3uUomQPatMRrbUm89saTlealER9mlmy3mLhTEXyJCGpmsnmRVzz/Ko7LtUpE+zOjEQ6Xl0d8FOj1nZHUO3rMrphqmcNzO9uT46RlJbto+n5Mn4CZJi95z7PKFdTTZNw+oGMjYsSZN8ZgHrOeHUHPNMT7tbL8pegjWjZZto2K7WSIMURxsNzDKoQXlaQfULWNCFAFxIwAMGPkiMBVCI8dkGEwiEiO1AiliEES/qDDKooDZBFkMGRXQfvz3MtkvUxzzvGRdqF6cIfCNUajBHvpNP5Ih8I5/JD/Lnv73Sokfu5QBvZ6hloT3z8uba73NVHt4S+ieqsZ4l7MH9witH72GB5H/hDvWDF6+P1pbbc6lOPpCf6P89OSRf8Q/8wS/34yprvxnjx0EvGa7HPLuM0WBjoWHebSytLtVWHpaLmoJbcBvquI17sAJPoAUdcBVfeaW8Vd6p0+qiuqw+GFInlFJzA/456uO/GPDuaw==</latexit> rt = Zt↵t + ✏t, ✏t ⇠ N(0, 2 ✏ ), ↵t+1 = Tt↵t + ⌘tRt, ⌘t ⇠ N(0, 2 ⌘ ), t = 1, . . . , ⌧ ↵1 ⇠ Nd(µ1, P1), • It represents that state follows the previous state and reward follows state . αt αt−1 rt αt Rt t + 1 Zt Tt ηt ϵt rt αt αt+1 + +

• The proposed policy assumes the contextual and nonstationary reward
model by the following linear Gaussian state space (LGSS) model. Modeling reward using LGSS model 18 <latexit sha1_base64="kUo8M4UkzCBhvYZGoaF1m2YsBbE=">AAADi3ichVFNa9RQFL1p1NbR2qluBDfBoWGkYXhpi0pxoCiCK5lOO22xacNL5nXm0ZcPk5eBGvIHXAsuREHBhbj3D7jxD7joTxCXFdy48CaT0trB6QvJu/fcc+494Tqh4LEk5FCZUC9cvDQ5dbly5er0tZnq7PWNOEgil3XcQATRlkNjJrjPOpJLwbbCiFHPEWzT2X+U1zcHLIp54K/Lg5DteLTn8z3uUomQPatMRrbUm89saTlealER9mlmy3mLhTEXyJCGpmsnmRVzz/Ko7LtUpE+zOjEQ6Xl0d8FOj1nZHUO3rMrphqmcNzO9uT46RlJbto+n5Mn4CZJi95z7PKFdTTZNw+oGMjYsSZN8ZgHrOeHUHPNMT7tbL8pegjWjZZto2K7WSIMURxsNzDKoQXlaQfULWNCFAFxIwAMGPkiMBVCI8dkGEwiEiO1AiliEES/qDDKooDZBFkMGRXQfvz3MtkvUxzzvGRdqF6cIfCNUajBHvpNP5Ih8I5/JD/Lnv73Sokfu5QBvZ6hloT3z8uba73NVHt4S+ieqsZ4l7MH9witH72GB5H/hDvWDF6+P1pbbc6lOPpCf6P89OSRf8Q/8wS/34yprvxnjx0EvGa7HPLuM0WBjoWHebSytLtVWHpaLmoJbcBvquI17sAJPoAUdcBVfeaW8Vd6p0+qiuqw+GFInlFJzA/456uO/GPDuaw==</latexit> rt = Zt↵t + ✏t, ✏t ⇠ N(0, 2 ✏ ), ↵t+1 = Tt↵t + ⌘tRt, ⌘t ⇠ N(0, 2 ⌘ ), t = 1, . . . , ⌧ ↵1 ⇠ Nd(µ1, P1), Advantages • The model can naturally represent the temporal change of the state. • The design of the coefficient matrix ( ) provides flexible representations for the contextual and nonstationary problem setting. • A lightweight estimator algorithm does not affect response time. Z, T, R

Context-aware coefficient matrices   for the LGSS model

• The designed matrices handle time-varying contexts in the LGSS
model. • These matrices vary according to the context at each time point . xt t 20 Context-aware coefficient matrix <latexit sha1_base64="sxN7addNWZLnET50graZPGd8RIE=">AAACvXichVHLSuRAFD2dcXz0+OgZN4KbYKPMQppqEWcQhhHd6K67tVW0NSSx1GDlQVLdo4b8gD/gwo0KLsS9PyCIO9248BMGlwpuXHg7HRAV9YZKnXvqnlunuIYnrEAydpNSvjR9bW5pbUt/a+/o7Mp8/zEbuFXf5GXTFa4/b+gBF5bDy9KSgs97PtdtQ/A5Y2Oifj5X435guc6M3PL4kq2vOdaqZeqSKC0zuqBJ9Y9aMexwM9LkcliRrhcNqqUX9KA608htXa6buginIs3WMlmWY3Gob0E+AVkkUXAzp6hgBS5MVGGDw4EkLKAjoG8ReTB4xC0hJM4nZMXnHBHSpK1SFacKndgN+q9RtpiwDuX1nkGsNukWQcsnpYp+ds2O2R27YCfsP3t8t1cY96h72aLdaGi5p3Xt9Ew/fKqyaZdYf1Z96FliFb9jrxZ592Km/gqzoa9t795Nj5b6wwF2yG7J/wG7YWf0Aqd2bx4VeWnvAz8GeYloPPnXw3gLZody+ZHccHE4OzaeDKoVvejDT5rGL4xhEgWUqfs+znGJK+WvwhWhOI1SJZVouvEilH9P00imgg==</latexit> Zt = x> t , Rt = xt, Tt = Im <latexit sha1_base64="KEh9qhof6rX1dWDKPCwbbApF54A=">AAACm3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxCTlVlfUxpcoxGTmKcRUKxjoKBjG1MblxgsoG+gZgIECJsMQylBmgIKAfIFtDDEMKQz5DMkMpQy5DKkMeQwlQHYOQyJDMRBGMxgyGDAUAMViGaqBYkVAViZYPpWhloELqLcUqCoVqCIRKJoNJNOBvGioaB6QDzKzGKw7GWhLDhAXAXUqMKgaXDVYafDZ4ITBaoOXBn9wmlUNNgPklkognQTRm1oQz98lEfydoK5cIF3CkIHQhdfNJQxpDBZgt2YC3V4AFgH5Ihmiv6xq+udgqyDVajWDRQavge5faHDT4DDQB3llX5KXBqYGzcbjniSgW2qB0WOIHhmYjDAjPUMzPZNAE2UHJ2hEcTBIMygxaABjw5zBgcGDIYAhFGh6HcMShrUM65hkmZyZvJh8IEqZGKF6hBlQAFMoAGZ8mKs=</latexit> xt 2 {0, 1}m

• The equation is consistent with the reward formulation in
contextual MAB. • It enables the estimated state to be used to solve the MAB. αt = θt • The designed matrices handle time-varying contexts in the LGSS model. • These matrices vary according to the context at each time point . • When these matrices are applied to the model: xt t 21 Context-aware coefficient matrix <latexit sha1_base64="sxN7addNWZLnET50graZPGd8RIE=">AAACvXichVHLSuRAFD2dcXz0+OgZN4KbYKPMQppqEWcQhhHd6K67tVW0NSSx1GDlQVLdo4b8gD/gwo0KLsS9PyCIO9248BMGlwpuXHg7HRAV9YZKnXvqnlunuIYnrEAydpNSvjR9bW5pbUt/a+/o7Mp8/zEbuFXf5GXTFa4/b+gBF5bDy9KSgs97PtdtQ/A5Y2Oifj5X435guc6M3PL4kq2vOdaqZeqSKC0zuqBJ9Y9aMexwM9LkcliRrhcNqqUX9KA608htXa6buginIs3WMlmWY3Gob0E+AVkkUXAzp6hgBS5MVGGDw4EkLKAjoG8ReTB4xC0hJM4nZMXnHBHSpK1SFacKndgN+q9RtpiwDuX1nkGsNukWQcsnpYp+ds2O2R27YCfsP3t8t1cY96h72aLdaGi5p3Xt9Ew/fKqyaZdYf1Z96FliFb9jrxZ592Km/gqzoa9t795Nj5b6wwF2yG7J/wG7YWf0Aqd2bx4VeWnvAz8GeYloPPnXw3gLZody+ZHccHE4OzaeDKoVvejDT5rGL4xhEgWUqfs+znGJK+WvwhWhOI1SJZVouvEilH9P00imgg==</latexit> Zt = x> t , Rt = xt, Tt = Im <latexit sha1_base64="KEh9qhof6rX1dWDKPCwbbApF54A=">AAACm3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxCTlVlfUxpcoxGTmKcRUKxjoKBjG1MblxgsoG+gZgIECJsMQylBmgIKAfIFtDDEMKQz5DMkMpQy5DKkMeQwlQHYOQyJDMRBGMxgyGDAUAMViGaqBYkVAViZYPpWhloELqLcUqCoVqCIRKJoNJNOBvGioaB6QDzKzGKw7GWhLDhAXAXUqMKgaXDVYafDZ4ITBaoOXBn9wmlUNNgPklkognQTRm1oQz98lEfydoK5cIF3CkIHQhdfNJQxpDBZgt2YC3V4AFgH5Ihmiv6xq+udgqyDVajWDRQavge5faHDT4DDQB3llX5KXBqYGzcbjniSgW2qB0WOIHhmYjDAjPUMzPZNAE2UHJ2hEcTBIMygxaABjw5zBgcGDIYAhFGh6HcMShrUM65hkmZyZvJh8IEqZGKF6hBlQAFMoAGZ8mKs=</latexit> xt 2 {0, 1}m <latexit sha1_base64="KZYowAIlFmWXflMENq4My1+mczg=">AAADXHichVFNa9RQFL2ZVK1Ta0cFEdwEhw6VhuGlFBWhWHTjSvrhtIW+NrxkXmdCXz5MbgZryB/wD7hwpeBC3PsH3PgHXMzGvbhsoRsX3iRTWi2OL+S9e887597zuE6kvAQZG2o1feLCxUuTl+tTV6avzjSuXd9IwjR2ZccNVRhvOSKRygtkBz1UciuKpfAdJTed/SfF/eZAxokXBs/xIJI7vugF3p7nCiTIbgxjG1tL3PGzl7mNuxnHMMqLlAsV9QVh81xGiaeIjKbRMk4znng+9wX2XaGyZ/kcMwnp+WJ3wc5OWPlds8V5/WzBDOetvGp5tgcKG40TH1WjAhrfBAU1KLgvUtE1cMkyeTfExOQoUrvRZG1WLuN8YI2CJozWStj4DBy6EIILKfggIQCkWIGAhL5tsIBBRNgOZITFFHnlvYQc6qRNiSWJIQjdp71H2fYIDSgvaial2qUuiv6YlAbMsm/sIztkX9kn9oP9+metrKxReDmg06m0MrJnXt9aP/6vyqcToX+qGusZYQ8elF498h6VSPEKt9IPXr05XH+4Npu12Hv2k/y/Y0P2hV4QDI7cD6ty7e0YPw55yWk81t/DOB9sLLSte+3F1cXm8uPRoCbhNtyBOZrGfViGp7ACHXC1R5rUAi2sfdcn9Cl9uqLWtJHmBvyx9Ju/AaLC4A0=</latexit> rt = x> t ↵t + ✏t, ✏t ⇠ N(0, 2 ✏ ), ↵t+1 = ↵t + ⌘txt, ⌘t ⇠ N(0, 2 ⌘ ), t = 1, . . . , ⌧

• The error is added only to the element in
the corresponding state of the context . • The design enables the modeling of state transitions according to the frequency of occurrence of context elements, which is expected to stabilize the accuracy of predictions. ηt αt xt • The designed matrices handle time-varying contexts in the LGSS model. • These matrices vary according to the context at each time point . • When these matrices are applied to the model: xt t 22 Context-aware coefficient matrix <latexit sha1_base64="sxN7addNWZLnET50graZPGd8RIE=">AAACvXichVHLSuRAFD2dcXz0+OgZN4KbYKPMQppqEWcQhhHd6K67tVW0NSSx1GDlQVLdo4b8gD/gwo0KLsS9PyCIO9248BMGlwpuXHg7HRAV9YZKnXvqnlunuIYnrEAydpNSvjR9bW5pbUt/a+/o7Mp8/zEbuFXf5GXTFa4/b+gBF5bDy9KSgs97PtdtQ/A5Y2Oifj5X435guc6M3PL4kq2vOdaqZeqSKC0zuqBJ9Y9aMexwM9LkcliRrhcNqqUX9KA608htXa6buginIs3WMlmWY3Gob0E+AVkkUXAzp6hgBS5MVGGDw4EkLKAjoG8ReTB4xC0hJM4nZMXnHBHSpK1SFacKndgN+q9RtpiwDuX1nkGsNukWQcsnpYp+ds2O2R27YCfsP3t8t1cY96h72aLdaGi5p3Xt9Ew/fKqyaZdYf1Z96FliFb9jrxZ592Km/gqzoa9t795Nj5b6wwF2yG7J/wG7YWf0Aqd2bx4VeWnvAz8GeYloPPnXw3gLZody+ZHccHE4OzaeDKoVvejDT5rGL4xhEgWUqfs+znGJK+WvwhWhOI1SJZVouvEilH9P00imgg==</latexit> Zt = x> t , Rt = xt, Tt = Im <latexit sha1_base64="KEh9qhof6rX1dWDKPCwbbApF54A=">AAACm3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxCTlVlfUxpcoxGTmKcRUKxjoKBjG1MblxgsoG+gZgIECJsMQylBmgIKAfIFtDDEMKQz5DMkMpQy5DKkMeQwlQHYOQyJDMRBGMxgyGDAUAMViGaqBYkVAViZYPpWhloELqLcUqCoVqCIRKJoNJNOBvGioaB6QDzKzGKw7GWhLDhAXAXUqMKgaXDVYafDZ4ITBaoOXBn9wmlUNNgPklkognQTRm1oQz98lEfydoK5cIF3CkIHQhdfNJQxpDBZgt2YC3V4AFgH5Ihmiv6xq+udgqyDVajWDRQavge5faHDT4DDQB3llX5KXBqYGzcbjniSgW2qB0WOIHhmYjDAjPUMzPZNAE2UHJ2hEcTBIMygxaABjw5zBgcGDIYAhFGh6HcMShrUM65hkmZyZvJh8IEqZGKF6hBlQAFMoAGZ8mKs=</latexit> xt 2 {0, 1}m <latexit sha1_base64="KZYowAIlFmWXflMENq4My1+mczg=">AAADXHichVFNa9RQFL2ZVK1Ta0cFEdwEhw6VhuGlFBWhWHTjSvrhtIW+NrxkXmdCXz5MbgZryB/wD7hwpeBC3PsH3PgHXMzGvbhsoRsX3iRTWi2OL+S9e887597zuE6kvAQZG2o1feLCxUuTl+tTV6avzjSuXd9IwjR2ZccNVRhvOSKRygtkBz1UciuKpfAdJTed/SfF/eZAxokXBs/xIJI7vugF3p7nCiTIbgxjG1tL3PGzl7mNuxnHMMqLlAsV9QVh81xGiaeIjKbRMk4znng+9wX2XaGyZ/kcMwnp+WJ3wc5OWPlds8V5/WzBDOetvGp5tgcKG40TH1WjAhrfBAU1KLgvUtE1cMkyeTfExOQoUrvRZG1WLuN8YI2CJozWStj4DBy6EIILKfggIQCkWIGAhL5tsIBBRNgOZITFFHnlvYQc6qRNiSWJIQjdp71H2fYIDSgvaial2qUuiv6YlAbMsm/sIztkX9kn9oP9+metrKxReDmg06m0MrJnXt9aP/6vyqcToX+qGusZYQ8elF498h6VSPEKt9IPXr05XH+4Npu12Hv2k/y/Y0P2hV4QDI7cD6ty7e0YPw55yWk81t/DOB9sLLSte+3F1cXm8uPRoCbhNtyBOZrGfViGp7ACHXC1R5rUAi2sfdcn9Cl9uqLWtJHmBvyx9Ju/AaLC4A0=</latexit> rt = x> t ↵t + ✏t, ✏t ⇠ N(0, 2 ✏ ), ↵t+1 = ↵t + ⌘txt, ⌘t ⇠ N(0, 2 ⌘ ), t = 1, . . . , ⌧

23 Context-aware coefficient matrix • The design is based on
the structural time-series model using an LGSS model. • For example, if a trend can be assumed for the state transition, it can be extended to the following matrix: <latexit sha1_base64="Ei8LQhPJVa/pc9CNk6xeLzl2vj4=">AAADunichVHNbtNAEB7X/JTy0xQuSFxWRI1aKYrWqCoIgVTBBW5tmrQVcWrZm3Wyite27E3UYPkFeAEOiANIHBB3XqAXDnDk0EdAHIvEhQNjx2lVStqxvDs7M98332ic0BOxovRAm9EvXLx0efbK3NVr12/MlxZubsXBIGK8yQIviHYcO+ae8HlTCeXxnTDitnQ8vu30n2b57SGPYhH4DTUKeVvaXV+4gtkKQ9aCtvrCUuQxWTIdmeyllto1pa16sZs00irJfcdNaLqbGMRUQvKYyHS5Suo5yvS4q1rEdHhX+IkdRfYoTRhLyYSNVE5wyAmHgSXm1FTlGG9yv1MQEzMS3Z5qV0njnOYZLbO95HlqyYxM8T2VdITdTY/GXJ7aX6YT0UcU/xFhlcq0RnMjpx2jcMpQ2HpQ+gwmdCAABgOQwMEHhb4HNsT4tcAACiHG2pBgLEJP5HkOKcwhdoBVHCtsjPbx7OKrVUR9fGeccY5m2MXDP0IkgUX6nX6kh/QL/UR/0D9TuZKcI9MywtsZY3lozb+6vfn7XJTEW0HvGHWmZgUuPMi1CtQe5pFsCjbGD1++Ptx8WF9MKvQ9/Yn639EDuo8T+MNf7MMGr785Q4+DWlJcj/HvMk47W/dqxmptZWOlvPakWNQs3IG7sITbuA9r8AzWoQlMe6vta1+1b/oj3dGF3h+XzmgF5hacMF39Bf56A1I=</latexit> Zt = (xT t , 01⇥m), Rt =  xt 0m⇥1 0m⇥1 xt , Tt =  Im diag(xt) 0m⇥m Im <latexit sha1_base64="KEh9qhof6rX1dWDKPCwbbApF54A=">AAACm3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtIxCTlVlfUxpcoxGTmKcRUKxjoKBjG1MblxgsoG+gZgIECJsMQylBmgIKAfIFtDDEMKQz5DMkMpQy5DKkMeQwlQHYOQyJDMRBGMxgyGDAUAMViGaqBYkVAViZYPpWhloELqLcUqCoVqCIRKJoNJNOBvGioaB6QDzKzGKw7GWhLDhAXAXUqMKgaXDVYafDZ4ITBaoOXBn9wmlUNNgPklkognQTRm1oQz98lEfydoK5cIF3CkIHQhdfNJQxpDBZgt2YC3V4AFgH5Ihmiv6xq+udgqyDVajWDRQavge5faHDT4DDQB3llX5KXBqYGzcbjniSgW2qB0WOIHhmYjDAjPUMzPZNAE2UHJ2hEcTBIMygxaABjw5zBgcGDIYAhFGh6HcMShrUM65hkmZyZvJh8IEqZGKF6hBlQAFMoAGZ8mKs=</latexit> xt 2 {0, 1}m <latexit sha1_base64="DhLfgRJrH99RxkHJx0iAL2Gs0Uk=">AAACpXichVG7SgNBFD2u73fURrAJBh9VuBFREQTRxkbwlSgkYdldR7O4L3YnAQ3pxR+wsFKwEMFS7W38AQs/QSwVbCy8u1kQFfUOM3PmzD13znB1zzIDSfTYoDQ2Nbe0trV3dHZ19/Qm+vpzgVv2DZE1XMv1t3QtEJbpiKw0pSW2PF9otm6JTX1vMbzfrAg/MF1nQ+57omhru465YxqaZEpNDBd0u1rQLK+k1VSZp1m7mJxLRqQsCRmSaiJFaYoi+RNkYpBCHCtu4hoFbMOFgTJsCDiQjC1oCHjkkQHBY66IKnM+IzO6F6ihg7VlzhKcoTG7x+sun/Ix6/A5rBlEaoNfsXj6rExihB7ogl7oni7pid5/rVWNaoRe9nnX61rhqb1Hg+tv/6ps3iVKn6o/PUvsYCbyarJ3L2LCXxh1feXg+GV9dm2kOkpn9Mz+T+mR7vgHTuXVOF8Vayd/+NHZS43bk/nejJ8gN5HOTKUnVydT8wtxo9owhGGMczemMY8lrCDL1Q9xhRvcKmPKsrKh5OqpSkOsGcCXUNQPIGmdsg==</latexit> ↵t[0 : m] = ✓t

MAB policy   using the Kalman filter

Estimation state using linear Kalman filter 25 <latexit sha1_base64="sMXYDc1LSdnLnTbRGa5oXLWkgvE=">AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY</latexit> µt+1
= Ttµt|t , Pt+1 = TtPt|t TT t + 2 ⌘ RtRT t <latexit sha1_base64="hcz14bbu87kOd9jJIu1Yta5XaWg=">AAADBXichVG7SsRQEJ3E9/patVCwCS6KIrvciKgIgiiIsM36WBWNhiTe1WBeJHcXNKaxEfwBCysFCxE7sbWw8Qcs/APFUsHGwkk24At1QnLPnJkzOZdRHUP3GCH3HF9RWVVdU1uXqG9obGpOtrQueHbR1Whesw3bXVIVjxq6RfNMZwZdclyqmKpBF9WtybC+WKKup9vWPNt26KqpbFh6QdcUhpSc3JNU05fMouyzXRYEQs+YEDOBzIR+IRtmJcSSlPhe683JbFlma5KpsE2v4M8HU5j5aTHo+6TKlUeHk7FfSAtZ7Mp+EsnJFMmQKISfQIxBCuLI2ckrkGAdbNCgCCZQsIAhNkABD58VEIGAg9wq+Mi5iPSoTiGABGqL2EWxQ0F2C78bmK3ErIV5ONOL1Br+xcDXRaUA3eSOnJFnckvOySN5+3WWH80IvWzjqZa11JGbDzrmXv9VmXgy2PxQ/emZQQFGIq86enciJryFVtaXdg6f50Znu/0eckKe0P8xuSc3eAOr9KKdztDZoz/8qOglXI/4fRk/wcJARhzKDM4MpsYn4kXVQid0QS9uYxjGYRpykMfpD1wT18518Pv8BX/JX5VbeS7WtMGX4K/fAYilwD0=</latexit> µt|t = µt + Kvt = µt + (PtZT t F 1 t )vt Pt|t = Pt KFtKT <latexit sha1_base64="h0xw7/dVeIVZnATPig1TbIV9A2U=">AAACnHichVHLSsNAFD3G97vqRhGkWCquylRERRBEXQgiaGsf2EpI4qjBvEimBQ3FvT/gwpWCCxF05w+48Qdc9BPEZQU3LrxNA6Ki3jCZc8/cc+cMV3UM3ROMVZuk5pbWtvaOzq7unt6+/sjAYNazS67GM5pt2G5eVTxu6BbPCF0YPO+4XDFVg+fUw+X6ea7MXU+3rS1x5PAdU9m39D1dUwRRcmSkeKAI363IIroQ3ZZFUTX9olmiXI7EWIIFEf0JkiGIIYwNO3KPInZhQ0MJJjgsCMIGFHj0FZAEg0PcDnziXEJ6cM5RQRdpS1TFqUIh9pD++5QVQtaivN7TC9Qa3WLQckkZRZw9sWtWY4/shj2z9197+UGPupcj2tWGljty/+lw+u1flUm7wMGn6k/PAnuYC7zq5N0JmPortIa+fHxWS8+n4v4Eu2Qv5P+CVdkDvcAqv2pXmzx1/ocflbxUaDzJ78P4CbJTieRMYnpzOra4FA6qA6MYxyRNYxaLWMUGMtT9BFe4xZ00Jq1Ia9J6o1RqCjVD+BJS9gPesZpA</latexit> ˆ rt = Ztµt <latexit sha1_base64="E27jzmhSgri70oES+c1vkr8K7sg=">AAACznichVHLShxBFD22JjFjEkfdBNwMGZRAyFAjEkNAEAMiuBkfoxJbm+q2ZqawX3TVNGjTZBvyAy6ySiALcZVNfiCb/EAQd7oUlwayySK3expCIjG3qKpTp+65dYprh65UmrHTPqN/4NbtO4N3S0P37j8YLo+MrqugGzmi6QRuEG3aXAlX+qKppXbFZhgJ7tmu2LD3Xmb3G7GIlAz8Nb0fim2Pt33Zkg7XRFnlJdP2kji1dGVythJZ+qnZ4TqJiDDN0kKPfmXphqVp3TE9rjuqlaylT0wl2x7fmbISU4RKuoGfWuUqq7E8KtdBvQBVFNEIyp9hYhcBHHThQcCHJuyCQ9HYQh0MIXHbSIiLCMn8XiBFibRdyhKUwYndo7VNp62C9emc1VS52qFXXJoRKSuYYN/YEbtiX9kxu2A//1kryWtkXvZpt3taEVrDbx+u/vivyqNdo/NbdaNnjRae514leQ9zJvuF09PHB4dXqy9WJpJJ9oFdkv/37JR9oR/48Xfn47JYeXeDH5u8ZO2p/92M62B9qlZ/Vptenq7OzReNGsQ4HuExdWMGc1hEA02q/gknOMO50TBiIzVe91KNvkIzhj/CePMLz5WuMA==</latexit> vt = rt ˆ rt Ft = ZtPtZT t + 2 ✏ Filtering One-step ahead prediction Prediction of observation Error of observation • The Kalman filter continuously estimates the state by repeating one-step ahead prediction and filtering. • State follows a multivariate normal distribution with mean variance-covariance . α 𝒩 m μ P Cycle of state updates in linear Kalman filter t = t+1

Arm selection by probability matching method 26 Probability matching method
using the state estimates from the Kalman filter. 1. The mean variance-covariance of the state are estimated for each arm. 2. The parameters for each arm are sampled from with the estimators. 3. The arm with the largest inner product between the sampled parameter and context is selected. μt Pt αt ( = θt ) ˜ θ(k) t k 𝒩 m ˜ θ(k) t xt

Virtual exploration   using missing observations

Estimation state using linear Kalman fi lter (again) 28 <latexit
sha1_base64="sMXYDc1LSdnLnTbRGa5oXLWkgvE=">AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY</latexit> µt+1 = Ttµt|t , Pt+1 = TtPt|t TT t + 2 ⌘ RtRT t <latexit sha1_base64="hcz14bbu87kOd9jJIu1Yta5XaWg=">AAADBXichVG7SsRQEJ3E9/patVCwCS6KIrvciKgIgiiIsM36WBWNhiTe1WBeJHcXNKaxEfwBCysFCxE7sbWw8Qcs/APFUsHGwkk24At1QnLPnJkzOZdRHUP3GCH3HF9RWVVdU1uXqG9obGpOtrQueHbR1Whesw3bXVIVjxq6RfNMZwZdclyqmKpBF9WtybC+WKKup9vWPNt26KqpbFh6QdcUhpSc3JNU05fMouyzXRYEQs+YEDOBzIR+IRtmJcSSlPhe683JbFlma5KpsE2v4M8HU5j5aTHo+6TKlUeHk7FfSAtZ7Mp+EsnJFMmQKISfQIxBCuLI2ckrkGAdbNCgCCZQsIAhNkABD58VEIGAg9wq+Mi5iPSoTiGABGqL2EWxQ0F2C78bmK3ErIV5ONOL1Br+xcDXRaUA3eSOnJFnckvOySN5+3WWH80IvWzjqZa11JGbDzrmXv9VmXgy2PxQ/emZQQFGIq86enciJryFVtaXdg6f50Znu/0eckKe0P8xuSc3eAOr9KKdztDZoz/8qOglXI/4fRk/wcJARhzKDM4MpsYn4kXVQid0QS9uYxjGYRpykMfpD1wT18518Pv8BX/JX5VbeS7WtMGX4K/fAYilwD0=</latexit> µt|t = µt + Kvt = µt + (PtZT t F 1 t )vt Pt|t = Pt KFtKT <latexit sha1_base64="h0xw7/dVeIVZnATPig1TbIV9A2U=">AAACnHichVHLSsNAFD3G97vqRhGkWCquylRERRBEXQgiaGsf2EpI4qjBvEimBQ3FvT/gwpWCCxF05w+48Qdc9BPEZQU3LrxNA6Ki3jCZc8/cc+cMV3UM3ROMVZuk5pbWtvaOzq7unt6+/sjAYNazS67GM5pt2G5eVTxu6BbPCF0YPO+4XDFVg+fUw+X6ea7MXU+3rS1x5PAdU9m39D1dUwRRcmSkeKAI363IIroQ3ZZFUTX9olmiXI7EWIIFEf0JkiGIIYwNO3KPInZhQ0MJJjgsCMIGFHj0FZAEg0PcDnziXEJ6cM5RQRdpS1TFqUIh9pD++5QVQtaivN7TC9Qa3WLQckkZRZw9sWtWY4/shj2z9197+UGPupcj2tWGljty/+lw+u1flUm7wMGn6k/PAnuYC7zq5N0JmPortIa+fHxWS8+n4v4Eu2Qv5P+CVdkDvcAqv2pXmzx1/ocflbxUaDzJ78P4CbJTieRMYnpzOra4FA6qA6MYxyRNYxaLWMUGMtT9BFe4xZ00Jq1Ia9J6o1RqCjVD+BJS9gPesZpA</latexit> ˆ rt = Ztµt <latexit sha1_base64="E27jzmhSgri70oES+c1vkr8K7sg=">AAACznichVHLShxBFD22JjFjEkfdBNwMGZRAyFAjEkNAEAMiuBkfoxJbm+q2ZqawX3TVNGjTZBvyAy6ySiALcZVNfiCb/EAQd7oUlwayySK3expCIjG3qKpTp+65dYprh65UmrHTPqN/4NbtO4N3S0P37j8YLo+MrqugGzmi6QRuEG3aXAlX+qKppXbFZhgJ7tmu2LD3Xmb3G7GIlAz8Nb0fim2Pt33Zkg7XRFnlJdP2kji1dGVythJZ+qnZ4TqJiDDN0kKPfmXphqVp3TE9rjuqlaylT0wl2x7fmbISU4RKuoGfWuUqq7E8KtdBvQBVFNEIyp9hYhcBHHThQcCHJuyCQ9HYQh0MIXHbSIiLCMn8XiBFibRdyhKUwYndo7VNp62C9emc1VS52qFXXJoRKSuYYN/YEbtiX9kxu2A//1kryWtkXvZpt3taEVrDbx+u/vivyqNdo/NbdaNnjRae514leQ9zJvuF09PHB4dXqy9WJpJJ9oFdkv/37JR9oR/48Xfn47JYeXeDH5u8ZO2p/92M62B9qlZ/Vptenq7OzReNGsQ4HuExdWMGc1hEA02q/gknOMO50TBiIzVe91KNvkIzhj/CePMLz5WuMA==</latexit> vt = rt ˆ rt Ft = ZtPtZT t + 2 ✏ Filtering reduces the values of P One-step ahead prediction increases the values of P Prediction of observation Error of observation • The Kalman filter continuously estimates the state by repeating one-step ahead prediction and filtering. • State follows a multivariate normal distribution with mean variance-covariance . α 𝒩 m μ P Cycle of state updates in linear Kalman filter t = t+1

Virtual exploration using missing observations 29 <latexit sha1_base64="sMXYDc1LSdnLnTbRGa5oXLWkgvE=">AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY</latexit> µt+1 =
Ttµt|t , Pt+1 = TtPt|t TT t + 2 ⌘ RtRT t One-step ahead prediction increases the values of P • The Kalman filter can handle missing observations appropriately. • The proposed policy incorporates this process as an update operation for the evaluation of the unselected arm in the MAB problem. Cycle of processing the missing observations in linear Kalman filter t = t+1 <latexit sha1_base64="bMvnPv7hldIpavvo1OljWtSrYKg=">AAACuHichVG7SsRAFD3G9/rYVRvBJrgoVsus+EIQRBvL9bEqGIlJHNfBvEhmFzTsD/gDFlYKFmJvLzaitYWfIJYKNhbezQZFRb1hMmfOvefOGa7p2yKUjD00KI1NzS2tbe2pjs6u7nSmp3c19MqBxYuWZ3vBummE3BYuL0ohbb7uB9xwTJuvmXvztfxahQeh8NwVue/zTccouWJHWIYkSs+Ma6YTaU65qkdSc8S2Kqvq8MwHKTUtVfiSUgu6VPVMluVYHOpPkE9AFkkUvMwlNGzDg4UyHHC4kIRtGAjp20AeDD5xm4iICwiJOM9RRYq0ZariVGEQu0f/Ep02Etalc61nGKstusWmFZBSxRC7Z+fsmd2wC/bI3n7tFcU9al72aTfrWu7r6cP+5dd/VQ7tErufqj89S+xgKvYqyLsfM7VXWHV95eDoeXl6aSgaZqfsifyfsAd2TS9wKy/W2SJfOv7Dj0leqjSe/Pdh/ASro7n8RG5scSw7O5cMqg0DGMQITWMSs1hAAUXqfowr3OJOmVa2lJIi6qVKQ6Lpw5dQgndBlqSe</latexit> µt|t = µt Pt|t = Pt Filtering keeps the values of P

Virtual exploration using missing observations 30 <latexit sha1_base64="sMXYDc1LSdnLnTbRGa5oXLWkgvE=">AAAC6nichVFNSxxBEH2OMdFV40YvQi7iogSUpUeWJAQCklxyXNddFRwdesbetXG+mOld2IzzB3LMJQRPCckh5J4/kEsuHj34E8TjCl4MpGZ2IH6g1tDTr1/Vq35NWYEjI8XY8YA2+GDo4aPhkcLo2PjjieKTybXIb4e2aNi+44cbFo+EIz3RUFI5YiMIBXctR6xbe2/T/HpHhJH0vbrqBmLL5S1PNqXNFVFmsWVYbmy47cSM1YKezL+um+oSta+SRcMoVC9lq32a0LbhcrUbNeN6smBEsuXy7SUzNoTiSc1UtSsFZrHEyiyLmZtAz0EJeVT94i8Y2IEPG224EPCgCDvgiOjbhA6GgLgtxMSFhGSWF0hQIG2bqgRVcGL36N+i02bOenROe0aZ2qZbHFohKWcwx47YD9Zjf9hPdsIubu0VZz1SL13arb5WBObEh+nV83tVLu0Ku/9Vd3pWaOJl5lWS9yBj0lfYfX3n/afe6qvaXDzPvrJT8v+FHbPf9AKvc2Z/XxG1gzv8WOQlHY9+fRg3wdpSWX9erqxUSstv8kEN4ylm8Yym8QLLeIcqGtT9ED1c4K/maB+1z9pBv1QbyDVTuBLat3985rtY</latexit> µt+1 =
Ttµt|t , Pt+1 = TtPt|t TT t + 2 ⌘ RtRT t One-step ahead prediction increases the values of P • The Kalman filter can handle missing observations appropriately. • The proposed policy incorporates this process as an update operation for the evaluation of the unselected arm in the MAB problem. Cycle of processing the missing observations in linear Kalman filter t = t+1 <latexit sha1_base64="bMvnPv7hldIpavvo1OljWtSrYKg=">AAACuHichVG7SsRAFD3G9/rYVRvBJrgoVsus+EIQRBvL9bEqGIlJHNfBvEhmFzTsD/gDFlYKFmJvLzaitYWfIJYKNhbezQZFRb1hMmfOvefOGa7p2yKUjD00KI1NzS2tbe2pjs6u7nSmp3c19MqBxYuWZ3vBummE3BYuL0ohbb7uB9xwTJuvmXvztfxahQeh8NwVue/zTccouWJHWIYkSs+Ma6YTaU65qkdSc8S2Kqvq8MwHKTUtVfiSUgu6VPVMluVYHOpPkE9AFkkUvMwlNGzDg4UyHHC4kIRtGAjp20AeDD5xm4iICwiJOM9RRYq0ZariVGEQu0f/Ep02Etalc61nGKstusWmFZBSxRC7Z+fsmd2wC/bI3n7tFcU9al72aTfrWu7r6cP+5dd/VQ7tErufqj89S+xgKvYqyLsfM7VXWHV95eDoeXl6aSgaZqfsifyfsAd2TS9wKy/W2SJfOv7Dj0leqjSe/Pdh/ASro7n8RG5scSw7O5cMqg0DGMQITWMSs1hAAUXqfowr3OJOmVa2lJIi6qVKQ6Lpw5dQgndBlqSe</latexit> µt|t = µt Pt|t = Pt By introducing this behavior into the policy, the exploration for the unselected arm can be encouraged by the mechanism of the probability matching method. Filtering keeps the values of P

4. Empirical evaluation on real data And results

32 Evaluation setting • The data were collected from an
actual EC site. • Each product is assigned to one of 18 categories. • Four recommendation methods are used. • Counts of the recommendations of each method and the clicks for each recommendation in each product category hourly from June 20, 2019, through July 22, 2019. • The total number of recommendations is 1,488,626. 0 100 200 300 400 500 600 Hours 0.05 0.10 0.15 Click-through rate Browsing path Demographic LLR Similar image The highest CTR switches. Typical characteristics of the recommendation method appears. Fig. 2. Change of CTR for each recommendation method.

33 Evaluation setting • We simulated the number of clicks
obtained by the selected recommendation method during the time that the data were acquired. • Assumptions • The user clicks for recommendations follow the Bernoulli distribution of the CTR, where recommendation methods are selected by the MAB policy. • Each recommendation method has an 18-dimensional parameter equal to the number of product categories. • The CTR is calculated by the inner product of and context . • The context is represented as a 1-hot vector of the product category that the user is browsing at time . ˜ θ(k) t ˜ θ(k) t xt xt t

Effect of context and temporal change 34 0 5000 10000
15000 20000 Cumulative regret adaptive thompson sampling decay linucb dynamic linucb LGSS bandits linear thompson sampling linucb neural linear select best arm at ﬁrst select random arm 0 100 200 300 400 500 600 0.0 0.2 0.4 0.6 0.8 1.0 Best arm rate • The proposed LGSS bandits achieved the highest cumulative reward, 201,330.4. • LGSS bandits and adaptive Thompson sampling reduce overall opportunity loss, not only during best arm switching but also during stationary periods.

Effect of context and temporal change 35 0.00 0.05 0.10
0.15 0.20 Change of click-through rate (Plants) Browsing path Demographic LLR Similar image 0 100 200 300 400 500 600 0.0 0.2 0.4 0.6 0.8 1.0 Best arm rate (Plants) adaptive thompson sampling dynamic linucb LGSS bandits linear thompson sampling 0.00 0.05 0.10 0.15 0.20 Change of click-through rate (Stationery) Browsing path Demographic LLR Similar image 0 100 200 300 400 500 600 0.0 0.2 0.4 0.6 0.8 1.0 Best arm rate (Stationery) adaptive thompson sampling dynamic linucb LGSS bandits linear thompson sampling 0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 Period 1 240 260 280 300 Period 2 300 400 500 600 Period 3 • Comparison of the ratio of optimal arms in categories where the recommendation method with the highest CTR switches and categories where it does not. • The proposed policy has been shown to continuously select the best arm in both situations.

Effect of the response time 36 0.0 0.2 0.4 ms
0.48 0.19 0.04 0.24 0.29 0.04 0.27 Select per 1 time 10°1 100 sec (Log) 1.49 0.04 0.06 0.09 0.04 0.05 1.82 Update per 2665 times adaptive thompson sampling decay linucb dynamic linucb LGSS bandits linear thompson sampling linucb neural linear Elapsed time for 2665 times for 18 dimension 4 arms • We measured the time elapsed from the arm selection and evaluation updates for each policy. • For arm selection, the LGSS bandits had an elapsed time of 0.24 ms. • All measures were less than 0.5 ms and considered to have sufficient performance without significant impact. • Measures of evaluation updates, the elapsed time was sufficiently short compared with the unit time (1 h). • Some policy are not scalable to an increase in the number of trials per unit time.

5. Conclusion

• We proposed an LGSS-based MAB policy to optimize the
selection of recommendation methods continuously and quickly according to the context and temporal change using the Kalman filter for meta-RSs. • The experimental results demonstrated that the proposed policy can effectively reduce opportunity loss during evaluation and increase the cumulative number of clicks compared with baseline policies. • The results also revealed that the proposed policy has no overhead on the response time of the meta-RS. • In the future, we will develop general-purpose policies that can quickly handle nonlinearity. 38 Conclusion

Contextual and Nonstationary Multi-armed Bandit...

Contextual and Nonstationary Multi-armed Bandits Using the Linear Gaussian State Space Model for the Meta-Recommender System

More Decks by monochromegane

Other Decks in Research

Featured

Transcript