Online Nonstationary and Nonlinear Bandits with Recursive Weighted Gaussian Process

Slide 1

Slide 1 text

Online Nonstationary and Nonlinear Bandits with Recursive Weighted Gaussian Process Yusuke Miyake [1][2] ,Ryuji Watanabe [2] and Tsunenori Mine [1] / Kyushu Univ [1]. Pepabo R&D Institute, GMO Pepabo, Inc [2]. July 2-4, 2024. The 48th IEEE International Conference on Computers, Software, and Applications (COMPSAC 2024)

Slide 2

Slide 2 text

2 Agenda 1. Introduction 2. Related works 3. Proposal 4. Evaluation 5. Conclusion

Slide 3

Slide 3 text

1. Introduction

Slide 4

Slide 4 text

4 Background • Selecting an optimal behavior from many ones is crucial for practical applications. • The effectiveness of them cannot be known in advance. • Continuous comparative evaluation in an actual environment is essential. • However, it results in opportunity loss. → → IUUQTJDPOTDPN Short-term opportunity loss Long-term opportunity loss • The reduction of opportunity loss is considered as a multi-armed bandit (MAB) problem.

Slide 5

Slide 5 text

• The player must select one arm from multiple arms to maximize the reward. • The reward is stochastically given based on the chosen arm. • The player needs to infer the reward distribution from the trial results. • To find the optimal arm, the player should balance exploitation and exploration. 5 Multi-Armed Bandit (MAB) Problem ʜ Select Reward Infer The term 'arm' is derived from the 'arm' of a slot machine. IUUQTJDPOTDPN

Slide 6

Slide 6 text

• -Greedy policy selects an arm uniformly at random with a probability of (exploration) and selects the arm with the highest average reward at that time with a probability of (exploitation). ϵ ϵ 1 − ϵ 6 The simplest MAB policy argmaxl=1,L ̂ y(l),1 − ϵ ∀a ∈ A, ϵ/L Bandit A/B testing ∀a ∈ A, ϵ/L

Slide 7

Slide 7 text

7 Motivation of our Approach • Nonlinearity • In practical applications such as e-commerce, user behavior and preferences often exhibit complex, nonlinear patterns that cannot be captured by simple linear models. • Nonstationarity • User preferences and environmental conditions in the real-world applications change over time. • Online performance • Response delays are detrimental to the user experience for the real-world applications. Nonlinearity Non stationarity Online performance

Slide 8

Slide 8 text

2. Related works

Slide 9

Slide 9 text

• Use of Kernel Function • Efficiently computes the inner product in high-dimensional feature space using basis functions. • Application to Nonlinear MAB Problems • Widely explored for nonlinear MAB problems. • Challenge of GP Regression • Training time increases exponentially as the dataset grows. • Few policies have been developed to handle nonstationarity. k • A method to infer the distribution of a function as a stochastic process from data. • Effective in modeling nonlinear functions and prediction uncertainties. 9 Gaussian Process Regression K−1 K ∈ ℝN×N K−1 → k(xi , xj )

Slide 10

Slide 10 text

10 • GP regression model-based nonstationary and nonlinear policy Weighted GP-UCB [Y. Deng 2022] K ZZ⊤ ≃ K ∈ ℝN×N Z ∈ ℝN×R Z⊤Z Z⊤Z ∈ ℝR×R ⋙ • Non stationarity • Focuses on new training data by using two new weight matrices. • Online performance • Using Random Fourier Features (RFF), compute the predictive distribution of GP Regression in the form of linear regression in -dimensional space, where . • Keeps the size of the computed inverse matrix constant, reducing computational complexity. • Limitation • Absence of a recursive learning algorithm. • Only partially mitigates the issue of escalating training time with a growing dataset. R R ⋘ N

Slide 11

Slide 11 text

11 • Nonstationary kernel Recursive Least Squares (RLS) method • Non stationarity • Introduce a forgetting mechanism for old training data. • Online performance • Using Nyström approximation, compute the predictive distribution of GP Regression in the form of linear regression in -dimensional space, where . • Applies a linear RLS algorithm. • Limitation • The regularization effect decreases as the number of recursive computations increases. • The loss of estimation accuracy due to overfitting can be critical. R R ⋘ N NysKRLS [T. Zhang 2020] (Z⊤ΓZ + γMΛ)−1 γ γ z xN

Slide 12

Slide 12 text

3. Proposal

Slide 13

Slide 13 text

• Online nonstationary and nonlinear contextual MAB policy 13 Our Proposed Policy • Nonlinearity and nonstationarity • Introduce a forgetting mechanism in nonlinear GP regression model. • Online performance • Using RFF, compute the predictive distribution of GP Regression in the form of linear regression in -dimensional space, where . • Applies a linear RLS algorithm. • Key features • Fast decision-making with recursive learning. • Accurate error correction in predictive distribution. R R ⋘ N z P N−2,M−2 Q N−2,M−2 x N−1 y N−1 (M = 0) γ P N−1,M−1 Q N−1,M−1 z x N y N γ P N,M Q N,M z x * p(y N−1 ∣ x N−1 , X, y) p(y N ∣ x N , X, y) p(y * ∣ x * , X, y) (M = 0)

Slide 14

Slide 14 text

• A method to approximate kernel function using samples from a probability distribution . k(xi , xj ) R′ = R/2 p(ω) 14 Random Fourier Features K K ∈ ℝN×N k(xi , xj ) • Linear Method • By decomposing the kernel function into inputs with -dimensional features, the original problem can be solved as a linear problem in -dimensional space. R R K ZZ⊤ ≃ K ∈ ℝN×N Z ∈ ℝN×R Z⊤Z Z⊤Z ∈ ℝR×R ⋙ k(xi , xj ) ≃ z(xi )⊤z(xj )

Slide 15

Slide 15 text

15 Comparative Learning Methods K′ K′ K′ … z z z … ⋮ x1 x2 xN x1 x1 xN−1 GP learning GP with RFF learning K′ −1 Inv K′ −1 Inv K′ −1 Inv Z⊤Z + Λ Z⊤Z + Λ Z⊤Z + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1 (Z⊤Z + Λ)−1 Inv (Z⊤Z + Λ)−1 Inv (Z⊤Z + Λ)−1 Inv K′ ∈ ℝN×N Z⊤Z ∈ ℝR×R

Slide 16

Slide 16 text

• Incorporating exponential forgetting for past inputs and outputs enhances estimation accuracy in nonstationary environments. • The proposed method assumes that more distant past data has larger observation errors , thus lower accuracy. ϵn 16 Forgetting Mechanism °4 °3 °2 °1 0 1 2 3 4 x °4 °2 0 2 4 y Predictive distribution (ˆ µ00 and 1æ conﬁdence based on ˆ ß00) yA fA (x) yB fB (x) ˆ µ00 1æ conﬁdence Transition of Observation Error Distribution ϵn Past data has larger observation errors. Recent data has smaller observation errors. Predictive distribution for a nonstationary and nonlinear regression problem Past training data from fA Recent training data from fB Prediction fits the recent training data. = Quick adaptation to changing environments. μ

Slide 17

Slide 17 text

17 Comparative Learning Methods Nonstationary GP with RFF learning Z⊤ΓZ + Λ Z⊤ΓZ + Λ Z⊤ΓZ + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1 (Z⊤ΓZ + Λ)−1 Inv (Z⊤ΓZ + Λ)−1 Inv (Z⊤ΓZ + Λ)−1 Inv Z⊤ΓZ ∈ ℝR×R GP with RFF learning Z⊤Z + Λ Z⊤Z + Λ Z⊤Z + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1 (Z⊤Z + Λ)−1 Inv (Z⊤Z + Λ)−1 Inv (Z⊤Z + Λ)−1 Inv Z⊤Z ∈ ℝR×R

Slide 18

Slide 18 text

• Recursive Least Squares (RLS) updates regression model parameters using previous calculations and new observations. • Efficiency • The algorithm computes parameters efficiently from results up to time and observed values at time . • Reduced Computational Cost • By avoiding the need to compute the inverse matrix, the algorithm significantly reduces computational cost. N N + 1 18 Recursive Learning Mechanism

Slide 19

Slide 19 text

19 Comparative Learning Methods Batch learning Z⊤ΓZ + Λ Z⊤ΓZ + Λ Z⊤ΓZ + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1 (Z⊤ΓZ + Λ)−1 Inv (Z⊤ΓZ + Λ)−1 Inv (Z⊤ΓZ + Λ)−1 Inv (Z⊤ΓZ + γΛ)−1 (Z⊤ΓZ + γ2Λ)−1 (Z⊤ΓZ + γMΛ)−1 γ γ γ … γ z z z … ⋮ x1 x2 xN x1 x1 xN−1 Recursive learning

Slide 20

Slide 20 text

• Estimation error arises in recursive learning due to the forgetting effect is recursively applied to the regularization term. 20 Estimation Error in Recursive Learning °0.10 °0.05 0.00 0.05 0.10 Error of ˆ µ00 Estimation error of ˆ µ00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x 0.0000 0.0005 0.0010 0.0015 0.0020 Error of ˆ ß00 Estimation error of ˆ ß00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x °4 °2 0 2 4 y Predictive distribution (ˆ µ00 and 1æ conﬁdence based on ˆ ß00) yA fA (x) yB fB (x) ˆ µ00 1æ conﬁdence ≠ Z⊤ΓZ + Λ z ⋮ xN x1 xN−1 (Z⊤ΓZ + Λ)−1 Inv (Z⊤ΓZ + γMΛ)−1 γ γ z ⋮ xN x1 xN−1 • For MAB policies that run for long periods, the loss of estimation accuracy due to overfitting can be fatal. • Addressing this error causes a trade-off between accuracy and online performance. Error Estimation error of predictive distribution parameters for each M

Slide 21

Slide 21 text

• A novel recursive error correction method balances estimation accuracy and online performance. 21 Error Correction Method = Z⊤ΓZ + Λ z ⋮ xN x1 xN−1 (Z⊤ΓZ + Λ)−1 Inv (Z⊤ΓZ + γ0Λ)−1 γ γ z ⋮ xN x1 xN−1 Error Correction 1. Inverse 2. Subtract error from 3. Inverse −1 −1 • High computational cost due to twice inversion of matrix. • Method executed at intervals based on acceptable estimation accuracy rather than every time. R × R °0.10 °0.05 0.00 0.05 0.10 Error of ˆ µ00 Estimation error of ˆ µ00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x 0.0000 0.0005 0.0010 0.0015 0.0020 Error of ˆ ß00 Estimation error of ˆ ß00 for each M M=0 M=200 M=400 M=600 No estimation error when . M = 0

Slide 22

Slide 22 text

4. Evaluation

Slide 23

Slide 23 text

• Nonstationary and nonlinear contextual MAB simulation • The arm's reward follows a normal distribution , with the mean corresponding to the context as shown below. • The banded curve moves leftward over time (one full rotation in 4000 trials). 𝒩 (μ, σ2) μ xt = (xt,d )1≤d≤2 23 Simulation Setup The arm is always a(1) t μ = μ1 μ1 = 0.1,μ2 = 0.0,μ3 = 1.0,σ2 = 0.01,δ = 0.8,ρ = 4000 The magnitudes of the means are . To maximize the expected reward, it is necessary to select the corresponding arm within the band curve and choose arm otherwise. μ2 < μ1 ≪ μ3 a(1)

Slide 24

Slide 24 text

24 Baseline Policies Nonlinear Nonstationary Recursive learning Policy Note ✓ ✓ ✓ RW-GPB (Proposal) Evaluated with multiple correction intervals τ to compare error correction effects. ✓ ✓ GP+UCB (Weighted, RFF) - State-of-the-art - Reduced learning time with RFF ✓ ✓ GP+UCB (Weighted) ✓ ✓ GP+UCB (Sliding Window) ✓ ✓ GP+UCB (Restarting) ✓ GP+UCB ✓ ✓ Decay LinUCB Constant exploration scale is used for all policies to clarify the effect of each policy's regression model. β = 1

Slide 25

Slide 25 text

• The proposed RW-GPB policy achieves higher cumulative rewards and shorter execution time than the state-of-the-art GP+UCB (Weighted, RFF) policy. • Compared to GP+UCB (Weighted, RFF), • RW-GPB ( ) reduces the execution time by 71% and equal rewards. • RW-GPB ( ) reduces the execution time by 92% and more rewards. • The GP+UCB (Weighted) policy without RFF had the highest reward and the longest execution time. • Improving the approximation accuracy of the kernel function is also essential. τ = 1 τ = 40 25 Simulation Results: Trade-off Analysis

Slide 26

Slide 26 text

• Accumulated errors reduce the cumulative reward and that the frequency of error correction has virtually no effect on execution time. • Error correction should be performed aggressively. • Interestingly, the cumulative reward is improved for than for the most accurate . • This result indicates that a slight increase in exploration frequency may be helpful in nonstationary environments. τ = 4 τ = 1 26 Simulation Results: Trade-off Analysis 750 800 850 900 950 Cumulative rewards 102 103 Trials per second ø = 1600 ø = 800 ø = 400 ø = 100 ø = 1 ø = 40 ø = 4 Cumulative rewards - Trials per second trade-oÆ RW-GPB GP+UCB (Sliding Window) GP+UCB (Weighted) GP+UCB (Weighted, RFF)

Slide 27

Slide 27 text

• The proposed RW-GPB policy keeps the execution time constant. • The policy without recursive learning increases the execution time linearly. • In addition, the policy without RFF increases it exponentially. 27 Simulation Results: Computation Time 500 1000 1500 2000 2500 3000 3500 4000 Number of trials 0 20 40 60 80 Cumulative execution time (Sec) Cumulative execution time RW-GPB (ø = 4) GP+UCB (Sliding Window) GP+UCB (Weighted) GP+UCB (Weighted, RFF)

Slide 28

Slide 28 text

5. Conclusion

Slide 29

Slide 29 text

• RW-GPB Policy • We introduced RW-GPB, a new online policy for nonstationary nonlinear contextual MAB problems, which balances accuracy and online performance. • RW-GPR Model • Our novel RW-GPR model, equipped with a practical error correction method, effectively implements the proposed policy. • Experimental Results • Experimental results demonstrate that RW-GPB significantly reduces computational time while maintaining cumulative reward in simulations. 29 Conclusion

Slide 30

Slide 30 text

• Meta-Recommender System • We aim to implement and evaluate a meta-recommender system that autonomously optimizes the selection of recommendation algorithms using the proposed policy. • Client-Side Agents • Future research will explore applying this lightweight policy to client-side agents for solving complex autonomous tasks on resource-constrained devices. • Real-World Effectiveness • We expect the proposed policy to enhance the effectiveness of autonomous systems across various real-world scenarios. 30 Future Work

Slide 31

Slide 31 text

No content