Slide 1

Slide 1 text

Online Nonstationary and Nonlinear Bandits with Recursive Weighted Gaussian Process Yusuke Miyake [1][2] ,Ryuji Watanabe [2] and Tsunenori Mine [1] / Kyushu Univ [1]. Pepabo R&D Institute, GMO Pepabo, Inc [2]. July 2-4, 2024. The 48th IEEE International Conference on Computers, Software, and Applications (COMPSAC 2024)

Slide 2

Slide 2 text

 2 Agenda 1. Introduction 2. Related works 3. Proposal 4. Evaluation 5. Conclusion

Slide 3

Slide 3 text

1. Introduction

Slide 4

Slide 4 text

 4 Background • Selecting an optimal behavior from many ones is crucial for practical applications. •  The effectiveness of them cannot be known in advance. • Continuous comparative evaluation in an actual environment is essential. •  However, it results in opportunity loss. → → IUUQTJDPOTDPN Short-term opportunity loss Long-term opportunity loss • The reduction of opportunity loss is considered as a multi-armed bandit (MAB) problem.

Slide 5

Slide 5 text

• The player must select one arm from multiple arms to maximize the reward. • The reward is stochastically given based on the chosen arm. • The player needs to infer the reward distribution from the trial results. • To find the optimal arm, the player should balance exploitation and exploration.  5 Multi-Armed Bandit (MAB) Problem ʜ Select Reward Infer The term 'arm' is derived from the 'arm' of a slot machine. IUUQTJDPOTDPN

Slide 6

Slide 6 text

•  -Greedy policy selects an arm uniformly at random with a probability of  (exploration) and selects the arm with the highest average reward at that time with a probability of  (exploitation). ϵ ϵ 1 − ϵ  6 The simplest MAB policy argmaxl=1,L ̂ y(l),1 − ϵ ∀a ∈ A, ϵ/L Bandit A/B testing ∀a ∈ A, ϵ/L

Slide 7

Slide 7 text

 7 Motivation of our Approach • Nonlinearity • In practical applications such as e-commerce, user behavior and preferences often exhibit complex, nonlinear patterns that cannot be captured by simple linear models. • Nonstationarity • User preferences and environmental conditions in the real-world applications change over time. • Online performance • Response delays are detrimental to the user experience for the real-world applications. Nonlinearity Non stationarity Online performance

Slide 8

Slide 8 text

2. Related works

Slide 9

Slide 9 text

• Use of Kernel Function  • Efficiently computes the inner product in high-dimensional feature space using basis functions. • Application to Nonlinear MAB Problems • Widely explored for nonlinear MAB problems. • Challenge of GP Regression • Training time increases exponentially as the dataset grows. • Few policies have been developed to handle nonstationarity. k • A method to infer the distribution of a function as a stochastic process from data. • Effective in modeling nonlinear functions and prediction uncertainties.  9 Gaussian Process Regression  K−1  K ∈ ℝN×N  K−1 →  k(xi , xj )

Slide 10

Slide 10 text

 10 • GP regression model-based nonstationary and nonlinear policy Weighted GP-UCB [Y. Deng 2022]  K  ZZ⊤  ≃  K ∈ ℝN×N  Z ∈ ℝN×R  Z⊤Z  Z⊤Z ∈ ℝR×R  ⋙ • Non stationarity • Focuses on new training data by using two new weight matrices. • Online performance • Using Random Fourier Features (RFF), compute the predictive distribution of GP Regression in the form of linear regression in  -dimensional space, where  . • Keeps the size of the computed inverse matrix constant, reducing computational complexity. • Limitation • Absence of a recursive learning algorithm. • Only partially mitigates the issue of escalating training time with a growing dataset. R R ⋘ N

Slide 11

Slide 11 text

 11 • Nonstationary kernel Recursive Least Squares (RLS) method • Non stationarity • Introduce a forgetting mechanism for old training data. • Online performance • Using Nyström approximation, compute the predictive distribution of GP Regression in the form of linear regression in  -dimensional space, where  . • Applies a linear RLS algorithm. • Limitation • The regularization effect decreases as the number of recursive computations increases. • The loss of estimation accuracy due to overfitting can be critical. R R ⋘ N NysKRLS [T. Zhang 2020]  (Z⊤ΓZ + γMΛ)−1 γ γ z xN

Slide 12

Slide 12 text

3. Proposal

Slide 13

Slide 13 text

• Online nonstationary and nonlinear contextual MAB policy  13 Our Proposed Policy • Nonlinearity and nonstationarity • Introduce a forgetting mechanism in nonlinear GP regression model. • Online performance • Using RFF, compute the predictive distribution of GP Regression in the form of linear regression in  -dimensional space, where  . • Applies a linear RLS algorithm. • Key features • Fast decision-making with recursive learning. • Accurate error correction in predictive distribution. R R ⋘ N z P N−2,M−2 Q N−2,M−2 x N−1 y N−1 (M = 0) γ P N−1,M−1 Q N−1,M−1 z x N y N γ P N,M Q N,M z x * p(y N−1 ∣ x N−1 , X, y) p(y N ∣ x N , X, y) p(y * ∣ x * , X, y) (M = 0)

Slide 14

Slide 14 text

• A method to approximate kernel function  using  samples from a probability distribution  . k(xi , xj ) R′  = R/2 p(ω)  14 Random Fourier Features  K  K ∈ ℝN×N  k(xi , xj ) • Linear Method • By decomposing the kernel function into inputs with  -dimensional features, the original problem can be solved as a linear problem in  -dimensional space. R R  K  ZZ⊤  ≃  K ∈ ℝN×N  Z ∈ ℝN×R  Z⊤Z  Z⊤Z ∈ ℝR×R  ⋙ k(xi , xj ) ≃ z(xi )⊤z(xj )

Slide 15

Slide 15 text

 15 Comparative Learning Methods  K′   K′   K′  … z z z … ⋮ x1 x2 xN x1 x1 xN−1 GP learning GP with RFF learning  K′  −1 Inv  K′  −1 Inv  K′  −1 Inv  Z⊤Z + Λ  Z⊤Z + Λ  Z⊤Z + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1  (Z⊤Z + Λ)−1 Inv  (Z⊤Z + Λ)−1 Inv  (Z⊤Z + Λ)−1 Inv  K′  ∈ ℝN×N  Z⊤Z ∈ ℝR×R

Slide 16

Slide 16 text

• Incorporating exponential forgetting for past inputs and outputs enhances estimation accuracy in nonstationary environments. • The proposed method assumes that more distant past data has larger observation errors  , thus lower accuracy. ϵn  16 Forgetting Mechanism °4 °3 °2 °1 0 1 2 3 4 x °4 °2 0 2 4 y Predictive distribution (ˆ µ00 and 1æ confidence based on ˆ ß00) yA fA (x) yB fB (x) ˆ µ00 1æ confidence Transition of Observation Error  Distribution ϵn Past data has larger observation errors. Recent data has smaller observation errors. Predictive distribution for a nonstationary and nonlinear regression problem Past training data from  fA Recent training data from  fB Prediction  fits the recent training data. = Quick adaptation to changing environments. μ

Slide 17

Slide 17 text

 17 Comparative Learning Methods Nonstationary GP with RFF learning  Z⊤ΓZ + Λ  Z⊤ΓZ + Λ  Z⊤ΓZ + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + Λ)−1 Inv  Z⊤ΓZ ∈ ℝR×R GP with RFF learning  Z⊤Z + Λ  Z⊤Z + Λ  Z⊤Z + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1  (Z⊤Z + Λ)−1 Inv  (Z⊤Z + Λ)−1 Inv  (Z⊤Z + Λ)−1 Inv  Z⊤Z ∈ ℝR×R

Slide 18

Slide 18 text

• Recursive Least Squares (RLS) updates regression model parameters using previous calculations and new observations. • Efficiency • The algorithm computes parameters efficiently from results up to time  and observed values at time  . • Reduced Computational Cost • By avoiding the need to compute the inverse matrix, the algorithm significantly reduces computational cost. N N + 1  18 Recursive Learning Mechanism

Slide 19

Slide 19 text

 19 Comparative Learning Methods Batch learning  Z⊤ΓZ + Λ  Z⊤ΓZ + Λ  Z⊤ΓZ + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + γΛ)−1  (Z⊤ΓZ + γ2Λ)−1  (Z⊤ΓZ + γMΛ)−1 γ γ γ … γ z z z … ⋮ x1 x2 xN x1 x1 xN−1 Recursive learning

Slide 20

Slide 20 text

• Estimation error arises in recursive learning due to the forgetting effect is recursively applied to the regularization term.  20 Estimation Error in Recursive Learning °0.10 °0.05 0.00 0.05 0.10 Error of ˆ µ00 Estimation error of ˆ µ00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x 0.0000 0.0005 0.0010 0.0015 0.0020 Error of ˆ ß00 Estimation error of ˆ ß00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x °4 °2 0 2 4 y Predictive distribution (ˆ µ00 and 1æ confidence based on ˆ ß00) yA fA (x) yB fB (x) ˆ µ00 1æ confidence ≠  Z⊤ΓZ + Λ z ⋮ xN x1 xN−1  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + γMΛ)−1 γ γ z ⋮ xN x1 xN−1 • For MAB policies that run for long periods, the loss of estimation accuracy due to overfitting can be fatal. • Addressing this error causes a trade-off between accuracy and online performance. Error Estimation error of predictive distribution parameters for each  M

Slide 21

Slide 21 text

• A novel recursive error correction method balances estimation accuracy and online performance.  21 Error Correction Method =  Z⊤ΓZ + Λ z ⋮ xN x1 xN−1  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + γ0Λ)−1 γ γ z ⋮ xN x1 xN−1 Error Correction 1. Inverse 2. Subtract error from 3. Inverse −1 −1 • High computational cost due to twice inversion of  matrix. • Method executed at intervals based on acceptable estimation accuracy rather than every time. R × R °0.10 °0.05 0.00 0.05 0.10 Error of ˆ µ00 Estimation error of ˆ µ00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x 0.0000 0.0005 0.0010 0.0015 0.0020 Error of ˆ ß00 Estimation error of ˆ ß00 for each M M=0 M=200 M=400 M=600 No estimation error when  . M = 0

Slide 22

Slide 22 text

4. Evaluation

Slide 23

Slide 23 text

• Nonstationary and nonlinear contextual MAB simulation • The arm's reward follows a normal distribution  , with the mean  corresponding to the context  as shown below. • The banded curve moves leftward over time (one full rotation in 4000 trials). 𝒩 (μ, σ2) μ xt = (xt,d )1≤d≤2  23 Simulation Setup The arm  is always  a(1) t μ = μ1  μ1 = 0.1,μ2 = 0.0,μ3 = 1.0,σ2 = 0.01,δ = 0.8,ρ = 4000 The magnitudes of the means are  . To maximize the expected reward, it is necessary to select the corresponding arm within the band curve and choose arm  otherwise. μ2 < μ1 ≪ μ3 a(1)

Slide 24

Slide 24 text

 24 Baseline Policies Nonlinear Nonstationary Recursive learning Policy Note ✓ ✓ ✓ RW-GPB (Proposal) Evaluated with multiple correction intervals τ to compare error correction effects. ✓ ✓ GP+UCB (Weighted, RFF) - State-of-the-art - Reduced learning time with RFF ✓ ✓ GP+UCB (Weighted) ✓ ✓ GP+UCB (Sliding Window) ✓ ✓ GP+UCB (Restarting) ✓ GP+UCB ✓ ✓ Decay LinUCB Constant exploration scale  is used for all policies to clarify the effect of each policy's regression model. β = 1

Slide 25

Slide 25 text

• The proposed RW-GPB policy achieves higher cumulative rewards and shorter execution time than the state-of-the-art GP+UCB (Weighted, RFF) policy. • Compared to GP+UCB (Weighted, RFF), • RW-GPB (  ) reduces the execution time by 71% and equal rewards. • RW-GPB (  ) reduces the execution time by 92% and more rewards. • The GP+UCB (Weighted) policy without RFF had the highest reward and the longest execution time. • Improving the approximation accuracy of the kernel function is also essential. τ = 1 τ = 40  25 Simulation Results: Trade-off Analysis

Slide 26

Slide 26 text

• Accumulated errors reduce the cumulative reward and that the frequency of error correction has virtually no effect on execution time. • Error correction should be performed aggressively. • Interestingly, the cumulative reward is improved for  than for the most accurate  . • This result indicates that a slight increase in exploration frequency may be helpful in nonstationary environments. τ = 4 τ = 1  26 Simulation Results: Trade-off Analysis 750 800 850 900 950 Cumulative rewards 102 103 Trials per second ø = 1600 ø = 800 ø = 400 ø = 100 ø = 1 ø = 40 ø = 4 Cumulative rewards - Trials per second trade-oÆ RW-GPB GP+UCB (Sliding Window) GP+UCB (Weighted) GP+UCB (Weighted, RFF)

Slide 27

Slide 27 text

• The proposed RW-GPB policy keeps the execution time constant. • The policy without recursive learning increases the execution time linearly. • In addition, the policy without RFF increases it exponentially.  27 Simulation Results: Computation Time 500 1000 1500 2000 2500 3000 3500 4000 Number of trials 0 20 40 60 80 Cumulative execution time (Sec) Cumulative execution time RW-GPB (ø = 4) GP+UCB (Sliding Window) GP+UCB (Weighted) GP+UCB (Weighted, RFF)

Slide 28

Slide 28 text

5. Conclusion

Slide 29

Slide 29 text

• RW-GPB Policy • We introduced RW-GPB, a new online policy for nonstationary nonlinear contextual MAB problems, which balances accuracy and online performance. • RW-GPR Model • Our novel RW-GPR model, equipped with a practical error correction method, effectively implements the proposed policy. • Experimental Results • Experimental results demonstrate that RW-GPB significantly reduces computational time while maintaining cumulative reward in simulations.  29 Conclusion

Slide 30

Slide 30 text

• Meta-Recommender System • We aim to implement and evaluate a meta-recommender system that autonomously optimizes the selection of recommendation algorithms using the proposed policy. • Client-Side Agents • Future research will explore applying this lightweight policy to client-side agents for solving complex autonomous tasks on resource-constrained devices. • Real-World Effectiveness • We expect the proposed policy to enhance the effectiveness of autonomous systems across various real-world scenarios.  30 Future Work

Slide 31

Slide 31 text

No content