monochromegane
July 04, 2024
130

# Online Nonstationary and Nonlinear Bandits with Recursive Weighted Gaussian Process

IEEE COMPSAC 2024
https://ieeecompsac.computer.org/2024/
Osaka, Japan, on July 2 – 4, 2024

July 04, 2024

## Transcript

1. ### Online Nonstationary and Nonlinear Bandits with Recursive Weighted Gaussian Process

Yusuke Miyake [1][2] ,Ryuji Watanabe [2] and Tsunenori Mine [1] / Kyushu Univ [1]. Pepabo R&D Institute, GMO Pepabo, Inc [2]. July 2-4, 2024. The 48th IEEE International Conference on Computers, Software, and Applications (COMPSAC 2024)
2. ### ￼ 2 Agenda 1. Introduction 2. Related works 3. Proposal

4. Evaluation 5. Conclusion

4. ### ￼ 4 Background • Selecting an optimal behavior from many

ones is crucial for practical applications. • ￼ The effectiveness of them cannot be known in advance. • Continuous comparative evaluation in an actual environment is essential. • ￼ However, it results in opportunity loss. → → IUUQTJDPOTDPN Short-term opportunity loss Long-term opportunity loss • The reduction of opportunity loss is considered as a multi-armed bandit (MAB) problem.
5. ### • The player must select one arm from multiple arms

to maximize the reward. • The reward is stochastically given based on the chosen arm. • The player needs to infer the reward distribution from the trial results. • To find the optimal arm, the player should balance exploitation and exploration. ￼ 5 Multi-Armed Bandit (MAB) Problem ʜ Select Reward Infer The term 'arm' is derived from the 'arm' of a slot machine. IUUQTJDPOTDPN
6. ### • ￼ -Greedy policy selects an arm uniformly at random

with a probability of ￼ (exploration) and selects the arm with the highest average reward at that time with a probability of ￼ (exploitation). ϵ ϵ 1 − ϵ ￼ 6 The simplest MAB policy argmaxl=1,L ̂ y(l),1 − ϵ ∀a ∈ A, ϵ/L Bandit A/B testing ∀a ∈ A, ϵ/L
7. ### ￼ 7 Motivation of our Approach • Nonlinearity • In

practical applications such as e-commerce, user behavior and preferences often exhibit complex, nonlinear patterns that cannot be captured by simple linear models. • Nonstationarity • User preferences and environmental conditions in the real-world applications change over time. • Online performance • Response delays are detrimental to the user experience for the real-world applications. Nonlinearity Non stationarity Online performance

9. ### • Use of Kernel Function ￼ • Efficiently computes the

inner product in high-dimensional feature space using basis functions. • Application to Nonlinear MAB Problems • Widely explored for nonlinear MAB problems. • Challenge of GP Regression • Training time increases exponentially as the dataset grows. • Few policies have been developed to handle nonstationarity. k • A method to infer the distribution of a function as a stochastic process from data. • Effective in modeling nonlinear functions and prediction uncertainties. ￼ 9 Gaussian Process Regression ￼ K−1 ￼ K ∈ ℝN×N ￼ K−1 → ￼ k(xi , xj )
10. ### ￼ 10 • GP regression model-based nonstationary and nonlinear policy

Weighted GP-UCB [Y. Deng 2022] ￼ K ￼ ZZ⊤ ￼ ≃ ￼ K ∈ ℝN×N ￼ Z ∈ ℝN×R ￼ Z⊤Z ￼ Z⊤Z ∈ ℝR×R ￼ ⋙ • Non stationarity • Focuses on new training data by using two new weight matrices. • Online performance • Using Random Fourier Features (RFF), compute the predictive distribution of GP Regression in the form of linear regression in ￼ -dimensional space, where ￼ . • Keeps the size of the computed inverse matrix constant, reducing computational complexity. • Limitation • Absence of a recursive learning algorithm. • Only partially mitigates the issue of escalating training time with a growing dataset. R R ⋘ N
11. ### ￼ 11 • Nonstationary kernel Recursive Least Squares (RLS) method

• Non stationarity • Introduce a forgetting mechanism for old training data. • Online performance • Using Nyström approximation, compute the predictive distribution of GP Regression in the form of linear regression in ￼ -dimensional space, where ￼ . • Applies a linear RLS algorithm. • Limitation • The regularization effect decreases as the number of recursive computations increases. • The loss of estimation accuracy due to overfitting can be critical. R R ⋘ N NysKRLS [T. Zhang 2020] ￼ (Z⊤ΓZ + γMΛ)−1 γ γ z xN

13. ### • Online nonstationary and nonlinear contextual MAB policy ￼ 13

Our Proposed Policy • Nonlinearity and nonstationarity • Introduce a forgetting mechanism in nonlinear GP regression model. • Online performance • Using RFF, compute the predictive distribution of GP Regression in the form of linear regression in ￼ -dimensional space, where ￼ . • Applies a linear RLS algorithm. • Key features • Fast decision-making with recursive learning. • Accurate error correction in predictive distribution. R R ⋘ N  z  P N−2,M−2  Q N−2,M−2  x N−1  y N−1  (M = 0)  γ  P N−1,M−1  Q N−1,M−1  z  x N  y N  γ  P N,M  Q N,M  z  x *  p(y N−1 ∣ x N−1 , X, y)  p(y N ∣ x N , X, y)  p(y * ∣ x * , X, y)  (M = 0)
14. ### • A method to approximate kernel function ￼ using ￼

samples from a probability distribution ￼ . k(xi , xj ) R′ ￼ = R/2 p(ω) ￼ 14 Random Fourier Features ￼ K ￼ K ∈ ℝN×N ￼ k(xi , xj ) • Linear Method • By decomposing the kernel function into inputs with ￼ -dimensional features, the original problem can be solved as a linear problem in ￼ -dimensional space. R R ￼ K ￼ ZZ⊤ ￼ ≃ ￼ K ∈ ℝN×N ￼ Z ∈ ℝN×R ￼ Z⊤Z ￼ Z⊤Z ∈ ℝR×R ￼ ⋙ k(xi , xj ) ≃ z(xi )⊤z(xj )
15. ### ￼ 15 Comparative Learning Methods ￼ K′ ￼ ￼ K′

￼ ￼ K′ ￼ … z z z … ⋮ x1 x2 xN x1 x1 xN−1 GP learning GP with RFF learning ￼ K′ ￼ −1 Inv ￼ K′ ￼ −1 Inv ￼ K′ ￼ −1 Inv ￼ Z⊤Z + Λ ￼ Z⊤Z + Λ ￼ Z⊤Z + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1 ￼ (Z⊤Z + Λ)−1 Inv ￼ (Z⊤Z + Λ)−1 Inv ￼ (Z⊤Z + Λ)−1 Inv ￼ K′ ￼ ∈ ℝN×N ￼ Z⊤Z ∈ ℝR×R
16. ### • Incorporating exponential forgetting for past inputs and outputs enhances

estimation accuracy in nonstationary environments. • The proposed method assumes that more distant past data has larger observation errors ￼ , thus lower accuracy. ϵn ￼ 16 Forgetting Mechanism °4 °3 °2 °1 0 1 2 3 4 x °4 °2 0 2 4 y Predictive distribution (ˆ µ00 and 1æ conﬁdence based on ˆ ß00) yA fA (x) yB fB (x) ˆ µ00 1æ conﬁdence Transition of Observation Error ￼ Distribution ϵn Past data has larger observation errors. Recent data has smaller observation errors. Predictive distribution for a nonstationary and nonlinear regression problem Past training data from ￼ fA Recent training data from ￼ fB Prediction ￼ fits the recent training data. = Quick adaptation to changing environments. μ
17. ### ￼ 17 Comparative Learning Methods Nonstationary GP with RFF learning

￼ Z⊤ΓZ + Λ ￼ Z⊤ΓZ + Λ ￼ Z⊤ΓZ + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1 ￼ (Z⊤ΓZ + Λ)−1 Inv ￼ (Z⊤ΓZ + Λ)−1 Inv ￼ (Z⊤ΓZ + Λ)−1 Inv ￼ Z⊤ΓZ ∈ ℝR×R GP with RFF learning ￼ Z⊤Z + Λ ￼ Z⊤Z + Λ ￼ Z⊤Z + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1 ￼ (Z⊤Z + Λ)−1 Inv ￼ (Z⊤Z + Λ)−1 Inv ￼ (Z⊤Z + Λ)−1 Inv ￼ Z⊤Z ∈ ℝR×R
18. ### • Recursive Least Squares (RLS) updates regression model parameters using

previous calculations and new observations. • Efficiency • The algorithm computes parameters efficiently from results up to time ￼ and observed values at time ￼ . • Reduced Computational Cost • By avoiding the need to compute the inverse matrix, the algorithm significantly reduces computational cost. N N + 1 ￼ 18 Recursive Learning Mechanism
19. ### ￼ 19 Comparative Learning Methods Batch learning ￼ Z⊤ΓZ +

Λ ￼ Z⊤ΓZ + Λ ￼ Z⊤ΓZ + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1 ￼ (Z⊤ΓZ + Λ)−1 Inv ￼ (Z⊤ΓZ + Λ)−1 Inv ￼ (Z⊤ΓZ + Λ)−1 Inv ￼ (Z⊤ΓZ + γΛ)−1 ￼ (Z⊤ΓZ + γ2Λ)−1 ￼ (Z⊤ΓZ + γMΛ)−1 γ γ γ … γ z z z … ⋮ x1 x2 xN x1 x1 xN−1 Recursive learning
20. ### • Estimation error arises in recursive learning due to the

forgetting effect is recursively applied to the regularization term. ￼ 20 Estimation Error in Recursive Learning °0.10 °0.05 0.00 0.05 0.10 Error of ˆ µ00 Estimation error of ˆ µ00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x 0.0000 0.0005 0.0010 0.0015 0.0020 Error of ˆ ß00 Estimation error of ˆ ß00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x °4 °2 0 2 4 y Predictive distribution (ˆ µ00 and 1æ conﬁdence based on ˆ ß00) yA fA (x) yB fB (x) ˆ µ00 1æ conﬁdence ≠ ￼ Z⊤ΓZ + Λ z ⋮ xN x1 xN−1 ￼ (Z⊤ΓZ + Λ)−1 Inv ￼ (Z⊤ΓZ + γMΛ)−1 γ γ z ⋮ xN x1 xN−1 • For MAB policies that run for long periods, the loss of estimation accuracy due to overfitting can be fatal. • Addressing this error causes a trade-off between accuracy and online performance. Error Estimation error of predictive distribution parameters for each ￼ M
21. ### • A novel recursive error correction method balances estimation accuracy

and online performance. ￼ 21 Error Correction Method = ￼ Z⊤ΓZ + Λ z ⋮ xN x1 xN−1 ￼ (Z⊤ΓZ + Λ)−1 Inv ￼ (Z⊤ΓZ + γ0Λ)−1 γ γ z ⋮ xN x1 xN−1 Error Correction 1. Inverse 2. Subtract error from 3. Inverse −1 −1 • High computational cost due to twice inversion of ￼ matrix. • Method executed at intervals based on acceptable estimation accuracy rather than every time. R × R °0.10 °0.05 0.00 0.05 0.10 Error of ˆ µ00 Estimation error of ˆ µ00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x 0.0000 0.0005 0.0010 0.0015 0.0020 Error of ˆ ß00 Estimation error of ˆ ß00 for each M M=0 M=200 M=400 M=600 No estimation error when ￼ . M = 0

23. ### • Nonstationary and nonlinear contextual MAB simulation • The arm's

reward follows a normal distribution ￼ , with the mean ￼ corresponding to the context ￼ as shown below. • The banded curve moves leftward over time (one full rotation in 4000 trials). 𝒩 (μ, σ2) μ xt = (xt,d )1≤d≤2 ￼ 23 Simulation Setup The arm ￼ is always ￼ a(1) t μ = μ1 ￼ μ1 = 0.1,μ2 = 0.0,μ3 = 1.0,σ2 = 0.01,δ = 0.8,ρ = 4000 The magnitudes of the means are ￼ . To maximize the expected reward, it is necessary to select the corresponding arm within the band curve and choose arm ￼ otherwise. μ2 < μ1 ≪ μ3 a(1)
24. ### ￼ 24 Baseline Policies Nonlinear Nonstationary Recursive learning Policy Note

✓ ✓ ✓ RW-GPB (Proposal) Evaluated with multiple correction intervals τ to compare error correction effects. ✓ ✓ GP+UCB (Weighted, RFF) - State-of-the-art - Reduced learning time with RFF ✓ ✓ GP+UCB (Weighted) ✓ ✓ GP+UCB (Sliding Window) ✓ ✓ GP+UCB (Restarting) ✓ GP+UCB ✓ ✓ Decay LinUCB Constant exploration scale ￼ is used for all policies to clarify the effect of each policy's regression model. β = 1
25. ### • The proposed RW-GPB policy achieves higher cumulative rewards and

shorter execution time than the state-of-the-art GP+UCB (Weighted, RFF) policy. • Compared to GP+UCB (Weighted, RFF), • RW-GPB ( ￼ ) reduces the execution time by 71% and equal rewards. • RW-GPB ( ￼ ) reduces the execution time by 92% and more rewards. • The GP+UCB (Weighted) policy without RFF had the highest reward and the longest execution time. • Improving the approximation accuracy of the kernel function is also essential. τ = 1 τ = 40 ￼ 25 Simulation Results: Trade-off Analysis
26. ### • Accumulated errors reduce the cumulative reward and that the

frequency of error correction has virtually no effect on execution time. • Error correction should be performed aggressively. • Interestingly, the cumulative reward is improved for ￼ than for the most accurate ￼ . • This result indicates that a slight increase in exploration frequency may be helpful in nonstationary environments. τ = 4 τ = 1 ￼ 26 Simulation Results: Trade-off Analysis 750 800 850 900 950 Cumulative rewards 102 103 Trials per second ø = 1600 ø = 800 ø = 400 ø = 100 ø = 1 ø = 40 ø = 4 Cumulative rewards - Trials per second trade-oÆ RW-GPB GP+UCB (Sliding Window) GP+UCB (Weighted) GP+UCB (Weighted, RFF)
27. ### • The proposed RW-GPB policy keeps the execution time constant.

• The policy without recursive learning increases the execution time linearly. • In addition, the policy without RFF increases it exponentially. ￼ 27 Simulation Results: Computation Time 500 1000 1500 2000 2500 3000 3500 4000 Number of trials 0 20 40 60 80 Cumulative execution time (Sec) Cumulative execution time RW-GPB (ø = 4) GP+UCB (Sliding Window) GP+UCB (Weighted) GP+UCB (Weighted, RFF)

29. ### • RW-GPB Policy • We introduced RW-GPB, a new online

policy for nonstationary nonlinear contextual MAB problems, which balances accuracy and online performance. • RW-GPR Model • Our novel RW-GPR model, equipped with a practical error correction method, effectively implements the proposed policy. • Experimental Results • Experimental results demonstrate that RW-GPB significantly reduces computational time while maintaining cumulative reward in simulations. ￼ 29 Conclusion
30. ### • Meta-Recommender System • We aim to implement and evaluate

a meta-recommender system that autonomously optimizes the selection of recommendation algorithms using the proposed policy. • Client-Side Agents • Future research will explore applying this lightweight policy to client-side agents for solving complex autonomous tasks on resource-constrained devices. • Real-World Effectiveness • We expect the proposed policy to enhance the effectiveness of autonomous systems across various real-world scenarios. ￼ 30 Future Work