Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Online Nonstationary and Nonlinear Bandits with...

Online Nonstationary and Nonlinear Bandits with Recursive Weighted Gaussian Process

IEEE COMPSAC 2024
https://ieeecompsac.computer.org/2024/
Osaka, Japan, on July 2 – 4, 2024

monochromegane

July 04, 2024
Tweet

More Decks by monochromegane

Other Decks in Research

Transcript

  1. Online Nonstationary and Nonlinear Bandits with Recursive Weighted Gaussian Process

    Yusuke Miyake [1][2] ,Ryuji Watanabe [2] and Tsunenori Mine [1] / Kyushu Univ [1]. Pepabo R&D Institute, GMO Pepabo, Inc [2]. July 2-4, 2024. The 48th IEEE International Conference on Computers, Software, and Applications (COMPSAC 2024)
  2.  4 Background • Selecting an optimal behavior from many

    ones is crucial for practical applications. •  The effectiveness of them cannot be known in advance. • Continuous comparative evaluation in an actual environment is essential. •  However, it results in opportunity loss. → → IUUQTJDPOTDPN Short-term opportunity loss Long-term opportunity loss • The reduction of opportunity loss is considered as a multi-armed bandit (MAB) problem.
  3. • The player must select one arm from multiple arms

    to maximize the reward. • The reward is stochastically given based on the chosen arm. • The player needs to infer the reward distribution from the trial results. • To find the optimal arm, the player should balance exploitation and exploration.  5 Multi-Armed Bandit (MAB) Problem ʜ Select Reward Infer The term 'arm' is derived from the 'arm' of a slot machine. IUUQTJDPOTDPN
  4. •  -Greedy policy selects an arm uniformly at random

    with a probability of  (exploration) and selects the arm with the highest average reward at that time with a probability of  (exploitation). ϵ ϵ 1 − ϵ  6 The simplest MAB policy argmaxl=1,L ̂ y(l),1 − ϵ ∀a ∈ A, ϵ/L Bandit A/B testing ∀a ∈ A, ϵ/L
  5.  7 Motivation of our Approach • Nonlinearity • In

    practical applications such as e-commerce, user behavior and preferences often exhibit complex, nonlinear patterns that cannot be captured by simple linear models. • Nonstationarity • User preferences and environmental conditions in the real-world applications change over time. • Online performance • Response delays are detrimental to the user experience for the real-world applications. Nonlinearity Non stationarity Online performance
  6. • Use of Kernel Function  • Efficiently computes the

    inner product in high-dimensional feature space using basis functions. • Application to Nonlinear MAB Problems • Widely explored for nonlinear MAB problems. • Challenge of GP Regression • Training time increases exponentially as the dataset grows. • Few policies have been developed to handle nonstationarity. k • A method to infer the distribution of a function as a stochastic process from data. • Effective in modeling nonlinear functions and prediction uncertainties.  9 Gaussian Process Regression  K−1  K ∈ ℝN×N  K−1 →  k(xi , xj )
  7.  10 • GP regression model-based nonstationary and nonlinear policy

    Weighted GP-UCB [Y. Deng 2022]  K  ZZ⊤  ≃  K ∈ ℝN×N  Z ∈ ℝN×R  Z⊤Z  Z⊤Z ∈ ℝR×R  ⋙ • Non stationarity • Focuses on new training data by using two new weight matrices. • Online performance • Using Random Fourier Features (RFF), compute the predictive distribution of GP Regression in the form of linear regression in  -dimensional space, where  . • Keeps the size of the computed inverse matrix constant, reducing computational complexity. • Limitation • Absence of a recursive learning algorithm. • Only partially mitigates the issue of escalating training time with a growing dataset. R R ⋘ N
  8.  11 • Nonstationary kernel Recursive Least Squares (RLS) method

    • Non stationarity • Introduce a forgetting mechanism for old training data. • Online performance • Using Nyström approximation, compute the predictive distribution of GP Regression in the form of linear regression in  -dimensional space, where  . • Applies a linear RLS algorithm. • Limitation • The regularization effect decreases as the number of recursive computations increases. • The loss of estimation accuracy due to overfitting can be critical. R R ⋘ N NysKRLS [T. Zhang 2020]  (Z⊤ΓZ + γMΛ)−1 γ γ z xN
  9. • Online nonstationary and nonlinear contextual MAB policy  13

    Our Proposed Policy • Nonlinearity and nonstationarity • Introduce a forgetting mechanism in nonlinear GP regression model. • Online performance • Using RFF, compute the predictive distribution of GP Regression in the form of linear regression in  -dimensional space, where  . • Applies a linear RLS algorithm. • Key features • Fast decision-making with recursive learning. • Accurate error correction in predictive distribution. R R ⋘ N  z  P N−2,M−2  Q N−2,M−2  x N−1  y N−1  (M = 0)  γ  P N−1,M−1  Q N−1,M−1  z  x N  y N  γ  P N,M  Q N,M  z  x *  p(y N−1 ∣ x N−1 , X, y)  p(y N ∣ x N , X, y)  p(y * ∣ x * , X, y)  (M = 0)
  10. • A method to approximate kernel function  using 

    samples from a probability distribution  . k(xi , xj ) R′  = R/2 p(ω)  14 Random Fourier Features  K  K ∈ ℝN×N  k(xi , xj ) • Linear Method • By decomposing the kernel function into inputs with  -dimensional features, the original problem can be solved as a linear problem in  -dimensional space. R R  K  ZZ⊤  ≃  K ∈ ℝN×N  Z ∈ ℝN×R  Z⊤Z  Z⊤Z ∈ ℝR×R  ⋙ k(xi , xj ) ≃ z(xi )⊤z(xj )
  11.  15 Comparative Learning Methods  K′   K′

      K′  … z z z … ⋮ x1 x2 xN x1 x1 xN−1 GP learning GP with RFF learning  K′  −1 Inv  K′  −1 Inv  K′  −1 Inv  Z⊤Z + Λ  Z⊤Z + Λ  Z⊤Z + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1  (Z⊤Z + Λ)−1 Inv  (Z⊤Z + Λ)−1 Inv  (Z⊤Z + Λ)−1 Inv  K′  ∈ ℝN×N  Z⊤Z ∈ ℝR×R
  12. • Incorporating exponential forgetting for past inputs and outputs enhances

    estimation accuracy in nonstationary environments. • The proposed method assumes that more distant past data has larger observation errors  , thus lower accuracy. ϵn  16 Forgetting Mechanism °4 °3 °2 °1 0 1 2 3 4 x °4 °2 0 2 4 y Predictive distribution (ˆ µ00 and 1æ confidence based on ˆ ß00) yA fA (x) yB fB (x) ˆ µ00 1æ confidence Transition of Observation Error  Distribution ϵn Past data has larger observation errors. Recent data has smaller observation errors. Predictive distribution for a nonstationary and nonlinear regression problem Past training data from  fA Recent training data from  fB Prediction  fits the recent training data. = Quick adaptation to changing environments. μ
  13.  17 Comparative Learning Methods Nonstationary GP with RFF learning

     Z⊤ΓZ + Λ  Z⊤ΓZ + Λ  Z⊤ΓZ + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + Λ)−1 Inv  Z⊤ΓZ ∈ ℝR×R GP with RFF learning  Z⊤Z + Λ  Z⊤Z + Λ  Z⊤Z + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1  (Z⊤Z + Λ)−1 Inv  (Z⊤Z + Λ)−1 Inv  (Z⊤Z + Λ)−1 Inv  Z⊤Z ∈ ℝR×R
  14. • Recursive Least Squares (RLS) updates regression model parameters using

    previous calculations and new observations. • Efficiency • The algorithm computes parameters efficiently from results up to time  and observed values at time  . • Reduced Computational Cost • By avoiding the need to compute the inverse matrix, the algorithm significantly reduces computational cost. N N + 1  18 Recursive Learning Mechanism
  15.  19 Comparative Learning Methods Batch learning  Z⊤ΓZ +

    Λ  Z⊤ΓZ + Λ  Z⊤ΓZ + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + γΛ)−1  (Z⊤ΓZ + γ2Λ)−1  (Z⊤ΓZ + γMΛ)−1 γ γ γ … γ z z z … ⋮ x1 x2 xN x1 x1 xN−1 Recursive learning
  16. • Estimation error arises in recursive learning due to the

    forgetting effect is recursively applied to the regularization term.  20 Estimation Error in Recursive Learning °0.10 °0.05 0.00 0.05 0.10 Error of ˆ µ00 Estimation error of ˆ µ00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x 0.0000 0.0005 0.0010 0.0015 0.0020 Error of ˆ ß00 Estimation error of ˆ ß00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x °4 °2 0 2 4 y Predictive distribution (ˆ µ00 and 1æ confidence based on ˆ ß00) yA fA (x) yB fB (x) ˆ µ00 1æ confidence ≠  Z⊤ΓZ + Λ z ⋮ xN x1 xN−1  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + γMΛ)−1 γ γ z ⋮ xN x1 xN−1 • For MAB policies that run for long periods, the loss of estimation accuracy due to overfitting can be fatal. • Addressing this error causes a trade-off between accuracy and online performance. Error Estimation error of predictive distribution parameters for each  M
  17. • A novel recursive error correction method balances estimation accuracy

    and online performance.  21 Error Correction Method =  Z⊤ΓZ + Λ z ⋮ xN x1 xN−1  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + γ0Λ)−1 γ γ z ⋮ xN x1 xN−1 Error Correction 1. Inverse 2. Subtract error from 3. Inverse −1 −1 • High computational cost due to twice inversion of  matrix. • Method executed at intervals based on acceptable estimation accuracy rather than every time. R × R °0.10 °0.05 0.00 0.05 0.10 Error of ˆ µ00 Estimation error of ˆ µ00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x 0.0000 0.0005 0.0010 0.0015 0.0020 Error of ˆ ß00 Estimation error of ˆ ß00 for each M M=0 M=200 M=400 M=600 No estimation error when  . M = 0
  18. • Nonstationary and nonlinear contextual MAB simulation • The arm's

    reward follows a normal distribution  , with the mean  corresponding to the context  as shown below. • The banded curve moves leftward over time (one full rotation in 4000 trials). 𝒩 (μ, σ2) μ xt = (xt,d )1≤d≤2  23 Simulation Setup The arm  is always  a(1) t μ = μ1  μ1 = 0.1,μ2 = 0.0,μ3 = 1.0,σ2 = 0.01,δ = 0.8,ρ = 4000 The magnitudes of the means are  . To maximize the expected reward, it is necessary to select the corresponding arm within the band curve and choose arm  otherwise. μ2 < μ1 ≪ μ3 a(1)
  19.  24 Baseline Policies Nonlinear Nonstationary Recursive learning Policy Note

    ✓ ✓ ✓ RW-GPB (Proposal) Evaluated with multiple correction intervals τ to compare error correction effects. ✓ ✓ GP+UCB (Weighted, RFF) - State-of-the-art - Reduced learning time with RFF ✓ ✓ GP+UCB (Weighted) ✓ ✓ GP+UCB (Sliding Window) ✓ ✓ GP+UCB (Restarting) ✓ GP+UCB ✓ ✓ Decay LinUCB Constant exploration scale  is used for all policies to clarify the effect of each policy's regression model. β = 1
  20. • The proposed RW-GPB policy achieves higher cumulative rewards and

    shorter execution time than the state-of-the-art GP+UCB (Weighted, RFF) policy. • Compared to GP+UCB (Weighted, RFF), • RW-GPB (  ) reduces the execution time by 71% and equal rewards. • RW-GPB (  ) reduces the execution time by 92% and more rewards. • The GP+UCB (Weighted) policy without RFF had the highest reward and the longest execution time. • Improving the approximation accuracy of the kernel function is also essential. τ = 1 τ = 40  25 Simulation Results: Trade-off Analysis
  21. • Accumulated errors reduce the cumulative reward and that the

    frequency of error correction has virtually no effect on execution time. • Error correction should be performed aggressively. • Interestingly, the cumulative reward is improved for  than for the most accurate  . • This result indicates that a slight increase in exploration frequency may be helpful in nonstationary environments. τ = 4 τ = 1  26 Simulation Results: Trade-off Analysis 750 800 850 900 950 Cumulative rewards 102 103 Trials per second ø = 1600 ø = 800 ø = 400 ø = 100 ø = 1 ø = 40 ø = 4 Cumulative rewards - Trials per second trade-oÆ RW-GPB GP+UCB (Sliding Window) GP+UCB (Weighted) GP+UCB (Weighted, RFF)
  22. • The proposed RW-GPB policy keeps the execution time constant.

    • The policy without recursive learning increases the execution time linearly. • In addition, the policy without RFF increases it exponentially.  27 Simulation Results: Computation Time 500 1000 1500 2000 2500 3000 3500 4000 Number of trials 0 20 40 60 80 Cumulative execution time (Sec) Cumulative execution time RW-GPB (ø = 4) GP+UCB (Sliding Window) GP+UCB (Weighted) GP+UCB (Weighted, RFF)
  23. • RW-GPB Policy • We introduced RW-GPB, a new online

    policy for nonstationary nonlinear contextual MAB problems, which balances accuracy and online performance. • RW-GPR Model • Our novel RW-GPR model, equipped with a practical error correction method, effectively implements the proposed policy. • Experimental Results • Experimental results demonstrate that RW-GPB significantly reduces computational time while maintaining cumulative reward in simulations.  29 Conclusion
  24. • Meta-Recommender System • We aim to implement and evaluate

    a meta-recommender system that autonomously optimizes the selection of recommendation algorithms using the proposed policy. • Client-Side Agents • Future research will explore applying this lightweight policy to client-side agents for solving complex autonomous tasks on resource-constrained devices. • Real-World Effectiveness • We expect the proposed policy to enhance the effectiveness of autonomous systems across various real-world scenarios.  30 Future Work