Upgrade to Pro — share decks privately, control downloads, hide ads and more …

IEEE WCNC: « Aggregation of Multi-Armed Bandits...

IEEE WCNC: « Aggregation of Multi-Armed Bandits Learning Algorithms for Opportunistic Spectrum Access »

Abstract: Multi-armed bandit algorithms have been recently studied and evaluated for Cognitive Radio (CR), especially in the context of Opportunistic Spectrum Access (OSA). Several solutions have been explored based on various models, but it is hard to exactly predict which could be the best for real-world conditions at every instants. Hence, expert aggregation algorithms can be useful to select on the run the best algorithm for a specific situation. Aggregation algorithms, such as Exp4 dating back from 2002, have never been used for OSA learning, and we show that it appears empirically sub-efficient when applied to simple stochastic problems. In this article, we present an improved variant, called Aggregator . For synthetic OSA problems modeled as Multi-Armed Bandit (MAB) problems, simulation results are presented to demonstrate its empirical efficiency. We combine classical algorithms, such as Thompson sampling, Upper-Confidence Bounds algorithms (UCB and variants), and Bayesian or Kullback-Leibler UCB. Our algorithm offers good performance compared to state-of-the-art algorithms (Exp4, CORRAL or LearnExp), and appears as a robust approach to select on the run the best algorithm for any stochastic MAB problem, being more realistic to real-world radio settings than any tuning-based approach.

See: https://hal.inria.fr/hal-01705292
Format: 4:3

PDF: https://perso.crans.org/besson/publis/slides/2018_04__Presentation_IEEE_WCNC/slides.pdf

Lilian Besson

April 16, 2018
Tweet

More Decks by Lilian Besson

Other Decks in Science

Transcript

  1. Aggregation of MAB Learning Algorithms for OSA Lilian Besson Advised

    by Christophe Moy Émilie Kaufmann PhD Student Team SCEE, IETR, CentraleSupélec, Rennes & Team SequeL, CRIStAL, Inria, Lille IEEE WCNC - 16th April 2018
  2. 0. Introduction and motivation 0.2. Objective Introduction Cognitive Radio (CR)

    is known for being one of the possible solution to tackle the spectrum scarcity issue Opportunistic Spectrum Access (OSA) is a good model for CR problems in licensed bands Online learning strategies, mainly using multi-armed bandits (MAB) algorithms, were recently proved to be efficient [Jouini 2010] But there is many different MAB algorithms… which one should you choose in practice? =⇒ we propose to use an online learning algorithm to also decide which algorithm to use, to be more robust and adaptive to unknown environments. Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 2 / 21
  3. 0. Introduction and motivation 0.3. Outline Outline 1. Opportunistic Spectrum

    Access 2. Multi-Armed Bandits 3. MAB algorithms 4. Aggregation of MAB algorithms 5. Illustration Please Ask questions at the end if you want! See our paper HAL.Inria.fr/hal-01705292 Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 3 / 21
  4. 1. Opportunistic Spectrum Access 1.1. OSA 1. Opportunistic Spectrum Access

    Spectrum scarcity is a well-known problem Different range of solutions… Cognitive Radio is one of them Opportunistic Spectrum Access is a kind of cognitive radio Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 4 / 21
  5. 1. Opportunistic Spectrum Access 1.2. Model Communication & interaction model

    Channel selection and access policy Spectrum Sensing RF Environment Channel selection Channel access Reward Free/busy Observation Primary users are occupying K radio channels Secondary users can sense and exploit free channels: want to explore the channels, and learn to exploit the best one Discrete time for everything t ≥ 1, t ∈ N Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 5 / 21
  6. 2. Multi-Armed Bandits 2. Multi-Armed Bandits Model Again K ≥

    2 resources (e.g., channels), called arms Each time slot t = 1, . . . , T, you must choose one arm, denoted A(t) ∈ {1, . . . , K} You receive some reward r(t) ∼ νk when playing k = A(t) Goal: maximize your sum reward T ∑ t=1 r(t) Hypothesis: rewards are stochastic, of mean µk . E.g., Bernoulli Why is it famous? Simple but good model for exploration/exploitation dilemma. Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 6 / 21
  7. 3. MAB algorithms 3. MAB algorithms Main idea: index Ik

    (t) to approximate the quality of arm k First example: UCB algorithm Second example: Thompson Sampling Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 7 / 21
  8. 3. MAB algorithms 3.1. Index based algorithms 3.1 Multi-Armed Bandit

    algorithms Often index based Keep index Ik (t) ∈ R for each arm k = 1, . . . , K Always play A(t) = arg max Ik (t) Ik (t) should represent belief of the quality of arm k at time t Example: “Follow the Leader” Xk (t) := ∑ s<t r(s)1(A(s) = k) sum reward from arm k Nk (t) := ∑ s<t 1(A(s) = k) number of samples of arm k And use Ik (t) = ˆ µk (t) := Xk(t) Nk(t) . Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 8 / 21
  9. 3. MAB algorithms 3.2. UCB algorithm Upper Confidence Bounds algorithm

    (UCB) Instead of using Ik (t) = Xk(t) Nk(t) , add an exploration term Ik (t) = Xk (t) Nk (t) + α log(t) 2Nk (t) Parameter α: tradeoff exploration vs exploitation Small α: focus more on exploitation Large α: focus more on exploration Problem: how to choose “the good α” for a certain problem? Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 9 / 21
  10. 3. MAB algorithms 3.3. Thompson sampling algorithm Thompson sampling (TS)

    Choose an initial belief on µk (uniform) and a prior pt (e.g., a Beta prior on [0, 1]) At each time, update the prior pt+1 from pt using Bayes theorem And use Ik (t) ∼ pt as random index Example with Beta prior, for binary rewards pt = Beta(1 + nb successes, 1 + nb failures). Mean of pt = 1+Xk(t) 2+Nk(t) ≃ ˆ µk (t). How to choose “the good prior” for a certain problem? Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 10 / 21
  11. 4. Aggregation of MAB algorithms 4. Aggregation of MAB algorithms

    Problem How to choose which algorithm to use? But also… Why commit to one only algorithm? Solutions Offline benchmarks? Or online selections from a pool of algorithms? → Aggregation? Not a new idea, studied from the 90s in the ML community. Also use online learning to select the best algorithm! Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 11 / 21
  12. 4. Aggregation of MAB algorithms 4.1 Basic idea for online

    aggregationorithms 4.1 Basic idea for online aggregation If you have A1 , . . . , AN different algorithms At time t = 0, start with a uniform distribution π0 on {1, . . . , N} (to represent the trust in each algorithm) At time t, choose at ∼ πt, then play with Aat Compute next distribution πt+1 from πt: increase πt+1 at if choosing Aat gave a good reward or decrease it otherwise Problems 1. How to increase πt+1 at ? 2. What information should we give to which algorithms? Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 12 / 21
  13. 4. Aggregation of MAB algorithms 4.2. The Exp4 algorithm 4.2

    Overview of the Exp4 aggregation algorithm For rewards in r(t) ∈ [−1, 1]. Use πt to choose randomly the algorithm to trust, at ∼ πt Play its decision, Aaggr (t) = Aat (t), receive reward r(t) And give feedback of observed reward r(t) only to this one Increase or decrease πt at using an exponential weight: πt+1 at := πt at × exp ( ηt × r(t) πt at ) . Renormalize πt+1 to keep a distribution on {1, . . . , N} Use a sequence of decreasing learning rate ηt = log(N) t×K (cooling scheme, ηt → 0 for t → ∞) Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 13 / 21
  14. 4. Aggregation of MAB algorithms Unbiased estimates? Use an unbiased

    estimate of the rewards Using directly r(t) to update trust probability yields a biased estimator So we use instead ˆ r(t) = r(t)/πt a if we trusted algorithm Aa This way E[ˆ r(t)] = N ∑ a=1 P(at = a)E[r(t)/πt a ] = E[r(t)] N ∑ a=1 P(at = a) πt a = E[r(t)] Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 14 / 21
  15. 4. Aggregation of MAB algorithms 4.3. Our Aggregator algorithm 4.3

    Our Aggregator aggregation algorithm Improves on Exp4 by the following ideas: First let each algorithm vote for its decision At 1 , . . . , At N Choose arm Aaggr (t) ∼ pt+1 j := N ∑ a=1 πt a 1(At a = j) Update trust for each of the trusted algorithm, not only one (i.e., if At a = At aggr ) → faster convergence Give feedback of reward r(t) to each algorithm! (and not only the one trusted at time t) → each algorithm have more data to learn from Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 15 / 21
  16. 5. Some illustrations 5. Some illustrations Artificial simulations of stochastic

    bandit problems Bernoulli bandits but not only Pool of different algorithms (UCB, Thompson Sampling etc) Compared with other state-of-the-art algorithms for expert aggregation (Exp4, CORRAL, LearnExp) What is plotted it the regret for problem of means µ1 , . . . , µK : Rµ T (A) = max k (Tµk ) − T ∑ t=1 E[r(t)] Regret is known to be lower-bounded by C(µ) log(T) and upper-bounded by C′(µ) log(T) for efficient algorithms Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 16 / 21
  17. 5. Some illustrations 5.1. On a simple Bernoulli problem On

    a simple Bernoulli problem 0 2500 5000 7500 10000 12500 15000 17500 20000 Time steps t=1..T, horizon T=20000 100 101 102 Cumulated regret Rt =tµ∗ − t s=1 1000[rs] Cumulated regrets for different bandit algorithms, averaged 1000 times 9 arms: [B(0.1),B(0.2),B(0.3),B(0.4),B(0.5),B(0.6),B(0.7),B(0.8),B(0.9)∗ ] Aggregator(N=6) Exp4(N=6) CORRAL(N=6, broadcast to all) LearnExp(N=6, η=0.9) UCB(α=1) Thompson KL-UCB(Bern) KL-UCB(Exp) KL-UCB(Gauss) BayesUCB Lai & Robbins lower bound = 7.52 log(T) Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 17 / 21
  18. 5. Some illustrations 5.2. On a ”hard” Bernoulli problem On

    a “hard” Bernoulli problem 0 2500 5000 7500 10000 12500 15000 17500 20000 Time steps t=1..T, horizon T=20000 0 50 100 150 200 250 300 Cumulated regret Rt =tµ∗ − t s=1 1000[rs] Cumulated regrets for different bandit algorithms, averaged 1000 times 9 arms: [B(0.01),B(0.02),B(0.3),B(0.4),B(0.5),B(0.6),B(0.795),B(0.8),B(0.805)∗ ] Aggregator(N=6) Exp4(N=6) CORRAL(N=6, broadcast to all) LearnExp(N=6, η=0.9) UCB(α=1) Thompson KL-UCB(Bern) KL-UCB(Exp) KL-UCB(Gauss) BayesUCB Lai & Robbins lower bound = 101 log(T) Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 18 / 21
  19. 5. Some illustrations 5.3. On a mixed problem On a

    mixed problem 0 2500 5000 7500 10000 12500 15000 17500 20000 Time steps t=1..T, horizon T=20000 100 101 102 Cumulated regret Rt =tµ∗ − t s=1 1000[rs] Cumulated regrets for different bandit algorithms, averaged 1000 times 9 arms: [B(0.1),G(0.1,0.05),Exp(10,1),B(0.5),G(0.5,0.05),Exp(1.59,1),B(0.9)∗ ,G(0.9,0.05)∗ ,Exp(0.215,1)∗ ] Aggregator(N=6) Exp4(N=6) CORRAL(N=6, broadcast to all) LearnExp(N=6, η=0.9) UCB(α=1) Thompson KL-UCB(Bern) KL-UCB(Exp) KL-UCB(Gauss) BayesUCB Lai & Robbins lower bound = 7.39e+07 log(T) Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 19 / 21
  20. 6. Conclusion 6.1. Summary Conclusion (1/2) Online learning can be

    a powerful tool for Cognitive Radio, and many other real-world applications Many formulation exist, a simple one is the Multi-Armed Bandit Many algorithms exist, to tackle different situations It’s hard to know before hand which algorithm is efficient for a certain problem… Online learning can also be used to select on the run which algorithm to prefer, for a specific situation! Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 20 / 21
  21. 6. Conclusion 6.2. Summary & Thanks Conclusion (2/2) Our algorithm

    Aggregator is efficient and easy to implement For N algorithms A1 , . . . , AN , it costs O(N) memory, and O(N) extra computation time at each time step For stochastic bandit problem, it outperforms empirically the other state-of-the-arts (Exp4, CORRAL, LearnExp). See our paper HAL.Inria.fr/hal-01705292 See our code for experimenting with bandit algorithms Python library, open source at SMPyBandits.GitHub.io Thanks for listening! Lilian Besson (CentraleSupélec & Inria) Aggregation of MAB for OSA IEEE WCNC - 16/04/18 21 / 21