Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Multi-Armed Bandit Learning in IoT Networks: Learning helps even in non-stationary settings (16:9)

Lilian Besson
September 20, 2017

Multi-Armed Bandit Learning in IoT Networks: Learning helps even in non-stationary settings (16:9)

Abstract: Setting up the future Internet of Things (IoT) networks will require to support more and more communicating devices. We prove that intelligent devices in unlicensed bands can use Multi-Armed Bandit (MAB) learning algorithms to improve resource exploitation. We evaluate the performance of two classical MAB learning algorithms, UCB1 and Thompson Sampling, to handle the decentralized decision-making of Spectrum Access, applied to IoT networks; as well as learning performance with a growing number of intelligent end-devices. We show that using learning algorithms does help to fit more devices in such networks, even when all end-devices are intelligent and are dynamically changing channel. In the studied scenario, stochastic MAB learning provides a up to 16% gain in term of successful transmission probabilities, and has near optimal performance even in non-stationary and non-i.i.d. settings with a majority of intelligent devices.

See: https://hal.inria.fr/hal-01575419
Format: 16:9 (wide screen)

PDF: https://perso.crans.org/besson/publis/slides/2017_09__Presentation_article_CrownCom_Conference/slides_169.pdf

Lilian Besson

September 20, 2017
Tweet

More Decks by Lilian Besson

Other Decks in Science

Transcript

  1. MAB Learning in IoT Networks Learning helps even in non-stationary

    settings! Lilian Besson Rémi Bonnefoi Émilie Kaufmann Christophe Moy Jacques Palicot PhD Student in France Team SCEE, IETR, CentraleSupélec, Rennes & Team SequeL, CRIStAL, Inria, Lille 20-21 Sept - CROWNCOM 2017
  2. 1. Introduction and motivation 1.a. Objective We want A lot

    of IoT devices want to access to a gateway of base station. Insert them in a crowded wireless network. With a protocol slotted in time and frequency. Each device has a low duty cycle (a few messages per day). Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 2 / 18
  3. 1. Introduction and motivation 1.a. Objective We want A lot

    of IoT devices want to access to a gateway of base station. Insert them in a crowded wireless network. With a protocol slotted in time and frequency. Each device has a low duty cycle (a few messages per day). Goal Maintain a good Quality of Service. Without centralized supervision! Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 2 / 18
  4. 1. Introduction and motivation 1.a. Objective We want A lot

    of IoT devices want to access to a gateway of base station. Insert them in a crowded wireless network. With a protocol slotted in time and frequency. Each device has a low duty cycle (a few messages per day). Goal Maintain a good Quality of Service. Without centralized supervision! How? Use learning algorithms: devices will learn on which frequency they should talk! Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 2 / 18
  5. 1. Introduction and motivation 1.b. Outline Outline 1 Introduction and

    motivation 2 Model and hypotheses 3 Baseline algorithms : to compare against naive and efficient centralized approaches 4 Multi-Armed Bandit algorithms : UCB 5 Experimental results 6 Perspectives and future works 7 Conclusion Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 3 / 18
  6. 2. Model and hypotheses 2.a. Model Model Discrete time t

    ≥ 1 and Nc radio channels (e.g., 10) (known) Figure 1: Protocol in time and frequency, with an Acknowledgement. D dynamic devices try to access the network independently S = S1 + · · · + SNc static devices occupy the network : S1 , . . . , SNc in each channel (unknown). Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 4 / 18
  7. 2. Model and hypotheses 2.b. Hypotheses Hypotheses I Emission model

    Each device has the same low emission probability: each step, each device sends a packet with probability p. (this gives a duty cycle proportional to 1/p) Background traffic Each static device uses only one channel. Their repartition is fixed in time. =⇒ Background traffic, bothering the dynamic devices! Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 5 / 18
  8. 2. Model and hypotheses 2.b. Hypotheses Hypotheses II Dynamic radio

    reconfiguration Each dynamic device decides the channel it uses to send every packet. It has memory and computational capacity to implement basic decision algorithm. Problem Goal : maximize packet loss ratio (= number of received Ack) in a finite-space discrete-time Decision Making Problem. Solution ? Multi-Armed Bandit algorithms, decentralized and used independently by each device. Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 6 / 18
  9. 3. Baseline algorithms 3.a. A naive strategy : uniformly random

    access A naive strategy : uniformly random access Uniformly random access: dynamic devices choose uniformly their channel in the pull of Nc channels. Natural strategy, dead simple to implement. Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 7 / 18
  10. 3. Baseline algorithms 3.a. A naive strategy : uniformly random

    access A naive strategy : uniformly random access Uniformly random access: dynamic devices choose uniformly their channel in the pull of Nc channels. Natural strategy, dead simple to implement. Simple analysis, in term of successful transmission probability (for every message from dynamic devices) : P(success|sent) = Nc i=1 (1 − p/Nc )D−1 No other dynamic device × (1 − p)Si No static device × 1 Nc . Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 7 / 18
  11. 3. Baseline algorithms 3.a. A naive strategy : uniformly random

    access A naive strategy : uniformly random access Uniformly random access: dynamic devices choose uniformly their channel in the pull of Nc channels. Natural strategy, dead simple to implement. Simple analysis, in term of successful transmission probability (for every message from dynamic devices) : P(success|sent) = Nc i=1 (1 − p/Nc )D−1 No other dynamic device × (1 − p)Si No static device × 1 Nc . Works fine only if all channels are similarly occupied, but it cannot learn to exploit the best (more free) channels. Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 7 / 18
  12. 3. Baseline algorithms 3.b. Optimal centralized strategy Optimal centralized strategy

    I If an oracle can decide to affect Di dynamic devices to channel i, the successful transmission probability is: P(success|sent) = Nc i=1 (1 − p)Di−1 Di−1 others × (1 − p)Si No static device × Di /D Sent in channel i . The oracle has to solve this optimization problem:      arg max D1,...,DNc Nc i=1 Di (1 − p)Si +Di−1 such that Nc i=1 Di = D and Di ≥ 0, ∀1 ≤ i ≤ Nc . We solved this quasi-convex optimization problem with Lagrange multipliers, only numerically. Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 8 / 18
  13. 3. Baseline algorithms 3.b. Optimal centralized strategy Optimal centralized strategy

    II =⇒ Very good performance, maximizing the transmission rate of all the D dynamic devices But unrealistic But not achievable in practice: no centralized oracle! Let see realistic decentralized approaches ֒→ Machine Learning ? ֒→ Reinforcement Learning ? ֒→ Multi-Armed Bandit ! Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 9 / 18
  14. 4. Multi-Armed Bandit algorithm : UCB 4.1. Multi-Armed Bandit formulation

    Multi-Armed Bandit formulation A dynamic device tries to collect rewards when transmitting : it transmits following a Bernoulli process (probability p of transmitting at each time step τ), chooses a channel A(τ) ∈ {1, . . . , Nc }, if Ack (no collision) =⇒ reward rA(τ) = 1, if collision (no Ack) =⇒ reward rA(τ) = 0. Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 10 / 18
  15. 4. Multi-Armed Bandit algorithm : UCB 4.1. Multi-Armed Bandit formulation

    Multi-Armed Bandit formulation A dynamic device tries to collect rewards when transmitting : it transmits following a Bernoulli process (probability p of transmitting at each time step τ), chooses a channel A(τ) ∈ {1, . . . , Nc }, if Ack (no collision) =⇒ reward rA(τ) = 1, if collision (no Ack) =⇒ reward rA(τ) = 0. Reinforcement Learning interpretation Maximize transmission rate ≡ maximize cumulated rewards max algorithm A horizon τ=1 rA(τ) . Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 10 / 18
  16. 4. Multi-Armed Bandit algorithm : UCB 4.2. Upper Confidence Bound

    algorithm : UCB Upper Confidence Bound algorithm (UCB1 ) A dynamic device keeps τ number of sent packets, Tk (t) selections of channel k, Xk (t) successful transmission in channel k. 1 For the first Nc steps (τ = 1, . . . , Nc ), try each channel once. 2 Then for the next steps t ≥ Nc : Compute the index gk (τ) := Xk (τ) Nk (τ) Mean µk (τ) + log(τ) 2Nk (τ) , Upper Confidence Bound Choose channel A(τ) = arg max k gk (τ), Update Tk (τ + 1) and Xk (τ + 1). References: [Lai & Robbins, 1985], [Auer et al, 2002], [Bubeck & Cesa-Bianchi, 2012] Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 11 / 18
  17. 5. Experimental results 5.1. Experiment setting Experimental setting Simulation parameters

    Nc = 10 channels, S + D = 10000 devices in total, p = 10−3 probability of emission, horizon = 105 time slots (≃ 100 messages / device), The proportion of dynamic devices D/(S + D) varies, Various settings for (S1 , . . . , SNc ) static devices repartition. What do we show After a short learning time, MAB algorithms are almost as efficient as the oracle solution. Never worse than the naive solution. Thompson sampling is even more efficient than UCB. Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 12 / 18
  18. 5. Experimental results 5.2. First result: 10% 10% of dynamic

    devices Number of slots ×105 2 4 6 8 10 Successful transmission rate 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 UCB Thompson-sampling Optimal Good sub-optimal Random Figure 2: 10% of dynamic devices. 7% of gain. Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 13 / 18
  19. 5. Experimental results 5.2. First result: 30% 30% of dynamic

    devices Number of slots ×105 2 4 6 8 10 Successful transmission rate 0.81 0.815 0.82 0.825 0.83 0.835 0.84 0.845 0.85 0.855 0.86 UCB Thompson-sampling Optimal Good sub-optimal Random Figure 3: 30% of dynamic devices. 3% of gain but not much is possible. Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 14 / 18
  20. 5. Experimental results 5.3. Growing proportion of devices dynamic devices

    Dependence on D/(S + D) Proportion of dynamic devices (%) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Gain compared to random channel selection -0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Optimal strategy UCB 1 , α=0.5 Thomson-sampling Figure 4: Almost optimal, for any proportion of dynamic devices, after a short learning time. Up-to 16% gain over the naive approach! Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 15 / 18
  21. 6. Perspectives and future work 6.1. Perspectives Perspectives Theoretical results

    MAB algorithms have performance guarantees for stochastic settings, But here the collisions cancel the i.i.d. hypothesis, Not easy to obtain guarantees in this mixed setting (i.i.d. emission process, game theoretic collisions). Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 16 / 18
  22. 6. Perspectives and future work 6.1. Perspectives Perspectives Theoretical results

    MAB algorithms have performance guarantees for stochastic settings, But here the collisions cancel the i.i.d. hypothesis, Not easy to obtain guarantees in this mixed setting (i.i.d. emission process, game theoretic collisions). Real-world experimental validation ? Real-world radio experiments will help to validate this. In progress... Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 16 / 18
  23. 6. Perspectives and future work 6.2. Future work Other direction

    of future work More realistic emission model: maybe driven by number of packets in a whole day, instead of emission probability. Validate this on a larger experimental scale. Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 17 / 18
  24. 7. Conclusion Thanks! Conclusion We showed numerically... After a learning

    period, MAB algorithms are as efficient as we could expect. Never worse than the naive solution. Thompson sampling is even more efficient than UCB. Simple algorithms are up-to 16% more efficient than the naive approach, and straightforward to apply. But more work is still needed. .. Theoretical guarantees are still missing. Maybe study other emission models. And also implement this on real-world radio devices. Thanks! Question? Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 18 / 18
  25. Appendix A.1. Thompson Sampling : Bayesian index policy Thompson Sampling

    : Bayesian approach A dynamic device assumes a stochastic hypothesis on the background traffic, modeled as Bernoulli distributions. Rewards rk (τ) are assumed to be i.i.d. samples from a Bernoulli distribution Bern(µk ). A binomial Bayesian posterior is kept on the mean availability µk : Bin(1 + Xk (τ), 1 + Nk (τ) − Xk (τ)). Starts with a uniform prior : Bin(1, 1) ∼ U([0, 1]). 1 Each step τ ≥ 1, a sample is drawn from each posterior ik (t) ∼ Bin(ak (τ), bk (τ)), 2 Choose channel A(τ) = arg max k ik (τ), 3 Update the posterior after receiving Ack or if collision. References: [Thompson, 1933], [Kaufmann et al, 2012] Lilian Besson (CentraleSupélec & Inria) MAB Learning in IoT Networks CROWNCOM 2017 18 / 18