Introduction to Multi-Armed Bandits and Reinforcement Learning

8b18582422c42a903d048b4eafa1aace?s=47 Lilian Besson
September 23, 2019

Introduction to Multi-Armed Bandits and Reinforcement Learning

- Speakers: Christophe Moy and Lilian Besson
- Title of the talk: Reinforcement learning for on-line dynamic spectrum access: theory and experimental validation

- Abstract:

This tutorial covers both theoretical and implementation aspects of on-line machine learning for dynamic spectrum access in order to solve spectrum scarcity issue. We target in this work efficient and ready-to-use solutions in real radio operation conditions, at an affordable electronic price, even in embedded devices.

We focus on two wireless applications in this presentation: Opportunistic Spectrum Access (OSA) and Internet of Things (IoT) networks. OSA is the scenario that has been first targeted in the early 2010s, and is a futuristic scenario that has not been regulated yet. Internet of Things has known a more recent interest and revealed to be also a potential candidate for the application of learning solutions of the Reinforcement Learning family as soon as now.

First part (Lilian BESSON): Introduction to Multi-Armed Bandits and Reinforcement Learning

The first part of the tutorial introduces the general framework of machine learning, and focuses on reinforcement learning. We explain the model of multi-armed bandits (MAB), and we give an overview of different successful applications of MAB, since the 1950s.

By first focusing on the simplest model, of a single player interacting with a stationary and stochastic (i.i.d) bandit game with a finite number of resources (or arms), we explain the most famous algorithms that are based on either a frequentist point-of-view, with Upper-Confidence Bounds (UCB) index policies (UCB1 and kl-UCB), or a Bayesian point-of-view, with Thompson Sampling. We also give details on the theoretical analyses of this model, by introducing the notion of regret which is a measure of performance of a MAB algorithm, and famous results from the literature on MAB algorithms, covering both what no algorithm can achieve (ie, lower-bounds on the performance on any algorithm), and what a good algorithm can indeed achieve (ie, upper-bounds on the performance of some efficient algorithms).

We also introduce some generalizations of this first MAB model, by considering non-stationary stochastic environments, Markov models (either rested or restless), and multi-player models. Each variant is illustrated with numerical experiments, showcasing the most well-known and most efficient algorithms, using our state-of-the-art open-source library for numerical simulations of MAB problems, SMPyBandits (see



Lilian Besson

September 23, 2019