Slide 1

Slide 1 text

Optimal Best Arm Identification with a Fixed Budget under a Small Gap Masahiro Kato (The University of Tokyo / CyberAgent, Inc.) Joint work with Masaaki Imaizumi (The University of Tokyo), Kaito Ariu (CyberAgent, Inc. / KTH) Masahiro Nomura (CyberAgent, Inc. ), Chao Qin (Columbia University) 2022 Asian Meeting of the Econometric Society Kato, M., Ariu, K., Imaizumi, M., Nomura, M., and Qin, C. (2022), “Best Arm Identification with a Fixed Budget under a Small Gap” (https://arxiv.org/abs/2201.04469)

Slide 2

Slide 2 text

Adaptive Experiments for Policy Choice n Consider there are multiple treatment arms with potential outcomes. • Treatment arm (also called treatment or arm): Slots, Medicine, online advertisements, etc. ØMotivation: conduct an experiments for efficiently choosing the best treatment arm • Best treatment arm = treatment arm with the highest expected outcome. 2 Observe the reward of the chosen treatment arm 𝐴! . Decision maker Choose a treatment arm 𝑎 Treatment arm 1 Treatment arm 2 Treatment arm 𝐾 ⋯ Adaptive experiments Recommend the best treatment arm.

Slide 3

Slide 3 text

Adaptive Experiments for Policy Choice n How can we gain efficiency of choosing the best treatment arm? → Optimize the sample allocation to each treatment in the experiment. n To optimize the sample allocation, we need to know some parameters, such as the variances. n In our experiment, we sequentially estimate the parameter and optimize the sample allocation. • This setting is called best arm identification (BAI) in machine learning. Also called adaptive experiments for policy choice in economics (Kasy and Sautmann (2021)). • BAI is an instance of multi-armed bandit (MAB) problems. • In MAB problem, we regard the treatments as the arms of the slot game. 3

Slide 4

Slide 4 text

Problem Setting n There are 𝐾 treatment arms 𝐾 = {1,2, … 𝐾} and fixed time horizon 𝑇. n After choosing a treatment arm 𝑎 ∈ [𝐾], the treatment arm returns an outcome 𝑌! ∈ ℝ. ØGoal: After the following trial, find the best treatment arm 𝑎∗ = arg max!∈[%] 𝔼[𝑌!] . n In each period 𝑡 (assume that the distribution of 𝑌! is invariant across periods), 1. Choose a treatment arm 𝐴' ∈ [𝐾] 2. Observe the chosen reward 𝑌' = ∑!∈[%] 1 𝐴' = 𝑎 𝑌' !. • Stop the trial at round 𝑡 = 𝑇. • Recommend an estimated best treatment arm : 𝑎( ∈ [𝐾]. 4 Arm 1 Arm 2 ⋯ Arm 𝐾 𝑌! " 𝑌! # 𝑌! $ 𝑌! = ∑%∈[$] 1 𝐴! = 𝑎 𝑌! % 𝑇 𝑡 = 1

Slide 5

Slide 5 text

Performance Evaluation n Consider directly evaluating the performance of the decision making. n Our decision making: recommend an estimated best arm : 𝑎) as the best arm 𝑎∗. n We evaluate the decision making by the probability of misidentification: ℙ(: 𝑎) ≠ 𝑎∗). • When there is the unique best arm, ℙ : 𝑎) ≠ 𝑎∗ converges to zero with an exponential rate. • Evaluate the decision making by evaluating the convergence rate of ℙ(: 𝑎) ≠ 𝑎∗). • Faster convence rate of ℙ(: 𝑎) ≠ 𝑎∗) implies a better decision making. n We aim to recommend the best treatment arm with less probability of misidentification. 5

Slide 6

Slide 6 text

RS-AIPW Strategy n Algorithms in BAI are often referred to as BAI strategies. n Our BAI strategy consists of the following two components: 1. Random sampling (RS): Following a certain probability, we choose treatment arms in 𝑡 = 1,2, … , 𝑇. 2. Recommendation based on the augmented inverse probability weighting (AIPW) estimator: In 𝑡 = 𝑇, we recommend the best arm using the AIPW estimator. ØWe call our strategy the RS-AIPW strategy. 6

Slide 7

Slide 7 text

RS-AIPW Strategy n Our BAI strategy is designed so that the ratio of the empirical sample allocation * ) ∑'+* ( 1[𝐴' = 𝑎] of each arm 𝑎 ∈ [𝐾] converges to the optimal allocation ratio. • The optimal allocation ratio is derived from the lower bound of the theoretical performance. n For example, when 𝐾 = 2, the optimal allocation ratio of each arm 𝑎 ∈ [𝐾] is 𝑤∗ 1 = 𝑉𝑎𝑟 𝑌* 𝑉𝑎𝑟 𝑌* + 𝑉𝑎𝑟 𝑌, and 𝑤∗ 2 = 𝑉𝑎𝑟 𝑌, 𝑉𝑎𝑟 𝑌* + 𝑉𝑎𝑟 𝑌, . • This implies that allocating samples with the ratio of the standard deviation is optimal. n When 𝐾 ≥ 3, we need to solve a linear programming to decide the optimal allocation ratio. 7

Slide 8

Slide 8 text

RS-AIPW Strategy n For simplicity, consider the case with two treatment arms, 𝐾 = 2. n In each period 𝑡 = 1,2, … , 𝑇, 1. Estimate the variance of each arm 𝑉𝑎𝑟 𝑌! using past observations until 𝑡 − 1 period. 2. Choose 𝐴' = 1with probability H 𝑤' 1 = - .!/ 0! - .!/ 0! 1 - .!/ 0" and 𝐴' = 2 with 1 − H 𝑤'(2). • After period 𝑡 = 𝑇, we recommend treatment arm as : 𝑎( = arg max!∈{*,,} : 𝜇( 5678,!, where : 𝜇( 5678,! = 1 𝑇 J '+* ( 1[𝐴' = 𝑎](𝑌' ! − : 𝜇' !) H 𝑤'(𝑎) + : 𝜇' !, and : 𝜇' ! is an estimator of 𝔼[𝑌!] using past observations until 𝑡 − 1 period. 8

Slide 9

Slide 9 text

RS-AIPW Strategy n : 𝜇( 5678,! is the AIPW estimator of 𝔼 𝑌! , a.k.a., the doubly robust estimator. • We can use the properties of the martingales. • This property is helpful since the observations in adaptive experiments are dependent. • To use the martingale property, we use past observations to construct nuisance estimators. • This is similar to cross-fitting of the double machine learning in Chernozhukov et al. (2016). • If we use the IPW-type estimators, the variance becomes larger. n In RS-AIPW strategy, ℙ : 𝑎( ∗ ≠ 𝑎∗ = ℙ arg max!∈{*,,} : 𝜇( 5678,! ≠ 𝑎∗ . 9

Slide 10

Slide 10 text

Asymptotic Optimality n Assumption: The best arm is unique and 𝑎∗ = 1; that is, 𝔼 𝑌* > 𝔼 𝑌, . n Consistent strategy: When there exists the unique best arm, consistent strategies return the best arm with probability one in large samples: ℙ : 𝑎( ∗ = 𝑎∗ → 1 as 𝑇 → ∞. n Small-gap setting: Consider a small gap setting where Δ = 𝔼[𝑌*] − 𝔼[𝑌,] → 0. • [Upper bound] If ∑#$! % * :#+! ( ;.= 𝑤∗(𝑎), then under some regularity conditions, for large samples, ℙ : 𝑎( ∗ ≠ 𝑎∗ = ℙ : 𝜇( 5678,* < : 𝜇( 5678,, ≤ exp −𝑇 Δ, 2 𝑉𝑎𝑟 𝑌* + 𝑉𝑎𝑟 𝑌, , + 𝑜 Δ, . • [Lower bound] No consistent strategy exceeds this convergence rate. n This result implies that our proposed RS-AIPW strategy is asymptotically optimal. 10 Main result (not mathematically rigorous)

Slide 11

Slide 11 text

Simulation Studies n BAI strategies: • Alpha: Strategy using the true variance. • Uniform: Uniform sampling, RCT. • RS-AIPW: Proposed strategy. n y-axis: ℙ : 𝑎( ∗ = 𝑎∗ , x-axis: 𝑇. • 𝔼 𝑌* = 0.05, 𝔼 𝑌, = 0.01. • Top figure: 𝑉𝑎𝑟 𝑌* = 1, 𝑉𝑎𝑟 𝑌, = 0.2. • Bottom figure: 𝑉𝑎𝑟 𝑌* = 1, 𝑉𝑎𝑟 𝑌, = 0.1. 11

Slide 12

Slide 12 text

Extensions 12

Slide 13

Slide 13 text

Related work and Details of the Asymptotic Optimality

Slide 14

Slide 14 text

Best Arm Identification n Best arm identification is an instance of the MAB problem. ØGoal: Identify (= choose) the best treatment arm via adaptive experiments. n Two problem setting: • Fixed-confidence setting: Sequential testing flavored formulation. Sample size is not fixed. Continue adaptive experiments until a predefined criterion is satisfied. • Fixed-budget setting: Sample size is fixed and conduct decision making at the last period. • In the final period 𝑇, we return an estimate the best treatment arm, : 𝑎( ∗ . • Evaluation: Minimizing the probability of misidenfication ℙ : 𝑎( ∗ ≠ 𝑎∗ . 14

Slide 15

Slide 15 text

Best Arm Identification with a Fixed Budget n How to evaluate ℙ : 𝑎( ∗ ≠ 𝑎∗ , the performance of BAI strategies? n When the best arm is unique, ℙ : 𝑎( ∗ ≠ 𝑎∗ converges to 0 with an exponential speed; that is, ℙ : 𝑎( ∗ ≠ 𝑎∗ = exp(−𝑇(⋆)) , where (⋆) is a constant term. ↔ Local asymptotics (van der Vaart (1998), Hirano and Porter (2009)): ℙ : 𝑎( ∗ ≠ 𝑎∗ is constant. n Consider evaluating the term (⋆) by lim sup (→? − * ( log ℙ : 𝑎( ∗ ≠ 𝑎∗ . n A performance lower (upper) bound of ℙ : 𝑎( ∗ ≠ 𝑎∗ is an upper (lower) bound of lim sup (→? − * ( log ℙ : 𝑎( ∗ ≠ 𝑎∗ 15

Slide 16

Slide 16 text

Distribution-dependent Lower Bound n Distribution-dependent lower bound, a.k.a., information-theoretic lower bound. • Lower bound based on the information of the distribution. • The lower bounds are often based on quantities, such as Fisher information and KL divergence. n We employ the derivation technique called a change-of-measure. • This technique has been used in asymptotic optimality of hypothesis testing (see van der Vaart (1998)) and lower bounds in the MAB problem (Lai and Robbins (1985)). • In BAI, Kaufmann et al. (2016) conjectures distribution-dependent lower bound. 16

Slide 17

Slide 17 text

Lower Bound for Two-armed Bandits n Introducing appropriate alternative hypothesis and restricting the strategy class. n Denote the true distribution (bandit model) by 𝑃. n Denote a set of alternative hypothesizes by Alt(𝑃). n Consistent strategy: return the true arm with probability 1 as 𝑇 → ∞. n For any 𝑄 ∈ Alt 𝑃 , consistent strategy satisfies, lim sup )→+ − 1 𝑇 log ℙ𝑷 6 𝑎) ∗ ≠ 𝑎∗ ≤ lim sup )→+ 1 𝑇 𝔼. : %/# $ : !/# ) 1[𝐴! = 𝑎] log 𝑓% 0 𝑌% 𝑓% 𝑌% where 𝑓! and 𝑓! @ are the pdfs of an arm 𝑎’s reward under 𝑃 and 𝑄. ℙ𝑷 and 𝔼B denote the probability law and its expectation under a bandit model 𝑃. 17 Transportation Lemma (Lemma 1 of Kaufmann et al. (2016) Log likelihood ratio

Slide 18

Slide 18 text

Open Question 1: Upper Bound n No optimal strategy whose upper bound achieves the lower bound. n Performances of any BAI strategy cannot exceed the lower bound. n If a performance guarantee of BAI strategy matches the lower bound, the strategy is optimal. n In BAI with a fixed budget, there is no optimal strategy whose upper bound achieves the distribution-dependent lower bound conjectured by Kaufmann et al. (2016) • Optimal strategies exist in other settings, e.g., BAI with fixed confidence. n One of the main problems is the estimation error of the optimal allocation ratio. • Glynn and Juneja (2004): if the allocation ratio is known, the lower bound is achievable. 18

Slide 19

Slide 19 text

Open Question 2: Lower Bound n Since there is no corresponding upper bound, the lower bound conjectured by Kaufmann et al. (2016) may not be correct. n Carpentier and Locattelli (2016) suggests that under a large gap setting, the lower bound by Kaufmann et al. (2016) is not achievable. n A large gap setting: For the best arm 𝑎∗ ∈ [𝐾], ∑!C!∗ * 𝔼 0'∗ E𝔼 0' " is bounded by constant. • This implies that 𝔼[𝑌!∗ ] − 𝔼[𝑌!] is not close to 0. • Therefore, we also reconsider the lower bound of Kaufmann et al. (2016). 19

Slide 20

Slide 20 text

Lower Bound for Two-armed Bandits n For case with 𝐾 = 2, we consider an optimal algorithm under a small-gap setting. n Consider a small gap situation:Δ = 𝔼[𝑌*] − 𝔼[𝑌,] → 0. • Under appropriate conditions, lim sup (→? − 1 𝑇 log ℙ : 𝑎( ∗ ≠ 𝑎∗ ≤ Δ, 2 𝑉𝑎𝑟(𝑌*) + 𝑉𝑎𝑟(𝑌,) , + 𝑜(Δ,) n This lower bounds suggests us the optimal sample allocation ratio: 𝑤∗ 1 = 𝑉𝑎𝑟 𝑌* 𝑉𝑎𝑟 𝑌* + 𝑉𝑎𝑟 𝑌, and 𝑤∗ 2 = 𝑉𝑎𝑟 𝑌, 𝑉𝑎𝑟 𝑌* + 𝑉𝑎𝑟 𝑌, . 20 Lower bound for the Two-armed Bandits

Slide 21

Slide 21 text

Upper Bound and Large Deviations n Next, we consider the upper bound. n We are interested in evaluation of the tail probability: ℙ : 𝑎( ∗ ≠ 𝑎∗ = J !C!∗ ℙ : 𝜇!,( ≥ : 𝜇!∗,( = J !C!∗ ℙ : 𝜇!∗,( − : 𝜇!,( − Δ! ≤ −Δ! → Large deviation principle (LDP): evaluation of ℙ H 𝝁𝒂∗,𝑻 − H 𝝁𝒂,𝑻 − 𝜟𝒂 ≤ 𝑪 (𝑪 is a constant) ⇔ Central limit theorem (CLT): ℙ 𝑇(: 𝜇!∗,( − : 𝜇!,( − Δ!) ≤ 𝐶 . n There are existing well-known results on LDPs. Ex. Cramér theorem and Gärtner-Ellis theorem • These cannot be applied to BAI owing to the non-stationarity of the stochastic process. 21

Slide 22

Slide 22 text

Large Deviation Principles for Martingales n To solve this problem, we derive a novel LDP for martingales by using the change-of-measure. n Transform an upper bound on another distribution to the distribution of interest. • Let ℙ be a probability measure of interest and {𝜉'}be martingale difference sequence from ℙ . • Consider obtain the large deviation bound of the average over {𝜉'}; that is, * ) ∑'+* ( 𝜉' . • For the martingale difference sequence {𝜉'} and constant 𝜆, define 𝑈( = ∏'+* ( HIJ(LM#) 𝔼[HIJ LM# |ℱ#(!] . • Define the conjugate probability measure ℙL as dℙL = 𝑈(dℙ. • Derive the large deviation bound on ℙL , not on ℙ of interest. • Then, transform the large deviation bound on ℙL to the large deviation bound on ℙ via the density ratio Qℙ Qℙ) . 22 ℙ! ℙ dℙ dℙL = 𝑈( Upper bound Upper bound Change measures

Slide 23

Slide 23 text

Upper Bound n Under an appropriately designed BAI algorithm, we show the following upper bound. • If ∑#$! % * :#+! ( ;.= 𝑤∗(𝑎), then under some regularity conditions, lim sup (→? − 1 𝑇 log ℙ : 𝑎( ∗ ≠ 𝑎∗ ≥ Δ, 2 𝑉𝑎𝑟(𝑌*) + 𝑉𝑎𝑟(𝑌,) , + 𝑜 Δ, . • This result implies Gaussian approximation in large deviation and Δ → 0. • This result is a generalization of the martingale central limit theorem. n This upper bound matches the lower bound under the small-gap setting. 23 Upper bound for the Two-armed Bandits

Slide 24

Slide 24 text

Conclusion ØMain contribution: • Optimal algorithm of BAI with a fixed budget (adaptive experiments for policy choice). ØTechnical contribution: • Evaluation framework with a small gap. • A novel large deviation bound for martingales. 24

Slide 25

Slide 25 text

Reference Ø Kato, M., Ariu, K., Imaizumi, M., Nomura, M., and Qin, C. (2022), “Best Arm Identification with a Fixed Budget under a Small Gap” • Carpentier, A. and Locatelli, A. Tight (lower) bounds for the fixed budget best arm identification388 bandit problem. In COLT, 2016 • Glynn, P. and Juneja, S. A large deviations perspective on ordinal optimization. In Proceedings of the 2004 Winter Simulation Conference, volume 1. IEEE, 2004 • Kasy, M. and Sautmann, A. Adaptive treatment assignment in experiments for policy choice. Econometrica. • Kaufmann, E., Cappe, O., and Garivier, A. (2016), “On the Complexity of Best-Arm Identification in Multi-Armed ´ Bandit Models,” Journal of Machine Learning Research. • Fan, X., Grama, I., and Liu, Q. (2013), “Cramer large deviation expansions for martingales under Bernstein’s condition,” Stochastic Processes and their Applications. • Lai, T. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 1985 • Hirano, K. and Porter, J. R. Asymptotics for statistical treatment rules. Econometrica, 2009 • van der Vaart, A. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998 25