Upgrade to PRO for Only $50/Yearโ€”Limited-Time Offer! ๐Ÿ”ฅ

Optimal Best Arm Identification with a Fixed Bu...

Avatar for MasaKat0 MasaKat0
August 10, 2022

Optimal Best Arm Identification with a Fixed Budget under a Smallย Gap

Presentation slide in AMES 2022.

Avatar for MasaKat0

MasaKat0

August 10, 2022
Tweet

More Decks by MasaKat0

Other Decks in Research

Transcript

  1. Optimal Best Arm Identification with a Fixed Budget under a

    Small Gap Masahiro Kato ๏ผˆThe University of Tokyo / CyberAgent, Inc.๏ผ‰ Joint work with Masaaki Imaizumi (The University of Tokyo), Kaito Ariu (CyberAgent, Inc. / KTH) Masahiro Nomura (CyberAgent, Inc. ), Chao Qin (Columbia University) 2022 Asian Meeting of the Econometric Society Kato, M., Ariu, K., Imaizumi, M., Nomura, M., and Qin, C. (2022), โ€œBest Arm Identification with a Fixed Budget under a Small Gapโ€ (https://arxiv.org/abs/2201.04469)
  2. Adaptive Experiments for Policy Choice n Consider there are multiple

    treatment arms with potential outcomes. โ€ข Treatment arm (also called treatment or arm): Slots, Medicine, online advertisements, etc. ร˜Motivation: conduct an experiments for efficiently choosing the best treatment arm โ€ข Best treatment arm = treatment arm with the highest expected outcome. 2 Observe the reward of the chosen treatment arm ๐ด! ๏ผŽ Decision maker Choose a treatment arm ๐‘Ž Treatment arm 1 Treatment arm 2 Treatment arm ๐พ โ‹ฏ Adaptive experiments Recommend the best treatment arm.
  3. Adaptive Experiments for Policy Choice n How can we gain

    efficiency of choosing the best treatment arm? โ†’ Optimize the sample allocation to each treatment in the experiment. n To optimize the sample allocation, we need to know some parameters, such as the variances. n In our experiment, we sequentially estimate the parameter and optimize the sample allocation. โ€ข This setting is called best arm identification (BAI) in machine learning. Also called adaptive experiments for policy choice in economics (Kasy and Sautmann (2021)). โ€ข BAI is an instance of multi-armed bandit (MAB) problems. โ€ข In MAB problem, we regard the treatments as the arms of the slot game. 3
  4. Problem Setting n There are ๐พ treatment arms ๐พ =

    {1,2, โ€ฆ ๐พ} and fixed time horizon ๐‘‡. n After choosing a treatment arm ๐‘Ž โˆˆ [๐พ], the treatment arm returns an outcome ๐‘Œ! โˆˆ โ„. ร˜Goal: After the following trial, find the best treatment arm ๐‘Žโˆ— = arg max!โˆˆ[%] ๐”ผ[๐‘Œ!] . n In each period ๐‘ก (assume that the distribution of ๐‘Œ! is invariant across periods), 1. Choose a treatment arm ๐ด' โˆˆ [๐พ] 2. Observe the chosen reward ๐‘Œ' = โˆ‘!โˆˆ[%] 1 ๐ด' = ๐‘Ž ๐‘Œ' !. โ€ข Stop the trial at round ๐‘ก = ๐‘‡. โ€ข Recommend an estimated best treatment arm : ๐‘Ž( โˆˆ [๐พ]. 4 Arm 1 Arm 2 โ‹ฏ Arm ๐พ ๐‘Œ! " ๐‘Œ! # ๐‘Œ! $ ๐‘Œ! = โˆ‘%โˆˆ[$] 1 ๐ด! = ๐‘Ž ๐‘Œ! % ๐‘‡ ๐‘ก = 1
  5. Performance Evaluation n Consider directly evaluating the performance of the

    decision making. n Our decision making: recommend an estimated best arm : ๐‘Ž) as the best arm ๐‘Žโˆ—. n We evaluate the decision making by the probability of misidentification: โ„™(: ๐‘Ž) โ‰  ๐‘Žโˆ—). โ€ข When there is the unique best arm, โ„™ : ๐‘Ž) โ‰  ๐‘Žโˆ— converges to zero with an exponential rate. โ€ข Evaluate the decision making by evaluating the convergence rate of โ„™(: ๐‘Ž) โ‰  ๐‘Žโˆ—). โ€ข Faster convence rate of โ„™(: ๐‘Ž) โ‰  ๐‘Žโˆ—) implies a better decision making. n We aim to recommend the best treatment arm with less probability of misidentification. 5
  6. RS-AIPW Strategy n Algorithms in BAI are often referred to

    as BAI strategies. n Our BAI strategy consists of the following two components: 1. Random sampling (RS): Following a certain probability, we choose treatment arms in ๐‘ก = 1,2, โ€ฆ , ๐‘‡. 2. Recommendation based on the augmented inverse probability weighting (AIPW) estimator: In ๐‘ก = ๐‘‡, we recommend the best arm using the AIPW estimator. ร˜We call our strategy the RS-AIPW strategy. 6
  7. RS-AIPW Strategy n Our BAI strategy is designed so that

    the ratio of the empirical sample allocation * ) โˆ‘'+* ( 1[๐ด' = ๐‘Ž] of each arm ๐‘Ž โˆˆ [๐พ] converges to the optimal allocation ratio. โ€ข The optimal allocation ratio is derived from the lower bound of the theoretical performance. n For example, when ๐พ = 2, the optimal allocation ratio of each arm ๐‘Ž โˆˆ [๐พ] is ๐‘คโˆ— 1 = ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ* ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ* + ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ, and ๐‘คโˆ— 2 = ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ, ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ* + ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ, . โ€ข This implies that allocating samples with the ratio of the standard deviation is optimal. n When ๐พ โ‰ฅ 3, we need to solve a linear programming to decide the optimal allocation ratio. 7
  8. RS-AIPW Strategy n For simplicity, consider the case with two

    treatment arms, ๐พ = 2. n In each period ๐‘ก = 1,2, โ€ฆ , ๐‘‡, 1. Estimate the variance of each arm ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ! using past observations until ๐‘ก โˆ’ 1 period. 2. Choose ๐ด' = 1with probability H ๐‘ค' 1 = - .!/ 0! - .!/ 0! 1 - .!/ 0" and ๐ด' = 2 with 1 โˆ’ H ๐‘ค'(2). โ€ข After period ๐‘ก = ๐‘‡, we recommend treatment arm as : ๐‘Ž( = arg max!โˆˆ{*,,} : ๐œ‡( 5678,!, where : ๐œ‡( 5678,! = 1 ๐‘‡ J '+* ( 1[๐ด' = ๐‘Ž](๐‘Œ' ! โˆ’ : ๐œ‡' !) H ๐‘ค'(๐‘Ž) + : ๐œ‡' !, and : ๐œ‡' ! is an estimator of ๐”ผ[๐‘Œ!] using past observations until ๐‘ก โˆ’ 1 period. 8
  9. RS-AIPW Strategy n : ๐œ‡( 5678,! is the AIPW estimator

    of ๐”ผ ๐‘Œ! , a.k.a., the doubly robust estimator. โ€ข We can use the properties of the martingales. โ€ข This property is helpful since the observations in adaptive experiments are dependent. โ€ข To use the martingale property, we use past observations to construct nuisance estimators. โ€ข This is similar to cross-fitting of the double machine learning in Chernozhukov et al. (2016). โ€ข If we use the IPW-type estimators, the variance becomes larger. n In RS-AIPW strategy, โ„™ : ๐‘Ž( โˆ— โ‰  ๐‘Žโˆ— = โ„™ arg max!โˆˆ{*,,} : ๐œ‡( 5678,! โ‰  ๐‘Žโˆ— . 9
  10. Asymptotic Optimality n Assumption: The best arm is unique and

    ๐‘Žโˆ— = 1; that is, ๐”ผ ๐‘Œ* > ๐”ผ ๐‘Œ, . n Consistent strategy: When there exists the unique best arm, consistent strategies return the best arm with probability one in large samples: โ„™ : ๐‘Ž( โˆ— = ๐‘Žโˆ— โ†’ 1 as ๐‘‡ โ†’ โˆž. n Small-gap setting: Consider a small gap setting where ฮ” = ๐”ผ[๐‘Œ*] โˆ’ ๐”ผ[๐‘Œ,] โ†’ 0. โ€ข [Upper bound] If โˆ‘#$! % * :#+! ( ;.= ๐‘คโˆ—(๐‘Ž), then under some regularity conditions, for large samples, โ„™ : ๐‘Ž( โˆ— โ‰  ๐‘Žโˆ— = โ„™ : ๐œ‡( 5678,* < : ๐œ‡( 5678,, โ‰ค exp โˆ’๐‘‡ ฮ”, 2 ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ* + ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ, , + ๐‘œ ฮ”, . โ€ข [Lower bound] No consistent strategy exceeds this convergence rate. n This result implies that our proposed RS-AIPW strategy is asymptotically optimal. 10 Main result (not mathematically rigorous)
  11. Simulation Studies n BAI strategies: โ€ข Alpha: Strategy using the

    true variance. โ€ข Uniform: Uniform sampling, RCT. โ€ข RS-AIPW: Proposed strategy. n y-axis: โ„™ : ๐‘Ž( โˆ— = ๐‘Žโˆ— , x-axis: ๐‘‡. โ€ข ๐”ผ ๐‘Œ* = 0.05, ๐”ผ ๐‘Œ, = 0.01. โ€ข Top figure: ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ* = 1, ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ, = 0.2. โ€ข Bottom figure: ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ* = 1, ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ, = 0.1. 11
  12. Best Arm Identification n Best arm identification is an instance

    of the MAB problem. ร˜Goal: Identify (= choose) the best treatment arm via adaptive experiments. n Two problem setting: โ€ข Fixed-confidence setting: Sequential testing flavored formulation. Sample size is not fixed. Continue adaptive experiments until a predefined criterion is satisfied. โ€ข Fixed-budget setting: Sample size is fixed and conduct decision making at the last period. โ€ข In the final period ๐‘‡, we return an estimate the best treatment arm, : ๐‘Ž( โˆ— . โ€ข Evaluation: Minimizing the probability of misidenfication โ„™ : ๐‘Ž( โˆ— โ‰  ๐‘Žโˆ— . 14
  13. Best Arm Identification with a Fixed Budget n How to

    evaluate โ„™ : ๐‘Ž( โˆ— โ‰  ๐‘Žโˆ— , the performance of BAI strategies? n When the best arm is unique, โ„™ : ๐‘Ž( โˆ— โ‰  ๐‘Žโˆ— converges to 0 with an exponential speed; that is, โ„™ : ๐‘Ž( โˆ— โ‰  ๐‘Žโˆ— = exp(โˆ’๐‘‡(โ‹†)) , where (โ‹†) is a constant term. โ†” Local asymptotics (van der Vaart (1998), Hirano and Porter (2009)): โ„™ : ๐‘Ž( โˆ— โ‰  ๐‘Žโˆ— is constant. n Consider evaluating the term (โ‹†) by lim sup (โ†’? โˆ’ * ( log โ„™ : ๐‘Ž( โˆ— โ‰  ๐‘Žโˆ— . n A performance lower (upper) bound of โ„™ : ๐‘Ž( โˆ— โ‰  ๐‘Žโˆ— is an upper (lower) bound of lim sup (โ†’? โˆ’ * ( log โ„™ : ๐‘Ž( โˆ— โ‰  ๐‘Žโˆ— 15
  14. Distribution-dependent Lower Bound n Distribution-dependent lower bound, a.k.a., information-theoretic lower

    bound. โ€ข Lower bound based on the information of the distribution. โ€ข The lower bounds are often based on quantities, such as Fisher information and KL divergence. n We employ the derivation technique called a change-of-measure. โ€ข This technique has been used in asymptotic optimality of hypothesis testing (see van der Vaart (1998)) and lower bounds in the MAB problem (Lai and Robbins (1985)). โ€ข In BAI, Kaufmann et al. (2016) conjectures distribution-dependent lower bound. 16
  15. Lower Bound for Two-armed Bandits n Introducing appropriate alternative hypothesis

    and restricting the strategy class. n Denote the true distribution (bandit model) by ๐‘ƒ. n Denote a set of alternative hypothesizes by Alt(๐‘ƒ). n Consistent strategy: return the true arm with probability 1 as ๐‘‡ โ†’ โˆž. n For any ๐‘„ โˆˆ Alt ๐‘ƒ , consistent strategy satisfies, lim sup )โ†’+ โˆ’ 1 ๐‘‡ log โ„™๐‘ท 6 ๐‘Ž) โˆ— โ‰  ๐‘Žโˆ— โ‰ค lim sup )โ†’+ 1 ๐‘‡ ๐”ผ. : %/# $ : !/# ) 1[๐ด! = ๐‘Ž] log ๐‘“% 0 ๐‘Œ% ๐‘“% ๐‘Œ% where ๐‘“! and ๐‘“! @ are the pdfs of an arm ๐‘Žโ€™s reward under ๐‘ƒ and ๐‘„. โ„™๐‘ท and ๐”ผB denote the probability law and its expectation under a bandit model ๐‘ƒ. 17 Transportation Lemma (Lemma 1 of Kaufmann et al. (2016) Log likelihood ratio
  16. Open Question 1: Upper Bound n No optimal strategy whose

    upper bound achieves the lower bound. n Performances of any BAI strategy cannot exceed the lower bound. n If a performance guarantee of BAI strategy matches the lower bound, the strategy is optimal. n In BAI with a fixed budget, there is no optimal strategy whose upper bound achieves the distribution-dependent lower bound conjectured by Kaufmann et al. (2016) โ€ข Optimal strategies exist in other settings, e.g., BAI with fixed confidence. n One of the main problems is the estimation error of the optimal allocation ratio. โ€ข Glynn and Juneja (2004): if the allocation ratio is known, the lower bound is achievable. 18
  17. Open Question 2: Lower Bound n Since there is no

    corresponding upper bound, the lower bound conjectured by Kaufmann et al. (2016) may not be correct. n Carpentier and Locattelli (2016) suggests that under a large gap setting, the lower bound by Kaufmann et al. (2016) is not achievable. n A large gap setting: For the best arm ๐‘Žโˆ— โˆˆ [๐พ], โˆ‘!C!โˆ— * ๐”ผ 0'โˆ— E๐”ผ 0' " is bounded by constant. โ€ข This implies that ๐”ผ[๐‘Œ!โˆ— ] โˆ’ ๐”ผ[๐‘Œ!] is not close to 0. โ€ข Therefore, we also reconsider the lower bound of Kaufmann et al. (2016). 19
  18. Lower Bound for Two-armed Bandits n For case with ๐พ

    = 2, we consider an optimal algorithm under a small-gap setting. n Consider a small gap situation๏ผšฮ” = ๐”ผ[๐‘Œ*] โˆ’ ๐”ผ[๐‘Œ,] โ†’ 0. โ€ข Under appropriate conditions, lim sup (โ†’? โˆ’ 1 ๐‘‡ log โ„™ : ๐‘Ž( โˆ— โ‰  ๐‘Žโˆ— โ‰ค ฮ”, 2 ๐‘‰๐‘Ž๐‘Ÿ(๐‘Œ*) + ๐‘‰๐‘Ž๐‘Ÿ(๐‘Œ,) , + ๐‘œ(ฮ”,) n This lower bounds suggests us the optimal sample allocation ratio: ๐‘คโˆ— 1 = ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ* ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ* + ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ, and ๐‘คโˆ— 2 = ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ, ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ* + ๐‘‰๐‘Ž๐‘Ÿ ๐‘Œ, . 20 Lower bound for the Two-armed Bandits
  19. Upper Bound and Large Deviations n Next, we consider the

    upper bound. n We are interested in evaluation of the tail probability: โ„™ : ๐‘Ž( โˆ— โ‰  ๐‘Žโˆ— = J !C!โˆ— โ„™ : ๐œ‡!,( โ‰ฅ : ๐œ‡!โˆ—,( = J !C!โˆ— โ„™ : ๐œ‡!โˆ—,( โˆ’ : ๐œ‡!,( โˆ’ ฮ”! โ‰ค โˆ’ฮ”! โ†’ Large deviation principle (LDP): evaluation of โ„™ H ๐๐’‚โˆ—,๐‘ป โˆ’ H ๐๐’‚,๐‘ป โˆ’ ๐œŸ๐’‚ โ‰ค ๐‘ช (๐‘ช is a constant) โ‡” Central limit theorem (CLT): โ„™ ๐‘‡(: ๐œ‡!โˆ—,( โˆ’ : ๐œ‡!,( โˆ’ ฮ”!) โ‰ค ๐ถ . n There are existing well-known results on LDPs. Ex. Cramรฉr theorem and Gรคrtner-Ellis theorem โ€ข These cannot be applied to BAI owing to the non-stationarity of the stochastic process. 21
  20. Large Deviation Principles for Martingales n To solve this problem,

    we derive a novel LDP for martingales by using the change-of-measure. n Transform an upper bound on another distribution to the distribution of interest. โ€ข Let โ„™ be a probability measure of interest and {๐œ‰'}be martingale difference sequence from โ„™ . โ€ข Consider obtain the large deviation bound of the average over {๐œ‰'}; that is, * ) โˆ‘'+* ( ๐œ‰' . โ€ข For the martingale difference sequence {๐œ‰'} and constant ๐œ†, define ๐‘ˆ( = โˆ'+* ( HIJ(LM#) ๐”ผ[HIJ LM# |โ„ฑ#(!] . โ€ข Define the conjugate probability measure โ„™L as dโ„™L = ๐‘ˆ(dโ„™. โ€ข Derive the large deviation bound on โ„™L , not on โ„™ of interest. โ€ข Then, transform the large deviation bound on โ„™L to the large deviation bound on โ„™ via the density ratio Qโ„™ Qโ„™) . 22 โ„™! โ„™ dโ„™ dโ„™L = ๐‘ˆ( Upper bound Upper bound Change measures
  21. Upper Bound n Under an appropriately designed BAI algorithm, we

    show the following upper bound. โ€ข If โˆ‘#$! % * :#+! ( ;.= ๐‘คโˆ—(๐‘Ž), then under some regularity conditions, lim sup (โ†’? โˆ’ 1 ๐‘‡ log โ„™ : ๐‘Ž( โˆ— โ‰  ๐‘Žโˆ— โ‰ฅ ฮ”, 2 ๐‘‰๐‘Ž๐‘Ÿ(๐‘Œ*) + ๐‘‰๐‘Ž๐‘Ÿ(๐‘Œ,) , + ๐‘œ ฮ”, . โ€ข This result implies Gaussian approximation in large deviation and ฮ” โ†’ 0. โ€ข This result is a generalization of the martingale central limit theorem. n This upper bound matches the lower bound under the small-gap setting. 23 Upper bound for the Two-armed Bandits
  22. Conclusion ร˜Main contribution: โ€ข Optimal algorithm of BAI with a

    fixed budget (adaptive experiments for policy choice). ร˜Technical contribution: โ€ข Evaluation framework with a small gap. โ€ข A novel large deviation bound for martingales. 24
  23. Reference ร˜ Kato, M., Ariu, K., Imaizumi, M., Nomura, M.,

    and Qin, C. (2022), โ€œBest Arm Identification with a Fixed Budget under a Small Gapโ€ โ€ข Carpentier, A. and Locatelli, A. Tight (lower) bounds for the fixed budget best arm identification388 bandit problem. In COLT, 2016 โ€ข Glynn, P. and Juneja, S. A large deviations perspective on ordinal optimization. In Proceedings of the 2004 Winter Simulation Conference, volume 1. IEEE, 2004 โ€ข Kasy, M. and Sautmann, A. Adaptive treatment assignment in experiments for policy choice. Econometrica. โ€ข Kaufmann, E., Cappe, O., and Garivier, A. (2016), โ€œOn the Complexity of Best-Arm Identification in Multi-Armed ยด Bandit Models,โ€ Journal of Machine Learning Research. โ€ข Fan, X., Grama, I., and Liu, Q. (2013), โ€œCramer large deviation expansions for martingales under Bernsteinโ€™s condition,โ€ Stochastic Processes and their Applications. โ€ข Lai, T. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 1985 โ€ข Hirano, K. and Porter, J. R. Asymptotics for statistical treatment rules. Econometrica, 2009 โ€ข van der Vaart, A. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998 25