120

# Optimal Best Arm Identification with a Fixed Budget under a Small Gap

Presentation slide in AMES 2022. August 10, 2022

## Transcript

1. Optimal Best Arm Identification
with a Fixed Budget under a Small Gap
Masahiro Kato （The University of Tokyo / CyberAgent, Inc.）
Joint work with
Masaaki Imaizumi (The University of Tokyo), Kaito Ariu (CyberAgent, Inc. / KTH)
Masahiro Nomura (CyberAgent, Inc. ), Chao Qin (Columbia University)
2022 Asian Meeting of the Econometric Society
Kato, M., Ariu, K., Imaizumi, M., Nomura, M., and Qin, C. (2022), “Best Arm Identification with a Fixed Budget under a Small Gap” (https://arxiv.org/abs/2201.04469)

2. Adaptive Experiments for Policy Choice
n Consider there are multiple treatment arms with potential outcomes.
• Treatment arm (also called treatment or arm): Slots, Medicine, online advertisements, etc.
ØMotivation: conduct an experiments for efficiently choosing the best treatment arm
• Best treatment arm = treatment arm with the highest expected outcome.
2
Observe the reward of
the chosen treatment arm 𝐴!

Decision maker
Choose a treatment arm 𝑎
Treatment
arm 1
Treatment
arm 2
Treatment
arm 𝐾

Adaptive experiments Recommend the best treatment arm.

3. Adaptive Experiments for Policy Choice
n How can we gain efficiency of choosing the best treatment arm?
→ Optimize the sample allocation to each treatment in the experiment.
n To optimize the sample allocation, we need to know some parameters, such as the variances.
n In our experiment, we sequentially estimate the parameter and optimize the sample allocation.
• This setting is called best arm identification (BAI) in machine learning.
Also called adaptive experiments for policy choice in economics (Kasy and Sautmann (2021)).
• BAI is an instance of multi-armed bandit (MAB) problems.
• In MAB problem, we regard the treatments as the arms of the slot game.
3

4. Problem Setting
n There are 𝐾 treatment arms 𝐾 = {1,2, … 𝐾} and fixed time horizon 𝑇.
n After choosing a treatment arm 𝑎 ∈ [𝐾], the treatment arm returns an outcome 𝑌! ∈ ℝ.
ØGoal: After the following trial, find the best treatment arm 𝑎∗ = arg max!∈[%] 𝔼[𝑌!] .
n In each period 𝑡 (assume that the distribution of 𝑌! is invariant across periods),
1. Choose a treatment arm 𝐴' ∈ [𝐾]
2. Observe the chosen reward 𝑌' = ∑!∈[%]
1 𝐴' = 𝑎 𝑌'
!.
• Stop the trial at round 𝑡 = 𝑇.
• Recommend an estimated best treatment arm :
𝑎( ∈ [𝐾].
4
Arm 1 Arm 2

Arm 𝐾
𝑌!
"
𝑌!
# 𝑌!
\$
𝑌!
= ∑%∈[\$]
1 𝐴!
= 𝑎 𝑌!
%
𝑇
𝑡 = 1

5. Performance Evaluation
n Consider directly evaluating the performance of the decision making.
n Our decision making: recommend an estimated best arm :
𝑎)
as the best arm 𝑎∗.
n We evaluate the decision making by the probability of misidentification: ℙ(:
𝑎) ≠ 𝑎∗).
• When there is the unique best arm, ℙ :
𝑎) ≠ 𝑎∗ converges to zero with an exponential rate.
• Evaluate the decision making by evaluating the convergence rate of ℙ(:
𝑎) ≠ 𝑎∗).
• Faster convence rate of ℙ(:
𝑎) ≠ 𝑎∗) implies a better decision making.
n We aim to recommend the best treatment arm with less probability of misidentification.
5

6. RS-AIPW Strategy
n Algorithms in BAI are often referred to as BAI strategies.
n Our BAI strategy consists of the following two components:
1. Random sampling (RS):
Following a certain probability, we choose treatment arms in 𝑡 = 1,2, … , 𝑇.
2. Recommendation based on the augmented inverse probability weighting (AIPW) estimator:
In 𝑡 = 𝑇, we recommend the best arm using the AIPW estimator.
ØWe call our strategy the RS-AIPW strategy.
6

7. RS-AIPW Strategy
n Our BAI strategy is designed so that the ratio of the empirical sample allocation
*
)
∑'+*
( 1[𝐴' = 𝑎] of each arm 𝑎 ∈ [𝐾] converges to the optimal allocation ratio.
• The optimal allocation ratio is derived from the lower bound of the theoretical performance.
n For example, when 𝐾 = 2, the optimal allocation ratio of each arm 𝑎 ∈ [𝐾] is
𝑤∗ 1 =
𝑉𝑎𝑟 𝑌*
𝑉𝑎𝑟 𝑌* + 𝑉𝑎𝑟 𝑌,
and 𝑤∗ 2 =
𝑉𝑎𝑟 𝑌,
𝑉𝑎𝑟 𝑌* + 𝑉𝑎𝑟 𝑌,
.
• This implies that allocating samples with the ratio of the standard deviation is optimal.
n When 𝐾 ≥ 3, we need to solve a linear programming to decide the optimal allocation ratio.
7

8. RS-AIPW Strategy
n For simplicity, consider the case with two treatment arms, 𝐾 = 2.
n In each period 𝑡 = 1,2, … , 𝑇,
1. Estimate the variance of each arm 𝑉𝑎𝑟 𝑌! using past observations until 𝑡 − 1 period.
2. Choose 𝐴' = 1with probability H
𝑤' 1 =
-
.!/ 0!
-
.!/ 0! 1 -
.!/ 0"
and 𝐴' = 2 with 1 − H
𝑤'(2).
• After period 𝑡 = 𝑇, we recommend treatment arm as :
𝑎( = arg max!∈{*,,} :
𝜇(
5678,!, where
:
𝜇(
5678,! =
1
𝑇
J
'+*
(
1[𝐴' = 𝑎](𝑌'
! − :
𝜇'
!)
H
𝑤'(𝑎)
+ :
𝜇'
!,
and :
𝜇'
! is an estimator of 𝔼[𝑌!] using past observations until 𝑡 − 1 period.
8

9. RS-AIPW Strategy
n :
𝜇(
5678,! is the AIPW estimator of 𝔼 𝑌! , a.k.a., the doubly robust estimator.
• We can use the properties of the martingales.
• This property is helpful since the observations in adaptive experiments are dependent.
• To use the martingale property, we use past observations to construct nuisance estimators.
• This is similar to cross-fitting of the double machine learning in Chernozhukov et al. (2016).
• If we use the IPW-type estimators, the variance becomes larger.
n In RS-AIPW strategy, ℙ :
𝑎(
∗ ≠ 𝑎∗ = ℙ arg max!∈{*,,} :
𝜇(
5678,! ≠ 𝑎∗ .
9

10. Asymptotic Optimality
n Assumption: The best arm is unique and 𝑎∗ = 1; that is, 𝔼 𝑌* > 𝔼 𝑌, .
n Consistent strategy: When there exists the unique best arm, consistent strategies return the
best arm with probability one in large samples: ℙ :
𝑎(
∗ = 𝑎∗ → 1 as 𝑇 → ∞.
n Small-gap setting: Consider a small gap setting where Δ = 𝔼[𝑌*] − 𝔼[𝑌,] → 0.
• [Upper bound] If ∑#\$!
% * :#+!
(
;.=
𝑤∗(𝑎), then under some regularity conditions, for large samples,
ℙ :
𝑎(
∗ ≠ 𝑎∗ = ℙ :
𝜇(
5678,* < :
𝜇(
5678,, ≤ exp −𝑇
Δ,
2 𝑉𝑎𝑟 𝑌* + 𝑉𝑎𝑟 𝑌,
,
+ 𝑜 Δ, .
• [Lower bound] No consistent strategy exceeds this convergence rate.
n This result implies that our proposed RS-AIPW strategy is asymptotically optimal.
10
Main result (not mathematically rigorous)

11. Simulation Studies
n BAI strategies:
• Alpha: Strategy using the true variance.
• Uniform: Uniform sampling, RCT.
• RS-AIPW: Proposed strategy.
n y-axis: ℙ :
𝑎(
∗ = 𝑎∗ , x-axis: 𝑇.
• 𝔼 𝑌* = 0.05, 𝔼 𝑌, = 0.01.
• Top figure: 𝑉𝑎𝑟 𝑌* = 1, 𝑉𝑎𝑟 𝑌, = 0.2.
• Bottom figure: 𝑉𝑎𝑟 𝑌* = 1, 𝑉𝑎𝑟 𝑌, = 0.1.
11

12. Extensions
12

13. Related work and Details of the Asymptotic Optimality

14. Best Arm Identification
n Best arm identification is an instance of the MAB problem.
ØGoal: Identify (= choose) the best treatment arm via adaptive experiments.
n Two problem setting:
• Fixed-confidence setting: Sequential testing flavored formulation. Sample size is not fixed.
Continue adaptive experiments until a predefined criterion is satisfied.
• Fixed-budget setting: Sample size is fixed and conduct decision making at the last period.
• In the final period 𝑇, we return an estimate the best treatment arm, :
𝑎(
∗ .
• Evaluation: Minimizing the probability of misidenfication ℙ :
𝑎(
∗ ≠ 𝑎∗ .
14

15. Best Arm Identification with a Fixed Budget
n How to evaluate ℙ :
𝑎(
∗ ≠ 𝑎∗ , the performance of BAI strategies?
n When the best arm is unique, ℙ :
𝑎(
∗ ≠ 𝑎∗ converges to 0 with an exponential speed; that is,
ℙ :
𝑎(
∗ ≠ 𝑎∗ = exp(−𝑇(⋆)) ,
where (⋆) is a constant term.
↔ Local asymptotics (van der Vaart (1998), Hirano and Porter (2009)): ℙ :
𝑎(
∗ ≠ 𝑎∗ is constant.
n Consider evaluating the term (⋆) by lim sup
(→?
− *
(
log ℙ :
𝑎(
∗ ≠ 𝑎∗ .
n A performance lower (upper) bound of ℙ :
𝑎(
∗ ≠ 𝑎∗ is
an upper (lower) bound of lim sup
(→?
− *
(
log ℙ :
𝑎(
∗ ≠ 𝑎∗
15

16. Distribution-dependent Lower Bound
n Distribution-dependent lower bound, a.k.a., information-theoretic lower bound.
• Lower bound based on the information of the distribution.
• The lower bounds are often based on quantities, such as Fisher information and KL divergence.
n We employ the derivation technique called a change-of-measure.
• This technique has been used in asymptotic optimality of hypothesis testing (see van der Vaart
(1998)) and lower bounds in the MAB problem (Lai and Robbins (1985)).
• In BAI, Kaufmann et al. (2016) conjectures distribution-dependent lower bound.
16

17. Lower Bound for Two-armed Bandits
n Introducing appropriate alternative hypothesis and restricting the strategy class.
n Denote the true distribution (bandit model) by 𝑃.
n Denote a set of alternative hypothesizes by Alt(𝑃).
n Consistent strategy: return the true arm with probability 1 as 𝑇 → ∞.
n For any 𝑄 ∈ Alt 𝑃 , consistent strategy satisfies,
lim sup
)→+

1
𝑇
log ℙ𝑷
6
𝑎)
∗ ≠ 𝑎∗ ≤ lim sup
)→+
1
𝑇
𝔼.
:
%/#
\$
:
!/#
)
1[𝐴!
= 𝑎] log
𝑓%
0 𝑌%
𝑓%
𝑌%
where 𝑓!
and 𝑓!
@ are the pdfs of an arm 𝑎’s reward under 𝑃 and 𝑄. ℙ𝑷
and 𝔼B
denote the
probability law and its expectation under a bandit model 𝑃.
17
Transportation Lemma (Lemma 1 of Kaufmann et al. (2016)
Log likelihood ratio

18. Open Question 1: Upper Bound
n No optimal strategy whose upper bound achieves the lower bound.
n Performances of any BAI strategy cannot exceed the lower bound.
n If a performance guarantee of BAI strategy matches the lower bound, the strategy is optimal.
n In BAI with a fixed budget, there is no optimal strategy whose upper bound achieves the
distribution-dependent lower bound conjectured by Kaufmann et al. (2016)
• Optimal strategies exist in other settings, e.g., BAI with fixed confidence.
n One of the main problems is the estimation error of the optimal allocation ratio.
• Glynn and Juneja (2004): if the allocation ratio is known, the lower bound is achievable.
18

19. Open Question 2: Lower Bound
n Since there is no corresponding upper bound, the lower bound conjectured by Kaufmann et al.
(2016) may not be correct.
n Carpentier and Locattelli (2016) suggests that under a large gap setting, the lower bound by
Kaufmann et al. (2016) is not achievable.
n A large gap setting: For the best arm 𝑎∗ ∈ [𝐾], ∑!C!∗
*
𝔼 0'∗ E𝔼 0' "
is bounded by constant.
• This implies that 𝔼[𝑌!∗
] − 𝔼[𝑌!] is not close to 0.
• Therefore, we also reconsider the lower bound of Kaufmann et al. (2016).
19

20. Lower Bound for Two-armed Bandits
n For case with 𝐾 = 2, we consider an optimal algorithm under a small-gap setting.
n Consider a small gap situation：Δ = 𝔼[𝑌*] − 𝔼[𝑌,] → 0.
• Under appropriate conditions,
lim sup
(→?

1
𝑇
log ℙ :
𝑎(
∗ ≠ 𝑎∗ ≤
Δ,
2 𝑉𝑎𝑟(𝑌*) + 𝑉𝑎𝑟(𝑌,)
,
+ 𝑜(Δ,)
n This lower bounds suggests us the optimal sample allocation ratio:
𝑤∗ 1 =
𝑉𝑎𝑟 𝑌*
𝑉𝑎𝑟 𝑌* + 𝑉𝑎𝑟 𝑌,
and 𝑤∗ 2 =
𝑉𝑎𝑟 𝑌,
𝑉𝑎𝑟 𝑌* + 𝑉𝑎𝑟 𝑌,
.
20
Lower bound for the Two-armed Bandits

21. Upper Bound and Large Deviations
n Next, we consider the upper bound.
n We are interested in evaluation of the tail probability:
ℙ :
𝑎(
∗ ≠ 𝑎∗ = J
!C!∗
ℙ :
𝜇!,( ≥ :
𝜇!∗,( = J
!C!∗
ℙ :
𝜇!∗,( − :
𝜇!,( − Δ! ≤ −Δ!
→ Large deviation principle (LDP): evaluation of ℙ H
𝝁𝒂∗,𝑻 − H
𝝁𝒂,𝑻 − 𝜟𝒂 ≤ 𝑪 (𝑪 is a constant)
⇔ Central limit theorem (CLT): ℙ 𝑇(:
𝜇!∗,( − :
𝜇!,( − Δ!) ≤ 𝐶 .
n There are existing well-known results on LDPs. Ex. Cramér theorem and Gärtner-Ellis theorem
• These cannot be applied to BAI owing to the non-stationarity of the stochastic process.
21

22. Large Deviation Principles for Martingales
n To solve this problem, we derive a novel LDP for martingales by using the change-of-measure.
n Transform an upper bound on another distribution to the distribution of interest.
• Let ℙ be a probability measure of interest and {𝜉'}be martingale difference sequence from ℙ .
• Consider obtain the large deviation bound of the average over {𝜉'}; that is, *
)
∑'+*
( 𝜉'
.
• For the martingale difference sequence {𝜉'} and constant 𝜆, define 𝑈( = ∏'+*
( HIJ(LM#)
𝔼[HIJ LM# |ℱ#(!]
.
• Define the conjugate probability measure ℙL
as dℙL = 𝑈(dℙ.
• Derive the large deviation bound on ℙL
, not on ℙ of interest.
• Then, transform the large deviation bound on ℙL
to the large deviation bound on ℙ via the density ratio Qℙ
Qℙ)
.
22
ℙ!

dℙ
dℙL
= 𝑈(
Upper bound Upper bound
Change measures

23. Upper Bound
n Under an appropriately designed BAI algorithm, we show the following upper bound.
• If ∑#\$!
% * :#+!
(
;.=
𝑤∗(𝑎), then under some regularity conditions,
lim sup
(→?

1
𝑇
log ℙ :
𝑎(
∗ ≠ 𝑎∗ ≥
Δ,
2 𝑉𝑎𝑟(𝑌*) + 𝑉𝑎𝑟(𝑌,)
,
+ 𝑜 Δ, .
• This result implies Gaussian approximation in large deviation and Δ → 0.
• This result is a generalization of the martingale central limit theorem.
n This upper bound matches the lower bound under the small-gap setting.
23
Upper bound for the Two-armed Bandits

24. Conclusion
ØMain contribution:
• Optimal algorithm of BAI with a fixed budget (adaptive experiments for policy choice).
ØTechnical contribution:
• Evaluation framework with a small gap.
• A novel large deviation bound for martingales.
24

25. Reference
Ø Kato, M., Ariu, K., Imaizumi, M., Nomura, M., and Qin, C. (2022), “Best Arm Identification with a Fixed Budget under a Small Gap”
• Carpentier, A. and Locatelli, A. Tight (lower) bounds for the fixed budget best arm identification388
bandit problem. In COLT, 2016
• Glynn, P. and Juneja, S. A large deviations perspective on ordinal optimization. In Proceedings of the 2004 Winter Simulation Conference, volume 1. IEEE, 2004
• Kasy, M. and Sautmann, A. Adaptive treatment assignment in experiments for policy choice. Econometrica.
• Kaufmann, E., Cappe, O., and Garivier, A. (2016), “On the Complexity of Best-Arm Identification in Multi-Armed ´ Bandit Models,” Journal of Machine Learning
Research.
• Fan, X., Grama, I., and Liu, Q. (2013), “Cramer large deviation expansions for martingales under Bernstein’s condition,” Stochastic Processes and their
Applications.
• Lai, T. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 1985
• Hirano, K. and Porter, J. R. Asymptotics for statistical treatment rules. Econometrica, 2009
• van der Vaart, A. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998
25