Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Optimal Best Arm Identification with a Fixed Budget under a Small Gap

MasaKat0
August 10, 2022

Optimal Best Arm Identification with a Fixed Budget under a Small Gap

Presentation slide in AMES 2022.

MasaKat0

August 10, 2022
Tweet

More Decks by MasaKat0

Other Decks in Research

Transcript

  1. Optimal Best Arm Identification
    with a Fixed Budget under a Small Gap
    Masahiro Kato (The University of Tokyo / CyberAgent, Inc.)
    Joint work with
    Masaaki Imaizumi (The University of Tokyo), Kaito Ariu (CyberAgent, Inc. / KTH)
    Masahiro Nomura (CyberAgent, Inc. ), Chao Qin (Columbia University)
    2022 Asian Meeting of the Econometric Society
    Kato, M., Ariu, K., Imaizumi, M., Nomura, M., and Qin, C. (2022), “Best Arm Identification with a Fixed Budget under a Small Gap” (https://arxiv.org/abs/2201.04469)

    View Slide

  2. Adaptive Experiments for Policy Choice
    n Consider there are multiple treatment arms with potential outcomes.
    • Treatment arm (also called treatment or arm): Slots, Medicine, online advertisements, etc.
    ØMotivation: conduct an experiments for efficiently choosing the best treatment arm
    • Best treatment arm = treatment arm with the highest expected outcome.
    2
    Observe the reward of
    the chosen treatment arm 𝐴!

    Decision maker
    Choose a treatment arm 𝑎
    Treatment
    arm 1
    Treatment
    arm 2
    Treatment
    arm 𝐾

    Adaptive experiments Recommend the best treatment arm.

    View Slide

  3. Adaptive Experiments for Policy Choice
    n How can we gain efficiency of choosing the best treatment arm?
    → Optimize the sample allocation to each treatment in the experiment.
    n To optimize the sample allocation, we need to know some parameters, such as the variances.
    n In our experiment, we sequentially estimate the parameter and optimize the sample allocation.
    • This setting is called best arm identification (BAI) in machine learning.
    Also called adaptive experiments for policy choice in economics (Kasy and Sautmann (2021)).
    • BAI is an instance of multi-armed bandit (MAB) problems.
    • In MAB problem, we regard the treatments as the arms of the slot game.
    3

    View Slide

  4. Problem Setting
    n There are 𝐾 treatment arms 𝐾 = {1,2, … 𝐾} and fixed time horizon 𝑇.
    n After choosing a treatment arm 𝑎 ∈ [𝐾], the treatment arm returns an outcome 𝑌! ∈ ℝ.
    ØGoal: After the following trial, find the best treatment arm 𝑎∗ = arg max!∈[%] 𝔼[𝑌!] .
    n In each period 𝑡 (assume that the distribution of 𝑌! is invariant across periods),
    1. Choose a treatment arm 𝐴' ∈ [𝐾]
    2. Observe the chosen reward 𝑌' = ∑!∈[%]
    1 𝐴' = 𝑎 𝑌'
    !.
    • Stop the trial at round 𝑡 = 𝑇.
    • Recommend an estimated best treatment arm :
    𝑎( ∈ [𝐾].
    4
    Arm 1 Arm 2

    Arm 𝐾
    𝑌!
    "
    𝑌!
    # 𝑌!
    $
    𝑌!
    = ∑%∈[$]
    1 𝐴!
    = 𝑎 𝑌!
    %
    𝑇
    𝑡 = 1

    View Slide

  5. Performance Evaluation
    n Consider directly evaluating the performance of the decision making.
    n Our decision making: recommend an estimated best arm :
    𝑎)
    as the best arm 𝑎∗.
    n We evaluate the decision making by the probability of misidentification: ℙ(:
    𝑎) ≠ 𝑎∗).
    • When there is the unique best arm, ℙ :
    𝑎) ≠ 𝑎∗ converges to zero with an exponential rate.
    • Evaluate the decision making by evaluating the convergence rate of ℙ(:
    𝑎) ≠ 𝑎∗).
    • Faster convence rate of ℙ(:
    𝑎) ≠ 𝑎∗) implies a better decision making.
    n We aim to recommend the best treatment arm with less probability of misidentification.
    5

    View Slide

  6. RS-AIPW Strategy
    n Algorithms in BAI are often referred to as BAI strategies.
    n Our BAI strategy consists of the following two components:
    1. Random sampling (RS):
    Following a certain probability, we choose treatment arms in 𝑡 = 1,2, … , 𝑇.
    2. Recommendation based on the augmented inverse probability weighting (AIPW) estimator:
    In 𝑡 = 𝑇, we recommend the best arm using the AIPW estimator.
    ØWe call our strategy the RS-AIPW strategy.
    6

    View Slide

  7. RS-AIPW Strategy
    n Our BAI strategy is designed so that the ratio of the empirical sample allocation
    *
    )
    ∑'+*
    ( 1[𝐴' = 𝑎] of each arm 𝑎 ∈ [𝐾] converges to the optimal allocation ratio.
    • The optimal allocation ratio is derived from the lower bound of the theoretical performance.
    n For example, when 𝐾 = 2, the optimal allocation ratio of each arm 𝑎 ∈ [𝐾] is
    𝑤∗ 1 =
    𝑉𝑎𝑟 𝑌*
    𝑉𝑎𝑟 𝑌* + 𝑉𝑎𝑟 𝑌,
    and 𝑤∗ 2 =
    𝑉𝑎𝑟 𝑌,
    𝑉𝑎𝑟 𝑌* + 𝑉𝑎𝑟 𝑌,
    .
    • This implies that allocating samples with the ratio of the standard deviation is optimal.
    n When 𝐾 ≥ 3, we need to solve a linear programming to decide the optimal allocation ratio.
    7

    View Slide

  8. RS-AIPW Strategy
    n For simplicity, consider the case with two treatment arms, 𝐾 = 2.
    n In each period 𝑡 = 1,2, … , 𝑇,
    1. Estimate the variance of each arm 𝑉𝑎𝑟 𝑌! using past observations until 𝑡 − 1 period.
    2. Choose 𝐴' = 1with probability H
    𝑤' 1 =
    -
    .!/ 0!
    -
    .!/ 0! 1 -
    .!/ 0"
    and 𝐴' = 2 with 1 − H
    𝑤'(2).
    • After period 𝑡 = 𝑇, we recommend treatment arm as :
    𝑎( = arg max!∈{*,,} :
    𝜇(
    5678,!, where
    :
    𝜇(
    5678,! =
    1
    𝑇
    J
    '+*
    (
    1[𝐴' = 𝑎](𝑌'
    ! − :
    𝜇'
    !)
    H
    𝑤'(𝑎)
    + :
    𝜇'
    !,
    and :
    𝜇'
    ! is an estimator of 𝔼[𝑌!] using past observations until 𝑡 − 1 period.
    8

    View Slide

  9. RS-AIPW Strategy
    n :
    𝜇(
    5678,! is the AIPW estimator of 𝔼 𝑌! , a.k.a., the doubly robust estimator.
    • We can use the properties of the martingales.
    • This property is helpful since the observations in adaptive experiments are dependent.
    • To use the martingale property, we use past observations to construct nuisance estimators.
    • This is similar to cross-fitting of the double machine learning in Chernozhukov et al. (2016).
    • If we use the IPW-type estimators, the variance becomes larger.
    n In RS-AIPW strategy, ℙ :
    𝑎(
    ∗ ≠ 𝑎∗ = ℙ arg max!∈{*,,} :
    𝜇(
    5678,! ≠ 𝑎∗ .
    9

    View Slide

  10. Asymptotic Optimality
    n Assumption: The best arm is unique and 𝑎∗ = 1; that is, 𝔼 𝑌* > 𝔼 𝑌, .
    n Consistent strategy: When there exists the unique best arm, consistent strategies return the
    best arm with probability one in large samples: ℙ :
    𝑎(
    ∗ = 𝑎∗ → 1 as 𝑇 → ∞.
    n Small-gap setting: Consider a small gap setting where Δ = 𝔼[𝑌*] − 𝔼[𝑌,] → 0.
    • [Upper bound] If ∑#$!
    % * :#+!
    (
    ;.=
    𝑤∗(𝑎), then under some regularity conditions, for large samples,
    ℙ :
    𝑎(
    ∗ ≠ 𝑎∗ = ℙ :
    𝜇(
    5678,* < :
    𝜇(
    5678,, ≤ exp −𝑇
    Δ,
    2 𝑉𝑎𝑟 𝑌* + 𝑉𝑎𝑟 𝑌,
    ,
    + 𝑜 Δ, .
    • [Lower bound] No consistent strategy exceeds this convergence rate.
    n This result implies that our proposed RS-AIPW strategy is asymptotically optimal.
    10
    Main result (not mathematically rigorous)

    View Slide

  11. Simulation Studies
    n BAI strategies:
    • Alpha: Strategy using the true variance.
    • Uniform: Uniform sampling, RCT.
    • RS-AIPW: Proposed strategy.
    n y-axis: ℙ :
    𝑎(
    ∗ = 𝑎∗ , x-axis: 𝑇.
    • 𝔼 𝑌* = 0.05, 𝔼 𝑌, = 0.01.
    • Top figure: 𝑉𝑎𝑟 𝑌* = 1, 𝑉𝑎𝑟 𝑌, = 0.2.
    • Bottom figure: 𝑉𝑎𝑟 𝑌* = 1, 𝑉𝑎𝑟 𝑌, = 0.1.
    11

    View Slide

  12. Extensions
    12

    View Slide

  13. Related work and Details of the Asymptotic Optimality

    View Slide

  14. Best Arm Identification
    n Best arm identification is an instance of the MAB problem.
    ØGoal: Identify (= choose) the best treatment arm via adaptive experiments.
    n Two problem setting:
    • Fixed-confidence setting: Sequential testing flavored formulation. Sample size is not fixed.
    Continue adaptive experiments until a predefined criterion is satisfied.
    • Fixed-budget setting: Sample size is fixed and conduct decision making at the last period.
    • In the final period 𝑇, we return an estimate the best treatment arm, :
    𝑎(
    ∗ .
    • Evaluation: Minimizing the probability of misidenfication ℙ :
    𝑎(
    ∗ ≠ 𝑎∗ .
    14

    View Slide

  15. Best Arm Identification with a Fixed Budget
    n How to evaluate ℙ :
    𝑎(
    ∗ ≠ 𝑎∗ , the performance of BAI strategies?
    n When the best arm is unique, ℙ :
    𝑎(
    ∗ ≠ 𝑎∗ converges to 0 with an exponential speed; that is,
    ℙ :
    𝑎(
    ∗ ≠ 𝑎∗ = exp(−𝑇(⋆)) ,
    where (⋆) is a constant term.
    ↔ Local asymptotics (van der Vaart (1998), Hirano and Porter (2009)): ℙ :
    𝑎(
    ∗ ≠ 𝑎∗ is constant.
    n Consider evaluating the term (⋆) by lim sup
    (→?
    − *
    (
    log ℙ :
    𝑎(
    ∗ ≠ 𝑎∗ .
    n A performance lower (upper) bound of ℙ :
    𝑎(
    ∗ ≠ 𝑎∗ is
    an upper (lower) bound of lim sup
    (→?
    − *
    (
    log ℙ :
    𝑎(
    ∗ ≠ 𝑎∗
    15

    View Slide

  16. Distribution-dependent Lower Bound
    n Distribution-dependent lower bound, a.k.a., information-theoretic lower bound.
    • Lower bound based on the information of the distribution.
    • The lower bounds are often based on quantities, such as Fisher information and KL divergence.
    n We employ the derivation technique called a change-of-measure.
    • This technique has been used in asymptotic optimality of hypothesis testing (see van der Vaart
    (1998)) and lower bounds in the MAB problem (Lai and Robbins (1985)).
    • In BAI, Kaufmann et al. (2016) conjectures distribution-dependent lower bound.
    16

    View Slide

  17. Lower Bound for Two-armed Bandits
    n Introducing appropriate alternative hypothesis and restricting the strategy class.
    n Denote the true distribution (bandit model) by 𝑃.
    n Denote a set of alternative hypothesizes by Alt(𝑃).
    n Consistent strategy: return the true arm with probability 1 as 𝑇 → ∞.
    n For any 𝑄 ∈ Alt 𝑃 , consistent strategy satisfies,
    lim sup
    )→+

    1
    𝑇
    log ℙ𝑷
    6
    𝑎)
    ∗ ≠ 𝑎∗ ≤ lim sup
    )→+
    1
    𝑇
    𝔼.
    :
    %/#
    $
    :
    !/#
    )
    1[𝐴!
    = 𝑎] log
    𝑓%
    0 𝑌%
    𝑓%
    𝑌%
    where 𝑓!
    and 𝑓!
    @ are the pdfs of an arm 𝑎’s reward under 𝑃 and 𝑄. ℙ𝑷
    and 𝔼B
    denote the
    probability law and its expectation under a bandit model 𝑃.
    17
    Transportation Lemma (Lemma 1 of Kaufmann et al. (2016)
    Log likelihood ratio

    View Slide

  18. Open Question 1: Upper Bound
    n No optimal strategy whose upper bound achieves the lower bound.
    n Performances of any BAI strategy cannot exceed the lower bound.
    n If a performance guarantee of BAI strategy matches the lower bound, the strategy is optimal.
    n In BAI with a fixed budget, there is no optimal strategy whose upper bound achieves the
    distribution-dependent lower bound conjectured by Kaufmann et al. (2016)
    • Optimal strategies exist in other settings, e.g., BAI with fixed confidence.
    n One of the main problems is the estimation error of the optimal allocation ratio.
    • Glynn and Juneja (2004): if the allocation ratio is known, the lower bound is achievable.
    18

    View Slide

  19. Open Question 2: Lower Bound
    n Since there is no corresponding upper bound, the lower bound conjectured by Kaufmann et al.
    (2016) may not be correct.
    n Carpentier and Locattelli (2016) suggests that under a large gap setting, the lower bound by
    Kaufmann et al. (2016) is not achievable.
    n A large gap setting: For the best arm 𝑎∗ ∈ [𝐾], ∑!C!∗
    *
    𝔼 0'∗ E𝔼 0' "
    is bounded by constant.
    • This implies that 𝔼[𝑌!∗
    ] − 𝔼[𝑌!] is not close to 0.
    • Therefore, we also reconsider the lower bound of Kaufmann et al. (2016).
    19

    View Slide

  20. Lower Bound for Two-armed Bandits
    n For case with 𝐾 = 2, we consider an optimal algorithm under a small-gap setting.
    n Consider a small gap situation:Δ = 𝔼[𝑌*] − 𝔼[𝑌,] → 0.
    • Under appropriate conditions,
    lim sup
    (→?

    1
    𝑇
    log ℙ :
    𝑎(
    ∗ ≠ 𝑎∗ ≤
    Δ,
    2 𝑉𝑎𝑟(𝑌*) + 𝑉𝑎𝑟(𝑌,)
    ,
    + 𝑜(Δ,)
    n This lower bounds suggests us the optimal sample allocation ratio:
    𝑤∗ 1 =
    𝑉𝑎𝑟 𝑌*
    𝑉𝑎𝑟 𝑌* + 𝑉𝑎𝑟 𝑌,
    and 𝑤∗ 2 =
    𝑉𝑎𝑟 𝑌,
    𝑉𝑎𝑟 𝑌* + 𝑉𝑎𝑟 𝑌,
    .
    20
    Lower bound for the Two-armed Bandits

    View Slide

  21. Upper Bound and Large Deviations
    n Next, we consider the upper bound.
    n We are interested in evaluation of the tail probability:
    ℙ :
    𝑎(
    ∗ ≠ 𝑎∗ = J
    !C!∗
    ℙ :
    𝜇!,( ≥ :
    𝜇!∗,( = J
    !C!∗
    ℙ :
    𝜇!∗,( − :
    𝜇!,( − Δ! ≤ −Δ!
    → Large deviation principle (LDP): evaluation of ℙ H
    𝝁𝒂∗,𝑻 − H
    𝝁𝒂,𝑻 − 𝜟𝒂 ≤ 𝑪 (𝑪 is a constant)
    ⇔ Central limit theorem (CLT): ℙ 𝑇(:
    𝜇!∗,( − :
    𝜇!,( − Δ!) ≤ 𝐶 .
    n There are existing well-known results on LDPs. Ex. Cramér theorem and Gärtner-Ellis theorem
    • These cannot be applied to BAI owing to the non-stationarity of the stochastic process.
    21

    View Slide

  22. Large Deviation Principles for Martingales
    n To solve this problem, we derive a novel LDP for martingales by using the change-of-measure.
    n Transform an upper bound on another distribution to the distribution of interest.
    • Let ℙ be a probability measure of interest and {𝜉'}be martingale difference sequence from ℙ .
    • Consider obtain the large deviation bound of the average over {𝜉'}; that is, *
    )
    ∑'+*
    ( 𝜉'
    .
    • For the martingale difference sequence {𝜉'} and constant 𝜆, define 𝑈( = ∏'+*
    ( HIJ(LM#)
    𝔼[HIJ LM# |ℱ#(!]
    .
    • Define the conjugate probability measure ℙL
    as dℙL = 𝑈(dℙ.
    • Derive the large deviation bound on ℙL
    , not on ℙ of interest.
    • Then, transform the large deviation bound on ℙL
    to the large deviation bound on ℙ via the density ratio Qℙ
    Qℙ)
    .
    22
    ℙ!

    dℙ
    dℙL
    = 𝑈(
    Upper bound Upper bound
    Change measures

    View Slide

  23. Upper Bound
    n Under an appropriately designed BAI algorithm, we show the following upper bound.
    • If ∑#$!
    % * :#+!
    (
    ;.=
    𝑤∗(𝑎), then under some regularity conditions,
    lim sup
    (→?

    1
    𝑇
    log ℙ :
    𝑎(
    ∗ ≠ 𝑎∗ ≥
    Δ,
    2 𝑉𝑎𝑟(𝑌*) + 𝑉𝑎𝑟(𝑌,)
    ,
    + 𝑜 Δ, .
    • This result implies Gaussian approximation in large deviation and Δ → 0.
    • This result is a generalization of the martingale central limit theorem.
    n This upper bound matches the lower bound under the small-gap setting.
    23
    Upper bound for the Two-armed Bandits

    View Slide

  24. Conclusion
    ØMain contribution:
    • Optimal algorithm of BAI with a fixed budget (adaptive experiments for policy choice).
    ØTechnical contribution:
    • Evaluation framework with a small gap.
    • A novel large deviation bound for martingales.
    24

    View Slide

  25. Reference
    Ø Kato, M., Ariu, K., Imaizumi, M., Nomura, M., and Qin, C. (2022), “Best Arm Identification with a Fixed Budget under a Small Gap”
    • Carpentier, A. and Locatelli, A. Tight (lower) bounds for the fixed budget best arm identification388
    bandit problem. In COLT, 2016
    • Glynn, P. and Juneja, S. A large deviations perspective on ordinal optimization. In Proceedings of the 2004 Winter Simulation Conference, volume 1. IEEE, 2004
    • Kasy, M. and Sautmann, A. Adaptive treatment assignment in experiments for policy choice. Econometrica.
    • Kaufmann, E., Cappe, O., and Garivier, A. (2016), “On the Complexity of Best-Arm Identification in Multi-Armed ´ Bandit Models,” Journal of Machine Learning
    Research.
    • Fan, X., Grama, I., and Liu, Q. (2013), “Cramer large deviation expansions for martingales under Bernstein’s condition,” Stochastic Processes and their
    Applications.
    • Lai, T. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 1985
    • Hirano, K. and Porter, J. R. Asymptotics for statistical treatment rules. Econometrica, 2009
    • van der Vaart, A. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998
    25

    View Slide