Slide 1

Slide 1 text

Best Arm Identification with a Fixed Budget under a Small Gap Masahiro Kato (CyberAgent, Inc. / University of Tokyo) Session: Adaptive Experimental Design for Policy Choice and Policy Learning Kato, Ariu, Imaizumi, Nomura, and Qin (2022), https://arxiv.org/abs/2201.04469. 1

Slide 2

Slide 2 text

Experimental Design for Better Decision-Making ØKeywords: Causal inference, decision-making, and experimental design. n Treatment arm (arm / treatment / policy). ex. drugs, advertisements, and economic policies. Each treatment arm has a potential outcome. By drawing an arm, we can observe the outcome. ØEconomics: From treatment effect estimation to treatment choice (decision-making). • Treatment (policy) choice: Choose the best treatment arm (policy) using observations. cf. Manski (2000, 2002, 2004), Stoye (2009, 2012), Manski and Tetenov (2016). ØMulti-armed bandit problem: Optimize decision-making via adaptive experiments. • Regret minimization: Choose the treatment arms to maximize the cumulative reward during the experiment. cf. Gittins (1979), and Lai and Robbins (1985). • Best arm identification (BAI): Choose the best treatment arm after the experiment. cf. Bubeck et al. (2011), Kaufmann et al. (2016), and Kasy and Sautmann (2021). cf. Average treatment effect estimation via adaptive experimental design: van der Laan (2008), Hahn, Hirano, and Karlan (2011). 2

Slide 3

Slide 3 text

BAI with a Fixed Budget ØConsider an adaptive experiment where we can draw a treatment arm in each round. Draw a treatment arm = allocate a treatment arm to an experimental unit and observe the realized outcome. ØBAI with a fixed budget. • The number of rounds of an adaptive experiments (budget / sample size) is predetermined. • Recommend the best arm from multiple candidates after the experiment. ↔ BAI with fixed confidence: continue adaptive experiments until a certain criterion is satisfied. cf. sequential experiments. ØEvaluation metrics: • Probability of misidentifying the best treatment arm. • Expected simple regret (difference between the expected outcomes of best and suboptimal arms). n Goal: recommend the best arm with smaller probability of misidentification or expected simple regret. 3

Slide 4

Slide 4 text

Contents n Discuss asymptotically optimal algorithms in BAI with a fixed budget. • Consider a case with two treatment arms. ex. treatment and control groups. • Potential outcomes follow Gaussian distributions. • Minimization of the probability of misidentification. n Neyman allocation rule: • Draw a treatment arm with the ratio of the standard deviations of the potential outcomes. • When the standard deviations are known, the Neyman allocation is known to be optimal. cf. Chen et al. (2000), Glynn and Juneja (2004), and Kaufmann et al. (2016). 4

Slide 5

Slide 5 text

Contents ØToday, I introduce our paper: Kato, Ariu, Imaizumi, Nomura, and Qin (2022), “Optimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small Gap.” • The standard deviations are unknown and estimated in an adaptive experiment. • The worst-case asymptotic optimality of the Neyman allocation rule. n In addition to the above paper, I introduce several new findings in our project. • Beyond the Neyman allocation rule. • Minimization of the expected simple regret. 5 * * https://arxiv.org/abs/2201.04469.

Slide 6

Slide 6 text

Optimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small Gap Kato, Ariu, Imaizumi, Nomura, and Qin (2022) 6

Slide 7

Slide 7 text

Problem Setting n Adaptive experiment with 𝑻 rounds: T = {1,2, … , 𝑇}. n Binary treatment arms: {1,0}. • Each treatment arm 𝑎 ∈ 1,0 has a potential outcome 𝑌! ∈ ℝ. The distributions of (𝑌", 𝑌#) do not change across rounds, and 𝑌" and 𝑌# are independent. • At round 𝑡, by drawing a treatment arm 𝑎 ∈ 1,0 , we observe 𝑌!,% , which is an iid copy of 𝑌! . ØDefinition: Two-armed Gaussian bandit models. • A class ℳ of joint distributions 𝜈 (bandit models) of (𝑌", 𝑌#). • 𝑌", 𝑌# under 𝜈 ∈ ℳ follow Gaussian distributions 𝒩(𝜇", 𝜎" &) and 𝒩(𝜇#, 𝜎# &). • 𝜎! & is the variance of a potential outcome 𝑌! , which is fixed across bandit models 𝜈 ∈ ℳ. 7

Slide 8

Slide 8 text

Problem Setting 8 Arm 1 𝑇 𝑡 = 1 𝑌! 𝑌" Arm 2 Draw 𝐴# Observe 𝑌 $!,# Experimenter n Best treatment arm: an arm with the highest expected outcome, 𝑎∗ = arg max !∈{",#} 𝜇! . For simplicity, we assume that the best arm is unique. n Bandit process: In each round 𝑡 ∈ {1,2, … , 𝑇}, under a bandit model 𝜈 ∈ ℳ, • Draw a treatment arm 𝐴% ∈ {1,0}. • Observe an outcome 𝑌 +!,% of the chosen arm 𝐴% , • Stop the trial at round 𝑡 = 𝑇 • After the final round 𝑇, an algorithm recommends an estimated best treatment arm ? 𝑎, ∈ {1,0}.

Slide 9

Slide 9 text

Best Arm Identification (BAI) Strategy n Goal:Find the best treatment arm 𝑎∗ efficiently with some performance metric. n Our actions: Using past observations, we can optimize 𝐴% during the bandit process. n These actions are components of algorithms for BAI, called a BAI strategy. • Sampling rule (𝐴", 𝐴&, … ): How we draw a treatment arm in each round 𝑡. • Recommendation rule ? 𝑎, ∈ {1,0}: Which treatment arm we recommend as the best arm. In other settings, stopping time (budget) of an adaptive experiment can be random variable as well as sequential testing. 9

Slide 10

Slide 10 text

Contributions n Let ℙ- and 𝔼- be a probability law and expectation under a bandit model 𝜈 ∈ ℳ. n Probability of misidentification ℙ- ? 𝑎, ≠ 𝑎∗ . = A probability of an event that we recommend a suboptimal arm instead of the best arm. ØMain result of Kato, Ariu, Imaizumi, Nomura, and Qin (2022). n Optimal strategy for minimization of the probability of misidentification under a small gap. • Propose a strategy using the Neyman allocation rule and the AIPW estimator. • Consider a lower bound of ℙ- ? 𝑎, ≠ 𝑎∗ that any strategy cannot exceeds. • The probability of misidentification matches the lower bound when 𝜇" − 𝜇# → 0. 10

Slide 11

Slide 11 text

Probability of Misidentification n Assume that the best arm 𝑎∗ is unique. • ℙ- ? 𝑎, ≠ 𝑎∗ converges to 0 with an exponential speed: ℙ- ? 𝑎, ≠ 𝑎∗ = exp(−𝑇(⋆)) for a constant (⋆). ØConsider evaluating the term (⋆) by lim sup ,→/ − 1 𝑇 log ℙ- ? 𝑎, ≠ 𝑎∗ . • A performance lower (upper) bound of ℙ- ? 𝑎, ≠ 𝑎∗ is an upper (lower) bound of lim sup ,→/ − " , log ℙ- ? 𝑎, ≠ 𝑎∗ . cf. Kaufmann et al. (2016). nLarge deviation analysis: tight evaluation of ℙ- ? 𝑎, ≠ 𝑎∗ 11

Slide 12

Slide 12 text

Lower Bound n Kaufmann et al. (2016) gives a lower bound for two-armed Gaussian bandit models. • To derive a lower bound, we restrict a class of strategies. ØDefinition: consistent strategy. • A strategy is called consistent for a class ℳ if for each 𝜈 ∈ ℳ, ℙ- ? 𝑎, ≠ 𝑎∗ → 1. • For any bandit model 𝜈 ∈ ℳ, any consistent strategy satisfies lim sup ,→/ − 1 𝑇 log ℙ- ? 𝑎, ≠ 𝑎∗ ≤ Δ& 2 𝜎" + 𝜎# & . n Any strategy cannot exceed this convergence rate of the probability of misidentification. A lower bound of the probability of misidentification ℙ" " 𝑎# ≠ 𝑎∗ is an upper bound of % # log ℙ" " 𝑎# ≠ 𝑎∗ . ØOptimal strategy: a strategy under which ℙ- ? 𝑎, ≠ 𝑎∗ matches the lower bound. 12 Lower bound (Theorem 12 in Kaufmann et al., 2016)

Slide 13

Slide 13 text

Neyman Allocation Rule n Target allocation ratios. • A ratio of the expected number of arm draws " , 𝔼- ∑%0" , 1 𝐴% = 𝑎 under a sampling rule. = % # 𝔼" ∑&'% # 1 𝐴& = 𝑎 / ∑(∈[+] % # 𝔼" ∑&'% # 1 𝐴& = 𝑏 ØNeyman allocation rule. • Target allocation ratio is the ratio of the standard deviations. = Draw a treatment arm as % # 𝔼" ∑&'% # 1 𝐴& = 1 : % # 𝔼" ∑&'% # 1 𝐴& = 0 = 𝜎% : 𝜎- . n When the standard deviations 𝜎" and 𝜎# are known, the Neyman allocation is optimal. cf. Glynn and Juneja (2004), and Kaufmann et al. (2016). ØAn optimal strategy is unknown when the standard deviations are unknown. n In our strategy, we estimate (𝜎", 𝜎#) and draw an arm 𝑎 with the probability 1 2. 1 2/31 20 . 13

Slide 14

Slide 14 text

NA-AIPW Strategy n Proposed strategy: NA-AIPW strategy. • NA: sampling rule following the Neyman Allocation rule. • AIPW: recommendation rule using an Augmented Inverse Probability Weighting (AIPW) estimator. ØProcedure of the NA-AIPW strategy: 1. In each round 𝑡 ∈ [𝑇], estimate 𝜎! & using observations obtained until round 𝑡. 2. Draw a treatment arm 𝑎 ∈ {1,0} with a probability S 𝑤%(𝑎) = 1 2.,! 1 2/,!31 20,! (Neyman allocation rule). 3. In round 𝑇, estimate 𝜇! using the AIPW estimator: ̂ 𝜇&,' ()*+ = ! ' ∑#,! ' ! $!,& -",!./ 0",! / 1!(&) + ̂ 𝜇&,# . ̂ 𝜇&,# = ! ∑$%& ! ![$$,&] ∑7,! # 1 𝐴7 = 𝑎 𝑌&,# is an estimator of 𝜇& using observations until round 𝑡. 4. Recommend ? 𝑎, 4567 = arg max !∈ ",# ? 𝜇!,, 4567 as an estimated best treatment arm. 14

Slide 15

Slide 15 text

Upper Bound and Asymptotic Optimality 15 • Assume some regularity conditions. • Suppose that the estimator S 𝑤% converges to 𝑤∗ almost surely (with a certain rate). • Then, for any 𝜈 ∈ ℳsuch that 0 < 𝜇" − 𝜇# ≤ 𝐶 for some constant 𝐶 > 0, the upper bound is lim sup '→9 − 1 𝑇 log ℙ: 8 𝑎' ()*+ ≠ 𝑎∗ ≥ Δ< 2 𝜎! + 𝜎" < − = 𝐶 Δ= + Δ> , where X 𝐶 is some constant. • This result implies that lim 8→# lim sup ,→/ − " 82, log ℙ- ? 𝑎, 4567 ≠ 𝑎∗ ≥ " & 2/320 2 − 𝑜 1 . • Under a small-gap regime (Δ = 𝜇" − 𝜇# → 0), the upper and lower bounds match = The NA-AIPW strategy is asymptotically optimal under the small gap. Theorem (Upper bound) When potential outcomes follow Bernoulli distributions, an RCT (drawing each arm with probability 1/2) is approximately optimal (Kaufmann et al., 2016).

Slide 16

Slide 16 text

On the Optimality under the Small Gap ØAsymptotically optimal strategy under a small gap. • This result implies the worst-case optimality of the proposed algorithm. n A technical reason for the small gap. • There is no optimal strategy when the gap is fixed, and the standard deviations are unknown. ↔ When the standard deviations are known, the Neyman allocation is known to be optimal. cf. Chen et al. (2000), Glynn and Juneja (2004), Kaufmann et al. (2016), and Carpentier and Locatteli, 2016). n Conjecture. • When the gap is small, we can ignore the estimation error of the standard deviations. ↑The estimation error is trivial compared with the difficulty of identifying the best arm under the small gap. ü Optimality under a large gap (constant 𝜇% − 𝜇-) is an open issue. 16

Slide 17

Slide 17 text

Simulation Studies ØEmpirical performance of the NA-AIPW (NA) strategy. n Compare the NA strategy with the 𝛼-elimination (Alpha) and Uniform sampling (Uniform). The 𝛼-elimination is a strategy using the Neyman allocation when the standard deviations are known (Kaufmann et al., 2016). The uniform sampling draw each treatment arm with equal probability. A randomized controlled trial without adaptation. • Setting 1: 𝜇" = 0.05, 𝜇# = 0.01, 𝜎" & = 1, 𝜎# & = 0.2. • Setting 2: 𝜇" = 0.05, 𝜇# = 0.01, 𝜎" & = 1, 𝜎# & = 0.1. • Strategies using the Neyman allocation outperform the RCT. 17 Setting 1 Setting 2 We draw treatment arm 1 in Setting 2 more often than in Setting 1.

Slide 18

Slide 18 text

18 Beyond the Neyman Allocation Rule

Slide 19

Slide 19 text

Limitations of the Neyman Allocation Rule ØI briefly introduce my ongoing studies. • Several results are still conjectures and not published. n Limitations of the Neyman allocation rule. • Only consider a case where there are two treatment arms. • Not use covariates (contextual information). n Consider extensions of the NA-AIPW strategy. 19

Slide 20

Slide 20 text

Problem Setting with Multiple Treatment Arms and Contextual Information n 𝑲 treatment arms: 𝐾 = {1,2, … , 𝐾}. n Covariate (context): 𝑑-dimensional random variable 𝑋 ∈ 𝒳 ⊂ ℝ9. Side information such as a feature of arms. • (𝑌", … . , 𝑌:) are conditionally independent given 𝑋. n Let 𝜈 be a joint distribution of (𝑌", … . , 𝑌:, 𝑋), called a bandit model. • Denote the mean outcome of treatment arm 𝑎 by 𝜇!(𝜈) = 𝔼- 𝑌!,% and conditional mean outcome by 𝜇!(𝜈)(𝑥) = 𝔼- 𝑌!,%|𝑋% = 𝑥 . n Best treatment arm: an arm with the highest expected outcome, 𝑎∗(𝜈) = arg max !∈[:] 𝜇!(𝜈). When there are multiple best treatment arms, we denote the set by 𝒜∗(𝜈). n Not conditioned on a covariate (context) 𝑋% = 𝑥. ex. Average treatment effect estimation via adaptive experiments, such as Hahn, Hirano, Karlan (2011). 20

Slide 21

Slide 21 text

Problem Setting n Bandit process: In each round 𝑡 ∈ {1,2, … , 𝑇}, under a bandit model 𝜈, • Observe a context (covariate) 𝑿𝒕 ∈ 𝓧. • Draw a treatment arm 𝐴% ∈ [𝐾]. • Observe an outcome 𝑌 +!,% of chosen arm 𝐴% , • Stop the trial at round 𝑡 = 𝑇 • After the final round 𝑇, an algorithm recommends an estimated best treatment arm ? 𝑎, ∈ [𝐾]. 21 Arm 1 𝑇 𝑡 = 1 𝑌!,# 𝑌<,# 𝑌?,# ⋮ Arm 2 Arm 𝐾 Observe 𝑋# Draw 𝐴# Observe 𝑌 $!,# Experimenter

Slide 22

Slide 22 text

Location-Shift Bandit Class ØDefinition: Location-shift bandit class. n We call a class 𝒫 of bandit models a location-shift bandit class if • For all 𝜈 ∈ 𝒫 and 𝑥 ∈ 𝒳, the conditional variance of 𝑌!,% is constant. = For all 𝑎 ∈ [𝐾] and any 𝑥 ∈ 𝒳, there exists a constant 𝜎3 4 𝑥 such that Var" 𝑌3,& 𝑋& = 𝑥 = 𝜎3 4 𝑥 for all 𝜈 ∈ 𝒫. • For all 𝜈 ∈ 𝒫, 𝑋 has the same distribution and denote the density by 𝜁(𝑥). ex. Gaussian distributions with fixed variances. 22

Slide 23

Slide 23 text

Class of Strategies n To derive lower bounds in a case with multiple treatment arms, we further restrict a class of BAI strategies in addition to consistent strategies. ØDefinition: Asymptotically invariant strategy. • A strategy is asymptotically invariant for 𝒫 if for any 𝜈, 𝜐 ∈ 𝒫, for all 𝑎 ∈ K and 𝑥 ∈ 𝒳, 𝔼- k %0" , 1[𝐴% = 𝑎] |𝑋% = 𝑥 = 𝔼> k %0" , 1[𝐴% = 𝑎] |𝑋% = 𝑥 . ü I conjecture that if potential results follow a particular distribution, such as Bernoulli distributions, such a restriction may not be necessary, and an RCT (uniform sampling) is optimal. 23

Slide 24

Slide 24 text

Lower Bound under the Small Gap • Consider a location-shift bandit class 𝒫 and 𝜈 ∈ 𝒫. • Assume that there is a unique best treatment arm 𝑎∗(𝜈). • Assume that for all 𝑎 ∈ [𝐾], there exist Δ" > 0 such that 𝜇&∗(:) (𝜈) − 𝜇& (𝜈) < Δ. • Then, for any 𝜈 in a location-shift class, any consistent and asymptotically invariant strategy satisfies if 𝐾 = 2: lim sup '→9 − ! ' log ℙ: 8 𝑎' ∗ ≠ 𝑎∗(𝜈) ≤ @( <∫ B& C DB( C ( E C FC + 𝐶! Δ=; if 𝐾 ≥ 3 and strategy is invariant: lim sup '→9 − ! ' log ℙ: 8 𝑎' ∗ ≠ 𝑎∗(𝜈) ≤ @( < ∑ )∈ + ∫ B) ( C E C FC + 𝐶< Δ=, where 𝐶! , 𝐶< > 0 are some constant. 24 Theorem (Lower bound) Small gap

Slide 25

Slide 25 text

Target Allocation Ratio ØTarget allocation ratio 𝑤∗: 𝐾 ×𝒳 → (0,1) such that ∑!∈[:] 𝑤∗(𝑎|𝑥 ) = 1 for any 𝑥 ∈ 𝒳. (Target allocation ratio: a ratio of the expected number of arm draws % # 𝔼" ∑&'% # 1 𝐴& = 𝑎 |𝑋& = 𝑥 ) n The lower bound suggests drawing an arm 𝑎 with the following probability 𝑤∗(𝑎|𝑋%): • If 𝐾 = 2, 𝑤∗ 1 𝑋% = 2/ (@!) 2/ @! 322(@!) , and 𝑤∗ 2 𝑋% = 22(@!) 2/ @! 322(@!) . • If 𝐾 ≥ 3, 𝑤∗ 𝑎 𝑋% = 2. 2(@!) ∑ 5∈ 6 25 2(@!) . ØBeyond the Neyman allocation rule: when 𝐾 ≥ 3, draw arms with the ratio of the variances. 25

Slide 26

Slide 26 text

RS-AIPW Strategy n Algorithm (strategy): RS-AIPW strategy. RS: Random Sampling of each treatment arm ØProcedure of Contextual RS-AIPW strategy: 1. In each round 𝑡 ∈ [𝑇], estimate 𝜎!(𝑋%). 2. Using estimators of 𝜎!(𝑋%), estimate 𝑤∗. 3. Draw a treatment arm with the estimated probability S 𝑤% . 4. In round 𝑇, estimate 𝜇!(𝜈) using the AIPW estimator: ! 𝜇3,# 789: = % # ∑&'% # % ;!'3 <",!=> ?",!(A!) > C!(3|A!) + ! 𝜇3,& 𝑋& . ̂ 𝜇&,# (𝑋# ): an estimator of 𝜇& (𝜈) using samples until round 𝑡. This estimator consists a martingale difference sequence. 5. Recommend ? 𝑎, 4567 = arg max !∈[:] ? 𝜇!,, 4567 as an estimated best treatment arm. n The RS-AIPW strategy is asymptotically optimal under the small gap as well as the NA-AIPW strategy. 26

Slide 27

Slide 27 text

Expected Simple Regret Minimization 27

Slide 28

Slide 28 text

Expected Simple Regret n Simple regret: 𝑟, 𝜈 = 𝜇!∗ - 𝜈 − 𝜇C !E 𝜈 under a bandit model 𝜈 (there is a randomness of " 𝑎# 𝜈 ). n Expected simple regret: 𝔼- 𝑟% 𝜈 = 𝔼- 𝜇!∗ D 𝜈 − 𝜇C !E 𝜈 . • 𝔼D is the expectation over ? 𝑎, 𝜈 . • The expected simple regret represents an expected relative welfare loss. n Let a gap between the expected outcomes of arms 𝑎, 𝑏 ∈ [𝐾] be Δ!,E(𝜈) = 𝜇! 𝜈 − 𝜇E 𝜈 . n By using the gap Δ!,E 𝜈 = 𝜇! 𝜈 − 𝜇E 𝜈 , the expected regret can be decomposed as 𝔼- 𝑟% 𝜈 = 𝔼- 𝜇!∗ - 𝜈 − 𝜇C !E 𝜈 = k E∉𝒜∗ - Δ!∗ - ,E(𝜈) ℙ- ? 𝑎, = 𝑏 ØConsider minimization of the expected simple regret. 28 A set of the best treatment arms.

Slide 29

Slide 29 text

Expected Simple Regret n The rate of the expected simple regret 𝔼G 𝑟# 𝜈 = ∑HI&∗(G) Δ&∗ : ,H(𝜈) ℙ: 8 𝑎' ∗ = 𝑏 . n For some constant 𝐶 > 0, the rate is given as 𝔼: 𝑟# (𝜈) = ∑H∉𝒜∗ : Δ&∗ G ,H (𝜈) exp −𝐶𝑇 Δ&∗ G ,H(𝜈) < . The speed of convergence to zero of Δ&∗ G ,H(𝜈) affects the of 𝔼: 𝑟# 𝑃 regarding 𝑇. 1. Δ&∗ G ,H(𝜈) is slower than 1/ 𝑇 → For some increasing function 𝑔(𝑇),𝔼: 𝑟# 𝜈 ≈ exp −𝑔 𝑇 . 2. Δ&∗ G ,H 𝜈 = 𝐶! / 𝑇 for some constant 𝐶! → For some constant 𝐶< > 0, 𝔼: 𝑟# 𝜈 ≈ L( ' . 3. Δ&∗ G ,H(𝜈) is faster than 1/ 𝑇 → 𝔼: 𝑟# 𝜈 ≈ 𝑜(1/ 𝑇) → In the worst case, Δ&∗ : ,H converges to zero with with 𝐶! / 𝑇 (Bubeck et al., 2011). cf. Hirano and Porter (2008), and limit of experiment framework. ü When 𝜟𝒂,𝒃 𝝂 is independent from 𝑻, evaluation of the expected simple regret is equal to that of the probability of misidentification. • ℙ& ' 𝑎' ∗ = 𝑏 converges to zero with an exponential speed if Δ),* 𝜈 is independent from 𝑇.Δ)∗ & ,* is ignorable. → For some constant (⋆), if ℙ& ' 𝑎' ∗ = 𝑏 ≈ exp(−𝑇(⋆)) for 𝑏 ∉ 𝒜∗ 𝜈 , then 𝔼& 𝑟+ 𝜈, ≈ exp(−𝑇(⋆)). cf. Kasy and Sautmann (2021). 29

Slide 30

Slide 30 text

Optimal Strategy when the Support is Bounded ØThe result of Bubeck et al. (2011). n Definition: The class of bounded bandit models 𝒫[H,I]: for any 𝑎 ∈ [𝐾], 𝑌! ∈ [𝛼, 𝛽]. n Lower bound: For 𝑇 ≥ 𝐾 ≥ 2,any strategy satisfies sup -∈𝒫[0,/] 𝔼- 𝑟, 𝜈 ≥ " &# : , . n Upper bound: For uniform sampling (RCT) + recommending an arm with the highest sample average, sup :∈𝒫[-,&] 𝔼: 𝑟' 𝜈 ≤ 𝐾 log 𝐾 𝑇 . Ø Consider the worst-case expected simple regret for bandit models with infinite support (location-shift). 30

Slide 31

Slide 31 text

Lower Bound n Definition: Null bounded strategy. • A strategy is null bounded if under a bandit model 𝜈 such that 𝜇! 𝜈 = ⋯ = 𝜇? 𝜈 , the strategy draws each treatment as 𝔼: ! ' ∑#,! ' 1 𝐴# = 𝑎 𝑋# = 𝑥 ≥ 𝜖 for some small constant 𝜖 > 0. • Consider a location-shit bandit class 𝒫. • For any null bounded strategy, the expected simple regret is lower bounded as follows: if 𝐾 = 2, sup -∈𝒫 𝑇𝔼- 𝑟, 𝑃 ≥ 𝐶" ∫ 𝜎" 𝑥 + 𝜎& 𝑥 & 𝜁 𝑥 𝑑𝑥 + 𝑜 1 ; if 𝐾 ≥ 3, sup -∈𝒫 𝑇𝔼- 𝑟, 𝑃 ≥ 𝐶& ∑ !∈ : ∫ 𝜎! 𝑥 & 𝜁 𝑥 𝑑𝑥 + 𝑜 1 , where 𝐶" and 𝐶& are some constants. 31 Theorem (Lower bound)

Slide 32

Slide 32 text

Target Allocation Ratio n The target allocation ratios for minimization of the expected simple regret is the same as those for minimization of the probability of misidentification. • If 𝐾 = 2, 𝑤∗ 1 𝑋% = 2/(@!) 2/ @! 322(@!) , and 𝑤∗ 2 𝑋% = 22(@!) 2/ @! 322(@!) . • If 𝐾 ≥ 3, 𝑤∗ 𝑎 𝑋% = 2. 2(@!) ∑5∈[6] 25 2(@!) . n Why? → Both settings consider worst-case scenarios. • In the worst case, only the variance affects the target sample allocation. ØThe RS-AIPW strategy proposed in minimization of the probability of misidentification. This strategy is minimax optimal in the sense that its leading factor matches the lower bound. 32

Slide 33

Slide 33 text

Summary 33

Slide 34

Slide 34 text

Summary ØAsymptotically optimal strategy in two-armed Gaussian BAI with a fixed budget. n Evaluating the performance of BAI strategies by the probability of misidentification. • The Neyman allocation rule is globally optimal when the standard deviations are known. = The Neyman allocation is known to be asymptotically optimal when potential outcomes of two treatment arms follow Gaussian distributions with any mean parameters and fixed variances. n Result of Kato, Ariu, Imaizumi, and Qin (2022). • The standard deviations are unknown and estimated during an experiment. • Under the NA-AIPW strategy, the probability of misidentification matches the lower bound when the gap between expected outcomes goes to zero. → The strategy based on the Neyman allocation is the worst-case optimal (small-gap optimal). 34

Slide 35

Slide 35 text

Reference • Kato, M., Ariu, K., Imaizumi, M., Nomura, M., and Qin, C. (2022), “Best Arm Identification with a Fixed Budget under a Small Gap.” • Audibert, J.-Y., Bubeck, S., and Munos, R. (2010), “Best Arm Identification in Multi-Armed Bandits,” in COLT. • Bubeck, S., Munos, R., and Stoltz, G. (2011), “Pure exploration in finitely-armed and continuous-armed bandits,” Theoretical Computer Science. • Carpentier, A. and Locatelli, A. (2016), “Tight (Lower) Bounds for the Fixed Budget Best Arm Identification Bandit Problem,” in COLT • Chen, C.-H., Lin, J., Yücesan, E., and Chick, S. E (2000). Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Discrete Event Dynamic Systems, • Garivier, A. and Kaufmann, E. (2016), “Optimal Best Arm Identification with Fixed Confidence,” in COLT. • Glynn, P. and Juneja, S. (2004), “A large deviations perspective on ordinal optimization,” in Proceedings of the 2004 Winter Simulation Conference, IEEE. • Kaufmann, E., Cappé, O., and Garivier, A. (2016), “On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models,” JMLR. • Lai, T. and Robbins, H. (1985), “Asymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics. • Manski, C. F. (2000), ”Identification problems and decisions under ambiguity: Empirical analysis of treatment response and normative analysis of treatment choice,” Journal of Econometrics. - (2002), ”Treatment choice under ambiguity induced by inferential problems,” Journal of Statistical Planning and Inference. - (2004), “Statistical treatment rules for heterogeneous populations,” Econometrica. • Manski, C. F. and Tetenov, A. (2016), “Sufficient trial size to inform clinical practice,” Proceedings of the National Academy of Science. • Stoye, J. (2009), “Minimax regret treatment choice with finite samples,” Journal of Econometrics. 35