Adaptive Experimental Design for Efficient Average Treatment Effect Estimation and Treatment Choice

1 Adaptive Experimental Design for Efficient Average Treatment Effect Estimation
and Treatment Choice 2025 CUHK-Shenzhen Econometrics Workshop Masahiro Kato Osaka Metropolitan University Mizuho-DL Financial Technology Co., Ltd.

2 Experimental approach for causal inference ◼ The gold standard
for causal inference is randomized controlled trials (RCTs). • We randomly allocate treatments for experimental units. ◼ RCTs are the gold standard but often costly and inefficient. → We want to design more efficient experiments (in some sense). ➢ Adaptive experiment design. • We can update treatment-allocation during an experiment to gain efficiency. • How do we define an ideal treatment-allocation probability (propensity score)?

3 Table of contents ◼ In this talk, we explain
how we design adaptive experiments for average treatment effect estimation and treatment choice. 1. General approach for adaptive experimental design. 2. Adaptive experimental design for ATE estimation. 3. Adaptive experimental design for treatment choice. 4. Adaptive experimental design for policy learning. 5. Technical issues in experimental design for treatment choice

4 Papers this talk is based on • Kato, M.,
Ishihara, T., Honda, J., and Narita, Y. (2020) Eﬃcient adaptive experimental design for average treatment eﬀect estimation Revised and Resubmit for JASA. • Kato, M. Minimax and Bayes optimal best-arm identification: adaptive experimental design for treatment choice Preprint • Komiyama, J., Ariu, K., Kato, M., and Qin, C. (2021). Rate-optimal Bayesian simple regret in best arm identification Mathematics of Operations Research. • Kato, M., Okumura, K., Ishihara, T., and Kitagawa, T. (2024) Adaptive experimental design for policy learning. Preprint. • Ariu, K., Kato, M., Komiyama, J., McAlinn, K., and Qin C. (2025). A comment on “adaptive treatment assignment in experiments for policy choice.”

5 1. General approach for adaptive experimental design

6 How to Design Adaptive Experiments? ◼ In many studies,
adaptive experiment are designed with the following steps: Step 1. Define the goal of causal inference and the performance measure. Step 2. Compute a lower bound and an ideal treatment-allocation probability. Step 3. Adaptive experiment Allocate treatment arms while estimating the ideal treatment-allocation probability. Return a target of interest using the observation in the adaptive experiment. Step 4. Investigate the performance of the designed experiment and confirm the optimality.

7 Step 1: goals and performance measures ◼ Goal 1:
Average treatment effect (ATE) estimation. * The number of treatment arms is usually two (binary treatments). • Goal. Estimate the ATE. • Performance measure. The (Asymptotic) Variance. • Smaller asymptotic variance is better. ◼ Goal 2: Treatment choice, also known as best-arm identification. * The number of treatments is more than or equal to two (multiple treatments). • Goal. Choose the best treatment = a treatment whose expected outcome is the highest. • Performance measure. The probability of misidentifying the best treatment or the regret.

8 Step 2: lower bounds and ideal allocation probability ◼
After deciding the goal and performance metric, we develop a lower bound. ◼ ATE estimation. • Semiparametric efficiency bound (Hahn, 1998). ◼ Best-arm identification. • Various lower bounds have been proposed. No consensus for which one we should use. ✓ Lower bounds are often functions of treatment-allocation probabilities (propensity score). → We can minimize the lower bound regarding treatment-allocation probabilities. We refer to probabilities that minimize lower bounds as an ideal treatment-allocation probabilities.

9 Step 3: adaptive experiment ◼ Ideal allocation probabilities usually
depend on unknown parameters of a distribution. • We estimate the probabilities (unknown parameters) during an experiment. ◼ We run an adaptive experiment, which consists of the following two phases. 1. Treatment-allocation phase: in each round 𝑡 = 1,2, … , 𝑇: • Estimate the ideal treatment-allocation probability based on past observations. • Allocate treatment following the estimated ideal treatment-allocation probability. 2. Decision-making phase: at the end of the experiment: • Return an estimate of a target of interest.

10 Step 4: upper bound and optimality ◼ For the
designed experiment, we investigate its theoretical performance. ➢ ATE estimation. • We prove the asymptotic normality and check the asymptotic variance. • If the asymptotic variance matches the efficiency bound minimized for the treatment- allocation probability, the design is (asymptotically) optimal. ➢ Best-arm identification. • We investigate the probability of misidentifying the best arm or the regret. • Distribution-dependent, minimax and Bayes optimality.

11 More details (Section 2) I explain the details of
adaptive experimental design for ATE estimation. (Section 3) Then, I discuss how to design adaptive experiment for treatment choice. (Section 4) I introduce adaptive experimental design for policy learning. (Section 5) I discuss technical issues in adaptive experimental design for treatment choice.

12 2. Adaptive experimental design for ATE estimation

13 Setup ◼ Binary treatments, 1 and 0. ◼ Sample
size, 𝑇. Each unit is indexed by 1,2, … , 𝑇. ◼ Potential outcome, 𝑌1,𝑡 , 𝑌0,𝑡 ∈ ℝ． • 𝜇𝑎 (𝑋) and 𝜎𝑎 2(𝑋): conditional mean and variance of 𝑌𝑎 given 𝑋. ◼ 𝑑-dimensional covariates, 𝑋𝑡 ∈ 𝒳 ⊂ ℝ𝑑. E.g., Age, occupation, etc. ◼ Average treatment effect, 𝜏 = 𝔼[𝑌1 − 𝑌0 ].

14 Adaptive experiment 1. Treatment-allocation phase: in each round 𝑡
= 1,2, … , 𝑇: • Observe covariates 𝑋𝑡 . • Allocate treatment 𝐴𝑡 ∈ {1,0} based on 𝑋𝑠 , 𝐴𝑠 , 𝑌𝑠 𝑠=1 𝑡−1 and 𝑋𝑡 . • Observe the outcome 𝑌𝑡 = 1 𝐴𝑡 = 1 𝑌1,𝑡 + 1 𝐴𝑡 = 0 𝑌0,𝑡 . 2. Decision-making phase: at the end of the experiment (after observing 𝑋𝑡 , 𝐴𝑡 , 𝑌𝑡 𝑡=1 𝑇 ). • Estimate the ATE 𝜏 = 𝔼 𝑌1 − 𝔼 𝑌0 . V≈ Round 𝒕 Unit 𝑡 with Covariates 𝑋𝑡 Treatment 𝐴𝑡 Covariate 𝑋𝑡 Outcome 𝑌𝑡 After round 𝑻 • Estimate the ATE.

15 Performance measure and lower bound ◼ We aim to
construct an asymptotically normal estimator Ƹ 𝜏𝑛 with a smaller asymptotic variance: 𝑇 Ƹ 𝜏𝑇 − 𝜏 →𝑑 𝒩 0, 𝑉 as 𝑛 → ∞. ◼ Efficiency bound for ATE estimation (Hahn, 1998): • Assume that observations are i.i.d. with a fixed treatment-allocation probability 𝑤𝑎 (𝑥). • Then, the efficiency bound is given by 𝑉 𝑤 ≔ 𝔼 𝜎1 2(𝑋) 𝑤1 (𝑋) + 𝜎0 2(𝑋) 𝑤0 (𝑋) + 𝜏 𝑋 − 𝜏 2 . • 𝜏 𝑥 is the conditional ATE defined as 𝜏 𝑋 ≔ 𝔼 𝑌1 𝑋 − 𝔼 𝑌0 𝑋 = 𝜇1 𝑋 − 𝜇0 (𝑋).

16 Ideal treatment-allocation probability ◼ The efficiency bound is a
functional of the treatment-allocation probability. • The bound can be further minimized for the probability (Hahn, Hirano, and Karlan, 2011): 𝑤∗ ≔ arg min 𝑤 𝑉 𝑤 = arg min 𝑤 𝔼 𝜎1 2(𝑋) 𝑤1 (𝑋) + 𝜎0 2(𝑋) 𝑤0 (𝑋) + 𝜏 𝑋 − 𝜏 2 . ◼ Neyman allocation (Neyman, 1932; Hahn, Hirano, and Karlan, 2011). • 𝑤∗ has the following closed-form solution, which is called the Neyman allocation: 𝑤1 ∗(𝑥) = 𝜎1 𝑥 𝜎1 𝑥 + 𝜎0 𝑥 , 𝑤0 ∗(𝑥) = 𝜎0 (𝑥) 𝜎1 (𝑥) + 𝜎0 (𝑥) . • Allocate treatments with a ratio of the standard deviations. • Estimate the variances and 𝑤∗ during an experiment. V≈

17 Adaptive experimental design for ATE estimation Kato, Ishihara, Honda,
and Narita. (2020). “Efficient Average Treatment Effect Estimation via Adaptive Experiments.” Preprint. 1. Treatment-assignment phase: in each round 𝑡 = 1,2, … , 𝑇: I. Obtain estimator ො 𝜎𝑎,𝑡 2 (𝑋𝑡 ) of 𝜎𝑎 2(𝑋𝑡 ) using past observations 𝑌𝑠 , 𝐴𝑠 , 𝑋𝑠 𝑠=1 𝑡−1. II. Assign 𝐴𝑡 ∼ 𝑤𝑎,𝑡 (𝑋𝑡 ) = ො 𝜎𝑎,𝑡 2 (𝑋𝑡 ) / ො 𝜎1,𝑡 2 (𝑋𝑡 ) + ො 𝜎0,𝑡 2 (𝑋𝑡 ) . 2. Decision-making phase: at the end of the experiment: • Adaptive Augmented Inverse Probability Weighting (A2IPW) estimator: Ƹ 𝜏𝑇 A2IPW = 1 𝑇 ෍ 𝑡=1 𝑇 1 𝐴𝑡 = 1 𝑌𝑡 − ො 𝜇1,𝑡 (𝑋𝑡 ) 𝑤1,𝑡 𝑋𝑡 − 1 𝐴𝑡 = 0 𝑌𝑡 − ො 𝜇0,𝑡 (𝑋𝑡 ) 𝑤0,𝑡 𝑋𝑡 + ො 𝜇1,𝑡 (𝑋𝑡 ) − ො 𝜇0,𝑡 (𝑋𝑡 ) . • ො 𝜇𝑎,𝑡 (𝑋𝑡 ) is an estimator of 𝜇𝑎 (𝑋𝑡 ) using past observations 𝑌𝑠 , 𝐴𝑠 , 𝑋𝑠 𝑠=1 𝑡−1. V≈

18 Batch and sequential design ◼ Batch design (Hahn, Hirano,
and Karlan (2011)): • Update the treatment-allocation probability only at certain rounds. • E.g., Two-stage experiment: Split 1000 rounds into the first 300 rounds and the next 700 rounds. The first stage is pilot phase for estimating the variances. In the second stage, we allocate the treatments following the estimated probability. ◼ Sequential design (ours): • Update the treatment-assignment probability at every rounds. Our paper is a sequential version of Hahn, Hirano, and Karlan (2011). 𝑇 𝑡 = 1 Sequential design Batch design

19 Asymptotic Normality and efficiency ◼ Rewrite the A2IPW estimator
as Ƹ 𝜏𝑇 A2IPW = 1 𝑇 σ𝑡=1 𝑇 𝜓𝑡 . • 𝜓𝑡 ≔ 1 𝐴𝑡=1 𝑌𝑡−ෝ 𝜇1,𝑡(𝑋𝑡) 𝑤1,𝑡 𝑋𝑡 − 1 𝐴𝑡=0 𝑌𝑡−ෝ 𝜇0,𝑡(𝑋𝑡) 𝑤0,𝑡(𝑋𝑡) + ො 𝜇1,𝑡 (𝑋𝑡 ) − ො 𝜇0,𝑡 (𝑋𝑡 ). ◼ 𝜓𝑡 𝑡=1 𝑇 is a martingale difference sequence. • Under suitable conditions, we can apply the martingale central limit theorem. • We can address the sample dependency problem occurred by the adaptive sampling. • m (Asymptotic normality of the A2IPW estimator): • 𝑤𝑎,𝑡 (𝑋) − 𝑤0 ∗(𝑋) → 0 and ො 𝜇𝑎,𝑡 (𝑋) → 𝜇𝑎 (𝑋𝑡 ) as 𝑡 → ∞ almost surely. • Then, it holds that 𝑇 Ƹ 𝜏𝑇 A2IPW − 𝜏 → 𝑑 𝒩 0, 𝑉 𝜋∗ 𝑎𝑠 𝑇 → ∞ Theorem (Asymptotic normality of the A2IPW estimator) V≈

20 Efficiency gain  How the adaptive sampling improves the
efficiency? ◼ Sample size computation in hypothesis testing with 𝐻0 : 𝜏 = 0, 𝐻1 : 𝜏 = Δ. • Treatment-assignment probability 𝜋. An effect size Δ ≠ 0. • To achieve a power 𝛽 with controlling Type I error at 𝛼, we need a sample size at least 𝑇∗ 𝑤 = 𝑉 𝑤 Δ2 𝑧1 −𝛼/2 − 𝑧𝛽 2 • 𝑧𝑤 is the 𝑤 quantile of the standard normal distribution. ◼ Sample size comparison. • Setting: 𝑋 ∈ Uniform[0, 1], 𝜎1 2(𝑋) = 𝑋2, 𝜎0 2(𝑋) = 0.1, Δ = 0.1, 𝛼 = 𝛽 = 0.05, 𝜏 𝑋 = 𝜏 = Δ. • RCT (𝑤1 (𝑋) = 𝑤0 (𝑋) = 1/2): 𝑇∗ 𝜋 ≈ 1300. • Neyman allocation: 𝑇∗ 𝜋 ≈ 970. • Another case: 𝜎1 2(𝑋) = (2𝑋)2, 𝜎0 2(𝑋) = 0.1. RCT: 𝑇∗ 𝜋 ≈ 3700. Neyman allocation: 𝑇∗ 𝜋 ≈ 2600. Reduce 330 samples. V≈

21 Double machine learning Kato, Yasui, and McAlinn (2021). “The
Adaptive Doubly Robust Estimator for Policy Evaluation in Adaptive Experiments and a Paradox Concerning Logging Policy.” In NeurIPS. ◼ The asymptotic normality is related to double machine learning (Chernozhukov et al., 2018). ◼ We can replace the true probability 𝑤𝑡 with its estimator even if we know 𝑤𝑡 . • Since 𝑤𝑡 can be volatile, replacing it with its estimator can stabilize the performance. ◼ The asymptotic property does not change. • The use of past observations is a variant of sample splitting, recently called cross fitting. We refer to the sample splitting as adaptive fitting (Figure 1).

22 Active adaptive experimental design Kato, Oga, Komatsubara, and Inokuchi
(2024). Active Adaptive Experimental Design for Treatment Effect Estimation with Covariate Choices. In ICML ◼ We can gain more efficiency by optimizing covariate distributions. • Fix the target population covariate density as 𝑞(𝑥). We aim to estimate the ATE over 𝑞(𝑥). • But we can sample from a density 𝑝 𝑥 different from 𝑞(𝑥). = Covariate shift problem (Shimodaira, 2000). Train data and test data follow different distributions. ① Sample units from 𝑝(𝑥) ③ Observe the outcome ② Treatment 𝑤𝑎 (𝑥) The pool of Experimental units Target population 𝑞(𝑥) Decision-maker Experimental unit

23 Active adaptive experimental design ◼ Efficiency bound under covariate
shift is investigated in Uehara, Kato, and Yasui (2021). ◼ By using the efficiency bound, ideal covariate density and allocation probability are given as 𝑝∗ 𝑥 = 𝜎1 (𝑥) + 𝜎0 (𝑥) ׬ 𝜎1 (𝑥) + 𝜎0 (𝑥) 𝑞 𝑥 𝑑𝑥 , 𝑤𝑎 ∗ 𝑥 = 𝜎𝑎 𝑥 𝜎1 𝑥 + 𝜎0 𝑥 . • Covariate density: We sample more experimental units with higher variances. • Treatment-allocation probability: Neyman allocation. ◼ Duding an experiment, we estimate 𝑝∗(𝑥) and 𝜋∗ using the past observations. ◼ Source data optimization is called active learning.

24 Related literature • Hahn, Hirano, and Karlan (2011) develops
adaptive experimental design for ATE estimation with a batch update and discrete contexts. • Tabord-Meehan (2023) proposes stratification tree for relaxing the discreetness of the covariates in Hahn, Hirano, and Karlan (2011). • Kallus, Saito, and Uehara (2020), Rafi (2023), and Li and Owen (2024) refine arguments about the efficiency bound for experimental design. • Shimodaira (2000) develops a framework of covariates-shift adaptation, and Sugiyama (2008) proposes active learning using techniques of covariate-shift adaptation. • Uehara, Kato, and Yasui (2021) investigates ATE estimation and policy learning under covariate shift. They develop the efficiency bound and an efficient ATE estimator using DML.

25 3. Adaptive Experimental Design for Treatment Choice

26 From ATE estimation to treatment choice ◼ From ATE
estimation to decision-making (treatment choice. Manski, 2004). ◼ Treatment choice via adaptive experiments is called best-arm identification (BAI). • BAI has been investigated in various areas, including operations research and economics. Gathering data Parameter estimation Decision-making ATE estimation Treatment choice

27 Setup ◼ Treatment. 𝐾 treatments indexed by 1,2, …
𝐾． Also referred to as treatment arms or arms. ◼ Potential outcome. 𝑌𝑎 ∈ ℝ． ◼ No covariates. ◼ Distribution. 𝑌𝑎 follows a parametric distribution 𝑃 𝝁 , where 𝝁 = 𝜇𝑎 𝑎∈[𝐾] ∈ ℝ𝐾. • The mean of 𝑌𝑎 is 𝜇𝑎 (and only the mean parameters can take different values). • For simplicity, let 𝑃 𝝁 be a Gaussian distribution under which the variance of 𝑌𝑎 is fixed at 𝜎𝑎 2. ◼ Goal. Find the following best treatment arm efficiently: 𝑎𝝁 ∗ ≔ arg max 𝑎∈ 1,2,…,𝐾 𝜇𝑎 .

28 Setup ◼ Adaptive experimental design with 𝑻 rounds: •
Treatment-allocation phase: in each round 𝑡 = 1,2, … , 𝑇: • Allocate treatment 𝐴𝑡 ∈ 𝐾 ≔ 1,2, … , 𝐾 using 𝐴𝑠 , 𝑌𝑠 𝑠=1 𝑡−1. • Observe the outcome 𝑌𝑡 = σ 𝑎∈ 𝐾 1 𝐴𝑡 = 𝑎 𝑌𝑎,𝑡 . • Decision-making phase: at the end of the experiment: • Obtain an estimator ො 𝑎𝑇 of the the best treatment arm using 𝑋𝑡 , 𝐴𝑡 , 𝑌𝑡 𝑡=1 𝑇 . Round 𝒕 Unit 𝑡 Treatment 𝐴𝑡 Outcome 𝑌𝑡 After round 𝑻 Choose the best treatment.

29 Performance measures • Let ℙ𝝁 and 𝔼𝝁 be the
probability law and expectation under 𝑃 𝝁 . ◼ Probability of misidentification: ℙ𝝁 ො 𝑎𝑇 ≠ 𝑎𝝁 ∗ . ➢ The probability that the estimate of the best treatment ො 𝑎𝑇 is not the true best one 𝑎𝜇 ∗ . ◼ Expected simple regret: Regret𝝁 ≔ 𝔼𝝁 𝑌𝑎𝝁 ∗ − 𝑌ො 𝑎𝑇 . ➢ The welfare loss when we deploy the estimate of the best treatment ො 𝑎𝑇 for the population. • In this talk, we simply refer this regret as the regret. • Also called the out-of-sample regret or the policy regret (Kasy and Sautmann, 2021). E.g., in-sample regret in regret minimization.

30 Evaluation framework ◼ Distribution-dependent analysis. • Evaluate the performance
measures given 𝝁 fixed for 𝑇. • For each 𝝁 ∈ ℝ𝐾, we evaluate ℙ𝝁 ො 𝑎𝑇 ≠ 𝑎𝝁 ∗ or Regret𝝁 ◼ Minimax analysis. • Evaluate the performance measures for the worst-case 𝝁. • We evaluate sup 𝝁∈ℝ𝐾 ℙ𝝁 ො 𝑎𝑇 ≠ 𝑎𝜇 ∗ or sup 𝝁∈ℝ𝐾 Regret𝝁 ◼ Bayes analysis. • A prior Π is given. • Evaluate the performance measures by summing them weighted by the prior. • We evaluate ׬ ℝ𝐾 ℙ𝝁 ො 𝑎𝑇 ≠ 𝑎𝝁 ∗ 𝑑Π(𝜇) or ׬ ℝ𝐾 Regret𝝁 𝑑Π(𝜇).

31 Regret decomposition ◼ The simple regret can be written
as Regret𝝁 = 𝔼𝝁 𝑌𝑎𝜇 ∗ − 𝔼𝝁 𝑌ො 𝑎𝑇 = ෍ 𝑏≠𝑎𝜇 ∗ 𝜇𝑎𝜇 ∗ − 𝜇𝑏 ℙ𝝁 (ො 𝑎𝑇 = 𝑏). • The regret (Regret𝝁 ) is equal to the sum of the probability (ℙ𝝁 (ො 𝑎𝑇 = 𝑏)) weighted by the ATE (𝜇𝑎𝜇 ∗ − 𝜇𝑏 ) between the best and suboptimal arms. • Also holds that Regret𝝁 ≤ σ 𝑏≠𝑎𝜇 ∗ 𝜇𝑎𝜇 ∗ − 𝜇𝑏 ℙ𝝁 (ො 𝑎𝑇 ≠ 𝑎𝜇 ∗ ). ◼ ℙ𝝁 ො 𝑎𝑇 ≠ 𝑎𝜇 ∗ roughly upper bounded by ෍ 𝑏≠𝑎𝜇 ∗ exp −𝐶∗ 𝑇 𝜇𝑎𝜇 ∗ − 𝜇𝑏 2 for some constant 𝐶∗ > 0. • C.f., Chernoff bound. Large-deviation principles.

32 Distribution-dependent, minimax, and Bayes analysis ◼ To obtain an
intuition, we consider a binary case (𝐾 = 2), where 𝑎𝜇 ∗ = 1. Regret𝝁 = 𝜇1 − 𝜇2 ⋅ ℙ𝝁 ො 𝑎𝑇 = 2 ≤ 𝜇1 − 𝜇2 ⋅ exp −𝐶∗ 𝑇 𝜇1 − 𝜇2 2 . ◼ Distribution-dependent analysis. → 𝜇1 − 𝜇2 can be asymptotically ignored since it is constant. → It is enough to evaluate ℙ𝝁 ො 𝑎𝑇 ≠ 𝑎𝜇 ∗ for evaluating Regret𝑃 . → We evaluate the term 𝐶∗ by 1 𝑇 log ℙ𝝁 ො 𝑎𝑇 ≠ 𝑎𝝁 ∗ ≈ 1 𝑇 log Regret𝝁 ≈ 𝐶∗.

33 Distribution-dependent, minimax, and Bayes analysis ◼ To obtain an
intuition, we consider a binary case (𝐾 = 2), where 𝑎𝜇 ∗ = 1. Regret𝝁 = 𝜇1 − 𝜇2 ⋅ ℙ𝜇 ො 𝑎𝑇 = 2 ≤ 𝜇1 − 𝜇2 ⋅ exp −𝐶∗ 𝑇 𝜇1 − 𝜇2 2 . ◼ Minimax and Bayes analysis. • Probability of misidentification, ℙ𝝁 ො 𝑎𝑇 ≠ 𝑎𝝁 ∗ . • The rate of convergence to zero is still exponential (if the worst-case is well-defined). • Regret, Regret𝜇 . • Distributions whose means are 𝜇1 − 𝜇2 = 𝑂 1/ 𝑇 dominate the regret. • The rate of convergence to zero is 𝑂 1/ 𝑇 . → The analysis differs between the probability of misidentification and the regret.

34 Distribution-dependent, minimax, and Bayes analysis ◼ Define 𝛥𝜇 =
𝜇1 − 𝜇2 . ◼ For some constant 𝐶 > 0, the regret Regret𝝁 = 𝛥𝝁 (ො 𝑎𝑇 ≠ 𝑎𝜇 ∗ ) can be written as Regret𝝁 ≈ 𝛥𝝁 exp −𝐶𝑇 𝛥𝝁 2 → The dependency of 𝛥𝝁 on 𝑇 affects the evaluation of the regret. 1. 𝛥𝜇 converges to zero with an order slower than 1/ 𝑇. → For some increasing function 𝑔, the regret becomes Regret𝝁 ≈ exp −𝑔 𝑇 . 2. 𝛥𝜇 converges to zero with an order of 1/ 𝑇. → For some 𝐶 > 0, Regret𝜇 ≈ C 𝑇 . 3. 𝛥𝜇 converges to zero with an order faster than 1/ 𝑇. → Regret𝝁 ≈ 𝑜(1/ 𝑇) → In the worst case, Regret𝝁 ≈ C 𝑇 , where 𝛥𝜇 = 𝑂 1 𝑇 )．

35 Minimax and Bayes optimal experiment for the regret Kato,
M. Minimax and Bayes optimal best-arm identification: adaptive experimental design for treatment choice. (Unpublished. I will upload soon) • We design an experiment for treatment choice (best-arm identification). • The designed experiment is minimax and Bayes optimality for the regret. ◼ Our designed experiment consists of the following two elements: • Treatment-allocation phase: two-stage sampling. • In the first stage, we remove clearly suboptimal treatments and estimate the variances. • In the second stage, we allocate treatment arms with a variant of the Neyman allocation. • Decision-making phase: choice of the empirical best arm (Bubeck et al., 2011, Manski, 2004).

36 Minimax and Bayes optimal experiment for the regret ◼
Treatment-allocation phase: We introduce the following two-stage sampling: • Split 𝑇 into the first stage with 𝑟𝑇 rounds and the second stage with 1 − 𝑟 𝑇 rounds. 1. First stage: • Allocate each treatment with equivalent ratio 𝑟𝑇/𝐾. • Using concentration inequality, we select candidates of the best treatments as መ 𝒮 ⊆ 𝐾 . • Obtain an estimator ො 𝜎𝑎,𝑟𝑇 2 of the variance 𝜎𝑎 2. (sample variance is fine). 2. Second stage (we do not allocate treatment 𝑎 ∉ መ 𝒮 from this stage): • If መ 𝒮 = 1, we return an arm 𝑎 ∈ መ 𝒮 as an estimate of the best arm. • If መ 𝒮 = 2, we allocate treatment 𝑎 ∈ መ 𝒮 with probability 𝑤𝑎,𝑡 ≔ ො 𝜎𝑎,𝑟𝑇 / ො 𝜎1,𝑟𝑇 + ො 𝜎2,𝑟𝑇 . • If መ 𝒮 ≥ 3, we allocate treatment 𝑎 ∈ መ 𝒮 with probability 𝑤𝑎,𝑡 ≔ ො 𝜎𝑎,𝑟𝑇 2 / σ 𝑏∈ መ 𝒮 ො 𝜎𝑏,𝑟𝑇 2 .

Decision-making phase: • Return ො 𝑎𝑇 = arg max 𝑎∈[𝐾] ො 𝜇𝑎,𝑇 . • ො 𝜇𝑎,𝑇 ≔ 1 σ𝑡=1 𝑇 1 𝐴𝑡=𝑎 σ𝑡=1 𝑇 1 𝐴𝑡 = 𝑎 𝑌𝑡 is the sample mean. ✓ Remark: • How to select candidates መ 𝒮? • Let 𝑣𝑎,𝑞𝑇 ≔ log 𝑇 𝑇 ො 𝜎𝑎,𝑟𝑇 , and ො 𝑎𝑟𝑇 = arg max 𝑎∈[𝐾] ො 𝜇𝑎,𝑟𝑇 . • Construct lower and upper confidence bounds as ො 𝜇𝑎,𝑟𝑇 − 𝑣𝑎,𝑞𝑇 and ො 𝜇𝑎,𝑟𝑇 ＋𝑣𝑎,𝑞𝑇 . • Define መ 𝒮 as መ 𝒮 ≔ 𝑎 ∈ 𝐾 ∶ ො 𝜇𝑎,𝑟𝑇 ＋𝑣𝑎,𝑞𝑇 ≥ ො 𝜇ො 𝑎𝑟𝑇,𝑟𝑇 − 𝑣ො 𝑎𝑟𝑇,𝑟𝑇 .

For the designed experiment, we can prove the minimax and Bayes optimality. • Let ℰ be a set of consistent experiments. For any 𝛿 ∈ ℰ, it holds that ℙ𝜇 ො 𝑎𝑇 𝛿 ≠ 𝑎𝜇 ∗ → 0. ➢ Minimax optimality. lim 𝑇→∞ sup 𝜇∈ℝ𝐾 𝑇 Regret𝜇 𝛿TS−EBA ≤ max 1 𝑒 max 𝑎≠𝑏 𝜎𝑎 + 𝜎𝑏 , 𝐾 − 1 𝐾 ෍ 𝑎∈ 𝐾 𝜎𝑎 2 log 𝐾 ≤ inf 𝛿∈ℰ lim 𝑇→∞ sup 𝜇∈ℝ𝐾 𝑇 Regret𝜇 𝛿. ➢ Bayes optimality. lim 𝑇→∞ 𝑇 න 𝜇∈ℝ𝐾 Regret𝜇 𝛿TS−EBA 𝑑Π(𝜇) ≤ 4 ෍ 𝑎∈ 𝐾 න ℝ𝐾−1 𝜎∖ 𝑎 2∗ ℎ𝑎 𝜇∖ 𝑎 ∗ 𝝁∖ 𝑎 𝑑𝐻∖ 𝑎 𝝁∖ 𝑎 ≤ inf 𝛿∈ℰ lim 𝑇→∞ 𝑇 න 𝜇∈ℝ𝐾 Regret𝜇 𝛿 𝑑Π 𝜇 , • 𝜎∖ 𝑎 2∗ = 𝜎𝑏∗ 2 and 𝜇∖ 𝑎 ∗ = 𝜇𝑏∗, where 𝑏∗ be arg max 𝑐∈ 𝐾 ∖{𝑎} 𝜇𝑎 . • 𝝁∖ 𝑎 is a parameter vector which removes 𝜇𝑎 from 𝝁. 𝐻∖ 𝑎 𝝁∖ 𝑎 is a prior of 𝝁∖ 𝑎 . V≈ V≈

39 Intuition ◼ Regret𝝁 = 𝔼𝝁 𝑌𝑎𝜇 ∗ − 𝔼𝝁
𝑌ො 𝑎𝑇 = σ 𝑏≠𝑎𝜇 ∗ 𝜇𝑎𝜇 ∗ − 𝜇𝑏 ℙ𝝁 (ො 𝑎𝑇 = 𝑏). • ℙ𝝁 ො 𝑎𝑇 = 𝑏 = exp −𝐶1 𝑇 𝜇𝑎𝜇 ∗ − 𝜇𝑏 2 . • If 𝜇𝑎𝜇 ∗ − 𝜇𝑏 converges to zero slower than 1/ 𝑇, we have ℙ𝝁 ො 𝑎𝑇 = 𝑏 → 0. (𝑏 ∉ መ 𝒮𝑟𝑇 ) • If 𝜇𝑎𝜇 ∗ − 𝜇𝑏 = 𝐶2 / 𝑇, we have ℙ𝝁 ො 𝑎𝑇 = 𝑏 = constant. (𝑏 ∈ መ 𝒮𝑟𝑇 ) Treatment 1 Treatment 2 𝜇1 𝜇2 Treatment 3 𝜇3 𝜇2 Treatment 4 Treatment 5 𝜇5 መ 𝒮𝑟𝑇 . 𝐶2 / 𝑇

40 Intuition ◼ When two treatments are given, we allocate
them so as to maximize the discrepancy between their confidence intervals. ◼ When multiple treatments are given, there is no unique way to compare their confidence intervals. Treatment 1 Treatment 2 𝜇1 𝜇2 Confidence intervals Treatment 1 Treatment 2 𝜇1 𝜇2 Treatment 3 𝜇3 → It depends on the performance measures and uncertainty evaluation Maximize → This task is closely related to make the variance of ATE estimators smaller. → Use Neyman allocation.

41 Proof of the lower bound ◼ Simplicity, only consider
binary treatments, where 𝜇1 > 𝜇2 . • The regret is given as Regret𝝁 = 𝜇1 − 𝜇2 ℙ𝝁 (ො 𝑎𝑇 = 2). ◼ As I explain later, by using bandit techniques (Kaufmann et al., 2016), we can derive the following lower bound for the probability of misidentification: ℙ𝝁 ො 𝑎𝑇 = 2 ≥ exp −𝑇 𝜇1 − 𝜇2 2 𝜎1 2/𝑤1 + 𝜎2 2/𝑤2 + 𝑜 1 . ◼ Therefore, we have Regret𝝁 ≥ 𝜇1 − 𝜇2 exp −𝑇 𝜇1−𝜇2 2 2 𝜎1 2/𝑤1+𝜎2 2/𝑤2 + 𝑜 1 . ◼ Letting 𝜇1 − 𝜇2 = 𝜎1 2 𝑤1 + 𝜎2 2 𝑤2 /𝑇 and optimizing 𝑤𝑎 , we have Regret𝝁 ≥ (𝜎1 + 𝜎2 )/ 𝑒𝑇 .

42 Bayes optimal experiment for the simple regret ◼ Our
designed experiment and theory are applicable for cases with other distributions. E.g., Bernoulli distributions. Komiyama, J., Ariu, K., Kato, M., and Qin, C. (2023). Rate-optimal Bayesian simple regret in best arm identification. Mathematics of Operations Research. • Bayes optimal experiment for treatment choice with Bernoulli distributions. • Komiyama et al. (2023) corresponds to a simplified case of our new results. • In our previous study, we only consider Bernoulli bandits and the lower bound is not tight (there exists a constant gap between the upper and lower bounds). • In Bernoulli cases, the Neyman allocation becomes the uniform sampling. • No need for estimating the variances.

43 Related literature ◼ Ordinal optimization: The best-arm identification has
been considered as ordinal optimization. In this setting, we assume that ideal allocation probability is known. • Chen (2000) considers Gaussian outcomes. Glynn and Juneja (2004) considers more general distributions. ◼ Best–arm identification: • Audibert, Bubeck, and Munos (2010) establishes the best-arm identification framework. An ideal allocation probability is unknown, and we need to estimate it. • Bubeck, Munos, and Stoltz (2011) proposes minimax-rate optimal algorithm.

44 Related literature ◼ Limit-of-experiment (or local asymptotic normality) framework
for experimental design: Tools for developing experiments with matching lower and upper bounds by assuming locality for the underlying distributions. • Hirano and Porter (2008) introduces this framework for treatment choice. • Armstong (2021) and Hirano and Porter (2023) apply the framework for adaptive experiments. • Adusumilli (2025) proposes adaptive experimental design based on the limit-of-experiment framework and the diffusion-process theory. ➢ I guess that we can derive tight lower and upper bounds without using the limit-of-experiment. • 1/ 𝑇 regimes (locality) directedly appear as a result of the worst-case for the regret.

45 4. Adaptive experimental design for policy learning

46 Policy learning ◼ Treatment choice with covariates 𝑋𝑡 .
◼ Two goals: • Choosing the marginalized best arm 𝑎𝑃 ∗ = arg max 𝑎∈[𝐾] 𝜇𝑎 , where 𝜇𝑎 = 𝔼𝑃 𝑌𝑎 . • Choosing the conditional best arm 𝑎𝑃 ∗ (𝑋) = arg max 𝑎∈ 𝐾 𝜇𝑎 (𝑋), where 𝜇𝑎 𝑋 = 𝔼𝑃 𝑌𝑎 𝑋 . ◼ The second task is called policy learning. • Policy learning has the following two approaches: • Plug-in policy: estimate 𝜇𝑎 𝑋 and choose ො 𝑎𝑛 (𝑋) = arg max 𝑎∈ 𝐾 ො 𝜇𝑎,𝑛 (𝑋). • Empirical welfare maximization (EWM): train a policy by minimizing a counterfactual risk. ◼ This talk focuses on adaptive experimental design for policy learning via the EWM approach.

47 Setup ◼ In the EWM approach, we train a
policy function 𝜋: [𝐾] × 𝒳 → [0,1]. • For all 𝑥 ∈ 𝒳, σ𝑎=1 𝐾 𝜋𝑎 (𝑥) = 1 holds. • A policy decides how we recommend a treatment for the future population. • Corresponds to a classifier in machine learning. ◼ Π is a set of policy functions 𝜋: [𝐾] × 𝒳 → [0,1]. E.g., linear models, neural networks, and random forest. ◼ Welfare of a policy 𝜋 ∈ Π: 𝐿(𝜋) ≔ 𝔼𝑃 ෍ 𝑎∈ 𝐾 𝜋𝑎 (𝑋)𝑌𝑎 .

48 Setup ◼ Goal. Obtain an optimal policy defined as
𝜋∗ ≔ arg max 𝜋∈Π 𝐿 𝜋 = arg max 𝜋∈Π 𝔼𝑃 ෍ 𝑎∈ 𝐾 𝜋𝑎 (𝑋)𝑌𝑎 . ◼ EWM approach. • Estimate the welfare 𝐿(𝜋) as ෠ 𝐿(𝜋). • Train a policy as ො 𝜋 ≔ arg max 𝜋∈Π ෠ 𝐿(𝜋). ◼ Regret. The regret for this problem. is defined as Regret𝑃 ≔ 𝔼𝑃 ෍ 𝑎∈ 𝐾 𝜋𝑎 ∗ 𝑋 𝑌 𝑎 − 𝔼𝑃 ෍ 𝑎∈ 𝐾 ො 𝜋𝑎 𝑋 𝑌 𝑎 .

49 Adaptive experimental design for policy learning Kato, M., Okumura,
K., Ishihara, T., and Kitagawa, T. (2024) Adaptive experimental design for policy learning. See the proceedings of the ICML 2024 Workshop RLControlTheory (I will update arXiv next Tuesday.). ◼ Propose minimax rate optimal experiment for policy learning. ◼ Let ℰ be a class of experiment 𝛿. Let 𝑀 be a VC or Natarayan dimension of the policy class Π. ◼ Lower bound: Given the class of experiment ℰ, we have inf 𝛿∈ℰ lim 𝑇→∞ 𝑇 Regret𝑃 ≥ 1 8 𝔼 𝑀 ෍ 𝑎∈ 𝐾 𝜎𝑎 2 𝑋 , if 𝐾 ≥ 3 1 8 𝔼 𝑀 𝜎1 𝑋 + 𝜎2 𝑋 2 , if 𝐾 = 2

50 Adaptive experimental design for policy learning ◼ Ideal treatment-allocation
probability: • If 𝐾 = 2, 𝑤𝑎 ∗(𝑋) = 𝜎𝑎(𝑋) 𝜎1 𝑋 +𝜎2(𝑋) . (the ratio of the standard deviation). • If 𝐾 ≥ 3, 𝑤𝑎 ∗(𝑋) = 𝜎𝑎 2(𝑋) σ𝑏∈[𝐾] 𝜎𝑏 2(𝑋) . (the ratio of the variances). ◼ Adaptive experiment for policy learning. • In each round 𝑡, we estimate 𝑤∗ as 𝑤𝑡 and allocate treatment with probability 𝑤𝑡 • At the end, we estimate the welfare as ෠ 𝐿 𝜋 = ෍ 𝑡=1 𝑇 ෍ 𝑎=1 𝐾 𝜋 𝑎 𝑋𝑡 1 𝐴𝑡 = 𝑎 𝑌𝑡 − ො 𝜇𝑎,𝑡 (𝑋𝑡 ) 𝑤𝑡 𝑎 ∣ 𝑋𝑡 + ො 𝜇𝑎,𝑡 (𝑋𝑡 ) . • Then, we train a policy as ො 𝜋 = arg max 𝜋∈Π ෠ 𝐿 𝜋 .

51 Adaptive experimental design for policy learning ◼ Upper bound:
For our designed experiment 𝛿, there exists a constant 𝐶1 , 𝐶2 > 0 such that lim 𝑇→∞ 𝑇 Regret𝑃 𝛿 ≤ 𝐶1 𝑀 + 𝐶2 𝔼 𝑀 ෍ 𝑎∈ 𝐾 𝜎𝑎 2 𝑋 , if 𝐾 ≥ 3 𝐶1 𝑀 + 𝐶2 𝔼 𝑀 𝜎1 𝑋 + 𝜎2 𝑋 2 , if 𝐾 = 2 ◼ Rate optimality. • For example, if 𝐾 = 2, a lower bound is 1 8 𝔼 𝑀 𝜎1 𝑋 + 𝜎2 𝑋 , while the upper bound is 𝐶1 𝑀 + 𝐶2 𝔼 𝑀 𝜎1 𝑋 + 𝜎2 𝑋 2 . • Only rates of 𝑇 and 𝜎𝑎 (𝑋) are optimal.

52 Comparison with the uniform allocation ◼ Upper bound under
some 𝑤: • Suppose that 𝐾 = 2. Consider allocating treatments with 𝑤. • There exists a constant 𝐶1 , 𝐶2 > 0 such that lim 𝑇→∞ 𝑇 Regret𝑃 𝛿 ≤ 𝐶1 𝑀 + 𝐶2 𝔼 𝑀 𝜎1 2 𝑋 𝑤1 𝑋 + 𝜎2 2 𝑋 𝑤2 𝑋 . ◼ Uniform allocation (𝑤1 𝑋 = 𝑤2 𝑋 = 1/2): lim 𝑇→∞ 𝑇 Regret𝑃 𝛿 ≤ 𝐶1 𝑀 + 𝐶2 𝔼 2𝑀 𝜎1 2 𝑋 + 𝜎2 2 𝑋 . • Here, 𝔼 𝑀 𝜎1 𝑋 + 𝜎2 𝑋 2 (Neyman） ≤ 𝔼 2𝑀 𝜎1 2 𝑋 + 𝜎2 2 𝑋 (Uniform).

53 Open questions ◼ We only showed the rate optimal
for the sample size 𝑇 and variance 𝜎𝑎 2. • Between the upper and lower bounds, there exist a large constant gap independent of the variance 𝜎𝑎 2(𝑋) and 𝑇. • The result is not so tight. The previous experiments for ATE estimation and treatment choice have matching lower and upper bounds, including the constant terms. There is no room for improvement. • I guess that this this is due to the looseness of the empirical-process theory (EWM). ◼ Can we implement an arm-elimination pilot phase as our previously designed minimax and Bayes optimal best-arm identification? (Chernozhukov, Lee, Rosen, and Sun, 2025?) • Finite sample confidence intervals are required for estimators of 𝜇𝑎 𝑥 .

54 Related literature ◼ EWM approach. • Swaminathan and Joachims
(2015), Kitagawa and Tetenov (2018), Athey and Wager (2021) develop a basic framework for policy learning. • Zhou, Athey, and Wager (2018) shows a regret upper bound when multiple treatments are given. • Zhan, Ren, Athey, and Zhou (2022) investigates policy learning with adaptively collected data.

55 5. On the optimal experimental design for treatment choice

56 Difficulty in best-arm identification ◼ Why we consider minimax
or Bayes optimality for treatment choice? → We can consider optimality under each fixed distribution. • Kasy and Sautmann (2021) claims that they developed such a result. • But their proof had several issues. → Through the investigation of their issues, various impossibility theorems have been showed. ◼ Impossibility theorems: there is no algorithm which is optimal for any distribution. • Kaufmann (2020). • Ariu, Kato, Komiyama, McAlinn, and Qin (2021) • Degenne (2023). • Wang, Ariu, and Proutier (2024) • Imbens, Qin, and Wager (2025(

57 Bandit lower Bound ◼ Lower bounds in the bandit
problem are derived via the information theory. ◼ The following lemma is one of the most general and tight results for lower bounds. • Let 𝑃 and 𝑄 be two distributions with 𝐾 arms such that for all 𝑎, the distributions 𝑃(𝑎) and 𝑄(𝑎) of 𝑌(𝑎) are mutually absolutely continuous. • We have ෍ 𝑎∈[𝐾] 𝔼𝑃 ෍ 𝑡=1 𝑇 1[𝐴𝑡 = 𝑎] KL 𝑃 𝑎 , 𝑄𝑎 ≥ sup ℰ∈ℱ𝑇 𝑑 ℙ𝑃 ℰ , ℙ𝑄 ℰ . • 𝑑 𝑥, 𝑦 ≔ 𝑥 log 𝑥 𝑦 + 1 − 𝑥 log 1−𝑥 1−𝑦 is the binary relative entropy with the convention that 𝑑 0,0 = 𝑑 1, 1 = 0. Transportation lemma (Lai and Robbins 1985; Kaufmann et al. 2016)

58 Example: regret lower bound in regret minimization ◼ Lai
and Robbins (1985) develops a lower bound for the regret minimization problem. • Regret minimization problem. • Goal. Maximize the cumulative outcome σ𝑡=1 𝑇 𝑌 𝐴𝑡,𝑡 . • The (in-sample) regret is defined as Regret𝑃 ≔ 𝔼 σ𝑡=1 𝑇 𝑌𝑎𝑃 ∗ , 𝑡 − σ𝑡=1 𝑇 𝑌 𝐴𝑡,𝑡 . • Under each distribution 𝑃0 , for large 𝑇, the lower bound is given as Regret𝑃0 ≥ log 𝑇 inf 𝑄∈𝐴𝑙𝑡 𝑃0 ෍ 𝑎∈ 𝐾 𝔼𝑃0 ෍ 𝑡=1 𝑇 1[𝐴𝑡 = 𝑎] KL 𝑃𝑎,0 , 𝑄𝑎 , where 𝐴𝑙𝑡 𝑃0 ≔ {𝑄 ∈ 𝒫: 𝑎𝑄 ∗ ≠ 𝑎𝑃_) ∗ } ◼ Kaufmann’s lower bound is a generalization of Lai and Robbins’ lower bound. • It is known that Kaufmann’s lower bound can yield lower bounds for various bandit problems. • “Can we develop a lower bound for best-arm identification using Kaufmann’s lemma?”

59 Lower bound in the fixed-budget setting ◼ A direct
application of Kaufmann’s lemma for best-arm identification yields the following bound. ◼ Lower bound (conjecture): • Under each distribution 𝑃0 of the data-generating process, a lower bound is given as ℙ𝑃0 ො 𝑎𝑇 ≠ 𝑎𝑃0 ∗ ≥ inf 𝑄∈𝐴𝑙𝑡(𝑃0) exp − ෍ 𝑎∈[𝐾] 𝔼𝑄 ෍ 𝑡=1 𝑇 1[𝐴𝑡 = 𝑎] KL 𝑄𝑎 𝑃𝑎, 0 . • 1 𝑇 𝔼𝑄 σ𝑡=1 𝑇 1[𝐴𝑡 = 𝑎] corresponds to a treatment-assignment probability (ratio) under 𝑄. • By denoting 1 𝑇 𝔼𝑄 σ𝑡=1 𝑇 1[𝐴𝑡 = 𝑎] by 𝑤𝑎 , we can rewrite the above conjecture as ℙ𝑃0 ො 𝑎𝑇 ≠ 𝑎𝑃0 ∗ ≥ inf 𝑄∈𝐴𝑙𝑡(𝑃0) exp − ෍ 𝑎∈ 1,2,…,𝐾 𝑤𝑎 KL 𝑄𝑎 , 𝑃0,𝑎 . • Note that 𝑤𝑎 = 1 𝑇 𝔼𝑄 σ𝑡=1 𝑇 1[𝐴𝑡 = 𝑎] implies that 𝜋 is a treatment-assignment probability (ratio) under 𝑄.

60 Optimal design in the fixed-budget setting ◼ Question: Does
there exist an algorithm whose probability of misidentification ℙ𝑃0 ො 𝑎𝑇 ≠ 𝑎𝑃0 ∗ exactly matches the following lower bound?: inf 𝑄∈𝐴𝑙𝑡(𝑃0) exp −𝑇 ෍ 𝑎∈[𝐾] 𝑤a KL 𝑄𝑎 , 𝑃0,𝑎 . → Answer: No(without additional assumptions). • When the number of treatments is two, and the outcomes follow the Gaussian distributions with known variances, such an optimal algorithm exists. • The optimal algorithm is Neyman allocation. • In more general cases, no algorithm exists.

61 Neyman allocation is optimal in two-armed bandits ◼ Consider
two treatments (𝐾 = 2) • Assume that the variance is known. ◼ The lower bound can be computed as inf 𝑄∈𝐴𝑙𝑡(𝑃0) exp −𝑇 ෍ 𝑎∈[𝐾] 𝑤𝑎 KL 𝑄𝑎 , 𝑃0,𝑎 ≥ exp −𝑇 𝔼𝑃0 𝑌1 − 𝔼𝑃0 𝑌2 2 𝜎1 + 𝜎2 2 ◼ Asymptotically optimal algorithm is Neyman allocation (Kaufmann et al. 2016): • Assume that the variances are known. • Allocate treatments with a ratio of the standard deviation. • At the end of the experiment, recommend a treatment with the highest sample mean as the best treatment. ◼ 𝜎1 + 𝜎2 2 is the asymptotic variance of the ATE estimator under the Neyman allocation.

62 Optimal algorithm in the multi-armed bandits. Kasy and Sautmann
(KS) (2021) ◼ Propose the Exploration Sampling (ES) algorithm. • A variant of the Top-Two Thompson sampling (Russo, 2016). ◼ They show that under their ES algorithm, for each 𝑃0 ∈ Bernoulli independent of 𝑇, it holds that Regret𝑃0 ≤ inf 𝑤 inf 𝑄∈𝐴𝑙𝑡(𝑃0) exp −𝑇 ෍ 𝑎∈[𝐾] 𝑤𝑎 KL 𝑄𝑎 , 𝑃𝑎,0 . → They claim that their algorithm is optimal for Kaufmann’s lower bound ℙ𝑃0 ො 𝑎𝑇 ≠ 𝑎𝑃0 ∗ ≥ inf 𝑄∈𝐴𝑙𝑡(𝑃0) exp −𝑇 ෍ 𝑎∈[𝐾] 𝑤𝑎 KL 𝑄𝑎 , 𝑃0,𝑎 .

63 Impossibility theorem in the multi-armed bandits ◼ Issues in
KS (Ariu et al. 2025). • The KL divergence is flipped. • KS (Incorrect): Regret𝑃0 ≤ inf 𝑤 inf 𝑄∈𝐴𝑙𝑡(𝑃0) exp −𝑇 σ 𝑎∈[𝐾] 𝑤𝑎 KL 𝑄𝑎 , 𝑃𝑎,0 . • Ours (Correct): Regret𝑃0 ≤ inf 𝑤 inf 𝑄∈𝐴𝑙𝑡(𝑃0) exp −𝑇 σ 𝑎∈[𝐾] 𝑤𝑎 KL 𝑃𝑎,0 , 𝑄𝑎 . • There exists a distribution under which any algorithm cannot attain Γ∗. Impossibility theorem: there exists a distribution 𝑃0 ∈ 𝒫 under which Regret𝑃0 ≥ inf 𝑤 inf 𝑄∈𝐴𝑙𝑡(𝑃0) exp −𝑇 ෍ 𝑎∈[𝐾] 𝑤𝑎 KL 𝑄𝑎 , 𝑃𝑎,0 .

64 Why? ◼ Why the conjectured lower bound does not
work? ◼ There are several technical issues. 1. We cannot compute an ideal treatment-allocation probability from the conjectured lower bound. • Recall that the lower bound is given as ℙ𝑃0 ො 𝑎𝑇 ≠ 𝑎𝑃0 ∗ ≥ 𝑉∗ = inf 𝑄∈𝐴𝑙𝑡(𝑃0) exp −𝑇 ෍ 𝑎∈[𝐾] 𝑤𝑎 KL 𝑄𝑎 , 𝑃𝑎,0 , Here, 𝑤𝑎 = 1 𝑇 𝔼𝑄 σ𝑡=1 𝑇 1[𝐴𝑡 = 𝑎] depends on 𝑄. → min 𝑄 min 𝑤 exp −𝑇 σ 𝑎∈[𝐾] 𝑤𝑎 KL 𝑄𝑎 , 𝑃𝑎,0 ≠ inf 𝑤 inf 𝑄∈𝐴𝑙𝑡(𝑃0) exp −𝑇 σ 𝑎∈[𝐾] 𝑤𝑎 KL 𝑄𝑎 , 𝑃𝑎,0 . → We cannot obtain an ideal allocation probability from this lower bound. 2. There exists 𝑃0 such that ℙ𝑃0 ො 𝑎𝑇 ≠ 𝑎𝑃0 ∗ ≥ inf 𝑤 inf 𝑄∈𝐴𝑙𝑡(𝑃0) exp −𝑇 σ 𝑎∈[𝐾] 𝑤𝑎 KL 𝑄𝑎 , 𝑃𝑎,0 .

65 Uncertainty Evaluations ◼ impossibility theorems for distribution-dependent analysis •
There exists a distribution under which we can derive a lower bound larger than the one conjectured from Kaufmann’s lemma. • If we aim to develop an experiment that is optimal for any distribution, we can find a counterexample. ◼ Two solutions. • Restrict a distribution class. • Minimax and Bayes optimality 𝒫: a set of distributions Performance measure Worst case Bayes (Average of the performances weighted by priors) Distribution dependent

66 Locally minimax optimal experiment for ℙ𝝁 ො 𝑎𝑇 ≠
𝑎𝝁 ∗ Kato, M. (2024). Generalized Neyman allocation for locally minimax optimal best-arm identification. Preprint. ◼ Locally minimax optimal asymptotically optimal experiment for ℙ𝜇 ො 𝑎𝑇 ≠ 𝑎𝜇 ∗ . ◼ Generalized Neyman allocation (GNA): set an ideal treatment-allocation probability as 𝑤𝑎𝜇 ∗ ∗ = 𝜎𝑎𝜇 ∗ 𝜎𝑎𝜇 ∗ + σ 𝑐∈[𝐾]∖ 𝑎𝜇 ∗ 𝜎𝑐 2 , 𝑤𝑎 ∗ = 1 − 𝜋𝑎𝜇 ∗ ∗ 𝜎𝑎 2 σ 𝑐∈[𝐾]∖ 𝑎𝜇 ∗ 𝜎𝑐 2 for all 𝑎 ∈ 𝐾 ∖ 𝑎𝜇 ∗ . ◼ GNA-A2IPW experiment: estimate the GNA and return the best arm using the A2IPW estimator. ◼ Local minimax optimality: sup 𝜇∈ΘΔ lim sup 𝑇→∞ ℙ𝑃 ො 𝑎𝑇 𝛿GNA−A2IPW ≠ 𝑎𝑃 ∗ ≤ min 𝑎 exp − Δ2𝑇 2 𝜎𝑎 + σ 𝑏∈ 𝐾 ∖ 𝑎 𝜎𝑏 2 ≤ inf 𝛿∈ℰ sup 𝜇∈ΘΔ lim inf 𝑇→∞ ℙ𝑃 ො 𝑎𝑇 𝛿 ≠ 𝑎𝑃 ∗ .

67 Related literature • Glynn and Juneja (2004) develops the
large-deviation optimal experiments for treatment choice. • Kasy and Sautman (2021) shows that Russo (2021)’s Top-Two-Thompson sampling is optimal in the sense that the upper bound matches the bound in Glynn and Juneja (2004). • Kaufmann, Cappé, and Garivier (2016) develops a general lower bound for the bandit problem. ◼ Impossibility theorems. • Ariu, Kato, Komiyama, McAlinn, and Qin (2025) finds a counterexample for Kasy and Sautmann (2021). • Degenne (2023) shows that the bounds in Glynn and Juneja (2004) is a special case of Kaufmann’s bounds under strong assumptions. • Wang, Ariu, and Proutier (2024) and Imbens, Qin, and Wager also derive counterexamples.

68 Concluding remarks

69 Concluding remarks ◼ Adaptive experimental design for efficient ATE
estimation and treatment choice. ◼ ATE estimation. • Goal. Estimation of the ATE with a smaller asymptotic variance. • Lower bound. Efficiency bound (Hahn, 1998). • Ideal treatment-allocation probability. = Neyman allocation.

70 Concluding remarks ◼ Treatment choice (BAI). • Goal. Choosing
the best treatment arm. • A lower bound and an ideal treatment-allocation probability depend on the uncertainty. • Distribution-dependent analysis: • No globally optimal algorithm exists for Kaufmann’s lower bound. • Impossibility theorems: there exists a distribution under which a lower bound is larger than Kaufmann’s one. • Locally optimal algorithm. • Minimax and Bayes analysis

71 Concluding remarks ◼ Policy learning. • We can develop
a rate-optimal experiment for policy learning. • The convergence rate regarding the variance and sample size 𝑇 match the lower bound. • There remain other constant terms. • Tighter lower and upper bounds? • We may need to reconsider the EWM approach.

72 Reference • Masahiro Kato, Takuya Ishihara, Junya Honda, and
Yusuke Narita. Efficient adaptive experimental design for average treatment effect estimation, 2020. arXiv:2002.05308. • Masahiro Kato, Kenichiro McAlinn, and Shota Yasui. The adaptive doubly robust estimator and a paradox concerning logging policy. In International Conference on Neural Information Processing Systems (NeurIPS), 2021. • Kaito Ariu, Masahiro Kato, Junpei Komiyama, Kenichiro McAlinn, and Chao Qin. A comment on “adaptive treatment assignment in experiments for policy choice”, 2021. • Masahiro Kato. Generalized Neyman allocation for locally minimax optimal best-arm identification, 2024a. arXiv: 2405.19317. • Masahiro Kato. Locally optimal fixed-budget best arm identification in two-armed gaussian bandits with unknown variances, 2024b. arXIV: 2312.12741. • Masahiro Kato and Kaito Ariu. The role of contextual information in best arm identification, 2021. Accepted for Journal of Machine Learning Research conditioned on minor revisions. • Masahiro Kato, Akihiro Oga, Wataru Komatsubara, and Ryo Inokuchi. Active adaptive experimental design for treatment effect estimation with covariate choice. In International Conference on Machine Learning (ICML), 2024a. • Masahiro Kato, Kyohei Okumura, Takuya Ishihara, and Toru Kitagawa. Adaptive experimental design for policy learning, 2024b. arXiv: 2401.03756. • Junpei Komiyama, Kaito Ariu, Masahiro Kato, and Chao Qin. Rate-optimal bayesian simple regret in best arm identification. Mathematics of Operations Research, 2023.

73 Reference • van der Vaart, A. (1998), Asymptotic Statistics,
Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press. • Tabord-Meehan, M. (2022), “Stratification Trees for Adaptive Randomization in Randomized Controlled Trials,” The Review of Economic Studies. • van der Laan, M. J. (2008), “The Construction and Analysis of Adaptive Group Sequential Designs,” https://biostats.bepress.com/ucbbiostat/paper232. • Neyman, J. (1923), “Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes,” Statistical Science, 5, 463–472. • Neyman, J. (1934), “On the Two Different Aspects of the Representative Method: the Method of Stratified Sampling and the Method of Purposive Selection,” Journal of the Royal Statistical Society, 97, 123–150. • Manski, C. F. (2002), “Treatment choice under ambiguity induced by inferential problems,” Journal of Statistical Planning and Inference, 105, 67–82. • Manski (2004), “Statistical Treatment Rules for Heterogeneous Populations,” Econometrica, 72, 1221–1246.

74 Reference • Kitagawa, T. and Tetenov, A. (2018), “Who
Should Be Treated? Empirical Welfare Maximization Methods for Treatment Choice,” Econometrica, 86, 591–616. • Garivier, A. and Kaufmann, E. (2016), “Optimal Best Arm Identification with Fixed Confidence,” in Conference on Learning Theory. • Glynn, P. and Juneja, S. (2004), “A large deviations perspective on ordinal optimization,” in Proceedings of the 2004 Winter Simulation Conference, IEEE, vol. 1. • Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018), “Double/debiased machine learning for treatment and structural parameters,” The Econometrics Journal. • Degenne, R. (2023), “On the Existence of a Complexity in Fixed Budget Bandit Identification,” Conference on Learning Theory (COLT). • Kasy, M. and Sautmann, A. (2021), “Adaptive Treatment Assignment in Experiments for Policy Choice,” Econometrica, 89, 113– 132. • Rubin, D. B. (1974), “Estimating causal effects of treatments in randomized and nonrandomized studies,” Journal of Educational Psychology.

Adaptive Experimental Design for Efficient Ave...

Adaptive Experimental Design for Efficient Average Treatment Effect Estimation and Treatment Choice

More Decks by MasaKat0

Other Decks in Research

Featured

Transcript