ร Contextual BAI with a fixed budget. n ๐ฒ treatment arms ๐พ = 1,2, โฆ , ๐พ : Treatment arms: alternatives of medicine, policy, and advertisements. โข Each arm ๐ โ [๐พ] has a potential outcome ๐! โ โ. n Context ๐ โ ๐ณ โ โ". n Policy ๐: ๐พ ร๐ณ โ 0,1 such that โ!โ[%] ๐(๐ โฃ ๐ฅ) = 1 โ๐ฅ โ ๐ณ. โข A policy is a function that returns a estimated best arm given a context ๐ฅ. โข Policy class ฮ : a set of policies ๐. Ex. Neural networks. n Distribution ๐ of (๐*, โฆ , ๐%, ๐). โข ๐ซ: a set of ๐. โข ๐ผ! and โ! : expectation and probability laws under ๐. โข ๐"(๐) = ๐ผ! ๐# " and ๐"(๐)(๐ฅ) n Policy value ๐ ๐ ๐ โ ๐ผ4 โ!โ % ๐ ๐ ๐ ๐!(๐)(๐) . โข Expected outcome under a policy ๐ and a distribution ๐. n Optimal policy ๐โ(๐) โ arg max6โ7 ๐ ๐ ๐ . n Experimenterโs interest: training ๐โ using experimental data. n Adaptive experiment with ๐ป rounds: โข In each round ๐ก โ ๐ โ 1,2, โฆ , ๐ , an experimenter 1. observes ๐-dimensional context ๐8 โ ๐ณ โ โ". 2. assigns treatment ๐ด8 โ [๐พ], and 3. observes outcome ๐8 โ โ!โ % 1 ๐ด8 = ๐ ๐8 ! . โข After the final round ๐, an experimenter trains a policy ๐. โข Denote the trained policy by L ๐ โ ฮ . n AS: adaptive sampling during an experiment. n PL: policy learning at the end of the experiment. ร Procedure of the PLAS strategy. โข Define the target treatment-assignment ratio as follows: - ๐พ = 2: ๐คโ(๐ โฃ ๐ฅ) = 9!(:) 9"(:);9#(:) # โ๐ โ [2] and โ๐ฅ โ ๐ณ. - ๐พ โฅ 3: ๐คโ ๐ ๐ฅ = 9! : # โ $โ & 9$ : # โ๐ โ ๐พ and โ๐ฅ โ ๐ณ. ๐"(๐ฅ) $: Conditional variance of ๐# " given ๐ฅ. โ Our designed policy adaptively estimates the ratio and assigns arms following the probability. โข (AS) In each round ๐ โ ๐ป , an experimenter 1. observes ๐8 , 2. estimates ๐คโ(๐ โฃ ๐8 ) by replacing ๐!(๐8 ) > with its estimator L ๐!(๐8 ) >, and 3. assigns an arm ๐ following the probability R ๐ค8 (๐ โฃ ๐8 ), an estimator of ๐คโ(๐ โฃ ๐8 ). โข (PL) At the end, the experimenter trains ๐ as ! ๐!"#$ โ arg max %โ' ) (โ[*] 1 ๐ ) ,โ[-] ๐(๐ โฃ ๐, ) 1[๐ด, = ๐](๐, โ ! ๐, (๐ โฃ ๐, )) 8 ๐ค, (๐ โฃ ๐, ) + ! ๐, (๐ โฃ ๐, ) . โข ? ๐ค#(๐ โฃ ๐#) and D ๐#(๐ โฃ ๐#) are estimators of ๐คโ and ๐(๐)(๐ฅ) using samples until ๐ก. v n Evaluate the strategy by using the worst-case analysis. n Lower and upper bounds depend on the policy class complexity. โข Use the Vapnik-Chervonenkis dimension when ๐พ = 2 and the Natarajan dimension when ๐พ โฅ 3 for the complexity. โข Denote the complexity of ฮ by ๐. n Make several restrictions on the policy class and distribution. ร Worst-case regret lower bounds. โข If ๐พ = 2, any strategy with a trained policy L ๐ satisfies sup !โ๐ซ ๐๐ผ! ๐
$ ๐ ( ๐ โฅ 1 8 ๐ผ% ๐ ๐&(๐) + ๐'(๐) ' + ๐ 1 ๐๐ ๐ โ โ. โข If ๐พ โฅ 3, any strategy with a trained policy L ๐ satisfies sup !โ๐ซ ๐ ๐ผ! ๐
$ ๐ , ๐ โฅ 1 8 ๐ผ% ๐ 2 &โ ' ๐&(๐) ( + ๐ 1 ๐๐ ๐ โ โ. โข Note that ๐ผ% ๐ โ (โ ) ๐((๐) ' โฅ ๐ผ% ๐ ๐&(๐) + ๐'(๐) ' . ร Worst-case regret upper bounds of the PLAS strategy. โข If ๐พ = 2, the PLAS strategy satisfies ๐ ๐ผ4 ๐
F ๐ L ๐GHIJ โค 272 ๐ + 870.4 ๐ผK ๐*(๐) + ๐>(๐) > + ๐ 1 ๐๐ ๐ โ โ. โข If ๐พ โฅ 3, the PLAS strategy satisfies ๐ ๐ผ! ๐
$ ๐ , ๐)*+, โค ๐ถ log ๐ ๐ + 870.4 ๐ผ% ๐ 2 &โ ' ๐&(๐) ( + ๐ 1 ๐๐ ๐ โ โ, where ๐ถ > 0 is universal constant. ร The leading factor depending on ๐, ๐, and ๐! ๐ฅ > of the upper bounds aligns with that of the lower bounds. โข ๐ผ% ๐ ๐&(๐) + ๐'(๐) ' and ๐ผ% ๐ โ (โ ) ๐((๐) ' . โ Minimax (rate-)optimal for the expected simple regret. Masahiro Kato, Mizuho-DL Financial Technology Co., Ltd. Kyohei Okumura, Northwestern University Takuya Ishihara, Tohoku University Toru Kitagawa, Brown University รผ New framework: Contextual best arm identification (BAI). รผ Algorithm: Adaptive experimental design for policy learning. รผ Results: minimax optimality of the proposed algorithm. 2. Problem Setting n Performance measure of L ๐ : Expected simple regret. โข Simple regret: ๐F ๐ (L ๐)(๐ฅ) = ๐!โ 4 ๐ (๐ฅ) โ ๐ D !( under ๐ โข Expected simple regret: ๐
F ๐ L ๐ โ ๐ผ4 ๐8 ๐ L ๐ ๐ = ๐ ๐ ๐โ ๐ โ ๐ ๐ L ๐ . n Our goal: โข Design of an algorithm (strategy) of an experimenter. โข Under the strategy, the experimenter trains ๐ while minimizing the expected simple regret ๐ผ4 ๐8 ๐ . โ We design the Adaptive-Sampling (AS) and Policy-Learning (PL) strategy and show its minimax optimality for ๐
F ๐ L ๐ . 3. Performance Measure 1. Contribution 4. PLAS Strategy v v v v 5. Minimax Optimality v v v