Slide 1

Slide 1 text

Adaptive Experimental Design for Policy Learning: Contextual Best Arm Identification Ø Contextual BAI with a fixed budget. n 𝑲 treatment arms 𝐾 = 1,2, … , 𝐾 : Treatment arms: alternatives of medicine, policy, and advertisements. β€’ Each arm π‘Ž ∈ [𝐾] has a potential outcome π‘Œ! ∈ ℝ. n Context 𝑋 ∈ 𝒳 βŠ† ℝ". n Policy πœ‹: 𝐾 ×𝒳 β†’ 0,1 such that βˆ‘!∈[%] πœ‹(π‘Ž ∣ π‘₯) = 1 βˆ€π‘₯ ∈ 𝒳. β€’ A policy is a function that returns a estimated best arm given a context π‘₯. β€’ Policy class Ξ : a set of policies πœ‹. Ex. Neural networks. n Distribution 𝑃 of (π‘Œ*, … , π‘Œ%, 𝑋). β€’ 𝒫: a set of 𝑃. β€’ 𝔼! and β„™! : expectation and probability laws under 𝑃. β€’ πœ‡"(𝑃) = 𝔼! π‘Œ# " and πœ‡"(𝑃)(π‘₯) n Policy value 𝑄 𝑃 πœ‹ ≔ 𝔼4 βˆ‘!∈ % πœ‹ π‘Ž 𝑋 πœ‡!(𝑃)(𝑋) . β€’ Expected outcome under a policy πœ‹ and a distribution 𝑃. n Optimal policy πœ‹βˆ—(𝑃) ∈ arg max6∈7 𝑄 𝑃 πœ‹ . n Experimenter’s interest: training πœ‹βˆ— using experimental data. n Adaptive experiment with 𝑻 rounds: β€’ In each round 𝑑 ∈ 𝑇 ≔ 1,2, … , 𝑇 , an experimenter 1. observes 𝑑-dimensional context 𝑋8 ∈ 𝒳 βŠ† ℝ". 2. assigns treatment 𝐴8 ∈ [𝐾], and 3. observes outcome π‘Œ8 ≔ βˆ‘!∈ % 1 𝐴8 = π‘Ž π‘Œ8 ! . β€’ After the final round 𝑇, an experimenter trains a policy πœ‹. β€’ Denote the trained policy by L πœ‹ ∈ Ξ . n AS: adaptive sampling during an experiment. n PL: policy learning at the end of the experiment. Ø Procedure of the PLAS strategy. β€’ Define the target treatment-assignment ratio as follows: - 𝐾 = 2: π‘€βˆ—(π‘Ž ∣ π‘₯) = 9!(:) 9"(:);9#(:) # βˆ€π‘Ž ∈ [2] and βˆ€π‘₯ ∈ 𝒳. - 𝐾 β‰₯ 3: π‘€βˆ— π‘Ž π‘₯ = 9! : # βˆ‘ $∈ & 9$ : # βˆ€π‘Ž ∈ 𝐾 and βˆ€π‘₯ ∈ 𝒳. 𝜎"(π‘₯) $: Conditional variance of π‘Œ# " given π‘₯. β†’ Our designed policy adaptively estimates the ratio and assigns arms following the probability. β€’ (AS) In each round 𝒕 ∈ 𝑻 , an experimenter 1. observes 𝑋8 , 2. estimates π‘€βˆ—(π‘Ž ∣ 𝑋8 ) by replacing 𝜎!(𝑋8 ) > with its estimator L 𝜎!(𝑋8 ) >, and 3. assigns an arm π‘Ž following the probability R 𝑀8 (π‘Ž ∣ 𝑋8 ), an estimator of π‘€βˆ—(π‘Ž ∣ 𝑋8 ). β€’ (PL) At the end, the experimenter trains πœ‹ as ! πœ‹!"#$ ≔ arg max %∈' ) (∈[*] 1 𝑇 ) ,∈[-] πœ‹(π‘Ž ∣ 𝑋, ) 1[𝐴, = π‘Ž](π‘Œ, βˆ’ ! πœ‡, (π‘Ž ∣ 𝑋, )) 8 𝑀, (π‘Ž ∣ 𝑋, ) + ! πœ‡, (π‘Ž ∣ 𝑋, ) . β€’ ? 𝑀#(π‘Ž ∣ 𝑋#) and D πœ‡#(π‘Ž ∣ 𝑋#) are estimators of π‘€βˆ— and πœ‡(𝑃)(π‘₯) using samples until 𝑑. v n Evaluate the strategy by using the worst-case analysis. n Lower and upper bounds depend on the policy class complexity. β€’ Use the Vapnik-Chervonenkis dimension when 𝐾 = 2 and the Natarajan dimension when 𝐾 β‰₯ 3 for the complexity. β€’ Denote the complexity of Ξ  by 𝑀. n Make several restrictions on the policy class and distribution. Ø Worst-case regret lower bounds. β€’ If 𝐾 = 2, any strategy with a trained policy L πœ‹ satisfies sup !βˆˆπ’« 𝑇𝔼! 𝑅$ 𝑃 ( πœ‹ β‰₯ 1 8 𝔼% 𝑀 𝜎&(𝑋) + 𝜎'(𝑋) ' + π‘œ 1 π‘Žπ‘  𝑇 β†’ ∞. β€’ If 𝐾 β‰₯ 3, any strategy with a trained policy L πœ‹ satisfies sup !βˆˆπ’« 𝑇 𝔼! 𝑅$ 𝑃 , πœ‹ β‰₯ 1 8 𝔼% 𝑀 2 &∈ ' 𝜎&(𝑋) ( + π‘œ 1 π‘Žπ‘  𝑇 β†’ ∞. β€’ Note that 𝔼% 𝑀 βˆ‘ (∈ ) 𝜎((𝑋) ' β‰₯ 𝔼% 𝑀 𝜎&(𝑋) + 𝜎'(𝑋) ' . Ø Worst-case regret upper bounds of the PLAS strategy. β€’ If 𝐾 = 2, the PLAS strategy satisfies 𝑇 𝔼4 𝑅F 𝑃 L πœ‹GHIJ ≀ 272 𝑀 + 870.4 𝔼K 𝜎*(𝑋) + 𝜎>(𝑋) > + π‘œ 1 π‘Žπ‘  𝑇 β†’ ∞. β€’ If 𝐾 β‰₯ 3, the PLAS strategy satisfies 𝑇 𝔼! 𝑅$ 𝑃 , πœ‹)*+, ≀ 𝐢 log 𝑑 𝑀 + 870.4 𝔼% 𝑀 2 &∈ ' 𝜎&(𝑋) ( + π‘œ 1 π‘Žπ‘  𝑇 β†’ ∞, where 𝐢 > 0 is universal constant. Ø The leading factor depending on 𝑀, 𝑇, and 𝜎! π‘₯ > of the upper bounds aligns with that of the lower bounds. β€’ 𝔼% 𝑀 𝜎&(𝑋) + 𝜎'(𝑋) ' and 𝔼% 𝑀 βˆ‘ (∈ ) 𝜎((𝑋) ' . β†’ Minimax (rate-)optimal for the expected simple regret. Masahiro Kato, Mizuho-DL Financial Technology Co., Ltd. Kyohei Okumura, Northwestern University Takuya Ishihara, Tohoku University Toru Kitagawa, Brown University ΓΌ New framework: Contextual best arm identification (BAI). ΓΌ Algorithm: Adaptive experimental design for policy learning. ΓΌ Results: minimax optimality of the proposed algorithm. 2. Problem Setting n Performance measure of L πœ‹ : Expected simple regret. β€’ Simple regret: π‘ŸF 𝑃 (L πœ‹)(π‘₯) = πœ‡!βˆ— 4 𝑃 (π‘₯) βˆ’ πœ‡ D !( under 𝑃 β€’ Expected simple regret: 𝑅F 𝑃 L πœ‹ ≔ 𝔼4 π‘Ÿ8 𝑃 L πœ‹ 𝑋 = 𝑄 𝑃 πœ‹βˆ— 𝑃 βˆ’ 𝑄 𝑃 L πœ‹ . n Our goal: β€’ Design of an algorithm (strategy) of an experimenter. β€’ Under the strategy, the experimenter trains πœ‹ while minimizing the expected simple regret 𝔼4 π‘Ÿ8 𝑃 . β†’ We design the Adaptive-Sampling (AS) and Policy-Learning (PL) strategy and show its minimax optimality for 𝑅F 𝑃 L πœ‹ . 3. Performance Measure 1. Contribution 4. PLAS Strategy v v v v 5. Minimax Optimality v v v