Upgrade to PRO for Only $50/Yearโ€”Limited-Time Offer! ๐Ÿ”ฅ

Adaptive Policy Learning

Avatar for MasaKat0 MasaKat0
October 27, 2024

Adaptive Policyย Learning

Poster for ICML 2024 workshop.

Avatar for MasaKat0

MasaKat0

October 27, 2024
Tweet

More Decks by MasaKat0

Other Decks in Research

Transcript

  1. Adaptive Experimental Design for Policy Learning: Contextual Best Arm Identification

    ร˜ Contextual BAI with a fixed budget. n ๐‘ฒ treatment arms ๐พ = 1,2, โ€ฆ , ๐พ : Treatment arms: alternatives of medicine, policy, and advertisements. โ€ข Each arm ๐‘Ž โˆˆ [๐พ] has a potential outcome ๐‘Œ! โˆˆ โ„. n Context ๐‘‹ โˆˆ ๐’ณ โІ โ„". n Policy ๐œ‹: ๐พ ร—๐’ณ โ†’ 0,1 such that โˆ‘!โˆˆ[%] ๐œ‹(๐‘Ž โˆฃ ๐‘ฅ) = 1 โˆ€๐‘ฅ โˆˆ ๐’ณ. โ€ข A policy is a function that returns a estimated best arm given a context ๐‘ฅ. โ€ข Policy class ฮ : a set of policies ๐œ‹. Ex. Neural networks. n Distribution ๐‘ƒ of (๐‘Œ*, โ€ฆ , ๐‘Œ%, ๐‘‹). โ€ข ๐’ซ: a set of ๐‘ƒ. โ€ข ๐”ผ! and โ„™! : expectation and probability laws under ๐‘ƒ. โ€ข ๐œ‡"(๐‘ƒ) = ๐”ผ! ๐‘Œ# " and ๐œ‡"(๐‘ƒ)(๐‘ฅ) n Policy value ๐‘„ ๐‘ƒ ๐œ‹ โ‰” ๐”ผ4 โˆ‘!โˆˆ % ๐œ‹ ๐‘Ž ๐‘‹ ๐œ‡!(๐‘ƒ)(๐‘‹) . โ€ข Expected outcome under a policy ๐œ‹ and a distribution ๐‘ƒ. n Optimal policy ๐œ‹โˆ—(๐‘ƒ) โˆˆ arg max6โˆˆ7 ๐‘„ ๐‘ƒ ๐œ‹ . n Experimenterโ€™s interest: training ๐œ‹โˆ— using experimental data. n Adaptive experiment with ๐‘ป rounds: โ€ข In each round ๐‘ก โˆˆ ๐‘‡ โ‰” 1,2, โ€ฆ , ๐‘‡ , an experimenter 1. observes ๐‘‘-dimensional context ๐‘‹8 โˆˆ ๐’ณ โІ โ„". 2. assigns treatment ๐ด8 โˆˆ [๐พ], and 3. observes outcome ๐‘Œ8 โ‰” โˆ‘!โˆˆ % 1 ๐ด8 = ๐‘Ž ๐‘Œ8 ! . โ€ข After the final round ๐‘‡, an experimenter trains a policy ๐œ‹. โ€ข Denote the trained policy by L ๐œ‹ โˆˆ ฮ . n AS: adaptive sampling during an experiment. n PL: policy learning at the end of the experiment. ร˜ Procedure of the PLAS strategy. โ€ข Define the target treatment-assignment ratio as follows: - ๐พ = 2: ๐‘คโˆ—(๐‘Ž โˆฃ ๐‘ฅ) = 9!(:) 9"(:);9#(:) # โˆ€๐‘Ž โˆˆ [2] and โˆ€๐‘ฅ โˆˆ ๐’ณ. - ๐พ โ‰ฅ 3: ๐‘คโˆ— ๐‘Ž ๐‘ฅ = 9! : # โˆ‘ $โˆˆ & 9$ : # โˆ€๐‘Ž โˆˆ ๐พ and โˆ€๐‘ฅ โˆˆ ๐’ณ. ๐œŽ"(๐‘ฅ) $: Conditional variance of ๐‘Œ# " given ๐‘ฅ. โ†’ Our designed policy adaptively estimates the ratio and assigns arms following the probability. โ€ข (AS) In each round ๐’• โˆˆ ๐‘ป , an experimenter 1. observes ๐‘‹8 , 2. estimates ๐‘คโˆ—(๐‘Ž โˆฃ ๐‘‹8 ) by replacing ๐œŽ!(๐‘‹8 ) > with its estimator L ๐œŽ!(๐‘‹8 ) >, and 3. assigns an arm ๐‘Ž following the probability R ๐‘ค8 (๐‘Ž โˆฃ ๐‘‹8 ), an estimator of ๐‘คโˆ—(๐‘Ž โˆฃ ๐‘‹8 ). โ€ข (PL) At the end, the experimenter trains ๐œ‹ as ! ๐œ‹!"#$ โ‰” arg max %โˆˆ' ) (โˆˆ[*] 1 ๐‘‡ ) ,โˆˆ[-] ๐œ‹(๐‘Ž โˆฃ ๐‘‹, ) 1[๐ด, = ๐‘Ž](๐‘Œ, โˆ’ ! ๐œ‡, (๐‘Ž โˆฃ ๐‘‹, )) 8 ๐‘ค, (๐‘Ž โˆฃ ๐‘‹, ) + ! ๐œ‡, (๐‘Ž โˆฃ ๐‘‹, ) . โ€ข ? ๐‘ค#(๐‘Ž โˆฃ ๐‘‹#) and D ๐œ‡#(๐‘Ž โˆฃ ๐‘‹#) are estimators of ๐‘คโˆ— and ๐œ‡(๐‘ƒ)(๐‘ฅ) using samples until ๐‘ก. v n Evaluate the strategy by using the worst-case analysis. n Lower and upper bounds depend on the policy class complexity. โ€ข Use the Vapnik-Chervonenkis dimension when ๐พ = 2 and the Natarajan dimension when ๐พ โ‰ฅ 3 for the complexity. โ€ข Denote the complexity of ฮ  by ๐‘€. n Make several restrictions on the policy class and distribution. ร˜ Worst-case regret lower bounds. โ€ข If ๐พ = 2, any strategy with a trained policy L ๐œ‹ satisfies sup !โˆˆ๐’ซ ๐‘‡๐”ผ! ๐‘…$ ๐‘ƒ ( ๐œ‹ โ‰ฅ 1 8 ๐”ผ% ๐‘€ ๐œŽ&(๐‘‹) + ๐œŽ'(๐‘‹) ' + ๐‘œ 1 ๐‘Ž๐‘  ๐‘‡ โ†’ โˆž. โ€ข If ๐พ โ‰ฅ 3, any strategy with a trained policy L ๐œ‹ satisfies sup !โˆˆ๐’ซ ๐‘‡ ๐”ผ! ๐‘…$ ๐‘ƒ , ๐œ‹ โ‰ฅ 1 8 ๐”ผ% ๐‘€ 2 &โˆˆ ' ๐œŽ&(๐‘‹) ( + ๐‘œ 1 ๐‘Ž๐‘  ๐‘‡ โ†’ โˆž. โ€ข Note that ๐”ผ% ๐‘€ โˆ‘ (โˆˆ ) ๐œŽ((๐‘‹) ' โ‰ฅ ๐”ผ% ๐‘€ ๐œŽ&(๐‘‹) + ๐œŽ'(๐‘‹) ' . ร˜ Worst-case regret upper bounds of the PLAS strategy. โ€ข If ๐พ = 2, the PLAS strategy satisfies ๐‘‡ ๐”ผ4 ๐‘…F ๐‘ƒ L ๐œ‹GHIJ โ‰ค 272 ๐‘€ + 870.4 ๐”ผK ๐œŽ*(๐‘‹) + ๐œŽ>(๐‘‹) > + ๐‘œ 1 ๐‘Ž๐‘  ๐‘‡ โ†’ โˆž. โ€ข If ๐พ โ‰ฅ 3, the PLAS strategy satisfies ๐‘‡ ๐”ผ! ๐‘…$ ๐‘ƒ , ๐œ‹)*+, โ‰ค ๐ถ log ๐‘‘ ๐‘€ + 870.4 ๐”ผ% ๐‘€ 2 &โˆˆ ' ๐œŽ&(๐‘‹) ( + ๐‘œ 1 ๐‘Ž๐‘  ๐‘‡ โ†’ โˆž, where ๐ถ > 0 is universal constant. ร˜ The leading factor depending on ๐‘€, ๐‘‡, and ๐œŽ! ๐‘ฅ > of the upper bounds aligns with that of the lower bounds. โ€ข ๐”ผ% ๐‘€ ๐œŽ&(๐‘‹) + ๐œŽ'(๐‘‹) ' and ๐”ผ% ๐‘€ โˆ‘ (โˆˆ ) ๐œŽ((๐‘‹) ' . โ†’ Minimax (rate-)optimal for the expected simple regret. Masahiro Kato, Mizuho-DL Financial Technology Co., Ltd. Kyohei Okumura, Northwestern University Takuya Ishihara, Tohoku University Toru Kitagawa, Brown University รผ New framework: Contextual best arm identification (BAI). รผ Algorithm: Adaptive experimental design for policy learning. รผ Results: minimax optimality of the proposed algorithm. 2. Problem Setting n Performance measure of L ๐œ‹ : Expected simple regret. โ€ข Simple regret: ๐‘ŸF ๐‘ƒ (L ๐œ‹)(๐‘ฅ) = ๐œ‡!โˆ— 4 ๐‘ƒ (๐‘ฅ) โˆ’ ๐œ‡ D !( under ๐‘ƒ โ€ข Expected simple regret: ๐‘…F ๐‘ƒ L ๐œ‹ โ‰” ๐”ผ4 ๐‘Ÿ8 ๐‘ƒ L ๐œ‹ ๐‘‹ = ๐‘„ ๐‘ƒ ๐œ‹โˆ— ๐‘ƒ โˆ’ ๐‘„ ๐‘ƒ L ๐œ‹ . n Our goal: โ€ข Design of an algorithm (strategy) of an experimenter. โ€ข Under the strategy, the experimenter trains ๐œ‹ while minimizing the expected simple regret ๐”ผ4 ๐‘Ÿ8 ๐‘ƒ . โ†’ We design the Adaptive-Sampling (AS) and Policy-Learning (PL) strategy and show its minimax optimality for ๐‘…F ๐‘ƒ L ๐œ‹ . 3. Performance Measure 1. Contribution 4. PLAS Strategy v v v v 5. Minimax Optimality v v v