Slide 1
Slide 1 text
Adaptive Experimental Design for Policy Learning:
Contextual Best Arm Identification
Γ Contextual BAI with a fixed budget.
n π² treatment arms πΎ = 1,2, β¦ , πΎ :
Treatment arms: alternatives of medicine, policy, and advertisements.
β’ Each arm π β [πΎ] has a potential outcome π! β β.
n Context π β π³ β β".
n Policy π: πΎ Γπ³ β 0,1 such that β!β[%]
π(π β£ π₯) = 1 βπ₯ β π³.
β’ A policy is a function that returns a estimated best arm given a context π₯.
β’ Policy class Ξ : a set of policies π. Ex. Neural networks.
n Distribution π of (π*, β¦ , π%, π).
β’ π«: a set of π.
β’ πΌ!
and β!
: expectation and probability laws under π.
β’ π"(π) = πΌ! π#
" and π"(π)(π₯)
n Policy value π π π β πΌ4
β!β %
π π π π!(π)(π) .
β’ Expected outcome under a policy π and a distribution π.
n Optimal policy πβ(π) β arg max6β7
π π π .
n Experimenterβs interest: training πβ using experimental data.
n Adaptive experiment with π» rounds:
β’ In each round π‘ β π β 1,2, β¦ , π , an experimenter
1. observes π-dimensional context π8
β π³ β β".
2. assigns treatment π΄8
β [πΎ], and
3. observes outcome π8
β β!β %
1 π΄8
= π π8
! .
β’ After the final round π, an experimenter trains a policy π.
β’ Denote the trained policy by L
π β Ξ .
n AS: adaptive sampling during an experiment.
n PL: policy learning at the end of the experiment.
Γ Procedure of the PLAS strategy.
β’ Define the target treatment-assignment ratio as follows:
- πΎ = 2: π€β(π β£ π₯) = 9!(:)
9"(:);9#(:) #
βπ β [2] and βπ₯ β π³.
- πΎ β₯ 3: π€β π π₯ = 9! :
#
β
$β &
9$ :
#
βπ β πΎ and βπ₯ β π³.
π"(π₯) $: Conditional variance of π#
" given π₯.
β Our designed policy adaptively estimates the ratio and
assigns arms following the probability.
β’ (AS) In each round π β π» , an experimenter
1. observes π8
,
2. estimates π€β(π β£ π8
) by replacing π!(π8
) > with its
estimator L
π!(π8
) >, and
3. assigns an arm π following the probability R
π€8
(π β£ π8
), an
estimator of π€β(π β£ π8
).
β’ (PL) At the end, the experimenter trains π as
!
π!"#$ β arg max
%β'
)
(β[*]
1
π
)
,β[-]
π(π β£ π,
)
1[π΄,
= π](π,
β !
π,
(π β£ π,
))
8
π€,
(π β£ π,
)
+ !
π,
(π β£ π,
) .
β’ ?
π€#(π β£ π#) and D
π#(π β£ π#) are estimators of π€β and π(π)(π₯) using samples until π‘.
v
n Evaluate the strategy by using the worst-case analysis.
n Lower and upper bounds depend on the policy class complexity.
β’ Use the Vapnik-Chervonenkis dimension when πΎ = 2 and
the Natarajan dimension when πΎ β₯ 3 for the complexity.
β’ Denote the complexity of Ξ by π.
n Make several restrictions on the policy class and distribution.
Γ Worst-case regret lower bounds.
β’ If πΎ = 2, any strategy with a trained policy L
π satisfies
sup
!βπ«
ππΌ!
π
$
π (
π β₯
1
8
πΌ%
π π&(π) + π'(π) ' + π 1 ππ π β β.
β’ If πΎ β₯ 3, any strategy with a trained policy L
π satisfies
sup
!βπ«
π πΌ!
π
$
π ,
π β₯
1
8
πΌ%
π 2
&β '
π&(π) ( + π 1 ππ π β β.
β’ Note that πΌ% π β
(β )
π((π) ' β₯ πΌ% π π&(π) + π'(π) ' .
Γ Worst-case regret upper bounds of the PLAS strategy.
β’ If πΎ = 2, the PLAS strategy satisfies
π πΌ4
π
F
π L
πGHIJ
β€ 272 π + 870.4 πΌK
π*(π) + π>(π) > + π 1 ππ π β β.
β’ If πΎ β₯ 3, the PLAS strategy satisfies
π πΌ!
π
$
π ,
π)*+,
β€ πΆ log π π + 870.4 πΌ%
π 2
&β '
π&(π) ( + π 1 ππ π β β,
where πΆ > 0 is universal constant.
Γ The leading factor depending on π, π, and π! π₯ >
of the
upper bounds aligns with that of the lower bounds.
β’ πΌ%
π π&(π) + π'(π) ' and πΌ%
π β
(β )
π((π) ' .
β Minimax (rate-)optimal for the expected simple regret.
Masahiro Kato, Mizuho-DL Financial Technology Co., Ltd.
Kyohei Okumura, Northwestern University
Takuya Ishihara, Tohoku University
Toru Kitagawa, Brown University
ΓΌ New framework: Contextual best arm identification (BAI).
ΓΌ Algorithm: Adaptive experimental design for policy learning.
ΓΌ Results: minimax optimality of the proposed algorithm.
2. Problem Setting
n Performance measure of L
π : Expected simple regret.
β’ Simple regret: πF
π (L
π)(π₯) = π!β 4 π (π₯) β π D
!( under π
β’ Expected simple regret:
π
F
π L
π β πΌ4
π8
π L
π π = π π πβ π β π π L
π .
n Our goal:
β’ Design of an algorithm (strategy) of an experimenter.
β’ Under the strategy, the experimenter trains π while
minimizing the expected simple regret πΌ4
π8
π .
β We design the Adaptive-Sampling (AS) and Policy-Learning
(PL) strategy and show its minimax optimality for π
F
π L
π .
3. Performance Measure
1. Contribution
4. PLAS Strategy
v
v
v
v
5. Minimax Optimality
v
v
v