Definitions
• Data
• Policy
D = ((x1
, y1
, δ1
, p1
), . . . , (xn
, yn
, δn
, pn
))
xi
yi
δi
pi
yi
= π(xi
)
π
: Context
: Labels (multi-label settings)
: Reward
: Propensity Score (ޙड़)
: Policy (Context → Action)
Slide 12
Slide 12 text
Counterfactual Risk Minimization
• Unbiased Estimation
R(π) =
1
n
n
∑
i=1
δi
π(yi
|xi
)
π0
(yi
|xi
)
=
1
n
n
∑
i=1
δi
π(yi
|xi
)
pi
δi
π0
: loss
: logging policy (→ Propensity Score)
Importance sampling
R(π) =
1
n
n
∑
i=1
min{M, δi
π(yi
|xi
)
pi
}
clipping (M)Λಋೖͨ͠Լه͕IPS (Inverse Propensity Score) Estimator [4]
Slide 13
Slide 13 text
Counterfactual Risk Minimization
arg min
h
R(h) + λ
Varw
(u)
n
• CRM (Counterfactual Risk Minimization)
Generalization Error Boundsͷ্ݶΛ࠷খʹ͢Δ
※ৄࡉจΛࢀর
data-dependent regularizer
Slide 14
Slide 14 text
• classificationͱಉ༷ͷpolicy (ઢܗ + softmax)
• ҎԼͷࣜͷ௨Γʹֶश
POEM [5]
π(y|x) =
exp(wϕ(x, y))
∑
y′∈Y
exp(wϕ(x, y′)
w * = arg min
w∈Rd
¯
uw
+ λ
Varw
(u)
n
ui
w
≡ δi
min{M,
exp(wϕ(x, y))
pi
∑
y′∈Y
exp(wϕ(x, y′)
} ¯
uw
≡
n
∑
i=1
ui
w
Varw
(u) ≡
1
n − 1
n
∑
i=0
(ui
w
− ¯
uw
)2
Slide 15
Slide 15 text
Experiments
• Dataset (multi label experiments)
• Supervised to Bandit Conversion [6]
5% 95%
x
y*
CRF
π0
y
ᶃશσʔλͷ5%Ͱlogging policyΛֶश
ᶄಘΒΕͨlogging policyͰ95%ͷσʔλʹϥϕϧΛ༩
ᶅ feedbackΛyͱy*Ͱܭࢉ
(Hamming loss)
δ
Slide 16
Slide 16 text
Experimental Results [5]
• Test set Hamming Loss
• Computational time (seconds)
S: AdaGrad, B: L-BFGS
Slide 17
Slide 17 text
Note
• logʹଘࡏ͠ͳ͍ϥϕϧʹରͯ͠ਖ਼֬ͳ༧ଌͰ͖ͳ͍
ex) ৽͍͠ࠂΛՃ͢Δ࣌
log
A
B
C
B
A
OK
NG
Counterfactual ML
※্هͷྫۃͳྫɺՄೳʹ͢Δํ๏ଘࡏ͢Δ
Slide 18
Slide 18 text
More
• [5]ͷݚڀνʔϜ͕ܧଓతʹݚڀΛൃද
• ”The Self-Normalized Estimator for Counterfactual Learning”
• “Recommendations as Treatments: Debiasing Learning and Evaluation”
• “Unbiased Learning-to-Rank with Biased Feedback”
• “Deep Learning with Logged Bandit Feedback”
• Microsoft Researchʹଟ͘ͷݚڀऀ͕ࡏ੶த
ڵຯͷ͋ΔํͥͻௐͯΈ͍ͯͩ͘͞ʂ