Kazuki Taniguchi
September 29, 2018
2k

# Counterfactual Machine Learning 入門 / Introduction to Counterfactual ML

この資料は「第28回Machine Learning 15minutes! 」(https://machine-learning15minutes.connpass.com/event/97195/) で発表した内容になります。

## Kazuki Taniguchi

September 29, 2018

## Transcript

1. Counterfactual
Machine Learningೖ໳
גࣜձࣾαΠόʔΤʔδΣϯτ
ΞυςΫຊ෦ɹAI Lab
Kazuki Taniguchi

2. Introduction
• ৬छ
ɹResearch Scientist
• ݚڀྖҬ
• Basics of Machine Learning
• Response Prediction
• Counterfactual ML
• ͜Ε·Ͱͷ࢓ࣄ (ResearchҎ֎)
• MLaaSͷ։ൃ
• DSPͷΞϧΰϦζϜ։ൃ

3. What is Counterfactual ML?
※Counterfactual ML (Machine Learning) [1]

4. Counterfactual ML
• ʮ൓ࣄ࣮͕ੜ͡Δσʔλʹର͢ΔΞϧΰϦζϜͷධՁɺ
͋Δ͍͸ϞσϧΛֶश͢ΔΞϧΰϦζϜʯͱఆٛ͢Δ
(ݫີͳఆٛͰ͋Δอূ͸ͳ͍)
• ൓ࣄ࣮͕ੜ͡ΔσʔλΛѻ͏ΞϧΰϦζϜ
• Interactive Learning ← (ࠓճ͸͜ͷྫͰઆ໌͢Δ)
• (Contextual) Bandit Algorithm
• Reinforcement Learning
• Covariant Shift

5. Supervised Learning
Feature(Context): xi
1
9
8
7
3
2
Predictions: ̂
yi
̂
yi
= f(xi
)
1
9
5
7
3
2
Labels: yi
miss
correct
correct
correct
correct
correct

6. • Ϣʔβʹ޿ࠂը૾Λදࣔͤ͞Δ
• Ϣʔβͱ഑৴໘͸Context(೔࣌, ੑผ, τϐοΫ, etc…)Λ͍࣋ͬͯΔ
• ޿ࠂը૾͸ީิͷத͔ΒҰͭͷը૾͚ͩදࣔ͞ΕΔ
• දࣔ͞ΕΔ޿ࠂ͕ΫϦοΫ͞ΕΔΑ͏ʹ഑৴͍ͨ͠
Problem Setting
഑৴໘
Ϣʔβ
π(x)

7. Interactive Learning [2]
Feature(Context): xi
ai
= π(xi
)
Action: ai
Reward: ri
഑৴໘
Ϣʔβ
Click
or
Not
Ϣʔβ

8. Comparison with Supervised Learning
1
7
Labels
Supervised Learning Interactive Learning
click
Counterfactual
• બ୒͞Εͳ͔ͬͨΞΫγϣϯͷධՁ͸൓ࣄ࣮ͱͳΔ
• ৽͍͠PolicyΛධՁ͢Δࡍ͸൓ࣄ࣮ͷΞΫγϣϯΛධՁͰ͖ͳ͍

9. Comparison with Contextual Bandit
• ໰୊ઃఆ͸ಉ͡
• Counterfactual ML͸Ofﬂine (Batch) LearningΛϝΠϯʹऔΓѻ͏
• Onlineͱҧ͍ɺධՁ͕ߦ͍΍͍͢఺͕ϝϦοτͱͳΔ
• Contextual Bandit͸OnlineͰPolicyΛߋ৽͢Δ
• Counterfactual MLͷߟ͑ํ͸Contextual BanditͷPolicyͷ
ධՁΛ͢Δ͜ͱͱಉ͡ (Ofﬂine Evaluation) [3]
Evaluationʹ͍ͭͯ͸ʮAI Lab Research Blogʯͷهࣄ[3]ʹ
ৄ͘͠ॻ͔Ε͍ͯΔͷͰࠓճͷൃදͰ͸ׂѪ͠·͢

10. Algorithms

11. Deﬁnitions
• Data
• Policy
D = ((x1
, y1
, δ1
, p1
), . . . , (xn
, yn
, δn
, pn
))
xi
yi
δi
pi
yi
= π(xi
)
π
: Context
: Labels (multi-label settings)
: Reward
: Propensity Score (ޙड़)
: Policy (Context → Action)

12. Counterfactual Risk Minimization
• Unbiased Estimation
R(π) =
1
n
n

i=1
δi
π(yi
|xi
)
π0
(yi
|xi
)
=
1
n
n

i=1
δi
π(yi
|xi
)
pi
δi
π0
: loss
: logging policy (→ Propensity Score)
Importance sampling
R(π) =
1
n
n

i=1
min{M, δi
π(yi
|xi
)
pi
}
clipping (M)Λಋೖͨ͠Լه͕IPS (Inverse Propensity Score) Estimator [4]

13. Counterfactual Risk Minimization
arg min
h
R(h) + λ
Varw
(u)
n
• CRM (Counterfactual Risk Minimization)
Generalization Error Boundsͷ্ݶΛ࠷খʹ͢Δ
※ৄࡉ͸࿦จΛࢀর
data-dependent regularizer

14. • classiﬁcationͱಉ༷ͷpolicy (ઢܗ + softmax)
• ҎԼͷࣜͷ௨Γʹֶश
POEM [5]
π(y|x) =
exp(wϕ(x, y))

y′∈Y
exp(wϕ(x, y′)
w * = arg min
w∈Rd
¯
uw
+ λ
Varw
(u)
n
ui
w
≡ δi
min{M,
exp(wϕ(x, y))
pi

y′∈Y
exp(wϕ(x, y′)
} ¯
uw

n

i=1
ui
w
Varw
(u) ≡
1
n − 1
n

i=0
(ui
w
− ¯
uw
)2

15. Experiments
• Dataset (multi label experiments)
• Supervised to Bandit Conversion [6]
5% 95%
x
y*
CRF
π0
y
ᶃશσʔλͷ5%Ͱlogging policyΛֶश
ᶄಘΒΕͨlogging policyͰ95%ͷσʔλʹϥϕϧΛ෇༩
ᶅ feedbackΛyͱy*Ͱܭࢉ
(Hamming loss)
δ

16. Experimental Results [5]
• Test set Hamming Loss
• Computational time (seconds)

17. Note
• logʹଘࡏ͠ͳ͍ϥϕϧʹରͯ͠͸ਖ਼֬ͳ༧ଌ͸Ͱ͖ͳ͍
ex) ৽͍͠޿ࠂΛ௥Ճ͢Δ࣌
log
A
B
C
B
A
OK
NG
Counterfactual ML
※্هͷྫ͸ۃ୺ͳྫɺՄೳʹ͢Δํ๏΋ଘࡏ͢Δ

18. More
• [5]ͷݚڀνʔϜ͕ܧଓతʹݚڀΛൃද
• ”The Self-Normalized Estimator for Counterfactual Learning”
• “Recommendations as Treatments: Debiasing Learning and Evaluation”
• “Unbiased Learning-to-Rank with Biased Feedback”
• “Deep Learning with Logged Bandit Feedback”
• Microsoft Researchʹ΋ଟ͘ͷݚڀऀ͕ࡏ੶த
ڵຯͷ͋Δํ͸ͥͻௐ΂ͯΈ͍ͯͩ͘͞ʂ

19. Summary

20. Summary
• Counterfactual ML
• ൓ࣄ࣮ΛධՁɺֶश͢Δ
• ޿ࠂͷόφʔදࣔ໰୊͸యܕతͳࣄྫ
• Algorithms
• IPS Estimator
• POEM
• Experiments

21. AI LabͰ΋ݚڀڧԽத
https://arxiv.org/abs/1809.03084
Yusuke Narita, Shota Yasui, Kohei Yata,
“Efﬁcient Counterfactual Learning from Bandit Feedback”, arxiv, 2018
ࠓճͷ಺༰ʹؔ͢Δ࿦จ
ৄ͘͠͸ฐࣾwebαΠτ΁

22. ﬁn.

23. References
1. SIGIR 2016 Tutorial on Counterfactual Evaluation and Learning