Slide 1

Slide 1 text

Counterfactual Machine Learningೖ໳ גࣜձࣾαΠόʔΤʔδΣϯτ ΞυςΫຊ෦ɹAI Lab Kazuki Taniguchi

Slide 2

Slide 2 text

Introduction • ৬छ
 ɹResearch Scientist • ݚڀྖҬ • Basics of Machine Learning • Response Prediction • Counterfactual ML • ͜Ε·Ͱͷ࢓ࣄ (ResearchҎ֎) • MLaaSͷ։ൃ • DSPͷΞϧΰϦζϜ։ൃ

Slide 3

Slide 3 text

What is Counterfactual ML? ※Counterfactual ML (Machine Learning) [1]

Slide 4

Slide 4 text

Counterfactual ML • ʮ൓ࣄ࣮͕ੜ͡Δσʔλʹର͢ΔΞϧΰϦζϜͷධՁɺ ͋Δ͍͸ϞσϧΛֶश͢ΔΞϧΰϦζϜʯͱఆٛ͢Δ
 (ݫີͳఆٛͰ͋Δอূ͸ͳ͍) • ൓ࣄ࣮͕ੜ͡ΔσʔλΛѻ͏ΞϧΰϦζϜ • Interactive Learning ← (ࠓճ͸͜ͷྫͰઆ໌͢Δ) • (Contextual) Bandit Algorithm • Reinforcement Learning • Covariant Shift

Slide 5

Slide 5 text

Supervised Learning Feature(Context): xi 1 9 8 7 3 2 Predictions: ̂ yi ̂ yi = f(xi ) 1 9 5 7 3 2 Labels: yi miss correct correct correct correct correct

Slide 6

Slide 6 text

• Ϣʔβʹ޿ࠂը૾Λදࣔͤ͞Δ • Ϣʔβͱ഑৴໘͸Context(೔࣌, ੑผ, τϐοΫ, etc…)Λ͍࣋ͬͯΔ • ޿ࠂը૾͸ީิͷத͔ΒҰͭͷը૾͚ͩදࣔ͞ΕΔ • දࣔ͞ΕΔ޿ࠂ͕ΫϦοΫ͞ΕΔΑ͏ʹ഑৴͍ͨ͠ Problem Setting Ad Selection ഑৴໘ Ϣʔβ π(x)

Slide 7

Slide 7 text

Interactive Learning [2] Feature(Context): xi ai = π(xi ) Action: ai Reward: ri ഑৴໘ Ϣʔβ Click or Not Ϣʔβ

Slide 8

Slide 8 text

Comparison with Supervised Learning 1 7 Labels Supervised Learning Interactive Learning click Counterfactual • બ୒͞Εͳ͔ͬͨΞΫγϣϯͷධՁ͸൓ࣄ࣮ͱͳΔ • ৽͍͠PolicyΛධՁ͢Δࡍ͸൓ࣄ࣮ͷΞΫγϣϯΛධՁͰ͖ͳ͍

Slide 9

Slide 9 text

Comparison with Contextual Bandit • ໰୊ઃఆ͸ಉ͡ • Counterfactual ML͸Offline (Batch) LearningΛϝΠϯʹऔΓѻ͏ • Onlineͱҧ͍ɺධՁ͕ߦ͍΍͍͢఺͕ϝϦοτͱͳΔ • Contextual Bandit͸OnlineͰPolicyΛߋ৽͢Δ • Counterfactual MLͷߟ͑ํ͸Contextual BanditͷPolicyͷ
 ධՁΛ͢Δ͜ͱͱಉ͡ (Offline Evaluation) [3] Evaluationʹ͍ͭͯ͸ʮAI Lab Research Blogʯͷهࣄ[3]ʹ ৄ͘͠ॻ͔Ε͍ͯΔͷͰࠓճͷൃදͰ͸ׂѪ͠·͢

Slide 10

Slide 10 text

Algorithms

Slide 11

Slide 11 text

Definitions • Data • Policy D = ((x1 , y1 , δ1 , p1 ), . . . , (xn , yn , δn , pn )) xi yi δi pi yi = π(xi ) π : Context : Labels (multi-label settings) : Reward : Propensity Score (ޙड़) : Policy (Context → Action)

Slide 12

Slide 12 text

Counterfactual Risk Minimization • Unbiased Estimation R(π) = 1 n n ∑ i=1 δi π(yi |xi ) π0 (yi |xi ) = 1 n n ∑ i=1 δi π(yi |xi ) pi δi π0 : loss : logging policy (→ Propensity Score) Importance sampling R(π) = 1 n n ∑ i=1 min{M, δi π(yi |xi ) pi } clipping (M)Λಋೖͨ͠Լه͕IPS (Inverse Propensity Score) Estimator [4]

Slide 13

Slide 13 text

Counterfactual Risk Minimization arg min h R(h) + λ Varw (u) n • CRM (Counterfactual Risk Minimization) Generalization Error Boundsͷ্ݶΛ࠷খʹ͢Δ ※ৄࡉ͸࿦จΛࢀর data-dependent regularizer

Slide 14

Slide 14 text

• classificationͱಉ༷ͷpolicy (ઢܗ + softmax) • ҎԼͷࣜͷ௨Γʹֶश POEM [5] π(y|x) = exp(wϕ(x, y)) ∑ y′∈Y exp(wϕ(x, y′) w * = arg min w∈Rd ¯ uw + λ Varw (u) n ui w ≡ δi min{M, exp(wϕ(x, y)) pi ∑ y′∈Y exp(wϕ(x, y′) } ¯ uw ≡ n ∑ i=1 ui w Varw (u) ≡ 1 n − 1 n ∑ i=0 (ui w − ¯ uw )2

Slide 15

Slide 15 text

Experiments • Dataset (multi label experiments) • Supervised to Bandit Conversion [6] 5% 95% x y* CRF π0 y ᶃશσʔλͷ5%Ͱlogging policyΛֶश ᶄಘΒΕͨlogging policyͰ95%ͷσʔλʹϥϕϧΛ෇༩ ᶅ feedbackΛyͱy*Ͱܭࢉ
 (Hamming loss) δ

Slide 16

Slide 16 text

Experimental Results [5] • Test set Hamming Loss • Computational time (seconds) S: AdaGrad, B: L-BFGS

Slide 17

Slide 17 text

Note • logʹଘࡏ͠ͳ͍ϥϕϧʹରͯ͠͸ਖ਼֬ͳ༧ଌ͸Ͱ͖ͳ͍ ex) ৽͍͠޿ࠂΛ௥Ճ͢Δ࣌ log A B C B A OK NG Counterfactual ML ※্هͷྫ͸ۃ୺ͳྫɺՄೳʹ͢Δํ๏΋ଘࡏ͢Δ

Slide 18

Slide 18 text

More • [5]ͷݚڀνʔϜ͕ܧଓతʹݚڀΛൃද • ”The Self-Normalized Estimator for Counterfactual Learning” • “Recommendations as Treatments: Debiasing Learning and Evaluation” • “Unbiased Learning-to-Rank with Biased Feedback” • “Deep Learning with Logged Bandit Feedback” • Microsoft Researchʹ΋ଟ͘ͷݚڀऀ͕ࡏ੶த ڵຯͷ͋Δํ͸ͥͻௐ΂ͯΈ͍ͯͩ͘͞ʂ

Slide 19

Slide 19 text

Summary

Slide 20

Slide 20 text

Summary • Counterfactual ML • ൓ࣄ࣮ΛධՁɺֶश͢Δ • ޿ࠂͷόφʔදࣔ໰୊͸యܕతͳࣄྫ • Algorithms • IPS Estimator • POEM • Experiments

Slide 21

Slide 21 text

AI LabͰ΋ݚڀڧԽத https://arxiv.org/abs/1809.03084 Yusuke Narita, Shota Yasui, Kohei Yata,
 “Efficient Counterfactual Learning from Bandit Feedback”, arxiv, 2018 https://adtech.cyberagent.io/ailab/ ࠓճͷ಺༰ʹؔ͢Δ࿦จ ৄ͘͠͸ฐࣾwebαΠτ΁

Slide 22

Slide 22 text

fin.

Slide 23

Slide 23 text

References 1. SIGIR 2016 Tutorial on Counterfactual Evaluation and Learning
 (http://www.cs.cornell.edu/~adith/CfactSIGIR2016/)
 2. ICML2017: Tutorial on Real World Interactive Learning 
 (http://hunch.net/~rwil/)
 3. όϯσΟοτΞϧΰϦζϜͷධՁͱҼՌਪ࿦
 (https://adtech.cyberagent.io/research/archives/199)
 4. Counterfactual Reasoning and Learning Systems, 2017
 (https://arxiv.org/abs/1209.2355)
 5. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback
 (https://arxiv.org/abs/1502.02362)