$30 off During Our Annual Pro Sale. View Details »

Counterfactual Machine Learning 入門 / Introduction to Counterfactual ML

Counterfactual Machine Learning 入門 / Introduction to Counterfactual ML

この資料は「第28回Machine Learning 15minutes! 」(https://machine-learning15minutes.connpass.com/event/97195/) で発表した内容になります。

Kazuki Taniguchi

September 29, 2018
Tweet

More Decks by Kazuki Taniguchi

Other Decks in Technology

Transcript

  1. Counterfactual
    Machine Learningೖ໳
    גࣜձࣾαΠόʔΤʔδΣϯτ
    ΞυςΫຊ෦ɹAI Lab
    Kazuki Taniguchi

    View Slide

  2. Introduction
    • ৬छ

    ɹResearch Scientist
    • ݚڀྖҬ
    • Basics of Machine Learning
    • Response Prediction
    • Counterfactual ML
    • ͜Ε·Ͱͷ࢓ࣄ (ResearchҎ֎)
    • MLaaSͷ։ൃ
    • DSPͷΞϧΰϦζϜ։ൃ

    View Slide

  3. What is Counterfactual ML?
    ※Counterfactual ML (Machine Learning) [1]

    View Slide

  4. Counterfactual ML
    • ʮ൓ࣄ࣮͕ੜ͡Δσʔλʹର͢ΔΞϧΰϦζϜͷධՁɺ
    ͋Δ͍͸ϞσϧΛֶश͢ΔΞϧΰϦζϜʯͱఆٛ͢Δ

    (ݫີͳఆٛͰ͋Δอূ͸ͳ͍)
    • ൓ࣄ࣮͕ੜ͡ΔσʔλΛѻ͏ΞϧΰϦζϜ
    • Interactive Learning ← (ࠓճ͸͜ͷྫͰઆ໌͢Δ)
    • (Contextual) Bandit Algorithm
    • Reinforcement Learning
    • Covariant Shift

    View Slide

  5. Supervised Learning
    Feature(Context): xi
    1
    9
    8
    7
    3
    2
    Predictions: ̂
    yi
    ̂
    yi
    = f(xi
    )
    1
    9
    5
    7
    3
    2
    Labels: yi
    miss
    correct
    correct
    correct
    correct
    correct

    View Slide

  6. • Ϣʔβʹ޿ࠂը૾Λදࣔͤ͞Δ
    • Ϣʔβͱ഑৴໘͸Context(೔࣌, ੑผ, τϐοΫ, etc…)Λ͍࣋ͬͯΔ
    • ޿ࠂը૾͸ީิͷத͔ΒҰͭͷը૾͚ͩදࣔ͞ΕΔ
    • දࣔ͞ΕΔ޿ࠂ͕ΫϦοΫ͞ΕΔΑ͏ʹ഑৴͍ͨ͠
    Problem Setting
    Ad Selection
    ഑৴໘
    Ϣʔβ
    π(x)

    View Slide

  7. Interactive Learning [2]
    Feature(Context): xi
    ai
    = π(xi
    )
    Action: ai
    Reward: ri
    ഑৴໘
    Ϣʔβ
    Click
    or
    Not
    Ϣʔβ

    View Slide

  8. Comparison with Supervised Learning
    1
    7
    Labels
    Supervised Learning Interactive Learning
    click
    Counterfactual
    • બ୒͞Εͳ͔ͬͨΞΫγϣϯͷධՁ͸൓ࣄ࣮ͱͳΔ
    • ৽͍͠PolicyΛධՁ͢Δࡍ͸൓ࣄ࣮ͷΞΫγϣϯΛධՁͰ͖ͳ͍

    View Slide

  9. Comparison with Contextual Bandit
    • ໰୊ઃఆ͸ಉ͡
    • Counterfactual ML͸Offline (Batch) LearningΛϝΠϯʹऔΓѻ͏
    • Onlineͱҧ͍ɺධՁ͕ߦ͍΍͍͢఺͕ϝϦοτͱͳΔ
    • Contextual Bandit͸OnlineͰPolicyΛߋ৽͢Δ
    • Counterfactual MLͷߟ͑ํ͸Contextual BanditͷPolicyͷ

    ධՁΛ͢Δ͜ͱͱಉ͡ (Offline Evaluation) [3]
    Evaluationʹ͍ͭͯ͸ʮAI Lab Research Blogʯͷهࣄ[3]ʹ
    ৄ͘͠ॻ͔Ε͍ͯΔͷͰࠓճͷൃදͰ͸ׂѪ͠·͢

    View Slide

  10. Algorithms

    View Slide

  11. Definitions
    • Data
    • Policy
    D = ((x1
    , y1
    , δ1
    , p1
    ), . . . , (xn
    , yn
    , δn
    , pn
    ))
    xi
    yi
    δi
    pi
    yi
    = π(xi
    )
    π
    : Context
    : Labels (multi-label settings)
    : Reward
    : Propensity Score (ޙड़)
    : Policy (Context → Action)

    View Slide

  12. Counterfactual Risk Minimization
    • Unbiased Estimation
    R(π) =
    1
    n
    n

    i=1
    δi
    π(yi
    |xi
    )
    π0
    (yi
    |xi
    )
    =
    1
    n
    n

    i=1
    δi
    π(yi
    |xi
    )
    pi
    δi
    π0
    : loss
    : logging policy (→ Propensity Score)
    Importance sampling
    R(π) =
    1
    n
    n

    i=1
    min{M, δi
    π(yi
    |xi
    )
    pi
    }
    clipping (M)Λಋೖͨ͠Լه͕IPS (Inverse Propensity Score) Estimator [4]

    View Slide

  13. Counterfactual Risk Minimization
    arg min
    h
    R(h) + λ
    Varw
    (u)
    n
    • CRM (Counterfactual Risk Minimization)
    Generalization Error Boundsͷ্ݶΛ࠷খʹ͢Δ
    ※ৄࡉ͸࿦จΛࢀর
    data-dependent regularizer

    View Slide

  14. • classificationͱಉ༷ͷpolicy (ઢܗ + softmax)
    • ҎԼͷࣜͷ௨Γʹֶश
    POEM [5]
    π(y|x) =
    exp(wϕ(x, y))

    y′∈Y
    exp(wϕ(x, y′)
    w * = arg min
    w∈Rd
    ¯
    uw
    + λ
    Varw
    (u)
    n
    ui
    w
    ≡ δi
    min{M,
    exp(wϕ(x, y))
    pi

    y′∈Y
    exp(wϕ(x, y′)
    } ¯
    uw

    n

    i=1
    ui
    w
    Varw
    (u) ≡
    1
    n − 1
    n

    i=0
    (ui
    w
    − ¯
    uw
    )2

    View Slide

  15. Experiments
    • Dataset (multi label experiments)
    • Supervised to Bandit Conversion [6]
    5% 95%
    x
    y*
    CRF
    π0
    y
    ᶃશσʔλͷ5%Ͱlogging policyΛֶश
    ᶄಘΒΕͨlogging policyͰ95%ͷσʔλʹϥϕϧΛ෇༩
    ᶅ feedbackΛyͱy*Ͱܭࢉ

    (Hamming loss)
    δ

    View Slide

  16. Experimental Results [5]
    • Test set Hamming Loss
    • Computational time (seconds)
    S: AdaGrad, B: L-BFGS

    View Slide

  17. Note
    • logʹଘࡏ͠ͳ͍ϥϕϧʹରͯ͠͸ਖ਼֬ͳ༧ଌ͸Ͱ͖ͳ͍
    ex) ৽͍͠޿ࠂΛ௥Ճ͢Δ࣌
    log
    A
    B
    C
    B
    A
    OK
    NG
    Counterfactual ML
    ※্هͷྫ͸ۃ୺ͳྫɺՄೳʹ͢Δํ๏΋ଘࡏ͢Δ

    View Slide

  18. More
    • [5]ͷݚڀνʔϜ͕ܧଓతʹݚڀΛൃද
    • ”The Self-Normalized Estimator for Counterfactual Learning”
    • “Recommendations as Treatments: Debiasing Learning and Evaluation”
    • “Unbiased Learning-to-Rank with Biased Feedback”
    • “Deep Learning with Logged Bandit Feedback”
    • Microsoft Researchʹ΋ଟ͘ͷݚڀऀ͕ࡏ੶த
    ڵຯͷ͋Δํ͸ͥͻௐ΂ͯΈ͍ͯͩ͘͞ʂ

    View Slide

  19. Summary

    View Slide

  20. Summary
    • Counterfactual ML
    • ൓ࣄ࣮ΛධՁɺֶश͢Δ
    • ޿ࠂͷόφʔදࣔ໰୊͸యܕతͳࣄྫ
    • Algorithms
    • IPS Estimator
    • POEM
    • Experiments

    View Slide

  21. AI LabͰ΋ݚڀڧԽத
    https://arxiv.org/abs/1809.03084
    Yusuke Narita, Shota Yasui, Kohei Yata,

    “Efficient Counterfactual Learning from Bandit Feedback”, arxiv, 2018
    https://adtech.cyberagent.io/ailab/
    ࠓճͷ಺༰ʹؔ͢Δ࿦จ
    ৄ͘͠͸ฐࣾwebαΠτ΁

    View Slide

  22. fin.

    View Slide

  23. References
    1. SIGIR 2016 Tutorial on Counterfactual Evaluation and Learning

    (http://www.cs.cornell.edu/~adith/CfactSIGIR2016/)

    2. ICML2017: Tutorial on Real World Interactive Learning 

    (http://hunch.net/~rwil/)

    3. όϯσΟοτΞϧΰϦζϜͷධՁͱҼՌਪ࿦

    (https://adtech.cyberagent.io/research/archives/199)

    4. Counterfactual Reasoning and Learning Systems, 2017

    (https://arxiv.org/abs/1209.2355)

    5. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback

    (https://arxiv.org/abs/1502.02362)

    View Slide