$30 off During Our Annual Pro Sale. View Details »

Target Encoding はなぜ有効なのか

Shuhei Goda
November 30, 2019

Target Encoding はなぜ有効なのか

Shuhei Goda

November 30, 2019
Tweet

More Decks by Shuhei Goda

Other Decks in Technology

Transcript

  1. ©2019 Wantedly, Inc.
    Target Encoding͸ͳͥ༗ޮͳͷ͔
    ෼ੳίϯϖLTձ
    Nov 30, 2019 - Shuhei Goda - @jy_msc

    View Slide

  2. ©2019 Wantedly, Inc.
    Self-Introduction
    •Shuhei Godaʢ߹ా पฏʣ
    •Wantedly, Inc. (since Sep 2019)
    •Recommendation Team
    https://www.wantedly.com/projects/375150
    Kaggle Master
    hakubishinͱ͍͏໊લͰ
    twitter΍͍ͬͯ·͢ @jy_msc
    We are hiring !

    View Slide

  3. ©2019 Wantedly, Inc.
    ɾTarget Encoding͸ͳͥ༗ޮͳͷ͔
    ɾKaggleͰͷఆ൪ख๏ͷ1ͭ
    ɾLabel EncodingͰ͸ͳ͘Target EncodingΛͨ͠ํ͕ྑ͍৔߹͕͋Δ
    ɾͳͥTarget Encoding͕ྑ͍݁ՌΛग़͢ͷ͔, ͦͷཧ༝Λઆ໌͍ͯ͠Δࢿྉ͕͋
    ·Γݟ౰ͨΒͳ͍
    ɾTarget Encoding͕༗ޮͰ͋Δཧ༝ʹ͍ͭͯ, ࣗ෼ͳΓͷղऍΛ঺հ
    About Talk

    View Slide

  4. ©2019 Wantedly, Inc.
    ɾ໨తม਺Λ༻͍ͯΧςΰϦม਺Λ਺஋ʹม׵͢Δख๏
    ɾΧςΰϦม਺Λ֤ਫ४ʹ͓͚Δ໨తม਺ͷظ଴஋Ͱஔ׵͢Δ
    ɾҰൠతʹ͸, ਫ४਺͕ଟ͍΄Ͳߴ͍ޮՌ͕ظ଴͞ΕΔ
    Target Encodingͱ͸
    Target EncodingΛѻ͏্Ͱͷ஫ҙ఺΍࣮૷ํ๏͸
    KaggleຊͰ֬ೝ͍ͯͩ͘͠͞ !

    View Slide

  5. ©2019 Wantedly, Inc.
    ɾϞσϧΛ୯७Խͤ͞ΔΑ͏ͳޮՌΛ࣋ͭ
    ɹɹɾҎ߱, GBDTΛྫʹߟ͑ͯΈΔ
    ͳͥ༗ޮͳͷ͔

    View Slide

  6. ©2019 Wantedly, Inc.
    ɾҎԼͷΑ͏ͳσʔλΛ࢖ͬͯઆ໌͢Δ
    ɹɹɾ໨తม਺ y ͸࿈ଓ஋
    ɹɹɾઆ໌ม਺ x ͸ਫ४਺4ͷΧςΰϦม਺ x = {A, B, C, D}
    ɹɹɹɾE[y|x=A]=60, E[y|x=B]=20, E[y|x=C]=50, E[y|x=D]=10
    ࢖༻͢Δαϯϓϧσʔλ

    View Slide

  7. ©2019 Wantedly, Inc.
    GBDTͷ෮श
    σʔληοτ:
    Ճ๏Ϟσϧ:
    ଛࣦؔ਺:
    ͸mຊ໨ͷ໦ͷ༿ͷweight, ͸໦ͷ༿ͷ਺, ͸໦ͷ਺Λද͢
    D = {(xi
    , yi
    )}n
    i=1
    (xi
    ∈ Rm, yi
    ∈ R)
    ̂
    yi
    = ΣM
    m=1
    fm
    (xi
    ) = ΣM
    m=1
    wm
    (xi
    )
    L = Σn
    i=1
    l( ̂
    yi
    , yi
    ) + ΣM
    m=1
    Ω(fm
    ) (Ω(f ) = γT +
    1
    2
    λ∥w∥2)
    wm
    (x) T M

    View Slide

  8. ©2019 Wantedly, Inc.
    GBDTͷ෮श
    ໦͕mຊ໨ͷ࣌ͷଛࣦؔ਺:
    ͸, j൪໨ͷ༿ʹׂΓ౰ͯΒΕͨσʔλू߹
    ͸, m-1ຊ໨·Ͱͷ༧ଌ݁ՌʹΑΔҰ֊ඍ෼ͱೋ֊ඍ෼
    gradient: , hessian:
    L(m) = Σn
    i=1
    l(yi
    , ̂
    yi
    + fm
    (xi
    )) + Ω(fm
    )
    ≃ Σn
    i=1
    [gi
    fm
    (xi
    ) +
    1
    2
    hi
    fm
    (xi
    )] + γT +
    1
    2
    λΣT
    j=1
    w2
    j
    = ΣT
    j=1
    [(Σi∈Ij
    gi
    )wj
    +
    1
    2
    (Σi∈Ij
    hj
    + λ)w2
    j
    + γT
    Ij
    gi
    , hi
    gi
    =
    ∂l(yi
    , ̂
    y(m−1)
    i
    )
    ∂ ̂
    y(m−1)
    i
    hi
    =
    ∂2l(yi
    , ̂
    y(m−1)
    i
    )
    (∂ ̂
    y(m−1)
    i
    )2

    View Slide

  9. ©2019 Wantedly, Inc.
    GBDTͷ෮श
    αϯϓϧׂ͕ΓৼΒΕͨ࣌ͷ༿ͷ࠷దͳweight͸
    Ͱ͋Γ, ͦͷ࣌ͷଛࣦ஋͸
    αϯϓϧΛ෼ׂͨ࣌͠ͷଛࣦͷݮΓํΛΈͯ, nodeຖʹ࠷దͳ෼ׂΛ୳͍ͯ͘͠
    gain:
    w*
    j
    = −
    Σi∈Ij
    gi
    Σi∈Ij
    hi
    L(m) = −
    1
    2
    ΣT
    j=1
    (Σi∈Ij
    gi
    )2
    Σi∈Ij
    hj
    + λ
    + γT
    Lbef
    − (Laf,left
    + Laf,right
    )
    " #
    $ %
    $ %
    " #
    Lbef
    Laf,left
    Laf,right
    gain (෼ׂલޙͷlossͷࠩ) ͕
    େ͖͍΄Ͳྑ͍෼ׂ

    View Slide

  10. ©2019 Wantedly, Inc.
    GBDTͷ෮श
    ଛࣦؔ਺͕ MSE ͷ৔߹
    ଛࣦؔ਺:
    gradient: , hessian:
    ΑΓ
    ༿ j ͷ weight ͸, ༿ j ʹׂΓ౰ͯΒΕͨαϯϓϧͷ࢒ࠩฏۉͱͳΔ
    l(yi
    , ̂
    yi
    ) =
    1
    2
    (yi
    − ̂
    yi
    )2
    gi
    =
    ∂l(yi
    , ̂
    y(m−1)
    i
    )
    ∂ ̂
    y(m−1)
    i
    = ̂
    y(m−1)
    i
    − yi
    hi
    =
    ∂2l(yi
    , ̂
    y(m−1)
    i
    )
    (∂ ̂
    y(m−1)
    i
    )2
    = 1
    w*
    j
    = −
    Σi∈Ij
    gi
    Σi∈Ij
    hi
    = −
    Σi∈Ij
    ( ̂
    y(m−1)
    i
    − yi
    )
    Σi∈Ij
    1
    ࢒ࠩ(ਅ஋ - m-1ຊ໨࣌఺ͷ༧ଌ஋)ͷ૯࿨
    αϯϓϧͷ਺

    View Slide

  11. ©2019 Wantedly, Inc.
    GBDTͷઃఆ
    ɾγϯϓϧͳϞσϧͰߟ͑ͯΈΔ.
    ɹɾloss_func = ‘MAE'
    ɹɾeta = 1 → εςοϓαΠζ
    ɹɾiteration = 1 → ࠷ॳͷ໦͚ͩߟ͑Δ
    ɹɾtree_method = ‘exact’ → ۪௚ʹશ୳ࡧ
    ɹɾbase_score = 0 → ॳظ஋͸0ελʔτ
    ɹɾlambda = 0
    ɹɾgamma = 0

    View Slide

  12. ©2019 Wantedly, Inc.
    Label EncodingΛ࢖ͬͨ৔߹
    ɾΧςΰϦม਺ΛΞϧϑΝϕοτॱʹLabel Encoding
    ɾಛ௃ྔͷେ͖͞ͰαϯϓϧΛιʔτ͢Δ
    ൵͍͠άϥϑʜ
    ιʔτ

    View Slide

  13. ©2019 Wantedly, Inc.
    Label EncodingΛ࢖ͬͨ৔߹ (depth=1)
    w*
    left
    w*
    left w*
    left
    w*
    right w*
    right
    w*
    right
    L1
    = − 48797
    L2
    = − 56913 L2
    = − 49783 L2
    = − 57093
    L2,left
    = − 35522 L2,right
    = − 21391 L2,left
    = − 31832 L2,right
    = − 17951 L2,left
    = − 56097 L2,right
    = − 996
    " # $ %

    View Slide

  14. ©2019 Wantedly, Inc.
    Label EncodingΛ࢖ͬͨ৔߹ (depth=1) " # $ %
    w*
    left
    w*
    left w*
    left
    w*
    right w*
    right
    w*
    right
    L1
    = − 48797
    L2
    = − 56913 L2
    = − 49783 L2
    = − 57093
    L2,left
    = − 35522 L2,right
    = − 21391 L2,left
    = − 31832 L2,right
    = − 17951
    ͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏
    L2,left
    = − 56097 L2,right
    = − 996

    View Slide

  15. ©2019 Wantedly, Inc.
    Label EncodingΛ࢖ͬͨ৔߹ (depth=2)
    L2
    = − 56097
    L3
    = − 60111 L3
    = − 56769
    w*
    left
    w*
    right w*
    left
    w*
    right
    L3,left
    = − 35522 L3,right
    = − 24589 L3,left
    = − 31832 L3,right
    = − 24937
    " # $ %
    %
    " # $
    L1
    = − 48797
    L2,left
    = − 56097
    L2,right
    = − 996

    View Slide

  16. ©2019 Wantedly, Inc.
    Label EncodingΛ࢖ͬͨ৔߹ (depth=2)
    L2
    = − 56097
    " # $ %
    %
    " # $
    L1
    = − 48797
    L2,left
    = − 56097
    L2,right
    = − 996
    L3
    = − 60111 L3
    = − 56769
    w*
    left
    w*
    right w*
    left
    w*
    right
    L3,left
    = − 35522 L3,right
    = − 24589 L3,left
    = − 31832 L3,right
    = − 24937
    ͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏

    View Slide

  17. ©2019 Wantedly, Inc.
    Label EncodingΛ࢖ͬͨ৔߹ (depth=3)
    L3
    = − 24589
    L4
    = − 29013
    w*
    left
    w*
    right
    L4,left
    = − 4076 L4,right
    = − 24937
    " # $ %
    %
    " # $
    # $
    "
    L1
    = − 48797
    L2,left
    = − 56097
    L2,right
    = − 996
    L3,left
    = − 35522 L3,right
    = − 24589

    View Slide

  18. ©2019 Wantedly, Inc.
    Label EncodingΛ࢖ͬͨ৔߹ (depth=3)
    L3
    = − 24589
    L4
    = − 29013
    w*
    left
    w*
    right
    L4,left
    = − 4076 L4,right
    = − 24937
    " # $ %
    %
    " # $
    # $
    "
    L1
    = − 48797
    L2,left
    = − 56097
    L2,right
    = − 996
    L3,left
    = − 35522 L3,right
    = − 24589
    ͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏

    View Slide

  19. ©2019 Wantedly, Inc.
    Label EncodingΛ࢖ͬͨ৔߹ (depth=3) " # $ %
    %
    " # $
    # $
    "
    L1
    = − 48797
    L2,left
    = − 56097
    L2,right
    = − 996
    L3,left
    = − 35522 L3,right
    = − 24589
    $
    #
    L4,left
    = − 4076
    L4,right
    = − 24937
    ෼ׂऴΘΓ

    View Slide

  20. ©2019 Wantedly, Inc.
    Target EncodingΛ࢖ͬͨ৔߹
    ɾΧςΰϦม਺ΛTarget Encoding
    ɾಛ௃ྔͷେ͖͞ͰαϯϓϧΛιʔτ͢Δ ιʔτ

    View Slide

  21. ©2019 Wantedly, Inc.
    Target EncodingΛ࢖ͬͨ৔߹ (depth=1)
    L1
    = − 48797
    L2,left
    = − 996 L2,right
    = − 56097 L2,left
    = − 4551 L2,right
    = − 59992 L2,left
    = − 21391 L2,right
    = − 35522
    w*
    left
    w*
    right
    w*
    left
    w*
    right
    w*
    left
    w*
    right
    L2
    = − 57093 L2
    = − 64543 L2
    = − 56913
    " # $ %

    View Slide

  22. ©2019 Wantedly, Inc.
    Target EncodingΛ࢖ͬͨ৔߹ (depth=1)
    L1
    = − 48797
    L2,left
    = − 996 L2,right
    = − 56097 L2,left
    = − 4551 L2,right
    = − 59992 L2,left
    = − 21391 L2,right
    = − 35522
    w*
    left
    w*
    right
    w*
    left
    w*
    right
    w*
    left
    w*
    right
    ͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏
    L2
    = − 57093 L2
    = − 64543 L2
    = − 56913
    " # $ %

    View Slide

  23. ©2019 Wantedly, Inc.
    Target EncodingΛ࢖ͬͨ৔߹ (depth=2)
    " #
    $ %
    " $
    # %
    L2,left
    = − 4551
    L1
    = − 48797
    L2,left
    = − 4551
    L2,right
    = − 59992
    L2,right
    = − 59992
    w*
    right
    w*
    left
    L′
    3,left
    = − 24937 L′
    3,right
    = − 35522
    L3
    = − 60459
    w*
    right
    w*
    left
    L3,left
    = − 996 L3,right
    = − 4076
    L3
    = − 5072

    View Slide

  24. ©2019 Wantedly, Inc.
    Target EncodingΛ࢖ͬͨ৔߹ (depth=2)
    " #
    $ %
    " $
    # %
    L2,left
    = − 4551
    L1
    = − 48797
    L2,left
    = − 4551
    L2,right
    = − 59992
    L2,right
    = − 59992
    w*
    right
    w*
    left
    L′
    3,left
    = − 24937 L′
    3,right
    = − 35522
    L3
    = − 60459
    w*
    right
    w*
    left
    L3,left
    = − 996 L3,right
    = − 4076
    L3
    = − 5072
    ͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏
    ͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏

    View Slide

  25. ©2019 Wantedly, Inc.
    Target EncodingΛ࢖ͬͨ৔߹ (depth=2) " # $ %
    " $
    # %
    L1
    = − 48797
    L2,left
    = − 4551
    L2,right
    = − 59992
    #
    % "
    $
    L′
    3,left
    = − 24937
    L′
    3,right
    = − 35522
    L3,left
    = − 996
    L3,right
    = − 4076
    ෼ׂऴΘΓ

    View Slide

  26. ©2019 Wantedly, Inc.
    Label Encoding ͱ Target Encoding ͷൺֱ
    " #
    $ %
    " $
    # %
    L1
    = − 48797
    L2,left
    = − 4551
    L2,right
    = − 59992
    #
    % "
    $
    L′
    3,left
    = − 24937
    L′
    3,right
    = − 35522
    L3,left
    = − 996
    L3,right
    = − 4076
    " #
    $ %
    %
    " # $
    # $
    "
    L1
    = − 48797
    L2,left
    = − 56097
    L2,right
    = − 996
    L3,left
    = − 35522 L3,right
    = − 24589
    $
    #
    L4,left
    = − 4076
    L4,right
    = − 24937
    Label EncodingͰ࡞ͬͨ໦ߏ଄ Target EncodingͰ࡞ͬͨ໦ߏ଄

    View Slide

  27. ©2019 Wantedly, Inc.
    (͔ͳΓዞҙతͳྫͰ͕ͨ͠)
    Target Encodingͷํ͕গ͠ޮ཰ྑͦ͞͏͡Όͳ͍Ͱ͔͢ʁ

    View Slide

  28. ©2019 Wantedly, Inc.
    Target Encoding͸ԿΛͯ͘͠Ε͍ͯΔͷ͔
    ɾ໦ߏ଄ΛΑΓγϯϓϧʹͳΔ
    ɾଛࣦؔ਺͕MSEͰ࢝ΊͷํͷiterationͰ͸, ࢒ࠩ(gradient) ͷେ͖͕͞
    ͍ۙਫ४ಉ࢜ΛΑΓ͍ۙҐஔʹ഑ஔ͢ΔΑ͏ͳޮՌΛ࣋ͭ.
    →෼ׂ͞Εͨαϯϓϧ܈͸, ͦΕͧΕൺֱత͍ۙ࢒ࠩΛ࣋ͭͷͰֶशޮ
    ཰͕ྑ͍

    View Slide

  29. ©2019 Wantedly, Inc.
    ΑΓਫ४਺͕૿͍͑ͯ͘ͱ
    ɾTarget EncodingͷޮՌ͸ਫ४਺͕૿͑Δ΄Ͳ࣮ײ͠΍͍͢
    ɾࣄલʹ, ࢒ࠩͷେ͖͞ͰΧςΰϦΛιʔτͨ͠ํ͕෼ׂͷޮ཰͕ྑ͍.

    View Slide

  30. ©2019 Wantedly, Inc.
    ΑΓਫ४਺͕૿͍͑ͯ͘ͱ
    ɾTarget EncodingͷޮՌ͸ਫ४਺͕૿͑Δ΄Ͳ࣮ײ͠΍͍͢
    ɾࣄલʹ, ࢒ࠩͷେ͖͞ͰΧςΰϦΛιʔτͨ͠ํ͕෼ׂͷޮ཰͕ྑ͍.
    w*
    right
    w*
    left
    w*
    right
    w*
    left

    View Slide

  31. ©2019 Wantedly, Inc.
    શͯͷਫ४Λ෼ׂ͠੾Δ·Ͱʹඞཁͳਂ͞
    ɾTarget Encodingͷํ͕ਂ͕͞ઙ͍, ΑΓ໦ߏ଄͕γϯϓϧʹ
    ɾҎԼ͸ਫ४਺100ͷΧςΰϦม਺Λ෼ׂͯ͠Έͨ࣌ͷ໦ߏ଄
    Label Encoding
    Target Encoding

    View Slide

  32. ©2019 Wantedly, Inc.
    ֤ਂ࣌͞఺Ͱͷlossͷݮগྔ
    ɾTarget Encodingͷํ͕ޮ཰తʹlossΛݮগ͍ͤͯ͞Δ
    ɾਫ४਺͕ଟ͍΄Ͳ, Label Encodingͱͷ͕ࠩେ͖͘ͳ͍ͬͯ͘.

    View Slide

  33. ©2019 Wantedly, Inc.
    ਂ͞ / iteration Λ૿΍͍͚ͯ͠͹Ϟσϧ͕ྑ͠ͳʹͯ͘͠ΕΔΜ͡Όͳ͍ʁ
    ɾ໌Β͔ʹྑ͍ͱΘ͔͍ͬͯΔ৘ใ͸໌ࣔతʹϞσϧʹ౉ͨ͠ํ͕ྑ͍
    ɾLabel EncodingͰ΋Կͱ͔ͯ͘͠ΕΔ͔΋͠Εͳ͍͕, Ϟσϧ͕ෳࡶʹ
    ͳΓ΍͍͢. ਫ४਺͕૿͍͑ͯ͘΄Ͳ, ͦΕ͸ݱ࣮తͰ͸ͳ͍.
    ɾܦݧ্, ໌Β͔ʹޮ͘ͱ෼͔͍ͬͯΔ΋ͷ͸ֶशͷલஈ֊ͰରԠͨ͠ํ
    ͕ྑ͍.
    ɾ਺஋ಛ௃ྔͷinteractionͱಉ͡࿩

    View Slide

  34. ©2019 Wantedly, Inc.
    ɾTarget EncodingʹΑͬͯ, Ϟσϧ͕ΑΓγϯϓϧʹͳΔ
    ɾଛࣦؔ਺͕MSEͰ࢝ΊͷํͷiterationͰ͸, ࢒ࠩͷେ͖͍ॱʹιʔτ͢Δ͜ͱ
    Ͱޮ཰తͳ෼ׂΛ࣮ݱ͢Δ͜ͱ͕Ͱ͖Δ.
    ɾਫ४਺͕૿͑Δ΄Ͳ, Target EncodingͷޮՌ͕େ͖͘ͳΔ
    ɾLabel encodingͰTarget encodingͱಉ౳ͷ͜ͱΛ΍ΔͨΊʹ͸͋Δఔ౓ͷਂ͞
    ͕ඞཁͰ, ͦΕ͸ਫ४਺͕૿͑Δ΄Ͳݱ࣮తͰͳ͍.
    ɾTarget Encodingͤͣͱ΋ϞσϧଆͰimplicitʹͰ͖Δ͔΋͠Εͳ͍͕, ໌Β͔ʹ
    ྑ͍ͱΘ͔͍ͬͯΔ΋ͷ͸ϞσϧʹೖΕΔલʹରԠͨ͠ํ͕ྑ͍.
    Summary

    View Slide