Target Encoding はなぜ有効なのか

99e9e6d2de62c373990ac1bd7c4defc5?s=47 Shuhei Goda
November 30, 2019

Target Encoding はなぜ有効なのか

99e9e6d2de62c373990ac1bd7c4defc5?s=128

Shuhei Goda

November 30, 2019
Tweet

Transcript

  1. ©2019 Wantedly, Inc. Target Encoding͸ͳͥ༗ޮͳͷ͔ ෼ੳίϯϖLTձ Nov 30, 2019 -

    Shuhei Goda - @jy_msc
  2. ©2019 Wantedly, Inc. Self-Introduction •Shuhei Godaʢ߹ా पฏʣ •Wantedly, Inc. (since

    Sep 2019) •Recommendation Team https://www.wantedly.com/projects/375150 Kaggle Master hakubishinͱ͍͏໊લͰ twitter΍͍ͬͯ·͢ @jy_msc We are hiring !
  3. ©2019 Wantedly, Inc. ɾTarget Encoding͸ͳͥ༗ޮͳͷ͔ ɾKaggleͰͷఆ൪ख๏ͷ1ͭ ɾLabel EncodingͰ͸ͳ͘Target EncodingΛͨ͠ํ͕ྑ͍৔߹͕͋Δ ɾͳͥTarget

    Encoding͕ྑ͍݁ՌΛग़͢ͷ͔, ͦͷཧ༝Λઆ໌͍ͯ͠Δࢿྉ͕͋ ·Γݟ౰ͨΒͳ͍ ɾTarget Encoding͕༗ޮͰ͋Δཧ༝ʹ͍ͭͯ, ࣗ෼ͳΓͷղऍΛ঺հ About Talk
  4. ©2019 Wantedly, Inc. ɾ໨తม਺Λ༻͍ͯΧςΰϦม਺Λ਺஋ʹม׵͢Δख๏ ɾΧςΰϦม਺Λ֤ਫ४ʹ͓͚Δ໨తม਺ͷظ଴஋Ͱஔ׵͢Δ ɾҰൠతʹ͸, ਫ४਺͕ଟ͍΄Ͳߴ͍ޮՌ͕ظ଴͞ΕΔ Target Encodingͱ͸ Target

    EncodingΛѻ͏্Ͱͷ஫ҙ఺΍࣮૷ํ๏͸ KaggleຊͰ֬ೝ͍ͯͩ͘͠͞ !
  5. ©2019 Wantedly, Inc. ɾϞσϧΛ୯७Խͤ͞ΔΑ͏ͳޮՌΛ࣋ͭ ɹɹɾҎ߱, GBDTΛྫʹߟ͑ͯΈΔ ͳͥ༗ޮͳͷ͔

  6. ©2019 Wantedly, Inc. ɾҎԼͷΑ͏ͳσʔλΛ࢖ͬͯઆ໌͢Δ ɹɹɾ໨తม਺ y ͸࿈ଓ஋ ɹɹɾઆ໌ม਺ x ͸ਫ४਺4ͷΧςΰϦม਺

    x = {A, B, C, D} ɹɹɹɾE[y|x=A]=60, E[y|x=B]=20, E[y|x=C]=50, E[y|x=D]=10 ࢖༻͢Δαϯϓϧσʔλ
  7. ©2019 Wantedly, Inc. GBDTͷ෮श σʔληοτ: Ճ๏Ϟσϧ: ଛࣦؔ਺: ͸mຊ໨ͷ໦ͷ༿ͷweight, ͸໦ͷ༿ͷ਺, ͸໦ͷ਺Λද͢

    D = {(xi , yi )}n i=1 (xi ∈ Rm, yi ∈ R) ̂ yi = ΣM m=1 fm (xi ) = ΣM m=1 wm (xi ) L = Σn i=1 l( ̂ yi , yi ) + ΣM m=1 Ω(fm ) (Ω(f ) = γT + 1 2 λ∥w∥2) wm (x) T M
  8. ©2019 Wantedly, Inc. GBDTͷ෮श ໦͕mຊ໨ͷ࣌ͷଛࣦؔ਺: ͸, j൪໨ͷ༿ʹׂΓ౰ͯΒΕͨσʔλू߹ ͸, m-1ຊ໨·Ͱͷ༧ଌ݁ՌʹΑΔҰ֊ඍ෼ͱೋ֊ඍ෼ gradient:

    , hessian: L(m) = Σn i=1 l(yi , ̂ yi + fm (xi )) + Ω(fm ) ≃ Σn i=1 [gi fm (xi ) + 1 2 hi fm (xi )] + γT + 1 2 λΣT j=1 w2 j = ΣT j=1 [(Σi∈Ij gi )wj + 1 2 (Σi∈Ij hj + λ)w2 j + γT Ij gi , hi gi = ∂l(yi , ̂ y(m−1) i ) ∂ ̂ y(m−1) i hi = ∂2l(yi , ̂ y(m−1) i ) (∂ ̂ y(m−1) i )2
  9. ©2019 Wantedly, Inc. GBDTͷ෮श αϯϓϧׂ͕ΓৼΒΕͨ࣌ͷ༿ͷ࠷దͳweight͸ Ͱ͋Γ, ͦͷ࣌ͷଛࣦ஋͸ αϯϓϧΛ෼ׂͨ࣌͠ͷଛࣦͷݮΓํΛΈͯ, nodeຖʹ࠷దͳ෼ׂΛ୳͍ͯ͘͠ gain:

    w* j = − Σi∈Ij gi Σi∈Ij hi L(m) = − 1 2 ΣT j=1 (Σi∈Ij gi )2 Σi∈Ij hj + λ + γT Lbef − (Laf,left + Laf,right ) " #  $ % $ % " # Lbef Laf,left Laf,right gain (෼ׂલޙͷlossͷࠩ) ͕ େ͖͍΄Ͳྑ͍෼ׂ
  10. ©2019 Wantedly, Inc. GBDTͷ෮श ଛࣦؔ਺͕ MSE ͷ৔߹ ଛࣦؔ਺: gradient: ,

    hessian: ΑΓ ༿ j ͷ weight ͸, ༿ j ʹׂΓ౰ͯΒΕͨαϯϓϧͷ࢒ࠩฏۉͱͳΔ l(yi , ̂ yi ) = 1 2 (yi − ̂ yi )2 gi = ∂l(yi , ̂ y(m−1) i ) ∂ ̂ y(m−1) i = ̂ y(m−1) i − yi hi = ∂2l(yi , ̂ y(m−1) i ) (∂ ̂ y(m−1) i )2 = 1 w* j = − Σi∈Ij gi Σi∈Ij hi = − Σi∈Ij ( ̂ y(m−1) i − yi ) Σi∈Ij 1 ࢒ࠩ(ਅ஋ - m-1ຊ໨࣌఺ͷ༧ଌ஋)ͷ૯࿨ αϯϓϧͷ਺
  11. ©2019 Wantedly, Inc. GBDTͷઃఆ ɾγϯϓϧͳϞσϧͰߟ͑ͯΈΔ. ɹɾloss_func = ‘MAE' ɹɾeta =

    1 → εςοϓαΠζ ɹɾiteration = 1 → ࠷ॳͷ໦͚ͩߟ͑Δ ɹɾtree_method = ‘exact’ → ۪௚ʹશ୳ࡧ ɹɾbase_score = 0 → ॳظ஋͸0ελʔτ ɹɾlambda = 0 ɹɾgamma = 0
  12. ©2019 Wantedly, Inc. Label EncodingΛ࢖ͬͨ৔߹ ɾΧςΰϦม਺ΛΞϧϑΝϕοτॱʹLabel Encoding ɾಛ௃ྔͷେ͖͞ͰαϯϓϧΛιʔτ͢Δ ൵͍͠άϥϑʜ ιʔτ

  13. ©2019 Wantedly, Inc. Label EncodingΛ࢖ͬͨ৔߹ (depth=1) w* left w* left

    w* left w* right w* right w* right L1 = − 48797 L2 = − 56913 L2 = − 49783 L2 = − 57093 L2,left = − 35522 L2,right = − 21391 L2,left = − 31832 L2,right = − 17951 L2,left = − 56097 L2,right = − 996 " # $ %
  14. ©2019 Wantedly, Inc. Label EncodingΛ࢖ͬͨ৔߹ (depth=1) " # $ %

    w* left w* left w* left w* right w* right w* right L1 = − 48797 L2 = − 56913 L2 = − 49783 L2 = − 57093 L2,left = − 35522 L2,right = − 21391 L2,left = − 31832 L2,right = − 17951 ͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏ L2,left = − 56097 L2,right = − 996
  15. ©2019 Wantedly, Inc. Label EncodingΛ࢖ͬͨ৔߹ (depth=2) L2 = − 56097

    L3 = − 60111 L3 = − 56769 w* left w* right w* left w* right L3,left = − 35522 L3,right = − 24589 L3,left = − 31832 L3,right = − 24937 " # $ % % " # $ L1 = − 48797 L2,left = − 56097 L2,right = − 996
  16. ©2019 Wantedly, Inc. Label EncodingΛ࢖ͬͨ৔߹ (depth=2) L2 = − 56097

    " # $ % % " # $ L1 = − 48797 L2,left = − 56097 L2,right = − 996 L3 = − 60111 L3 = − 56769 w* left w* right w* left w* right L3,left = − 35522 L3,right = − 24589 L3,left = − 31832 L3,right = − 24937 ͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏
  17. ©2019 Wantedly, Inc. Label EncodingΛ࢖ͬͨ৔߹ (depth=3) L3 = − 24589

    L4 = − 29013 w* left w* right L4,left = − 4076 L4,right = − 24937 " # $ % % " # $ # $ " L1 = − 48797 L2,left = − 56097 L2,right = − 996 L3,left = − 35522 L3,right = − 24589
  18. ©2019 Wantedly, Inc. Label EncodingΛ࢖ͬͨ৔߹ (depth=3) L3 = − 24589

    L4 = − 29013 w* left w* right L4,left = − 4076 L4,right = − 24937 " # $ % % " # $ # $ " L1 = − 48797 L2,left = − 56097 L2,right = − 996 L3,left = − 35522 L3,right = − 24589 ͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏
  19. ©2019 Wantedly, Inc. Label EncodingΛ࢖ͬͨ৔߹ (depth=3) " # $ %

    % " # $ # $ " L1 = − 48797 L2,left = − 56097 L2,right = − 996 L3,left = − 35522 L3,right = − 24589 $ # L4,left = − 4076 L4,right = − 24937 ෼ׂऴΘΓ
  20. ©2019 Wantedly, Inc. Target EncodingΛ࢖ͬͨ৔߹ ɾΧςΰϦม਺ΛTarget Encoding ɾಛ௃ྔͷେ͖͞ͰαϯϓϧΛιʔτ͢Δ ιʔτ

  21. ©2019 Wantedly, Inc. Target EncodingΛ࢖ͬͨ৔߹ (depth=1) L1 = − 48797

    L2,left = − 996 L2,right = − 56097 L2,left = − 4551 L2,right = − 59992 L2,left = − 21391 L2,right = − 35522 w* left w* right w* left w* right w* left w* right L2 = − 57093 L2 = − 64543 L2 = − 56913 " # $ %
  22. ©2019 Wantedly, Inc. Target EncodingΛ࢖ͬͨ৔߹ (depth=1) L1 = − 48797

    L2,left = − 996 L2,right = − 56097 L2,left = − 4551 L2,right = − 59992 L2,left = − 21391 L2,right = − 35522 w* left w* right w* left w* right w* left w* right ͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏ L2 = − 57093 L2 = − 64543 L2 = − 56913 " # $ %
  23. ©2019 Wantedly, Inc. Target EncodingΛ࢖ͬͨ৔߹ (depth=2) " #  $

    % " $ # % L2,left = − 4551 L1 = − 48797 L2,left = − 4551 L2,right = − 59992 L2,right = − 59992 w* right w* left L′ 3,left = − 24937 L′ 3,right = − 35522 L3 = − 60459 w* right w* left L3,left = − 996 L3,right = − 4076 L3 = − 5072
  24. ©2019 Wantedly, Inc. Target EncodingΛ࢖ͬͨ৔߹ (depth=2) " #  $

    % " $ # % L2,left = − 4551 L1 = − 48797 L2,left = − 4551 L2,right = − 59992 L2,right = − 59992 w* right w* left L′ 3,left = − 24937 L′ 3,right = − 35522 L3 = − 60459 w* right w* left L3,left = − 996 L3,right = − 4076 L3 = − 5072 ͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏ ͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏
  25. ©2019 Wantedly, Inc. Target EncodingΛ࢖ͬͨ৔߹ (depth=2) " # $ %

    " $ # % L1 = − 48797 L2,left = − 4551 L2,right = − 59992 # % " $ L′ 3,left = − 24937 L′ 3,right = − 35522 L3,left = − 996 L3,right = − 4076 ෼ׂऴΘΓ
  26. ©2019 Wantedly, Inc. Label Encoding ͱ Target Encoding ͷൺֱ "

    #  $ % " $ # % L1 = − 48797 L2,left = − 4551 L2,right = − 59992 # % " $ L′ 3,left = − 24937 L′ 3,right = − 35522 L3,left = − 996 L3,right = − 4076 " #  $ % % " # $ # $ " L1 = − 48797 L2,left = − 56097 L2,right = − 996 L3,left = − 35522 L3,right = − 24589 $ # L4,left = − 4076 L4,right = − 24937 Label EncodingͰ࡞ͬͨ໦ߏ଄ Target EncodingͰ࡞ͬͨ໦ߏ଄
  27. ©2019 Wantedly, Inc. (͔ͳΓዞҙతͳྫͰ͕ͨ͠) Target Encodingͷํ͕গ͠ޮ཰ྑͦ͞͏͡Όͳ͍Ͱ͔͢ʁ

  28. ©2019 Wantedly, Inc. Target Encoding͸ԿΛͯ͘͠Ε͍ͯΔͷ͔ ɾ໦ߏ଄ΛΑΓγϯϓϧʹͳΔ ɾଛࣦؔ਺͕MSEͰ࢝ΊͷํͷiterationͰ͸, ࢒ࠩ(gradient) ͷେ͖͕͞ ͍ۙਫ४ಉ࢜ΛΑΓ͍ۙҐஔʹ഑ஔ͢ΔΑ͏ͳޮՌΛ࣋ͭ.

    →෼ׂ͞Εͨαϯϓϧ܈͸, ͦΕͧΕൺֱత͍ۙ࢒ࠩΛ࣋ͭͷͰֶशޮ ཰͕ྑ͍
  29. ©2019 Wantedly, Inc. ΑΓਫ४਺͕૿͍͑ͯ͘ͱ ɾTarget EncodingͷޮՌ͸ਫ४਺͕૿͑Δ΄Ͳ࣮ײ͠΍͍͢ ɾࣄલʹ, ࢒ࠩͷେ͖͞ͰΧςΰϦΛιʔτͨ͠ํ͕෼ׂͷޮ཰͕ྑ͍.

  30. ©2019 Wantedly, Inc. ΑΓਫ४਺͕૿͍͑ͯ͘ͱ ɾTarget EncodingͷޮՌ͸ਫ४਺͕૿͑Δ΄Ͳ࣮ײ͠΍͍͢ ɾࣄલʹ, ࢒ࠩͷେ͖͞ͰΧςΰϦΛιʔτͨ͠ํ͕෼ׂͷޮ཰͕ྑ͍. w* right

    w* left w* right w* left
  31. ©2019 Wantedly, Inc. શͯͷਫ४Λ෼ׂ͠੾Δ·Ͱʹඞཁͳਂ͞ ɾTarget Encodingͷํ͕ਂ͕͞ઙ͍, ΑΓ໦ߏ଄͕γϯϓϧʹ ɾҎԼ͸ਫ४਺100ͷΧςΰϦม਺Λ෼ׂͯ͠Έͨ࣌ͷ໦ߏ଄ Label Encoding

    Target Encoding
  32. ©2019 Wantedly, Inc. ֤ਂ࣌͞఺Ͱͷlossͷݮগྔ ɾTarget Encodingͷํ͕ޮ཰తʹlossΛݮগ͍ͤͯ͞Δ ɾਫ४਺͕ଟ͍΄Ͳ, Label Encodingͱͷ͕ࠩେ͖͘ͳ͍ͬͯ͘.

  33. ©2019 Wantedly, Inc. ਂ͞ / iteration Λ૿΍͍͚ͯ͠͹Ϟσϧ͕ྑ͠ͳʹͯ͘͠ΕΔΜ͡Όͳ͍ʁ ɾ໌Β͔ʹྑ͍ͱΘ͔͍ͬͯΔ৘ใ͸໌ࣔతʹϞσϧʹ౉ͨ͠ํ͕ྑ͍ ɾLabel EncodingͰ΋Կͱ͔ͯ͘͠ΕΔ͔΋͠Εͳ͍͕,

    Ϟσϧ͕ෳࡶʹ ͳΓ΍͍͢. ਫ४਺͕૿͍͑ͯ͘΄Ͳ, ͦΕ͸ݱ࣮తͰ͸ͳ͍. ɾܦݧ্, ໌Β͔ʹޮ͘ͱ෼͔͍ͬͯΔ΋ͷ͸ֶशͷલஈ֊ͰରԠͨ͠ํ ͕ྑ͍. ɾ਺஋ಛ௃ྔͷinteractionͱಉ͡࿩
  34. ©2019 Wantedly, Inc. ɾTarget EncodingʹΑͬͯ, Ϟσϧ͕ΑΓγϯϓϧʹͳΔ ɾଛࣦؔ਺͕MSEͰ࢝ΊͷํͷiterationͰ͸, ࢒ࠩͷେ͖͍ॱʹιʔτ͢Δ͜ͱ Ͱޮ཰తͳ෼ׂΛ࣮ݱ͢Δ͜ͱ͕Ͱ͖Δ. ɾਫ४਺͕૿͑Δ΄Ͳ,

    Target EncodingͷޮՌ͕େ͖͘ͳΔ ɾLabel encodingͰTarget encodingͱಉ౳ͷ͜ͱΛ΍ΔͨΊʹ͸͋Δఔ౓ͷਂ͞ ͕ඞཁͰ, ͦΕ͸ਫ४਺͕૿͑Δ΄Ͳݱ࣮తͰͳ͍. ɾTarget Encodingͤͣͱ΋ϞσϧଆͰimplicitʹͰ͖Δ͔΋͠Εͳ͍͕, ໌Β͔ʹ ྑ͍ͱΘ͔͍ͬͯΔ΋ͷ͸ϞσϧʹೖΕΔલʹରԠͨ͠ํ͕ྑ͍. Summary