Shuhei Goda
November 30, 2019
8.7k

# Target Encoding はなぜ有効なのか

## Shuhei Goda

November 30, 2019

## Transcript

Target Encoding͸ͳͥ༗ޮͳͷ͔
෼ੳίϯϖLTձ
Nov 30, 2019 - Shuhei Goda - @jy_msc

Self-Introduction
•Shuhei Godaʢ߹ా पฏʣ
•Wantedly, Inc. (since Sep 2019)
•Recommendation Team
https://www.wantedly.com/projects/375150
Kaggle Master
hakubishinͱ͍͏໊લͰ
We are hiring !

ɾTarget Encoding͸ͳͥ༗ޮͳͷ͔
ɾKaggleͰͷఆ൪ख๏ͷ1ͭ
ɾLabel EncodingͰ͸ͳ͘Target EncodingΛͨ͠ํ͕ྑ͍৔߹͕͋Δ
ɾͳͥTarget Encoding͕ྑ͍݁ՌΛग़͢ͷ͔, ͦͷཧ༝Λઆ໌͍ͯ͠Δࢿྉ͕͋
·Γݟ౰ͨΒͳ͍
ɾTarget Encoding͕༗ޮͰ͋Δཧ༝ʹ͍ͭͯ, ࣗ෼ͳΓͷղऍΛ঺հ

ɾ໨తม਺Λ༻͍ͯΧςΰϦม਺Λ਺஋ʹม׵͢Δख๏
ɾΧςΰϦม਺Λ֤ਫ४ʹ͓͚Δ໨తม਺ͷظ଴஋Ͱஔ׵͢Δ
ɾҰൠతʹ͸, ਫ४਺͕ଟ͍΄Ͳߴ͍ޮՌ͕ظ଴͞ΕΔ
Target Encodingͱ͸
Target EncodingΛѻ͏্Ͱͷ஫ҙ఺΍࣮૷ํ๏͸
KaggleຊͰ֬ೝ͍ͯͩ͘͠͞ !

ɾϞσϧΛ୯७Խͤ͞ΔΑ͏ͳޮՌΛ࣋ͭ
ɹɹɾҎ߱, GBDTΛྫʹߟ͑ͯΈΔ
ͳͥ༗ޮͳͷ͔

ɾҎԼͷΑ͏ͳσʔλΛ࢖ͬͯઆ໌͢Δ
ɹɹɾ໨తม਺ y ͸࿈ଓ஋
ɹɹɾઆ໌ม਺ x ͸ਫ४਺4ͷΧςΰϦม਺ x = {A, B, C, D}
ɹɹɹɾE[y|x=A]=60, E[y|x=B]=20, E[y|x=C]=50, E[y|x=D]=10
࢖༻͢Δαϯϓϧσʔλ

GBDTͷ෮श
σʔληοτ:
Ճ๏Ϟσϧ:
ଛࣦؔ਺:
͸mຊ໨ͷ໦ͷ༿ͷweight, ͸໦ͷ༿ͷ਺, ͸໦ͷ਺Λද͢
D = {(xi
, yi
)}n
i=1
(xi
∈ Rm, yi
∈ R)
̂
yi
= ΣM
m=1
fm
(xi
) = ΣM
m=1
wm
(xi
)
L = Σn
i=1
l( ̂
yi
, yi
) + ΣM
m=1
Ω(fm
) (Ω(f ) = γT +
1
2
λ∥w∥2)
wm
(x) T M

GBDTͷ෮श
໦͕mຊ໨ͷ࣌ͷଛࣦؔ਺:
͸, j൪໨ͷ༿ʹׂΓ౰ͯΒΕͨσʔλू߹
͸, m-1ຊ໨·Ͱͷ༧ଌ݁ՌʹΑΔҰ֊ඍ෼ͱೋ֊ඍ෼
L(m) = Σn
i=1
l(yi
, ̂
yi
+ fm
(xi
)) + Ω(fm
)
≃ Σn
i=1
[gi
fm
(xi
) +
1
2
hi
fm
(xi
)] + γT +
1
2
λΣT
j=1
w2
j
= ΣT
j=1
[(Σi∈Ij
gi
)wj
+
1
2
(Σi∈Ij
hj
+ λ)w2
j
+ γT
Ij
gi
, hi
gi
=
∂l(yi
, ̂
y(m−1)
i
)
∂ ̂
y(m−1)
i
hi
=
∂2l(yi
, ̂
y(m−1)
i
)
(∂ ̂
y(m−1)
i
)2

GBDTͷ෮श
αϯϓϧׂ͕ΓৼΒΕͨ࣌ͷ༿ͷ࠷దͳweight͸
Ͱ͋Γ, ͦͷ࣌ͷଛࣦ஋͸
αϯϓϧΛ෼ׂͨ࣌͠ͷଛࣦͷݮΓํΛΈͯ, nodeຖʹ࠷దͳ෼ׂΛ୳͍ͯ͘͠
gain:
w*
j
= −
Σi∈Ij
gi
Σi∈Ij
hi
L(m) = −
1
2
ΣT
j=1
(Σi∈Ij
gi
)2
Σi∈Ij
hj
+ λ
+ γT
Lbef
− (Laf,left
+ Laf,right
)
" #
\$ %
\$ %
" #
Lbef
Laf,left
Laf,right
gain (෼ׂલޙͷlossͷࠩ) ͕
େ͖͍΄Ͳྑ͍෼ׂ

GBDTͷ෮श
ଛࣦؔ਺͕ MSE ͷ৔߹
ଛࣦؔ਺:
ΑΓ
༿ j ͷ weight ͸, ༿ j ʹׂΓ౰ͯΒΕͨαϯϓϧͷ࢒ࠩฏۉͱͳΔ
l(yi
, ̂
yi
) =
1
2
(yi
− ̂
yi
)2
gi
=
∂l(yi
, ̂
y(m−1)
i
)
∂ ̂
y(m−1)
i
= ̂
y(m−1)
i
− yi
hi
=
∂2l(yi
, ̂
y(m−1)
i
)
(∂ ̂
y(m−1)
i
)2
= 1
w*
j
= −
Σi∈Ij
gi
Σi∈Ij
hi
= −
Σi∈Ij
( ̂
y(m−1)
i
− yi
)
Σi∈Ij
1
࢒ࠩ(ਅ஋ - m-1ຊ໨࣌఺ͷ༧ଌ஋)ͷ૯࿨
αϯϓϧͷ਺

GBDTͷઃఆ
ɾγϯϓϧͳϞσϧͰߟ͑ͯΈΔ.
ɹɾloss_func = ‘MAE'
ɹɾeta = 1 → εςοϓαΠζ
ɹɾiteration = 1 → ࠷ॳͷ໦͚ͩߟ͑Δ
ɹɾtree_method = ‘exact’ → ۪௚ʹશ୳ࡧ
ɹɾbase_score = 0 → ॳظ஋͸0ελʔτ
ɹɾlambda = 0
ɹɾgamma = 0

Label EncodingΛ࢖ͬͨ৔߹
ɾΧςΰϦม਺ΛΞϧϑΝϕοτॱʹLabel Encoding
ɾಛ௃ྔͷେ͖͞ͰαϯϓϧΛιʔτ͢Δ
൵͍͠άϥϑʜ
ιʔτ

Label EncodingΛ࢖ͬͨ৔߹ (depth=1)
w*
left
w*
left w*
left
w*
right w*
right
w*
right
L1
= − 48797
L2
= − 56913 L2
= − 49783 L2
= − 57093
L2,left
= − 35522 L2,right
= − 21391 L2,left
= − 31832 L2,right
= − 17951 L2,left
= − 56097 L2,right
= − 996
" # \$ %

Label EncodingΛ࢖ͬͨ৔߹ (depth=1) " # \$ %
w*
left
w*
left w*
left
w*
right w*
right
w*
right
L1
= − 48797
L2
= − 56913 L2
= − 49783 L2
= − 57093
L2,left
= − 35522 L2,right
= − 21391 L2,left
= − 31832 L2,right
= − 17951
͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏
L2,left
= − 56097 L2,right
= − 996

Label EncodingΛ࢖ͬͨ৔߹ (depth=2)
L2
= − 56097
L3
= − 60111 L3
= − 56769
w*
left
w*
right w*
left
w*
right
L3,left
= − 35522 L3,right
= − 24589 L3,left
= − 31832 L3,right
= − 24937
" # \$ %
%
" # \$
L1
= − 48797
L2,left
= − 56097
L2,right
= − 996

Label EncodingΛ࢖ͬͨ৔߹ (depth=2)
L2
= − 56097
" # \$ %
%
" # \$
L1
= − 48797
L2,left
= − 56097
L2,right
= − 996
L3
= − 60111 L3
= − 56769
w*
left
w*
right w*
left
w*
right
L3,left
= − 35522 L3,right
= − 24589 L3,left
= − 31832 L3,right
= − 24937
͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏

Label EncodingΛ࢖ͬͨ৔߹ (depth=3)
L3
= − 24589
L4
= − 29013
w*
left
w*
right
L4,left
= − 4076 L4,right
= − 24937
" # \$ %
%
" # \$
# \$
"
L1
= − 48797
L2,left
= − 56097
L2,right
= − 996
L3,left
= − 35522 L3,right
= − 24589

Label EncodingΛ࢖ͬͨ৔߹ (depth=3)
L3
= − 24589
L4
= − 29013
w*
left
w*
right
L4,left
= − 4076 L4,right
= − 24937
" # \$ %
%
" # \$
# \$
"
L1
= − 48797
L2,left
= − 56097
L2,right
= − 996
L3,left
= − 35522 L3,right
= − 24589
͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏

Label EncodingΛ࢖ͬͨ৔߹ (depth=3) " # \$ %
%
" # \$
# \$
"
L1
= − 48797
L2,left
= − 56097
L2,right
= − 996
L3,left
= − 35522 L3,right
= − 24589
\$
#
L4,left
= − 4076
L4,right
= − 24937
෼ׂऴΘΓ

Target EncodingΛ࢖ͬͨ৔߹
ɾΧςΰϦม਺ΛTarget Encoding
ɾಛ௃ྔͷେ͖͞ͰαϯϓϧΛιʔτ͢Δ ιʔτ

Target EncodingΛ࢖ͬͨ৔߹ (depth=1)
L1
= − 48797
L2,left
= − 996 L2,right
= − 56097 L2,left
= − 4551 L2,right
= − 59992 L2,left
= − 21391 L2,right
= − 35522
w*
left
w*
right
w*
left
w*
right
w*
left
w*
right
L2
= − 57093 L2
= − 64543 L2
= − 56913
" # \$ %

Target EncodingΛ࢖ͬͨ৔߹ (depth=1)
L1
= − 48797
L2,left
= − 996 L2,right
= − 56097 L2,left
= − 4551 L2,right
= − 59992 L2,left
= − 21391 L2,right
= − 35522
w*
left
w*
right
w*
left
w*
right
w*
left
w*
right
͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏
L2
= − 57093 L2
= − 64543 L2
= − 56913
" # \$ %

Target EncodingΛ࢖ͬͨ৔߹ (depth=2)
" #
\$ %
" \$
# %
L2,left
= − 4551
L1
= − 48797
L2,left
= − 4551
L2,right
= − 59992
L2,right
= − 59992
w*
right
w*
left
L′
3,left
= − 24937 L′
3,right
= − 35522
L3
= − 60459
w*
right
w*
left
L3,left
= − 996 L3,right
= − 4076
L3
= − 5072

Target EncodingΛ࢖ͬͨ৔߹ (depth=2)
" #
\$ %
" \$
# %
L2,left
= − 4551
L1
= − 48797
L2,left
= − 4551
L2,right
= − 59992
L2,right
= − 59992
w*
right
w*
left
L′
3,left
= − 24937 L′
3,right
= − 35522
L3
= − 60459
w*
right
w*
left
L3,left
= − 996 L3,right
= − 4076
L3
= − 5072
͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏
͜͜Ͱ෼ׂ͢Δͷ͕ྑͦ͞͏

Target EncodingΛ࢖ͬͨ৔߹ (depth=2) " # \$ %
" \$
# %
L1
= − 48797
L2,left
= − 4551
L2,right
= − 59992
#
% "
\$
L′
3,left
= − 24937
L′
3,right
= − 35522
L3,left
= − 996
L3,right
= − 4076
෼ׂऴΘΓ

Label Encoding ͱ Target Encoding ͷൺֱ
" #
\$ %
" \$
# %
L1
= − 48797
L2,left
= − 4551
L2,right
= − 59992
#
% "
\$
L′
3,left
= − 24937
L′
3,right
= − 35522
L3,left
= − 996
L3,right
= − 4076
" #
\$ %
%
" # \$
# \$
"
L1
= − 48797
L2,left
= − 56097
L2,right
= − 996
L3,left
= − 35522 L3,right
= − 24589
\$
#
L4,left
= − 4076
L4,right
= − 24937
Label EncodingͰ࡞ͬͨ໦ߏ଄ Target EncodingͰ࡞ͬͨ໦ߏ଄

(͔ͳΓዞҙతͳྫͰ͕ͨ͠)
Target Encodingͷํ͕গ͠ޮ཰ྑͦ͞͏͡Όͳ͍Ͱ͔͢ʁ

Target Encoding͸ԿΛͯ͘͠Ε͍ͯΔͷ͔
ɾ໦ߏ଄ΛΑΓγϯϓϧʹͳΔ
͍ۙਫ४ಉ࢜ΛΑΓ͍ۙҐஔʹ഑ஔ͢ΔΑ͏ͳޮՌΛ࣋ͭ.
→෼ׂ͞Εͨαϯϓϧ܈͸, ͦΕͧΕൺֱత͍ۙ࢒ࠩΛ࣋ͭͷͰֶशޮ
཰͕ྑ͍

ΑΓਫ४਺͕૿͍͑ͯ͘ͱ
ɾTarget EncodingͷޮՌ͸ਫ४਺͕૿͑Δ΄Ͳ࣮ײ͠΍͍͢
ɾࣄલʹ, ࢒ࠩͷେ͖͞ͰΧςΰϦΛιʔτͨ͠ํ͕෼ׂͷޮ཰͕ྑ͍.

ΑΓਫ४਺͕૿͍͑ͯ͘ͱ
ɾTarget EncodingͷޮՌ͸ਫ४਺͕૿͑Δ΄Ͳ࣮ײ͠΍͍͢
ɾࣄલʹ, ࢒ࠩͷେ͖͞ͰΧςΰϦΛιʔτͨ͠ํ͕෼ׂͷޮ཰͕ྑ͍.
w*
right
w*
left
w*
right
w*
left

શͯͷਫ४Λ෼ׂ͠੾Δ·Ͱʹඞཁͳਂ͞
ɾTarget Encodingͷํ͕ਂ͕͞ઙ͍, ΑΓ໦ߏ଄͕γϯϓϧʹ
ɾҎԼ͸ਫ४਺100ͷΧςΰϦม਺Λ෼ׂͯ͠Έͨ࣌ͷ໦ߏ଄
Label Encoding
Target Encoding

֤ਂ࣌͞఺Ͱͷlossͷݮগྔ
ɾTarget Encodingͷํ͕ޮ཰తʹlossΛݮগ͍ͤͯ͞Δ
ɾਫ४਺͕ଟ͍΄Ͳ, Label Encodingͱͷ͕ࠩେ͖͘ͳ͍ͬͯ͘.

ਂ͞ / iteration Λ૿΍͍͚ͯ͠͹Ϟσϧ͕ྑ͠ͳʹͯ͘͠ΕΔΜ͡Όͳ͍ʁ
ɾ໌Β͔ʹྑ͍ͱΘ͔͍ͬͯΔ৘ใ͸໌ࣔతʹϞσϧʹ౉ͨ͠ํ͕ྑ͍
ɾLabel EncodingͰ΋Կͱ͔ͯ͘͠ΕΔ͔΋͠Εͳ͍͕, Ϟσϧ͕ෳࡶʹ
ͳΓ΍͍͢. ਫ४਺͕૿͍͑ͯ͘΄Ͳ, ͦΕ͸ݱ࣮తͰ͸ͳ͍.
ɾܦݧ্, ໌Β͔ʹޮ͘ͱ෼͔͍ͬͯΔ΋ͷ͸ֶशͷલஈ֊ͰରԠͨ͠ํ
͕ྑ͍.
ɾ਺஋ಛ௃ྔͷinteractionͱಉ͡࿩