Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
Target Encoding はなぜ有効なのか
Shuhei Goda
November 30, 2019
Technology
10
6.7k
Target Encoding はなぜ有効なのか
分析コンペLT会
https://kaggle-friends.connpass.com/event/154881/
Shuhei Goda
November 30, 2019
Tweet
Share
More Decks by Shuhei Goda
See All by Shuhei Goda
会社訪問アプリ「Wantedly Visit」 における募集画像がユーザーに与える影響
hakubishin3
0
800
A Stacking Ensemble Model for Prediction of Multi-type Tweet Engagements(RecSys Challenge 2020 3rd place solution)
hakubishin3
2
620
RecSys Challenge 2020 Workshop: A Stacking Ensemble Model for Prediction of Multi-type Tweet Engagements
hakubishin3
1
1.3k
The Web Conference2020 参加報告会
hakubishin3
0
620
Kaggle Google Quest Q&A Labeling - 23th place solution
hakubishin3
4
3.3k
Kaggle Rコンペ解法紹介
hakubishin3
0
100
【論文紹介】Learning sparse neural networks through L0 regularization
hakubishin3
0
83
【論文紹介】EX2: exploration with exemplar models for deep reinforcement learning
hakubishin3
0
61
【論文紹介】Sparse Embedded k-means Clustering
hakubishin3
0
43
Other Decks in Technology
See All in Technology
testing journey / テストが嫌いでIT業界を離れるはずだったのに〜テスト嫌いが現場で品質改善を実施するまでの物語〜
aki_moon
1
380
ソフトウェアテストで参考にしている67のモノ #scrumniigata / 67 things for software testing
kyonmm
PRO
1
510
AWS CLI入門_20220513
suzakiyoshito
0
3.9k
A1A会社紹介資料-2022-05-20
a1a
2
1.1k
GitHub Actionsを使用してGoogle Play Consoleに自動アップロード
takenaga7
0
220
BFFとmicroservicesアーキテクチャ
hirac1220
0
110
OSINT/GEOINT ワークショップ 20220514 古橋資料
furuhashilab
2
310
Oracle Cloud Infrastructure:2022年5月度サービス・アップデート
oracle4engineer
PRO
0
130
街じゅうを"駅前化"する電動マイクロモビリティのシェアサービス「LUUP」のIoTとSRE
0gm
1
890
[AKIBA.AWS] それ、t2.micro選んで大丈夫?
tsukuboshi
0
320
HTTP Session Architecture Pattern
chiroito
1
410
Steps toward self-service operations in eureka
fukubaka0825
0
820
Featured
See All Featured
ReactJS: Keep Simple. Everything can be a component!
pedronauck
655
120k
The World Runs on Bad Software
bkeepers
PRO
56
5.2k
Atom: Resistance is Futile
akmur
255
20k
Typedesign – Prime Four
hannesfritz
33
1.3k
Statistics for Hackers
jakevdp
781
210k
GraphQLとの向き合い方2022年版
quramy
16
8.1k
Robots, Beer and Maslow
schacon
152
7.1k
How To Stay Up To Date on Web Technology
chriscoyier
780
250k
StorybookのUI Testing Handbookを読んだ
zakiyama
4
2k
Navigating Team Friction
lara
175
11k
The Web Native Designer (August 2011)
paulrobertlloyd
74
1.9k
WebSockets: Embracing the real-time Web
robhawkes
57
5k
Transcript
©2019 Wantedly, Inc. Target Encodingͳͥ༗ޮͳͷ͔ ੳίϯϖLTձ Nov 30, 2019 -
Shuhei Goda - @jy_msc
©2019 Wantedly, Inc. Self-Introduction •Shuhei Godaʢ߹ా पฏʣ •Wantedly, Inc. (since
Sep 2019) •Recommendation Team https://www.wantedly.com/projects/375150 Kaggle Master hakubishinͱ͍͏໊લͰ twitter͍ͬͯ·͢ @jy_msc We are hiring !
©2019 Wantedly, Inc. ɾTarget Encodingͳͥ༗ޮͳͷ͔ ɾKaggleͰͷఆ൪ख๏ͷ1ͭ ɾLabel EncodingͰͳ͘Target EncodingΛͨ͠ํ͕ྑ͍߹͕͋Δ ɾͳͥTarget
Encoding͕ྑ͍݁ՌΛग़͢ͷ͔, ͦͷཧ༝Λઆ໌͍ͯ͠Δࢿྉ͕͋ ·ΓݟͨΒͳ͍ ɾTarget Encoding͕༗ޮͰ͋Δཧ༝ʹ͍ͭͯ, ࣗͳΓͷղऍΛհ About Talk
©2019 Wantedly, Inc. ɾతมΛ༻͍ͯΧςΰϦมΛʹม͢Δख๏ ɾΧςΰϦมΛ֤ਫ४ʹ͓͚ΔతมͷظͰஔ͢Δ ɾҰൠతʹ, ਫ४͕ଟ͍΄Ͳߴ͍ޮՌ͕ظ͞ΕΔ Target Encodingͱ Target
EncodingΛѻ͏্Ͱͷҙ࣮ํ๏ KaggleຊͰ֬ೝ͍ͯͩ͘͠͞ !
©2019 Wantedly, Inc. ɾϞσϧΛ୯७Խͤ͞ΔΑ͏ͳޮՌΛ࣋ͭ ɹɹɾҎ߱, GBDTΛྫʹߟ͑ͯΈΔ ͳͥ༗ޮͳͷ͔
©2019 Wantedly, Inc. ɾҎԼͷΑ͏ͳσʔλΛͬͯઆ໌͢Δ ɹɹɾతม y ࿈ଓ ɹɹɾઆ໌ม x ਫ४4ͷΧςΰϦม
x = {A, B, C, D} ɹɹɹɾE[y|x=A]=60, E[y|x=B]=20, E[y|x=C]=50, E[y|x=D]=10 ༻͢Δαϯϓϧσʔλ
©2019 Wantedly, Inc. GBDTͷ෮श σʔληοτ: Ճ๏Ϟσϧ: ଛࣦؔ: mຊͷͷ༿ͷweight, ͷ༿ͷ, ͷΛද͢
D = {(xi , yi )}n i=1 (xi ∈ Rm, yi ∈ R) ̂ yi = ΣM m=1 fm (xi ) = ΣM m=1 wm (xi ) L = Σn i=1 l( ̂ yi , yi ) + ΣM m=1 Ω(fm ) (Ω(f ) = γT + 1 2 λ∥w∥2) wm (x) T M
©2019 Wantedly, Inc. GBDTͷ෮श ͕mຊͷ࣌ͷଛࣦؔ: , j൪ͷ༿ʹׂΓͯΒΕͨσʔλू߹ , m-1ຊ·Ͱͷ༧ଌ݁ՌʹΑΔҰ֊ඍͱೋ֊ඍ gradient:
, hessian: L(m) = Σn i=1 l(yi , ̂ yi + fm (xi )) + Ω(fm ) ≃ Σn i=1 [gi fm (xi ) + 1 2 hi fm (xi )] + γT + 1 2 λΣT j=1 w2 j = ΣT j=1 [(Σi∈Ij gi )wj + 1 2 (Σi∈Ij hj + λ)w2 j + γT Ij gi , hi gi = ∂l(yi , ̂ y(m−1) i ) ∂ ̂ y(m−1) i hi = ∂2l(yi , ̂ y(m−1) i ) (∂ ̂ y(m−1) i )2
©2019 Wantedly, Inc. GBDTͷ෮श αϯϓϧׂ͕ΓৼΒΕͨ࣌ͷ༿ͷ࠷దͳweight Ͱ͋Γ, ͦͷ࣌ͷଛࣦ αϯϓϧΛׂͨ࣌͠ͷଛࣦͷݮΓํΛΈͯ, nodeຖʹ࠷దͳׂΛ୳͍ͯ͘͠ gain:
w* j = − Σi∈Ij gi Σi∈Ij hi L(m) = − 1 2 ΣT j=1 (Σi∈Ij gi )2 Σi∈Ij hj + λ + γT Lbef − (Laf,left + Laf,right ) " # $ % $ % " # Lbef Laf,left Laf,right gain (ׂલޙͷlossͷࠩ) ͕ େ͖͍΄Ͳྑׂ͍
©2019 Wantedly, Inc. GBDTͷ෮श ଛࣦ͕ؔ MSE ͷ߹ ଛࣦؔ: gradient: ,
hessian: ΑΓ ༿ j ͷ weight , ༿ j ʹׂΓͯΒΕͨαϯϓϧͷࠩฏۉͱͳΔ l(yi , ̂ yi ) = 1 2 (yi − ̂ yi )2 gi = ∂l(yi , ̂ y(m−1) i ) ∂ ̂ y(m−1) i = ̂ y(m−1) i − yi hi = ∂2l(yi , ̂ y(m−1) i ) (∂ ̂ y(m−1) i )2 = 1 w* j = − Σi∈Ij gi Σi∈Ij hi = − Σi∈Ij ( ̂ y(m−1) i − yi ) Σi∈Ij 1 ࠩ(ਅ - m-1ຊ࣌ͷ༧ଌ)ͷ૯ αϯϓϧͷ
©2019 Wantedly, Inc. GBDTͷઃఆ ɾγϯϓϧͳϞσϧͰߟ͑ͯΈΔ. ɹɾloss_func = ‘MAE' ɹɾeta =
1 → εςοϓαΠζ ɹɾiteration = 1 → ࠷ॳͷ͚ͩߟ͑Δ ɹɾtree_method = ‘exact’ → ۪ʹશ୳ࡧ ɹɾbase_score = 0 → ॳظ0ελʔτ ɹɾlambda = 0 ɹɾgamma = 0
©2019 Wantedly, Inc. Label EncodingΛͬͨ߹ ɾΧςΰϦมΛΞϧϑΝϕοτॱʹLabel Encoding ɾಛྔͷେ͖͞ͰαϯϓϧΛιʔτ͢Δ ൵͍͠άϥϑʜ ιʔτ
©2019 Wantedly, Inc. Label EncodingΛͬͨ߹ (depth=1) w* left w* left
w* left w* right w* right w* right L1 = − 48797 L2 = − 56913 L2 = − 49783 L2 = − 57093 L2,left = − 35522 L2,right = − 21391 L2,left = − 31832 L2,right = − 17951 L2,left = − 56097 L2,right = − 996 " # $ %
©2019 Wantedly, Inc. Label EncodingΛͬͨ߹ (depth=1) " # $ %
w* left w* left w* left w* right w* right w* right L1 = − 48797 L2 = − 56913 L2 = − 49783 L2 = − 57093 L2,left = − 35522 L2,right = − 21391 L2,left = − 31832 L2,right = − 17951 ͜͜Ͱׂ͢Δͷ͕ྑͦ͞͏ L2,left = − 56097 L2,right = − 996
©2019 Wantedly, Inc. Label EncodingΛͬͨ߹ (depth=2) L2 = − 56097
L3 = − 60111 L3 = − 56769 w* left w* right w* left w* right L3,left = − 35522 L3,right = − 24589 L3,left = − 31832 L3,right = − 24937 " # $ % % " # $ L1 = − 48797 L2,left = − 56097 L2,right = − 996
©2019 Wantedly, Inc. Label EncodingΛͬͨ߹ (depth=2) L2 = − 56097
" # $ % % " # $ L1 = − 48797 L2,left = − 56097 L2,right = − 996 L3 = − 60111 L3 = − 56769 w* left w* right w* left w* right L3,left = − 35522 L3,right = − 24589 L3,left = − 31832 L3,right = − 24937 ͜͜Ͱׂ͢Δͷ͕ྑͦ͞͏
©2019 Wantedly, Inc. Label EncodingΛͬͨ߹ (depth=3) L3 = − 24589
L4 = − 29013 w* left w* right L4,left = − 4076 L4,right = − 24937 " # $ % % " # $ # $ " L1 = − 48797 L2,left = − 56097 L2,right = − 996 L3,left = − 35522 L3,right = − 24589
©2019 Wantedly, Inc. Label EncodingΛͬͨ߹ (depth=3) L3 = − 24589
L4 = − 29013 w* left w* right L4,left = − 4076 L4,right = − 24937 " # $ % % " # $ # $ " L1 = − 48797 L2,left = − 56097 L2,right = − 996 L3,left = − 35522 L3,right = − 24589 ͜͜Ͱׂ͢Δͷ͕ྑͦ͞͏
©2019 Wantedly, Inc. Label EncodingΛͬͨ߹ (depth=3) " # $ %
% " # $ # $ " L1 = − 48797 L2,left = − 56097 L2,right = − 996 L3,left = − 35522 L3,right = − 24589 $ # L4,left = − 4076 L4,right = − 24937 ׂऴΘΓ
©2019 Wantedly, Inc. Target EncodingΛͬͨ߹ ɾΧςΰϦมΛTarget Encoding ɾಛྔͷେ͖͞ͰαϯϓϧΛιʔτ͢Δ ιʔτ
©2019 Wantedly, Inc. Target EncodingΛͬͨ߹ (depth=1) L1 = − 48797
L2,left = − 996 L2,right = − 56097 L2,left = − 4551 L2,right = − 59992 L2,left = − 21391 L2,right = − 35522 w* left w* right w* left w* right w* left w* right L2 = − 57093 L2 = − 64543 L2 = − 56913 " # $ %
©2019 Wantedly, Inc. Target EncodingΛͬͨ߹ (depth=1) L1 = − 48797
L2,left = − 996 L2,right = − 56097 L2,left = − 4551 L2,right = − 59992 L2,left = − 21391 L2,right = − 35522 w* left w* right w* left w* right w* left w* right ͜͜Ͱׂ͢Δͷ͕ྑͦ͞͏ L2 = − 57093 L2 = − 64543 L2 = − 56913 " # $ %
©2019 Wantedly, Inc. Target EncodingΛͬͨ߹ (depth=2) " # $
% " $ # % L2,left = − 4551 L1 = − 48797 L2,left = − 4551 L2,right = − 59992 L2,right = − 59992 w* right w* left L′ 3,left = − 24937 L′ 3,right = − 35522 L3 = − 60459 w* right w* left L3,left = − 996 L3,right = − 4076 L3 = − 5072
©2019 Wantedly, Inc. Target EncodingΛͬͨ߹ (depth=2) " # $
% " $ # % L2,left = − 4551 L1 = − 48797 L2,left = − 4551 L2,right = − 59992 L2,right = − 59992 w* right w* left L′ 3,left = − 24937 L′ 3,right = − 35522 L3 = − 60459 w* right w* left L3,left = − 996 L3,right = − 4076 L3 = − 5072 ͜͜Ͱׂ͢Δͷ͕ྑͦ͞͏ ͜͜Ͱׂ͢Δͷ͕ྑͦ͞͏
©2019 Wantedly, Inc. Target EncodingΛͬͨ߹ (depth=2) " # $ %
" $ # % L1 = − 48797 L2,left = − 4551 L2,right = − 59992 # % " $ L′ 3,left = − 24937 L′ 3,right = − 35522 L3,left = − 996 L3,right = − 4076 ׂऴΘΓ
©2019 Wantedly, Inc. Label Encoding ͱ Target Encoding ͷൺֱ "
# $ % " $ # % L1 = − 48797 L2,left = − 4551 L2,right = − 59992 # % " $ L′ 3,left = − 24937 L′ 3,right = − 35522 L3,left = − 996 L3,right = − 4076 " # $ % % " # $ # $ " L1 = − 48797 L2,left = − 56097 L2,right = − 996 L3,left = − 35522 L3,right = − 24589 $ # L4,left = − 4076 L4,right = − 24937 Label EncodingͰ࡞ͬͨߏ Target EncodingͰ࡞ͬͨߏ
©2019 Wantedly, Inc. (͔ͳΓዞҙతͳྫͰ͕ͨ͠) Target Encodingͷํ͕গ͠ޮྑͦ͞͏͡Όͳ͍Ͱ͔͢ʁ
©2019 Wantedly, Inc. Target EncodingԿΛͯ͘͠Ε͍ͯΔͷ͔ ɾߏΛΑΓγϯϓϧʹͳΔ ɾଛࣦ͕ؔMSEͰ࢝ΊͷํͷiterationͰ, ࠩ(gradient) ͷେ͖͕͞ ͍ۙਫ४ಉ࢜ΛΑΓ͍ۙҐஔʹஔ͢ΔΑ͏ͳޮՌΛ࣋ͭ.
→ׂ͞Εͨαϯϓϧ܈, ͦΕͧΕൺֱత͍ۙࠩΛ࣋ͭͷͰֶशޮ ͕ྑ͍
©2019 Wantedly, Inc. ΑΓਫ४͕૿͍͑ͯ͘ͱ ɾTarget EncodingͷޮՌਫ४͕૿͑Δ΄Ͳ࣮ײ͍͢͠ ɾࣄલʹ, ࠩͷେ͖͞ͰΧςΰϦΛιʔτͨ͠ํׂ͕ͷޮ͕ྑ͍.
©2019 Wantedly, Inc. ΑΓਫ४͕૿͍͑ͯ͘ͱ ɾTarget EncodingͷޮՌਫ४͕૿͑Δ΄Ͳ࣮ײ͍͢͠ ɾࣄલʹ, ࠩͷେ͖͞ͰΧςΰϦΛιʔτͨ͠ํׂ͕ͷޮ͕ྑ͍. w* right
w* left w* right w* left
©2019 Wantedly, Inc. શͯͷਫ४Λׂ͠Δ·Ͱʹඞཁͳਂ͞ ɾTarget Encodingͷํ͕ਂ͕͞ઙ͍, ΑΓߏ͕γϯϓϧʹ ɾҎԼਫ४100ͷΧςΰϦมΛׂͯ͠Έͨ࣌ͷߏ Label Encoding
Target Encoding
©2019 Wantedly, Inc. ֤ਂ࣌͞Ͱͷlossͷݮগྔ ɾTarget Encodingͷํ͕ޮతʹlossΛݮগ͍ͤͯ͞Δ ɾਫ४͕ଟ͍΄Ͳ, Label Encodingͱͷ͕ࠩେ͖͘ͳ͍ͬͯ͘.
©2019 Wantedly, Inc. ਂ͞ / iteration Λ૿͍͚ͯ͠Ϟσϧ͕ྑ͠ͳʹͯ͘͠ΕΔΜ͡Όͳ͍ʁ ɾ໌Β͔ʹྑ͍ͱΘ͔͍ͬͯΔใ໌ࣔతʹϞσϧʹͨ͠ํ͕ྑ͍ ɾLabel EncodingͰԿͱ͔ͯ͘͠ΕΔ͔͠Εͳ͍͕,
Ϟσϧ͕ෳࡶʹ ͳΓ͍͢. ਫ४͕૿͍͑ͯ͘΄Ͳ, ͦΕݱ࣮తͰͳ͍. ɾܦݧ্, ໌Β͔ʹޮ͘ͱ͔͍ͬͯΔͷֶशͷલஈ֊ͰରԠͨ͠ํ ͕ྑ͍. ɾಛྔͷinteractionͱಉ͡
©2019 Wantedly, Inc. ɾTarget EncodingʹΑͬͯ, Ϟσϧ͕ΑΓγϯϓϧʹͳΔ ɾଛࣦ͕ؔMSEͰ࢝ΊͷํͷiterationͰ, ࠩͷେ͖͍ॱʹιʔτ͢Δ͜ͱ ͰޮతͳׂΛ࣮ݱ͢Δ͜ͱ͕Ͱ͖Δ. ɾਫ४͕૿͑Δ΄Ͳ,
Target EncodingͷޮՌ͕େ͖͘ͳΔ ɾLabel encodingͰTarget encodingͱಉͷ͜ͱΛΔͨΊʹ͋Δఔͷਂ͞ ͕ඞཁͰ, ͦΕਫ४͕૿͑Δ΄Ͳݱ࣮తͰͳ͍. ɾTarget EncodingͤͣͱϞσϧଆͰimplicitʹͰ͖Δ͔͠Εͳ͍͕, ໌Β͔ʹ ྑ͍ͱΘ͔͍ͬͯΔͷϞσϧʹೖΕΔલʹରԠͨ͠ํ͕ྑ͍. Summary