Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
Target Encoding はなぜ有効なのか
Shuhei Goda
November 30, 2019
Technology
10
7.4k
Target Encoding はなぜ有効なのか
分析コンペLT会
https://kaggle-friends.connpass.com/event/154881/
Shuhei Goda
November 30, 2019
Tweet
Share
More Decks by Shuhei Goda
See All by Shuhei Goda
会社訪問アプリ「Wantedly Visit」における推薦システム開発事例
hakubishin3
2
740
会社訪問アプリ「Wantedly Visit」 における募集画像がユーザーに与える影響
hakubishin3
0
1.1k
A Stacking Ensemble Model for Prediction of Multi-type Tweet Engagements(RecSys Challenge 2020 3rd place solution)
hakubishin3
2
700
RecSys Challenge 2020 Workshop: A Stacking Ensemble Model for Prediction of Multi-type Tweet Engagements
hakubishin3
1
1.5k
The Web Conference2020 参加報告会
hakubishin3
0
710
Kaggle Google Quest Q&A Labeling - 23th place solution
hakubishin3
4
3.5k
Kaggle Rコンペ解法紹介
hakubishin3
0
140
【論文紹介】Learning sparse neural networks through L0 regularization
hakubishin3
0
200
【論文紹介】EX2: exploration with exemplar models for deep reinforcement learning
hakubishin3
0
91
Other Decks in Technology
See All in Technology
データベースの発表には RDBMS 以外もありますよ
maroon1st
0
230
02_プロトタイピングの進め方
kouzoukaikaku
0
120
Astroで始める爆速個人サイト開発
takanorip
12
8.5k
2年で10→70人へ! スタートアップの 情報セキュリティ課題と施策
miekobayashi
1
280
JAWS-UG 横浜 #54 資料
takakuni
0
200
【NGK2023S】 ノードエディタ形式の画像処理ツール「Image-Processing-Node-Editor」
kazuhitotakahashi
0
260
エアドロップ for オープンソースプロジェクト
epicsdao
0
360
S3とCloudWatch Logsの見直しから始めるコスト削減 / Cost saving S3 and CloudWatch Logs
shonansurvivors
0
210
データ分析基盤の要件分析の話(202201_JEDAI)
yabooun
0
220
SmartHRからOktaへのSCIM連携で作り出すHRドリブンのアカウント管理
jousysmiler
1
110
CES_2023_FleetWise_demo.pdf
sparkgene
0
110
AWS re:Invent 2022で発表された新機能を試してみた ~Cloud OperationとSecurity~ / New Cloud Operation and Security Features Announced at AWS reInvent 2022
yuj1osm
1
180
Featured
See All Featured
WebSockets: Embracing the real-time Web
robhawkes
58
6k
Art, The Web, and Tiny UX
lynnandtonic
284
18k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
109
16k
jQuery: Nuts, Bolts and Bling
dougneiner
57
6.6k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
7
570
In The Pink: A Labor of Love
frogandcode
132
21k
Gamification - CAS2011
davidbonilla
75
4.1k
No one is an island. Learnings from fostering a developers community.
thoeni
12
1.5k
JazzCon 2018 Closing Keynote - Leadership for the Reluctant Leader
reverentgeek
175
9.1k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
13
1.1k
The Brand Is Dead. Long Live the Brand.
mthomps
48
2.9k
5 minutes of I Can Smell Your CMS
philhawksworth
198
18k
Transcript
©2019 Wantedly, Inc. Target Encodingͳͥ༗ޮͳͷ͔ ੳίϯϖLTձ Nov 30, 2019 -
Shuhei Goda - @jy_msc
©2019 Wantedly, Inc. Self-Introduction •Shuhei Godaʢ߹ా पฏʣ •Wantedly, Inc. (since
Sep 2019) •Recommendation Team https://www.wantedly.com/projects/375150 Kaggle Master hakubishinͱ͍͏໊લͰ twitter͍ͬͯ·͢ @jy_msc We are hiring !
©2019 Wantedly, Inc. ɾTarget Encodingͳͥ༗ޮͳͷ͔ ɾKaggleͰͷఆ൪ख๏ͷ1ͭ ɾLabel EncodingͰͳ͘Target EncodingΛͨ͠ํ͕ྑ͍߹͕͋Δ ɾͳͥTarget
Encoding͕ྑ͍݁ՌΛग़͢ͷ͔, ͦͷཧ༝Λઆ໌͍ͯ͠Δࢿྉ͕͋ ·ΓݟͨΒͳ͍ ɾTarget Encoding͕༗ޮͰ͋Δཧ༝ʹ͍ͭͯ, ࣗͳΓͷղऍΛհ About Talk
©2019 Wantedly, Inc. ɾతมΛ༻͍ͯΧςΰϦมΛʹม͢Δख๏ ɾΧςΰϦมΛ֤ਫ४ʹ͓͚ΔతมͷظͰஔ͢Δ ɾҰൠతʹ, ਫ४͕ଟ͍΄Ͳߴ͍ޮՌ͕ظ͞ΕΔ Target Encodingͱ Target
EncodingΛѻ͏্Ͱͷҙ࣮ํ๏ KaggleຊͰ֬ೝ͍ͯͩ͘͠͞ !
©2019 Wantedly, Inc. ɾϞσϧΛ୯७Խͤ͞ΔΑ͏ͳޮՌΛ࣋ͭ ɹɹɾҎ߱, GBDTΛྫʹߟ͑ͯΈΔ ͳͥ༗ޮͳͷ͔
©2019 Wantedly, Inc. ɾҎԼͷΑ͏ͳσʔλΛͬͯઆ໌͢Δ ɹɹɾతม y ࿈ଓ ɹɹɾઆ໌ม x ਫ४4ͷΧςΰϦม
x = {A, B, C, D} ɹɹɹɾE[y|x=A]=60, E[y|x=B]=20, E[y|x=C]=50, E[y|x=D]=10 ༻͢Δαϯϓϧσʔλ
©2019 Wantedly, Inc. GBDTͷ෮श σʔληοτ: Ճ๏Ϟσϧ: ଛࣦؔ: mຊͷͷ༿ͷweight, ͷ༿ͷ, ͷΛද͢
D = {(xi , yi )}n i=1 (xi ∈ Rm, yi ∈ R) ̂ yi = ΣM m=1 fm (xi ) = ΣM m=1 wm (xi ) L = Σn i=1 l( ̂ yi , yi ) + ΣM m=1 Ω(fm ) (Ω(f ) = γT + 1 2 λ∥w∥2) wm (x) T M
©2019 Wantedly, Inc. GBDTͷ෮श ͕mຊͷ࣌ͷଛࣦؔ: , j൪ͷ༿ʹׂΓͯΒΕͨσʔλू߹ , m-1ຊ·Ͱͷ༧ଌ݁ՌʹΑΔҰ֊ඍͱೋ֊ඍ gradient:
, hessian: L(m) = Σn i=1 l(yi , ̂ yi + fm (xi )) + Ω(fm ) ≃ Σn i=1 [gi fm (xi ) + 1 2 hi fm (xi )] + γT + 1 2 λΣT j=1 w2 j = ΣT j=1 [(Σi∈Ij gi )wj + 1 2 (Σi∈Ij hj + λ)w2 j + γT Ij gi , hi gi = ∂l(yi , ̂ y(m−1) i ) ∂ ̂ y(m−1) i hi = ∂2l(yi , ̂ y(m−1) i ) (∂ ̂ y(m−1) i )2
©2019 Wantedly, Inc. GBDTͷ෮श αϯϓϧׂ͕ΓৼΒΕͨ࣌ͷ༿ͷ࠷దͳweight Ͱ͋Γ, ͦͷ࣌ͷଛࣦ αϯϓϧΛׂͨ࣌͠ͷଛࣦͷݮΓํΛΈͯ, nodeຖʹ࠷దͳׂΛ୳͍ͯ͘͠ gain:
w* j = − Σi∈Ij gi Σi∈Ij hi L(m) = − 1 2 ΣT j=1 (Σi∈Ij gi )2 Σi∈Ij hj + λ + γT Lbef − (Laf,left + Laf,right ) " # $ % $ % " # Lbef Laf,left Laf,right gain (ׂલޙͷlossͷࠩ) ͕ େ͖͍΄Ͳྑׂ͍
©2019 Wantedly, Inc. GBDTͷ෮श ଛࣦ͕ؔ MSE ͷ߹ ଛࣦؔ: gradient: ,
hessian: ΑΓ ༿ j ͷ weight , ༿ j ʹׂΓͯΒΕͨαϯϓϧͷࠩฏۉͱͳΔ l(yi , ̂ yi ) = 1 2 (yi − ̂ yi )2 gi = ∂l(yi , ̂ y(m−1) i ) ∂ ̂ y(m−1) i = ̂ y(m−1) i − yi hi = ∂2l(yi , ̂ y(m−1) i ) (∂ ̂ y(m−1) i )2 = 1 w* j = − Σi∈Ij gi Σi∈Ij hi = − Σi∈Ij ( ̂ y(m−1) i − yi ) Σi∈Ij 1 ࠩ(ਅ - m-1ຊ࣌ͷ༧ଌ)ͷ૯ αϯϓϧͷ
©2019 Wantedly, Inc. GBDTͷઃఆ ɾγϯϓϧͳϞσϧͰߟ͑ͯΈΔ. ɹɾloss_func = ‘MAE' ɹɾeta =
1 → εςοϓαΠζ ɹɾiteration = 1 → ࠷ॳͷ͚ͩߟ͑Δ ɹɾtree_method = ‘exact’ → ۪ʹશ୳ࡧ ɹɾbase_score = 0 → ॳظ0ελʔτ ɹɾlambda = 0 ɹɾgamma = 0
©2019 Wantedly, Inc. Label EncodingΛͬͨ߹ ɾΧςΰϦมΛΞϧϑΝϕοτॱʹLabel Encoding ɾಛྔͷେ͖͞ͰαϯϓϧΛιʔτ͢Δ ൵͍͠άϥϑʜ ιʔτ
©2019 Wantedly, Inc. Label EncodingΛͬͨ߹ (depth=1) w* left w* left
w* left w* right w* right w* right L1 = − 48797 L2 = − 56913 L2 = − 49783 L2 = − 57093 L2,left = − 35522 L2,right = − 21391 L2,left = − 31832 L2,right = − 17951 L2,left = − 56097 L2,right = − 996 " # $ %
©2019 Wantedly, Inc. Label EncodingΛͬͨ߹ (depth=1) " # $ %
w* left w* left w* left w* right w* right w* right L1 = − 48797 L2 = − 56913 L2 = − 49783 L2 = − 57093 L2,left = − 35522 L2,right = − 21391 L2,left = − 31832 L2,right = − 17951 ͜͜Ͱׂ͢Δͷ͕ྑͦ͞͏ L2,left = − 56097 L2,right = − 996
©2019 Wantedly, Inc. Label EncodingΛͬͨ߹ (depth=2) L2 = − 56097
L3 = − 60111 L3 = − 56769 w* left w* right w* left w* right L3,left = − 35522 L3,right = − 24589 L3,left = − 31832 L3,right = − 24937 " # $ % % " # $ L1 = − 48797 L2,left = − 56097 L2,right = − 996
©2019 Wantedly, Inc. Label EncodingΛͬͨ߹ (depth=2) L2 = − 56097
" # $ % % " # $ L1 = − 48797 L2,left = − 56097 L2,right = − 996 L3 = − 60111 L3 = − 56769 w* left w* right w* left w* right L3,left = − 35522 L3,right = − 24589 L3,left = − 31832 L3,right = − 24937 ͜͜Ͱׂ͢Δͷ͕ྑͦ͞͏
©2019 Wantedly, Inc. Label EncodingΛͬͨ߹ (depth=3) L3 = − 24589
L4 = − 29013 w* left w* right L4,left = − 4076 L4,right = − 24937 " # $ % % " # $ # $ " L1 = − 48797 L2,left = − 56097 L2,right = − 996 L3,left = − 35522 L3,right = − 24589
©2019 Wantedly, Inc. Label EncodingΛͬͨ߹ (depth=3) L3 = − 24589
L4 = − 29013 w* left w* right L4,left = − 4076 L4,right = − 24937 " # $ % % " # $ # $ " L1 = − 48797 L2,left = − 56097 L2,right = − 996 L3,left = − 35522 L3,right = − 24589 ͜͜Ͱׂ͢Δͷ͕ྑͦ͞͏
©2019 Wantedly, Inc. Label EncodingΛͬͨ߹ (depth=3) " # $ %
% " # $ # $ " L1 = − 48797 L2,left = − 56097 L2,right = − 996 L3,left = − 35522 L3,right = − 24589 $ # L4,left = − 4076 L4,right = − 24937 ׂऴΘΓ
©2019 Wantedly, Inc. Target EncodingΛͬͨ߹ ɾΧςΰϦมΛTarget Encoding ɾಛྔͷେ͖͞ͰαϯϓϧΛιʔτ͢Δ ιʔτ
©2019 Wantedly, Inc. Target EncodingΛͬͨ߹ (depth=1) L1 = − 48797
L2,left = − 996 L2,right = − 56097 L2,left = − 4551 L2,right = − 59992 L2,left = − 21391 L2,right = − 35522 w* left w* right w* left w* right w* left w* right L2 = − 57093 L2 = − 64543 L2 = − 56913 " # $ %
©2019 Wantedly, Inc. Target EncodingΛͬͨ߹ (depth=1) L1 = − 48797
L2,left = − 996 L2,right = − 56097 L2,left = − 4551 L2,right = − 59992 L2,left = − 21391 L2,right = − 35522 w* left w* right w* left w* right w* left w* right ͜͜Ͱׂ͢Δͷ͕ྑͦ͞͏ L2 = − 57093 L2 = − 64543 L2 = − 56913 " # $ %
©2019 Wantedly, Inc. Target EncodingΛͬͨ߹ (depth=2) " # $
% " $ # % L2,left = − 4551 L1 = − 48797 L2,left = − 4551 L2,right = − 59992 L2,right = − 59992 w* right w* left L′ 3,left = − 24937 L′ 3,right = − 35522 L3 = − 60459 w* right w* left L3,left = − 996 L3,right = − 4076 L3 = − 5072
©2019 Wantedly, Inc. Target EncodingΛͬͨ߹ (depth=2) " # $
% " $ # % L2,left = − 4551 L1 = − 48797 L2,left = − 4551 L2,right = − 59992 L2,right = − 59992 w* right w* left L′ 3,left = − 24937 L′ 3,right = − 35522 L3 = − 60459 w* right w* left L3,left = − 996 L3,right = − 4076 L3 = − 5072 ͜͜Ͱׂ͢Δͷ͕ྑͦ͞͏ ͜͜Ͱׂ͢Δͷ͕ྑͦ͞͏
©2019 Wantedly, Inc. Target EncodingΛͬͨ߹ (depth=2) " # $ %
" $ # % L1 = − 48797 L2,left = − 4551 L2,right = − 59992 # % " $ L′ 3,left = − 24937 L′ 3,right = − 35522 L3,left = − 996 L3,right = − 4076 ׂऴΘΓ
©2019 Wantedly, Inc. Label Encoding ͱ Target Encoding ͷൺֱ "
# $ % " $ # % L1 = − 48797 L2,left = − 4551 L2,right = − 59992 # % " $ L′ 3,left = − 24937 L′ 3,right = − 35522 L3,left = − 996 L3,right = − 4076 " # $ % % " # $ # $ " L1 = − 48797 L2,left = − 56097 L2,right = − 996 L3,left = − 35522 L3,right = − 24589 $ # L4,left = − 4076 L4,right = − 24937 Label EncodingͰ࡞ͬͨߏ Target EncodingͰ࡞ͬͨߏ
©2019 Wantedly, Inc. (͔ͳΓዞҙతͳྫͰ͕ͨ͠) Target Encodingͷํ͕গ͠ޮྑͦ͞͏͡Όͳ͍Ͱ͔͢ʁ
©2019 Wantedly, Inc. Target EncodingԿΛͯ͘͠Ε͍ͯΔͷ͔ ɾߏΛΑΓγϯϓϧʹͳΔ ɾଛࣦ͕ؔMSEͰ࢝ΊͷํͷiterationͰ, ࠩ(gradient) ͷେ͖͕͞ ͍ۙਫ४ಉ࢜ΛΑΓ͍ۙҐஔʹஔ͢ΔΑ͏ͳޮՌΛ࣋ͭ.
→ׂ͞Εͨαϯϓϧ܈, ͦΕͧΕൺֱత͍ۙࠩΛ࣋ͭͷͰֶशޮ ͕ྑ͍
©2019 Wantedly, Inc. ΑΓਫ४͕૿͍͑ͯ͘ͱ ɾTarget EncodingͷޮՌਫ४͕૿͑Δ΄Ͳ࣮ײ͍͢͠ ɾࣄલʹ, ࠩͷେ͖͞ͰΧςΰϦΛιʔτͨ͠ํׂ͕ͷޮ͕ྑ͍.
©2019 Wantedly, Inc. ΑΓਫ४͕૿͍͑ͯ͘ͱ ɾTarget EncodingͷޮՌਫ४͕૿͑Δ΄Ͳ࣮ײ͍͢͠ ɾࣄલʹ, ࠩͷେ͖͞ͰΧςΰϦΛιʔτͨ͠ํׂ͕ͷޮ͕ྑ͍. w* right
w* left w* right w* left
©2019 Wantedly, Inc. શͯͷਫ४Λׂ͠Δ·Ͱʹඞཁͳਂ͞ ɾTarget Encodingͷํ͕ਂ͕͞ઙ͍, ΑΓߏ͕γϯϓϧʹ ɾҎԼਫ४100ͷΧςΰϦมΛׂͯ͠Έͨ࣌ͷߏ Label Encoding
Target Encoding
©2019 Wantedly, Inc. ֤ਂ࣌͞Ͱͷlossͷݮগྔ ɾTarget Encodingͷํ͕ޮతʹlossΛݮগ͍ͤͯ͞Δ ɾਫ४͕ଟ͍΄Ͳ, Label Encodingͱͷ͕ࠩେ͖͘ͳ͍ͬͯ͘.
©2019 Wantedly, Inc. ਂ͞ / iteration Λ૿͍͚ͯ͠Ϟσϧ͕ྑ͠ͳʹͯ͘͠ΕΔΜ͡Όͳ͍ʁ ɾ໌Β͔ʹྑ͍ͱΘ͔͍ͬͯΔใ໌ࣔతʹϞσϧʹͨ͠ํ͕ྑ͍ ɾLabel EncodingͰԿͱ͔ͯ͘͠ΕΔ͔͠Εͳ͍͕,
Ϟσϧ͕ෳࡶʹ ͳΓ͍͢. ਫ४͕૿͍͑ͯ͘΄Ͳ, ͦΕݱ࣮తͰͳ͍. ɾܦݧ্, ໌Β͔ʹޮ͘ͱ͔͍ͬͯΔͷֶशͷલஈ֊ͰରԠͨ͠ํ ͕ྑ͍. ɾಛྔͷinteractionͱಉ͡
©2019 Wantedly, Inc. ɾTarget EncodingʹΑͬͯ, Ϟσϧ͕ΑΓγϯϓϧʹͳΔ ɾଛࣦ͕ؔMSEͰ࢝ΊͷํͷiterationͰ, ࠩͷେ͖͍ॱʹιʔτ͢Δ͜ͱ ͰޮతͳׂΛ࣮ݱ͢Δ͜ͱ͕Ͱ͖Δ. ɾਫ४͕૿͑Δ΄Ͳ,
Target EncodingͷޮՌ͕େ͖͘ͳΔ ɾLabel encodingͰTarget encodingͱಉͷ͜ͱΛΔͨΊʹ͋Δఔͷਂ͞ ͕ඞཁͰ, ͦΕਫ४͕૿͑Δ΄Ͳݱ࣮తͰͳ͍. ɾTarget EncodingͤͣͱϞσϧଆͰimplicitʹͰ͖Δ͔͠Εͳ͍͕, ໌Β͔ʹ ྑ͍ͱΘ͔͍ͬͯΔͷϞσϧʹೖΕΔલʹରԠͨ͠ํ͕ྑ͍. Summary