Slide 1

Slide 1 text

©2020 Wantedly, Inc. RecSys Challenge 2020 3rd place solution RecSys2020࿦จಡΈձ 17.Oct.2020 - Shuhei Goda, Naomichi Agata, Yuya Matsumura for Prediction of Multi-type Tweet Engagements A Stacking Ensemble Model

Slide 2

Slide 2 text

©2020 Wantedly, Inc. ߹ా पฏ @jy_msc Team Wantedly ០ ௚ಓ @agatan_ দଜ ༏໵ @yu__ya4

Slide 3

Slide 3 text

©2020 Wantedly, Inc. Twitter ʹ͓͚ΔϢʔβͷΤϯήʔδϝϯτΛ༧ଌ͢ΔλεΫ • (tweet ID, engaging user ID) ʹର֤ͯ͠Τϯήʔδϝϯτͷ༗ແΛ༧ଌ͢Δ Multi-label binary classification • ༧ଌର৅ͷϥϕϧ͸4छྨ: Like, Reply, RT and RT with comment ධՁࢦඪ͸2छྨ 1. PR-AUC 2. RCE (Relative Cross Entropy) CHALLENGE TASK φΠʔϒͳฏۉ஋༧ଌ͔Βͷ૬ରతͳ CE ͷྑ͞Λද͢ࢦඪ. ஋͕ߴ͍΄Ͳྑ͍.

Slide 4

Slide 4 text

©2020 Wantedly, Inc. Challenge Ͱ࢖༻͞Εͨσʔληοτ৘ใ • Tweet ʹؔ͢Δ৘ใ: tweet ID, timestamp, text token, etc. • Τϯήʔδϝϯτ͢ΔϢʔβͷ৘ใ: engaging user ID, follower count, etc. • Τϯήʔδϝϯτ͞ΕΔϢʔβͷ৘ใ: engaged user ID, follower count, etc. • Τϯήʔδϝϯτ৘ใ(λʔήοτ): timestamps of the engagements ධՁσʔλͷ෼ׂํ๏ DATASET DESCRIPTION Training Data ( ~ 120 millions samples ) Testing Data Validation Data 1 week 1 week

Slide 5

Slide 5 text

©2020 Wantedly, Inc. Pseudo Negative Label ʹΑΔෛྫ࡞੒ • ʮπΠʔτΛݟ͕ͨengagement͠ͳ͍ʯͱ͍͏ωΨςΟϒͳ৘ใͷެ։Λ๷͙. • ෛྫΛϑΥϩʔάϥϑ͔ΒϥϯμϜαϯϓϦϯά͢Δ͜ͱͰ, ҎԼͷαϯϓϧΛࠞͥΔ. 1. πΠʔτΛݟ͕ͨengagement͠ͳ͔ͬͨαϯϓϧ 2. πΠʔτΛݟ͍ͯͳ͍͔Βengagement͠ͳ͔ͬͨαϯϓϧ GDPR ʹΑͬͯ࢖༻ෆՄͷσʔλ͕೔ʑ૿͍͑ͯ͘ • GDPR ʹ४ڌ͍ͯ͠ΔͨΊ, ಛఆͷσʔλֶ͕शɾධՁڞʹ࢖༻ෆՄͱͳΔ. • ܇࿅σʔληοτͷαΠζ͸, ։͔࢝Βऴྃ·Ͱͷ໿ 3 ϲ݄Ͱ 1.6 ԯ -> 1.2 ԯ ͷݮগ. Privacy-Preserving

Slide 6

Slide 6 text

©2020 Wantedly, Inc. σʔληοτͷαΠζ • ܇࿅σʔληοτ: ໿1.6ԯϨίʔυ (࠷ऴతʹ͸໿1.2ԯϨίʔυ) • ධՁ༻σʔληοτ: ໿1,500ສϨίʔυ (࠷ऴతʹ͸໿1,200ສϨίʔυ) • float32ͷsubmissionϑΝΠϧͷ߹ܭ͕ 4GB ͙Β͍ ҎԼͷܭࢉϦιʔεͰରԠ • Google BigQuery • Google Dataflow • Google Compute Engine (vCPUs: 64, Memory: 600GB) DATASET CHARACTERISTICS (1)

Slide 7

Slide 7 text

©2020 Wantedly, Inc. ϥϕϧͷෆۉߧ • Like > RT > Reply > RT with Comment ͷॱʹਖ਼ྫͷׂ߹͕খ͘͞ͳΔ. • Like ͸ 43% ͕ਖ਼ྫ͕ͩ, RT with Comment ͸ 0.7% ͔͠ਖ਼ྫΛ࣋ͨͳ͍. DATASET CHARACTERISTICS (2)

Slide 8

Slide 8 text

©2020 Wantedly, Inc. ΤϯήʔδϝϯτؒͰͷߴ͍ڞىੑ • Ϣʔβ͸1ͭͷTweetʹରͯ͠ෳ਺छྨͷΤϯήʔδϝϯτΛߦ͏৔߹͕͋Δ. • ಛఆͷΤϯήʔδϝϯτؒͰߴ͍ڞىੑΛ࣋ͭ. • e.g. RT and Like , RT and RT with comment DATASET CHARACTERISTICS (3)

Slide 9

Slide 9 text

©2020 Wantedly, Inc. OVERVIEW OF OUR SOLUTION Model Architecture • Stacking LightGBMs Features • Categorical Features • Network Features • Text Features • Meta Features • etc. Training Process • Bagging with negative under sampling • Stratified K-Folds over Retweet with Comment

Slide 10

Slide 10 text

©2020 Wantedly, Inc. MODEL ARCHITECTURE The First Stage Models The Second Stage Models Like Models Reply Models RT Models RT with Comment Models Target Independent Features Target Dependent Features Like Models Reply Models RT Models RT with Comment Models Like Predictions Reply Predictions RT Predictions RT with Comment Predictions Meta Features

Slide 11

Slide 11 text

©2020 Wantedly, Inc. The First Stage Models The Second Stage Models Like Models Reply Models RT Models RT with Comment Models Target Independent Features Target Dependent Features Like Models Reply Models RT Models RT with Comment Models Like Predictions Reply Predictions RT Predictions RT with Comment Predictions Meta Features ୯ҰͷΤϯήʔδϝϯτΛ༧ଌ ͢ΔϞσϧΛ, Τϯήʔδϝϯτ ͷछྨ෼͚ͩ࡞੒͢Δ (1st stage models) 1st Stage MODEL ARCHITECTURE

Slide 12

Slide 12 text

©2020 Wantedly, Inc. The First Stage Models The Second Stage Models Like Models Reply Models RT Models RT with Comment Models Target Independent Features Target Dependent Features Like Models Reply Models RT Models RT with Comment Models Like Predictions Reply Predictions RT Predictions RT with Comment Predictions Meta Features 2nd Stage 1st stage modelsͷ༧ଌ஋Λೖྗ ʹ௥Ճͨ͠ϞσϧΛ, Τϯήʔδ ϝϯτͷछྨ෼͚ͩ࡞੒͢Δ (2nd stage models) MODEL ARCHITECTURE

Slide 13

Slide 13 text

©2020 Wantedly, Inc. ༷ʑͳΤϯίʔσΟϯάख๏ʹΑΔΧςΰϦม਺ͷಛ௃ྔͷ࡞੒ • ΧʔσΟφϦςΟͷখ͍͞ΧςΰϦม਺ʹ͸ Label Encoding • e.g. language, tweet type • ΧʔσΟφϦςΟͷେ͖͍ΧςΰϦม਺ʹ͸ Frequency Encoding & Target Encoding • e.g. tweet ID, user ID ΧςΰϦม਺ͷ૊Έ߹ΘͤΛ৽͍͠ΧςΰϦͱΈͳͨ͠ಛ௃ྔͷ࡞੒ • ΧςΰϦม਺ؒͷෳࡶͳؔ܎ੑΛଊ͑Δ͜ͱ͕Ͱ͖Δ • e.g. Hashtag engaging user ID × Categorical Features FEATURES

Slide 14

Slide 14 text

©2020 Wantedly, Inc. Graph Features FEATURES Followؔ܎͔Β෮ݩͨ͠ιʔγϟϧάϥϑ • Ϣʔβؒͷؔ܎ੑͱਓؾ౓Λදݱ͢Δ. • Ϣʔβؒͷ1࣍ɾ2࣍ͷܨ͕Γͷ༗ແ ͔Βੜ੒ͨ͠ಛ௃ྔ • ϖʔδϥϯΫ ͔Βੜ੒ͨ͠ಛ௃ྔ LikeΛΤοδͱ͢Δάϥϑ • Ϣʔβͷᅂ޷ੑͷྨࣅ౓Λදݱ͢Δ. • Like Graph ্ Ͱ Random Walk with Restarts Λߦͬͨ࣌ͷ๚໰ճ਺

Slide 15

Slide 15 text

©2020 Wantedly, Inc. Engaging User ͷίϯςϯπʹର͢Δؔ৺౓ΛςΩετ͔Βਪఆ͢Δ • Engaging User ͷ2छྨͷؔ৺౓Λಛ௃ྔͱͯ͠࠾༻͢Δ. • Tweet ͷίϯςϯπʹର͢Δؔ৺౓ • Engaged User ʹର͢Δؔ৺౓ • ֤ϕΫτϧͷ૊Έ߹Θͤͷ಺ੵΛऔͬͯؔ৺౓ͱͯ͠දݱ͢Δ. • Tweet ͷϕΫτϧ: pre-trained multi-lingual Bert ͷதؒ૚ग़ྗ • Engaging User ͷϕΫτϧ: ΠϯλϥΫγϣϯͨ͠πΠʔτͷϕΫτϧฏۉ • Engaged User ͷϕΫτϧ: πΠʔτͨ͠ Tweet ͷϕΫτϧฏۉ Text Features FEATURES

Slide 16

Slide 16 text

©2020 Wantedly, Inc. 1st stage models ͷ༧ଌ஋Λಛ௃ྔͱͯ͠࢖༻͢Δ • Τϯήʔδϝϯτؒͷڞىੑ͕ߴ͍ͨΊ, ༧ଌର৅ͱͳΔΤϯήʔδϝϯτҎ֎ͷ ֶशʹ͸ͦͷଞͷΤϯήʔδϝϯτ৘ใ͕ॏཁͱͳΔ. • Τϯήʔδϝϯτͷ༧ଌ݁ՌΛ user ID ΍ tweet ID ͳͲͷΧςΰϦͰू໿͢Δ. • user ID ΍ tweet ID ͷΤϯήʔδϝϯτ͠΍͢͞ / ͞Ε΍͢͞ Λߴ͍දݱྗͰ ѻ͏͜ͱ͕Ͱ͖Δ. Meta Features FEATURES

Slide 17

Slide 17 text

©2020 Wantedly, Inc. Bagging with Negative Under-Samplingɹ • ϞσϧΛޮ཰తʹֶश͢ΔͨΊʹ, খσʔληοτΛෳ਺࡞੒ͯ͠ෳ਺ͷϞσϧΛ ࡞੒͢Δ Bagging Λ࠾༻͢Δ. • ҎԼͷΑ͏ͳαϯϓϦϯάํ๏Λ࠾༻ͨ͠. 1. Negative User-Sampling Λద༻ͯ͠, σʔλαΠζΛখ͘͢͞Δ. 2. Like ΍ Retweet ͳͲͷΤϯήʔδϝϯτ͸ґવσʔλαΠζ͕େ͖͍ͷͰ, ࢦఆͨ͠ αΠζʹͳΔΑ͏ Random Sampling ͰߋʹσʔλαΠζΛখ͘͢͞Δ. TRAINING PROCESS Sampling Process

Slide 18

Slide 18 text

©2020 Wantedly, Inc. ֤foldͷ RT with comment ͷਖ਼ྫ਺͕౳͘͠ͳΔΑ͏ͳ Stratified K-Folds Λ࠾༻͢Δ • λʔήοτͷछྨຖʹڞ௨ͷ෼ׂઃఆΛར༻͢Δ. • λʔήοτຖʹҟͳΔ෼ׂઃఆΛ࠾༻͢Δͱ, 2nd stage models Ͱ࢖͍ͬͯΔ Meta Features ΍ Target Encoding ʹΑΔ Leakage ͷӨڹ͕େ͖͘ͳΔͨΊ.ɹ • ͜Ε͸ ֤छͷλʔήοτ͕׬શʹಠཱͰ͸ͳ͍͜ͱ͕ݪҼͰ͋Δ. • ܭࢉ࣌ؒΛߟྀͯ͠, fold਺͸ 3 ʹઃఆ. TRAINING PROCESS Validation Strategy

Slide 19

Slide 19 text

©2020 Wantedly, Inc. EXPERIMENTS Final Results 1st stage models ͱ 2nd stage models ͷൺֱ • ͲͪΒͷࢦඪͰ΋, 2nd stage models ͸ 1st stage models ΑΓ΋ྑ͍݁Ռ. • ͜ΕΒͷࠩ෼͸ 2nd stage models Ͱ࠾༻ͨ͠ stacking ʹΑΔޮՌྔΛද͢. • ܇࿅είΞͱݕূείΞͷ͕ࠩେ͖͘ͳ͍ͬͯΔ͕, ྆ࢦඪͱ΋ 1st stage ΑΓ΋վળͯ͠ ͍ΔͨΊ, ͜ͷߏ੒Ͱ໰୊ͳ͍ͱ൑அ

Slide 20

Slide 20 text

©2020 Wantedly, Inc. EXPERIMENTS Training Data Size 2nd stage models Ͱ͸, ܇࿅σʔλ਺Λ૿΍͢΄ͲݕূείΞ͕ѱ͘ͳΔ • Meta Features ͱ Target Encoding ʹΑΔ Leakage ͕ݪҼͱਪଌ. • ܇࿅σʔλ਺ΛϋΠύʔύϥϝʔλͱͯ͠ݕূείΞͰ࠷దԽͨ͠. • Like͸ 100,000, ͦͷଞͷλʔήοτ͸ 1,000,000 ͱͨ͠.

Slide 21

Slide 21 text

©2020 Wantedly, Inc. CONCLUSION • RecSys Challenge 2020 ͸, Twitter ʹ͓͚ΔϢʔβͷΤϯήʔδϝϯτ༧ଌ • Team Wantedly Ͱ͸, 2nd stage stacking ͷϞσϧߏ଄Λ࠾༻͠, ҟͳΔΤ ϯήʔδϝϯτؒͷڞىੑΛޮ཰తʹଊ͑ΒΕΔΑ͏ͳऔΓ૊ΈΛߦͬͨ. • https://github.com/wantedly/recsys2020-challenge

Slide 22

Slide 22 text

©2020 Wantedly, Inc. Pseudo Negative Label ʹΑͬͯੜͨ͡ಛ௃ ϑΥϩʔؔ܎Λಛఆ͢Δ͜ͱ͕Ͱ͖Δ • ʮෛྫΛϑΥϩʔάϥϑ͔ΒϥϯμϜαϯϓϦϯάʯ͍ͯ͠ΔͷͰ, ෛྫ͸ʮΤϯήʔδϯά ͢ΔϢʔβʯ->ʮΤϯήʔδϯά͞ΕͨϢʔβʯͷؔ܎͕੒Γཱͭ. • ༩͑ΒΕ͍ͯΔؔ܎͸, ʮΤϯήʔδϯά͞ΕΔϢʔβʯ->ʮΤϯήʔδϯάͨ͠Ϣʔβʯ ͷΈ • ׬શͳϑΥϩʔάϥϑ͕͋Ε͹, 1࣍ͷܨ͕ΓͰͳ͍Ϩίʔυ͸ਖ਼ྫͱ֬ఆ͢Δ. ࣌ؒํ޲ͷਖ਼ྫස౓෼෍ͱෛྫස౓෼෍ʹζϨ͕ੜ͡Δ • ʮෛྫΛϑΥϩʔάϥϑ͔ΒϥϯμϜαϯϓϦϯάʯ͍ͯ͠ΔͷͰ, Ϣʔβͷෛྫͷ࣌ؒํ޲ ͷස౓෼෍͸ʮϢʔβͷimpʯͰ͸ͳ͘ʮϑΥϩʔ͍ͯ͠ΔϢʔβ܈ͷπΠʔτʯʹґଘͯ͠ ͍ͯ, ਖ਼ྫͱෛྫͷ෼෍͕ͣΕΔՄೳੑ͕͋Δ. • ͜ΕʹΑΓ, ࣌ؒࠩ෼ͳͲͷಛ௃ྔΛ௥Ճ͢Δ͜ͱͰ݁ߏվળ͢Δ.