A Stacking Ensemble Model for Prediction of Multi-type Tweet Engagements(RecSys Challenge 2020 3rd place solution)

99e9e6d2de62c373990ac1bd7c4defc5?s=47 Shuhei Goda
October 17, 2020

A Stacking Ensemble Model for Prediction of Multi-type Tweet Engagements(RecSys Challenge 2020 3rd place solution)

RecSys2020論文読み会(オンライン)
https://connpass.com/event/189192/

99e9e6d2de62c373990ac1bd7c4defc5?s=128

Shuhei Goda

October 17, 2020
Tweet

Transcript

  1. ©2020 Wantedly, Inc. RecSys Challenge 2020 3rd place solution RecSys2020࿦จಡΈձ

    17.Oct.2020 - Shuhei Goda, Naomichi Agata, Yuya Matsumura for Prediction of Multi-type Tweet Engagements A Stacking Ensemble Model
  2. ©2020 Wantedly, Inc. ߹ా पฏ @jy_msc Team Wantedly ០ ௚ಓ

    @agatan_ দଜ ༏໵ @yu__ya4
  3. ©2020 Wantedly, Inc. Twitter ʹ͓͚ΔϢʔβͷΤϯήʔδϝϯτΛ༧ଌ͢ΔλεΫ • (tweet ID, engaging user

    ID) ʹର֤ͯ͠Τϯήʔδϝϯτͷ༗ແΛ༧ଌ͢Δ Multi-label binary classification • ༧ଌର৅ͷϥϕϧ͸4छྨ: Like, Reply, RT and RT with comment ධՁࢦඪ͸2छྨ 1. PR-AUC 2. RCE (Relative Cross Entropy) CHALLENGE TASK φΠʔϒͳฏۉ஋༧ଌ͔Βͷ૬ରతͳ CE ͷྑ͞Λද͢ࢦඪ. ஋͕ߴ͍΄Ͳྑ͍.
  4. ©2020 Wantedly, Inc. Challenge Ͱ࢖༻͞Εͨσʔληοτ৘ใ • Tweet ʹؔ͢Δ৘ใ: tweet ID,

    timestamp, text token, etc. • Τϯήʔδϝϯτ͢ΔϢʔβͷ৘ใ: engaging user ID, follower count, etc. • Τϯήʔδϝϯτ͞ΕΔϢʔβͷ৘ใ: engaged user ID, follower count, etc. • Τϯήʔδϝϯτ৘ใ(λʔήοτ): timestamps of the engagements ධՁσʔλͷ෼ׂํ๏ DATASET DESCRIPTION Training Data ( ~ 120 millions samples ) Testing Data Validation Data 1 week 1 week
  5. ©2020 Wantedly, Inc. Pseudo Negative Label ʹΑΔෛྫ࡞੒ • ʮπΠʔτΛݟ͕ͨengagement͠ͳ͍ʯͱ͍͏ωΨςΟϒͳ৘ใͷެ։Λ๷͙. •

    ෛྫΛϑΥϩʔάϥϑ͔ΒϥϯμϜαϯϓϦϯά͢Δ͜ͱͰ, ҎԼͷαϯϓϧΛࠞͥΔ. 1. πΠʔτΛݟ͕ͨengagement͠ͳ͔ͬͨαϯϓϧ 2. πΠʔτΛݟ͍ͯͳ͍͔Βengagement͠ͳ͔ͬͨαϯϓϧ GDPR ʹΑͬͯ࢖༻ෆՄͷσʔλ͕೔ʑ૿͍͑ͯ͘ • GDPR ʹ४ڌ͍ͯ͠ΔͨΊ, ಛఆͷσʔλֶ͕शɾධՁڞʹ࢖༻ෆՄͱͳΔ. • ܇࿅σʔληοτͷαΠζ͸, ։͔࢝Βऴྃ·Ͱͷ໿ 3 ϲ݄Ͱ 1.6 ԯ -> 1.2 ԯ ͷݮগ. Privacy-Preserving
  6. ©2020 Wantedly, Inc. σʔληοτͷαΠζ • ܇࿅σʔληοτ: ໿1.6ԯϨίʔυ (࠷ऴతʹ͸໿1.2ԯϨίʔυ) • ධՁ༻σʔληοτ:

    ໿1,500ສϨίʔυ (࠷ऴతʹ͸໿1,200ສϨίʔυ) • float32ͷsubmissionϑΝΠϧͷ߹ܭ͕ 4GB ͙Β͍ ҎԼͷܭࢉϦιʔεͰରԠ • Google BigQuery • Google Dataflow • Google Compute Engine (vCPUs: 64, Memory: 600GB) DATASET CHARACTERISTICS (1)
  7. ©2020 Wantedly, Inc. ϥϕϧͷෆۉߧ • Like > RT > Reply

    > RT with Comment ͷॱʹਖ਼ྫͷׂ߹͕খ͘͞ͳΔ. • Like ͸ 43% ͕ਖ਼ྫ͕ͩ, RT with Comment ͸ 0.7% ͔͠ਖ਼ྫΛ࣋ͨͳ͍. DATASET CHARACTERISTICS (2)
  8. ©2020 Wantedly, Inc. ΤϯήʔδϝϯτؒͰͷߴ͍ڞىੑ • Ϣʔβ͸1ͭͷTweetʹରͯ͠ෳ਺छྨͷΤϯήʔδϝϯτΛߦ͏৔߹͕͋Δ. • ಛఆͷΤϯήʔδϝϯτؒͰߴ͍ڞىੑΛ࣋ͭ. • e.g.

    RT and Like , RT and RT with comment DATASET CHARACTERISTICS (3)
  9. ©2020 Wantedly, Inc. OVERVIEW OF OUR SOLUTION Model Architecture •

    Stacking LightGBMs Features • Categorical Features • Network Features • Text Features • Meta Features • etc. Training Process • Bagging with negative under sampling • Stratified K-Folds over Retweet with Comment
  10. ©2020 Wantedly, Inc. MODEL ARCHITECTURE The First Stage Models The

    Second Stage Models Like Models Reply Models RT Models RT with Comment Models Target Independent Features Target Dependent Features Like Models Reply Models RT Models RT with Comment Models Like Predictions Reply Predictions RT Predictions RT with Comment Predictions Meta Features
  11. ©2020 Wantedly, Inc. The First Stage Models The Second Stage

    Models Like Models Reply Models RT Models RT with Comment Models Target Independent Features Target Dependent Features Like Models Reply Models RT Models RT with Comment Models Like Predictions Reply Predictions RT Predictions RT with Comment Predictions Meta Features ୯ҰͷΤϯήʔδϝϯτΛ༧ଌ ͢ΔϞσϧΛ, Τϯήʔδϝϯτ ͷछྨ෼͚ͩ࡞੒͢Δ (1st stage models) 1st Stage MODEL ARCHITECTURE
  12. ©2020 Wantedly, Inc. The First Stage Models The Second Stage

    Models Like Models Reply Models RT Models RT with Comment Models Target Independent Features Target Dependent Features Like Models Reply Models RT Models RT with Comment Models Like Predictions Reply Predictions RT Predictions RT with Comment Predictions Meta Features 2nd Stage 1st stage modelsͷ༧ଌ஋Λೖྗ ʹ௥Ճͨ͠ϞσϧΛ, Τϯήʔδ ϝϯτͷछྨ෼͚ͩ࡞੒͢Δ (2nd stage models) MODEL ARCHITECTURE
  13. ©2020 Wantedly, Inc. ༷ʑͳΤϯίʔσΟϯάख๏ʹΑΔΧςΰϦม਺ͷಛ௃ྔͷ࡞੒ • ΧʔσΟφϦςΟͷখ͍͞ΧςΰϦม਺ʹ͸ Label Encoding • e.g.

    language, tweet type • ΧʔσΟφϦςΟͷେ͖͍ΧςΰϦม਺ʹ͸ Frequency Encoding & Target Encoding • e.g. tweet ID, user ID ΧςΰϦม਺ͷ૊Έ߹ΘͤΛ৽͍͠ΧςΰϦͱΈͳͨ͠ಛ௃ྔͷ࡞੒ • ΧςΰϦม਺ؒͷෳࡶͳؔ܎ੑΛଊ͑Δ͜ͱ͕Ͱ͖Δ • e.g. Hashtag engaging user ID × Categorical Features FEATURES
  14. ©2020 Wantedly, Inc. Graph Features FEATURES Followؔ܎͔Β෮ݩͨ͠ιʔγϟϧάϥϑ • Ϣʔβؒͷؔ܎ੑͱਓؾ౓Λදݱ͢Δ. •

    Ϣʔβؒͷ1࣍ɾ2࣍ͷܨ͕Γͷ༗ແ ͔Βੜ੒ͨ͠ಛ௃ྔ • ϖʔδϥϯΫ ͔Βੜ੒ͨ͠ಛ௃ྔ LikeΛΤοδͱ͢Δάϥϑ • Ϣʔβͷᅂ޷ੑͷྨࣅ౓Λදݱ͢Δ. • Like Graph ্ Ͱ Random Walk with Restarts Λߦͬͨ࣌ͷ๚໰ճ਺
  15. ©2020 Wantedly, Inc. Engaging User ͷίϯςϯπʹର͢Δؔ৺౓ΛςΩετ͔Βਪఆ͢Δ • Engaging User ͷ2छྨͷؔ৺౓Λಛ௃ྔͱͯ͠࠾༻͢Δ.

    • Tweet ͷίϯςϯπʹର͢Δؔ৺౓ • Engaged User ʹର͢Δؔ৺౓ • ֤ϕΫτϧͷ૊Έ߹Θͤͷ಺ੵΛऔͬͯؔ৺౓ͱͯ͠දݱ͢Δ. • Tweet ͷϕΫτϧ: pre-trained multi-lingual Bert ͷதؒ૚ग़ྗ • Engaging User ͷϕΫτϧ: ΠϯλϥΫγϣϯͨ͠πΠʔτͷϕΫτϧฏۉ • Engaged User ͷϕΫτϧ: πΠʔτͨ͠ Tweet ͷϕΫτϧฏۉ Text Features FEATURES
  16. ©2020 Wantedly, Inc. 1st stage models ͷ༧ଌ஋Λಛ௃ྔͱͯ͠࢖༻͢Δ • Τϯήʔδϝϯτؒͷڞىੑ͕ߴ͍ͨΊ, ༧ଌର৅ͱͳΔΤϯήʔδϝϯτҎ֎ͷ

    ֶशʹ͸ͦͷଞͷΤϯήʔδϝϯτ৘ใ͕ॏཁͱͳΔ. • Τϯήʔδϝϯτͷ༧ଌ݁ՌΛ user ID ΍ tweet ID ͳͲͷΧςΰϦͰू໿͢Δ. • user ID ΍ tweet ID ͷΤϯήʔδϝϯτ͠΍͢͞ / ͞Ε΍͢͞ Λߴ͍දݱྗͰ ѻ͏͜ͱ͕Ͱ͖Δ. Meta Features FEATURES
  17. ©2020 Wantedly, Inc. Bagging with Negative Under-Samplingɹ • ϞσϧΛޮ཰తʹֶश͢ΔͨΊʹ, খσʔληοτΛෳ਺࡞੒ͯ͠ෳ਺ͷϞσϧΛ

    ࡞੒͢Δ Bagging Λ࠾༻͢Δ. • ҎԼͷΑ͏ͳαϯϓϦϯάํ๏Λ࠾༻ͨ͠. 1. Negative User-Sampling Λద༻ͯ͠, σʔλαΠζΛখ͘͢͞Δ. 2. Like ΍ Retweet ͳͲͷΤϯήʔδϝϯτ͸ґવσʔλαΠζ͕େ͖͍ͷͰ, ࢦఆͨ͠ αΠζʹͳΔΑ͏ Random Sampling ͰߋʹσʔλαΠζΛখ͘͢͞Δ. TRAINING PROCESS Sampling Process
  18. ©2020 Wantedly, Inc. ֤foldͷ RT with comment ͷਖ਼ྫ਺͕౳͘͠ͳΔΑ͏ͳ Stratified K-Folds

    Λ࠾༻͢Δ • λʔήοτͷछྨຖʹڞ௨ͷ෼ׂઃఆΛར༻͢Δ. • λʔήοτຖʹҟͳΔ෼ׂઃఆΛ࠾༻͢Δͱ, 2nd stage models Ͱ࢖͍ͬͯΔ Meta Features ΍ Target Encoding ʹΑΔ Leakage ͷӨڹ͕େ͖͘ͳΔͨΊ.ɹ • ͜Ε͸ ֤छͷλʔήοτ͕׬શʹಠཱͰ͸ͳ͍͜ͱ͕ݪҼͰ͋Δ. • ܭࢉ࣌ؒΛߟྀͯ͠, fold਺͸ 3 ʹઃఆ. TRAINING PROCESS Validation Strategy
  19. ©2020 Wantedly, Inc. EXPERIMENTS Final Results 1st stage models ͱ

    2nd stage models ͷൺֱ • ͲͪΒͷࢦඪͰ΋, 2nd stage models ͸ 1st stage models ΑΓ΋ྑ͍݁Ռ. • ͜ΕΒͷࠩ෼͸ 2nd stage models Ͱ࠾༻ͨ͠ stacking ʹΑΔޮՌྔΛද͢. • ܇࿅είΞͱݕূείΞͷ͕ࠩେ͖͘ͳ͍ͬͯΔ͕, ྆ࢦඪͱ΋ 1st stage ΑΓ΋վળͯ͠ ͍ΔͨΊ, ͜ͷߏ੒Ͱ໰୊ͳ͍ͱ൑அ
  20. ©2020 Wantedly, Inc. EXPERIMENTS Training Data Size 2nd stage models

    Ͱ͸, ܇࿅σʔλ਺Λ૿΍͢΄ͲݕূείΞ͕ѱ͘ͳΔ • Meta Features ͱ Target Encoding ʹΑΔ Leakage ͕ݪҼͱਪଌ. • ܇࿅σʔλ਺ΛϋΠύʔύϥϝʔλͱͯ͠ݕূείΞͰ࠷దԽͨ͠. • Like͸ 100,000, ͦͷଞͷλʔήοτ͸ 1,000,000 ͱͨ͠.
  21. ©2020 Wantedly, Inc. CONCLUSION • RecSys Challenge 2020 ͸, Twitter

    ʹ͓͚ΔϢʔβͷΤϯήʔδϝϯτ༧ଌ • Team Wantedly Ͱ͸, 2nd stage stacking ͷϞσϧߏ଄Λ࠾༻͠, ҟͳΔΤ ϯήʔδϝϯτؒͷڞىੑΛޮ཰తʹଊ͑ΒΕΔΑ͏ͳऔΓ૊ΈΛߦͬͨ. • https://github.com/wantedly/recsys2020-challenge
  22. ©2020 Wantedly, Inc. Pseudo Negative Label ʹΑͬͯੜͨ͡ಛ௃ ϑΥϩʔؔ܎Λಛఆ͢Δ͜ͱ͕Ͱ͖Δ • ʮෛྫΛϑΥϩʔάϥϑ͔ΒϥϯμϜαϯϓϦϯάʯ͍ͯ͠ΔͷͰ,

    ෛྫ͸ʮΤϯήʔδϯά ͢ΔϢʔβʯ->ʮΤϯήʔδϯά͞ΕͨϢʔβʯͷؔ܎͕੒Γཱͭ. • ༩͑ΒΕ͍ͯΔؔ܎͸, ʮΤϯήʔδϯά͞ΕΔϢʔβʯ->ʮΤϯήʔδϯάͨ͠Ϣʔβʯ ͷΈ • ׬શͳϑΥϩʔάϥϑ͕͋Ε͹, 1࣍ͷܨ͕ΓͰͳ͍Ϩίʔυ͸ਖ਼ྫͱ֬ఆ͢Δ. ࣌ؒํ޲ͷਖ਼ྫස౓෼෍ͱෛྫස౓෼෍ʹζϨ͕ੜ͡Δ • ʮෛྫΛϑΥϩʔάϥϑ͔ΒϥϯμϜαϯϓϦϯάʯ͍ͯ͠ΔͷͰ, Ϣʔβͷෛྫͷ࣌ؒํ޲ ͷස౓෼෍͸ʮϢʔβͷimpʯͰ͸ͳ͘ʮϑΥϩʔ͍ͯ͠ΔϢʔβ܈ͷπΠʔτʯʹґଘͯ͠ ͍ͯ, ਖ਼ྫͱෛྫͷ෼෍͕ͣΕΔՄೳੑ͕͋Δ. • ͜ΕʹΑΓ, ࣌ؒࠩ෼ͳͲͷಛ௃ྔΛ௥Ճ͢Δ͜ͱͰ݁ߏվળ͢Δ.