Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Stacking Ensemble Model for Prediction of Multi-type Tweet Engagements(RecSys Challenge 2020 3rd place solution)

Shuhei Goda
October 17, 2020

A Stacking Ensemble Model for Prediction of Multi-type Tweet Engagements(RecSys Challenge 2020 3rd place solution)

RecSys2020論文読み会(オンライン)
https://connpass.com/event/189192/

Shuhei Goda

October 17, 2020
Tweet

More Decks by Shuhei Goda

Other Decks in Research

Transcript

  1. ©2020 Wantedly, Inc.
    RecSys Challenge 2020 3rd place solution
    RecSys2020࿦จಡΈձ
    17.Oct.2020 - Shuhei Goda, Naomichi Agata, Yuya Matsumura
    for Prediction of Multi-type Tweet Engagements
    A Stacking Ensemble Model

    View Slide

  2. ©2020 Wantedly, Inc.
    ߹ా पฏ
    @jy_msc
    Team Wantedly
    ០ ௚ಓ
    @agatan_
    দଜ ༏໵
    @yu__ya4

    View Slide

  3. ©2020 Wantedly, Inc.
    Twitter ʹ͓͚ΔϢʔβͷΤϯήʔδϝϯτΛ༧ଌ͢ΔλεΫ
    • (tweet ID, engaging user ID) ʹର֤ͯ͠Τϯήʔδϝϯτͷ༗ແΛ༧ଌ͢Δ
    Multi-label binary classification
    • ༧ଌର৅ͷϥϕϧ͸4छྨ: Like, Reply, RT and RT with comment
    ධՁࢦඪ͸2छྨ
    1. PR-AUC
    2. RCE (Relative Cross Entropy)
    CHALLENGE TASK
    φΠʔϒͳฏۉ஋༧ଌ͔Βͷ૬ରతͳ CE
    ͷྑ͞Λද͢ࢦඪ. ஋͕ߴ͍΄Ͳྑ͍.

    View Slide

  4. ©2020 Wantedly, Inc.
    Challenge Ͱ࢖༻͞Εͨσʔληοτ৘ใ
    • Tweet ʹؔ͢Δ৘ใ: tweet ID, timestamp, text token, etc.
    • Τϯήʔδϝϯτ͢ΔϢʔβͷ৘ใ: engaging user ID, follower count, etc.
    • Τϯήʔδϝϯτ͞ΕΔϢʔβͷ৘ใ: engaged user ID, follower count, etc.
    • Τϯήʔδϝϯτ৘ใ(λʔήοτ): timestamps of the engagements
    ධՁσʔλͷ෼ׂํ๏
    DATASET DESCRIPTION
    Training Data
    ( ~ 120 millions samples )
    Testing Data
    Validation Data
    1 week
    1 week

    View Slide

  5. ©2020 Wantedly, Inc.
    Pseudo Negative Label ʹΑΔෛྫ࡞੒
    • ʮπΠʔτΛݟ͕ͨengagement͠ͳ͍ʯͱ͍͏ωΨςΟϒͳ৘ใͷެ։Λ๷͙.
    • ෛྫΛϑΥϩʔάϥϑ͔ΒϥϯμϜαϯϓϦϯά͢Δ͜ͱͰ, ҎԼͷαϯϓϧΛࠞͥΔ.
    1. πΠʔτΛݟ͕ͨengagement͠ͳ͔ͬͨαϯϓϧ
    2. πΠʔτΛݟ͍ͯͳ͍͔Βengagement͠ͳ͔ͬͨαϯϓϧ
    GDPR ʹΑͬͯ࢖༻ෆՄͷσʔλ͕೔ʑ૿͍͑ͯ͘
    • GDPR ʹ४ڌ͍ͯ͠ΔͨΊ, ಛఆͷσʔλֶ͕शɾධՁڞʹ࢖༻ෆՄͱͳΔ.
    • ܇࿅σʔληοτͷαΠζ͸, ։͔࢝Βऴྃ·Ͱͷ໿ 3 ϲ݄Ͱ 1.6 ԯ -> 1.2 ԯ ͷݮগ.
    Privacy-Preserving

    View Slide

  6. ©2020 Wantedly, Inc.
    σʔληοτͷαΠζ
    • ܇࿅σʔληοτ: ໿1.6ԯϨίʔυ (࠷ऴతʹ͸໿1.2ԯϨίʔυ)
    • ධՁ༻σʔληοτ: ໿1,500ສϨίʔυ (࠷ऴతʹ͸໿1,200ສϨίʔυ)
    • float32ͷsubmissionϑΝΠϧͷ߹ܭ͕ 4GB ͙Β͍
    ҎԼͷܭࢉϦιʔεͰରԠ
    • Google BigQuery
    • Google Dataflow
    • Google Compute Engine (vCPUs: 64, Memory: 600GB)
    DATASET CHARACTERISTICS (1)

    View Slide

  7. ©2020 Wantedly, Inc.
    ϥϕϧͷෆۉߧ
    • Like > RT > Reply > RT with Comment ͷॱʹਖ਼ྫͷׂ߹͕খ͘͞ͳΔ.
    • Like ͸ 43% ͕ਖ਼ྫ͕ͩ, RT with Comment ͸ 0.7% ͔͠ਖ਼ྫΛ࣋ͨͳ͍.
    DATASET CHARACTERISTICS (2)

    View Slide

  8. ©2020 Wantedly, Inc.
    ΤϯήʔδϝϯτؒͰͷߴ͍ڞىੑ
    • Ϣʔβ͸1ͭͷTweetʹରͯ͠ෳ਺छྨͷΤϯήʔδϝϯτΛߦ͏৔߹͕͋Δ.
    • ಛఆͷΤϯήʔδϝϯτؒͰߴ͍ڞىੑΛ࣋ͭ.
    • e.g. RT and Like , RT and RT with comment
    DATASET CHARACTERISTICS (3)

    View Slide

  9. ©2020 Wantedly, Inc.
    OVERVIEW OF OUR SOLUTION
    Model Architecture
    • Stacking LightGBMs
    Features
    • Categorical Features
    • Network Features
    • Text Features
    • Meta Features
    • etc.
    Training Process
    • Bagging with negative under sampling
    • Stratified K-Folds over Retweet with Comment

    View Slide

  10. ©2020 Wantedly, Inc.
    MODEL ARCHITECTURE
    The First Stage Models
    The Second Stage Models
    Like
    Models
    Reply
    Models
    RT
    Models
    RT with Comment
    Models
    Target Independent Features Target Dependent Features
    Like
    Models
    Reply
    Models
    RT
    Models
    RT with Comment
    Models
    Like
    Predictions
    Reply
    Predictions
    RT
    Predictions
    RT with Comment
    Predictions
    Meta Features

    View Slide

  11. ©2020 Wantedly, Inc.
    The First Stage Models
    The Second Stage Models
    Like
    Models
    Reply
    Models
    RT
    Models
    RT with Comment
    Models
    Target Independent Features Target Dependent Features
    Like
    Models
    Reply
    Models
    RT
    Models
    RT with Comment
    Models
    Like
    Predictions
    Reply
    Predictions
    RT
    Predictions
    RT with Comment
    Predictions
    Meta Features
    ୯ҰͷΤϯήʔδϝϯτΛ༧ଌ
    ͢ΔϞσϧΛ, Τϯήʔδϝϯτ
    ͷछྨ෼͚ͩ࡞੒͢Δ
    (1st stage models)
    1st Stage
    MODEL ARCHITECTURE

    View Slide

  12. ©2020 Wantedly, Inc.
    The First Stage Models
    The Second Stage Models
    Like
    Models
    Reply
    Models
    RT
    Models
    RT with Comment
    Models
    Target Independent Features Target Dependent Features
    Like
    Models
    Reply
    Models
    RT
    Models
    RT with Comment
    Models
    Like
    Predictions
    Reply
    Predictions
    RT
    Predictions
    RT with Comment
    Predictions
    Meta Features
    2nd Stage
    1st stage modelsͷ༧ଌ஋Λೖྗ
    ʹ௥Ճͨ͠ϞσϧΛ, Τϯήʔδ
    ϝϯτͷछྨ෼͚ͩ࡞੒͢Δ
    (2nd stage models)
    MODEL ARCHITECTURE

    View Slide

  13. ©2020 Wantedly, Inc.
    ༷ʑͳΤϯίʔσΟϯάख๏ʹΑΔΧςΰϦม਺ͷಛ௃ྔͷ࡞੒
    • ΧʔσΟφϦςΟͷখ͍͞ΧςΰϦม਺ʹ͸ Label Encoding
    • e.g. language, tweet type
    • ΧʔσΟφϦςΟͷେ͖͍ΧςΰϦม਺ʹ͸ Frequency Encoding & Target
    Encoding
    • e.g. tweet ID, user ID
    ΧςΰϦม਺ͷ૊Έ߹ΘͤΛ৽͍͠ΧςΰϦͱΈͳͨ͠ಛ௃ྔͷ࡞੒
    • ΧςΰϦม਺ؒͷෳࡶͳؔ܎ੑΛଊ͑Δ͜ͱ͕Ͱ͖Δ
    • e.g. Hashtag engaging user ID
    ×
    Categorical Features
    FEATURES

    View Slide

  14. ©2020 Wantedly, Inc.
    Graph Features
    FEATURES
    Followؔ܎͔Β෮ݩͨ͠ιʔγϟϧάϥϑ
    • Ϣʔβؒͷؔ܎ੑͱਓؾ౓Λදݱ͢Δ.
    • Ϣʔβؒͷ1࣍ɾ2࣍ͷܨ͕Γͷ༗ແ ͔Βੜ੒ͨ͠ಛ௃ྔ
    • ϖʔδϥϯΫ ͔Βੜ੒ͨ͠ಛ௃ྔ
    LikeΛΤοδͱ͢Δάϥϑ
    • Ϣʔβͷᅂ޷ੑͷྨࣅ౓Λදݱ͢Δ.
    • Like Graph ্ Ͱ Random Walk with Restarts Λߦͬͨ࣌ͷ๚໰ճ਺

    View Slide

  15. ©2020 Wantedly, Inc.
    Engaging User ͷίϯςϯπʹର͢Δؔ৺౓ΛςΩετ͔Βਪఆ͢Δ
    • Engaging User ͷ2छྨͷؔ৺౓Λಛ௃ྔͱͯ͠࠾༻͢Δ.
    • Tweet ͷίϯςϯπʹର͢Δؔ৺౓
    • Engaged User ʹର͢Δؔ৺౓
    • ֤ϕΫτϧͷ૊Έ߹Θͤͷ಺ੵΛऔͬͯؔ৺౓ͱͯ͠දݱ͢Δ.
    • Tweet ͷϕΫτϧ: pre-trained multi-lingual Bert ͷதؒ૚ग़ྗ
    • Engaging User ͷϕΫτϧ: ΠϯλϥΫγϣϯͨ͠πΠʔτͷϕΫτϧฏۉ
    • Engaged User ͷϕΫτϧ: πΠʔτͨ͠ Tweet ͷϕΫτϧฏۉ
    Text Features
    FEATURES

    View Slide

  16. ©2020 Wantedly, Inc.
    1st stage models ͷ༧ଌ஋Λಛ௃ྔͱͯ͠࢖༻͢Δ
    • Τϯήʔδϝϯτؒͷڞىੑ͕ߴ͍ͨΊ, ༧ଌର৅ͱͳΔΤϯήʔδϝϯτҎ֎ͷ
    ֶशʹ͸ͦͷଞͷΤϯήʔδϝϯτ৘ใ͕ॏཁͱͳΔ.
    • Τϯήʔδϝϯτͷ༧ଌ݁ՌΛ user ID ΍ tweet ID ͳͲͷΧςΰϦͰू໿͢Δ.
    • user ID ΍ tweet ID ͷΤϯήʔδϝϯτ͠΍͢͞ / ͞Ε΍͢͞ Λߴ͍දݱྗͰ
    ѻ͏͜ͱ͕Ͱ͖Δ.
    Meta Features
    FEATURES

    View Slide

  17. ©2020 Wantedly, Inc.
    Bagging with Negative Under-Samplingɹ
    • ϞσϧΛޮ཰తʹֶश͢ΔͨΊʹ, খσʔληοτΛෳ਺࡞੒ͯ͠ෳ਺ͷϞσϧΛ
    ࡞੒͢Δ Bagging Λ࠾༻͢Δ.
    • ҎԼͷΑ͏ͳαϯϓϦϯάํ๏Λ࠾༻ͨ͠.
    1. Negative User-Sampling Λద༻ͯ͠, σʔλαΠζΛখ͘͢͞Δ.
    2. Like ΍ Retweet ͳͲͷΤϯήʔδϝϯτ͸ґવσʔλαΠζ͕େ͖͍ͷͰ, ࢦఆͨ͠
    αΠζʹͳΔΑ͏ Random Sampling ͰߋʹσʔλαΠζΛখ͘͢͞Δ.
    TRAINING PROCESS Sampling Process

    View Slide

  18. ©2020 Wantedly, Inc.
    ֤foldͷ RT with comment ͷਖ਼ྫ਺͕౳͘͠ͳΔΑ͏ͳ Stratified K-Folds
    Λ࠾༻͢Δ
    • λʔήοτͷछྨຖʹڞ௨ͷ෼ׂઃఆΛར༻͢Δ.
    • λʔήοτຖʹҟͳΔ෼ׂઃఆΛ࠾༻͢Δͱ, 2nd stage models Ͱ࢖͍ͬͯΔ Meta
    Features ΍ Target Encoding ʹΑΔ Leakage ͷӨڹ͕େ͖͘ͳΔͨΊ.ɹ
    • ͜Ε͸ ֤छͷλʔήοτ͕׬શʹಠཱͰ͸ͳ͍͜ͱ͕ݪҼͰ͋Δ.
    • ܭࢉ࣌ؒΛߟྀͯ͠, fold਺͸ 3 ʹઃఆ.
    TRAINING PROCESS Validation Strategy

    View Slide

  19. ©2020 Wantedly, Inc.
    EXPERIMENTS Final Results
    1st stage models ͱ 2nd stage models ͷൺֱ
    • ͲͪΒͷࢦඪͰ΋, 2nd stage models ͸ 1st stage models ΑΓ΋ྑ͍݁Ռ.
    • ͜ΕΒͷࠩ෼͸ 2nd stage models Ͱ࠾༻ͨ͠ stacking ʹΑΔޮՌྔΛද͢.
    • ܇࿅είΞͱݕূείΞͷ͕ࠩେ͖͘ͳ͍ͬͯΔ͕, ྆ࢦඪͱ΋ 1st stage ΑΓ΋վળͯ͠
    ͍ΔͨΊ, ͜ͷߏ੒Ͱ໰୊ͳ͍ͱ൑அ

    View Slide

  20. ©2020 Wantedly, Inc.
    EXPERIMENTS Training Data Size
    2nd stage models Ͱ͸, ܇࿅σʔλ਺Λ૿΍͢΄ͲݕূείΞ͕ѱ͘ͳΔ
    • Meta Features ͱ Target Encoding ʹΑΔ Leakage ͕ݪҼͱਪଌ.
    • ܇࿅σʔλ਺ΛϋΠύʔύϥϝʔλͱͯ͠ݕূείΞͰ࠷దԽͨ͠.
    • Like͸ 100,000, ͦͷଞͷλʔήοτ͸ 1,000,000 ͱͨ͠.

    View Slide

  21. ©2020 Wantedly, Inc.
    CONCLUSION
    • RecSys Challenge 2020 ͸, Twitter ʹ͓͚ΔϢʔβͷΤϯήʔδϝϯτ༧ଌ
    • Team Wantedly Ͱ͸, 2nd stage stacking ͷϞσϧߏ଄Λ࠾༻͠, ҟͳΔΤ
    ϯήʔδϝϯτؒͷڞىੑΛޮ཰తʹଊ͑ΒΕΔΑ͏ͳऔΓ૊ΈΛߦͬͨ.
    • https://github.com/wantedly/recsys2020-challenge

    View Slide

  22. ©2020 Wantedly, Inc.
    Pseudo Negative Label ʹΑͬͯੜͨ͡ಛ௃
    ϑΥϩʔؔ܎Λಛఆ͢Δ͜ͱ͕Ͱ͖Δ
    • ʮෛྫΛϑΥϩʔάϥϑ͔ΒϥϯμϜαϯϓϦϯάʯ͍ͯ͠ΔͷͰ, ෛྫ͸ʮΤϯήʔδϯά
    ͢ΔϢʔβʯ->ʮΤϯήʔδϯά͞ΕͨϢʔβʯͷؔ܎͕੒Γཱͭ.
    • ༩͑ΒΕ͍ͯΔؔ܎͸, ʮΤϯήʔδϯά͞ΕΔϢʔβʯ->ʮΤϯήʔδϯάͨ͠Ϣʔβʯ ͷΈ
    • ׬શͳϑΥϩʔάϥϑ͕͋Ε͹, 1࣍ͷܨ͕ΓͰͳ͍Ϩίʔυ͸ਖ਼ྫͱ֬ఆ͢Δ.
    ࣌ؒํ޲ͷਖ਼ྫස౓෼෍ͱෛྫස౓෼෍ʹζϨ͕ੜ͡Δ
    • ʮෛྫΛϑΥϩʔάϥϑ͔ΒϥϯμϜαϯϓϦϯάʯ͍ͯ͠ΔͷͰ, Ϣʔβͷෛྫͷ࣌ؒํ޲
    ͷස౓෼෍͸ʮϢʔβͷimpʯͰ͸ͳ͘ʮϑΥϩʔ͍ͯ͠ΔϢʔβ܈ͷπΠʔτʯʹґଘͯ͠
    ͍ͯ, ਖ਼ྫͱෛྫͷ෼෍͕ͣΕΔՄೳੑ͕͋Δ.
    • ͜ΕʹΑΓ, ࣌ؒࠩ෼ͳͲͷಛ௃ྔΛ௥Ճ͢Δ͜ͱͰ݁ߏվળ͢Δ.

    View Slide