Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kaggle Google Quest Q&A Labeling - 23th place solution

Shuhei Goda
February 28, 2020

Kaggle Google Quest Q&A Labeling - 23th place solution

Shuhei Goda

February 28, 2020
Tweet

More Decks by Shuhei Goda

Other Decks in Technology

Transcript

  1. ©2020 Wantedly, Inc.
    23th place solution
    Kaggle Google Quest Q&A Labeling ൓লձ
    Feb 28, 2020 - Shuhei Goda - @jy_msc

    View Slide

  2. ©2020 Wantedly, Inc.
    Team - The Hand
    Shuhei Goda @jy_msc
    Visit Engineering Team at Wantedly
    Naomichi Agata @agatan_
    People Engineering Team at Wantedly

    View Slide

  3. ©2020 Wantedly, Inc.
    Model Pipeline
    #FSUCBTF
    VODBTFE
    -JHIU(#.
    #FSUCBTF
    VODBTFE
    Settings
    ɾ3fold with GroupKFold
    ɾBCE + margin ranking loss
    ɾ3epoch
    Settings
    ɾmax_depth=1
    ɾlr=0.1
    Meta features
    ɾtext length
    ɾstackexchange
    Text data
    ɾquestion_title
    ɾquestion_body
    ɾanswer
    1SF1SPDFTT

    2BOE"

    1SF1SPDFTT

    POMZ2

    ɾquestion_title
    ɾquestion_body
    ɾquestion_title
    ɾquestion_body
    ɾanswer
    Settings
    ɾhtml escape
    ɾhead+tail truncation

    View Slide

  4. ©2020 Wantedly, Inc.
    ɾHTMLจࣈྻͷΞϯΤεέʔϓ
    Pre-Process
    IUUQTXXXLBHHMFDPNDHPPHMFRVFTUDIBMMFOHFEJTDVTTJPO

    View Slide

  5. ©2020 Wantedly, Inc.
    ɾςΩετσʔλͷ݁߹ͱτϦϛϯά
    ɹɾ[CLS] + question_title + [SEP] + question_body + [SEP] + answer
    ɾquestion_body ͱ answer ͕ࢦఆͷ௕͞Λ௒͑ͨ৔߹, ྆୺͔ΒಉαΠζ෼ΛτϦϛϯά
    Pre-Process
    IUUQTBSYJWPSHBCT

    View Slide

  6. ©2020 Wantedly, Inc.
    ɾBert-base (uncased)
    ɹɾޙΖ4ͭͷӅΕ૚ͷग़ྗΛ࢖༻ https://arxiv.org/abs/1905.05583
    ɹɾQAؒͷSEP tokenͷग़ྗΛ࢖༻
    Model Architecture

    View Slide

  7. ©2020 Wantedly, Inc.
    ɾLabel weight
    ɹɾ؆୯ͦ͏ͳλεΫ͸weightΛখ͘͞, ෆۉߧͰ೉ͦ͠͏ͳλεΫ͸weightΛେ͖͘
    ɹɾgpyoptͰweightͷ୳ࡧΛࢼͨ͠Έ͕ͨ, Լهͷ୯७ͳ΍Γํ͕࠷΋ྑ͔ͬͨ
    Loss function
    Label weight ͋Γ
    Public: 0.45979, Private: 0.41440
    Label weight ͳ͠
    Public: 0.43455, Private: 0.40602

    View Slide

  8. ©2020 Wantedly, Inc.
    ɾBCE + margin ranking loss (1 : 1)
    ɹɾϛχόονΛ2ͭʹ෼ׂͯ͠ margin ranking loss Λܭࢉ
    Loss function
    BCE + margin ranking loss
    Public: 0.45979, Private: 0.41440
    BCE
    Public: 0.44006, Private: 0.40668

    View Slide

  9. ©2020 Wantedly, Inc.
    ɾQuestion Model
    ɹɾQ༻ͷλεΫΛQuestion text͚ͩΛ࢖ͬͯղ͘
    ɹɾΠϯϓοτ͸Q͚ͩͰ͍͍ͷͰ, Qͷtruncationͷྔ͕ݮΔ (Qͷ৘ใྔ͕૿͑Δ)
    Training
    Q model + Q and A model
    Public: 0.45979, Private: 0.41440
    Q and A model × 2 (seed average)
    Public: 0.44298, Private: 0.40613

    View Slide

  10. ©2020 Wantedly, Inc.
    ɾLightGBM
    ɹɾmax_depth=1, lr=0.1
    ɹɾmeta features
    ɹɹɾtext length (question, answer)
    ɹɹɾmeta data from stackexchange (Score, View, FavoriteCount, …)
    Post-Process
    LightGBM
    Public: 0.45979, Private: 0.41440
    Simple binning without meta features
    Public: 0.45282, Private: 0.41387

    View Slide

  11. ©2020 Wantedly, Inc.
    Why we used LightGBM?
    1. Simple binning method
    ɹɾ༧ଌ஋Λ཭ࢄԽ͢Δ͜ͱͰ Spearman’s correlation ͕ྑ͘ͳΔ͜ͱʹؾͮ͘
    ɹɾtarget͝ͱʹϏϯαΠζΛࣄલʹઃఆͯ͠Ϗϯೋϯά
    ɹɾϏϯαΠζ͸ݻఆʹ্ͨ͠ͰBertͷ֤epochͷग़ྗΛweighted average (weight͸࠷దԽ)

    View Slide

  12. ©2020 Wantedly, Inc.
    Why we used LightGBM?
    2. Optimize bin-size and weights
    ɹɾϏϯαΠζ΋࠷దͳ஋Λ࢖͍ͨ͘ͳͬͨ
    ɹɾϏϯαΠζͱweightsͷಉ࣌࠷దԽ্͕ͨ͠ख͍͔͘ͳ͍
    ɹɾ࠷దͳϏϯαΠζ͸༧ଌ෼෍ͷܗʹΑܾͬͯ·Δ. ֤foldͷ࠷దͳϏϯαΠζͷฏۉͱ
    weighted averageޙͷ༧ଌ෼෍͸࠷దͳ΋ͷ͔Βဃ཭͢Δ

    View Slide

  13. ©2020 Wantedly, Inc.
    Why we used LightGBM?
    3. LightGBM
    ɹɾϏϯαΠζͱweightsͷಉ࣌࠷దԽ͍ͨ͠
    ɹɾmeta features΋࢖͍͍ͨ
    ɹɾGBDT͸σʔλΛ෼ׂͯ͠෼ׂޙͷྖҬʹ࠷దͳ஋ΛׂΓ౰ͯΔख๏
    ɹɹˠ ઙ͍৔߹͸Ϗϯχϯάͱಉ༷ͷ཭ࢄԽ͕Ͱ͖ΔΜ͡Όͳ͍͔
    max_depth=2 max_depth=8

    View Slide

  14. ©2020 Wantedly, Inc.
    4. LightGBM (parameter tuning)
    ɹɾ཭ࢄԽ͢Δ΄Ͳscore͕ྑ͘ͳΔͷͰ, ໦ߏ଄Λۃྗγϯϓϧʹ͍ͨ͠
    ɹɾtrainσʔλΛ෼ׂͯ͠࠷దͳύϥϝʔλΛݟ͚ͭΔ
    ɹɾmax_depthΛҰ൪খ͘͞, lrΛۃྗେ͖ͨ͘͠ํ͕score͕ྑ͘ͳͬͨ
    Why we used LightGBM?

    View Slide

  15. ©2020 Wantedly, Inc.
    ɾsample weightͷઃఆ
    ɾhostͷ୯ޠΛΠϯϓοτͷઌ಄ྻʹஔ͘
    ɾnew tokenͷ௥Ճ
    ɾBert-base casedΛ࢖͏
    ɾtexͷίʔυϒϩοΫΛྗٕͰফڈ
    Didn’t work for us

    View Slide

  16. ©2020 Wantedly, Inc.
    Discussion:
    https://www.kaggle.com/c/google-quest-challenge/discussion/129904#742302
    Kernel:
    https://www.kaggle.com/shuheigoda/23th-place-solusion
    Links

    View Slide

  17. ©2020 Wantedly, Inc.
    https://www.wantedly.com/projects/375150
    We are hiring !

    View Slide