Kaggle Google Quest Q&A Labeling - 23th place solution

99e9e6d2de62c373990ac1bd7c4defc5?s=47 Shuhei Goda
February 28, 2020

Kaggle Google Quest Q&A Labeling - 23th place solution

99e9e6d2de62c373990ac1bd7c4defc5?s=128

Shuhei Goda

February 28, 2020
Tweet

Transcript

  1. ©2020 Wantedly, Inc. 23th place solution Kaggle Google Quest Q&A

    Labeling ൓লձ Feb 28, 2020 - Shuhei Goda - @jy_msc
  2. ©2020 Wantedly, Inc. Team - The Hand Shuhei Goda @jy_msc

    Visit Engineering Team at Wantedly Naomichi Agata @agatan_ People Engineering Team at Wantedly
  3. ©2020 Wantedly, Inc. Model Pipeline #FSUCBTF VODBTFE -JHIU(#. #FSUCBTF VODBTFE

    Settings ɾ3fold with GroupKFold ɾBCE + margin ranking loss ɾ3epoch Settings ɾmax_depth=1 ɾlr=0.1 Meta features ɾtext length ɾstackexchange Text data ɾquestion_title ɾquestion_body ɾanswer 1SF1SPDFTT
 2BOE" 1SF1SPDFTT
 POMZ2 ɾquestion_title ɾquestion_body ɾquestion_title ɾquestion_body ɾanswer Settings ɾhtml escape ɾhead+tail truncation
  4. ©2020 Wantedly, Inc. ɾHTMLจࣈྻͷΞϯΤεέʔϓ Pre-Process IUUQTXXXLBHHMFDPNDHPPHMFRVFTUDIBMMFOHFEJTDVTTJPO

  5. ©2020 Wantedly, Inc. ɾςΩετσʔλͷ݁߹ͱτϦϛϯά ɹɾ[CLS] + question_title + [SEP] +

    question_body + [SEP] + answer ɾquestion_body ͱ answer ͕ࢦఆͷ௕͞Λ௒͑ͨ৔߹, ྆୺͔ΒಉαΠζ෼ΛτϦϛϯά Pre-Process IUUQTBSYJWPSHBCT
  6. ©2020 Wantedly, Inc. ɾBert-base (uncased) ɹɾޙΖ4ͭͷӅΕ૚ͷग़ྗΛ࢖༻ https://arxiv.org/abs/1905.05583 ɹɾQAؒͷSEP tokenͷग़ྗΛ࢖༻ Model

    Architecture
  7. ©2020 Wantedly, Inc. ɾLabel weight ɹɾ؆୯ͦ͏ͳλεΫ͸weightΛখ͘͞, ෆۉߧͰ೉ͦ͠͏ͳλεΫ͸weightΛେ͖͘ ɹɾgpyoptͰweightͷ୳ࡧΛࢼͨ͠Έ͕ͨ, Լهͷ୯७ͳ΍Γํ͕࠷΋ྑ͔ͬͨ Loss

    function Label weight ͋Γ Public: 0.45979, Private: 0.41440 Label weight ͳ͠ Public: 0.43455, Private: 0.40602
  8. ©2020 Wantedly, Inc. ɾBCE + margin ranking loss (1 :

    1) ɹɾϛχόονΛ2ͭʹ෼ׂͯ͠ margin ranking loss Λܭࢉ Loss function BCE + margin ranking loss Public: 0.45979, Private: 0.41440 BCE Public: 0.44006, Private: 0.40668
  9. ©2020 Wantedly, Inc. ɾQuestion Model ɹɾQ༻ͷλεΫΛQuestion text͚ͩΛ࢖ͬͯղ͘ ɹɾΠϯϓοτ͸Q͚ͩͰ͍͍ͷͰ, Qͷtruncationͷྔ͕ݮΔ (Qͷ৘ใྔ͕૿͑Δ)

    Training Q model + Q and A model Public: 0.45979, Private: 0.41440 Q and A model × 2 (seed average) Public: 0.44298, Private: 0.40613
  10. ©2020 Wantedly, Inc. ɾLightGBM ɹɾmax_depth=1, lr=0.1 ɹɾmeta features ɹɹɾtext length

    (question, answer) ɹɹɾmeta data from stackexchange (Score, View, FavoriteCount, …) Post-Process LightGBM Public: 0.45979, Private: 0.41440 Simple binning without meta features Public: 0.45282, Private: 0.41387
  11. ©2020 Wantedly, Inc. Why we used LightGBM? 1. Simple binning

    method ɹɾ༧ଌ஋Λ཭ࢄԽ͢Δ͜ͱͰ Spearman’s correlation ͕ྑ͘ͳΔ͜ͱʹؾͮ͘ ɹɾtarget͝ͱʹϏϯαΠζΛࣄલʹઃఆͯ͠Ϗϯೋϯά ɹɾϏϯαΠζ͸ݻఆʹ্ͨ͠ͰBertͷ֤epochͷग़ྗΛweighted average (weight͸࠷దԽ)
  12. ©2020 Wantedly, Inc. Why we used LightGBM? 2. Optimize bin-size

    and weights ɹɾϏϯαΠζ΋࠷దͳ஋Λ࢖͍ͨ͘ͳͬͨ ɹɾϏϯαΠζͱweightsͷಉ࣌࠷దԽ্͕ͨ͠ख͍͔͘ͳ͍ ɹɾ࠷దͳϏϯαΠζ͸༧ଌ෼෍ͷܗʹΑܾͬͯ·Δ. ֤foldͷ࠷దͳϏϯαΠζͷฏۉͱ weighted averageޙͷ༧ଌ෼෍͸࠷దͳ΋ͷ͔Βဃ཭͢Δ
  13. ©2020 Wantedly, Inc. Why we used LightGBM? 3. LightGBM ɹɾϏϯαΠζͱweightsͷಉ࣌࠷దԽ͍ͨ͠

    ɹɾmeta features΋࢖͍͍ͨ ɹɾGBDT͸σʔλΛ෼ׂͯ͠෼ׂޙͷྖҬʹ࠷దͳ஋ΛׂΓ౰ͯΔख๏ ɹɹˠ ઙ͍৔߹͸Ϗϯχϯάͱಉ༷ͷ཭ࢄԽ͕Ͱ͖ΔΜ͡Όͳ͍͔ max_depth=2 max_depth=8
  14. ©2020 Wantedly, Inc. 4. LightGBM (parameter tuning) ɹɾ཭ࢄԽ͢Δ΄Ͳscore͕ྑ͘ͳΔͷͰ, ໦ߏ଄Λۃྗγϯϓϧʹ͍ͨ͠ ɹɾtrainσʔλΛ෼ׂͯ͠࠷దͳύϥϝʔλΛݟ͚ͭΔ

    ɹɾmax_depthΛҰ൪খ͘͞, lrΛۃྗେ͖ͨ͘͠ํ͕score͕ྑ͘ͳͬͨ Why we used LightGBM?
  15. ©2020 Wantedly, Inc. ɾsample weightͷઃఆ ɾhostͷ୯ޠΛΠϯϓοτͷઌ಄ྻʹஔ͘ ɾnew tokenͷ௥Ճ ɾBert-base casedΛ࢖͏

    ɾtexͷίʔυϒϩοΫΛྗٕͰফڈ Didn’t work for us
  16. ©2020 Wantedly, Inc. Discussion: https://www.kaggle.com/c/google-quest-challenge/discussion/129904#742302 Kernel: https://www.kaggle.com/shuheigoda/23th-place-solusion Links

  17. ©2020 Wantedly, Inc. https://www.wantedly.com/projects/375150 We are hiring !