Slide 1

Slide 1 text

©2020 Wantedly, Inc. 23th place solution Kaggle Google Quest Q&A Labeling ൓লձ Feb 28, 2020 - Shuhei Goda - @jy_msc

Slide 2

Slide 2 text

©2020 Wantedly, Inc. Team - The Hand Shuhei Goda @jy_msc Visit Engineering Team at Wantedly Naomichi Agata @agatan_ People Engineering Team at Wantedly

Slide 3

Slide 3 text

©2020 Wantedly, Inc. Model Pipeline #FSUCBTF VODBTFE -JHIU(#. #FSUCBTF VODBTFE Settings ɾ3fold with GroupKFold ɾBCE + margin ranking loss ɾ3epoch Settings ɾmax_depth=1 ɾlr=0.1 Meta features ɾtext length ɾstackexchange Text data ɾquestion_title ɾquestion_body ɾanswer 1SF1SPDFTT
 2BOE" 1SF1SPDFTT
 POMZ2 ɾquestion_title ɾquestion_body ɾquestion_title ɾquestion_body ɾanswer Settings ɾhtml escape ɾhead+tail truncation

Slide 4

Slide 4 text

©2020 Wantedly, Inc. ɾHTMLจࣈྻͷΞϯΤεέʔϓ Pre-Process IUUQTXXXLBHHMFDPNDHPPHMFRVFTUDIBMMFOHFEJTDVTTJPO

Slide 5

Slide 5 text

©2020 Wantedly, Inc. ɾςΩετσʔλͷ݁߹ͱτϦϛϯά ɹɾ[CLS] + question_title + [SEP] + question_body + [SEP] + answer ɾquestion_body ͱ answer ͕ࢦఆͷ௕͞Λ௒͑ͨ৔߹, ྆୺͔ΒಉαΠζ෼ΛτϦϛϯά Pre-Process IUUQTBSYJWPSHBCT

Slide 6

Slide 6 text

©2020 Wantedly, Inc. ɾBert-base (uncased) ɹɾޙΖ4ͭͷӅΕ૚ͷग़ྗΛ࢖༻ https://arxiv.org/abs/1905.05583 ɹɾQAؒͷSEP tokenͷग़ྗΛ࢖༻ Model Architecture

Slide 7

Slide 7 text

©2020 Wantedly, Inc. ɾLabel weight ɹɾ؆୯ͦ͏ͳλεΫ͸weightΛখ͘͞, ෆۉߧͰ೉ͦ͠͏ͳλεΫ͸weightΛେ͖͘ ɹɾgpyoptͰweightͷ୳ࡧΛࢼͨ͠Έ͕ͨ, Լهͷ୯७ͳ΍Γํ͕࠷΋ྑ͔ͬͨ Loss function Label weight ͋Γ Public: 0.45979, Private: 0.41440 Label weight ͳ͠ Public: 0.43455, Private: 0.40602

Slide 8

Slide 8 text

©2020 Wantedly, Inc. ɾBCE + margin ranking loss (1 : 1) ɹɾϛχόονΛ2ͭʹ෼ׂͯ͠ margin ranking loss Λܭࢉ Loss function BCE + margin ranking loss Public: 0.45979, Private: 0.41440 BCE Public: 0.44006, Private: 0.40668

Slide 9

Slide 9 text

©2020 Wantedly, Inc. ɾQuestion Model ɹɾQ༻ͷλεΫΛQuestion text͚ͩΛ࢖ͬͯղ͘ ɹɾΠϯϓοτ͸Q͚ͩͰ͍͍ͷͰ, Qͷtruncationͷྔ͕ݮΔ (Qͷ৘ใྔ͕૿͑Δ) Training Q model + Q and A model Public: 0.45979, Private: 0.41440 Q and A model × 2 (seed average) Public: 0.44298, Private: 0.40613

Slide 10

Slide 10 text

©2020 Wantedly, Inc. ɾLightGBM ɹɾmax_depth=1, lr=0.1 ɹɾmeta features ɹɹɾtext length (question, answer) ɹɹɾmeta data from stackexchange (Score, View, FavoriteCount, …) Post-Process LightGBM Public: 0.45979, Private: 0.41440 Simple binning without meta features Public: 0.45282, Private: 0.41387

Slide 11

Slide 11 text

©2020 Wantedly, Inc. Why we used LightGBM? 1. Simple binning method ɹɾ༧ଌ஋Λ཭ࢄԽ͢Δ͜ͱͰ Spearman’s correlation ͕ྑ͘ͳΔ͜ͱʹؾͮ͘ ɹɾtarget͝ͱʹϏϯαΠζΛࣄલʹઃఆͯ͠Ϗϯೋϯά ɹɾϏϯαΠζ͸ݻఆʹ্ͨ͠ͰBertͷ֤epochͷग़ྗΛweighted average (weight͸࠷దԽ)

Slide 12

Slide 12 text

©2020 Wantedly, Inc. Why we used LightGBM? 2. Optimize bin-size and weights ɹɾϏϯαΠζ΋࠷దͳ஋Λ࢖͍ͨ͘ͳͬͨ ɹɾϏϯαΠζͱweightsͷಉ࣌࠷దԽ্͕ͨ͠ख͍͔͘ͳ͍ ɹɾ࠷దͳϏϯαΠζ͸༧ଌ෼෍ͷܗʹΑܾͬͯ·Δ. ֤foldͷ࠷దͳϏϯαΠζͷฏۉͱ weighted averageޙͷ༧ଌ෼෍͸࠷దͳ΋ͷ͔Βဃ཭͢Δ

Slide 13

Slide 13 text

©2020 Wantedly, Inc. Why we used LightGBM? 3. LightGBM ɹɾϏϯαΠζͱweightsͷಉ࣌࠷దԽ͍ͨ͠ ɹɾmeta features΋࢖͍͍ͨ ɹɾGBDT͸σʔλΛ෼ׂͯ͠෼ׂޙͷྖҬʹ࠷దͳ஋ΛׂΓ౰ͯΔख๏ ɹɹˠ ઙ͍৔߹͸Ϗϯχϯάͱಉ༷ͷ཭ࢄԽ͕Ͱ͖ΔΜ͡Όͳ͍͔ max_depth=2 max_depth=8

Slide 14

Slide 14 text

©2020 Wantedly, Inc. 4. LightGBM (parameter tuning) ɹɾ཭ࢄԽ͢Δ΄Ͳscore͕ྑ͘ͳΔͷͰ, ໦ߏ଄Λۃྗγϯϓϧʹ͍ͨ͠ ɹɾtrainσʔλΛ෼ׂͯ͠࠷దͳύϥϝʔλΛݟ͚ͭΔ ɹɾmax_depthΛҰ൪খ͘͞, lrΛۃྗେ͖ͨ͘͠ํ͕score͕ྑ͘ͳͬͨ Why we used LightGBM?

Slide 15

Slide 15 text

©2020 Wantedly, Inc. ɾsample weightͷઃఆ ɾhostͷ୯ޠΛΠϯϓοτͷઌ಄ྻʹஔ͘ ɾnew tokenͷ௥Ճ ɾBert-base casedΛ࢖͏ ɾtexͷίʔυϒϩοΫΛྗٕͰফڈ Didn’t work for us

Slide 16

Slide 16 text

©2020 Wantedly, Inc. Discussion: https://www.kaggle.com/c/google-quest-challenge/discussion/129904#742302 Kernel: https://www.kaggle.com/shuheigoda/23th-place-solusion Links

Slide 17

Slide 17 text

©2020 Wantedly, Inc. https://www.wantedly.com/projects/375150 We are hiring !