Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
Kaggle Google Quest Q&A Labeling - 23th place solution
Shuhei Goda
February 28, 2020
Technology
4
3.3k
Kaggle Google Quest Q&A Labeling - 23th place solution
Shuhei Goda
February 28, 2020
Tweet
Share
More Decks by Shuhei Goda
See All by Shuhei Goda
会社訪問アプリ「Wantedly Visit」 における募集画像がユーザーに与える影響
hakubishin3
0
800
A Stacking Ensemble Model for Prediction of Multi-type Tweet Engagements(RecSys Challenge 2020 3rd place solution)
hakubishin3
2
620
RecSys Challenge 2020 Workshop: A Stacking Ensemble Model for Prediction of Multi-type Tweet Engagements
hakubishin3
1
1.3k
The Web Conference2020 参加報告会
hakubishin3
0
620
Target Encoding はなぜ有効なのか
hakubishin3
10
6.7k
Kaggle Rコンペ解法紹介
hakubishin3
0
100
【論文紹介】Learning sparse neural networks through L0 regularization
hakubishin3
0
83
【論文紹介】EX2: exploration with exemplar models for deep reinforcement learning
hakubishin3
0
61
【論文紹介】Sparse Embedded k-means Clustering
hakubishin3
0
43
Other Decks in Technology
See All in Technology
一人から始めるプロダクトSRE / How to start SRE in a product team, all by yourself
vtryo
4
2.8k
[SRE NEXT 2022]KaaS桶狭間の戦い 〜Yahoo! JAPANのSLI/SLOを用いた統合監視〜
srenext
0
370
Oracle Cloud Infrastructure:2022年5月度サービス・アップデート
oracle4engineer
PRO
0
120
Whats new in Android Media?
myolwin00
2
110
Devに力を授けたいSREのあゆみ / SRE that wants to empower developers
tocyuki
3
480
インフラエンジニアBooks 30分でわかる「Dockerコンテナ開発・環境構築の基本」
cyberblack28
11
7.1k
エンタープライズにおけるSRE立ち上げとNew Relic選定に至った背景とは / SRE Startup and New Relic in the Enterprise
tomoyakitaura
2
160
runn is a package/tool for running operations following a scenario. / golang.tokyo #32
k1low
1
220
Oracle Content Management サービス概要 (2022年5月版)
oracle4engineer
PRO
0
120
tfcon-2022-cpp
cpp
5
5.1k
toilを解消した話
asumaywy
0
210
IDOLY PRIDEにおけるAssetBundleビルドパイプラインについて
qualiarts
0
280
Featured
See All Featured
We Have a Design System, Now What?
morganepeng
35
2.9k
Java REST API Framework Comparison - PWX 2021
mraible
PRO
11
4.6k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
119
28k
How New CSS Is Changing Everything About Graphic Design on the Web
jensimmons
212
11k
Writing Fast Ruby
sferik
612
57k
BBQ
matthewcrist
74
7.9k
Building Better People: How to give real-time feedback that sticks.
wjessup
343
17k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
2
400
The Straight Up "How To Draw Better" Workshop
denniskardys
225
120k
How To Stay Up To Date on Web Technology
chriscoyier
780
250k
4 Signs Your Business is Dying
shpigford
169
20k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
103
16k
Transcript
©2020 Wantedly, Inc. 23th place solution Kaggle Google Quest Q&A
Labeling লձ Feb 28, 2020 - Shuhei Goda - @jy_msc
©2020 Wantedly, Inc. Team - The Hand Shuhei Goda @jy_msc
Visit Engineering Team at Wantedly Naomichi Agata @agatan_ People Engineering Team at Wantedly
©2020 Wantedly, Inc. Model Pipeline #FSUCBTF VODBTFE -JHIU(#. #FSUCBTF VODBTFE
Settings ɾ3fold with GroupKFold ɾBCE + margin ranking loss ɾ3epoch Settings ɾmax_depth=1 ɾlr=0.1 Meta features ɾtext length ɾstackexchange Text data ɾquestion_title ɾquestion_body ɾanswer 1SF1SPDFTT 2BOE" 1SF1SPDFTT POMZ2 ɾquestion_title ɾquestion_body ɾquestion_title ɾquestion_body ɾanswer Settings ɾhtml escape ɾhead+tail truncation
©2020 Wantedly, Inc. ɾHTMLจࣈྻͷΞϯΤεέʔϓ Pre-Process IUUQTXXXLBHHMFDPNDHPPHMFRVFTUDIBMMFOHFEJTDVTTJPO
©2020 Wantedly, Inc. ɾςΩετσʔλͷ݁߹ͱτϦϛϯά ɹɾ[CLS] + question_title + [SEP] +
question_body + [SEP] + answer ɾquestion_body ͱ answer ͕ࢦఆͷ͞Λ͑ͨ߹, ͔྆ΒಉαΠζΛτϦϛϯά Pre-Process IUUQTBSYJWPSHBCT
©2020 Wantedly, Inc. ɾBert-base (uncased) ɹɾޙΖ4ͭͷӅΕͷग़ྗΛ༻ https://arxiv.org/abs/1905.05583 ɹɾQAؒͷSEP tokenͷग़ྗΛ༻ Model
Architecture
©2020 Wantedly, Inc. ɾLabel weight ɹɾ؆୯ͦ͏ͳλεΫweightΛখ͘͞, ෆۉߧͰͦ͠͏ͳλεΫweightΛେ͖͘ ɹɾgpyoptͰweightͷ୳ࡧΛࢼͨ͠Έ͕ͨ, Լهͷ୯७ͳΓํ͕࠷ྑ͔ͬͨ Loss
function Label weight ͋Γ Public: 0.45979, Private: 0.41440 Label weight ͳ͠ Public: 0.43455, Private: 0.40602
©2020 Wantedly, Inc. ɾBCE + margin ranking loss (1 :
1) ɹɾϛχόονΛ2ͭʹׂͯ͠ margin ranking loss Λܭࢉ Loss function BCE + margin ranking loss Public: 0.45979, Private: 0.41440 BCE Public: 0.44006, Private: 0.40668
©2020 Wantedly, Inc. ɾQuestion Model ɹɾQ༻ͷλεΫΛQuestion text͚ͩΛͬͯղ͘ ɹɾΠϯϓοτQ͚ͩͰ͍͍ͷͰ, Qͷtruncationͷྔ͕ݮΔ (Qͷใྔ͕૿͑Δ)
Training Q model + Q and A model Public: 0.45979, Private: 0.41440 Q and A model × 2 (seed average) Public: 0.44298, Private: 0.40613
©2020 Wantedly, Inc. ɾLightGBM ɹɾmax_depth=1, lr=0.1 ɹɾmeta features ɹɹɾtext length
(question, answer) ɹɹɾmeta data from stackexchange (Score, View, FavoriteCount, …) Post-Process LightGBM Public: 0.45979, Private: 0.41440 Simple binning without meta features Public: 0.45282, Private: 0.41387
©2020 Wantedly, Inc. Why we used LightGBM? 1. Simple binning
method ɹɾ༧ଌΛࢄԽ͢Δ͜ͱͰ Spearman’s correlation ͕ྑ͘ͳΔ͜ͱʹؾͮ͘ ɹɾtarget͝ͱʹϏϯαΠζΛࣄલʹઃఆͯ͠Ϗϯೋϯά ɹɾϏϯαΠζݻఆʹ্ͨ͠ͰBertͷ֤epochͷग़ྗΛweighted average (weight࠷దԽ)
©2020 Wantedly, Inc. Why we used LightGBM? 2. Optimize bin-size
and weights ɹɾϏϯαΠζ࠷దͳΛ͍ͨ͘ͳͬͨ ɹɾϏϯαΠζͱweightsͷಉ࣌࠷దԽ্͕ͨ͠ख͍͔͘ͳ͍ ɹɾ࠷దͳϏϯαΠζ༧ଌͷܗʹΑܾͬͯ·Δ. ֤foldͷ࠷దͳϏϯαΠζͷฏۉͱ weighted averageޙͷ༧ଌ࠷దͳͷ͔Βဃ͢Δ
©2020 Wantedly, Inc. Why we used LightGBM? 3. LightGBM ɹɾϏϯαΠζͱweightsͷಉ࣌࠷దԽ͍ͨ͠
ɹɾmeta features͍͍ͨ ɹɾGBDTσʔλΛׂׂͯ͠ޙͷྖҬʹ࠷దͳΛׂΓͯΔख๏ ɹɹˠ ઙ͍߹Ϗϯχϯάͱಉ༷ͷࢄԽ͕Ͱ͖ΔΜ͡Όͳ͍͔ max_depth=2 max_depth=8
©2020 Wantedly, Inc. 4. LightGBM (parameter tuning) ɹɾࢄԽ͢Δ΄Ͳscore͕ྑ͘ͳΔͷͰ, ߏΛۃྗγϯϓϧʹ͍ͨ͠ ɹɾtrainσʔλΛׂͯ͠࠷దͳύϥϝʔλΛݟ͚ͭΔ
ɹɾmax_depthΛҰ൪খ͘͞, lrΛۃྗେ͖ͨ͘͠ํ͕score͕ྑ͘ͳͬͨ Why we used LightGBM?
©2020 Wantedly, Inc. ɾsample weightͷઃఆ ɾhostͷ୯ޠΛΠϯϓοτͷઌ಄ྻʹஔ͘ ɾnew tokenͷՃ ɾBert-base casedΛ͏
ɾtexͷίʔυϒϩοΫΛྗٕͰফڈ Didn’t work for us
©2020 Wantedly, Inc. Discussion: https://www.kaggle.com/c/google-quest-challenge/discussion/129904#742302 Kernel: https://www.kaggle.com/shuheigoda/23th-place-solusion Links
©2020 Wantedly, Inc. https://www.wantedly.com/projects/375150 We are hiring !