Watson
March 15, 2022
470

# MioGatto による数式グラウンディング データセットの構築 / nlp2022

March 15, 2022

## Transcript

1 / 15
2. ### MioGatto ʹΑΔ਺ࣜάϥ΢ϯσΟϯάσʔληοτͷߏங ਺ࣜάϥ΢ϯσΟϯά [Asakura+ 2020] 1. ਺ֶ֓೦Λࢦ͢τʔΫϯͷάϧʔϓΛݟ͚ͭΔ άϧʔϓͷྫ x, α,

cos, ∑ , =, × 2. ֤άϧʔϓʹͦͷࢦࣔ͢͠਺ֶ֓೦Λඥ෇͚Δ ࠓճͷߩݙ ʜ ࣗಈԽʹ޲͚ͯσʔληοτΛߏங ▶ ࿦จ 15 ຊͷܭ 12,352 ࣝผࢠʹखಈΞϊςʔγϣϯ ▶ จॻʹ͓͚Δࣝผࢠείʔϓͷ༷૬ͳͲ͕ݟ͖͑ͯͨ 2 / 15
3. ### MioGatto ʹΑΔ਺ࣜάϥ΢ϯσΟϯάσʔληοτͷߏங ਺ࣜάϥ΢ϯσΟϯά [Asakura+ 2020] ˺ΞϥΠϝϯτʴʓʓʓʓʓ આ໌ͷΞϥΠϝϯτ ▶ ֤τʔΫϯʹઆ໌ (description)

Λ෇༩͢ΔλεΫ ▶ ෳ਺ͷઌߦݚڀ͋Γ [Aizawa+ 2013, Alexeeva+ 2020, etc.] ˠ΄ͱΜͲ͕τʔΫϯͷҙຯ͸จॻ಺ͰҰఆͱԾఆ 3 / 15
4. ### MioGatto ʹΑΔ਺ࣜάϥ΢ϯσΟϯάσʔληοτͷߏங ਺ࣜάϥ΢ϯσΟϯά [Asakura+ 2020] ˺ΞϥΠϝϯτʴڞࢀরղੳ ࣗવݴޠͷڞࢀর ౧ଠ࿠͸౧͔Βੜ·Εͨɽ ൴͸َୀ࣏ʹग़͔͚ͨɽ ڞࢀর

਺ࣜʹ͓͚Δڞࢀর ػցֶशͷΞϧΰϦζϜʹΑͬͯಘΒΕΔͷ͸ɹ ؔ਺ ɹ y(x) Ͱ͋Δɽ͜ ͷؔ਺ʹɼ৽ͨʹ਺ࣈͷը૾ x Λೖྗ͢Δͱɼ໨ඪϕΫτϧͱූ߸Խ ͷ࢓ํ͕౳͍͠ɹ ग़ྗϕΫτϧ ɹ y ͕ग़ྗ͞ΕΔɽɹ ؔ਺ ɹ y(x) ͷৄࡉͳܗ ͸܇࿅σʔλʹج͍ͮͯٻΊΒΕΔɽ (PRML, p. 2) 4 / 15
5. ### MioGatto ʹΑΔ਺ࣜάϥ΢ϯσΟϯάσʔληοτͷߏங ਺ࣜάϥ΢ϯσΟϯάͷඞཁੑͱ೉͠͞ ▶ ਺ࣜʹ΋ࣗવݴޠͱࣅͨᐆດੑ͋Γ [Kohlhase+, 2014] ▶ ه߸ʢτʔΫϯʣͷিಥ ▶

਺ࣜͷߏจతᐆດੑ ྫ f (a + b) ▶ લޙͷςΩετͳ͠ʹ͸ղऍͰ͖ͳ͍ ▶ ৗࣝ΍υϝΠϯ஌ࣝͷඞཁੑ ྫ π: ԁप཰ PRML ୈ 1 ষʹ͓͚ΔτʔΫϯ y ͷଟٛੑ ຊจͷςΩετஅย y ͷҙຯ ...ಘΒΕΔͷ͸ؔ਺ y(x) Ͱ͋Δ... ը૾Λೖྗͱ͢Δؔ਺ ...ग़ྗϕΫτϧ y ͕ग़ྗ͞ΕΔ... ؔ਺ y(x) ͷग़ྗϕΫτϧ 2 ͭͷ֬཰ม਺ϕΫτϧ x ͱ y ʹ... ֬཰ม਺ϕΫτϧ ...ಉ࣌෼෍ p(x,y) Λߟ͑Α͏ɽ x ʹରԠ͢Δ஋ 5 / 15
6. ### MioGatto ʹΑΔ਺ࣜάϥ΢ϯσΟϯάσʔληοτͷߏங άϥ΢ϯσΟϯά৘ใݯ จॻ಺֎ͷ਺ࣜάϥ΢ϯσΟϯάͷࠜڌͱͳΔ΋ͷ จॻ಺ पลςΩετɼ਺ࣜ ྫ ಉ໊֨ࢺɼdef = จॻ֎

ৗࣝɼυϝΠϯ஌ࣝ ྫ Wikidata Ξϊςʔγϣϯ৘ใ ʜ ࣗಈԽʹ޲͚ͯඞཁͳ৘ใ ▶ ਺ֶ֓೦ ʜ άϥ΢ϯσΟϯάͷ݁Ռɽਖ਼ղϥϕϧ ▶ ৘ใݯ ʜ ͜ͷࣗಈநग़͸ࣗಈԽͷୈҰา 6 / 15
7. ### MioGatto ʹΑΔ਺ࣜάϥ΢ϯσΟϯάσʔληοτͷߏங ΞϊςʔγϣϯπʔϧMioGatto [Asakura+ 2021] Math Identiﬁer-Oriented Grounding Annotation Tool

▶ ਺ࣜάϥ΢ϯσΟϯάσʔλߏஙͷͨΊͷಠࣗπʔϧ ᵋ Web ϕʔε GUIʢPython + TypeScript ࣮૷ʣ ▶ ΦʔϓϯιʔεʢMIT ϥΠηϯεʣͰ։ൃதʂ https://github.com/wtsnjp/MioGatto 7 / 15
8. ### MioGatto ʹΑΔ਺ࣜάϥ΢ϯσΟϯάσʔληοτͷߏங Ξϊςʔγϣϯख๏ ֶੜΞϊςʔλ Twitter ౳Ͱͷ΂ 10 ໊ΛืूɽँۚΛࢧ෷ͬͯՔಇ ▶ ͞·͟·ͳ෼໺

NLP × 4ɼ਺ཧ࿦ཧֶ × 2ɼ਺ֶ × 1ɼ෺ཧ × 1ɼఱจ × 1 ▶ ͞·͟·ͳֶੜ ߴߍੜ × 1ɼֶ෦ੜ × 1ɼमֶ࢜ੜ × 5ɼതֶ࢜ੜ × 3 https://wtsnjp.com/annotator.html ํ๏ ▶ Ξϊςʔγϣϯର৅͸਺ࣜࣝผࢠ ྫ x, θ, sin ▶ ࿦จͷબ୒͸ΞϊςʔλͷࡋྔʢҰ෦ࢦఆʣ ▶ ΞϊςʔγϣϯΨΠυϥΠϯΛ༻ҙ https://github.com/wtsnjp/MioGatto/wiki/Annotator’s-Guide 8 / 15
9. ### MioGatto ʹΑΔ਺ࣜάϥ΢ϯσΟϯάσʔληοτͷߏங Ξϊςʔγϣϯ݁Ռ ਺ࣜάϥ΢ϯσΟϯάσʔληοτ ࿦จ ෼໺ ୯ޠ਺ छྨ ग़ݱ ࣙॻ߲໨

ฏۉީิ਺ ৘ใݯ 1 ML 10976 40 937 104 6.4 232 2 NLP 4267 42 266 73 2.6 30 3 NLP 3563 38 433 79 2.5 34 4 ࿦ཧֶ 3567 46 1648 64 1.9 30 5 ୅਺ֶ 13154 141 4629 424 5.2 180 6 NLP 2881 25 162 30 2.7 12 7 NLP 5543 31 203 47 2.6 36 8 NLP 4613 23 217 27 1.1 28 9 NLP 6255 34 510 74 2.7 27 10 NLP 5415 73 1175 167 3.3 60 11 NLP 4451 33 237 61 2.9 34 12 NLP 4261 31 186 39 1.7 25 13 NLP 2257 23 124 27 1.2 18 14 ఱจֶ 10032 59 1064 129 4.2 97 15 ఱจֶ 4863 41 561 73 2.3 95 ߹ܭ — 86098 680 12352 1418 — 938 https://sigmathling.kwarc.info/resources/grounding-dataset/ 9 / 15
10. ### MioGatto ʹΑΔ਺ࣜάϥ΢ϯσΟϯάσʔληοτͷߏங σʔλ෼ੳᶃ ΞϊςʔλؒҰக཰ ৘ใݯͷ਺ͱΞϊςʔλؒҰக཰ʢରΞϊςʔλ Aʣ Ξϊςʔλ A B C

D E Ұக཰ (%) — 96.5 87.4 92.1 84.2 κ ஋ ˞ — 0.94 0.80 0.87 0.75 ৘ใݯͷ਺ 232 — — 249 257 ᵋ ॏෳ཰ (%) — — — 80.3 93.4 ˞ࣝผࢠ͝ͱʹܭࢉͨ͠ κ ஋ͷՃॏฏۉʢࢀߟ஋ʣ ▶ ࿦จ 1 ʹ 5 ໊͕ಠཱʹΞϊςʔγϣϯ ▶ ਺ֶ֓೦ɿશһ ▶ ৘ใݯɿΞϊςʔλ A, D, E ͷΈ ▶ ਺ֶ֓೦ͷΞϊςʔλؒҰக཰ɾκ ஋͸े෼ʹߴ͍ ▶ ৘ใݯͱೝࣝ͞ΕΔεύϯҐஔ΋Α͘Ұக 10 / 15
11. ### MioGatto ʹΑΔ਺ࣜάϥ΢ϯσΟϯάσʔληοτͷߏங σʔλ෼ੳᶄ είʔϓ੾ସ ࿦จ 1 𝐷 E 𝐿 𝑁

𝑇 maximize 𝑝 𝑞 𝑡 t 𝑤 𝑥 x 𝑧 z 𝜃 𝜙 D §1 §2 §3 §4 §5 §6 §7 ࿦จ 15 𝐸 HS IS LS 𝑁 𝑅 𝑆 𝑇 𝑉 𝑊 𝑗 𝑘 𝑙 H 𝒓 §1 §2 §3 §4 §5 είʔϓ੾ସ ʜ จॻ಺Ͱࣝผࢠͷҙຯ͕มΘΔ ▶ είʔϓ੾ସͷ 89.5%͸ಉҰͷηΫγϣϯ಺Ͱൃੜ ▶ Ұ౓੾ΓସͬͨޙʹɼҎલͷείʔϓʹ໭Δ͜ͱ΋ 11 / 15
12. ### MioGatto ʹΑΔ਺ࣜάϥ΢ϯσΟϯάσʔληοτͷߏங σʔλ෼ੳᶅ άϥ΢ϯσΟϯά৘ใݯ ৘ใݯͷྫ ͜ͷؔ਺ʹɼ৽ͨʹ਺ࣈͷը૾ x Λೖྗ͢Δͱɼ ໨ඪϕΫτϧͱූ ߸Խͷ࢓ํ͕౳͍͠ग़ྗϕΫτϧ

y ͕ग़ྗ͞ΕΔɽ (PRML, p. 2) ऩूͨ͠ 938 ৘ใݯͷ෼ੳ ▶ 76.5%͕ࣝผࢠΑΓઌߦ ▶ ࣝผࢠͱ৘ใݯͷڑ཭͸ ฏۉ 14.7 ୯ޠ ᵋ தԝ஋͸ 0ʙ4 ୯ޠ యܕతʹ͸௚લͷಉ໊֨ࢺ ৘ใݯͷҐஔ 718 220 0 200 400 600 800 前 後 ࣝผࢠͱ৘ใݯͷڑ཭ 距離（単語数） 0 100 200 300 400 500 0 1 2 <10 <100 >=100 12 / 15
13. ### MioGatto ʹΑΔ਺ࣜάϥ΢ϯσΟϯάσʔληοτͷߏங ࠓޙͷ՝୊ Ξϊςʔγϣϯίετͷ௿ݮ ▶ ಉ͡࿦จʹෳ਺ਓͰΞϊςʔγϣϯ͢Δͷ͸େม ˠ શର৅࿦จʹ͍ͭͯ͸Ұக཰ΛܭࢉͰ͖͍ͯͳ͍ ▶ ෼໺ؒͷൺֱΛߦ͏ʹ͸਺͕଍Γͳ͍

▶ ਺ֶ΍෺ཧͰ͸਺ࣜͷ਺͕ଟ͗͢ ˠ ͢΂ͯखಈͰ͸ඇݱ࣮తɽࣙॻ׬੒Λ༏ઌ ▶ ࿦ཧֶͷ࿦จ͸ಛʹ Notation ͕ಛघ ˠ ਺ࣈ΍ԋࢉࢠʹ΋ᐆດੑɽࣝผࢠ͚ͩͰ͸ෆे෼ ະղܾͷϦαʔνɾΫΤενϣϯ ▶ ஶऀΞϊςʔγϣϯͱಡऀΞϊςʔγϣϯͷൺֱ ▶ ෼໺֎ͷਓͰ΋άϥ΢ϯσΟϯά͸ਖ਼͘͠Ͱ͖Δʁ 13 / 15
14. ### MioGatto ʹΑΔ਺ࣜάϥ΢ϯσΟϯάσʔληοτͷߏங άϥ΢ϯσΟϯάࣗಈԽͷํ਑ 3 εςοϓͰࣗಈԽ 1. จॻ಺άϥ΢ϯσΟϯά৘ใݯͷಛఆɾநग़ ᵋ ύλʔϯϚονʴ඼ࢺ෼ղʢಉ໊֨ࢺʣར༻ 2.

จॻ಺৘ใݯͷΫϥελϦϯάʹΑΔʮࣙॻʯੜ੒ ᵋ Short Text Clustering ख๏ [Jiaming+, 2017] ͷద༻ 3. จॻதͷ֤਺ࣜτʔΫϯͱʮࣙॻʯ߲໨ͷؔ࿈෇͚ ᵋ ύλʔϯϚονʴ඼ࢺ෼ղʴ෼ྨϞσϧ ৘ใݯͷநग़ ʮࣙॻʯੜ੒ ؔ࿈෇͚ ൓෮ɾվળ ఏҊσʔληοτ ֦ॆ ධՁ 14 / 15
15. ### MioGatto ʹΑΔ਺ࣜάϥ΢ϯσΟϯάσʔληοτͷߏங ࢀߟจݙ ▶ Akiko Aizawa, et al. “NTCIR-10 Math

Pilot Task Overview.” In Proceedings of NTCIR-10 (2013). ▶ Maria Alexeeva, et al. “MathAlign: Linking Formula Identiﬁers to their Contextual Natural Language Descriptions”. Proceedings of LREC 2020. ▶ Takuto Asakura, et al. “Towards Grounding of Formulae.”. In Proceedings of SDP 2020. ▶ Takuto Asakura, et al. “MioGatto: A Math Identiﬁer-oriented Grounding Annotation Tool.” In 13th MathUI Workshop at 14th Conference on Intelligent Computer Mathematics (MathUI 2021). ▶ Christopher M Bishop. Pattern Recognition and Machine Learning (2006). ▶ Xu, Jiaming, et al. “Self-taught convolutional neural networks for short text clustering.” Neural Networks 88 (2017). ▶ Michael Kohlhase and Mihnea Iancu. “Co-representing structure and meaning of mathematical documents” (2014). 15 / 15