Upgrade to Pro — share decks privately, control downloads, hide ads and more …

数式識別子の文書内曖昧性の解消 / nlp2024

Watson
March 12, 2024

数式識別子の文書内曖昧性の解消 / nlp2024

数式識別子の文書内曖昧性解消は、自然言語中の 数式理解を実現する上で重要である。文書をまたいだ数式識別子の曖昧性解消については一定の進展が得られてきたが、文書内曖昧性は十分に研究されな いまま残されてきた。本研究では、どのような情報 が文書内曖昧性解消に必要であるのかを明らかにする。我々は文書内の位置データと数式識別子周辺の数式内ローカル構造が特に有効であると結論付けた。構築した多層パーセプトロンモデルは、人間アノテータに近い精度(一致率85%、カッパ値0.73)で文書内曖昧性解消を実現する。また、重要な情報種は対象文書の科学分野に依存しないことを確認 した。

Watson

March 12, 2024
Tweet

More Decks by Watson

Other Decks in Research

Transcript

  1. ਺ࣜࣝผࢠͷจॻ಺ᐆດੑͷղফ ਺ࣜάϥ΢ϯσΟϯά ਺ࣜάϥ΢ϯσΟϯάͷ 3 ཁૉ 1. ର৅จॻʹొ৔͢Δ਺ֶ֓೦ͷྻڍ cf. ఆٛநग़ 2.

    ਺ࣜτʔΫϯग़ݱʹର͢Δ਺ֶ֓೦ͷׂ౰ʢຊݚڀʣ 3. ਺ֶ֓೦ͱ֎෦஌ࣝͷඥ෇͚ cf. MathIR The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y, encoded in the same way as the target vectors. The precise form of the function y(x) is determined during the training phase. (p. 2, PRML) Math concepts • function y(·) • output vector y  ਺ֶ֓೦ͷྻڍ  ֓೦ͷׂ౰ External knowledge Concept 1 Concept 2  ֎෦஌ࣝͱͷඥ෇͚ 2 / 20
  2. ਺ࣜࣝผࢠͷจॻ಺ᐆດੑͷղফ ຊݚڀͷର৅ɿ਺ࣜࣝผࢠͷจॻ಺ᐆດੑ The result of running the machine learning algorithm

    can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y, encoded in the same way as the target vectors. The precise form of the function y(x) is determined during the training phase. (p. 2, PRML) Math concepts • function y(·) • output vector y 1  ਺ֶ֓೦ͷྻڍ  ֓೦ͷׂ౰ External knowledge Concept 1 Concept 2  ֎෦஌ࣝͱͷඥ෇͚ ▶ P2C ม׵΍ MathIR ͳͲͷԠ༻λεΫͰোนͱೝࣝ ▶ طଘݚڀͰΧόʔͰ͖ͳ͍ ∵) σʔλࢿݯ͕ෆ଍ ˠ ຊݚڀͰ͸σʔλࢿݯͷߏங͔ΒऔΓ૊Ή 3 / 20
  3. ਺ࣜࣝผࢠͷจॻ಺ᐆດੑͷղফ λεΫ֓ཁ ೖྗ ▶ ߏ଄ԽจॻσʔλʢXHTMLʣ ▶ ֤਺ֶ֓೦ʹඥ෇͘࠷ॳͷग़ݱҐஔ ˠ ਖ਼ղϥϕϧͷ໿ 10%

    ʹ૬౰ ग़ྗ ▶ ֤ग़ݱʹඥ෇͘਺ֶ֓೦ ˠ ࢒Γ 90% ͷϥϕϧΛ౰͍ͯͨ λεΫͷ೉қ౓ ▶ Cascade ϕʔεϥΠϯ ʹείʔϓ੾ସ͸ॳग़ ҐஔͷΈͱԾఆ ˠ Kappa 0.6431 ▶ ਓؒΞϊςʔλ ˠ Kappa 0.7939 random baseline mode baseline cascade baseline our model human 0.0 0.2 0.4 0.6 0.8 1.0 Score 0.5894 0.6329 0.8312 0.8534 0.9515 0.0099 0.0000 0.6431 0.7330 0.7939 random baseline mode baseline cascade baseline our model human Accuracy Kappa 5 / 20
  4. ਺ࣜࣝผࢠͷจॻ಺ᐆດੑͷղফ σʔληοτ ग़൛ LREC 2022 +Ћ ▶ ࿦จͷ෼໺ʹΑͬͯσʔληοτΛ෼ׂ ▶ NLP

    αϒηοτɿϞσϧͷ։ൃͱධՁʹར༻ ▶ ͦͷଞαϒηοτɿධՁͷΈʹར༻ ఱจֶ ×8ɺCS ×5ɺܦࡁֶ ×3ɺ਺ֶ ×2ɺ ੜ෺ֶ ×1ɺ෺ཧֶ ×1 ▶ ܇࿅/ݕূ/ධՁσʔλ͸࿦จ୯ҐͰ෼ׂ ▶ ධՁσʔλ͸࠷ॳʹִ཭ʢNLP ͷ 4 ຊɺͦͷଞͷશ෦ʣ ▶ ։ൃσʔλ͸ LOOCV Ͱ༗ޮ׆༻ σʔληοτͷن໛ αϒηοτ ࿦จ ຊจͷ୯ޠ ࣝผࢠͷछྨ ग़ݱ ࣙॻ߲໨ NLP 20 97,045 789 9,278 1,518 ͦͷଞ 20 140,017 953 18,377 2,085 ߹ܭ 40 237,062 1,742 27,655 3,603 7 / 20
  5. ਺ࣜࣝผࢠͷจॻ಺ᐆດੑͷղফ ಛ௃ྔΤϯδχΞϦϯά c: จ຺ ▶ ग़ݱʢΛؚΉ਺ࣜʣલޙͷςΩετ ྫ ‌ feature vector

    $v’ {x}$ extracted from ‌ ▶ Sentence Transformer [Reimers+, 2019] ͰϕΫτϧԽ MiniLM ͕࠷΋༗ޮɺwindow size ΍਺ࣜදݱ͸Өڹখ a: ઀ࣙλΠϓ ▶ ग़ݱपลͷϩʔΧϧͳ਺ࣜߏ଄ ྫ ఴࣈͷ༗ແ ▶ ϧʔϧϕʔεͰਪఆ ˠ ਫ਼౓ 90.56% p: Ґஔσʔλ ▶ Cascade ޮՌͷ༗ແͱॳग़Ґஔ͔Βͷ૬ରڑ཭ ˠ ୯ಠͰ΋ڧ͍ɻCascade ϕʔεϥΠϯͱҰக 9 / 20
  6. ਺ࣜࣝผࢠͷจॻ಺ᐆດੑͷղফ Ϟσϧൺֱᶃ ֓ཁ ▶ 3 ͭͷಛ௃Λ૊Έ߹Θͤͯ MLP ΛֶशɺධՁ ▶ Ϟσϧͷछྨ͸

    23 − 1 = 7 ݸ ▶ c: จ຺ɺa: ઀ࣙλΠϓɺp: Ґஔσʔλͷ༗ແͰදݱ ྫ c+ / a+ / p−: จ຺ͱ઀ࣙλΠϓΛ࢖༻ͨ͠Ϟσϧ 10 / 20
  7. ਺ࣜࣝผࢠͷจॻ಺ᐆດੑͷղফ ֓೦ׂ౰ͷࣗಈԽɿ·ͱΊ ϙΠϯτ ▶ ਺ࣜࣝผࢠͷจॻ಺ᐆດੑͷղফΛ໨ࢦͨ͠ ▶ ଟ૚ύʔηϓτϩϯΛ܇࿅͠ɺॏཁͳ৘ใछΛಛఆ ▶ ։ൃͨ͠Ϟσϧͷੑೳ͸ϕʔεϥΠϯҎ্ਓؒະຬ ໌Β͔ʹͨ͜͠ͱ

    ▶ จॻ಺ᐆດੑͷղফʹॏཁͳ৘ใछ͸ͳʹ͔ʁ ˠ Ґஔσʔλͱ઀ࣙλΠϓ͕ॏཁ ▶ ༗ޮͳ৘ใछ͸ɺର৅࿦จͷ෼໺ʹґଘ͢Δ͔ʁ ˠ ্هͷ܏޲͸෼໺ʹґଘ͠ͳ͍ 19 / 20
  8. ਺ࣜࣝผࢠͷจॻ಺ᐆດੑͷղফ ࢀߟจݙ ▶ Takuto Asakura, et al. “Towards Grounding of

    Formulae.” In Proceedings of SDP 2020. ▶ Takuto Asakura, et al. “MioGatto: A Math Identifier-oriented Grounding Annotation Tool.” In Proceedings of MathUI 2021. ▶ Takuto Asakura, et al. “Building Dataset for Grounding of Formulae — Annotating Coreference Relations Among Math Identifiers.” In Proceedings of LREC 2022. ▶ Ron Ausbrooks, et al. “Mathematical Markup Language (MathML) 3.0 Specification.” World Wide Web Consortium (W3C) Recommendation, (2014). ▶ Christopher M. Bishop. Pattern Recognition and Machine Learning (2006). ▶ Viet Lai et al. “SemEval 2022 Task 12: Symlink—Linking Mathematical Symbols to their Descriptions.” In Proceedings of SemEval-2022. ▶ Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” In Proceedings of EMNLP2019. 20 / 20