Slide 1

Slide 1 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool MioGatto: A Math Identifier-oriented Grounding Annotation Tool Takuto Asakura (UTokyo), Yusuke Miyao (UTokyo), Akiko Aizawa (NII), and Michael Kohlhase (FAU) MathUI Workshop 2021 2021-07-30 1 / 18

Slide 2

Slide 2 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool The Annotation Tool — MioGatto Math Identifier-oriented Grounding Annotation Tool A novel annotation tool for math formulae grounding Its aim is to build large corpora for Math Language Processing (MLP) It can perform two types of annotations: 1. Math concept for each math identifier occurrence 2. Sources of grounding, i.e., definitions and declarations It is open source! (MIT License) https://github.com/wtsnjp/MioGatto 2 / 18

Slide 3

Slide 3 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool Background: Ambiguities in Math Formulae Various ambiguities similar to natural languages [Kohlhase+, 2014] A symbol (token) can be used in several meanings Syntactic ambiguity E.g. ƒ( + b) Formulae cannot be understood without reading surrounding texts Common sense and domain knowledge may be required E.g. π is Archimedes’ constant Usage of character y in the first chapter of PRML (except exercises) Text fragment from PRML Chap. 1 Meaning of y . . . can be expressed as a function y(x) . . . a function which takes an image as input . . . an output vector y, encoded in . . . an output vector of function y(x) . . . two vectors of random variables x and y . . . a vector of random variables Suppose we have a joint distribution p(x, y) . . . a part of pairs of values, corresponding to x 3 / 18

Slide 4

Slide 4 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool Grounding of Formulae [Asakura+, 2020] 1. Finding groups of tokens which refer to mathematical concepts E.g. , α, cos, , =, ×, etc. 2. Associating a corresponding mathematical concept to each group 4 / 18

Slide 5

Slide 5 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool Grounding of Formulae [Asakura+, 2020] ≈ Description Alignment + Coreference Resolution A task to associate description for each math identifier occurrence There are some existing work [Aizawa+ 2013, Alexeeva+ 2020, etc.] 5 / 18

Slide 6

Slide 6 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool Grounding of Formulae [Asakura+, 2020] ≈ Description Alignment + Coreference Resolution Coreference in Natural Languages Bob told Alice that he wants to study NLP. Coreference Coreference in Formulae The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y, encoded in the same way as the target vectors. The precise form of the function y(x) is determined during the training phase (PRML, p. 2) 6 / 18

Slide 7

Slide 7 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool Prior Results (1) Pilot Annotation [Asakura+, 2020] Annotating all occurrences of math identifiers in an academic paper * Identifiers: variables, functions, and constants E.g. , y, θ, sin A suitable paper was taken from arXiv A Very Brief Introduction to Machine Learning With Applications to Communication Systems [Simeone, 2018] Basic statistics of the target paper #words in texts 10,616 # tags 937 #sections 7 #inline math 331 #pages (in PDF) 20 #display math 23 All existing data are distributed for SIGMathLing members (please join!) https://sigmath ing.kwarc.info/resources/grounding-dataset/ 7 / 18

Slide 8

Slide 8 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool Prior Results (2) Occurrences and Concepts [Asakura+, 2020] 8 / 18

Slide 9

Slide 9 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool Source of Grounding Bases of grounding of formulae inside or outside documents: inner-document Surrounding texts, formulae E.g. apposition noun, def = outer-document Common sense, domain knowledge E.g. Wikidata 9 / 18

Slide 10

Slide 10 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool Definition, Declaration, and Registration Sources of grounding are normally one of the followings. Definition and Declaration definition declaration definiendum Registration (Others) Throughout, we use Roman font to denote random variables and the corre- sponding letter in regular font for realizations. [Simeone, 2018] 10 / 18

Slide 11

Slide 11 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool How MioGatto Works Input and Output Input XHTML — L A TEX documents converted by L A TEXML [Miller, 2018] → Files in the arXMLiv dataset [Ginev, 2020] satisfies Output Annotation data in JSON format∗ Brief Procedure for Annotators 1. Create items in the math concept dictionary 2. Associate one of the items for each identifier occurrence 3. Register text spans for the grounding sources Let me show you demonstration! ∗Please refer to https://github.com/wtsnjp/MioGatto/wiki for the detailed spec. 11 / 18

Slide 12

Slide 12 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool Variety of Annotation Tools in NLP General Tools (not specific for NLP) Some commercial tools provides basic annotation functionalities: Adobe Acrobat can add free-text notes and highlighting texts in PDF hypothes.is can do something similar for web pages Annotation Tools for NLP Number of efforts have been made. Examples brat [Stenetorp+, 2012] is a high functionarity tool and famous in NLP WebAnno [Yimam+, 2013] has extensive features for collaboration PDFAnno [Shindo+, 2018] can annotate PDFs directly SACR [Oberle, 2018] is specialized for coreference relations 12 / 18

Slide 13

Slide 13 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool Comparison with Other Tools for MLP KAT: KWARC Annotation Tool [Ginev+, 2015] A web-based annotation tool for STEM documents Annotating attributes for the OMDoc format [Kohlhase 2006] Input: HTML5, Output: Annotation expressed in RDF Currently, not actively maintained AnnoMathTeX [Scharpf+, 2019 & 2021] An annotation recommender system for math identifiers Document-global annotations unless a ‘local’ option is specified Input: Wikitext or L A TEX, Output: Annotation expressed in JSON MioGatto made some additions including: extra information to math concepts E.g. math type and arity annotation of grounding sources 13 / 18

Slide 14

Slide 14 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool Current Status: Working on the Actual Annotation Our Team We are currently working with 8 part-time annotators They are mostly graduate students: Four in Natural Language Processing Two in Logics One in Mathematics One in Physics Annotation Math concepts Sources of grounding 14 / 18

Slide 15

Slide 15 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool Future Plan (1) Automating the Grounding 3-step of Automation 1. Detecting/Retrieving inner-document sources of grounding → Pattern matching + POS tagging 2. ‘Dictionary’ generation by clustering the sources → Short text clustering [Jiaming+, 2017] may be applicable 3. Associating each occurrence with the entry in the ‘dictionary’ → Pattern matching + POS tagging + text classification Source Detection Dictionary Generation Associating Repeat & Improvement Proposing Dataset &OIBODFNFOU &WBMVBUJPO 15 / 18

Slide 16

Slide 16 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool Future Plan (2) Enhancement for MioGatto Review Mode Clearly show discrepancies between annotators Enable commenting on annotations for discussing Other enhancements Output format standardization More improvements on the UI for efficient annotation MioGatto is open source! You are welcome to use it, requesting new features, and sending patches for improvements. https://github.com/wtsnjp/MioGatto 16 / 18

Slide 17

Slide 17 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool References (1) Akiko Aizawa, Michael Kohlhase, and Iadh Ounis. “NTCIR-10 Math Pilot Task Overview.” In Proceedings of NTCIR-10 (2013). Maria Alexeeva et al. “MathAlign: Linking Formula Identifiers to their Contextual Natural Language Descriptions.” In Proceedings of LREC 2020. Takuto Asakura et al. “Towards Grounding of Formulae.” In Proceedings of SDP2020. Christopher M Bishop. Pattern Recognition and Machine Learning (2006). Deyan Ginev et al. “KAT: an annotation tool for STEM documents”. In Proceedings of MathUI Workshop 2015. Deyan Ginev. arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org. SIGMathLing (2020). Xu, Jiaming, et al. “Self-taught convolutional neural networks for short text clustering.” Neural Networks 88 (2017). Michael Kohlhase and Mihnea Iancu. “Co-representing structure and meaning of mathematical documents” (2014). Bruce Miller. L A TEXML The Manual — A L A TEX to XML/HTML/MathML Converter (2018). Bruno Oberle. “SACR: A Drag-and-Drop Based Tool for Coreference Annotation.” In Proceedings of LREC 2018. 17 / 18

Slide 18

Slide 18 text

MioGatto: A Math Identifier-oriented Grounding Annotation Tool References (2) Hiroyuki Shindo, Yohei Munesada, and Yuji Matsumoto. “PDFAnno: a web-based linguistic annotation tool for pdf documents.” In Proceedings of LREC 2018. Philipp Scharpf et al. “AnnoMathTeX — a Formula Identifier Annotation Recommender System for STEM Documents”. In Proceedings RecSys 2019. Philipp Scharpf et al. “Fast Linking of Mathematical Wikidata Entities in Wikipedia Articles Using Annotation Recommendation”. In Proceedings WWW 2021. Osvaldo Simeone. “A very brief introduction to machine learning with applications to communication systems.” IEEE Transactions on Cognitive Communications and Networking (2018). Pontus Stenetorp et al. “brat: a Web-based Tool for NLP-Assisted Text Annotation.” In Proceedings of EACL 2012. Seid Muhie Yimam, Iryna Gurevych, Richard Eckart de Castilho, and Chris Biemann. “WebAnno: A flexible, web-based and visually supported system for distributed annotations.” In Proceedings of ACL 2013. 18 / 18