Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MioGatto: A Math Identifier-oriented Grounding Annotation Tool / mathui2021

Watson
July 30, 2021

MioGatto: A Math Identifier-oriented Grounding Annotation Tool / mathui2021

We present a new annotation tool, called MioGatto, to efficiently build large corpora for grounding math formulae. While in documents in science, technology, engineering, and mathematics, math identifiers can be used in multiple meanings in a single document, corpora with annotated coreference relations between identifiers are crucial for the grounding task. Using MioGatto, annotators can produce a list of math concepts for each document, associate one of the math concepts with each occurrence of math identifiers, and annotate the text span that is the source for grounding. In general, manual annotation of coreference relations is a very tough task, but this tool is specialized for building grounding corpora and can annotate them more efficiently than existing general-purpose annotation tools. The tool can be obtained from https://github.com/wtsnjp/MioGatto.

Watson

July 30, 2021
Tweet

More Decks by Watson

Other Decks in Research

Transcript

  1. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    MioGatto: A Math Identifier-oriented Grounding
    Annotation Tool
    Takuto Asakura (UTokyo), Yusuke Miyao (UTokyo), Akiko Aizawa (NII),
    and Michael Kohlhase (FAU)
    MathUI Workshop 2021
    2021-07-30
    1 / 18

    View Slide

  2. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    The Annotation Tool — MioGatto
    Math Identifier-oriented Grounding Annotation Tool
    A novel annotation tool for math formulae grounding
    Its aim is to build large corpora for Math Language Processing (MLP)
    It can perform two types of annotations:
    1. Math concept for each math identifier occurrence
    2. Sources of grounding, i.e., definitions and declarations
    It is open source! (MIT License)
    https://github.com/wtsnjp/MioGatto
    2 / 18

    View Slide

  3. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    Background: Ambiguities in Math Formulae
    Various ambiguities similar to natural languages [Kohlhase+, 2014]
    A symbol (token) can be used in several meanings
    Syntactic ambiguity E.g. ƒ( + b)
    Formulae cannot be understood without reading surrounding texts
    Common sense and domain knowledge may be required
    E.g. π is Archimedes’ constant
    Usage of character y in the first chapter of PRML (except exercises)
    Text fragment from PRML Chap. 1 Meaning of y
    . . . can be expressed as a function y(x) . . . a function which takes an image as input
    . . . an output vector y, encoded in . . . an output vector of function y(x)
    . . . two vectors of random variables x and y . . . a vector of random variables
    Suppose we have a joint distribution p(x, y) . . . a part of pairs of values, corresponding to x
    3 / 18

    View Slide

  4. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    Grounding of Formulae [Asakura+, 2020]
    1. Finding groups of tokens which refer to mathematical concepts
    E.g. , α, cos, , =, ×, etc.
    2. Associating a corresponding mathematical concept to each group
    4 / 18

    View Slide

  5. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    Grounding of Formulae [Asakura+, 2020]
    ≈ Description Alignment + Coreference Resolution
    A task to associate description for each math identifier occurrence
    There are some existing work [Aizawa+ 2013, Alexeeva+ 2020, etc.]
    5 / 18

    View Slide

  6. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    Grounding of Formulae [Asakura+, 2020]
    ≈ Description Alignment + Coreference Resolution
    Coreference in Natural Languages
    Bob told Alice that

    he wants to study NLP.
    Coreference
    Coreference in Formulae
    The result of running the machine learning algorithm can be expressed as a function
    y(x) which takes a new digit image x as input and that generates an output vector y,
    encoded in the same way as the target vectors. The precise form of the function y(x)
    is determined during the training phase (PRML, p. 2)
    6 / 18

    View Slide

  7. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    Prior Results (1) Pilot Annotation [Asakura+, 2020]
    Annotating all occurrences of math identifiers in an academic paper
    * Identifiers: variables, functions, and constants E.g. , y, θ, sin
    A suitable paper was taken from arXiv
    A Very Brief Introduction to Machine Learning With Applications
    to Communication Systems [Simeone, 2018]
    Basic statistics of the target paper
    #words in texts 10,616 # tags 937
    #sections 7 #inline math 331
    #pages (in PDF) 20 #display math 23
    All existing data are distributed for SIGMathLing members (please join!)
    https://sigmath ing.kwarc.info/resources/grounding-dataset/
    7 / 18

    View Slide

  8. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    Prior Results (2) Occurrences and Concepts [Asakura+, 2020]
    8 / 18

    View Slide

  9. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    Source of Grounding
    Bases of grounding of formulae inside or outside documents:
    inner-document Surrounding texts, formulae E.g. apposition noun, def
    =
    outer-document Common sense, domain knowledge E.g. Wikidata
    9 / 18

    View Slide

  10. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    Definition, Declaration, and Registration
    Sources of grounding are normally one of the followings.
    Definition and Declaration
    definition
    declaration
    definiendum
    Registration (Others)
    Throughout, we use Roman font to denote random variables and the corre-
    sponding letter in regular font for realizations. [Simeone, 2018]
    10 / 18

    View Slide

  11. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    How MioGatto Works
    Input and Output
    Input XHTML — L
    A
    TEX documents converted by L
    A
    TEXML [Miller, 2018]
    → Files in the arXMLiv dataset [Ginev, 2020] satisfies
    Output Annotation data in JSON format∗
    Brief Procedure for Annotators
    1. Create items in the math concept dictionary
    2. Associate one of the items for each identifier occurrence
    3. Register text spans for the grounding sources
    Let me show you demonstration!
    ∗Please refer to https://github.com/wtsnjp/MioGatto/wiki for the detailed spec.
    11 / 18

    View Slide

  12. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    Variety of Annotation Tools in NLP
    General Tools (not specific for NLP)
    Some commercial tools provides basic annotation functionalities:
    Adobe Acrobat can add free-text notes and highlighting texts in PDF
    hypothes.is can do something similar for web pages
    Annotation Tools for NLP
    Number of efforts have been made.
    Examples
    brat [Stenetorp+, 2012] is a high functionarity tool and famous in NLP
    WebAnno [Yimam+, 2013] has extensive features for collaboration
    PDFAnno [Shindo+, 2018] can annotate PDFs directly
    SACR [Oberle, 2018] is specialized for coreference relations
    12 / 18

    View Slide

  13. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    Comparison with Other Tools for MLP
    KAT: KWARC Annotation Tool [Ginev+, 2015]
    A web-based annotation tool for STEM documents
    Annotating attributes for the OMDoc format [Kohlhase 2006]
    Input: HTML5, Output: Annotation expressed in RDF
    Currently, not actively maintained
    AnnoMathTeX [Scharpf+, 2019 & 2021]
    An annotation recommender system for math identifiers
    Document-global annotations unless a ‘local’ option is specified
    Input: Wikitext or L
    A
    TEX, Output: Annotation expressed in JSON
    MioGatto made some additions including:
    extra information to math concepts E.g. math type and arity
    annotation of grounding sources
    13 / 18

    View Slide

  14. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    Current Status: Working on the Actual Annotation
    Our Team
    We are currently working with 8 part-time annotators
    They are mostly graduate students:
    Four in Natural Language Processing
    Two in Logics
    One in Mathematics
    One in Physics
    Annotation
    Math concepts
    Sources of grounding
    14 / 18

    View Slide

  15. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    Future Plan (1) Automating the Grounding
    3-step of Automation
    1. Detecting/Retrieving inner-document sources of grounding
    → Pattern matching + POS tagging
    2. ‘Dictionary’ generation by clustering the sources
    → Short text clustering [Jiaming+, 2017] may be applicable
    3. Associating each occurrence with the entry in the ‘dictionary’
    → Pattern matching + POS tagging + text classification
    Source Detection Dictionary Generation Associating
    Repeat & Improvement
    Proposing Dataset
    &OIBODFNFOU
    &WBMVBUJPO
    15 / 18

    View Slide

  16. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    Future Plan (2) Enhancement for MioGatto
    Review Mode
    Clearly show discrepancies between annotators
    Enable commenting on annotations for discussing
    Other enhancements
    Output format standardization
    More improvements on the UI for efficient annotation
    MioGatto is open source!
    You are welcome to use it, requesting new features,
    and sending patches for improvements.
    https://github.com/wtsnjp/MioGatto
    16 / 18

    View Slide

  17. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    References (1)
    Akiko Aizawa, Michael Kohlhase, and Iadh Ounis. “NTCIR-10 Math Pilot Task Overview.” In
    Proceedings of NTCIR-10 (2013).
    Maria Alexeeva et al. “MathAlign: Linking Formula Identifiers to their Contextual Natural
    Language Descriptions.” In Proceedings of LREC 2020.
    Takuto Asakura et al. “Towards Grounding of Formulae.” In Proceedings of SDP2020.
    Christopher M Bishop. Pattern Recognition and Machine Learning (2006).
    Deyan Ginev et al. “KAT: an annotation tool for STEM documents”. In Proceedings of MathUI
    Workshop 2015.
    Deyan Ginev. arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org. SIGMathLing (2020).
    Xu, Jiaming, et al. “Self-taught convolutional neural networks for short text clustering.”
    Neural Networks 88 (2017).
    Michael Kohlhase and Mihnea Iancu. “Co-representing structure and meaning of
    mathematical documents” (2014).
    Bruce Miller. L
    A
    TEXML The Manual — A L
    A
    TEX to XML/HTML/MathML Converter (2018).
    Bruno Oberle. “SACR: A Drag-and-Drop Based Tool for Coreference Annotation.” In
    Proceedings of LREC 2018.
    17 / 18

    View Slide

  18. MioGatto: A Math Identifier-oriented Grounding Annotation Tool
    References (2)
    Hiroyuki Shindo, Yohei Munesada, and Yuji Matsumoto. “PDFAnno: a web-based linguistic
    annotation tool for pdf documents.” In Proceedings of LREC 2018.
    Philipp Scharpf et al. “AnnoMathTeX — a Formula Identifier Annotation Recommender System
    for STEM Documents”. In Proceedings RecSys 2019.
    Philipp Scharpf et al. “Fast Linking of Mathematical Wikidata Entities in Wikipedia Articles
    Using Annotation Recommendation”. In Proceedings WWW 2021.
    Osvaldo Simeone. “A very brief introduction to machine learning with applications to
    communication systems.” IEEE Transactions on Cognitive Communications and Networking
    (2018).
    Pontus Stenetorp et al. “brat: a Web-based Tool for NLP-Assisted Text Annotation.” In
    Proceedings of EACL 2012.
    Seid Muhie Yimam, Iryna Gurevych, Richard Eckart de Castilho, and Chris Biemann.
    “WebAnno: A flexible, web-based and visually supported system for distributed annotations.”
    In Proceedings of ACL 2013.
    18 / 18

    View Slide