Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dataset Creation for Grounding of Formulae / SCIDOCA2020

Watson
November 15, 2020

Dataset Creation for Grounding of Formulae / SCIDOCA2020

Watson

November 15, 2020
Tweet

More Decks by Watson

Other Decks in Research

Transcript

  1. Dataset Creation for Grounding of Formulae
    Dataset Creation for Grounding of Formulae
    Takuto Asakura, André Greiner-Petter, Akiko Aizawa, Yusuke Miyao
    SCIDOCA2020
    2020-11-15
    1 / 17

    View Slide

  2. Dataset Creation for Grounding of Formulae
    Targets: STEM Documents
    The targets of our work are Science, Technology, Engineering, and
    Mathematics (STEM) documents.
    Example
    Papers,
    Textbooks, and
    Manuals, etc.
    STEM documents are:
    essence of human knowledge
    well organized (semi-structured)
    texts with mathematical formulae
    2 / 17

    View Slide

  3. Dataset Creation for Grounding of Formulae
    Long-term Goal: Documents → Computational Conversion
    STEM Documents (Natural Language + Formulae)
    Papers, textbooks, manuals, etc.
    Conversion
    Computational Form (Formal Language)
    Executable code, first-order logic, etc.
    The conversion enables us to:
    construct databases of mathematical knowledge
    search for formulae
    3 / 17

    View Slide

  4. Dataset Creation for Grounding of Formulae
    Necessity of Synthetic Analysis
    Importance of formulae in STEM documents
    Mathematical expressions are commonly used in scientific
    communication in numerous fields.
    E.g. Mathematics, Physics, Informatics, etc.
    They often express key ideas in STEM documents.
    [Kohlhase+, 2014]
    Interaction among texts and formulae
    Texts and formulae are complimentary to each other:
    Texts explains formulae (and vice versa)
    Texts in formulae E.g. { ∈ N |  is prime}
    Notations and verbalizations E.g. 1 + 2 and “one plus two”
    Integration of NLP and formulae analysis is required!
    4 / 17

    View Slide

  5. Dataset Creation for Grounding of Formulae
    Short-term Goal: Token-level Analysis
    Tokens
    Layer Existing tasks/work
    Application MathIR [Koprucki+, 2016], Formulae search [Davila+, 2017],
    Conversion to formal representations [Kohlhase+, 2014]
    Formulae Semantic analysis
    Subexpression Syntactic analysis, MOI analysis [Greiner-Petter+, 2020]
    Token This project, Part-of-Math tagging [Youssef+, 2017]
    5 / 17

    View Slide

  6. Dataset Creation for Grounding of Formulae
    Grounding of Formulae
    1. Finding groups of tokens (math words) which refer to mathematical
    concepts
    2. Associating a corresponding mathematical concept to each group
    cf. Entity linking + Co-reference analysis in NLP
    6 / 17

    View Slide

  7. Dataset Creation for Grounding of Formulae
    Difficulty of the Grounding
    Various ambiguities similar to natural languages [Kohlhase+, 2014]
    A symbol (token) can be used in several meanings
    Syntactic ambiguity E.g. ƒ( + b)
    Formulae cannot be understood without reading surrounding texts
    Common sense and domain knowledge may be required
    E.g. π is Archimedes’ constant
    Usage of character y in the first chapter of PRML (except exercises)
    Text fragment from PRML Chap. 1 Meaning of y
    . . . can be expressed as a function y(x) . . . a function which takes an image as input
    . . . an output vector y, encoded in . . . an output vector of function y(x)
    . . . two vectors of random variables x and y . . . a vector of random variables
    Suppose we have a joint distribution p(x, y) . . . a part of pairs of values, corresponding to x
    7 / 17

    View Slide

  8. Dataset Creation for Grounding of Formulae
    Related Work
    [Aizawa+, 2013]
    NTCIR-10 Math Pilot Task
    Annotating a description for each token in formulae
    Hard to directly use this dataset for the conversion
    [Stathopoulos+, 2018]
    Variable Typing
    Assigning a mathematical type for each token E.g. set, monoid, etc.
    A sort of subtask for our grounding
    Only targeting simple formulae that consist of single tokens
    [Youssef+, 2017]
    Part-of-Math Tagging
    Tagging akin to Part-of-Speech tagging for NLP
    E.g. indexes, functions, left-delimiter, etc.
    A sort of subtask for our grounding
    8 / 17

    View Slide

  9. Dataset Creation for Grounding of Formulae
    Dataset arXMLiv
    papers from arXiv in XML format [Ginev+, 2009]
    converted from L
    ATEX via L
    ATEXML
    formulae are in MathML markups
    L
    A
    TEXML
    XHTML/XML
    arXiv.org
    9 / 17

    View Slide

  10. Dataset Creation for Grounding of Formulae
    A Little Note for MathML
    a W3C Recommendation [Ausbrooks+, 2014]
    includes two markups: presentation and content
    Presentation Markup
    This shows syntax:


    a
    +
    b

    2

    ( + b)2
    Content Markup
    This shows semantics:




    a
    b

    2

    10 / 17

    View Slide

  11. Dataset Creation for Grounding of Formulae
    Pilot Annotation
    We annotated all identifiers in an academic paper.
    Identifiers
    a type of token; in Presentation MathML
    a letter or a string E.g. , y, θ, sin, etc.
    Target Paper
    We chose from the arXMLiv dataset:
    A Very Brief Introduction to Machine Learning With Applications to Communi-
    cation Systems [Simeone, 2018]
    Basic statistics of the target paper
    #words in texts 10,616 # tags 937
    #sections 7 #inline math 331
    #pages (in PDF) 20 #display math 23
    Let me show you a demonstration!
    11 / 17

    View Slide

  12. Dataset Creation for Grounding of Formulae
    Inter-annotator Agreements
    3 persons worked for the annotation:
    Annotator 1 created the dictionary and did the annotation
    Annotator 2 & 3 only performed the annotation part with the
    dictionary created by the Annotator 1
    The agreements
    Agreements Affixes mismatches
    Annotator 2 904/937 (96.48%) 2/33 (6.06%)
    Annotator 3 824/937 (87.94%) 60/113 (53.10%)
    Example (Affixes mismatch)
    Annotator 1 says p is a part of p(· | ·, ·) [a parameterized true
    distribution] but another says it is a part of p(· | ·) [a parameterized true
    distribution].
    12 / 17

    View Slide

  13. Dataset Creation for Grounding of Formulae
    Disagreement Analyses
    Disagreements have cascading effects
    ∵) meanings of identifiers depend on that of others
    disagreements between Annotator 1 & 2 are mostly due to a single
    disagreement for D
    some declarations are not clear enough
    113 disagreements between Annotator 1 & 3 can be categorized
    into 40 patterns
    we are given a training set D of N training points (n, tn), with n = 1, . . . , N, where the variables
    n are the inputs [Simeone, 2018]
    Under this assumption, the data set D is not necessary, since the mapping between input and
    output is fully described by the distribution p(, t). [Simeone, 2018]
    13 / 17

    View Slide

  14. Dataset Creation for Grounding of Formulae
    The Number of Mathematical Concepts
    x
    D
    E
    maximize
    t
    z
    KL
    argmax
    argmin
    exp
    ln
    max
    min
    L

    N
    0
    2
    4
    6
    8
    10
    12
    Entries (identifiers)
    #items (mathematical objects)
    max of #items 13
    median of #items 1
    mean of #items 2.6
    standard deviation of #items 2.7
    14 / 17

    View Slide

  15. Dataset Creation for Grounding of Formulae
    Token Occurrence and Mathematical Concepts
    15 / 17

    View Slide

  16. Dataset Creation for Grounding of Formulae
    Notable Phenomena
    Linguistic Phenomena
    Ambiguity a type of identifier is used in several meanings (referring to
    mathematical objects)
    Scopes there are loose scopes
    Meta-declaration for notation usages
    There are sentences like the following:
    Throughout, we use Roman font to denote random variables and the corresponding letter in
    regular font for realizations. [Simeone, 2018]
    16 / 17

    View Slide

  17. Dataset Creation for Grounding of Formulae
    Summary and Future Direction
    Summary
    Intergration of NLP and Formulae analysis is crucial
    We propose a new task grounding of formulae
    Associating each math token to mathematical concept
    It is like a combination of entity linking and co-reference analysis
    We start creating a dataset with our special annotation tool
    Future Direction
    Automating the grounding process
    Enlarge the dataset with semi-automatic methods
    Thanks for your time! Questions?
    17 / 17

    View Slide