Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dataset Creation for Grounding of Formulae / SCIDOCA2020

Watson
November 15, 2020

Dataset Creation for Grounding of Formulae / SCIDOCA2020

Watson

November 15, 2020
Tweet

More Decks by Watson

Other Decks in Research

Transcript

  1. Dataset Creation for Grounding of Formulae Dataset Creation for Grounding

    of Formulae Takuto Asakura, André Greiner-Petter, Akiko Aizawa, Yusuke Miyao SCIDOCA2020 2020-11-15 1 / 17
  2. Dataset Creation for Grounding of Formulae Targets: STEM Documents The

    targets of our work are Science, Technology, Engineering, and Mathematics (STEM) documents. Example Papers, Textbooks, and Manuals, etc. STEM documents are: essence of human knowledge well organized (semi-structured) texts with mathematical formulae 2 / 17
  3. Dataset Creation for Grounding of Formulae Long-term Goal: Documents →

    Computational Conversion STEM Documents (Natural Language + Formulae) Papers, textbooks, manuals, etc. Conversion Computational Form (Formal Language) Executable code, first-order logic, etc. The conversion enables us to: construct databases of mathematical knowledge search for formulae 3 / 17
  4. Dataset Creation for Grounding of Formulae Necessity of Synthetic Analysis

    Importance of formulae in STEM documents Mathematical expressions are commonly used in scientific communication in numerous fields. E.g. Mathematics, Physics, Informatics, etc. They often express key ideas in STEM documents. [Kohlhase+, 2014] Interaction among texts and formulae Texts and formulae are complimentary to each other: Texts explains formulae (and vice versa) Texts in formulae E.g. { ∈ N |  is prime} Notations and verbalizations E.g. 1 + 2 and “one plus two” Integration of NLP and formulae analysis is required! 4 / 17
  5. Dataset Creation for Grounding of Formulae Short-term Goal: Token-level Analysis

    Tokens Layer Existing tasks/work Application MathIR [Koprucki+, 2016], Formulae search [Davila+, 2017], Conversion to formal representations [Kohlhase+, 2014] Formulae Semantic analysis Subexpression Syntactic analysis, MOI analysis [Greiner-Petter+, 2020] Token This project, Part-of-Math tagging [Youssef+, 2017] 5 / 17
  6. Dataset Creation for Grounding of Formulae Grounding of Formulae 1.

    Finding groups of tokens (math words) which refer to mathematical concepts 2. Associating a corresponding mathematical concept to each group cf. Entity linking + Co-reference analysis in NLP 6 / 17
  7. Dataset Creation for Grounding of Formulae Difficulty of the Grounding

    Various ambiguities similar to natural languages [Kohlhase+, 2014] A symbol (token) can be used in several meanings Syntactic ambiguity E.g. ƒ( + b) Formulae cannot be understood without reading surrounding texts Common sense and domain knowledge may be required E.g. π is Archimedes’ constant Usage of character y in the first chapter of PRML (except exercises) Text fragment from PRML Chap. 1 Meaning of y . . . can be expressed as a function y(x) . . . a function which takes an image as input . . . an output vector y, encoded in . . . an output vector of function y(x) . . . two vectors of random variables x and y . . . a vector of random variables Suppose we have a joint distribution p(x, y) . . . a part of pairs of values, corresponding to x 7 / 17
  8. Dataset Creation for Grounding of Formulae Related Work [Aizawa+, 2013]

    NTCIR-10 Math Pilot Task Annotating a description for each token in formulae Hard to directly use this dataset for the conversion [Stathopoulos+, 2018] Variable Typing Assigning a mathematical type for each token E.g. set, monoid, etc. A sort of subtask for our grounding Only targeting simple formulae that consist of single tokens [Youssef+, 2017] Part-of-Math Tagging Tagging akin to Part-of-Speech tagging for NLP E.g. indexes, functions, left-delimiter, etc. A sort of subtask for our grounding 8 / 17
  9. Dataset Creation for Grounding of Formulae Dataset arXMLiv papers from

    arXiv in XML format [Ginev+, 2009] converted from L ATEX via L ATEXML formulae are in MathML markups L A TEXML XHTML/XML arXiv.org 9 / 17
  10. Dataset Creation for Grounding of Formulae A Little Note for

    MathML a W3C Recommendation [Ausbrooks+, 2014] includes two markups: presentation and content Presentation Markup This shows syntax: <msup > <mfenced > <mi>a</mi> <mo >+</mo> <mi>b</mi> </mfenced > <mm >2</mm> </msup > ( + b)2 Content Markup This shows semantics: <app y > <power > <app y > <p us/> <ci >a</ci > <ci >b</ci > </app y > <cn >2</cn > </app y > 10 / 17
  11. Dataset Creation for Grounding of Formulae Pilot Annotation We annotated

    all identifiers in an academic paper. Identifiers a type of token; <mi> in Presentation MathML a letter or a string E.g. , y, θ, sin, etc. Target Paper We chose from the arXMLiv dataset: A Very Brief Introduction to Machine Learning With Applications to Communi- cation Systems [Simeone, 2018] Basic statistics of the target paper #words in texts 10,616 #<mi> tags 937 #sections 7 #inline math 331 #pages (in PDF) 20 #display math 23 Let me show you a demonstration! 11 / 17
  12. Dataset Creation for Grounding of Formulae Inter-annotator Agreements 3 persons

    worked for the annotation: Annotator 1 created the dictionary and did the annotation Annotator 2 & 3 only performed the annotation part with the dictionary created by the Annotator 1 The agreements Agreements Affixes mismatches Annotator 2 904/937 (96.48%) 2/33 (6.06%) Annotator 3 824/937 (87.94%) 60/113 (53.10%) Example (Affixes mismatch) Annotator 1 says p is a part of p(· | ·, ·) [a parameterized true distribution] but another says it is a part of p(· | ·) [a parameterized true distribution]. 12 / 17
  13. Dataset Creation for Grounding of Formulae Disagreement Analyses Disagreements have

    cascading effects ∵) meanings of identifiers depend on that of others disagreements between Annotator 1 & 2 are mostly due to a single disagreement for D some declarations are not clear enough 113 disagreements between Annotator 1 & 3 can be categorized into 40 patterns we are given a training set D of N training points (n, tn), with n = 1, . . . , N, where the variables n are the inputs [Simeone, 2018] Under this assumption, the data set D is not necessary, since the mapping between input and output is fully described by the distribution p(, t). [Simeone, 2018] 13 / 17
  14. Dataset Creation for Grounding of Formulae The Number of Mathematical

    Concepts x D E maximize t z KL argmax argmin exp ln max min L ℓ N 0 2 4 6 8 10 12 Entries (identifiers) #items (mathematical objects) max of #items 13 median of #items 1 mean of #items 2.6 standard deviation of #items 2.7 14 / 17
  15. Dataset Creation for Grounding of Formulae Notable Phenomena Linguistic Phenomena

    Ambiguity a type of identifier is used in several meanings (referring to mathematical objects) Scopes there are loose scopes Meta-declaration for notation usages There are sentences like the following: Throughout, we use Roman font to denote random variables and the corresponding letter in regular font for realizations. [Simeone, 2018] 16 / 17
  16. Dataset Creation for Grounding of Formulae Summary and Future Direction

    Summary Intergration of NLP and Formulae analysis is crucial We propose a new task grounding of formulae Associating each math token to mathematical concept It is like a combination of entity linking and co-reference analysis We start creating a dataset with our special annotation tool Future Direction Automating the grounding process Enlarge the dataset with semi-automatic methods Thanks for your time! Questions? 17 / 17