Dataset Creation for Grounding of Formulae / SCIDOCA2020

Dataset Creation for Grounding of Formulae Dataset Creation for Grounding
of Formulae Takuto Asakura, André Greiner-Petter, Akiko Aizawa, Yusuke Miyao SCIDOCA2020 2020-11-15 1 / 17

Dataset Creation for Grounding of Formulae Targets: STEM Documents The
targets of our work are Science, Technology, Engineering, and Mathematics (STEM) documents. Example Papers, Textbooks, and Manuals, etc. STEM documents are: essence of human knowledge well organized (semi-structured) texts with mathematical formulae 2 / 17

Dataset Creation for Grounding of Formulae Long-term Goal: Documents →
Computational Conversion STEM Documents (Natural Language + Formulae) Papers, textbooks, manuals, etc. Conversion Computational Form (Formal Language) Executable code, ﬁrst-order logic, etc. The conversion enables us to: construct databases of mathematical knowledge search for formulae 3 / 17

Dataset Creation for Grounding of Formulae Necessity of Synthetic Analysis
Importance of formulae in STEM documents Mathematical expressions are commonly used in scientiﬁc communication in numerous ﬁelds. E.g. Mathematics, Physics, Informatics, etc. They often express key ideas in STEM documents. [Kohlhase+, 2014] Interaction among texts and formulae Texts and formulae are complimentary to each other: Texts explains formulae (and vice versa) Texts in formulae E.g. { ∈ N |  is prime} Notations and verbalizations E.g. 1 + 2 and “one plus two” Integration of NLP and formulae analysis is required! 4 / 17

Dataset Creation for Grounding of Formulae Short-term Goal: Token-level Analysis
Tokens Layer Existing tasks/work Application MathIR [Koprucki+, 2016], Formulae search [Davila+, 2017], Conversion to formal representations [Kohlhase+, 2014] Formulae Semantic analysis Subexpression Syntactic analysis, MOI analysis [Greiner-Petter+, 2020] Token This project, Part-of-Math tagging [Youssef+, 2017] 5 / 17

Dataset Creation for Grounding of Formulae Grounding of Formulae 1.
Finding groups of tokens (math words) which refer to mathematical concepts 2. Associating a corresponding mathematical concept to each group cf. Entity linking + Co-reference analysis in NLP 6 / 17

Dataset Creation for Grounding of Formulae Difﬁculty of the Grounding
Various ambiguities similar to natural languages [Kohlhase+, 2014] A symbol (token) can be used in several meanings Syntactic ambiguity E.g. ƒ( + b) Formulae cannot be understood without reading surrounding texts Common sense and domain knowledge may be required E.g. π is Archimedes’ constant Usage of character y in the ﬁrst chapter of PRML (except exercises) Text fragment from PRML Chap. 1 Meaning of y . . . can be expressed as a function y(x) . . . a function which takes an image as input . . . an output vector y, encoded in . . . an output vector of function y(x) . . . two vectors of random variables x and y . . . a vector of random variables Suppose we have a joint distribution p(x, y) . . . a part of pairs of values, corresponding to x 7 / 17

Dataset Creation for Grounding of Formulae Related Work [Aizawa+, 2013]
NTCIR-10 Math Pilot Task Annotating a description for each token in formulae Hard to directly use this dataset for the conversion [Stathopoulos+, 2018] Variable Typing Assigning a mathematical type for each token E.g. set, monoid, etc. A sort of subtask for our grounding Only targeting simple formulae that consist of single tokens [Youssef+, 2017] Part-of-Math Tagging Tagging akin to Part-of-Speech tagging for NLP E.g. indexes, functions, left-delimiter, etc. A sort of subtask for our grounding 8 / 17

Dataset Creation for Grounding of Formulae Dataset arXMLiv papers from
arXiv in XML format [Ginev+, 2009] converted from L ATEX via L ATEXML formulae are in MathML markups L A TEXML XHTML/XML arXiv.org 9 / 17

Dataset Creation for Grounding of Formulae A Little Note for
MathML a W3C Recommendation [Ausbrooks+, 2014] includes two markups: presentation and content Presentation Markup This shows syntax: <msup > <mfenced > <mi>a</mi> <mo >+</mo> <mi>b</mi> </mfenced > <mm >2</mm> </msup > ( + b)2 Content Markup This shows semantics: <app y > <power > <app y > <p us/> <ci >a</ci > <ci >b</ci > </app y > <cn >2</cn > </app y > 10 / 17

Dataset Creation for Grounding of Formulae Pilot Annotation We annotated
all identiﬁers in an academic paper. Identiﬁers a type of token; <mi> in Presentation MathML a letter or a string E.g. , y, θ, sin, etc. Target Paper We chose from the arXMLiv dataset: A Very Brief Introduction to Machine Learning With Applications to Communi- cation Systems [Simeone, 2018] Basic statistics of the target paper #words in texts 10,616 #<mi> tags 937 #sections 7 #inline math 331 #pages (in PDF) 20 #display math 23 Let me show you a demonstration! 11 / 17

Dataset Creation for Grounding of Formulae Inter-annotator Agreements 3 persons
worked for the annotation: Annotator 1 created the dictionary and did the annotation Annotator 2 & 3 only performed the annotation part with the dictionary created by the Annotator 1 The agreements Agreements Afﬁxes mismatches Annotator 2 904/937 (96.48%) 2/33 (6.06%) Annotator 3 824/937 (87.94%) 60/113 (53.10%) Example (Afﬁxes mismatch) Annotator 1 says p is a part of p(· | ·, ·) [a parameterized true distribution] but another says it is a part of p(· | ·) [a parameterized true distribution]. 12 / 17

Dataset Creation for Grounding of Formulae Disagreement Analyses Disagreements have
cascading effects ∵) meanings of identiﬁers depend on that of others disagreements between Annotator 1 & 2 are mostly due to a single disagreement for D some declarations are not clear enough 113 disagreements between Annotator 1 & 3 can be categorized into 40 patterns we are given a training set D of N training points (n, tn), with n = 1, . . . , N, where the variables n are the inputs [Simeone, 2018] Under this assumption, the data set D is not necessary, since the mapping between input and output is fully described by the distribution p(, t). [Simeone, 2018] 13 / 17

Dataset Creation for Grounding of Formulae The Number of Mathematical
Concepts x D E maximize t z KL argmax argmin exp ln max min L ℓ N 0 2 4 6 8 10 12 Entries (identiﬁers) #items (mathematical objects) max of #items 13 median of #items 1 mean of #items 2.6 standard deviation of #items 2.7 14 / 17

Dataset Creation for Grounding of Formulae Token Occurrence and Mathematical
Concepts 15 / 17

Dataset Creation for Grounding of Formulae Notable Phenomena Linguistic Phenomena
Ambiguity a type of identiﬁer is used in several meanings (referring to mathematical objects) Scopes there are loose scopes Meta-declaration for notation usages There are sentences like the following: Throughout, we use Roman font to denote random variables and the corresponding letter in regular font for realizations. [Simeone, 2018] 16 / 17

Dataset Creation for Grounding of Formulae Summary and Future Direction
Summary Intergration of NLP and Formulae analysis is crucial We propose a new task grounding of formulae Associating each math token to mathematical concept It is like a combination of entity linking and co-reference analysis We start creating a dataset with our special annotation tool Future Direction Automating the grounding process Enlarge the dataset with semi-automatic methods Thanks for your time! Questions? 17 / 17

Dataset Creation for Grounding of Formulae / SC...

Dataset Creation for Grounding of Formulae / SCIDOCA2020

Watson

More Decks by Watson

Other Decks in Research

Featured

Transcript

Dataset Creation for Grounding of Formulae Dataset Creation for Grounding

Dataset Creation for Grounding of Formulae Targets: STEM Documents The

Dataset Creation for Grounding of Formulae Long-term Goal: Documents →

Dataset Creation for Grounding of Formulae Necessity of Synthetic Analysis

Dataset Creation for Grounding of Formulae Short-term Goal: Token-level Analysis

Dataset Creation for Grounding of Formulae Grounding of Formulae 1.

Dataset Creation for Grounding of Formulae Difﬁculty of the Grounding

Dataset Creation for Grounding of Formulae Related Work [Aizawa+, 2013]

Dataset Creation for Grounding of Formulae Dataset arXMLiv papers from

Dataset Creation for Grounding of Formulae A Little Note for

Dataset Creation for Grounding of Formulae Pilot Annotation We annotated

Dataset Creation for Grounding of Formulae Inter-annotator Agreements 3 persons

Dataset Creation for Grounding of Formulae Disagreement Analyses Disagreements have

Dataset Creation for Grounding of Formulae The Number of Mathematical

Dataset Creation for Grounding of Formulae Token Occurrence and Mathematical

Dataset Creation for Grounding of Formulae Notable Phenomena Linguistic Phenomena

Dataset Creation for Grounding of Formulae Summary and Future Direction