Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language / cicm2019

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and
Natural Language Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and Natural Language Takuto ASAKURA SOKENDAI / Miyao Group at UTokyo 2019-07-08 1 / 18

Natural Language About Me Takuto ASAKURA (aka. wtsnjp) A graduate student at SOKENDAI A member of Miyao Group, UTokyo Supervisers: Prof. Yusuke Miyao Prof. Akiko Aizawa I studied bioinformatics at UTokyo for bachelor I’m also a heavy TEX user A member of the TEX Live Team maintaining Texdoc—a documentation search tool supports for Japanese A contributer for the L ATEX3 Project 2 / 18

Natural Language Targets: STEM Documents The targets of our work are Science, Technology, Engineering, and Mathematics (STEM) documents. Example Papers, Textbooks, and Manuals, etc. STEM documents are: essence of human knowledge well organized (semi-structured) texts with mathematical expressions 3 / 18

Natural Language Long-term Goal: Converting STEM Documents to Formal Expressions STEM Documents (Natural Language + Formulae) Papers, textbooks, manuals, etc. Conversion Computational Form (Formal Language) Executable code, ﬁrst-order logic, etc. The conversion enables us to: construct databases of mathematical knowledge search for formulae 4 / 18

Natural Language Necessity of Synthetic Analysis Importance of formulae in STEM documents Mathematical expressions are commonly used in scientiﬁc communication in numerous ﬁelds. E.g. Mathematics, Physics, Informatics, etc. They often express key ideas in STEM documents. Interaction among texts and formulae Texts and formulae are complimentary to each other: [Kohlhase and Iancu, 2015] Texts explains formulae (and vice versa) Texts in formulae E.g. { ∈ N |  is prime} Notations and verbalizations E.g. 1 + 2 and “one plus two” Deep synthetic analyses on natural language and mathematical expressions are necessary. 5 / 18

Natural Language Short-term Goals At ﬁrst, we focus on the token-level analysis in formulae: an initial step of the conversion almost untouched Examples (tokens) , ε, ×, log, , etc. We will work on both algorithm and theory: Token-level (Grounding) Word-level (Morphology, Lexical Semantics) Fragment-level (Parsing) Phrase-level (Syntax, PSG) Formulae- level (SP) Sentence- level (Semantics) Applications (Conversion, IR, Searching, etc.) Mathematical Expressions Natural Language 1. Automatically associating formulae tokens to mathematical objects 2. Discussing morphology and lexical semantics for mathematical expressions 6 / 18

Natural Language Grounding Tokens to Mathematical Objects Tokens in formulae and their combination can refer to mathematical objects The detection is fundamental for understanding STEM documents Example For example,  might describe the outcome of flipping a coin, with  = 1 representing ‘heads’, and  = 0 representing ‘tails’. We can imagine that this is a damaged coin so that the probability of landing heads is not necessarily the same as that of landing tails. The probability of  = 1 will be denoted by the parameter μ. The probability distribution over  can therefore be written in the form Bern(  | μ ) = μ(1 − μ)1− The result of coin flipping, int,  ∈ {0, 1} The probability of ‘heads’ on top, float, 0 ≤ μ ≤ 1 which is known as the Bernoulli distribution. (PRML, pp. 86–87) 7 / 18

Natural Language Difﬁculty of the Grounding Factors which make the detection highly challenging: ambiguity of tokens (see below) syntactic ambiguity of formulae E.g. ƒ( + b) necessity for common sence & domain knowledge severe abbreviation Usage of character y in the ﬁrst chapter of PRML (except exercises) Text fragment from PRML Chap. 1 Meaning of y . . . can be expressed as a function y(x) . . . a function which takes an image as input . . . an output vector y, encoded in . . . an output vector of function y(x) . . . two vectors of random variables x and y . . . a vector of random variables Suppose we have a joint distribution p(x, y) . . . a part of pairs of values, corresponding to x 8 / 18

Natural Language Related Work [Aizawa+, 2013] NTCIR-10 Math Pilot Task annotating a description for each token in formulae an object can be described in several ways → difﬁcult to make the annotation coherent [Stathopoulos+, 2018] Variable Typing assigning a mathematical type for each token E.g. set, monoid, etc. a sort of subtask for our grounding Usage of character y in the ﬁrst chapter of PRML (except exercises) Text fragment from PRML Chap. 1 Meaning of y . . . can be expressed as a function y(x) . . . a function which takes an image as input . . . an output vector y, encoded in . . . an output vector of function y(x) . . . two vectors of random variables x and y . . . a vector of random variables Suppose we have a joint distribution p(x, y) . . . a part of pairs of values, corresponding to x 9 / 18

Natural Language A Simpliﬁed Task What are mathematical objects? Each mathematical object should have a description and some attributes including: mathematical type condition E.g. larger than 0 Necessary and sufﬁcient attributes are still unclear → We will see after some experiments. . . Clustering for tokens Giving a label for tokens which refer to the same mathematical object is easier. cf. Co-reference in NLP The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y, encoded in the same way as the target vectors. The precise form of the function y(x) is determined during the training phase, also known as the learning phase, on the basis of the training data. (PRML, p. 2) 10 / 18

Natural Language Morphology for Mathematical Expressions Morphemes and words (terms of morphology) morphome: the shortest meaningful unit in a language word: a morphemes or combination of a few morphemes which can refer to an object Example A word “un-break-able” comprises three morphemes. Words in mathematical expressions As a matter of fact, words also exist in formulae. Example M is a word in “Matrix M”, but M is not a word in “An entry M,j” (M,j is a word). 11 / 18

Natural Language Semantics Over Natural Language and Mathematical Expressions There are ambiguity arise only when context exists. For instance, “equals signs” (=) in formulae have at least three meanings: deﬁnition, identity, and equation. Example Let  = 4, b = 3. Suppose we have to solve 4 + b2 + 1 = 0. To reach the answer, “difference of two” is helpful: p2 − q2 = (p + q)(p − q). 12 / 18

Natural Language Dataset arXMLiv papers from arXiv in XML format [Ginev+, 2009] converted from L ATEX via L ATEXML formulae are in MathML markups L A TEXML XHTML/XML arXiv.org 13 / 18

Natural Language A Little Note for MathML a W3C Recommendation [Ausbrooks+, 2014] includes two markups: presentation and content Presentation Markup This shows syntax: <msup> <mfenced> <mi>a</mi> <mo>+</mo> <mi>b</mi> </mfenced> <mm>2</mm> </msup> Content Markup This shows semantics: <apply> <power> <apply> <plus/> <ci>a</ci> <ci>b</ci> </apply> <cn>2</cn> </apply> ( + b)2 14 / 18

Natural Language Pilot Annotation For the ﬁrst attempt, we are now giving pilot annotation for 3 papers from arXMLiv in the following manner: 1. Detecting minimal groups of tokens (i.e., words) each of which refers to a mathematical object. 2. Categorizing words by the mathematical object they referring to. Example The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y, encoded in the same way as the target vectors. The precise form of the function y(x) is determined during the training phase, also known as the learning phase, on the basis of the training data. (PRML, p. 2) Let me show you a demonstration! 15 / 18

Natural Language What’s Next? Creating a dataset completing the annotation for ≥ 10 papers in arXiv I would also like to do it for some textbooks check for the reproducibility of the annotation Automating the detectiion Combination of rule-based and machine learning with features such as: apposition nouns E.g. “a function ƒ” syntactic information in formulae E.g. does it appear inside an argument or not? distance from the former appearence 16 / 18

Natural Language Possible Applications Mathematical Information Retrieval (MIR) → enables us to create scientiﬁc knowledge bases Automatic code generation E.g. Python, Coq, etc. Searching for mathematical expressions Example Let us think about searching for: n + yn = zn (n ≥ 3). It is easy to search if you know a keyword Fermat’s Last Theorem, but otherwise. . . 17 / 18

Natural Language Today’s Conclusions converting STEM documents to computational form is beneﬁcial and challenging for the conversion, synthetic analysis on natural language and mathematical expressions is required At ﬁrst, we focus on token-level analyses: grounding tokens to mathematical objects disscussing morphorogy for formulae Currenly, we are working on the pilot annotation Possible applications: MIR, code generation, searching for formulae Thanks for your time! Questions? 18 / 18

Understanding Scientific Documents with Synthet...

Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language / cicm2019

Watson

More Decks by Watson

Other Decks in Research

Featured

Transcript

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and

Understanding Scientiﬁc Documents with Synthetic Analysis on Mathematical Expressions and