Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language / cicm2019

Watson
July 08, 2019

Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language / cicm2019

Converting science, technology, engineering and mathematics documents to formal expressions is beneficial. To achieve that conversion it is necessary to analyze both on formulae and texts interactively. We began to tackle the conversion from two foundational parts for the synthetic analyses. In this abstract, we briefly introduce our aim, planning approaches, and current status of the work.

Watson

July 08, 2019
Tweet

More Decks by Watson

Other Decks in Research

Transcript

  1. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    Understanding Scientific Documents
    with Synthetic Analysis on Mathematical
    Expressions and Natural Language
    Takuto ASAKURA
    SOKENDAI / Miyao Group at UTokyo
    2019-07-08
    1 / 18

    View Slide

  2. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    About Me
    Takuto ASAKURA (aka. wtsnjp)
    A graduate student at SOKENDAI
    A member of Miyao Group, UTokyo
    Supervisers:
    Prof. Yusuke Miyao
    Prof. Akiko Aizawa
    I studied bioinformatics at UTokyo for bachelor
    I’m also a heavy TEX user
    A member of the TEX Live Team
    maintaining Texdoc—a documentation search tool
    supports for Japanese
    A contributer for the L
    ATEX3 Project
    2 / 18

    View Slide

  3. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    Targets: STEM Documents
    The targets of our work are Science, Technology,
    Engineering, and Mathematics (STEM) documents.
    Example
    Papers,
    Textbooks, and
    Manuals, etc.
    STEM documents are:
    essence of human knowledge
    well organized (semi-structured)
    texts with mathematical expressions
    3 / 18

    View Slide

  4. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    Long-term Goal: Converting STEM Documents
    to Formal Expressions
    STEM Documents (Natural Language + Formulae)
    Papers, textbooks, manuals, etc.
    Conversion
    Computational Form (Formal Language)
    Executable code, first-order logic, etc.
    The conversion enables us to:
    construct databases of mathematical knowledge
    search for formulae
    4 / 18

    View Slide

  5. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    Necessity of Synthetic Analysis
    Importance of formulae in STEM documents
    Mathematical expressions are commonly used in
    scientific communication in numerous fields.
    E.g. Mathematics, Physics, Informatics, etc.
    They often express key ideas in STEM documents.
    Interaction among texts and formulae
    Texts and formulae are complimentary to each other:
    [Kohlhase and Iancu, 2015]
    Texts explains formulae (and vice versa)
    Texts in formulae E.g. { ∈ N |  is prime}
    Notations and verbalizations
    E.g. 1 + 2 and “one plus two”
    Deep synthetic analyses on natural language and
    mathematical expressions are necessary. 5 / 18

    View Slide

  6. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    Short-term Goals
    At first, we focus on
    the token-level
    analysis in formulae:
    an initial step of
    the conversion
    almost untouched
    Examples (tokens)
    , ε, ×, log, , etc.
    We will work on both
    algorithm and theory:
    Token-level
    (Grounding)
    Word-level
    (Morphology, Lexical
    Semantics)
    Fragment-level
    (Parsing)
    Phrase-level
    (Syntax, PSG)
    Formulae-
    level
    (SP)
    Sentence-
    level
    (Semantics)
    Applications
    (Conversion, IR,
    Searching, etc.)
    Mathematical
    Expressions
    Natural Language
    1. Automatically associating formulae tokens to
    mathematical objects
    2. Discussing morphology and lexical semantics for
    mathematical expressions
    6 / 18

    View Slide

  7. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    Grounding Tokens to Mathematical Objects
    Tokens in formulae and their combination can refer
    to mathematical objects
    The detection is fundamental for understanding
    STEM documents
    Example
    For example,  might describe the outcome of flipping a coin, with  = 1
    representing ‘heads’, and  = 0 representing ‘tails’. We can imagine that
    this is a damaged coin so that the probability of landing heads is not
    necessarily the same as that of landing tails. The probability of  = 1 will
    be denoted by the parameter μ. The probability distribution over  can
    therefore be written in the form
    Bern(  | μ ) = μ(1 − μ)1−
    The result of coin flipping, int,  ∈ {0, 1}
    The probability of ‘heads’ on top, float, 0 ≤ μ ≤ 1
    which is known as the Bernoulli distribution. (PRML, pp. 86–87)
    7 / 18

    View Slide

  8. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    Difficulty of the Grounding
    Factors which make the detection highly challenging:
    ambiguity of tokens (see below)
    syntactic ambiguity of formulae E.g. ƒ( + b)
    necessity for common sence & domain knowledge
    severe abbreviation
    Usage of character y in the first chapter of PRML (except exercises)
    Text fragment from PRML Chap. 1 Meaning of y
    . . . can be expressed as a function y(x) . . . a function which takes an image as input
    . . . an output vector y, encoded in . . . an output vector of function y(x)
    . . . two vectors of random variables x and y . . . a vector of random variables
    Suppose we have a joint distribution p(x, y) . . . a part of pairs of values, corresponding to x
    8 / 18

    View Slide

  9. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    Related Work
    [Aizawa+, 2013]
    NTCIR-10 Math Pilot Task
    annotating a description for each token in formulae
    an object can be described in several ways
    → difficult to make the annotation coherent
    [Stathopoulos+, 2018]
    Variable Typing
    assigning a mathematical type for each token
    E.g. set, monoid, etc.
    a sort of subtask for our grounding
    Usage of character y in the first chapter of PRML (except exercises)
    Text fragment from PRML Chap. 1 Meaning of y
    . . . can be expressed as a function y(x) . . . a function which takes an image as input
    . . . an output vector y, encoded in . . . an output vector of function y(x)
    . . . two vectors of random variables x and y . . . a vector of random variables
    Suppose we have a joint distribution p(x, y) . . . a part of pairs of values, corresponding to x
    9 / 18

    View Slide

  10. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    A Simplified Task
    What are mathematical objects?
    Each mathematical object should have a
    description and some attributes including:
    mathematical type
    condition E.g. larger than 0
    Necessary and sufficient attributes are still unclear
    → We will see after some experiments. . .
    Clustering for tokens
    Giving a label for tokens which refer to the same
    mathematical object is easier. cf. Co-reference in NLP
    The result of running the machine learning algorithm can be expressed
    as a function y(x) which takes a new digit image x as input and that
    generates an output vector y, encoded in the same way as the target
    vectors. The precise form of the function y(x) is determined during the
    training phase, also known as the learning phase, on the basis of the
    training data. (PRML, p. 2)
    10 / 18

    View Slide

  11. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    Morphology for Mathematical Expressions
    Morphemes and words (terms of morphology)
    morphome: the shortest meaningful unit in a
    language
    word: a morphemes or combination of a few
    morphemes which can refer to an object
    Example
    A word “un-break-able” comprises three morphemes.
    Words in mathematical expressions
    As a matter of fact, words also exist in formulae.
    Example
    M is a word in “Matrix M”, but M is not a word in
    “An entry M,j” (M,j is a word).
    11 / 18

    View Slide

  12. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    Semantics Over Natural Language and
    Mathematical Expressions
    There are ambiguity arise only when context exists. For
    instance, “equals signs” (=) in formulae have at least
    three meanings: definition, identity, and equation.
    Example
    Let  = 4, b = 3. Suppose we have to solve
    4 + b2 + 1 = 0.
    To reach the answer, “difference of two” is helpful:
    p2 − q2 = (p + q)(p − q).
    12 / 18

    View Slide

  13. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    Dataset arXMLiv
    papers from arXiv in XML format [Ginev+, 2009]
    converted from L
    ATEX via L
    ATEXML
    formulae are in MathML markups
    L
    A
    TEXML
    XHTML/XML
    arXiv.org
    13 / 18

    View Slide

  14. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    A Little Note for MathML
    a W3C Recommendation [Ausbrooks+, 2014]
    includes two markups: presentation and content
    Presentation Markup
    This shows syntax:


    a
    +
    b

    2

    Content Markup
    This shows semantics:




    a
    b

    2

    ( + b)2
    14 / 18

    View Slide

  15. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    Pilot Annotation
    For the first attempt, we are now giving pilot annotation
    for 3 papers from arXMLiv in the following manner:
    1. Detecting minimal groups of tokens (i.e., words)
    each of which refers to a mathematical object.
    2. Categorizing words by the mathematical object
    they referring to.
    Example
    The result of running the machine learning algorithm can be expressed
    as a function y(x) which takes a new digit image x as input and that
    generates an output vector y, encoded in the same way as the target
    vectors. The precise form of the function y(x) is determined during the
    training phase, also known as the learning phase, on the basis of the
    training data. (PRML, p. 2)
    Let me show you a demonstration!
    15 / 18

    View Slide

  16. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    What’s Next?
    Creating a dataset
    completing the annotation for ≥ 10 papers in arXiv
    I would also like to do it for some textbooks
    check for the reproducibility of the annotation
    Automating the detectiion
    Combination of rule-based and machine learning with
    features such as:
    apposition nouns E.g. “a function ƒ”
    syntactic information in formulae
    E.g. does it appear inside an argument or not?
    distance from the former appearence
    16 / 18

    View Slide

  17. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    Possible Applications
    Mathematical Information Retrieval (MIR)
    → enables us to create scientific knowledge bases
    Automatic code generation E.g. Python, Coq, etc.
    Searching for mathematical expressions
    Example
    Let us think about searching for:
    n + yn = zn (n ≥ 3).
    It is easy to search if you know a keyword Fermat’s Last
    Theorem, but otherwise. . .
    17 / 18

    View Slide

  18. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
    Today’s Conclusions
    converting STEM documents to computational form
    is beneficial and challenging
    for the conversion, synthetic analysis on natural
    language and mathematical expressions is required
    At first, we focus on token-level analyses:
    grounding tokens to mathematical objects
    disscussing morphorogy for formulae
    Currenly, we are working on the pilot annotation
    Possible applications: MIR, code generation,
    searching for formulae
    Thanks for your time! Questions?
    18 / 18

    View Slide