Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applying Conditional Random Fields to Japanese Morphological Analysis

Applying Conditional Random Fields to Japanese Morphological Analysis

More Decks by 自然言語処理研究室

Other Decks in Technology

Transcript

  1. 1
    文献紹介 (2016.07.12)
    長岡技術科学大学  自然言語処理
       Nguyen Van Hai
    Applying Conditional Random Fields to Japanese
    Morphological Analysis
    Taku Kudo Kaoru Yamamoto Yuji Matsumoto
    Nara Institute of Science and Technology
    8916-5, Takayama-Cho Ikoma, Nara, 630-0192 Japan
    CREST JST, Tokyo Institute of Technology
    4259, Nagatuta Midori-Ku Yokohama, 226-8503 Japan
    [email protected], [email protected], [email protected]
    The 2004 Conference on Empirical Methods on Natural Language Processing

    View full-size slide

  2. 2
    Abstract

    Japanese morphological analysis based on
    conditional random fields (CRFs).
    – Apply to word boundary ambiguity.
    – Solve long-standing problem in corpus-based or
    statistical Japanese morphological analysis.

    Experiment using the same dataset as the HMMs
    and MEMMs

    View full-size slide

  3. 3
    Problems

    HMMs (Asahara and Matsumoto, 2000)
    – Hard to employ overlapping features stemmed from
    hierarchical tagset and non-independent features
    – Unknown word guessing

    MEMMs (Uchimoto et al.,2001):
    – Evade neither from label bias nor from length bias
    – Easy sequences with low entropy are to be selected

    View full-size slide

  4. 4
    Word Boundary Ambiguity

    Simple approach let a character be a token
    ( character-based Begin/Inside tagging)
    – Cannot directly reflect lexicons which contain prior
    knowledge
    – Cannot ignore a lexicon since over 90% accuracy

    A lattice represents all candidate sequences of
    tokens

    View full-size slide

  5. 5
    Word Boundary Ambiguity

    View full-size slide

  6. 6
    Long-standing Problems

    Hierarchical tagset
    – Japanese POS tagsets used in the two major ChaSen
    and JUMAN
    – Top level has 15 different categories, bottom level
    seem be word level
    – Use bottom : data sparseness problem
    – Use top level: lack POS to capture fine differences,
    suffixes: san and kun

    View full-size slide

  7. 7
    Long-standing Problems

    Label bias and Length bias

    View full-size slide

  8. 8
    Conditional Random Fields

    Correlated features of the inputs

    Allows flexible feature designs for hierarchical
    tagsets

    Minimize the influences of the label and length
    bias.

    View full-size slide

  9. 9
    Experiment

    We use two widely-used Japanese annotated
    corpora
    – Kyoto University Corpus ver 2.0 (KC)
    – RWCP Text Corpus (RWCP),

    View full-size slide

  10. 10
    Results

    Tables 3 and 4 show experimental results using KC
    and RWCP respectively. The three F-scores
    (seg/top/all) for our CRFs and a baseline bi-gram
    HMMs.

    In Table 3 (KC data set), the results of a variant of
    MEMMs (Uchimoto et al., 2001) and a rule-based
    analyzer (JUMAN7)

    In Table 4 (RWCP data set), the result of an E-HMMs

    View full-size slide

  11. 12
    CRFs and MEMMs

    MEMMs trained with a number of features, fail to
    segment some sentences which are correctly
    segmented with HMMs or rulebased analyzers.
    ● “ ロマンは” (romanticist) and “ ない心” (one’s
    heart) are unusual spellings and they are normally
    written as “ ロマン派” and “ 内心” respectively

    By the length bias, short paths are preferred to
    long paths

    View full-size slide

  12. 13
    CRFs and MEMMs

    View full-size slide

  13. 14
    CRFs and Extended-HMMs

    Asahara et al. extended the original HMMs by
    – 1)position-wise grouping of POS tags
    – 2) word-level statistics
    – 3) smoothing of word and POS level statistics

    CRFs can realize such extensions naturally and
    straightforwardly

    View full-size slide