Applying Conditional Random Fields to Japanese Morphological Analysis

1 文献紹介 (2016.07.12) 長岡技術科学大学　　自然言語処理　　 Nguyen Van Hai Applying Conditional
Random Fields to Japanese Morphological Analysis Taku Kudo Kaoru Yamamoto Yuji Matsumoto Nara Institute of Science and Technology 8916-5, Takayama-Cho Ikoma, Nara, 630-0192 Japan CREST JST, Tokyo Institute of Technology 4259, Nagatuta Midori-Ku Yokohama, 226-8503 Japan [email protected], [email protected], [email protected] The 2004 Conference on Empirical Methods on Natural Language Processing

2 Abstract • Japanese morphological analysis based on conditional random
fields (CRFs). – Apply to word boundary ambiguity. – Solve long-standing problem in corpus-based or statistical Japanese morphological analysis. • Experiment using the same dataset as the HMMs and MEMMs

3 Problems • HMMs (Asahara and Matsumoto, 2000) – Hard
to employ overlapping features stemmed from hierarchical tagset and non-independent features – Unknown word guessing • MEMMs (Uchimoto et al.,2001): – Evade neither from label bias nor from length bias – Easy sequences with low entropy are to be selected

4 Word Boundary Ambiguity • Simple approach let a character
be a token ( character-based Begin/Inside tagging) – Cannot directly reflect lexicons which contain prior knowledge – Cannot ignore a lexicon since over 90% accuracy • A lattice represents all candidate sequences of tokens

5 Word Boundary Ambiguity

6 Long-standing Problems • Hierarchical tagset – Japanese POS tagsets
used in the two major ChaSen and JUMAN – Top level has 15 different categories, bottom level seem be word level – Use bottom : data sparseness problem – Use top level: lack POS to capture fine differences, suffixes: san and kun

7 Long-standing Problems • Label bias and Length bias

8 Conditional Random Fields • Correlated features of the inputs
• Allows flexible feature designs for hierarchical tagsets • Minimize the influences of the label and length bias.

9 Experiment • We use two widely-used Japanese annotated corpora
– Kyoto University Corpus ver 2.0 (KC) – RWCP Text Corpus (RWCP),

10 Results • Tables 3 and 4 show experimental results
using KC and RWCP respectively. The three F-scores (seg/top/all) for our CRFs and a baseline bi-gram HMMs. • In Table 3 (KC data set), the results of a variant of MEMMs (Uchimoto et al., 2001) and a rule-based analyzer (JUMAN7) • In Table 4 (RWCP data set), the result of an E-HMMs

11 Results

12 CRFs and MEMMs • MEMMs trained with a number
of features, fail to segment some sentences which are correctly segmented with HMMs or rulebased analyzers. • “ ロマンは” (romanticist) and “ ない心” (one’s heart) are unusual spellings and they are normally written as “ ロマン派” and “ 内心” respectively • By the length bias, short paths are preferred to long paths

13 CRFs and MEMMs

14 CRFs and Extended-HMMs • Asahara et al. extended the
original HMMs by – 1)position-wise grouping of POS tags – 2) word-level statistics – 3) smoothing of word and POS level statistics • CRFs can realize such extensions naturally and straightforwardly

Applying Conditional Random Fields to Japanese ...

Applying Conditional Random Fields to Japanese Morphological Analysis

自然言語処理研究室

More Decks by 自然言語処理研究室

Other Decks in Technology

Featured

Transcript

1 文献紹介 (2016.07.12) 長岡技術科学大学　　自然言語処理　　 Nguyen Van Hai Applying Conditional

2 Abstract • Japanese morphological analysis based on conditional random

3 Problems • HMMs (Asahara and Matsumoto, 2000) – Hard

4 Word Boundary Ambiguity • Simple approach let a character

5 Word Boundary Ambiguity

6 Long-standing Problems • Hierarchical tagset – Japanese POS tagsets

7 Long-standing Problems • Label bias and Length bias

8 Conditional Random Fields • Correlated features of the inputs

9 Experiment • We use two widely-used Japanese annotated corpora

10 Results • Tables 3 and 4 show experimental results

11 Results

12 CRFs and MEMMs • MEMMs trained with a number

13 CRFs and MEMMs

14 CRFs and Extended-HMMs • Asahara et al. extended the