Slide 1

Slide 1 text

1 文献紹介 (2016.07.12) 長岡技術科学大学  自然言語処理    Nguyen Van Hai Applying Conditional Random Fields to Japanese Morphological Analysis Taku Kudo Kaoru Yamamoto Yuji Matsumoto Nara Institute of Science and Technology 8916-5, Takayama-Cho Ikoma, Nara, 630-0192 Japan CREST JST, Tokyo Institute of Technology 4259, Nagatuta Midori-Ku Yokohama, 226-8503 Japan [email protected], [email protected], [email protected] The 2004 Conference on Empirical Methods on Natural Language Processing

Slide 2

Slide 2 text

2 Abstract ● Japanese morphological analysis based on conditional random fields (CRFs). – Apply to word boundary ambiguity. – Solve long-standing problem in corpus-based or statistical Japanese morphological analysis. ● Experiment using the same dataset as the HMMs and MEMMs

Slide 3

Slide 3 text

3 Problems ● HMMs (Asahara and Matsumoto, 2000) – Hard to employ overlapping features stemmed from hierarchical tagset and non-independent features – Unknown word guessing ● MEMMs (Uchimoto et al.,2001): – Evade neither from label bias nor from length bias – Easy sequences with low entropy are to be selected

Slide 4

Slide 4 text

4 Word Boundary Ambiguity ● Simple approach let a character be a token ( character-based Begin/Inside tagging) – Cannot directly reflect lexicons which contain prior knowledge – Cannot ignore a lexicon since over 90% accuracy ● A lattice represents all candidate sequences of tokens

Slide 5

Slide 5 text

5 Word Boundary Ambiguity

Slide 6

Slide 6 text

6 Long-standing Problems ● Hierarchical tagset – Japanese POS tagsets used in the two major ChaSen and JUMAN – Top level has 15 different categories, bottom level seem be word level – Use bottom : data sparseness problem – Use top level: lack POS to capture fine differences, suffixes: san and kun

Slide 7

Slide 7 text

7 Long-standing Problems ● Label bias and Length bias

Slide 8

Slide 8 text

8 Conditional Random Fields ● Correlated features of the inputs ● Allows flexible feature designs for hierarchical tagsets ● Minimize the influences of the label and length bias.

Slide 9

Slide 9 text

9 Experiment ● We use two widely-used Japanese annotated corpora – Kyoto University Corpus ver 2.0 (KC) – RWCP Text Corpus (RWCP),

Slide 10

Slide 10 text

10 Results ● Tables 3 and 4 show experimental results using KC and RWCP respectively. The three F-scores (seg/top/all) for our CRFs and a baseline bi-gram HMMs. ● In Table 3 (KC data set), the results of a variant of MEMMs (Uchimoto et al., 2001) and a rule-based analyzer (JUMAN7) ● In Table 4 (RWCP data set), the result of an E-HMMs

Slide 11

Slide 11 text

11 Results

Slide 12

Slide 12 text

12 CRFs and MEMMs ● MEMMs trained with a number of features, fail to segment some sentences which are correctly segmented with HMMs or rulebased analyzers. ● “ ロマンは” (romanticist) and “ ない心” (one’s heart) are unusual spellings and they are normally written as “ ロマン派” and “ 内心” respectively ● By the length bias, short paths are preferred to long paths

Slide 13

Slide 13 text

13 CRFs and MEMMs

Slide 14

Slide 14 text

14 CRFs and Extended-HMMs ● Asahara et al. extended the original HMMs by – 1)position-wise grouping of POS tags – 2) word-level statistics – 3) smoothing of word and POS level statistics ● CRFs can realize such extensions naturally and straightforwardly