Slide 1

Slide 1 text

1 文献紹介 (2016.10.25) 長岡技術科学大学  自然言語処理    Nguyen Van Hai Vietnamese Word Segmentation Dinh Dien, Hoang Kiem, Nguyen Van Toan Faculty of Information Technology National University of HCM City 227 Nguyen Van Cu, Dist. 5, HCM City, VIETNAM [email protected] Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium

Slide 2

Slide 2 text

2 Abstract ● Vietnamese is Asian languages which use whites spaces be used to determine syllable. ● This paper present a model combining WFST approach and Neural Network ● The algorithm achieves 97% of accuracy.

Slide 3

Slide 3 text

3 Problems ● In Vietnamese whitespaces are no used to identify the word boundaries. ● Problem with the word segmentation – Local ambiguity in compound words – No comprehensive dictionaries – Recognition of proper nouns and names – Morphemes and reduplicatives

Slide 4

Slide 4 text

4 Vietnamese Linguistics ● Linguistic unit called “ti ng”. It is constructed from phonemes under ế the following structure ● the syllable “tu n” (week) has a tone mark (grave accent), a first ầ consonant (t), a secondary vowel (u), a main vowel (â) and a last consonant (n).

Slide 5

Slide 5 text

5 Vietnamese Linguistics ● “ti ng” may be: ế – A word “tôi” – A morpheme “hoa” (flow) and “h ng” (pink) in a word ồ “hoa h ng” (rose) ồ – A sub-morpheme “bù” and “nhìn” in a word “bù nhìn” (puppet)

Slide 6

Slide 6 text

6 Previous Works ● Rule-based approach: – Longest Matching, Greedy Matchin Models (Yuen Poowarawan, 1986 ; Sampan Rarunrom, 1991). – Maximum matching models: ● Thai, Sornlertlamvanich (1993) ● Chinese, Chih-Hao Tsai (1996), MMSeg 2000; accurate 98% in a corpus with 1300 simple sentences without solution for proper nouns and unknown words.

Slide 7

Slide 7 text

7 Previous Works ● Statistics-based approach: – HMM, based on Viterbi algorithm (Asanee Kawtraku, 1995 ; Surapant, 1995). – Expectation-Maximization (EM). This method is based on the resolvement of the “chicken and egg” question through its repetition (Xianping, 1996).

Slide 8

Slide 8 text

8 Propose model

Slide 9

Slide 9 text

9 WFST Model ● Apply WFST model for Chinese Word segmentation into our task as follows (Richard Sproat, 1996): ● Represent the dictionary D as a Weighted Finite State Transducer. Supposed: – H: set of “ti ng” (syllables). ế – p: no use, due to characteristic of “ti ng” ế – P: set of grammatical Part-of-speech (POS) labels.

Slide 10

Slide 10 text

10 Dictionaries ● Each word will be attributed to it such additional details as POS, word frequency, and syntactic features. ● the weight is assigned through the logarithm of the probability of a concrete word:

Slide 11

Slide 11 text

11 Dictionaries

Slide 12

Slide 12 text

12 Dictionaries ● The probability of words is calculated based on a corpus of 2,000,000 words. – 1.6 MB from Complete works of Ho Chi Minh. – 0.6 MB from Vietnam PC-WORLD magazines. – 0.9 MB from newspapers in Science and Technology. – 0.5 MB from famous works of Vietnamese poets. – 3.7 MB from Vietnamese literary works. ● And a dictionary of 34,000 words based on the one of the Center of Lexicography (under the National Center of Social Sciences and Humanities).

Slide 13

Slide 13 text

13 Identification of proper name ● Found out some peculiar rules

Slide 14

Slide 14 text

14 Identification of proper name ● the ambiguity here is that the initial letter of a sentence is also capitalized and besides. Ex: B Chính tr , B chính tr , or B Chính Tr ộ ị ộ ị ộ ị (politburo). ● They make use of heuristic to attribute appropriate weights to these words and then consider them as conventional words to be processed at WFST with a very satisfactory result.

Slide 15

Slide 15 text

15 Method of selecting the best sentences

Slide 16

Slide 16 text

16 Neural network ● This sentence after the WFST processing “H c ọ sinh h c sinh h c”. ọ ọ

Slide 17

Slide 17 text

17 Neural network

Slide 18

Slide 18 text

18 Parameter in the Neural Network

Slide 19

Slide 19 text

19 Results ● The following table show the result of applying above model.