Vietnamese Word Segmentation

1 文献紹介 (2016.10.25) 長岡技術科学大学　　自然言語処理　　 Nguyen Van Hai Vietnamese Word
Segmentation Dinh Dien, Hoang Kiem, Nguyen Van Toan Faculty of Information Technology National University of HCM City 227 Nguyen Van Cu, Dist. 5, HCM City, VIETNAM [email protected] Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium

2 Abstract • Vietnamese is Asian languages which use whites
spaces be used to determine syllable. • This paper present a model combining WFST approach and Neural Network • The algorithm achieves 97% of accuracy.

3 Problems • In Vietnamese whitespaces are no used to
identify the word boundaries. • Problem with the word segmentation – Local ambiguity in compound words – No comprehensive dictionaries – Recognition of proper nouns and names – Morphemes and reduplicatives

4 Vietnamese Linguistics • Linguistic unit called “ti ng”. It
is constructed from phonemes under ế the following structure • the syllable “tu n” (week) has a tone mark (grave accent), a first ầ consonant (t), a secondary vowel (u), a main vowel (â) and a last consonant (n).

5 Vietnamese Linguistics • “ti ng” may be: ế –
A word “tôi” – A morpheme “hoa” (flow) and “h ng” (pink) in a word ồ “hoa h ng” (rose) ồ – A sub-morpheme “bù” and “nhìn” in a word “bù nhìn” (puppet)

6 Previous Works • Rule-based approach: – Longest Matching, Greedy
Matchin Models (Yuen Poowarawan, 1986 ; Sampan Rarunrom, 1991). – Maximum matching models: • Thai, Sornlertlamvanich (1993) • Chinese, Chih-Hao Tsai (1996), MMSeg 2000; accurate 98% in a corpus with 1300 simple sentences without solution for proper nouns and unknown words.

7 Previous Works • Statistics-based approach: – HMM, based on
Viterbi algorithm (Asanee Kawtraku, 1995 ; Surapant, 1995). – Expectation-Maximization (EM). This method is based on the resolvement of the “chicken and egg” question through its repetition (Xianping, 1996).

8 Propose model

9 WFST Model • Apply WFST model for Chinese Word
segmentation into our task as follows (Richard Sproat, 1996): • Represent the dictionary D as a Weighted Finite State Transducer. Supposed: – H: set of “ti ng” (syllables). ế – p: no use, due to characteristic of “ti ng” ế – P: set of grammatical Part-of-speech (POS) labels.

10 Dictionaries • Each word will be attributed to it
such additional details as POS, word frequency, and syntactic features. • the weight is assigned through the logarithm of the probability of a concrete word:

11 Dictionaries

12 Dictionaries • The probability of words is calculated based
on a corpus of 2,000,000 words. – 1.6 MB from Complete works of Ho Chi Minh. – 0.6 MB from Vietnam PC-WORLD magazines. – 0.9 MB from newspapers in Science and Technology. – 0.5 MB from famous works of Vietnamese poets. – 3.7 MB from Vietnamese literary works. • And a dictionary of 34,000 words based on the one of the Center of Lexicography (under the National Center of Social Sciences and Humanities).

13 Identification of proper name • Found out some peculiar
rules

14 Identification of proper name • the ambiguity here is
that the initial letter of a sentence is also capitalized and besides. Ex: B Chính tr , B chính tr , or B Chính Tr ộ ị ộ ị ộ ị (politburo). • They make use of heuristic to attribute appropriate weights to these words and then consider them as conventional words to be processed at WFST with a very satisfactory result.

15 Method of selecting the best sentences

16 Neural network • This sentence after the WFST processing
“H c ọ sinh h c sinh h c”. ọ ọ

17 Neural network

18 Parameter in the Neural Network

19 Results • The following table show the result of
applying above model.

Vietnamese Word Segmentation

Vietnamese Word Segmentation

自然言語処理研究室

More Decks by 自然言語処理研究室

Other Decks in Technology

Featured

Transcript

1 文献紹介 (2016.10.25) 長岡技術科学大学　　自然言語処理　　 Nguyen Van Hai Vietnamese Word

2 Abstract • Vietnamese is Asian languages which use whites

3 Problems • In Vietnamese whitespaces are no used to

4 Vietnamese Linguistics • Linguistic unit called “ti ng”. It

5 Vietnamese Linguistics • “ti ng” may be: ế –

6 Previous Works • Rule-based approach: – Longest Matching, Greedy

7 Previous Works • Statistics-based approach: – HMM, based on

8 Propose model

9 WFST Model • Apply WFST model for Chinese Word

10 Dictionaries • Each word will be attributed to it

11 Dictionaries

12 Dictionaries • The probability of words is calculated based

13 Identification of proper name • Found out some peculiar

14 Identification of proper name • the ambiguity here is

15 Method of selecting the best sentences

16 Neural network • This sentence after the WFST processing

17 Neural network

18 Parameter in the Neural Network

19 Results • The following table show the result of