A Pointwise Approach for Vietnamese Diacritics Restoration

文献紹介(2016/09/17) 長岡技術科学大学自然言語処理研究室 B4 LY NAM PHONG 2012 International Conference
on Asian Language Processing DOI: 10.1109/IALP.2012.18 A Pointwise Approach for Vietnamese Diacritics Restoration

Abstract • The automatic insertion of diacritics in electronic texts
is needed. • The first to study automatic diacritic restoration in Vietnamese. • Propose a pointwise approach, using three features: • N-gram of syllables. • N-gram of syllable types. • Dictionary word features. • 94.7% accuracy rate.

Introdution • Two basic approaches to diacritic restoration: • Word-based:
language dependent, require large corpora to build useful model. • Character-based: use language independent algorithms. • Word-based approaches to Vietnamese face two major challenges: • Not enough dictionaries and corpora. • Word segmentations. • Character-based approach is also difficult due to the abundance of diacritics in Vietnamese.

Vietnamese Orthograghy • Consists of 29 letters: • The 26
letters of English alphabet except f, j, w, z. • 7 letters that are modified with diacritics: đ, ă, â, ê, ô, ơ , ư. • Tone marking: a, à, á, ả, ã, ạ.

Pointwise Approach • Assumes that every decision about a syllable’s
diacritic is independent of decisions about neighboring syllables. • Features for machine learning: • N-grams of syllables: 1-gram and 2-gram. • Window: W word around the target syllables • N-grams of syllable types: U for uppercase, L for lowercase, N for number and O for others, such as a symbol. • Dictionary word features.

Pointwise Approach • First occurrence of “cho” feature vector: (“1”,
“con”, “cho”, “ngoi”, “ngoai”, “1 con”, “con cho”, “cho ngoi”, “ngoi ngoai”, “N”, “L”, “NL”, “LL”, “con cho(dictionary)”). • Second “cho” feature vector: (“ngoai”, “cong”, “cho”, “Dong”, “Xuan”, “ngoai cong”, “cong cho”, “cho Dong”, “Dong Xuan”, “L”, “U”, “LL”, “LU”, “UU”, “cong cho(dictionary)”).

Experimental setting • Only lowercase syllables were provided with diacritical
marks. • Used SVM implement in the LIBLINEAR package. • Texts corpus are crawled from journalism and devided into 2 parts: • 320Mb for training and other 15Mb for test. • A classifier build for each non-diacritical string (1525 strings).

Result

Conclusion • Presented an automatic system for diacritic restoration in
Vietnamese texts using pointwise approach. • Negative consequence: files generated for the model were very large, up to 16Gb. • Expect that with proper feature selection, the model’s files can be smaller.

A Pointwise Approach for Vietnamese Diacritics...

A Pointwise Approach for Vietnamese Diacritics Restoration

phong3112

More Decks by phong3112

Featured

Transcript

文献紹介(2016/09/17) 長岡技術科学大学自然言語処理研究室 B4 LY NAM PHONG 2012 International Conference

Abstract • The automatic insertion of diacritics in electronic texts

Introdution • Two basic approaches to diacritic restoration: • Word-based:

Vietnamese Orthograghy • Consists of 29 letters: • The 26

Pointwise Approach • Assumes that every decision about a syllable’s

Pointwise Approach • First occurrence of “cho” feature vector: (“1”,

Experimental setting • Only lowercase syllables were provided with diacritical

Result

Conclusion • Presented an automatic system for diacritic restoration in