Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Pointwise Approach for Vietnamese Diacritics Restoration

September 27, 2016

A Pointwise Approach for Vietnamese Diacritics Restoration


September 27, 2016


  1. 文献紹介(2016/09/17) 長岡技術科学大学 自然言語処理研究室 B4 LY NAM PHONG 2012 International Conference

    on Asian Language Processing DOI: 10.1109/IALP.2012.18 A Pointwise Approach for Vietnamese Diacritics Restoration
  2. Abstract • The automatic insertion of diacritics in electronic texts

    is needed. • The first to study automatic diacritic restoration in Vietnamese. • Propose a pointwise approach, using three features: • N-gram of syllables. • N-gram of syllable types. • Dictionary word features. • 94.7% accuracy rate.
  3. Introdution • Two basic approaches to diacritic restoration: • Word-based:

    language dependent, require large corpora to build useful model. • Character-based: use language independent algorithms. • Word-based approaches to Vietnamese face two major challenges: • Not enough dictionaries and corpora. • Word segmentations. • Character-based approach is also difficult due to the abundance of diacritics in Vietnamese.
  4. Vietnamese Orthograghy • Consists of 29 letters: • The 26

    letters of English alphabet except f, j, w, z. • 7 letters that are modified with diacritics: đ, ă, â, ê, ô, ơ , ư. • Tone marking: a, à, á, ả, ã, ạ.
  5. Pointwise Approach • Assumes that every decision about a syllable’s

    diacritic is independent of decisions about neighboring syllables. • Features for machine learning: • N-grams of syllables: 1-gram and 2-gram. • Window: W word around the target syllables • N-grams of syllable types: U for uppercase, L for lowercase, N for number and O for others, such as a symbol. • Dictionary word features.
  6. Pointwise Approach • First occurrence of “cho” feature vector: (“1”,

    “con”, “cho”, “ngoi”, “ngoai”, “1 con”, “con cho”, “cho ngoi”, “ngoi ngoai”, “N”, “L”, “NL”, “LL”, “con cho(dictionary)”). • Second “cho” feature vector: (“ngoai”, “cong”, “cho”, “Dong”, “Xuan”, “ngoai cong”, “cong cho”, “cho Dong”, “Dong Xuan”, “L”, “U”, “LL”, “LU”, “UU”, “cong cho(dictionary)”).
  7. Experimental setting • Only lowercase syllables were provided with diacritical

    marks. • Used SVM implement in the LIBLINEAR package. • Texts corpus are crawled from journalism and devided into 2 parts: • 320Mb for training and other 15Mb for test. • A classifier build for each non-diacritical string (1525 strings).
  8. Result

  9. Conclusion • Presented an automatic system for diacritic restoration in

    Vietnamese texts using pointwise approach. • Negative consequence: files generated for the model were very large, up to 16Gb. • Expect that with proper feature selection, the model’s files can be smaller.