Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sub-lexical Translations for Low-Resource Lang...

Yemane
November 05, 2015

Sub-lexical Translations for Low-Resource Language

Khan Md. Anwarus Salam (1) Setsuo Yamada (2) Tetsuro Nishino(1) (1) The University of Electro-Communications, Tokyo, Japan. (2) NTT Corporation, Tokyo, Japan. [email protected], [email protected], [email protected]

Proceedings of the Workshop on Machine Translation and Parsing
in Indian Languages (MTPIL-2012), pages 39–52,
COLING 2012, Mumbai, December 2012.

Yemane

November 05, 2015
Tweet

More Decks by Yemane

Other Decks in Education

Transcript

  1. Sub-lexical Translations for Low-Resource Language Khan Md. Anwarus Salam (1)

    Setsuo Yamada (2) Tetsuro Nishino(1) (1) The University of Electro-Communications, Tokyo, Japan. (2) NTT Corporation, Tokyo, Japan. [email protected], [email protected], [email protected] Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages (MTPIL-2012), pages 39–52, COLING 2012, Mumbai, December 2012. 1
  2. Introduction  Motivation – improve information access  Most web

    contents ==> English  Translate information to native languages for monolingual speakers  Source – Target languages:  English - Bangla (230 mil. speakers)  The Problem:  low-coverage issues due to Out-Of-Vocabulary (OOV) words  Proposed method  Sub-lexical translation to achieve wide-coverage in Example-Based Machine Translation (EBMT) 2
  3. Previous work • Rule-Based MT • Akkhor Bangla Software (Free)

    • Apertium based Anubadok online MT • Statistical MT • GoogleTranslate • The systems did not consider or perform low on OOV words 3
  4. Sub-lexical translation • Sub-lexical unit is part of the word

    which has independent meaning • For example, “bluebird” has two sub-lexical units: “blue” and “bird” • Effective in finding translation candidates in EBMT for English to Bangla 4
  5. The Architecture Input sentence parsed into chunks [OpenNLP chunker] Chunks

    matched with the example-base [matching Algorithm] OOV words are marked [WordNet] 5
  6. Word-alignment and CST (Chunk-String Templates) Generalization only considers nouns, proper

    nouns and cardinal number (NN, NNP, CD in OpenNLP tagset) 6
  7. Handling Out-of-Vocabulary (OOV) words finding semantically related English words from

    WordNet for the OOV words rank the translation candidates using WSD technique and English-Bangla dictionary 7
  8. Finding Sub-lexical Translations (1) Find possible sublexical units e.g. “bluebird”

    ==> “blue” and “bird” (2) Extract sublexical translations (3) Remove less probable sublexical translations (4) Output translation candidates along with their POS tags Selection criteria For all the set of CSTs Level 4: Exact match. Level 3: Sublexical unit match, <lexical filename> of WordNet and POS tags match Level 2: Sublexical unit match, <lexical filename> of WordNet match Level 1: Only POS tags match. Level 0: No match found, all OOV words. 8
  9. Finding candidates from WordNet (1) Synonyms are searched in synsets

    nouns, proper nouns, verbs, adjectives and adverbs (2) Antonyms – negating of words to get candidates e.g unfriendly = friendly negated nouns, proper Antonymy (opposing-name) (3) Hyponymy and Hypernymy – wordnet is searched till match is found, lower level are more suitable Nouns,verbs WordNet dog, domestic dog, Canis familiaris => canine, canid => carnivore => placental, placental mammal => mammal => vertebrate, craniate => chordate => animal => ... English-Bangla Dict Dog - animal 9
  10. Ranking Candidates Using Google search hits information Search each candidate

    word: e.g for OOV word “dog” This dog is really cool 37,300 This animal is really cool 1,560 This domestic animal is really cool < 10 This canine is really cool < 10 Final Candidate: Highest Google hit entry “ Animal” Remaining OOV words are translitered to bengali alphabet. If result < 10 , Neighboring chunk search Phrase search "This mammal is" 527,000 "This canid is" 503,000 "This canine is" 110,000 "This carnivore is" 58,600 "This vertebrate is" 2,460 "This placental is" 46 "This craniate is" 27 "This chordate is" 27 "This placental mammal is" 6 10
  11. Experiment and Result • Translation quality - Human Evaluation •

    2000 aligned sentences • generated 15356 initial CSTs, 543 Generalized CSTs and 12458 Combined-CSTs • Result shows improved coverage of OOV exact match understandable for human wrong word choice and wrong word order. non-translated words 11