Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Topic Models + Word Alignment = A Flexible Framework for Extracting Bilingual Dictionary from Comparable Corpus

Yemane
February 25, 2016

Topic Models + Word Alignment = A Flexible Framework for Extracting Bilingual Dictionary from Comparable Corpus

Xiaodong Liu, Kevin Duh and Yuji Matsumoto
Graduate School of Information Science
Nara Institute of Science and Technology, Japan

Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 212–221, Sofia, Bulgaria
© 2013 Association for Computational Linguistics

Yemane

February 25, 2016
Tweet

More Decks by Yemane

Other Decks in Education

Transcript

  1. TOPIC MODELS + WORD ALIGNMENT = A FLEXIBLE FRAMEWORK FOR

    EXTRACTING BILINGUAL DICTIONARY FROM COMPARABLE CORPUS February 25, 2016 Xiaodong Liu, Kevin Duh and Yuji Matsumoto Graduate School of Information Science Nara Institute of Science and Technology, Japan Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 212–221, Sofia, Bulgaria © 2013 Association for Computational Linguistics 1
  2. Introduction • Proposed system - language independent framework for extracting

    a bilingual dictionary from comparable corpora • Approach – combination of topic modeling and word alignment techniques • Corpora type – comparable corpora, document-aligned corpus parallel topic-aligned corpus and then performed word alignments using co-occurrence statistics 2
  3. Introduction • Significance of Machine readable dictionaries • In SMT,

    domain adaptation • In CLIR, query translation • Multilingual applications • Bilingual extraction algorithms should: • Handle low resource languages and be language independent • Handle polysemy • Scalable • Result – extraction of high precision translation pairs on various conditions 3
  4. Example • Example free = 自由 / じゆう / (as

    in ‘free’ speech) , prob = 0.5 = 無料 / むりょう / (as in ‘free’ beer) , prob = 0.5 • Traditional extraction models Maximize (we | wf ) or (wf | we) we = English word , wf = French word • Proposed method (we|wf,t ) or (wf | we,t) , t = topic ? Topic  politics  shopping 4
  5. Proposed Framework for Bilingual Dictionary Extraction The innovation is in

    how this topic-aligned corpora is defined and constructed, 5
  6. Topic alignment 1. Train a multilingual topic model. 2. Infer

    a topic assignment for each token Ck,i 3. Topic alignment - re-arrange the word collections such that Ck,i belonging to the same topic are grouped together. 4. For each topic tk , run IBM Model-1 on Ck,1… Ck,i…. Ck,D 5. To extract a bilingual dictionary, find pairs (we | wf ) with high probability under the model: 7
  7. Experimental setup • Corpus - Kyoto Wiki Corpus , Japanese

    – English translations • Comparability level Data Comp100% = 100% comparable corpora, keeping all Jap-Eng contents. However sentences alignment is deleted. Data Comp50% = keeping 50% of randomly selected data per document Data Comp20% = keeping 20% of randomly selected data per document Datasets: the number of document pairs (#doc), sentences (#sent) and vocabulary size (#voc) in English (e) and Japanese (j). 8
  8. Comparison with previous work • Baseline - IBM-1 (no topic

    models, assumes document pair as sentence pair) • The Cue Method: (Vulec et al., 2011) • The JS Method: (Vulec et al., 2011) • Computes topic-word distributions to calculate translation probabilities 9
  9. Error analysis • Incorrect lexicon extractions • Word Segmentation errors

    • Incorrect Topic error • Correct Topic, Incorrect Alignment error 14
  10. Conclusions • Proposed an effective method of extracting bilingual dictionaries

    by a • novel combination of topic modeling • word alignment techniques • The key innovation is the conversion of a comparable document-aligned corpus into a parallel topic-aligned corpus • The proposed framework outperforms existing baselines under both automatic metrics and manual evaluation. • In the Future: • Exploring other topic models • Extract lexicon from massive multilingual collections 15