Topic Models + Word Alignment = A Flexible Framework for Extracting Bilingual Dictionary from Comparable Corpus

TOPIC MODELS + WORD ALIGNMENT = A FLEXIBLE FRAMEWORK FOR
EXTRACTING BILINGUAL DICTIONARY FROM COMPARABLE CORPUS February 25, 2016 Xiaodong Liu, Kevin Duh and Yuji Matsumoto Graduate School of Information Science Nara Institute of Science and Technology, Japan Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 212–221, Sofia, Bulgaria © 2013 Association for Computational Linguistics 1

Introduction • Proposed system - language independent framework for extracting
a bilingual dictionary from comparable corpora • Approach – combination of topic modeling and word alignment techniques • Corpora type – comparable corpora, document-aligned corpus parallel topic-aligned corpus and then performed word alignments using co-occurrence statistics 2

Introduction • Significance of Machine readable dictionaries • In SMT,
domain adaptation • In CLIR, query translation • Multilingual applications • Bilingual extraction algorithms should: • Handle low resource languages and be language independent • Handle polysemy • Scalable • Result – extraction of high precision translation pairs on various conditions 3

Example • Example free = 自由 / じゆう / (as
in ‘free’ speech) , prob = 0.5 = 無料 / むりょう / (as in ‘free’ beer) , prob = 0.5 • Traditional extraction models Maximize (we | wf ) or (wf | we) we = English word , wf = French word • Proposed method (we|wf,t ) or (wf | we,t) , t = topic ? Topic  politics  shopping 4

Proposed Framework for Bilingual Dictionary Extraction The innovation is in
how this topic-aligned corpora is deﬁned and constructed, 5

Topic-aligned Corpora 6

Topic alignment 1. Train a multilingual topic model. 2. Infer
a topic assignment for each token Ck,i 3. Topic alignment - re-arrange the word collections such that Ck,i belonging to the same topic are grouped together. 4. For each topic tk , run IBM Model-1 on Ck,1… Ck,i…. Ck,D 5. To extract a bilingual dictionary, ﬁnd pairs (we | wf ) with high probability under the model: 7

Experimental setup • Corpus - Kyoto Wiki Corpus , Japanese
– English translations • Comparability level Data Comp100% = 100% comparable corpora, keeping all Jap-Eng contents. However sentences alignment is deleted. Data Comp50% = keeping 50% of randomly selected data per document Data Comp20% = keeping 20% of randomly selected data per document Datasets: the number of document pairs (#doc), sentences (#sent) and vocabulary size (#voc) in English (e) and Japanese (j). 8

Comparison with previous work • Baseline - IBM-1 (no topic
models, assumes document pair as sentence pair) • The Cue Method: (Vulec et al., 2011) • The JS Method: (Vulec et al., 2011) • Computes topic-word distributions to calculate translation probabilities 9

Comparison with previous work (2) 10 ROC = Receiver Operating
Characteristic Curve

Comparison with previous work (3) 11

Degree of comparability 12

Capturing polysemy 13

Error analysis • Incorrect lexicon extractions • Word Segmentation errors
• Incorrect Topic error • Correct Topic, Incorrect Alignment error 14

Conclusions • Proposed an effective method of extracting bilingual dictionaries
by a • novel combination of topic modeling • word alignment techniques • The key innovation is the conversion of a comparable document-aligned corpus into a parallel topic-aligned corpus • The proposed framework outperforms existing baselines under both automatic metrics and manual evaluation. • In the Future: • Exploring other topic models • Extract lexicon from massive multilingual collections 15

Topic Models + Word Alignment = A Flexible Fram...

Topic Models + Word Alignment = A Flexible Framework for Extracting Bilingual Dictionary from Comparable Corpus

Yemane

More Decks by Yemane

Other Decks in Education

Featured

Transcript