Sampling-based Multilingual Alignment

Sampling-based Multilingual Alignment March 24,2016 Adrien Lardilleux and Yves Lepage
GREYC, University of Caen Basse-Normandie, France International Conference RANLP, 2009 Borovets, Bulgaria, pages 214–218 1

Introduction • Purpose – word alignment by exploiting low frequency
terms to extract high quality multi-word alignments • Method – random sampling of perfect alignments from numerous sub-corpora • Results – competetive with GIZA++ in quality with lower processing time and higher coverage 2

Motivation • IBM models implemented in GIZA++ and other methods
address the issues of quality • Some other issues worth exploring are • Simultaneous multi-lingual alignment • Scaling up, coverage and processing time • Integration with real applications 3

Rationale – From high to low frequency • Intuition –
high frequency terms have higher significance • Avoiding indefinite data increment - safe alignment of low frequency words • Convert high frequency terms to low frequency by decreasing data (as opposed to increasing data) increase data Low frequency High frequency High frequency Low frequency increase data 4

Rationale – hapax legomena • Hapax legomena (Hapaxes) – are
found only once in their corpus • Represent ~ 50% of the vocabularies • Hapax are aligned using an assumption of lexical equivalence • By filtering inputs sentences • High frequency terms are converted to low frequency terms • The terms reduced to hapaxes form translation pairs • Perfect alignments can be extracted 5

Methodology- Perfect alignment “sequences of words that strictly appear in
the same sentences.” 6

Methodology – alingual corpus a corpus of multiple languages treated
as monolingual corpus and does not involve any language-dependent concept. 7

Methodology – Biasing the sampling • the probability that a
particular sentence is chosen is k/n; • the probability that this sentence is not chosen is 1 − k/n; • the probability that none of the k sentences is chosen is (1 − k/n) ^ k; • the probability that none of these k sentences is ever chosen is (1 − k/n) ^ kx. • Hence, the number of random subcorpora of size k to create by sampling yields • (1 <= k <= n) , n = size of alingual corpus, k = size of subcorpora, x = number of subcorpora, t = threshold for choosing a sentence 8

Methodology – Sampling input data • New perfect alignments are
created by removing input data • Sampling is random, the size of sub-corpora to process is given by • To ensure coverage, alignments are extracted from numerous sub corpora • at least x random sub-corpora of size k • Processing a sub corpora is faster and parallel processing is possible, 9

Extracting alignments 10

Scoring alignments • Translation probability is computed between one monolingual
language and all the other remaining languages. • Si – sequence of words of language i • S1---SL (1<= i <= L) – sequence of words of target languages 11

Translation probabilities 12

Lexical weights • Lexical weights determine the quality of alignments
• evaluates how each source word translates into the target words it links to. • Each language becomes the source in turn, and the rest of the alignment is assimilated to the target. • The lexical weights generated are as many as the languages, the weight their product • The sampling-based approach does not only link words but also sequence of words 13

Evaluation • Comparing with GIZA++ results for two corpora (1)
IWSLT07 Japanese to English classical task 14

Evaluation (2) Spanish to French task using the Europarl corpus
• Lexical weights boost the performance • Coverage is higher than 15

Conclusion • The sampling-based alignment method allows multiple languages to
be aligned simultaneously from parallel corpora. • The algorithm relies on the use of low frequency terms • The sample-based approach produces high quality translation. • The method can match the accuracy of Giza++ while having a much higher coverage of input data, and being by far simpler 16

Sampling-based Multilingual Alignment

Sampling-based Multilingual Alignment

Yemane

More Decks by Yemane

Other Decks in Education

Featured

Transcript

Sampling-based Multilingual Alignment March 24,2016 Adrien Lardilleux and Yves Lepage

Introduction • Purpose – word alignment by exploiting low frequency

Motivation • IBM models implemented in GIZA++ and other methods

Rationale – From high to low frequency • Intuition –

Rationale – hapax legomena • Hapax legomena (Hapaxes) – are

Methodology- Perfect alignment “sequences of words that strictly appear in

Methodology – alingual corpus a corpus of multiple languages treated

Methodology – Biasing the sampling • the probability that a

Methodology – Sampling input data • New perfect alignments are

Extracting alignments 10

Scoring alignments • Translation probability is computed between one monolingual

Translation probabilities 12

Lexical weights • Lexical weights determine the quality of alignments

Evaluation • Comparing with GIZA++ results for two corpora (1)

Evaluation (2) Spanish to French task using the Europarl corpus

Conclusion • The sampling-based alignment method allows multiple languages to