Sampling-based Multilingual
Alignment
March 24,2016
Adrien Lardilleux and Yves Lepage
GREYC, University of Caen Basse-Normandie,
France
International Conference RANLP, 2009
Borovets, Bulgaria, pages 214–218
1
Slide 2
Slide 2 text
Introduction
• Purpose – word alignment by exploiting low frequency
terms to extract high quality multi-word alignments
• Method – random sampling of perfect alignments from
numerous sub-corpora
• Results – competetive with GIZA++ in quality with lower
processing time and higher coverage
2
Slide 3
Slide 3 text
Motivation
• IBM models implemented in GIZA++ and
other methods address the issues of
quality
• Some other issues worth exploring are
• Simultaneous multi-lingual alignment
• Scaling up, coverage and processing time
• Integration with real applications
3
Slide 4
Slide 4 text
Rationale –
From high to low frequency
• Intuition – high frequency terms have higher significance
• Avoiding indefinite data increment - safe alignment of low
frequency words
• Convert high frequency terms to low frequency by
decreasing data (as opposed to increasing data)
increase data
Low frequency
High frequency
High frequency
Low frequency
increase data
4
Slide 5
Slide 5 text
Rationale –
hapax legomena
• Hapax legomena (Hapaxes) – are found only once in
their corpus
• Represent ~ 50% of the vocabularies
• Hapax are aligned using an assumption of lexical
equivalence
• By filtering inputs sentences
• High frequency terms are converted to low frequency terms
• The terms reduced to hapaxes form translation pairs
• Perfect alignments can be extracted
5
Slide 6
Slide 6 text
Methodology-
Perfect alignment
“sequences of words that strictly appear in the same
sentences.”
6
Slide 7
Slide 7 text
Methodology –
alingual corpus
a corpus of multiple languages treated as monolingual corpus
and does not involve any language-dependent concept.
7
Slide 8
Slide 8 text
Methodology –
Biasing the sampling
• the probability that a particular sentence is chosen is k/n;
• the probability that this sentence is not chosen is 1 − k/n;
• the probability that none of the k sentences is chosen is (1 − k/n) ^ k;
• the probability that none of these k sentences is ever chosen is (1 −
k/n) ^ kx.
• Hence, the number of random subcorpora of size k to create by
sampling yields
• (1 <= k <= n) , n = size of alingual corpus, k = size of subcorpora, x =
number of subcorpora, t = threshold for choosing a sentence
8
Slide 9
Slide 9 text
Methodology –
Sampling input data
• New perfect alignments are created by removing input
data
• Sampling is random, the size of sub-corpora to process is
given by
• To ensure coverage, alignments are extracted from
numerous sub corpora
• at least x random sub-corpora of size k
• Processing a sub corpora is faster and parallel
processing is possible,
9
Slide 10
Slide 10 text
Extracting alignments
10
Slide 11
Slide 11 text
Scoring alignments
• Translation probability is computed between one
monolingual language and all the other remaining
languages.
• Si – sequence of words of language i
• S1---SL (1<= i <= L) – sequence of words of target
languages
11
Slide 12
Slide 12 text
Translation probabilities
12
Slide 13
Slide 13 text
Lexical weights
• Lexical weights determine the quality of alignments
• evaluates how each source word translates into the target words it links to.
• Each language becomes the source in turn, and the rest of the
alignment is assimilated to the target.
• The lexical weights generated are as many as the languages, the
weight their product
• The sampling-based approach does not only link words but also
sequence of words
13
Slide 14
Slide 14 text
Evaluation
• Comparing with GIZA++ results for two corpora
(1) IWSLT07 Japanese to English classical task
14
Slide 15
Slide 15 text
Evaluation
(2) Spanish to French task using the Europarl corpus
• Lexical weights boost the performance
• Coverage is higher than
15
Slide 16
Slide 16 text
Conclusion
• The sampling-based alignment method allows multiple
languages to be aligned simultaneously from parallel
corpora.
• The algorithm relies on the use of low frequency terms
• The sample-based approach produces high quality
translation.
• The method can match the accuracy of Giza++ while
having a much higher coverage of input data, and being
by far simpler
16