Nguyen Van Hai
Finding Synonyms Using Automatic Word Alignment
and Measures of Distributional Similarity
Lonneke van der Plas & Jorg Tiedemann
Alfa-Informatica University of Groningen
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions,
pages 866-873, Sydney, July 2006.
Distribution similarity been used to extract
semantically related words.
– Not able to distinguish between synonym and other
types of semantically related words.
This paper present a method based on automatic
word alignment of parallel corpora
Results shows higher precision and recall scores
than the monolingual syntax-based approach.
Single words sharing the same meaning we speak of
they define context in multilingual setting.
– Translate a word into other languages
– Assume that word share translation context are semantically
– Measure using distributional similarity
They use both monolingual syntax-based and multilingual
Measuring Distributional Similarity
Extract distributional similar is using to acquire semantically similar
Similar words are used in similar contexts. The contexts a given word
are used as the feature in the vector called context vectors
Van der Plas and Boma (2002) present similar experiment for Ductch
by Pointwise Mutual Information and Dice
Weighted is indication of the amount of
information carried particular combination of a
noun and its feature
For example verb have and verb squeeze
– Reduce data sparseness
– Facilite eluation based on comparing their results to existing synonym databases
They applied GIZA++ and intersection heuristics
From word aligned corpora they extracted word type links, pairs of source
and target words with their alignment frequency.
– Hand-crafted synonym database, Dutch EuroWordnet (EWN,
– Extract all synsets in EWN 1000 words with a frequency
Precision is the percentage of candidate synonyms are
Recall is the percentage of the synonyms according to
Distributional similarity based on syntactic
– Feature vectors are constructed from syntactically
parsed monolingual corpora.
– Used data: Dutch CLEE QA corpus which consists
of 78 million words of Dutch
– Use several grammatical relations: subject, object,
adjective, coordination, apposition, prepositional
Distributional similarity based on word
– Context vectors are built from the alignments found
in a parallel corpus.
– Used data: Use Europarl corpus (Koehn, 2003)
includes 11 languages parallel. Dutch includes 29
million tokens in about 1.2 million sentences
Result and discussion
First 10 rows show the results for all language pairs
The 11th rows correspond for all languages are