文献紹介：Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity

1 文献紹介 (2016.05.13) 長岡技術科学大学　　自然言語処理研究室　　 Nguyen Van Hai Finding Synonyms
Using Automatic Word Alignment and Measures of Distributional Similarity Lonneke van der Plas & Jorg Tiedemann Alfa-Informatica University of Groningen Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 866-873, Sydney, July 2006.

2 Abstract • Distribution similarity been used to extract semantically
related words. – Not able to distinguish between synonym and other types of semantically related words. • This paper present a method based on automatic word alignment of parallel corpora • Results shows higher precision and recall scores than the monolingual syntax-based approach.

3 Introduction • Single words sharing the same meaning we
speak of synonyms. • they define context in multilingual setting. – Translate a word into other languages – Assume that word share translation context are semantically related – Measure using distributional similarity • They use both monolingual syntax-based and multilingual alignment-based

4 Measuring Distributional Similarity • Extract distributional similar is using
to acquire semantically similar words • Similar words are used in similar contexts. The contexts a given word are used as the feature in the vector called context vectors • Van der Plas and Boma (2002) present similar experiment for Ductch by Pointwise Mutual Information and Dice

5 Weighting • Weighted is indication of the amount of
information carried particular combination of a noun and its feature • For example verb have and verb squeeze

6 Word Alignment • Process alignment – Reduce data sparseness
– Facilite eluation based on comparing their results to existing synonym databases • They applied GIZA++ and intersection heuristics • From word aligned corpora they extracted word type links, pairs of source and target words with their alignment frequency.

7 Evaluation Framework • Data used – Hand-crafted synonym database,
Dutch EuroWordnet (EWN, Vossen(1998)) – Extract all synsets in EWN 1000 words with a frequency above 4 • Precision is the percentage of candidate synonyms are truly synonyms • Recall is the percentage of the synonyms according to EWN

8 Experiment setup • Distributional similarity based on syntactic relations
– Feature vectors are constructed from syntactically parsed monolingual corpora. – Used data: Dutch CLEE QA corpus which consists of 78 million words of Dutch – Use several grammatical relations: subject, object, adjective, coordination, apposition, prepositional complement

10 Experiment setup • Distributional similarity based on word alignment
– Context vectors are built from the alignments found in a parallel corpus. – Used data: Use Europarl corpus (Koehn, 2003) includes 11 languages parallel. Dutch includes 29 million tokens in about 1.2 million sentences

12 Result and discussion • First 10 rows show the
results for all language pairs individually. • The 11th rows correspond for all languages are combined.

文献紹介：Finding Synonyms Using Automatic Word Alig...

文献紹介：Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity

Van Hai

More Decks by Van Hai

Featured

Transcript

1 文献紹介 (2016.05.13) 長岡技術科学大学　　自然言語処理研究室　　 Nguyen Van Hai Finding Synonyms

2 Abstract • Distribution similarity been used to extract semantically

3 Introduction • Single words sharing the same meaning we

4 Measuring Distributional Similarity • Extract distributional similar is using

5 Weighting • Weighted is indication of the amount of

6 Word Alignment • Process alignment – Reduce data sparseness

7 Evaluation Framework • Data used – Hand-crafted synonym database,

8 Experiment setup • Distributional similarity based on syntactic relations

9

10 Experiment setup • Distributional similarity based on word alignment

11

12 Result and discussion • First 10 rows show the

13