Slide 1

Slide 1 text

1 文献紹介 (2016.05.13) 長岡技術科学大学  自然言語処理研究室    Nguyen Van Hai Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity Lonneke van der Plas & Jorg Tiedemann Alfa-Informatica University of Groningen Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 866-873, Sydney, July 2006.

Slide 2

Slide 2 text

2 Abstract ● Distribution similarity been used to extract semantically related words. – Not able to distinguish between synonym and other types of semantically related words. ● This paper present a method based on automatic word alignment of parallel corpora ● Results shows higher precision and recall scores than the monolingual syntax-based approach.

Slide 3

Slide 3 text

3 Introduction ● Single words sharing the same meaning we speak of synonyms. ● they define context in multilingual setting. – Translate a word into other languages – Assume that word share translation context are semantically related – Measure using distributional similarity ● They use both monolingual syntax-based and multilingual alignment-based

Slide 4

Slide 4 text

4 Measuring Distributional Similarity ● Extract distributional similar is using to acquire semantically similar words ● Similar words are used in similar contexts. The contexts a given word are used as the feature in the vector called context vectors ● Van der Plas and Boma (2002) present similar experiment for Ductch by Pointwise Mutual Information and Dice

Slide 5

Slide 5 text

5 Weighting ● Weighted is indication of the amount of information carried particular combination of a noun and its feature ● For example verb have and verb squeeze

Slide 6

Slide 6 text

6 Word Alignment ● Process alignment – Reduce data sparseness – Facilite eluation based on comparing their results to existing synonym databases ● They applied GIZA++ and intersection heuristics ● From word aligned corpora they extracted word type links, pairs of source and target words with their alignment frequency.

Slide 7

Slide 7 text

7 Evaluation Framework ● Data used – Hand-crafted synonym database, Dutch EuroWordnet (EWN, Vossen(1998)) – Extract all synsets in EWN 1000 words with a frequency above 4 ● Precision is the percentage of candidate synonyms are truly synonyms ● Recall is the percentage of the synonyms according to EWN

Slide 8

Slide 8 text

8 Experiment setup ● Distributional similarity based on syntactic relations – Feature vectors are constructed from syntactically parsed monolingual corpora. – Used data: Dutch CLEE QA corpus which consists of 78 million words of Dutch – Use several grammatical relations: subject, object, adjective, coordination, apposition, prepositional complement

Slide 9

Slide 9 text

9

Slide 10

Slide 10 text

10 Experiment setup ● Distributional similarity based on word alignment – Context vectors are built from the alignments found in a parallel corpus. – Used data: Use Europarl corpus (Koehn, 2003) includes 11 languages parallel. Dutch includes 29 million tokens in about 1.2 million sentences

Slide 11

Slide 11 text

11

Slide 12

Slide 12 text

12 Result and discussion ● First 10 rows show the results for all language pairs individually. ● The 11th rows correspond for all languages are combined.

Slide 13

Slide 13 text

13