Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity

Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity

Transcript

  1. 1 文献紹介 (2016.05.13) 長岡技術科学大学  自然言語処理研究室    Nguyen Van Hai Finding Synonyms

    Using Automatic Word Alignment and Measures of Distributional Similarity Lonneke van der Plas & Jorg Tiedemann Alfa-Informatica University of Groningen Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 866-873, Sydney, July 2006.
  2. 2 Abstract • Distribution similarity been used to extract semantically

    related words. – Not able to distinguish between synonym and other types of semantically related words. • This paper present a method based on automatic word alignment of parallel corpora • Results shows higher precision and recall scores than the monolingual syntax-based approach.
  3. 3 Introduction • Single words sharing the same meaning we

    speak of synonyms. • they define context in multilingual setting. – Translate a word into other languages – Assume that word share translation context are semantically related – Measure using distributional similarity • They use both monolingual syntax-based and multilingual alignment-based
  4. 4 Measuring Distributional Similarity • Extract distributional similar is using

    to acquire semantically similar words • Similar words are used in similar contexts. The contexts a given word are used as the feature in the vector called context vectors • Van der Plas and Boma (2002) present similar experiment for Ductch by Pointwise Mutual Information and Dice
  5. 5 Weighting • Weighted is indication of the amount of

    information carried particular combination of a noun and its feature • For example verb have and verb squeeze
  6. 6 Word Alignment • Process alignment – Reduce data sparseness

    – Facilite eluation based on comparing their results to existing synonym databases • They applied GIZA++ and intersection heuristics • From word aligned corpora they extracted word type links, pairs of source and target words with their alignment frequency.
  7. 7 Evaluation Framework • Data used – Hand-crafted synonym database,

    Dutch EuroWordnet (EWN, Vossen(1998)) – Extract all synsets in EWN 1000 words with a frequency above 4 • Precision is the percentage of candidate synonyms are truly synonyms • Recall is the percentage of the synonyms according to EWN
  8. 8 Experiment setup • Distributional similarity based on syntactic relations

    – Feature vectors are constructed from syntactically parsed monolingual corpora. – Used data: Dutch CLEE QA corpus which consists of 78 million words of Dutch – Use several grammatical relations: subject, object, adjective, coordination, apposition, prepositional complement
  9. 9

  10. 10 Experiment setup • Distributional similarity based on word alignment

    – Context vectors are built from the alignments found in a parallel corpus. – Used data: Use Europarl corpus (Koehn, 2003) includes 11 languages parallel. Dutch includes 29 million tokens in about 1.2 million sentences
  11. 11

  12. 12 Result and discussion • First 10 rows show the

    results for all language pairs individually. • The 11th rows correspond for all languages are combined.
  13. 13