Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity

Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity

More Decks by 自然言語処理研究室

Other Decks in Technology

Transcript

  1. 1 文献紹介 (2016.05.13) 長岡技術科学大学  自然言語処理研究室    Nguyen Van Hai Finding Synonyms

    Using Automatic Word Alignment and Measures of Distributional Similarity Lonneke van der Plas & Jorg Tiedemann Alfa-Informatica University of Groningen Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 866-873, Sydney, July 2006.
  2. 2 Abstract • Distribution similarity been used to extract semantically

    related words. – Not able to distinguish between synonym and other types of semantically related words. • This paper present a method based on automatic word alignment of parallel corpora • Results shows higher precision and recall scores than the monolingual syntax-based approach.
  3. 3 Introduction • Single words sharing the same meaning we

    speak of synonyms. • they define context in multilingual setting. – Translate a word into other languages – Assume that word share translation context are semantically related – Measure using distributional similarity • They use both monolingual syntax-based and multilingual alignment-based
  4. 4 Measuring Distributional Similarity • Extract distributional similar is using

    to acquire semantically similar words • Similar words are used in similar contexts. The contexts a given word are used as the feature in the vector called context vectors • Van der Plas and Boma (2002) present similar experiment for Ductch by Pointwise Mutual Information and Dice
  5. 5 Weighting • Weighted is indication of the amount of

    information carried particular combination of a noun and its feature • For example verb have and verb squeeze
  6. 6 Word Alignment • Process alignment – Reduce data sparseness

    – Facilite eluation based on comparing their results to existing synonym databases • They applied GIZA++ and intersection heuristics • From word aligned corpora they extracted word type links, pairs of source and target words with their alignment frequency.
  7. 7 Evaluation Framework • Data used – Hand-crafted synonym database,

    Dutch EuroWordnet (EWN, Vossen(1998)) – Extract all synsets in EWN 1000 words with a frequency above 4 • Precision is the percentage of candidate synonyms are truly synonyms • Recall is the percentage of the synonyms according to EWN
  8. 8 Experiment setup • Distributional similarity based on syntactic relations

    – Feature vectors are constructed from syntactically parsed monolingual corpora. – Used data: Dutch CLEE QA corpus which consists of 78 million words of Dutch – Use several grammatical relations: subject, object, adjective, coordination, apposition, prepositional complement
  9. 9

  10. 10 Experiment setup • Distributional similarity based on word alignment

    – Context vectors are built from the alignments found in a parallel corpus. – Used data: Use Europarl corpus (Koehn, 2003) includes 11 languages parallel. Dutch includes 29 million tokens in about 1.2 million sentences
  11. 11

  12. 12 Result and discussion • First 10 rows show the

    results for all language pairs individually. • The 11th rows correspond for all languages are combined.
  13. 13