Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity

Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity

More Decks by 自然言語処理研究室

Other Decks in Technology


  1. 1
    文献紹介 (2016.05.13)
    長岡技術科学大学  自然言語処理研究室
       Nguyen Van Hai
    Finding Synonyms Using Automatic Word Alignment
    and Measures of Distributional Similarity
    Lonneke van der Plas & Jorg Tiedemann
    Alfa-Informatica University of Groningen
    Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions,
    pages 866-873, Sydney, July 2006.

    View full-size slide

  2. 2

    Distribution similarity been used to extract
    semantically related words.
    – Not able to distinguish between synonym and other
    types of semantically related words.

    This paper present a method based on automatic
    word alignment of parallel corpora

    Results shows higher precision and recall scores
    than the monolingual syntax-based approach.

    View full-size slide

  3. 3

    Single words sharing the same meaning we speak of

    they define context in multilingual setting.
    – Translate a word into other languages
    – Assume that word share translation context are semantically
    – Measure using distributional similarity

    They use both monolingual syntax-based and multilingual

    View full-size slide

  4. 4
    Measuring Distributional Similarity

    Extract distributional similar is using to acquire semantically similar

    Similar words are used in similar contexts. The contexts a given word
    are used as the feature in the vector called context vectors

    Van der Plas and Boma (2002) present similar experiment for Ductch
    by Pointwise Mutual Information and Dice

    View full-size slide

  5. 5

    Weighted is indication of the amount of
    information carried particular combination of a
    noun and its feature

    For example verb have and verb squeeze

    View full-size slide

  6. 6
    Word Alignment

    Process alignment
    – Reduce data sparseness
    – Facilite eluation based on comparing their results to existing synonym databases

    They applied GIZA++ and intersection heuristics

    From word aligned corpora they extracted word type links, pairs of source
    and target words with their alignment frequency.

    View full-size slide

  7. 7
    Evaluation Framework

    Data used
    – Hand-crafted synonym database, Dutch EuroWordnet (EWN,
    – Extract all synsets in EWN 1000 words with a frequency
    above 4

    Precision is the percentage of candidate synonyms are
    truly synonyms

    Recall is the percentage of the synonyms according to

    View full-size slide

  8. 8
    Experiment setup

    Distributional similarity based on syntactic
    – Feature vectors are constructed from syntactically
    parsed monolingual corpora.
    – Used data: Dutch CLEE QA corpus which consists
    of 78 million words of Dutch
    – Use several grammatical relations: subject, object,
    adjective, coordination, apposition, prepositional

    View full-size slide

  9. 10
    Experiment setup

    Distributional similarity based on word
    – Context vectors are built from the alignments found
    in a parallel corpus.
    – Used data: Use Europarl corpus (Koehn, 2003)
    includes 11 languages parallel. Dutch includes 29
    million tokens in about 1.2 million sentences

    View full-size slide

  10. 12
    Result and discussion

    First 10 rows show the results for all language pairs

    The 11th rows correspond for all languages are

    View full-size slide