Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Even Unassociated Features Can Improve Lexical Distributional Similarity

Even Unassociated Features Can Improve Lexical Distributional Similarity

Kazuhide Yamamoto and Takeshi Asakura. Even Unassociated Features Can Improve Lexical Distributional Similarity. Proceedings of The Second International Workshop on NLP Challenges in the Information Explosion Era (NLPIX 2010), pp.32-39 (2010.8)

自然言語処理研究室

August 31, 2010
Tweet

More Decks by 自然言語処理研究室

Other Decks in Research

Transcript

  1. 1 / 30 Even Unassociated Features Even Unassociated Features Can

    Improve Can Improve Lexical Distributional Similarity Lexical Distributional Similarity Kazuhide Yamamoto Kazuhide Yamamoto Takeshi Asakura Takeshi Asakura Nagaoka University of Technology, Japan Nagaoka University of Technology, Japan
  2. 3 / 30 Lexical Similarity Lexical Similarity  Essential task

    in natural language processing Essential task in natural language processing  Look for similar words for Look for similar words for  (corpus-driven) summarization, machine translation, (corpus-driven) summarization, machine translation, textual entailment recognition, ... textual entailment recognition, ...  Generalize or cluster words for Generalize or cluster words for  Language modeling, word sense disambiguation, ... Language modeling, word sense disambiguation, ...
  3. 4 / 30 Similarity Computation Similarity Computation  Based on

    thesaurus / ontology Based on thesaurus / ontology  Such as WordNet Such as WordNet  Based on corpus = Based on corpus = distributional similarity distributional similarity  Harris (1968) : “semantically similar words tend to Harris (1968) : “semantically similar words tend to appear in similar contexts.” appear in similar contexts.”  Target of our work Target of our work
  4. 5 / 30 Our Motivation Our Motivation Japanese two words:

    Japanese two words: たばこ たばこ タバコ タバコ
  5. 6 / 30 Our Motivation Our Motivation Japanese two words:

    Japanese two words:     たばこ     たばこ (tobacco) (tobacco)     タバコ     タバコ (tobacco) (tobacco)  Same pronunciation Same pronunciation  Same meaning Same meaning  Computed similarity is Computed similarity is far from 1.0 (0.428) far from 1.0 (0.428)
  6. 7 / 30 Our Interest : Context Our Interest :

    Context  There must be There must be many noises included many noises included in a context that in a context that causes inaccurate similarity measure. causes inaccurate similarity measure.  If that is the case, we should If that is the case, we should clean context clean context before before computing similarity. computing similarity.  State-of-the-art appoaches are used for other modules. State-of-the-art appoaches are used for other modules.
  7. 9 / 30 Similarity Computation: Framework Similarity Computation: Framework Distributional

    similarity is computed in basically the same Distributional similarity is computed in basically the same framework: framework: 1. 1. A context is extracted A context is extracted for each of two words, for each of two words, 2. 2. A vector is made A vector is made in which an element is a value or a weight, in which an element is a value or a weight, 3. 3. Two vectors are compared Two vectors are compared to measure similarity. to measure similarity.
  8. 10 / 30 Feature Vector Feature Vector friend MOD cry

    SUBJ A boy's friend … The boy cried … … … corpus: feature vector for boy: ... ...  Features are collection of syntactically-dependent words Features are collection of syntactically-dependent words with their syntactic roles. with their syntactic roles.  Compound words are identified. Compound words are identified.  Pointwise mutual information used for feature value. Pointwise mutual information used for feature value.  Features are filtered out if threshold < α Features are filtered out if threshold < α
  9. 11 / 30 Similarity Function Similarity Function  Shibata and

    Kurohashi (2009) reported that Jaccard- Shibata and Kurohashi (2009) reported that Jaccard- Simpson attains better (in Japanese) than Simpson, Simpson attains better (in Japanese) than Simpson, Cosine, Lin98, and Lin02. Cosine, Lin98, and Lin02.  We follow their findings and use Jaccard-Simpson. We follow their findings and use Jaccard-Simpson. sim Jaccard = ∣V1∩V2∣ ∣V1∪V2∣ sim Simpson = ∣V1∩V2∣ min∣V1∣,∣V2∣ sim JaccardSimpson = sim Jaccard sim Simpson 2 Shibata and Kurohashi. Distributional similarity calculation using very large scale Web corpus. ANLP Annual Meeting, pp.705-708, 2009.
  10. 12 / 30 Feature Weighting Feature Weighting  A feature

    (friend A feature (friend MOD MOD ) is reinforced according to how ) is reinforced according to how much synonyms of “ much synonyms of “boy boy” has the feature. ” has the feature.  All features in all words are weighted, and values are All features in all words are weighted, and values are normalized to 0-1 for each word. normalized to 0-1 for each word.  Use thesaurus to get synonyms. Use thesaurus to get synonyms. friend MOD feature for “boy”:
  11. 13 / 30 Feature Reduction : Problem Feature Reduction :

    Problem  Zhitomirsky-Geffet and Zhitomirsky-Geffet and Dagan (2009) picks up Dagan (2009) picks up only “associated” only “associated” features and reduced features and reduced other features. other features. high low Color: degree of value; After reduction: Original: word 1 word 2 word 1 word 2 Zhitomirsky-Geffet and Dagan, Bootstrapping Feature Vector Quality. Computational Linguistics, Vol.35, No.3, pp.435-461 2009.
  12. 14 / 30 Feature Reduction : Problem (continued) Feature Reduction

    : Problem (continued)  However, it measures However, it measures well only in very similar well only in very similar words with many words with many associated features. associated features.  In case two words are In case two words are middle- or low-similar middle- or low-similar (right figure), little (right figure), little information is provided. information is provided. high low Color: degree of value; After reduction: Original: word 1 word 2 word 1 word 2
  13. 15 / 30 Feature Reduction : Our Idea Feature Reduction

    : Our Idea  We propose to use We propose to use features where features where the the difference of the values difference of the values is less than β. is less than β.  Final similarity is Final similarity is computed by Jaccard- computed by Jaccard- Simposon with the Simposon with the reduced features. reduced features. high low Color: degree of value; After reduction: Original: word 1 word 2 word 1 word 2
  14. 17 / 30 Evaluation Metrics : Idea Evaluation Metrics :

    Idea  How clearly the similarity measure distinguish similar How clearly the similarity measure distinguish similar word pairs out of non-similar ones? word pairs out of non-similar ones? Similar word pairs Non-similar word pairs Threshold Similarity
  15. 18 / 30 Evaluation Metrics : continued Evaluation Metrics :

    continued  However, it is easy task to distinguish similar and non- However, it is easy task to distinguish similar and non- similar, that makes it difficult who wins. similar, that makes it difficult who wins.  Therefore, we define a more difficult task that Therefore, we define a more difficult task that distinguishes different similarity level. distinguishes different similarity level.  Similarity level is defined by thesaurus. Similarity level is defined by thesaurus. (root) (example) Target : Asia Level 3 : Europe Level 2 : Brazil Level 1 : my country Level 0 : system
  16. 19 / 30 Experimental Setting Experimental Setting Compare with two

    benchmarks: Compare with two benchmarks:  Shibata and Kurohashi (2009) Shibata and Kurohashi (2009)  Simpson-Jaccard without feature reduction. Simpson-Jaccard without feature reduction.  Zhitomirsky-Geffet and Dagan (2009) Zhitomirsky-Geffet and Dagan (2009)  reinforce associated features. reinforce associated features.  Corpus : the Nikkei newspaper corpus, 14 years. Corpus : the Nikkei newspaper corpus, 14 years.  Thesaurus : Bunrui Goi Hyo. Thesaurus : Bunrui Goi Hyo.  Number of target words : 75,530. Number of target words : 75,530.  Evaluation set : 800 pairs in each Level. Evaluation set : 800 pairs in each Level.
  17. 21 / 30 Result Result Similarity Level Shibata & Kurohashi

    Zhitomirsky -Geffet & Dagan our method Level 3+2 0.702 0.791 0.797 Level 2+1 0.747 0.771 0.773 Level 1+0 0.838 0.789 0.840 Our method (slightly) outperforms two benchmarks at any levels.
  18. 22 / 30 Result Result Similarity Level Shibata & Kurohashi

    Zhitomirsky -Geffet & Dagan our method Level 3+2 0.702 0.791 0.797 Level 2+1 0.747 0.771 0.773 Level 1+0 0.838 0.789 0.840 Shibata and Kurohashi (2009) always has many features that degrades particularly in higher level.
  19. 23 / 30 Result Result Similarity Level Shibata & Kurohashi

    Zhitomirsky -Geffet & Dagan our method Level 3+2 0.702 0.791 0.797 Level 2+1 0.747 0.771 0.773 Level 1+0 0.838 0.789 0.840 Zhitomirsky-Geffet and Dagan (2009) reduces such noises that gives better in higher level, although down in lower level due to lack of features.
  20. 24 / 30 Result Result Similarity Level Shibata & Kurohashi

    Zhitomirsky -Geffet & Dagan our method Level 3+2 0.702 0.791 0.797 Level 2+1 0.747 0.771 0.773 Level 1+0 0.838 0.789 0.840 Our proposed method maintains performance in higher level, while improves performance in lower level that is close to Shibata&Kurohashi.
  21. 25 / 30 Discussion : Error Analysis Discussion : Error

    Analysis  Major errors are NOT due to lack of features (below). Major errors are NOT due to lack of features (below).  Hence, key features are reduced and/or noisy features Hence, key features are reduced and/or noisy features are remained in the reduction. are remained in the reduction. #errors < 20 fea. Level 3+2 (high) 125 32 (26%) (low) 220 60 (27%) Level 2+1 (high) 137 32 (23%) (low) 253 52 (21%) Level 1+0 (high) 149 4 (3%) (low) 100 3 (3%)
  22. 26 / 30 Discussion : Feature Reduction Discussion : Feature

    Reduction  We may reduce 81% of features in level 3+2, 87% in We may reduce 81% of features in level 3+2, 87% in level 2+1, and 52% in level 1+0. level 2+1, and 52% in level 1+0.  The precisions are given by observing performance The precisions are given by observing performance changes. changes.  Not surprising since Hagiwara et al. (2006) reports Not surprising since Hagiwara et al. (2006) reports similar statistics (90%). similar statistics (90%).  There is a lot to be reduced further. There is a lot to be reduced further. Hagiwara et al. Selection of Contextual Information for Automatic Synonym Acquisition. Proc. of Coling-ACL, pp.353-360 (2006)
  23. 28 / 30 Conclusions Conclusions  New method for lexical

    distributional similarity is New method for lexical distributional similarity is proposed. proposed.  Not only Not only associated associated features but features but even even unassociated unassociated features can improve lexical distributional similarity. features can improve lexical distributional similarity.  Experimental results shows (slightly) better performance Experimental results shows (slightly) better performance in all levels of similarity. in all levels of similarity.
  24. 29 / 30 Future Work Future Work Again, two same

    words: Again, two same words: たばこ たばこ (tobacco) (tobacco) タバコ タバコ (tobacco) (tobacco) The similarity is The similarity is still far from 1.0 still far from 1.0. .
  25. 30 / 30 Thank you! Thank you!  Questions are

    welcome. Questions are welcome.  Contact: Contact: [email protected] [email protected]