What Substitutes Tell Us-Analysis of an “All-Words” Lexical Substitution Corpus

What Substitutes Tell Us – Analysis of an “All-Words” Lexical
Substitution Corpus Gerhard Kremer, Katrin Erk, Sebastian Padó, Stefan Thater Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 540–549, Gothenburg, Sweden, April 26-30 2014. 自然言語処理研究室 B４勝田哲弘 2017/9/20 1 図、表などは論文中から引用しています。

概要 • 英語の大規模な「allwords lexical substitution」コーパスの構築 ▫ 同義語辞書 • WordNetとSEMEVAL lexical
substitution dataと比較を行う

はじめに • 語義の曖昧さを解消する方法 ▫ supervised word sense disambiguation ▫ WSD
(McCarthy, 2008; Navigli, 2009) • WordNet ▫ coverage and granularityが批判されている • Lexical Substitution ▫ (McCarthy and Navigli, 2009) ▫ 文脈の中で置換候補をリストする

はじめに • Lexical Substitution ▫ 小規模のデータしかない • 大規模なデータセットを構築 ▫ MASC（30,000語以上）
• Lexical Substitution の性質 • 文脈における語義の性質を調べる

Amazon Mechanical Turk (AMT) • HITs. ▫ 3文表示し、その中の1単語を基本1語で書き換える ▫
1単語6人が書き換えるように依頼 • データセット ▫ 2,474文（7,117の名詞、4,617の動詞、2,470の形容詞、1,425の副詞）の15,629個

Inter-Annotator Agreement

Characterising Lexical Substitutions • コーパスから以下の内容を調査 ▫ 対象単語と置換候補にどのような関係があるか ▫ Parasetは語意に類似するのか •
WordNetと比較

Characterising Lexical Substitutions • 同義語（syn）、直接/推移的（direct/trans）上位語（hyper）および下位語（hypo）

Characterising Lexical Substitutions

Ranking Paraphrases • McCarthyとNavigliのSEMEVAL 2007データセットと３つの計算モデルで比較 ▫ Erk and Padó
(2008, EP08) ▫ Thater et al. (2010, TFP10) ▫ Thater et al. (2011, TFP11) • (Kishida, 2005, GAP)を使用し、頻度を重視してランク付けされたリストを作成

Ranking Paraphrases • 対象単語をベクトル化し、コサイン類似度を基にランク付け

Ranking Paraphrases • ContextにおいてCOINCOが低くなる要因 ▫ 依頼設定 ▫ 意味分布 ▫ 頻度-品詞の分布
• 頻度-分布をSEMEVALに合わせた ▫ COINCO subset

まとめ • 利点 ▫ 連続したドキュメントをカバー ▫ 規模が大きいためよりlexical substitutionの詳細な分析が可能 •
1つの対象単語がWordNetのsynsetに類似する • WordNetでは区別できない意味要素を文脈は含んでいる。

What Substitutes Tell Us-Analysis of an “All-Wo...

What Substitutes Tell Us-Analysis of an “All-Words” Lexical Substitution Corpus

katsutan

More Decks by katsutan

Other Decks in Technology

Featured

Transcript