Simple task-specific bilingual word embeddings

文献紹介 Simple task-specific bilingual word embeddings Human Language Technologies: The
2015 Annual Conference of the North American Chapter of the ACL, pages 1386–1390, Denver, Colorado, May 31 – June 5, 2015. 長岡技術科学大学自然言語処理研究室勝田哲弘

Abstract • 既存の単語埋め込みアルゴリズムを使用して、タスク固有のバイリンガル単語埋め込みの学習を行う • タスク固有の同義語辞書を使用して、モデルの学習に適応させる • メリット • 埋め込みモデルに依存しない
• パラレルコーパスを必要としない • 辞書を再定義すれば別のタスクに適応できる 2

Introduction • 単語埋め込み (Word Embedding) • 単一言語の単語表現を学習 • 構文的に類似する単語を埋め込み空間の近くに配置 •
固有表現抽出や依存関係の解析など多くのタスクに適応されている • 本研究では、辞書をweak(distant) supervisionとして利用 3

Introduction • バイリンガルの単語埋め込み • 2つの異なる言語の類似した単語が埋め込み空間上で近くなるように単語埋め込みを統合させる • POSのタグ付け、固有表現抽出、または感情分析などのタスク固有のバイリンガル埋め込みを学習 •
既存の単一言語の単語埋め込みアルゴリズムへの単純なラッパーメソッドを提示 4

Introduction - contributions • バイリンガル単語埋め込みを学習するための新しいアプローチの提案 • 提案モデル • Bilingual Adaptive
Reshuffling with Individual Stochastic Alternatives (BARISTA) • 入力: two (non-parallel) corpora and a small dictionary • タスク依存の辞書 • EN car, FR maison(‘house’): 品詞としては等価 • EN house, FR maison: 翻訳としては等価 5

Approach • WordNetまたは同様のリソースから単語の等価性を抽出、学習に組み込む • word alignment bases (e.g., house ∼
maison) • knowledge bases (e.g., car ∼ maison) • これらを使用して、 mixed context target pairsを生成 6

Approach 1. Ct, Csを連結してシャッフル Ct : target corpus, Cs :
source corpus 2. 各単語の時、確率1/2でランダムに置換 • R : 辞書 3. For example, the English sentence “build the house”: construire the house, build la maison, build the maison, etc. 7

Experiments • Word2vec CBOW • Learning rate 0.1, window 4
• POS tagging dataset • Google’s universal tagset • SuS tagging dataset • Princeton WordNet and DanNet • Translation: Google translate 8

Qualitative Evaluation • POS classes • 英独EuroparlでPOSクラスを使用して学習した場合 • 同じ品詞のクラスを両方で共有するこ
とができている • 細かい関係は保持できていない 9

Qualitative Evaluation • Translation classes • 英語で使用頻度の高い上位2万語の辞書を作成 • Google翻訳を用いて •
より細かい関係を確認できる • 部分的にPOSの情報を確認できる 10

Cross-language part-of-speech tagging • バイリンガル埋め込みで英語データを学習し、別の言語に適応する。 • デンマーク語、ドイツ語、スペイン語、イタリア語、オランダ語、ポルトガル語、スウェーデン語のデータを使用 • 品詞の数を12にそろえる
• ベースライン: type-constrained structured perceptron • 比較：random embeddings, Klmtv provided by Klementiev et al. (2012) 11

Cross-language part-of-speech tagging 12

Cross-language super sense tagging • BARISTA embeddings for English-Danishをテスト •
English SemCor (1000文), Danish (320文) • baseline：most frequent sense (MFS), structured perceptron model trained only with ortographic and POS features • Metric: weighted average over F1-scores 13

Cross-language super sense tagging 14

Conclusions • バイリンガル埋め込みを学習するための簡単なアプローチ、 BARISTAを紹介 • BARISTA （a）埋め込みアルゴリズムの選択に依存しない（b）並列データを必要としない（c）適切な辞書を使用して特定のタスクに適応できる •
言語間のPOS / SuSタグ付けに有用である 15

Simple task-specific bilingual word embeddings

Simple task-specific bilingual word embeddings

katsutan

More Decks by katsutan

Other Decks in Technology

Featured

Transcript

文献紹介 Simple task-specific bilingual word embeddings Human Language Technologies: The

Introduction • 単語埋め込み (Word Embedding) • 単一言語の単語表現を学習 • 構文的に類似する単語を埋め込み空間の近くに配置 •

Introduction - contributions • バイリンガル単語埋め込みを学習するための新しいアプローチの提案 • 提案モデル • Bilingual Adaptive

Approach • WordNetまたは同様のリソースから単語の等価性を抽出、学習に組み込む • word alignment bases (e.g., house ∼

Approach 1. Ct, Csを連結してシャッフル Ct : target corpus, Cs :

Experiments • Word2vec CBOW • Learning rate 0.1, window 4

Qualitative Evaluation • POS classes • 英独EuroparlでPOSクラスを使用して学習した場合 • 同じ品詞のクラスを両方で共有するこ

Qualitative Evaluation • Translation classes • 英語で使用頻度の高い上位2万語の辞書を作成 • Google翻訳を用いて •

Cross-language part-of-speech tagging 12

Cross-language super sense tagging • BARISTA embeddings for English-Danishをテスト •

Cross-language super sense tagging 14