CHARAGRAM: Embedding Words and Sentences via Character n-grams

CHARAGRAM : Embedding Words and Sentences via Character n-grams John
Wieting†, Mohit Bansal†, Kevin Gimpel†, Karen Livescu† †Toyota Technological Institute at Chicago EMNLP 2016, pages 1504‒1515

Research Objective Words や Sentences を betterな特徴空間に埋め込みたい(Embedding) betterとは？単語の類似度，⽂章の類似度タスクなどで評価より精度の⾼いEmbedding⼿法がbetter
1 ※引⽤ : http://www.samyzaf.com/ML/nlp/nlp.html ※

Contributions Character n-gram を⽤いて WordやSentenceをEmbeddingする⼿法を提案３つのタスクで評価 Character basedなCNN，LSTMをOutperform 1. word
similarity 2. sentence similarity 3. part-of-speech tagging(省略) 学習収束が早い・未知語に強いなどの特徴を分析 2

Agenda Related Work Models – Proposed Method・Comparative Method Experiments –
Word Similarity・Sentence Similarity Analysis – Quantitative・ Qualitative Summary 3

Related Work word embeddingを⽤いた研究の多くが，「単語」を意味の最⼩構成要素として処理例：Skip-gram 4 I have a
pen . <EOS> <SOS>

Related Work Subword (部分語)をモデルに組み込むことで，精度の向上が期待近年，Character basedなモデルがいくつかのタスクでstate-of-the-art達成 – POS
tagging with char Bi-LSTMs (Ling et al., 2015) – Language Modeling with char CNNs (Kim et al., 2015) 5 ex.) literal, literature, literary, literate Subword information × Character based model = Character n-gram(CHARAGRAM)

Models

Models : Proposed Method 7

Models : Proposed Method : character sequence $%&' () :
を embedded ベクトル (次元)にする関数 8

Models : Proposed Method + , : の⽂字⽬から⽂字⽬までのsubsequence(char n-gram) :
ボキャブラリー [] : 指⽰関数, if true(+ , ∈ ) → 1; if false(+ , ∉ )→ 0 78 9 : + , の embedded ベクトル (次元) : の⽂字⻑ : 全char n-gramの中での最⼤⽂字⻑ 9

Models : Proposed Method : バイアス項 (次元) ℎ(∗) : ⾮線形関数(活性化関数)
ー ℎなど 10

Models : Proposed Method 例） $%&' ( = “”) =
ℎ( + “” ∗ ”G” + “” ∗ ”G,” + “” ∗ ”G,H” + “” ∗ ”,” + “” ∗ ”,H” + “” ∗ ”H”) 11

charLSTM : character sequence ℎI : 最終のtime step における隠れ層 →
のembedding vectorとして取得 Models : Comparative Method 12 K LSTM Block L LSTM Block M LSTM Block I LSTM Block 損失関数・・・ ℎK ℎL ℎI

charLSTM : character sequence ℎI : 最終のtime step における隠れ層 →
のembedding vectorとして取得 LSTM Block Models : Comparative Method 13 K LSTM Block L LSTM Block M LSTM Block I LSTM Block 損失関数・・・ ℎK ℎL ℎI N O O Input Gate Forget Gate Memory Cell O Output Gate O Hidden layer

Models : Comparative Method 14 charCNN : character sequence Character
Embeddingによる出⼒ → のembedding vectorとして取得 K L M P I filter数 Convolution Max-over-time Pooling Full Connect 損失関数 Character Embedding

Models : 損失関数 15 Margin-based loss : phrase pairの集合，<K ,
L > : phrase pair ( ∗ ) : embedding function : パラメータ K , L : ミニバッチから取り出した negative example : マージン， : 正則化パラメータ

Models : 損失関数 16 Negative Example – MAX = ミニバッチの中から，K
(L )と類似度が最も⾼くなるphrase を選択 ( chooses the most similar phrase in some set of phrases ) – MIN = selects negative examples using MAX with probability 0.5 and selects them randomly from the mini-batch otherwise.

L > : phrase pair ( ∗ ) : embedding function : パラメータ K , L : ミニバッチから取り出した negative example : マージン， : 正則化パラメータ 0 ⼤きくするように学習

L > : phrase pair ( ∗ ) : embedding function : パラメータ K , L : ミニバッチから取り出した negative example : マージン， : 正則化パラメータ

Experiments

Experiments : Word Similarity 20 Datasets – Training : Paraphrase
Database (lexical) ‒ 770,007 word pairs 同じ意味の単語のペア ex. ( “strengthens” | “toughens” ) – Tuning & Evaluation : WordSim-353 ‒ 353 word pairs + scores SimLex-999 ‒ 999 word pairs + scores 似た意味の単語のペア + スコア[0-10] ex. ( “football” | “soccer” ) : 9.03

Experiments : Word Similarity 21 Training & Tuning – Training
: PPDB(770,007 word pairs) → 1 epoch – Tuning : WordSim-353, SimLex-999 → 50 epoch – Hyperparameter 共通：ミニバッチサイズ( 25 or 50 )，マージン( 0.4 )，次元数( 300 ) 正則化パラメータ( 10-4 or 10-5 or 10-6 )，最適化(Adam lr=0.001) CHARAGRAM：char n-grams( n ∈ {2, 3, 4} )，活性化関数(ℎ or ) charLSTM：output gate (on or off ) charCNN：filter数( 25 or 125 )，dropout (on or off )，活性化関数(ℎ or )

Experiments : Word Similarity 22 Results – 評価指標： Spearman の順位相関係数
– 提案⼿法がcharCNN，charLSTMをoutperform

Experiments : Sentence Similarity 23 Datasets – Training : Paraphrase
Database (phrasal) ‒ 9,123,575 phrase pairs 同じ意味のフレーズのペア ex. ( “fast , easy and” | “quick , easy and” ) – Evaluation : SemEval semantic textual similarity (STS) task(ʻ12-ʼ15) SemEval 2014 SICK Semantic Relatedness task SemEval 2015 Twitter task 似た意味の⽂章のペア + スコア[0-5] ex. ( “Fourth arrest in body-in-bin probe” | “Three held after body found in bin” ) : 3.1259

Experiments : Sentence Similarity 24 Training & Tuning – Training
: PPDB(3,033,753 phrase pairs) → 1 epoch – Tuning : PPDB(9,123,575 phrase pairs) → 10 epoch – Hyperparameter 共通：ミニバッチサイズ( 100 )，マージン( 0.4 )，次元数( 300 ) 正則化パラメータ( 10-4 or 10-5 or 10-6 )，最適化(Adam lr=0.001) CHARAGRAM：char n-grams( n ∈ {2, 3, 4} )，活性化関数(ℎ or ) charLSTM：output gate (on or off ) charCNN：filter数( 25 or 125 )，dropout (on or off )，活性化関数(ℎ or )

Experiments : Sentence Similarity 25 PARAGRAM-PHRASE – 単語毎のChar-ngram Embedding vectorの平均
CHARAGRAM-PHRASE – 単語間のChar-ngramも考慮したEmbedding vectorの平均 – 単語の順序や共起をモデルに組み込むことが可能 “I have a pen.” “I have a pen.”

Experiments : Sentence Similarity 26 Results – 評価指標： Spearman の順位相関係数
– 提案⼿法がcharCNN，charLSTMをoutperform – 部分的なTask(STS ʻ12 newsなど)では 5 / 22で sota 達成

Analysis

Analysis : Convergence(収束性) 28 Epoch毎の精度をプロット CHARAGRAMが圧倒的に学習の収束速度が速い． – char-based のmodelより
パラメータの数が少ない． CHARAGRAM : 90,000 charCNN : 881,025 charLSTM : 763,200

Analysis : Quantitative 29 未知語に対する評価 – 未知語を含むsentence pairで評価 ( :
sentence pair の数) – 評価指標： Spearman の順位相関係数 – CHARAGRAM-PHRASEは未知語に対してもrobust • 任意の⽂字シーケンスを埋め込むことが可能なため．

Analysis : Quantitative 30 Sentenceの⽂字⻑に対する評価 – ⽂字⻑毎のsentence pairで評価 ( :
sentence pair の数) – 評価指標： Spearman の順位相関係数 – 提案⼿法は⽂字⻑に対してもrobust • ⽂字⻑が⻑くなるほど⽐較的精度向上

Analysis : Qualitative 31 “not ~”のWord Bi-gramのnearest neighborsの可視化 – CHARAGRAM-PHRASEは
否定語の表現を正しく埋め込めていることがわかる． • “not able” → “unable”, “incapable”など – PARAGRAM-PHRASEは複数語にまたがる表現をうまく獲得できない．

Analysis : Qualitative 32 CHARAGRAM-PHRASEによる未知語のnearest neighborsの可視化 – スペルミス(“vehicals”, “vehicels”)や
くだけた表現(“babyyyyyy”)に対いしてもrobust – 語幹が同じ語(“journeying”と” journey”)や類義語(“vehicles”, ”cars”, “automobiles”)も埋め込み可能

Summary 33 CHARAGRAMを提案 – Character ngramを⽤いたEmbedding⼿法 – 学習収束が早い・未知語に強いなどの特徴評価タスクにより有効性を評価 –
Word Similarity・Sentence Similarity・POS Taggingで検証 – charCNN，charLSTMをoutperform 定量的・定性的分析 – 多少のノイズにもrobustな特徴を獲得可能なことを確認

Feelings アプローチはシンプル・精度を出すためのtuningが泥臭い – fasttext(FAIR @ TACL 16ʼ)とタスクが異なるだけでやってることは⼀緒？ Good :
分析(定量・定性)にしっかりリソースを割いていた． – ⼿法が有効な事例が可視化されていてわかりやすい． Bad : パラメータチューニングに関する記述がまとまりがない． – 提案⼿法に有利な実験を⾏なっているかのような印象を与える．実装がTheano + lasagneだったのできつかった... – データ量が多く計算時間がかかる... 34

CHARAGRAM: Embedding Words and Sentences via Ch...

CHARAGRAM: Embedding Words and Sentences via Character n-grams

More Decks by hightensan

Other Decks in Research

Featured

Transcript