word2vec + α

word2vec + mt_caret kml輪講 2018-05-25 mt_caret (kml輪講) word2vec + 2018-05-25
1 / 28

全体の流れ自然言語処理に関して何も知らないところからword2vecの仕組みとその後の発展までを追う。 Linguistic Regularities in Continuous Space Word Representations
Eﬃcient Estimation of Word Representations in Vector Space Distributed Representations of Words and Phrases and their Compositionality (word2vec Parameter Learning Explained) (word2vec Explained: Deriving Mikolov et al’s Negative Sampling Word-Embedding Method) mt_caret (kml輪講) word2vec + 2018-05-25 2 / 28

語の表現方法図 1: One-hotベクトル (https://blog.acolyer.orgより引用) = ( )のようなモデルを考えた時、が語に対応する One-hotベクトルだと
考えるとはの一列を取り出していると考えることができる。したがって、の各列は語に対応していると解釈できる。 mt_caret (kml輪講) word2vec + 2018-05-25 3 / 28

語の表現方法すると、語のOne-hotベクトルを入力とするニューラルネットワークベースのモデルであれば、最初の層の重みの各列から語を表す連続的なベクトル、つまり分散表現が得られる。図 2: 分散表現 (https://blog.acolyer.orgより引用) mt_caret
(kml輪講) word2vec + 2018-05-25 4 / 28

Linguistic Regularities in Continuous Space Word Representations mt_caret (kml輪講) word2vec
+ 2018-05-25 5 / 28

Linguistic Regularities in Continuous Space Word Representations これらの分散表現は言語における統語構造・意味構造が上手く反映されている。統語構造: apple
− apples ≃ car − cars 意味構造: woman − man ≃ queen − king 図 3: 分散表現 (https://blog.acolyer.orgより引用) mt_caret (kml輪講) word2vec + 2018-05-25 6 / 28

Linguistic Regularities in Continuous Space Word Representations 統語構造・意味構造の検証のためのテストセットを用意し ∶ ,
∶ という関係性においてを求めたい語とした時、 − + にコサイン距離が最も近い語を答えとし、正答率を検証。図 4: 統語構造のテストセット mt_caret (kml輪講) word2vec + 2018-05-25 7 / 28

Linguistic Regularities in Continuous Space Word Representations mt_caret (kml輪講) word2vec
+ 2018-05-25 8 / 28

Eﬃcient Estimation of Word Representations in Vector Space mt_caret (kml輪講)
word2vec + 2018-05-25 9 / 28

Eﬃcient Estimation of Word Representations in Vector Space = ×
× O: 学習に掛かる計算量 E: データセットの大きさ(語数) Q: モデル依存 NNベースのモデル(NNLM) = × + × × + × N: 入力語数 D: 投影先の次元 H: 分散表現の次元 V: Vocabularyの大きさ mt_caret (kml輪講) word2vec + 2018-05-25 10 / 28

× O: 学習に掛かる計算量 E: データセットの大きさ(語数) Q: モデル依存 RNNベースのモデル(RNNLM) = × + × H: 分散表現の次元 V: Vocabularyの大きさ mt_caret (kml輪講) word2vec + 2018-05-25 11 / 28

Eﬃcient Estimation of Word Representations in Vector Space 図 5:
Continuous Bag-of-Words(CBOW)とContinuous Skip-gramモデル mt_caret (kml輪講) word2vec + 2018-05-25 12 / 28

word2vec Parameter Learning Explained ℎ ℎ ℎ = 図 6:
1語入力のCBOWモデル mt_caret (kml輪講) word2vec + 2018-05-25 13 / 28

word2vec Parameter Learning Explained ℎ ℎ ℎ = 1 (
1 + 2 + ⋯ + ) 図 7: 多語入力のCBOWモデル mt_caret (kml輪講) word2vec + 2018-05-25 14 / 28

word2vec Parameter Learning Explained モデルの構造は1語入力のCBOWと同じだが、出力をコンテキストの語全てと比較して交差エントロピーロスを計算する。図 8: Skip-gramモデル mt_caret
(kml輪講) word2vec + 2018-05-25 15 / 28

word2vec Parameter Learning Explained Hierarchical Softmax 通常のSoftmaxだと分母で出力列ベクトルの全ての行を計算する必要があり、 × の計算が必要になっていた。そこで、各語を表す行に行き着く確率を二分木と各枝での左右への遷移確率をシグモイドでモデル化する。すると、各枝では
× 1の計算で済みlog 2 ( )回の遷移で語にたどり着くため × が × log 2 ( )になる。 (”time”|) = 0 (right|)1 (left|)2 (right|) 図 9: Hierarchical Softmaxの図 (http://building-babylon.net/より引用) mt_caret (kml輪講) word2vec + 2018-05-25 16 / 28

× O: 学習に掛かる計算量 E: データセットの大きさ(語数) Q: モデル依存 Continuous Bag-of-Wordsモデル(CBOW) = × + × 2 ( ) C: 入力語数 D: 投影先の次元かつ分散表現の次元(同一) V: Vocabularyの大きさ mt_caret (kml輪講) word2vec + 2018-05-25 17 / 28

× O: 学習に掛かる計算量 E: データセットの大きさ(語数) Q: モデル依存 Continuous Skip-gramモデル(CBOW) = × ( + × 2 ( )) C: 予測する語数 D: 投影先の次元かつ分散表現の次元(同一) V: Vocabularyの大きさ mt_caret (kml輪講) word2vec + 2018-05-25 18 / 28

Eﬃcient Estimation of Word Representations in Vector Space We observe
large improvements in accuracy at much lower computa- tional cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities. 図 10: CBOWとSkip-gramの結果 mt_caret (kml輪講) word2vec + 2018-05-25 19 / 28

Eﬃcient Estimation of Word Representations in Vector Space We observe
large improvements in accuracy at much lower computa- tional cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities. 図 11: 計算時間 mt_caret (kml輪講) word2vec + 2018-05-25 20 / 28

Eﬃcient Estimation of Word Representations in Vector Space The training
speed is signiﬁcantly higher than reported earlier in this paper, i.e. it is in the order of billions of words per hour for typical hyperparameter choices. We also published more than 1.4 million vectors that represent named entities, trained on more than 100 billion words. Some of our follow-up work will be published in an upcoming NIPS 2013 paper. mt_caret (kml輪講) word2vec + 2018-05-25 21 / 28

Distributed Representations of Words and Phrases and their Compositionality Negative
Sampling Subsampling Learning Phrases mt_caret (kml輪講) word2vec + 2018-05-25 22 / 28

word2vec Parameter Learning Explained Negative Sampling そもそもSoftmaxを使わずNoise Contrastive Estimation(NCE)の近似である Negative
Sampling(NEG)を行う。具体的には正解の語を最大化し、データセットから個語を引いてそれらを最小化することを目標として学習する。 log (′ ) + ∑ =1 ∼ () [− log (′ )] mt_caret (kml輪講) word2vec + 2018-05-25 23 / 28

Distributed Representations of Words and Phrases and their Compositionality Subsampling
頻出語(“in”, “the”, “a”, etc.のストップワード等)は情報が少ないため、確率 ( ) = 1 − √ ( ) の確率で語を捨てる処理をコーパスについて行った後にword2vecの学習を行う。ここのは適当に決める(10−5 前後が典型的)。 mt_caret (kml輪講) word2vec + 2018-05-25 24 / 28

Distributed Representations of Words and Phrases and their Compositionality Learning
for Phrases 単体で出現する確率(unigram)と2語連続して出現する確率(bigram)を用いて以下のスコアを計算し、閾値を超えたものは新しい語としてVocabularyに追加する。これを閾値を下げながら何パスか行う。 score( , ) = count( , ) − count( ) × count( ) 図 12: 句を学習した結果 mt_caret (kml輪講) word2vec + 2018-05-25 25 / 28

Distributed Representations of Words and Phrases and their Compositionality この論文の成果がオープンソースとしてhttps://code.google.com/p/word2vecで
公開されていて、そのプロジェクトの名前がword2vec1。 1タイトル回収 mt_caret (kml輪講) word2vec + 2018-05-25 26 / 28

+ Hierarchical Softmaxの木の作り方 (A Scalable Hierarchical Distributed Language Model) Poincare
Embeddings (Poincaré Embeddings for Learning Hierarchical Representations) doc2vec (Distributed Representations of Sentences and Documents) mt_caret (kml輪講) word2vec + 2018-05-25 27 / 28

参考にした資料 The amazing power of word vectors | the morning
paper Hierarchical Softmax – Building Babylon How does sub-sampling of frequent words work in the context of Word2Vec? - Quora Approximating the Softmax for Learning Word Embeddings A gentle introduction to Doc2Vec – ScaleAbout – Medium 異空間への埋め込み！Poincare Embeddingsが拓く表現学習の新展開 - ABEJA Arts Blog Neural Network Methods for Natural Language Processing mt_caret (kml輪講) word2vec + 2018-05-25 28 / 28

word2vec + α

word2vec + α

mt_caret

More Decks by mt_caret

Other Decks in Programming

Featured

Transcript

word2vec + mt_caret kml輪講 2018-05-25 mt_caret (kml輪講) word2vec + 2018-05-25

全体の流れ自然言語処理に関して何も知らないところからword2vecの仕組みとその後の発展までを追う。 Linguistic Regularities in Continuous Space Word Representations

語の表現方法図 1: One-hotベクトル (https://blog.acolyer.orgより引用) = ( )のようなモデルを考えた時、が語に対応する One-hotベクトルだと

Linguistic Regularities in Continuous Space Word Representations mt_caret (kml輪講) word2vec

Linguistic Regularities in Continuous Space Word Representations これらの分散表現は言語における統語構造・意味構造が上手く反映されている。統語構造: apple

Linguistic Regularities in Continuous Space Word Representations 統語構造・意味構造の検証のためのテストセットを用意し ∶ ,

Linguistic Regularities in Continuous Space Word Representations mt_caret (kml輪講) word2vec

Eﬃcient Estimation of Word Representations in Vector Space mt_caret (kml輪講)

Eﬃcient Estimation of Word Representations in Vector Space = ×

Eﬃcient Estimation of Word Representations in Vector Space = ×

Eﬃcient Estimation of Word Representations in Vector Space 図 5:

word2vec Parameter Learning Explained ℎ ℎ ℎ = 図 6:

word2vec Parameter Learning Explained ℎ ℎ ℎ = 1 (

word2vec Parameter Learning Explained モデルの構造は1語入力のCBOWと同じだが、出力をコンテキストの語全てと比較して交差エントロピーロスを計算する。図 8: Skip-gramモデル mt_caret

Eﬃcient Estimation of Word Representations in Vector Space = ×

Eﬃcient Estimation of Word Representations in Vector Space = ×

Eﬃcient Estimation of Word Representations in Vector Space We observe

Eﬃcient Estimation of Word Representations in Vector Space We observe

Eﬃcient Estimation of Word Representations in Vector Space The training

Distributed Representations of Words and Phrases and their Compositionality Negative

word2vec Parameter Learning Explained Negative Sampling そもそもSoftmaxを使わずNoise Contrastive Estimation(NCE)の近似である Negative

Distributed Representations of Words and Phrases and their Compositionality Subsampling

Distributed Representations of Words and Phrases and their Compositionality Learning

Distributed Representations of Words and Phrases and their Compositionality この論文の成果がオープンソースとしてhttps://code.google.com/p/word2vecで

+ Hierarchical Softmaxの木の作り方 (A Scalable Hierarchical Distributed Language Model) Poincare

参考にした資料 The amazing power of word vectors | the morning