Word2vec implementation in gensim

Slide 1

Slide 1 text

Word2vecの実装 @masa_kazama

Slide 2

Slide 2 text

目次 ● Word2vecの概要 ● GensimのWord2vecの実装(Python, Cython) ○ Negative Sampling ○ Hierarchical softmax ● GensimのCythonのコードを変更する ● 参考資料

Slide 3

Slide 3 text

目次 ● Word2vecの概要 ● GensimのWord2vecの実装(Python, Cython) ○ Negative Sampling ○ Hierarchical softmax ● GensimのCythonのコードを変更する ● 参考資料

Slide 4

Slide 4 text

Word2vec ● Mikolovが2013年に提案した単語をベクトル化する手法 ● 同じ文脈で出てくる単語は似ているという分布仮説に基づいて、単語をベクトル化 ● 王様 - 男性 + 女性のようなアナロジー計算も可能 ● 近年は、推薦システムでもitem2vecという形で用いられているこのスライドではword2vecの概要については詳しく説明しないため、詳細は下記の資料を参照ください ● word2vec Parameter Learning Explained ● 数式からみるWord2Vec ● Word2Vec のニューラルネットワーク学習過程を理解する

Slide 5

Slide 5 text

Word2vec ● モデル ○ Skip gram ○ Continuous Bag of Words (CBOW) ● パラメータ最適化方法 ○ Negative sampling ○ Hierarchical softmax

Slide 6

Slide 6 text

Skip gram ● 入力単語の周辺の単語を予測するモデル

Slide 7

Slide 7 text

CBOW ● 周辺の単語から中心の単語を予測するモデル

Slide 8

Slide 8 text

目的関数と最適化方法 ● 下記の目的関数を微分すると更新式が求まる ● しかし、その更新式の計算量はとても大きく時間がかかる(すべての単語の総和を計算するため) ● そのため、目的関数を変更/近似して、計算量を少なくする手法が提案されている ○ Negative sampling ○ Hierarchical softmax

Slide 9

Slide 9 text

Negative sampling 目的関数更新式少数の負例をサンプリングすることで、計算を高速化

Slide 10

Slide 10 text

Hierarchical softmax 目的関数更新式ハフマンツリーを使って、 softmaxを近似することで、計算量を削減する

Slide 11

Slide 11 text

データの作り方例 he is a very good man (Window size = 2のとき） Input Output he is he a is he is a is very a he a is ・・・・ man good 文章から、inputとoutputの単語のペアを作成する

Slide 12

Slide 12 text

パラメータそれぞれの単語がInput vectorとOutput vectorの２つのベクトルを持つ Gensimでは、 model.wv.syn0と model.syn1neg に格納されている。学習後は、Input vectorだけを使い、類似度やアナロジー計算を行う Index Word Input vector Output vector 0 he [0.4, 0.9, …, 0.1] [0.1, 0.2, …, 0.1] 1 is [0.2, 0.7, …, 0.2] [0.8, 0.5, …, 0.4] 2 a [0.6, 0.6, …, 0.7] [0.2, 0.1, …, 0.7] 3 very [0.2, 0.5, …, 0.9] [0.6, 0.7, …, 0.3] 4 kind [0.1, 0.4, …, 0.1] [0.5, 0.8, …, 0.8] 5 man [0.5, 0.3, …, 0.5] [0.4, 0.3, …, 0.9] (タスクやデータによっては、学習後に Input vectorとOutput vector を足したものを単語のベクトルとして使うと性能が上がると報告されている。[Levy 2015])

Slide 13

Slide 13 text

目次 ● Word2vecの概要 ● GensimのWord2vecの実装(Python, Cython) ○ Negative Sampling ○ Hierarchical softmax ● GensimのCythonのコードを変更する ● 参考資料

Slide 14

Slide 14 text

Negative samplingのパラメータの更新方法 l1 = context_vectors[context_index] word_indices = [predict_word.index] while len(word_indices) < model.negative + 1: w = model.cum_table.searchsorted(model.random.randint(model.cum_table[-1])) if w != predict_word.index: word_indices.append(w) l2b = model.syn1neg[word_indices] # 2d matrix, k+1 x layer1_size prod_term = dot(l1, l2b.T) fb = expit(prod_term) # propagate hidden -> output gb = (model.neg_labels - fb) * alpha # vector of error gradients multiplied by the learning rate model.syn1neg[word_indices] += outer(gb, l1) # learn hidden -> output neu1e += dot(gb, l2b) # save error l1 += neu1e gensim/models.word2vec.py train_sg_pair 説明のため、一部コードを変更次ページ以降、上から一行ずつコードを解説していく

Slide 15

Slide 15 text

例 (input, output) = (he, is) Input word he Input index (context_indext) 0 Input vector (l1) [0.4, 0.9, …, 0.1] Output word is Output index (predict_word.index) 1 Output vector [0.8, 0.5, …, 0.4] l1 = context_vectors[context_index] word_indices = [predict_word.index]

Slide 16

Slide 16 text

Negative sampling while len(word_indices) < model.negative + 1: w = model.cum_table.searchsorted(model.random.randint(model.cum_table[-1])) if w != predict_word.index: word_indices.append(w) 事前に作成しておいた cum_tableを利用して、model.negative個のwordを Negative sampling行う。（例として、”very”と”kind”がnegative samplingされたとして、次ページ以降の説明を進める。）参考）cum_tableの構築について def make_cum_table(self, wv, domain=2**31 - 1): vocab_size = len(wv.index2word) self.cum_table = zeros(vocab_size, dtype=uint32) # compute sum of all power (Z in paper) train_words_pow = 0.0 for word_index in range(vocab_size): train_words_pow += wv.vocab[wv.index2word[word_index]].count**self.ns_exponent cumulative = 0.0 for word_index in range(vocab_size): cumulative += wv.vocab[wv.index2word[word_index]].count**self.ns_exponent self.cum_table[word_index] = round(cumulative / train_words_pow * domain) if len(self.cum_table) > 0: assert self.cum_table[-1] == domain Negative samplingするときの単語分布自然言語処理では、 α=3/4が推薦システムでは、α=負の値が良いとされている [Hugo 2018]

Slide 17

Slide 17 text

Negative sampling l2b = model.syn1neg[word_indices] # 2d matrix, k+1 x layer1_size Word Output vector is [0.8, 0.5, …, 0.4] Predict_word very [0.6, 0.7, …, 0.3] negative sampling kind [0.5, 0.8, …, 0.8] negative sampling l2b

Slide 18

Slide 18 text

Negative sampling prod_term = dot(l1, l2b.T) fb = expit(prod_term) # propagate hidden -> output Word Output vector is [0.8, 0.5, …, 0.4] very [0.6, 0.7, …, 0.3] kind [0.5, 0.8, …, 0.8] Word input vector he [0.1, 0.2, …, 0.1] l2b l1 prod_term [0.9, 0.4, 0.2] fb [0.71, 0.59, 0.54] expit(prod_term) = 1/(1+exp(-prod_term)) = 1/(1+exp(-[0.9, 0.4, 0.2]))

Slide 19

Slide 19 text

Negative sampling gb = (model.neg_labels - fb) * alpha # vector of error gradients multiplied by the learning rate self.neg_labels = [] if self.negative > 0: # precompute negative labels optimization for pure-python training self.neg_labels = zeros(self.negative + 1) self.neg_labels[0] = 1. alphaは学習率。neg_labelsは、predict_wordのときは１、それ以外は 0。 Word Output vector neg_labels is [0.8, 0.5, …, 0.4] 1 very [0.6, 0.7, …, 0.3] 0 kind [0.5, 0.8, …, 0.8] 0 参考）neg_labelsの構築方法 gb = ([1, 0, 0] - [0.71, 0.59, 0.54] ) * 0.1 = [0.028, -0.059, -0.054]

Slide 20

Slide 20 text

Negative sampling Output vectorの更新 predict_wordとnegative samplingされたwordsのoutput vectorを更新する model.syn1neg[word_indices] += outer(gb, l1) # learn hidden -> output Word Output vector is [0.8, 0.5, …, 0.4] very [0.6, 0.7, …, 0.3] kind [0.5, 0.8, …, 0.8] outer(gb, l1) 0.028 * [0.4, 0.9, …, 0.1] -0.059 * [0.4, 0.9, …, 0.1] -0.054 * [0.4, 0.9, …, 0.1] +=

Slide 21

Slide 21 text

Negative sampling neu1e += dot(gb, l2b) # save error l1 += neu1e Input vectorの更新 Word Output vector is [0.8, 0.5, …, 0.4] very [0.6, 0.7, …, 0.3] kind [0.5, 0.8, …, 0.8] l2b gb 0.028 -0.059 -0.054 Input vector (l1) [0.4, 0.9, …, 0.1] +=

Slide 22

Slide 22 text

Negative sampling (Cython) for d in range(negative+1): if d == 0: target_index = word_index label = ONEF else: target_index = bisect_left(cum_table, (next_random >> 16) % cum_table[cum_table_len-1], 0, cum_table_len) next_random = (next_random * 25214903917ULL + 11) & modulo if target_index == word_index: continue label = 0.0 row2 = target_index * size f_dot = our_dot(&size, &syn0[row1], &ONE, &syn1neg[row2], &ONE) #内積の計算 if f_dot <= -MAX_EXP or f_dot >= MAX_EXP: continue f = EXP_TABLE[((f_dot + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]　#シグモイド関数の計算 g = (label - f) * alpha our_saxpy(&size, &g, &syn1neg[row2], &ONE, work, &ONE) #inputベクトルの更新のための一時変数 work += g*output_vector our_saxpy(&size, &g, &syn0[row1], &ONE, &syn1neg[row2], &ONE) #outputベクトルの更新 output_vector += g*input_vector our_saxpy(&size, &word_locks[word2_index], work, &ONE, &syn0[row1], &ONE) #inputベクトルの更新 input_vector += g*work return next_random gensim/models/word2vec_inner.pyx 説明のため一部コードを変更

Slide 23

Slide 23 text

Negative sampling (Cython) ● Pythonでは、複数のnegative samplingのwordsに対して、行列を使ってまとめて計算していた ● Cythonでは、negative samplingのwordひとつひとつに対して、ベクトルを更新している ● 高速化のための工夫がなされている ○ あらかじめシグモイド関数を計算しておいてそれを配列に格納している ○ 内積などのベクトル演算が高速に計算される cdef scopy_ptr scopy=PyCObject_AsVoidPtr(fblas.scopy._cpointer) # y = x cdef saxpy_ptr saxpy=PyCObject_AsVoidPtr(fblas.saxpy._cpointer) # y += alpha * x cdef sdot_ptr sdot=PyCObject_AsVoidPtr(fblas.sdot._cpointer) # float = dot(x, y) cdef dsdot_ptr dsdot=PyCObject_AsVoidPtr(fblas.sdot._cpointer) # double = dot(x, y) cdef snrm2_ptr snrm2=PyCObject_AsVoidPtr(fblas.snrm2._cpointer) # sqrt(x^2) cdef sscal_ptr sscal=PyCObject_AsVoidPtr(fblas.sscal._cpointer) # x = alpha * x

Slide 24

Slide 24 text

目次 ● Word2vecの概要 ● GensimのWord2vecの実装(Python, Cython) ○ Negative Sampling ○ Hierarchical softmax ● GensimのCythonのコードを変更する ● 参考資料

Slide 25

Slide 25 text

Hierarchical softmax from gensim.models import Word2Vec sentences = [["he", "is", "a", "very", "kind", "man"]] model = Word2Vec(sentences, min_count=1, seed=1, hs=1) for word in model.vocab.keys(): print("word:", word) print("index", model.vocab[word].index) print("code", model.vocab[word].code) print("point", model.vocab[word].point) print("-------------") ('word:', 'a') ('index', 0) ('code', array([1, 0, 0], dtype=uint8)) ('point', array([4, 3, 1], dtype=uint32)) ------------- ('word:', 'kind') ('index', 1) ('code', array([1, 0, 1], dtype=uint8)) ('point', array([4, 3, 1], dtype=uint32)) ------------- ('word:', 'very') ('index', 2) ('code', array([1, 1, 1], dtype=uint8)) ('point', array([4, 3, 0], dtype=uint32)) ------------- ('word:', 'is') ('index', 3) ('code', array([0, 1], dtype=uint8)) ('point', array([4, 2], dtype=uint32)) ------------- ('word:', 'he') ('index', 4) ('code', array([0, 0], dtype=uint8)) ('point', array([4, 2], dtype=uint32)) ------------- ('word:', 'man') ('index', 5) ('code', array([1, 1, 0], dtype=uint8)) ('point', array([4, 3, 0], dtype=uint32)) ------------- Gensimは、Hierarchical softmaxのデータ構造をindex, code, pointという形で保持している

Slide 26

Slide 26 text

Hierarchical softmax 4 3 2 0 1 1 0 1 0 1 0 1 0 1 0 very man kind a he is word index code point a 0 [1,0,0] [4,3,1] kind 1 [1,0,1] [4,3,1] very 2 [1,1,1] [4,3,0] is 3 [0,1] [4,2] he 4 [0,0] [4,2] man 5 [1,1,0] [4,3,0] Pointは、その単語にたどり着くまでの経由したノード Codeは、そのノードの左右どちらに行ったかを示す

Slide 27

Slide 27 text

ハフマンツリーの構築 def create_binary_tree(self, wv): # build the huffman tree heap = list(itervalues(wv.vocab)) heapq.heapify(heap) for i in range(len(wv.vocab) - 1): min1, min2 = heapq.heappop(heap), heapq.heappop(heap) heapq.heappush( heap, Vocab(count=min1.count + min2.count, index=i + len(wv.vocab), left=min1, right=min2) ) # recurse over the tree, assigning a binary code to each vocabulary word if heap: max_depth, stack = 0, [(heap[0], [], [])] while stack: node, codes, points = stack.pop() if node.index < len(wv.vocab): # leaf node => store its path from the root node.code, node.point = codes, points max_depth = max(len(codes), max_depth) else: # inner node => continue recursion points = array(list(points) + [node.index - len(wv.vocab)], dtype=uint32) stack.append((node.left, array(list(codes) + [0], dtype=uint8), points)) stack.append((node.right, array(list(codes) + [1], dtype=uint8), points)) ヒープを用いて、ハフマンツリーを構築する。回数が少ないもの同士をマージしていくノードにpointや codeを割り振る gensim/models.word2vec.py

Slide 28

Slide 28 text

Hierarchical softmax のパラメータ word vector a [0.4, 0.9, …, 0.1] kind [0.2, 0.7, …, 0.2] very [0.6, 0.6, …, 0.7] is [0.2, 0.5, …, 0.9] he [0.1, 0.4, …, 0.1] man [0.5, 0.3, …, 0.5] node vector node0 [0.8, 0.5, …, 0.4] node1 [0.2, 0.1, …, 0.7] node2 [0.6, 0.7, …, 0.3] node3 [0.5, 0.8, …, 0.8] node4 [0.4, 0.3, …, 0.9] wordのinput vectorとハフマンツリーのノードの vector (Output vectorは出てこない)

Slide 29

Slide 29 text

Hierarchical softmaxのパラメータの更新方法 l1 = context_vectors[context_index] l2a = deepcopy(model.syn1[predict_word.point]) # 2d matrix, codelen x layer1_size prod_term = dot(l1, l2a.T) fa = expit(prod_term) # propagate hidden -> output ga = (1 - predict_word.code - fa) * alpha # vector of error gradients multiplied by the learning rate model.syn1[predict_word.point] += outer(ga, l1) # learn hidden -> output neu1e += dot(ga, l2a) # save error l1 += neu1e Negative samplingのときのl2bが、Hierarchical softmaxではl2aに、 Negative samplingのときのneg_labelsが、Hierarchical softmaxではpredict_word.codeに対応していると考えると、パラメータの更新方法は Negative samplingのときとほぼ同様 (メモ：predict_word.codeの長さの平均は、ハフマンツリーの平均符号長に対応するため、ネガティブサンプリングの数とオーダが等しく、計算量は negative samplingとHierachical softmaxはほぼ等しくなる)

Slide 30

Slide 30 text

Hierarchical softmax (Cython) for b in range(codelen): row2 = word_point[b] * size f_dot = our_dot(&size, &syn0[row1], &ONE, &syn1[row2], &ONE) #内積の計算 if f_dot <= -MAX_EXP or f_dot >= MAX_EXP: continue f = EXP_TABLE[((f_dot + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))] #シグモイド関数の計算 g = (1 - word_code[b] - f) * alpha our_saxpy(&size, &g, &syn1[row2], &ONE, work, &ONE) #inputベクトルの更新のための一時変数 work += g*syn1[row2] our_saxpy(&size, &g, &syn0[row1], &ONE, &syn1[row2], &ONE) #節ベクトルの更新 syn1[row2] += g*syn0[row1] our_saxpy(&size, &word_locks[word2_index], work, &ONE, &syn0[row1], &ONE) #inputベクトルの更新 input_vector += g*work

Slide 31

Slide 31 text

目次 ● Word2vecの概要 ● GensimのWord2vecの実装(Python, Cython) ○ Negative Sampling ○ Hierarchical softmax ● GensimのCythonのコードを変更する ● 参考資料

Slide 32

Slide 32 text

GensimのCythonのコード変更 ● gensimのword2vecでは、window_size=3と指定しても常に前後の3つの単語を取得するのではなく、1から３の整数を乱数で選択し、その分だけ前後の単語を取得する。（近傍の単語をより重点的にサンプリングしたいため） ● コード変更の例として、乱数で取得するのではなく、常にwindow_size分だけ取得するように変更する # precompute "reduced window" offsets in a single randint() call for i, item in enumerate(model.random.randint(0, c.window, effective_words)): c.reduced_windows[i] = item print("fix windowsize") # precompute "reduced window" offsets in a single randint() call for i, item in enumerate(model.random.randint(0, c.window, effective_words)): c.reduced_windows[i] = 0 変更前変更後 gensim/models/word2vec_inner.pyx train_batch_sg

Slide 33

Slide 33 text

GensimのCythonのコード変更 git clone https://github.com/RaRe-Technologies/gensim.git cd gensim virtualenv gensim_env #gensim用に環境作成 source gensim_env/bin/activate vim gensim/models/word2vec_inner.pyx #コードの変更 cython -2 gensim/models/word2vec_inner.pyx #cythonのコンパイル pip install -e .[test] #変更したgensimのインストール ● コード変更、コンパイル、インストールの手順 from gensim.models import Word2Vec sentences = [["he", "is", "a", "very", "kind", "man"]] model = Word2Vec(sentences, min_count=1, seed=1, negative=1, sg=1) 下記を実行するとfix windowsizeが表示され、変更したコードが反映されていることを確かめることができる

Slide 34

Slide 34 text

目次 ● Word2vecの概要 ● GensimのWord2vecの実装(Python, Cython) ○ Negative Sampling ○ Hierarchical softmax ● GensimのCythonのコードを変更する ● 参考資料

Slide 35

Slide 35 text

参考資料 ● 数式からみるWord2Vec ● Word2Vec のニューラルネットワーク学習過程を理解する ● Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. 2013 ● Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. 2013 ● Yoav Goldberg, Omer Levy. word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. 2014 ● Hugo Caselles-Dupré, Florian Lesaint, Jimena Royo-Letelier. Word2Vec applied to Recommendation: Hyperparameters Matter. 2018 ● Omer Levy, Yoav Goldberg, Ido Dagan. Improving Distributional Similarity with Lessons Learned from Word Embeddings. 2015 ● GensimのContributionガイド ● Gensimのdeveloper page