SNLP2019_watanabe.pdf

Word and Document Embedding with vMF-Mixture Priors on Context Word
Vectors Shoaib Jameel and Steven Schockaert (ACL2019) ਤදͷҰ෦͸࿦จΑΓҾ༻ Presenter: Kento Watanabe SNLP2019 SNLP2019 1

෼෍Ծઆͱ୯ޠຒΊࠐΈϞσϧ cold white devil … snow 41911 32910 33 …
winter 52202 22291 52 … dark 18 213 43982 … shadow 31 114 50928 … … … … … … λʔήοτ୯ޠ ! จ຺୯ޠ " ! ڞىස౓ߦྻ # ಉ͡จ຺୯ޠ" !ͱڞىͨ͠ λʔήοτ୯ޠಉ࢜ !͸ྨࣅ͍ͯ͠Δ $ %,' ()*+, -(/%' )(1% 2 " 1' + 4% + 4' − log(/%' ))9 GloVe [Pennington+ 14]ͷޡࠩؔ਺ ʮλʔήοτ୯ޠͱจ຺୯ޠͷ಺ੵ ͱόΠΞε ʯͱ ʮڞىස౓ͷର਺ʯͷೋ৐ޡࠩΛ࠷খʹ͢Δ ࠷దԽ໰୊ʢॏΈ෇͖࠷খೋ৐๏ʣ ϞσϧԽ SNLP2019 2

෼෍ԾઆͱจॻຒΊࠐΈϞσϧ cold white devil … จॻA 24 40 0 …
จॻB 22 45 1 … จॻC 1 0 18 … จॻD 2 1 20 … … … … … … λʔήοτจॻ ! จ຺୯ޠ " # ڞىස౓ߦྻ $ ಉ͡จ຺୯ޠ" #ͱڞىͨ͠ λʔήοτจॻಉ࢜ !͸ྨࣅ͍ͯ͠Δ % &,( )*+,- .(0&( )(2& 3 " 4( + 6& + 6( − log(0&( )); GloVe [Pennington+ 14]ͷޡࠩؔ਺ ʮλʔήοτจॻͱจ຺୯ޠͷ಺ੵ ͱόΠΞε ʯͱ ʮڞىස౓ͷର਺ʯͷೋ৐ޡࠩΛ࠷খʹ͢Δ ࠷దԽ໰୊ʢॏΈ෇͖࠷খೋ৐๏ʣ ϞσϧԽ SNLP2019 3

୯ޠຒΊࠐΈϞσϧͷಛ௃ɿ SyntacticͳྨࣅੑͱSemanticͳྨࣅੑ • طଘͷ୯ޠຒΊࠐΈϞσϧ͸SyntacticͳྨࣅੑͱSemanticͳྨ ࣅੑΛ۠ผ͍ͯ͠ͳ͍ʢྑ͘ݴ͑͹ɺಉ࣌ʹߟྀ͍ͯ͠Δʣ cession summer loVe Our GloVe
Our GloVe ailants ceding ceding winter winter iegers annexation ceded autumn olympics rsuers annexing reafﬁrmation spring autumn unately cede abrogation year spring oters expropriation stipulating fall in acker continuance californios months beginning mplices ceded renegotiation in next ptors incorporation expropriation also months ngpoints ironically zapatistas time during reupon dismantling annexation beginning year ected words. ckers cession summer GloVe Our GloVe Our GloVe assailants ceding ceding winter winter besiegers annexation ceded autumn olympics pursuers annexing reafﬁrmation spring autumn fortunately cede abrogation year spring looters expropriation stipulating fall in attacker continuance californios months beginning accomplices ceded renegotiation in next captors incorporation expropriation also months strongpoints ironically zapatistas time during whereupon dismantling annexation beginning year GloVeͰͷNearest neighbors SNLP2019 4

طଘݚڀɿ ֎෦஌ࣝΛ࢖ͬͨΫϥελߏ଄Λߟྀͨ͠ຒΊࠐΈ • ֎෦஌ࣝΛ༻͍ͯSyntacticͳΫϥελ΍ͱSemanticͳΫϥελ ߏ଄Λֶश͢Δ͜ͱͰຒΊࠐΈͷੑೳΛ޲্ dia red attackers cession summer
GloVe Our GloVe Our GloVe Our GloVe Our GloVe indian blue blue assailants assailants ceding ceding winter winter mumbai yellow white attacker besiegers annexation ceded autumn olympics pakistan white yellow townspeople pursuers annexing reafﬁrmation spring autumn pradesh black which insurgents fortunately cede abrogation year spring subcontinent green called policemen looters expropriation stipulating fall in karnataka pink bright retaliation attacker continuance californios months beginning bengal gray pink rioters accomplices ceded renegotiation in next bangalore well green terrorists captors incorporation expropriation also months asia the purple perpetrators strongpoints ironically zapatistas time during delhi with black whereupon whereupon dismantling annexation beginning year Table 3: Nearest neighbors for selected words. Nearest neighbors ֎෦஌ࣝ winter summer autumn spring ੑೳ޲্ͷཧ༝͕ෆಁ໌ ɾ֎෦஌ࣝΛ࢖͏͜ͱ͕ੑೳ޲্ͷཧ༝ʁ ɾΫϥελߏ଄ΛϞσϧʹ૊ΈࠐΉ͜ͱ͕ཧ༝ʁ winter autumn spring … قઅΧςΰϦ ಉҰΧςΰϦ಺Ͱ ຒΊࠐΈΛֶश [Xu+14, Guo+15, Hu+15, Li+16] SNLP2019 5

Ծઆɿ ֎෦஌ࣝΛ༻͍ͣɺ จ຺୯ޠϕΫτϧΛཅʹΫϥελϦϯά͢Δ ຒΊࠐΈϞσϧΛֶश͢Ε͹ੑೳ্͕͕ΔͷͰ͸ʁ month season in on ຒΊࠐΈۭؒ summer
λʔήοτϕΫτϧ จ຺ϕΫτϧ time semester winter spring ɾwinterͱsummer͸ڞى͠ͳ͍ ɾڞى͢Δͷ͸ڞ௨ͷจ຺୯ޠ ɾจ຺୯ޠ͕ΑΓྑ͘ΫϥελϦϯά͞Ε͍ͯΕ͹ɺ winterͱsummerͷྨࣅ౓͸্͕Δ͸ͣɻ SNLP2019 6

Ծઆɿ ֎෦஌ࣝΛ༻͍ͣɺ จ຺୯ޠϕΫτϧΛཅʹΫϥελϦϯάͯ͠ ຒΊࠐΈΛֶश͢Ε͹ੑೳ্͕͕ΔͷͰ͸ʁ month season in on จ຺Ϋϥελ1 จ຺Ϋϥελ2
จ຺͕ΫϥελԽ͞ΕͨຒΊࠐΈ௒ۭؒ summer λʔήοτϕΫτϧ จ຺ϕΫτϧ time semester winter spring ɾwinterͱsummer͸ڞى͠ͳ͍ ɾڞى͢Δͷ͸ڞ௨ͷจ຺୯ޠ ɾจ຺୯ޠ͕ΑΓྑ͘ΫϥελϦϯά͞Ε͍ͯΕ͹ɺ winterͱsummerͷྨࣅ౓͸্͕Δ͸ͣɻ ʲର৅֎ʳ ɾΫϥελ-specificͷλʔήοτ୯ޠϕΫτϧͷֶश ɾଟٛޠϕΫτϧͷֶश ʲຊݚڀͷয఺ʳ จ຺୯ޠϕΫτϧͷΫϥελϦϯάʹΑΔ StandardͳEmbeddingͷੑೳ޲্ SNLP2019 7

Ծઆɿ ֎෦஌ࣝΛ༻͍ͣɺ จ຺୯ޠϕΫτϧΛཅʹΫϥελϦϯάͯ͠ ຒΊࠐΈΛֶश͢Ε͹ੑೳ্͕͕ΔͷͰ͸ʁ month season in on จ຺Ϋϥελ1 จ຺Ϋϥελ2
จ຺͕ΫϥελԽ͞ΕͨຒΊࠐΈ௒ۭؒ document3 λʔήοτϕΫτϧ จ຺ϕΫτϧ time semester document2 document1 Ͳ͏΍ͬͯจ຺ϕΫτϧΛ ΫϥελϦϯά͢Δ͔ʁ ɾจ຺୯ޠʢจॻ಺ͷ୯ޠʣ͕ΑΓྑ͘ΫϥελϦ ϯά͞Ε͍ͯΕ͹ɺจॻؒͷྨࣅ౓͸্͕Δ͸ͣɻ SNLP2019 8

จ຺ϕΫτϧͷΫϥελϦϯάͷΞΠσΟΞɿ τϐοΫϞσϦϯάతൃ૝ ɾ֤จ຺୯ޠϕΫτϧ ! "͸ɺ͋ΔΫϥελkͷ֬཰෼෍͔Βੜ੒͞Ε΍͍͢ ୯ޠϕΫτϧΛੜ੒͢ΔKݸͷ෼෍ k=4 k=3 k=2 k=1
… ྫ͑͹ɺΫϥελk=1ͷ֬཰෼෍͸ɺ ࣌ؒʹؔ͢Δจ຺୯ޠϕΫτϧΛੜ੒͠΍͍͢ SNLP2019 9

จ຺ϕΫτϧͷΫϥελϦϯάͷΞΠσΟΞɿ τϐοΫϞσϦϯάతൃ૝ ɾ֤จ຺୯ޠϕΫτϧ ! "͸ɺ͋ΔΫϥελkͷ֬཰෼෍͔Βੜ੒͞Ε΍͍͢ Kݸͷ୯ޠϕΫτϧΛੜ੒͢Δ෼෍ k=4 k=3 k=2 k=1
… ྫ͑͹ɺΫϥελk=1ͷ֬཰෼෍͸ɺ ࣌ؒʹؔ͢Δจ຺୯ޠϕΫτϧΛੜ੒͠΍͍͢ ϕΫτϧΛੜ੒͢Δ֬཰෼෍ʁ SNLP2019 10

von Mises-Fisher ෼෍ɿ จ຺୯ޠϕΫτϧΛੜ੒͢Δͷʹ౎߹ͷྑ͍෼෍ ֬཰஋ Y࣠ X࣠ ํ޲σʔλʹର͢Δ֬཰෼෍ɺ΋͘͠͸௒ٿ্ͷΨ΢ε෼෍ ຒΊࠐΈ͸಺ੵΛ໨తؔ਺Ͱ༻͍Δɺ֯౓ੈքͷॅਓ time
season semester 3321 Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distribution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vectors. The vMF density for x 2 Sd (with Sd the d-dimensional unit hypersphere) is given by: vmf(x|✓) = e✓| x 0F1(; d/2; ||✓||2 4 ) where the denominator is given by 0F1(; p; q) = 1 X n=0 (p) (p + n) qn n! In 2019 in w 2 j . I Y i,j xij6= wher distri not u foun word We w Docu abov dings ply r ! ∥ ! ∥ ෼฼͸࿨Λ1ʹ͢ΔͨΊͷscaling factorʢؾʹ͠ͳ͍ʣ !͸ฏۉํ޲ϕΫτϧ ฏۉํ޲ϕΫτϧͱಉ͡ํ޲ͷ จ຺ϕΫτϧ΄Ͳੜ੒͞Ε΍͍͢ SNLP2019 11

ྫ͑͹ɺΫϥελk=1ͷvon Mises-Fisher෼෍͸ɺ ࣌ؒʹؔ͢Δจ຺୯ޠϕΫτϧΛੜ੒͠΍͍͢ จ຺ϕΫτϧͷΫϥελϦϯάͷΞΠσΟΞɿ τϐοΫϞσϦϯάతൃ૝ ɾ֤จ຺୯ޠϕΫτϧ ! "͸ɺΫϥελkͷvon Mises-Fisher෼෍͔Βੜ੒͞Ε΍͍͢ Kݸͷvon
Mises-Fisher෼෍ k=4 k=3 k=2 k=1 … ֬཰஋ Y࣠ X࣠ SNLP2019 12

ྫ͑͹ɺΫϥελk=1ͷvon Mises-Fisher෼෍͸ɺ ࣌ؒʹؔ͢Δจ຺୯ޠϕΫτϧΛੜ੒͠΍͍͢ จ຺ϕΫτϧͷΫϥελϦϯάͷΞΠσΟΞɿ τϐοΫϞσϦϯάతൃ૝ ɾ֤จ຺୯ޠϕΫτϧ ! "͸ɺΫϥελkͷvon Mises-Fisher෼෍͔Βੜ੒͞Ε΍͍͢ k=4
k=3 k=2 k=1 … ֬཰஋ Y࣠ X࣠ Kݸ෼෍Λࠞ߹ͨ͠Ϟσϧɿࠞ߹von Mises-Fisher Kݸͷvon Mises-Fisher෼෍ pact of sparse co-occurrence counts. It see that this objective is equivalent to ng the following likelihood function ⌦) / Y i,j xij6=0 N(log xij; µij, 2)f(xij) > 0 can be chosen arbitrarily, N means al distribution and µij = wi · ˜ wj + bi + ˜ bj ore, D denotes the given corpus and ⌦ he set of parameters learned by the word g model, i.e. the word vectors wi and ˜ wj as terms. vantage of this probabilistic formulation allows us to introduce priors on the pa- of the model. This strategy was recently e WeMAP model (Jameel et al., 2019) to 2 2 acts as a scaling factor. The normalized vector k✓k , for ✓ 6= 0, is the mean direction of the distribution, while k✓k is known as the concentration parameter. To estimate the parameter ✓ from a given set of samples, we can use maximum likelihood (Hornik and Gr¨ un, 2014). A finite mixture of vMFs, which we denote as movMF, is a distribution on the unit hypersphere of the following form (x 2 Sd): h(x|⇥) = K X k=1 k vmf(x|✓k) where K is the number of mixture components, k 0 for each k, P k k = 1, and ⇥ = (✓1, ..., ✓K). The parameters of this movMF distribution can be computed using the Expectation- Maximization (EM) algorithm (Banerjee et al., 2005; Hornik and Gr¨ un, 2014). Note that movMF is a distribution on unit vec- ࠞ߹ൺ ࣮ࡍʹ͸จ຺ϕΫτϧ! #ͷੜ੒֬཰Λ ܭࢉ͢ΔͨΊʹKͰपลԽ͢Δ ij where 2 > 0 can be chosen arbitrarily, N means the Normal distribution and µij = wi · ˜ wj + bi + ˜ bj Furthermore, D denotes the given corpus and ⌦ refers to the set of parameters learned by the word embedding model, i.e. the word vectors wi and ˜ wj and the bias terms. The advantage of this probabilistic formulation is that it allows us to introduce priors on the parameters of the model. This strategy was recently used in the WeMAP model (Jameel et al., 2019) to replace the constant variance 2 by a variance 2 j that depends on the context word. In this paper, however, we will use priors on the parameters of the word embedding model itself. Specifically, we will impose a prior on the context word vectors ˜ w, i.e. we will maximize: Y i,j xij6=0 N(log xij; µij, 2)f(xij) · Y i P( ˜ wi) Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, A finite mixture of vMFs, which we denote as movMF, is a distribution on the unit hypersphere of the following form (x 2 Sd): h(x|⇥) = K X k=1 k vmf(x|✓k) where K is the number of mixture components, k 0 for each k, P k k = 1, and ⇥ = (✓1, ..., ✓K). The parameters of this movMF distribution can be computed using the Expectation- Maximization (EM) algorithm (Banerjee et al., 2005; Hornik and Gr¨ un, 2014). Note that movMF is a distribution on unit vectors, whereas context word vectors should not be normalized. We therefore define the prior on context word vectors as follows: P(˜ w) / h ˜ w k˜ wk | ⇥ Furthermore, we use L2 regularization to constrain the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word specific variance 2 j . In that case, we maximize the following: ͨͩ͠ orpus, bi and ˜ bj are bias terms hting function aimed at reduc- arse co-occurrence counts. It his objective is equivalent to owing likelihood function 0 N(log xij; µij, 2)f(xij) e chosen arbitrarily, N means ion and wi · ˜ wj + bi + ˜ bj notes the given corpus and ⌦ arameters learned by the word e. the word vectors wi and ˜ wj this probabilistic formulation o introduce priors on the pa- geometric function. Note, however, that we will not need to evaluate this denominator, as it simply acts as a scaling factor. The normalized vector ✓ k✓k , for ✓ 6= 0, is the mean direction of the distribution, while k✓k is known as the concentration parameter. To estimate the parameter ✓ from a given set of samples, we can use maximum likelihood (Hornik and Gr¨ un, 2014). A finite mixture of vMFs, which we denote as movMF, is a distribution on the unit hypersphere of the following form (x 2 Sd): h(x|⇥) = K X k=1 k vmf(x|✓k) where K is the number of mixture components, k 0 for each k, P k k = 1, and ⇥ = (✓1, ..., ✓K). The parameters of this movMF distribution can be computed using the Expectation- = ݸਓతײ૝ɿ ࠞ߹ൺ$Λจॻຖ ʹੜ੒ͯ͠΋ྑ ͔ͬͨͷͰ͸ʁ SNLP2019 13

ఏҊϞσϧɿ GloVe + vMF 3320 her distributions for this purpose.
uments are still modelled as multi- utions of topics in these models. In 17) the opposite approach is taken: d topics are represented as vectors, f modelling topic correlations in an while each topic is represented as a The GloVe model (Pennington et al., 2014) learns for each word w a target word vector w and a context word vector ˜ w by minimizing the following objective: X i,j xij6=0 f(xij)(wi · ˜ wj + bi + ˜ bj log xij)2 GloVeͷଛࣦؔ਺ (ॏΈ෇͖࠷খೋ৐๏) ౳෼ࢄΨ΢ε෼෍ + ࠷໬ਪఆ ʹ ౳Ձ ͨͩ͠ where xij is the number of times wi and wj co- occur in the given corpus, bi and ˜ bj are bias terms and f(xij) is a weighting function aimed at reduc- ing the impact of sparse co-occurrence counts. It is easy to see that this objective is equivalent to maximizing the following likelihood function P(D|⌦) / Y i,j xij6=0 N(log xij; µij, 2)f(xij) where 2 > 0 can be chosen arbitrarily, N means the Normal distribution and µij = wi · ˜ wj + bi + ˜ bj Furthermore, D denotes the given corpus and ⌦ refers to the set of parameters learned by the word which geom not ne acts a for ✓ while ter. T samp and G A ﬁ movM of the where where xij is the number of times wi and wj co- occur in the given corpus, bi and ˜ bj are bias terms and f(xij) is a weighting function aimed at reduc- ing the impact of sparse co-occurrence counts. It is easy to see that this objective is equivalent to maximizing the following likelihood function P(D|⌦) / Y i,j xij6=0 N(log xij; µij, 2)f(xij) where 2 > 0 can be chosen arbitrarily, N means the Normal distribution and µij = wi · ˜ wj + bi + ˜ bj which i geomet not need acts as a for ✓ 6= while k ter. To e samples and Gr¨ u A ﬁn movMF of the fo ಺ੵͱڞىස౓ͷೋ৐ޡࠩΛ࠷খԽ͢Δ໰୊ ฏۉʢ಺ੵʣͱ෼ࢄ͔Βڞىස౓Λੜ੒͢Δ֬཰Λ࠷େԽ͢Δ໰୊ʢ࠷໬ਪఆʣ SNLP2019 14

ఏҊϞσϧɿ GloVe + vMF ౳෼ࢄΨ΢ε෼෍ + ࠷໬ਪఆ ͨͩ͠ ing the
impact of sparse co-occurrence counts. It is easy to see that this objective is equivalent to maximizing the following likelihood function P(D|⌦) / Y i,j xij6=0 N(log xij; µij, 2)f(xij) where 2 > 0 can be chosen arbitrarily, N means the Normal distribution and µij = wi · ˜ wj + bi + ˜ bj Furthermore, D denotes the given corpus and ⌦ refers to the set of parameters learned by the word embedding model, i.e. the word vectors wi and ˜ wj and the bias terms. The advantage of this probabilistic formulation is that it allows us to introduce priors on the pa- acts as a sc for ✓ 6= 0, while k✓k ter. To est samples, w and Gr¨ un, A ﬁnite movMF, i of the foll where K k 0 (✓1, ..., ✓K tribution c Maximiza where xij is the number of times wi and wj co- occur in the given corpus, bi and ˜ bj are bias terms and f(xij) is a weighting function aimed at reduc- ing the impact of sparse co-occurrence counts. It is easy to see that this objective is equivalent to maximizing the following likelihood function P(D|⌦) / Y i,j xij6=0 N(log xij; µij, 2)f(xij) where 2 > 0 can be chosen arbitrarily, N means the Normal distribution and µij = wi · ˜ wj + bi + ˜ bj Furthermore, D denotes the given corpus and ⌦ refers to the set of parameters learned by the word embedding model, i.e. the word vectors wi and ˜ wj which is com geometric fu not need to e acts as a scal for ✓ 6= 0, is while k✓k is ter. To estim samples, we and Gr¨ un, 20 A ﬁnite m movMF, is a of the follow h where K is 0 fo ฏۉʢ಺ੵʣͱ෼ࢄ͔Βڞىස౓Λੜ੒͢Δ֬཰Λ࠷େԽ͢Δ໰୊ʢ࠷໬ਪఆʣ ֦ுલͷGloVe Qɿͳͥ͜ͷࣜมܗΛ͢Δͷ͔ʁ SNLP2019 15

ఏҊϞσϧɿ GloVe + vMF Ψ΢ε෼෍ɺ ࠞ߹vMF෼෍ + ࠷໬ਪఆ ͨͩ͠ is
easy to see that this objective is equivalent to maximizing the following likelihood function P(D|⌦) / Y i,j xij6=0 N(log xij; µij, 2)f(xij) where 2 > 0 can be chosen arbitrarily, N means the Normal distribution and µij = wi · ˜ wj + bi + ˜ bj Furthermore, D denotes the given corpus and ⌦ refers to the set of parameters learned by the word embedding model, i.e. the word vectors wi and ˜ wj and the bias terms. The advantage of this probabilistic formulation is that it allows us to introduce priors on the parameters of the model. This strategy was recently for ✓ 6= 0, while k✓k ter. To est samples, w and Gr¨ un, A finite movMF, i of the foll where K k 0 (✓1, ..., ✓K tribution c Maximiza where xij is the number of times wi and wj co- occur in the given corpus, bi and ˜ bj are bias terms and f(xij) is a weighting function aimed at reduc- ing the impact of sparse co-occurrence counts. It is easy to see that this objective is equivalent to maximizing the following likelihood function P(D|⌦) / Y i,j xij6=0 N(log xij; µij, 2)f(xij) where 2 > 0 can be chosen arbitrarily, N means the Normal distribution and µij = wi · ˜ wj + bi + ˜ bj Furthermore, D denotes the given corpus and ⌦ refers to the set of parameters learned by the word embedding model, i.e. the word vectors wi and ˜ wj which is com geometric fu not need to e acts as a scal for ✓ 6= 0, is while k✓k is ter. To estim samples, we and Gr¨ un, 20 A finite m movMF, is a of the follow h where K is k 0 fo ฏۉʢ಺ੵʣͱ෼ࢄ͔Βڞىස౓Λੜ੒͢Δ໬౓º ࠞ߹W.'Ϟσϧʹج͍ͮͯจ຺୯ޠΛੜ੒͢Δࣄલ෼෍Λ࠷େԽ͢Δ໰୊ Y i,j xij6=0 N(log xij; µij, 2)f(xij) · Y i P( ˜ wi) Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distribution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vectors. The vMF density for x 2 Sd (with Sd the d-dimensional unit hypersphere) is given by: vmf(x|✓) = e✓| x 0F1(; d/2; ||✓||2 4 ) where the denominator is given by 0F1(; p; q) = 1 X n=0 (p) (p + n) qn n! Furthermore, we use L2 regularization to constrain the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word specific variance 2 j . In that case, we maximize the following: Y i,j xij6=0 N(log xij; µij, 2 j ) · Y i P( ˜ wi) · Y i P( 2 j ) where P( 2 j ) is modelled as an inverse-gamma distribution (NIG). Note that in this variant we do not use the weighting function f(xij), as this was found to be unnecessary when using a context- word specific variance 2 j in (Jameel et al., 2019). We will refer this variant as CvMF(NIG). Document embedding. The model described above can also be used to learn document embed- dings. To this end, the target word vectors are simply replaced by document vectors and the counts The advantage of this probabilistic formulation is that it allows us to introduce priors on the parameters of the model. This strategy was recently used in the WeMAP model (Jameel et al., 2019) to replace the constant variance 2 by a variance 2 j that depends on the context word. In this paper, however, we will use priors on the parameters of the word embedding model itself. Specifically, we will impose a prior on the context word vectors ˜ w, i.e. we will maximize: Y i,j xij6=0 N(log xij; µij, 2)f(xij) · Y i P( ˜ wi) Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distribution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vec- 1 K tribution can be computed using the Expectation- Maximization (EM) algorithm (Banerjee et al., 2005; Hornik and Gr¨ un, 2014). Note that movMF is a distribution on unit vectors, whereas context word vectors should not be normalized. We therefore define the prior on context word vectors as follows: P(˜ w) / h ˜ w k˜ wk | ⇥ Furthermore, we use L2 regularization to constrain the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word specific variance 2 j . In that case, we maximize the following: Y i,j xij6=0 N(log xij; µij, 2 j ) · Y i P( ˜ wi) · Y i P( 2 j ) where P( 2) is modelled as an inverse-gamma Y i,j xij6=0 N(log xij; µij, 2)f(xij) · Y i P( ˜ wi) Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distribution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vectors. The vMF density for x 2 Sd (with Sd the d-dimensional unit hypersphere) is given by: vmf(x|✓) = e✓| x 0F1(; d/2; ||✓||2 4 ) where the denominator is given by 0F1(; p; q) = 1 X n=0 (p) (p + n) qn n! Furthermore, we use L2 regularization to constrain the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word specific variance 2 j . In that case, we maximize the following: Y i,j xij6=0 N(log xij; µij, 2 j ) · Y i P( ˜ wi) · Y i P( 2 j ) where P( 2 j ) is modelled as an inverse-gamma distribution (NIG). Note that in this variant we do not use the weighting function f(xij), as this was found to be unnecessary when using a context- word specific variance 2 j in (Jameel et al., 2019). We will refer this variant as CvMF(NIG). Document embedding. The model described above can also be used to learn document embed- dings. To this end, the target word vectors are simply replaced by document vectors and the counts 3321 i,j xij6=0 i Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distribution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vectors. The vMF density for x 2 Sd (with Sd the d-dimensional unit hypersphere) is given by: vmf(x|✓) = e✓| x 0F1(; d/2; ||✓||2 4 ) where the denominator is given by 0F1(; p; q) = 1 X n=0 (p) (p + n) qn n! the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word specific variance 2 j . In that case, we maximize the following: Y i,j xij6=0 N(log xij; µij, 2 j ) · Y i P( ˜ wi) · Y i P( 2 j ) where P( 2 j ) is modelled as an inverse-gamma distribution (NIG). Note that in this variant we do not use the weighting function f(xij), as this was found to be unnecessary when using a context- word specific variance 2 j in (Jameel et al., 2019). We will refer this variant as CvMF(NIG). Document embedding. The model described above can also be used to learn document embed- dings. To this end, the target word vectors are simply replaced by document vectors and the counts i,j xij6=0 N(log xij; µij, 2)f(xij) · i P( ˜ wi) Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distribution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vectors. The vMF density for x 2 Sd (with Sd the d-dimensional unit hypersphere) is given by: vmf(x|✓) = e✓| x 0F1(; d/2; ||✓||2 4 ) where the denominator is given by 0F1(; p; q) = 1 X n=0 (p) (p + n) qn n! Furthermore, we use L2 regularization to constrain the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word specific variance 2 j . In that case, we maximize the following: Y i,j xij6=0 N(log xij; µij, 2 j ) · Y i P( ˜ wi) · Y i P( 2 j ) where P( 2 j ) is modelled as an inverse-gamma distribution (NIG). Note that in this variant we do not use the weighting function f(xij), as this was found to be unnecessary when using a context- word specific variance 2 j in (Jameel et al., 2019). We will refer this variant as CvMF(NIG). Document embedding. The model described above can also be used to learn document embed- dings. To this end, the target word vectors are simply replaced by document vectors and the counts ֦ுޙͷGloVe ࿦จͷ͕ࣜtypoͳؾ͕͢Δ… Aɿࣄલ෼෍ΛಋೖͰ͖Δ͔Β SNLP2019 16

ֶशํ๏ • ୯ޠຒΊࠐΈΛ5ΠςϨʔγϣϯֶश͢Δ౓ʹɺ ࠞ߹von Mises-FisherϞσϧΛEMΞϧΰϦζϜͰߋ৽͠ͳ͓͢ɻ • จॻຒΊࠐΈ΁ͷ֦ுํ๏ λʔήοτ୯ޠϕΫτϧΛλʔήοτจॻϕΫτϧʹஔ͖׵͑ Δ͚ͩɻʢ͍ΘΏΔparagraph vectorతख๏ʣ
• ຒΊࠐΈ࣍ݩ਺ɿ300 ɺΫϥελ਺ɿ3000 SNLP2019 17

࣮ݧ1ɿΞφϩδʔλεΫ Models Gsem GSyn MSR IM DM ES LS GloVe
78.85 62.81 53.04 55.21 14.82 10.56 0.881 SG 71.58 60.50 51.71 55.45 13.48 08.78 0.671 CBOW 64.81 47.39 45.33 50.58 10.11 07.02 0.764 WeMAP 83.52 63.08 55.08 56.03 14.95 10.62 0.903 CvMF 63.22 67.41 63.21 65.94 17.46 9.380 1.100 CvMF(NIG) 64.14 67.55 63.55 65.95 17.49 9.410 1.210 Table 1: Word analogy accuracy results on different datasets. Models MC30 TR3k Tr287 Tr771 RG65 Stanf LEX Verb143 WS353 YP130 Verb RW CA ఏҊ ख๏ Syntactical/morphological relationships Semantical relationships ఏҊख๏͸TZOUBDUJDBMNPSQIPMPHJDBMͳྨࣅੑʹಛԽ Models Gsem GSyn MSR IM DM ES LS GloVe 78.85 62.81 53.04 55.21 14.82 10.56 0.881 SG 71.58 60.50 51.71 55.45 13.48 08.78 0.671 CBOW 64.81 47.39 45.33 50.58 10.11 07.02 0.764 WeMAP 83.52 63.08 55.08 56.03 14.95 10.62 0.903 CvMF 63.22 67.41 63.21 65.94 17.46 9.380 1.100 CvMF(NIG) 64.14 67.55 63.55 65.95 17.49 9.410 1.210 Table 1: Word analogy accuracy results on different datasets. odels MC30 TR3k Tr287 Tr771 RG65 Stanf LEX Verb143 WS353 YP130 Verb RW CA-660 oVe 0.739 0.746 0.648 0.651 0.752 0.473 0.347 0.308 0.675 0.582 0.184 0.422 0.301 SNLP2019 18

࣮ݧ2ɿWord similarity λεΫ WeMAP 83.52 63.08 55.08 56.03 14.95 10.62
0.903 CvMF 63.22 67.41 63.21 65.94 17.46 9.380 1.100 CvMF(NIG) 64.14 67.55 63.55 65.95 17.49 9.410 1.210 Table 1: Word analogy accuracy results on different datasets. Models MC30 TR3k Tr287 Tr771 RG65 Stanf LEX Verb143 WS353 YP130 Verb RW CA-660 GloVe 0.739 0.746 0.648 0.651 0.752 0.473 0.347 0.308 0.675 0.582 0.184 0.422 0.301 SG 0.741 0.742 0.651 0.653 0.757 0.470 0.356 0.289 0.662 0.565 0.195 0.470 0.206 CBOW 0.727 0.615 0.637 0.555 0.639 0.419 0.279 0.307 0.618 0.227 0.168 0.419 0.219 WeMAP 0.769 0.752 0.657 0.659 0.779 0.472 0.361 0.303 0.684 0.593 0.196 0.480 0.301 CvMF 0.707 0.703 0.642 0.652 0.746 0.419 0.353 0.250 0.601 0.465 0.226 0.519 0.394 CvMF(NIG) 0.708 0.703 0.642 0.652 0.747 0.419 0.354 0.250 0.604 0.467 0.226 0.519 0.395 Table 2: Word similarity results on some benchmark datasets (Spearman’s Rho). this dataset to 484 records. In most of these datasets, our model does not outperform the base- lines, which is to be expected given the conclusion from the analogy task that our model seems spe- some high-frequency terms. In these case we can see that the GloVe model obtains the best results, as e.g. moreover is found as a neighbor of neural for our model, and indeed is found as a neighbor ఏҊϞσϧ͸ຆͲͷσʔληοτʹ͓͍ͯউͯͳ͍ CvMF(NIG) 64.14 67.55 63.55 65.95 17.49 9.410 1.210 Table 1: Word analogy accuracy results on different datasets. Models MC30 TR3k Tr287 Tr771 RG65 Stanf LEX Verb143 WS353 YP130 Verb RW CA-6 GloVe 0.739 0.746 0.648 0.651 0.752 0.473 0.347 0.308 0.675 0.582 0.184 0.422 0.30 SG 0.741 0.742 0.651 0.653 0.757 0.470 0.356 0.289 0.662 0.565 0.195 0.470 0.20 CBOW 0.727 0.615 0.637 0.555 0.639 0.419 0.279 0.307 0.618 0.227 0.168 0.419 0.21 WeMAP 0.769 0.752 0.657 0.659 0.779 0.472 0.361 0.303 0.684 0.593 0.196 0.480 0.30 CvMF 0.707 0.703 0.642 0.652 0.746 0.419 0.353 0.250 0.601 0.465 0.226 0.519 0.39 MF(NIG) 0.708 0.703 0.642 0.652 0.747 0.419 0.354 0.250 0.604 0.467 0.226 0.519 0.39 Table 2: Word similarity results on some benchmark datasets (Spearman’s Rho). dataset to 484 records. In most of these ets, our model does not outperform the base- which is to be expected given the conclusion the analogy task that our model seems spe- some high-frequency terms. In these case we see that the GloVe model obtains the best res as e.g. moreover is found as a neighbor of ne for our model, and indeed is found as a neigh (ΞφϩδʔλεΫͱಉ༷ʹ)ఏҊϞσϧ͸ syntactical / morphological ͳಛ௃ʹಛԽ͍ͯ͠Δ͔Βʁ SNLP2019 19

࣮ݧ2ɿWord similarity λεΫ WeMAP 83.52 63.08 55.08 56.03 14.95 10.62
0.903 CvMF 63.22 67.41 63.21 65.94 17.46 9.380 1.100 CvMF(NIG) 64.14 67.55 63.55 65.95 17.49 9.410 1.210 Table 1: Word analogy accuracy results on different datasets. Models MC30 TR3k Tr287 Tr771 RG65 Stanf LEX Verb143 WS353 YP130 Verb RW CA-660 GloVe 0.739 0.746 0.648 0.651 0.752 0.473 0.347 0.308 0.675 0.582 0.184 0.422 0.301 SG 0.741 0.742 0.651 0.653 0.757 0.470 0.356 0.289 0.662 0.565 0.195 0.470 0.206 CBOW 0.727 0.615 0.637 0.555 0.639 0.419 0.279 0.307 0.618 0.227 0.168 0.419 0.219 WeMAP 0.769 0.752 0.657 0.659 0.779 0.472 0.361 0.303 0.684 0.593 0.196 0.480 0.301 CvMF 0.707 0.703 0.642 0.652 0.746 0.419 0.353 0.250 0.601 0.465 0.226 0.519 0.394 CvMF(NIG) 0.708 0.703 0.642 0.652 0.747 0.419 0.354 0.250 0.604 0.467 0.226 0.519 0.395 Table 2: Word similarity results on some benchmark datasets (Spearman’s Rho). this dataset to 484 records. In most of these datasets, our model does not outperform the base- lines, which is to be expected given the conclusion from the analogy task that our model seems spe- some high-frequency terms. In these case we can see that the GloVe model obtains the best results, as e.g. moreover is found as a neighbor of neural for our model, and indeed is found as a neighbor ϨΞϫʔυʹಛԽͨ͠σʔληοτͰ͸ఏҊϞσϧ͕ߴੑೳ CvMF(NIG) 64.14 67.55 63.55 65.95 17.49 9.410 1.210 Table 1: Word analogy accuracy results on different datasets. Models MC30 TR3k Tr287 Tr771 RG65 Stanf LEX Verb143 WS353 YP130 Verb RW CA-6 GloVe 0.739 0.746 0.648 0.651 0.752 0.473 0.347 0.308 0.675 0.582 0.184 0.422 0.30 SG 0.741 0.742 0.651 0.653 0.757 0.470 0.356 0.289 0.662 0.565 0.195 0.470 0.20 CBOW 0.727 0.615 0.637 0.555 0.639 0.419 0.279 0.307 0.618 0.227 0.168 0.419 0.21 WeMAP 0.769 0.752 0.657 0.659 0.779 0.472 0.361 0.303 0.684 0.593 0.196 0.480 0.30 CvMF 0.707 0.703 0.642 0.652 0.746 0.419 0.353 0.250 0.601 0.465 0.226 0.519 0.39 MF(NIG) 0.708 0.703 0.642 0.652 0.747 0.419 0.354 0.250 0.604 0.467 0.226 0.519 0.39 Table 2: Word similarity results on some benchmark datasets (Spearman’s Rho). dataset to 484 records. In most of these ets, our model does not outperform the base- which is to be expected given the conclusion the analogy task that our model seems spe- some high-frequency terms. In these case we see that the GloVe model obtains the best res as e.g. moreover is found as a neighbor of ne for our model, and indeed is found as a neigh จ຺ϕΫτϧͷΫϥελϦϯάΛ͢Δ͜ͱ͕ εϜʔδϯάͷ໾ׂʹͳ͍ͬͯΔͷͰ͸ʁ SNLP2019 20

ఆੑධՁɿNearest neighbors for selected word fastest india red attackers cession
summer Our GloVe Our GloVe Our GloVe Our GloVe Our GloVe Our GloVe slowest fifth pakistan indian blue blue assailants assailants ceding ceding winter winter quickest second lanka mumbai yellow white attacker besiegers annexation ceded autumn olympics slower sixth nepal pakistan white yellow townspeople pursuers annexing reaffirmation spring autumn faster slowest indian pradesh black which insurgents fortunately cede abrogation year spring fast ever bangladesh subcontinent green called policemen looters expropriation stipulating fall in surpassing quickest asia karnataka pink bright retaliation attacker continuance californios months beginning next third delhi bengal gray pink rioters accomplices ceded renegotiation in next surpassed respectively sri bangalore well green terrorists captors incorporation expropriation also months best tenth thailand asia the purple perpetrators strongpoints ironically zapatistas time during slow first china delhi with black whereupon whereupon dismantling annexation beginning year Table 3: Nearest neighbors for selected words. incisions unveil promissory batgirl casio Our GloVe Our GloVe Our GloVe Our GloVe Our GloVe incision incision unveiling unveils issuance estoppel catwoman huntress notebook <unk> indentations embellishment utilise devise curiously scribbled nightwing zatanna compute nightlifepartner punctures preferably introduce unveiling wherein untraceable supergirl clayface practicality vgnvcm scalpel notches invent <unk> handwritten evidencing batman superwoman utilizing counterstrike creases oftentimes expose finalise ostensibly gifting nemesis gcpd add graphing ଎͞ʹؔ͢Δ୯ޠ ৭ʹؔ͢Δ୯ޠ Our model favors words that are of the same kind ݸਓతݟղɿ GloVe͸จ຺ϕΫτϧ͕ۭؒ಺ͰࢄͬͯΔͷͰɺ෯޿͘ྨࣅ୯ޠΛྻڍ͢Δ͕(Recallॏࢹ) ఏҊϞσϧ͸จ຺͕ΫϥελϦϯά͞ΕΔͷͰɺҰͭͷ؍఺ʹߜͬͯྨࣅ୯ޠΛྻڍ͠΍͍͢(Precisionॏࢹ)ɻ SNLP2019 21

fast ever bangladesh subcontinent green called policemen looters expropriation stipulating
fall in surpassing quickest asia karnataka pink bright retaliation attacker continuance californios months beginning next third delhi bengal gray pink rioters accomplices ceded renegotiation in next surpassed respectively sri bangalore well green terrorists captors incorporation expropriation also months best tenth thailand asia the purple perpetrators strongpoints ironically zapatistas time during slow first china delhi with black whereupon whereupon dismantling annexation beginning year Table 3: Nearest neighbors for selected words. incisions unveil promissory batgirl casio Our GloVe Our GloVe Our GloVe Our GloVe Our GloVe incision incision unveiling unveils issuance estoppel catwoman huntress notebook <unk> indentations embellishment utilise devise curiously scribbled nightwing zatanna compute nightlifepartner punctures preferably introduce unveiling wherein untraceable supergirl clayface practicality vgnvcm scalpel notches invent <unk> handwritten evidencing batman superwoman utilizing counterstrike creases oftentimes expose finalise ostensibly gifting nemesis gcpd add graphing abrasions utilising publicize solidify purpotedly discordant abandon supergirl furthermore mkii lacerations lastly anticipating rediscover omnious renegotiation protege riddler utilising kajimitsuo extractions silhouettes unravelling embellish phony repossession unbeknownst woman utilizing reconditioned liposuction discreetly uncover reexamine proposing waiving reappears fight likewise bivort apertures purposefully inaugrate memorializing ironically abrogation cyborg first anticipating spellbinder Table 4: Nearest neighbors for low-frequency words. neural clouds Our GloVe Our GloVe neuronal neuronal cloud cumulonimbus brain cortical shadows cloud Models 20NG OHS TechTC Reu TF-IDF 0.852 0.632 0.306 0.319 LDA 0.859 0.629 0.305 0.323 ఆੑධՁɿ Nearest neighbors for low-frequency words Our model captures the meaning of these words in a better way than the GloVe SNLP2019 22

ఆੑධՁɿ Nearest neighbors for high-frequency words our model performs better
than standard methods at modelling rare words but worse at modelling frequent words. SNLP2019 23

ఆੑධՁɿ Nearest neighbors for ambiguous words. furthermore computation indeed sky
moreover circuitry furthermore fog cellular spiking fog swirling circuitry mechanisms lastly halos Table 5: Nearest neighbors for high-frequency words. amazon apple Our GloVe Our GloVe amazonian itunes cherry iigs forest kindle apples iphone brazil emusic peach macintosh rain nightlifepartner pear itunes green astore red ipad trees cdbaby sweet ipod wildlife guianas healthy ios preserve likewise doctor microsoft water aforementioned fruit garbageband rains ebay edible phone Table 6: Nearest neighbors for ambiguous words. GloVe 0.852 0.629 0.301 0.315 WeMAP 0.855 0.630 0.306 0.345 SG 0.853 0.631 0.304 0.341 CBOW 0.823 0.629 0.297 0.339 CvMF 0.871 0.633 0.305 0.362 CvMF(NIG) 0.871 0.633 0.305 0.363 Table 7: Document classiﬁcation results (F1). (sHDP)1314, 7) GloVe15 (Pennington et al., 201 8) WeMAP (Jameel et al., 2019), 9) Skipgra (SG) and Continuous Bag-of-Words16 (Mikol et al., 2013b) models. In the case of the wo embedding models, we create document vectors the same way as we do for our model, by simp replacing the role of target word vectors with do ument word vectors. In all the datasets, we removed punctuation a 13https://github.com/Ardavans/sHDP ී௨ͷ୯ޠϕΫτϧͱ͸ҧ͏ڍಈʹ ݸਓతݟղ Կނʁ ࠞ߹v-MFʹΑΔจ຺୯ޠͷΫϥελϦϯά݁Ռʹ େ͖͘ґଘͦ͠͏ɻ ͨ·ͨ·Ͱ͸ʁʢvMFʹΑͬͯہॴղʹ͍ͦ͏ʣ ΫϥελϦϯάͷ݁ՌΛݟͯΈ͍ͨɻ SNLP2019 24

ධՁɿจॻຒΊࠐΈʢׂѪʣ • จॻ෼ྨ໰୊ɾจॻݕࡧ໰୊ɾҰ؏ੑλεΫ • େମఏҊϞσϧ͕উͬͨΑʂ ݸਓతݟղ ୯ޠϕΫτϧͷ৔߹ɿ Ͳ͏͍͏จ຺ͷதͰ୯ޠ͕࢖ΘΕΔ͔ͱ͍͏ɺ ༻๏ͷ؍఺Ͱྨࣅੑ͕ܾ·Δɻ จॻϕΫτϧͷ৔߹ɿ
Ͳ͏͍͏಺༰(จ຺)ͷ୯ޠΛ͔࣋ͭͱ͍͏ɺ ಺༰ͷ؍఺Ͱྨࣅੑ͕ܾ·Δɻ ΫϥελϦϯάʹΑΔӨڹ͸୯ޠͱจॻͰҟͳΓͦ͏ ΫϥελϦϯά͞Εͨ จ຺୯ޠϕΫτϧͷ෼ੳ͕஌Γ͔ͨͬͨʂ SNLP2019 25

·ͱΊ • എܠɿWord2Vecͱ͔ͷຒΊࠐΈϞσϧͬͯɺSyntacticͳΫϥε΍ɺSemanticͳΫϥε ͱ͔ͷྨࣅੑΛ۠ผͯ͠ͳ͍ΑͶ • Ծઆɿ෼෍Ծઆʹ͓͍ͯจ຺୯ޠ͕ཅʹΫϥελϦϯά͞Ε͍ͯΕ͹ੑೳ্͕͕Δ͔΋ʁ • ख๏ɿจ຺୯ޠ͸ࠞ߹von Mises-Fisher෼෍͔Βੜ੒͞ΕΔͱԾఆͯ͠ɺ͏·͘ΫϥελϦ ϯά͠ͳ͕ΒɺຒΊࠐΈΛֶश͢Δʂ
• ࣮ݧʢ୯ޠຒΊࠐΈʣɿsyntactical / morphologicalͳྨࣅੑʹಛԽ͍ͯ͠Δ ϨΞϫʔυʹڧ͍ طଘͷຒΊࠐΈϞσϧͱ͸ͪΐͬͱੑ࣭ͷҟͳΔ΋ͷֶ͕शͰ͖ͨ SNLP2019 26

͜ͷ࿦จΛબΜͩཧ༝ɾॴײ • Ͱ͔͍σʔλʹ͍͢͝ωοτϫʔΫͰൃݟతʹੑೳ͕ྑ͍ݚڀ͸ɺ Tipsఔ౓ͷ஌ݟ͔͠ಘΒΕͳ͍ɻ • ͜ͷݚڀ͸ɺੑೳվળ݁Ռͦ͜ੌ͘͸ͳ͍͕ɺຒΊࠐΈϞσϧʹ͓͚Δ ڵຯਂ͍Ծઆͱཧʹ͔ͳ͏ख๏ΛఏҊͨ͠ɻ • ·ͩվྑͷ༨஍͕͋Δ ʢe.g.,
v-MFϞσϧͷࠞ߹ൺʣ • Կ͕ى͖͍ͯΔ͔ਂ͘ݕূ͍ͨ͠ʢe.g., ΫϥελϦϯά͞Εͨจ຺ϕΫτϧͷ ಛੑʣ • ຒΊࠐΈϞσϧ͸಺ੵΛ໨తؔ਺ʹ࢖͏Ҏ্ɺ֯౓ͷੈքɻ • ϢʔΫϦουۭؒΛԾఆͨ͠ຒΊࠐΈݚڀ(k-meansͱ͔)΋ଟͯ͘Ϟχϣͬͯͨɻ • ௒ٿ্ͷਖ਼ن෼෍Ͱ͋Δvon Mises-Fisher෼෍Λ࢖͏ͷ͸ ཧʹ͔ͳ͍ͬͯͯྑͦ͞͏ɻ • λʔήοτϕΫτϧΛ௚઀ΫϥελϦϯά͢ΔͷͰ͸ͳ͘จ຺ϕΫτϧΛΫϥε λϦϯά͢Δɺͱ͍͏Ξϓϩʔν͕໘ന͍ɻ SNLP2019 27

༧උ SNLP2019 28

ؔ࿈ख๏ɿ(BVTTJBO-%"<%BT > ֶशࡁΈ୯ޠϕΫτϧΛ؍ଌ஋ͱͨ͠-%" ɾLDAʹ͓͚Δ୯ޠੜ੒෼෍ΛΨ΢ε෼෍ʹ͢Δ͜ͱͰɺ ୯ޠϕΫτϧΛߟྀͨ͠τϐοΫϞσϧΛ࣮ݱ ! " # $ %
& D N K ' " # (, Σ % +, &, Ψ, - D N K LDA Gaussian LDA ࣮͸Ψ΢ε෼෍ΑΓɺ୯ޠϕΫτϧʹ͋ͬͨੜ੒෼෍͕͋Δ SNLP2019 29

ؔ࿈ख๏ɿNJYWPO.JTFT'JTIFS-%"<-J > ֶशࡁΈ୯ޠϕΫτϧΛ؍ଌ஋ͱͨ͠-%" ɾLDAʹ͓͚Δ୯ޠੜ੒෼෍Λ.JTFT'JTIFS෼෍ʹ͢Δ͜ͱͰɺ ୯ޠϕΫτϧΛߟྀͨ͠τϐοΫϞσϧΛ࣮ݱ ! " # $ %
& D N K ' " # () % ∆ D N K LDA mix-von Mises-Fisher LDA SNLP2019 30

ؔ࿈ख๏ɿNJYWPO.JTFT'JTIFS-%"<-J > ֶशࡁΈ୯ޠϕΫτϧΛ؍ଌ஋ͱͨ͠-%" ɾLDAʹ͓͚Δ୯ޠੜ੒෼෍Λ.JTFT'JTIFS෼෍ʹ͢Δ͜ͱͰɺ ୯ޠϕΫτϧΛߟྀͨ͠τϐοΫϞσϧΛ࣮ݱ ! " # $ %
& D N K ' " # () % ∆ D N K LDA mix-von Mises-Fisher LDA ͜ΕΒ͸͋͘·ͰτϐοΫϞσϦϯάͷؔ࿈ݚڀ ຊݚڀ͸୯ޠϕΫτϧͷֶशɻ จ຺୯ޠͷΫϥελϦϯάͷͨΊʹ WPO.JTFT'JTIFS෼෍Λར༻͢Δ SNLP2019 31

SNLP2019_watanabe.pdf

SNLP2019_watanabe.pdf

Kento Watanabe

More Decks by Kento Watanabe

Other Decks in Technology

Featured

Transcript

Word and Document Embedding with vMF-Mixture Priors on Context Word

෼෍Ծઆͱ୯ޠຒΊࠐΈϞσϧ cold white devil … snow 41911 32910 33 …

෼෍ԾઆͱจॻຒΊࠐΈϞσϧ cold white devil … จॻA 24 40 0 …

୯ޠຒΊࠐΈϞσϧͷಛ௃ɿ SyntacticͳྨࣅੑͱSemanticͳྨࣅੑ • طଘͷ୯ޠຒΊࠐΈϞσϧ͸SyntacticͳྨࣅੑͱSemanticͳྨ ࣅੑΛ۠ผ͍ͯ͠ͳ͍ʢྑ͘ݴ͑͹ɺಉ࣌ʹߟྀ͍ͯ͠Δʣ cession summer loVe Our GloVe

طଘݚڀɿ ֎෦஌ࣝΛ࢖ͬͨΫϥελߏ଄Λߟྀͨ͠ຒΊࠐΈ • ֎෦஌ࣝΛ༻͍ͯSyntacticͳΫϥελ΍ͱSemanticͳΫϥελ ߏ଄Λֶश͢Δ͜ͱͰຒΊࠐΈͷੑೳΛ޲্ dia red attackers cession summer

Ծઆɿ ֎෦஌ࣝΛ༻͍ͣɺ จ຺୯ޠϕΫτϧΛཅʹΫϥελϦϯά͢Δ ຒΊࠐΈϞσϧΛֶश͢Ε͹ੑೳ্͕͕ΔͷͰ͸ʁ month season in on ຒΊࠐΈۭؒ summer

Ծઆɿ ֎෦஌ࣝΛ༻͍ͣɺ จ຺୯ޠϕΫτϧΛཅʹΫϥελϦϯάͯ͠ ຒΊࠐΈΛֶश͢Ε͹ੑೳ্͕͕ΔͷͰ͸ʁ month season in on จ຺Ϋϥελ1 จ຺Ϋϥελ2

Ծઆɿ ֎෦஌ࣝΛ༻͍ͣɺ จ຺୯ޠϕΫτϧΛཅʹΫϥελϦϯάͯ͠ ຒΊࠐΈΛֶश͢Ε͹ੑೳ্͕͕ΔͷͰ͸ʁ month season in on จ຺Ϋϥελ1 จ຺Ϋϥελ2

จ຺ϕΫτϧͷΫϥελϦϯάͷΞΠσΟΞɿ τϐοΫϞσϦϯάతൃ૝ ɾ֤จ຺୯ޠϕΫτϧ ! "͸ɺ͋ΔΫϥελkͷ֬཰෼෍͔Βੜ੒͞Ε΍͍͢ ୯ޠϕΫτϧΛੜ੒͢ΔKݸͷ෼෍ k=4 k=3 k=2 k=1

จ຺ϕΫτϧͷΫϥελϦϯάͷΞΠσΟΞɿ τϐοΫϞσϦϯάతൃ૝ ɾ֤จ຺୯ޠϕΫτϧ ! "͸ɺ͋ΔΫϥελkͷ֬཰෼෍͔Βੜ੒͞Ε΍͍͢ Kݸͷ୯ޠϕΫτϧΛੜ੒͢Δ෼෍ k=4 k=3 k=2 k=1

von Mises-Fisher ෼෍ɿ จ຺୯ޠϕΫτϧΛੜ੒͢Δͷʹ౎߹ͷྑ͍෼෍ ֬཰஋ Y࣠ X࣠ ํ޲σʔλʹର͢Δ֬཰෼෍ɺ΋͘͠͸௒ٿ্ͷΨ΢ε෼෍ ຒΊࠐΈ͸಺ੵΛ໨తؔ਺Ͱ༻͍Δɺ֯౓ੈքͷॅਓ time

ྫ͑͹ɺΫϥελk=1ͷvon Mises-Fisher෼෍͸ɺ ࣌ؒʹؔ͢Δจ຺୯ޠϕΫτϧΛੜ੒͠΍͍͢ จ຺ϕΫτϧͷΫϥελϦϯάͷΞΠσΟΞɿ τϐοΫϞσϦϯάతൃ૝ ɾ֤จ຺୯ޠϕΫτϧ ! "͸ɺΫϥελkͷvon Mises-Fisher෼෍͔Βੜ੒͞Ε΍͍͢ Kݸͷvon

ྫ͑͹ɺΫϥελk=1ͷvon Mises-Fisher෼෍͸ɺ ࣌ؒʹؔ͢Δจ຺୯ޠϕΫτϧΛੜ੒͠΍͍͢ จ຺ϕΫτϧͷΫϥελϦϯάͷΞΠσΟΞɿ τϐοΫϞσϦϯάతൃ૝ ɾ֤จ຺୯ޠϕΫτϧ ! "͸ɺΫϥελkͷvon Mises-Fisher෼෍͔Βੜ੒͞Ε΍͍͢ k=4

ఏҊϞσϧɿ GloVe + vMF 3320 her distributions for this purpose.

ఏҊϞσϧɿ GloVe + vMF ౳෼ࢄΨ΢ε෼෍ + ࠷໬ਪఆ ͨͩ͠ ing the

ఏҊϞσϧɿ GloVe + vMF Ψ΢ε෼෍ɺ ࠞ߹vMF෼෍ + ࠷໬ਪఆ ͨͩ͠ is

ֶशํ๏ • ୯ޠຒΊࠐΈΛ5ΠςϨʔγϣϯֶश͢Δ౓ʹɺ ࠞ߹von Mises-FisherϞσϧΛEMΞϧΰϦζϜͰߋ৽͠ͳ͓͢ɻ • จॻຒΊࠐΈ΁ͷ֦ுํ๏ λʔήοτ୯ޠϕΫτϧΛλʔήοτจॻϕΫτϧʹஔ͖׵͑ Δ͚ͩɻʢ͍ΘΏΔparagraph vectorతख๏ʣ

࣮ݧ1ɿΞφϩδʔλεΫ Models Gsem GSyn MSR IM DM ES LS GloVe

࣮ݧ2ɿWord similarity λεΫ WeMAP 83.52 63.08 55.08 56.03 14.95 10.62

࣮ݧ2ɿWord similarity λεΫ WeMAP 83.52 63.08 55.08 56.03 14.95 10.62

ఆੑධՁɿNearest neighbors for selected word fastest india red attackers cession

fast ever bangladesh subcontinent green called policemen looters expropriation stipulating

ఆੑධՁɿ Nearest neighbors for high-frequency words our model performs better

ఆੑධՁɿ Nearest neighbors for ambiguous words. furthermore computation indeed sky

ධՁɿจॻຒΊࠐΈʢׂѪʣ • จॻ෼ྨ໰୊ɾจॻݕࡧ໰୊ɾҰ؏ੑλεΫ • େମఏҊϞσϧ͕উͬͨΑʂ ݸਓతݟղ ୯ޠϕΫτϧͷ৔߹ɿ Ͳ͏͍͏จ຺ͷதͰ୯ޠ͕࢖ΘΕΔ͔ͱ͍͏ɺ ༻๏ͷ؍఺Ͱྨࣅੑ͕ܾ·Δɻ จॻϕΫτϧͷ৔߹ɿ

͜ͷ࿦จΛબΜͩཧ༝ɾॴײ • Ͱ͔͍σʔλʹ͍͢͝ωοτϫʔΫͰൃݟతʹੑೳ͕ྑ͍ݚڀ͸ɺ Tipsఔ౓ͷ஌ݟ͔͠ಘΒΕͳ͍ɻ • ͜ͷݚڀ͸ɺੑೳվળ݁Ռͦ͜ੌ͘͸ͳ͍͕ɺຒΊࠐΈϞσϧʹ͓͚Δ ڵຯਂ͍Ծઆͱཧʹ͔ͳ͏ख๏ΛఏҊͨ͠ɻ • ·ͩվྑͷ༨஍͕͋Δ ʢe.g.,

༧උ SNLP2019 28

ؔ࿈ख๏ɿ(BVTTJBO-%"<%BT > ֶशࡁΈ୯ޠϕΫτϧΛ؍ଌ஋ͱͨ͠-%" ɾLDAʹ͓͚Δ୯ޠੜ੒෼෍ΛΨ΢ε෼෍ʹ͢Δ͜ͱͰɺ ୯ޠϕΫτϧΛߟྀͨ͠τϐοΫϞσϧΛ࣮ݱ ! " # $ %

ؔ࿈ख๏ɿNJYWPO.JTFT'JTIFS-%"<-J > ֶशࡁΈ୯ޠϕΫτϧΛ؍ଌ஋ͱͨ͠-%" ɾLDAʹ͓͚Δ୯ޠੜ੒෼෍Λ.JTFT'JTIFS෼෍ʹ͢Δ͜ͱͰɺ ୯ޠϕΫτϧΛߟྀͨ͠τϐοΫϞσϧΛ࣮ݱ ! " # $ %

ؔ࿈ख๏ɿNJYWPO.JTFT'JTIFS-%"<-J > ֶशࡁΈ୯ޠϕΫτϧΛ؍ଌ஋ͱͨ͠-%" ɾLDAʹ͓͚Δ୯ޠੜ੒෼෍Λ.JTFT'JTIFS෼෍ʹ͢Δ͜ͱͰɺ ୯ޠϕΫτϧΛߟྀͨ͠τϐοΫϞσϧΛ࣮ݱ ! " # $ %