GloVe Our GloVe Our GloVe Our GloVe Our GloVe indian blue blue assailants assailants ceding ceding winter winter mumbai yellow white attacker besiegers annexation ceded autumn olympics pakistan white yellow townspeople pursuers annexing reafﬁrmation spring autumn pradesh black which insurgents fortunately cede abrogation year spring subcontinent green called policemen looters expropriation stipulating fall in karnataka pink bright retaliation attacker continuance californios months beginning bengal gray pink rioters accomplices ceded renegotiation in next bangalore well green terrorists captors incorporation expropriation also months asia the purple perpetrators strongpoints ironically zapatistas time during delhi with black whereupon whereupon dismantling annexation beginning year Table 3: Nearest neighbors for selected words. Nearest neighbors ֎෦ࣝ winter summer autumn spring ੑ্ͷཧ༝͕ෆಁ໌ ɾ֎෦ࣝΛ͏͜ͱ͕ੑ্ͷཧ༝ʁ ɾΫϥελߏΛϞσϧʹΈࠐΉ͜ͱ͕ཧ༝ʁ winter autumn spring … قઅΧςΰϦ ಉҰΧςΰϦͰ ຒΊࠐΈΛֶश [Xu+14, Guo+15, Hu+15, Li+16] SNLP2019 5
season semester 3321 Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distri- bution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vec- tors. The vMF density for x 2 Sd (with Sd the d-dimensional unit hypersphere) is given by: vmf(x|✓) = e✓| x 0F1(; d/2; ||✓||2 4 ) where the denominator is given by 0F1(; p; q) = 1 X n=0 (p) (p + n) qn n! In 2019 in w 2 j . I Y i,j xij6= wher distri not u foun word We w Docu abov dings ply r ! ∥ ! ∥ Λ1ʹ͢ΔͨΊͷscaling factorʢؾʹ͠ͳ͍ʣ !ฏۉํϕΫτϧ ฏۉํϕΫτϧͱಉ͡ํͷ จ຺ϕΫτϧ΄Ͳੜ͞Ε͍͢ SNLP2019 11
k=3 k=2 k=1 … ֬ Y࣠ X࣠ KݸΛࠞ߹ͨ͠Ϟσϧɿࠞ߹von Mises-Fisher Kݸͷvon Mises-Fisher pact of sparse co-occurrence counts. It see that this objective is equivalent to ng the following likelihood function ⌦) / Y i,j xij6=0 N(log xij; µij, 2)f(xij) > 0 can be chosen arbitrarily, N means al distribution and µij = wi · ˜ wj + bi + ˜ bj ore, D denotes the given corpus and ⌦ he set of parameters learned by the word g model, i.e. the word vectors wi and ˜ wj as terms. vantage of this probabilistic formulation allows us to introduce priors on the pa- of the model. This strategy was recently e WeMAP model (Jameel et al., 2019) to 2 2 acts as a scaling factor. The normalized vector k✓k , for ✓ 6= 0, is the mean direction of the distribution, while k✓k is known as the concentration parame- ter. To estimate the parameter ✓ from a given set of samples, we can use maximum likelihood (Hornik and Gr¨ un, 2014). A ﬁnite mixture of vMFs, which we denote as movMF, is a distribution on the unit hypersphere of the following form (x 2 Sd): h(x|⇥) = K X k=1 k vmf(x|✓k) where K is the number of mixture components, k 0 for each k, P k k = 1, and ⇥ = (✓1, ..., ✓K). The parameters of this movMF dis- tribution can be computed using the Expectation- Maximization (EM) algorithm (Banerjee et al., 2005; Hornik and Gr¨ un, 2014). Note that movMF is a distribution on unit vec- ࠞ߹ൺ ࣮ࡍʹจ຺ϕΫτϧ! #ͷੜ֬Λ ܭ͢ΔͨΊʹKͰपลԽ͢Δ ij where 2 > 0 can be chosen arbitrarily, N means the Normal distribution and µij = wi · ˜ wj + bi + ˜ bj Furthermore, D denotes the given corpus and ⌦ refers to the set of parameters learned by the word embedding model, i.e. the word vectors wi and ˜ wj and the bias terms. The advantage of this probabilistic formulation is that it allows us to introduce priors on the pa- rameters of the model. This strategy was recently used in the WeMAP model (Jameel et al., 2019) to replace the constant variance 2 by a variance 2 j that depends on the context word. In this paper, however, we will use priors on the parameters of the word embedding model itself. Speciﬁcally, we will impose a prior on the context word vectors ˜ w, i.e. we will maximize: Y i,j xij6=0 N(log xij; µij, 2)f(xij) · Y i P( ˜ wi) Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, A ﬁnite mixture of vMFs, which we denote as movMF, is a distribution on the unit hypersphere of the following form (x 2 Sd): h(x|⇥) = K X k=1 k vmf(x|✓k) where K is the number of mixture components, k 0 for each k, P k k = 1, and ⇥ = (✓1, ..., ✓K). The parameters of this movMF dis- tribution can be computed using the Expectation- Maximization (EM) algorithm (Banerjee et al., 2005; Hornik and Gr¨ un, 2014). Note that movMF is a distribution on unit vec- tors, whereas context word vectors should not be normalized. We therefore deﬁne the prior on con- text word vectors as follows: P(˜ w) / h ˜ w k˜ wk | ⇥ Furthermore, we use L2 regularization to constrain the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word speciﬁc variance 2 j . In that case, we maximize the following: ͨͩ͠ orpus, bi and ˜ bj are bias terms hting function aimed at reduc- arse co-occurrence counts. It his objective is equivalent to owing likelihood function 0 N(log xij; µij, 2)f(xij) e chosen arbitrarily, N means ion and wi · ˜ wj + bi + ˜ bj notes the given corpus and ⌦ arameters learned by the word e. the word vectors wi and ˜ wj this probabilistic formulation o introduce priors on the pa- geometric function. Note, however, that we will not need to evaluate this denominator, as it simply acts as a scaling factor. The normalized vector ✓ k✓k , for ✓ 6= 0, is the mean direction of the distribution, while k✓k is known as the concentration parame- ter. To estimate the parameter ✓ from a given set of samples, we can use maximum likelihood (Hornik and Gr¨ un, 2014). A ﬁnite mixture of vMFs, which we denote as movMF, is a distribution on the unit hypersphere of the following form (x 2 Sd): h(x|⇥) = K X k=1 k vmf(x|✓k) where K is the number of mixture components, k 0 for each k, P k k = 1, and ⇥ = (✓1, ..., ✓K). The parameters of this movMF dis- tribution can be computed using the Expectation- = ݸਓతײɿ ࠞ߹ൺ$Λจॻຖ ʹੜͯ͠ྑ ͔ͬͨͷͰʁ SNLP2019 13
uments are still modelled as multi- utions of topics in these models. In 17) the opposite approach is taken: d topics are represented as vectors, f modelling topic correlations in an while each topic is represented as a The GloVe model (Pennington et al., 2014) learns for each word w a target word vector w and a con- text word vector ˜ w by minimizing the following objective: X i,j xij6=0 f(xij)(wi · ˜ wj + bi + ˜ bj log xij)2 GloVeͷଛࣦؔ (ॏΈ͖࠷খೋ๏) Ψε + ࠷ਪఆ ʹ Ձ ͨͩ͠ where xij is the number of times wi and wj co- occur in the given corpus, bi and ˜ bj are bias terms and f(xij) is a weighting function aimed at reduc- ing the impact of sparse co-occurrence counts. It is easy to see that this objective is equivalent to maximizing the following likelihood function P(D|⌦) / Y i,j xij6=0 N(log xij; µij, 2)f(xij) where 2 > 0 can be chosen arbitrarily, N means the Normal distribution and µij = wi · ˜ wj + bi + ˜ bj Furthermore, D denotes the given corpus and ⌦ refers to the set of parameters learned by the word which geom not ne acts a for ✓ while ter. T samp and G A ﬁ movM of the where where xij is the number of times wi and wj co- occur in the given corpus, bi and ˜ bj are bias terms and f(xij) is a weighting function aimed at reduc- ing the impact of sparse co-occurrence counts. It is easy to see that this objective is equivalent to maximizing the following likelihood function P(D|⌦) / Y i,j xij6=0 N(log xij; µij, 2)f(xij) where 2 > 0 can be chosen arbitrarily, N means the Normal distribution and µij = wi · ˜ wj + bi + ˜ bj which i geomet not need acts as a for ✓ 6= while k ter. To e samples and Gr¨ u A ﬁn movMF of the fo ੵͱڞىසͷೋޡࠩΛ࠷খԽ͢Δ ฏۉʢੵʣͱ͔ΒڞىසΛੜ͢Δ֬Λ࠷େԽ͢Δʢ࠷ਪఆʣ SNLP2019 14
impact of sparse co-occurrence counts. It is easy to see that this objective is equivalent to maximizing the following likelihood function P(D|⌦) / Y i,j xij6=0 N(log xij; µij, 2)f(xij) where 2 > 0 can be chosen arbitrarily, N means the Normal distribution and µij = wi · ˜ wj + bi + ˜ bj Furthermore, D denotes the given corpus and ⌦ refers to the set of parameters learned by the word embedding model, i.e. the word vectors wi and ˜ wj and the bias terms. The advantage of this probabilistic formulation is that it allows us to introduce priors on the pa- acts as a sc for ✓ 6= 0, while k✓k ter. To est samples, w and Gr¨ un, A ﬁnite movMF, i of the foll where K k 0 (✓1, ..., ✓K tribution c Maximiza where xij is the number of times wi and wj co- occur in the given corpus, bi and ˜ bj are bias terms and f(xij) is a weighting function aimed at reduc- ing the impact of sparse co-occurrence counts. It is easy to see that this objective is equivalent to maximizing the following likelihood function P(D|⌦) / Y i,j xij6=0 N(log xij; µij, 2)f(xij) where 2 > 0 can be chosen arbitrarily, N means the Normal distribution and µij = wi · ˜ wj + bi + ˜ bj Furthermore, D denotes the given corpus and ⌦ refers to the set of parameters learned by the word embedding model, i.e. the word vectors wi and ˜ wj which is com geometric fu not need to e acts as a scal for ✓ 6= 0, is while k✓k is ter. To estim samples, we and Gr¨ un, 20 A ﬁnite m movMF, is a of the follow h where K is 0 fo ฏۉʢੵʣͱ͔ΒڞىසΛੜ͢Δ֬Λ࠷େԽ͢Δʢ࠷ਪఆʣ ֦ுલͷGloVe Qɿͳͥ͜ͷࣜมܗΛ͢Δͷ͔ʁ SNLP2019 15
easy to see that this objective is equivalent to maximizing the following likelihood function P(D|⌦) / Y i,j xij6=0 N(log xij; µij, 2)f(xij) where 2 > 0 can be chosen arbitrarily, N means the Normal distribution and µij = wi · ˜ wj + bi + ˜ bj Furthermore, D denotes the given corpus and ⌦ refers to the set of parameters learned by the word embedding model, i.e. the word vectors wi and ˜ wj and the bias terms. The advantage of this probabilistic formulation is that it allows us to introduce priors on the pa- rameters of the model. This strategy was recently for ✓ 6= 0, while k✓k ter. To est samples, w and Gr¨ un, A ﬁnite movMF, i of the foll where K k 0 (✓1, ..., ✓K tribution c Maximiza where xij is the number of times wi and wj co- occur in the given corpus, bi and ˜ bj are bias terms and f(xij) is a weighting function aimed at reduc- ing the impact of sparse co-occurrence counts. It is easy to see that this objective is equivalent to maximizing the following likelihood function P(D|⌦) / Y i,j xij6=0 N(log xij; µij, 2)f(xij) where 2 > 0 can be chosen arbitrarily, N means the Normal distribution and µij = wi · ˜ wj + bi + ˜ bj Furthermore, D denotes the given corpus and ⌦ refers to the set of parameters learned by the word embedding model, i.e. the word vectors wi and ˜ wj which is com geometric fu not need to e acts as a scal for ✓ 6= 0, is while k✓k is ter. To estim samples, we and Gr¨ un, 20 A ﬁnite m movMF, is a of the follow h where K is k 0 fo ฏۉʢੵʣͱ͔ΒڞىසΛੜ͢Δº ࠞ߹W.'Ϟσϧʹج͍ͮͯจ຺୯ޠΛੜ͢ΔࣄલΛ࠷େԽ͢Δ Y i,j xij6=0 N(log xij; µij, 2)f(xij) · Y i P( ˜ wi) Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distri- bution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vec- tors. The vMF density for x 2 Sd (with Sd the d-dimensional unit hypersphere) is given by: vmf(x|✓) = e✓| x 0F1(; d/2; ||✓||2 4 ) where the denominator is given by 0F1(; p; q) = 1 X n=0 (p) (p + n) qn n! Furthermore, we use L2 regularization to constrain the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word speciﬁc variance 2 j . In that case, we maximize the following: Y i,j xij6=0 N(log xij; µij, 2 j ) · Y i P( ˜ wi) · Y i P( 2 j ) where P( 2 j ) is modelled as an inverse-gamma distribution (NIG). Note that in this variant we do not use the weighting function f(xij), as this was found to be unnecessary when using a context- word speciﬁc variance 2 j in (Jameel et al., 2019). We will refer this variant as CvMF(NIG). Document embedding. The model described above can also be used to learn document embed- dings. To this end, the target word vectors are sim- ply replaced by document vectors and the counts The advantage of this probabilistic formulation is that it allows us to introduce priors on the pa- rameters of the model. This strategy was recently used in the WeMAP model (Jameel et al., 2019) to replace the constant variance 2 by a variance 2 j that depends on the context word. In this paper, however, we will use priors on the parameters of the word embedding model itself. Speciﬁcally, we will impose a prior on the context word vectors ˜ w, i.e. we will maximize: Y i,j xij6=0 N(log xij; µij, 2)f(xij) · Y i P( ˜ wi) Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distri- bution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vec- 1 K tribution can be computed using the Expectation- Maximization (EM) algorithm (Banerjee et al., 2005; Hornik and Gr¨ un, 2014). Note that movMF is a distribution on unit vec- tors, whereas context word vectors should not be normalized. We therefore deﬁne the prior on con- text word vectors as follows: P(˜ w) / h ˜ w k˜ wk | ⇥ Furthermore, we use L2 regularization to constrain the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word speciﬁc variance 2 j . In that case, we maximize the following: Y i,j xij6=0 N(log xij; µij, 2 j ) · Y i P( ˜ wi) · Y i P( 2 j ) where P( 2) is modelled as an inverse-gamma Y i,j xij6=0 N(log xij; µij, 2)f(xij) · Y i P( ˜ wi) Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distri- bution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vec- tors. The vMF density for x 2 Sd (with Sd the d-dimensional unit hypersphere) is given by: vmf(x|✓) = e✓| x 0F1(; d/2; ||✓||2 4 ) where the denominator is given by 0F1(; p; q) = 1 X n=0 (p) (p + n) qn n! Furthermore, we use L2 regularization to constrain the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word speciﬁc variance 2 j . In that case, we maximize the following: Y i,j xij6=0 N(log xij; µij, 2 j ) · Y i P( ˜ wi) · Y i P( 2 j ) where P( 2 j ) is modelled as an inverse-gamma distribution (NIG). Note that in this variant we do not use the weighting function f(xij), as this was found to be unnecessary when using a context- word speciﬁc variance 2 j in (Jameel et al., 2019). We will refer this variant as CvMF(NIG). Document embedding. The model described above can also be used to learn document embed- dings. To this end, the target word vectors are sim- ply replaced by document vectors and the counts 3321 i,j xij6=0 i Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distri- bution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vec- tors. The vMF density for x 2 Sd (with Sd the d-dimensional unit hypersphere) is given by: vmf(x|✓) = e✓| x 0F1(; d/2; ||✓||2 4 ) where the denominator is given by 0F1(; p; q) = 1 X n=0 (p) (p + n) qn n! the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word speciﬁc variance 2 j . In that case, we maximize the following: Y i,j xij6=0 N(log xij; µij, 2 j ) · Y i P( ˜ wi) · Y i P( 2 j ) where P( 2 j ) is modelled as an inverse-gamma distribution (NIG). Note that in this variant we do not use the weighting function f(xij), as this was found to be unnecessary when using a context- word speciﬁc variance 2 j in (Jameel et al., 2019). We will refer this variant as CvMF(NIG). Document embedding. The model described above can also be used to learn document embed- dings. To this end, the target word vectors are sim- ply replaced by document vectors and the counts i,j xij6=0 N(log xij; µij, 2)f(xij) · i P( ˜ wi) Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distri- bution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vec- tors. The vMF density for x 2 Sd (with Sd the d-dimensional unit hypersphere) is given by: vmf(x|✓) = e✓| x 0F1(; d/2; ||✓||2 4 ) where the denominator is given by 0F1(; p; q) = 1 X n=0 (p) (p + n) qn n! Furthermore, we use L2 regularization to constrain the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word speciﬁc variance 2 j . In that case, we maximize the following: Y i,j xij6=0 N(log xij; µij, 2 j ) · Y i P( ˜ wi) · Y i P( 2 j ) where P( 2 j ) is modelled as an inverse-gamma distribution (NIG). Note that in this variant we do not use the weighting function f(xij), as this was found to be unnecessary when using a context- word speciﬁc variance 2 j in (Jameel et al., 2019). We will refer this variant as CvMF(NIG). Document embedding. The model described above can also be used to learn document embed- dings. To this end, the target word vectors are sim- ply replaced by document vectors and the counts ֦ுޙͷGloVe จͷ͕ࣜtypoͳؾ͕͢Δ… AɿࣄલΛಋೖͰ͖Δ͔Β SNLP2019 16
0.903 CvMF 63.22 67.41 63.21 65.94 17.46 9.380 1.100 CvMF(NIG) 64.14 67.55 63.55 65.95 17.49 9.410 1.210 Table 1: Word analogy accuracy results on different datasets. Models MC30 TR3k Tr287 Tr771 RG65 Stanf LEX Verb143 WS353 YP130 Verb RW CA-660 GloVe 0.739 0.746 0.648 0.651 0.752 0.473 0.347 0.308 0.675 0.582 0.184 0.422 0.301 SG 0.741 0.742 0.651 0.653 0.757 0.470 0.356 0.289 0.662 0.565 0.195 0.470 0.206 CBOW 0.727 0.615 0.637 0.555 0.639 0.419 0.279 0.307 0.618 0.227 0.168 0.419 0.219 WeMAP 0.769 0.752 0.657 0.659 0.779 0.472 0.361 0.303 0.684 0.593 0.196 0.480 0.301 CvMF 0.707 0.703 0.642 0.652 0.746 0.419 0.353 0.250 0.601 0.465 0.226 0.519 0.394 CvMF(NIG) 0.708 0.703 0.642 0.652 0.747 0.419 0.354 0.250 0.604 0.467 0.226 0.519 0.395 Table 2: Word similarity results on some benchmark datasets (Spearman’s Rho). this dataset to 484 records. In most of these datasets, our model does not outperform the base- lines, which is to be expected given the conclusion from the analogy task that our model seems spe- some high-frequency terms. In these case we can see that the GloVe model obtains the best results, as e.g. moreover is found as a neighbor of neural for our model, and indeed is found as a neighbor ఏҊϞσϧຆͲͷσʔληοτʹ͓͍ͯউͯͳ͍ CvMF(NIG) 64.14 67.55 63.55 65.95 17.49 9.410 1.210 Table 1: Word analogy accuracy results on different datasets. Models MC30 TR3k Tr287 Tr771 RG65 Stanf LEX Verb143 WS353 YP130 Verb RW CA-6 GloVe 0.739 0.746 0.648 0.651 0.752 0.473 0.347 0.308 0.675 0.582 0.184 0.422 0.30 SG 0.741 0.742 0.651 0.653 0.757 0.470 0.356 0.289 0.662 0.565 0.195 0.470 0.20 CBOW 0.727 0.615 0.637 0.555 0.639 0.419 0.279 0.307 0.618 0.227 0.168 0.419 0.21 WeMAP 0.769 0.752 0.657 0.659 0.779 0.472 0.361 0.303 0.684 0.593 0.196 0.480 0.30 CvMF 0.707 0.703 0.642 0.652 0.746 0.419 0.353 0.250 0.601 0.465 0.226 0.519 0.39 MF(NIG) 0.708 0.703 0.642 0.652 0.747 0.419 0.354 0.250 0.604 0.467 0.226 0.519 0.39 Table 2: Word similarity results on some benchmark datasets (Spearman’s Rho). dataset to 484 records. In most of these ets, our model does not outperform the base- which is to be expected given the conclusion the analogy task that our model seems spe- some high-frequency terms. In these case we see that the GloVe model obtains the best res as e.g. moreover is found as a neighbor of ne for our model, and indeed is found as a neigh (ΞφϩδʔλεΫͱಉ༷ʹ)ఏҊϞσϧ syntactical / morphological ͳಛʹಛԽ͍ͯ͠Δ͔Βʁ SNLP2019 19
0.903 CvMF 63.22 67.41 63.21 65.94 17.46 9.380 1.100 CvMF(NIG) 64.14 67.55 63.55 65.95 17.49 9.410 1.210 Table 1: Word analogy accuracy results on different datasets. Models MC30 TR3k Tr287 Tr771 RG65 Stanf LEX Verb143 WS353 YP130 Verb RW CA-660 GloVe 0.739 0.746 0.648 0.651 0.752 0.473 0.347 0.308 0.675 0.582 0.184 0.422 0.301 SG 0.741 0.742 0.651 0.653 0.757 0.470 0.356 0.289 0.662 0.565 0.195 0.470 0.206 CBOW 0.727 0.615 0.637 0.555 0.639 0.419 0.279 0.307 0.618 0.227 0.168 0.419 0.219 WeMAP 0.769 0.752 0.657 0.659 0.779 0.472 0.361 0.303 0.684 0.593 0.196 0.480 0.301 CvMF 0.707 0.703 0.642 0.652 0.746 0.419 0.353 0.250 0.601 0.465 0.226 0.519 0.394 CvMF(NIG) 0.708 0.703 0.642 0.652 0.747 0.419 0.354 0.250 0.604 0.467 0.226 0.519 0.395 Table 2: Word similarity results on some benchmark datasets (Spearman’s Rho). this dataset to 484 records. In most of these datasets, our model does not outperform the base- lines, which is to be expected given the conclusion from the analogy task that our model seems spe- some high-frequency terms. In these case we can see that the GloVe model obtains the best results, as e.g. moreover is found as a neighbor of neural for our model, and indeed is found as a neighbor ϨΞϫʔυʹಛԽͨ͠σʔληοτͰఏҊϞσϧ͕ߴੑ CvMF(NIG) 64.14 67.55 63.55 65.95 17.49 9.410 1.210 Table 1: Word analogy accuracy results on different datasets. Models MC30 TR3k Tr287 Tr771 RG65 Stanf LEX Verb143 WS353 YP130 Verb RW CA-6 GloVe 0.739 0.746 0.648 0.651 0.752 0.473 0.347 0.308 0.675 0.582 0.184 0.422 0.30 SG 0.741 0.742 0.651 0.653 0.757 0.470 0.356 0.289 0.662 0.565 0.195 0.470 0.20 CBOW 0.727 0.615 0.637 0.555 0.639 0.419 0.279 0.307 0.618 0.227 0.168 0.419 0.21 WeMAP 0.769 0.752 0.657 0.659 0.779 0.472 0.361 0.303 0.684 0.593 0.196 0.480 0.30 CvMF 0.707 0.703 0.642 0.652 0.746 0.419 0.353 0.250 0.601 0.465 0.226 0.519 0.39 MF(NIG) 0.708 0.703 0.642 0.652 0.747 0.419 0.354 0.250 0.604 0.467 0.226 0.519 0.39 Table 2: Word similarity results on some benchmark datasets (Spearman’s Rho). dataset to 484 records. In most of these ets, our model does not outperform the base- which is to be expected given the conclusion the analogy task that our model seems spe- some high-frequency terms. In these case we see that the GloVe model obtains the best res as e.g. moreover is found as a neighbor of ne for our model, and indeed is found as a neigh จ຺ϕΫτϧͷΫϥελϦϯάΛ͢Δ͜ͱ͕ εϜʔδϯάͷׂʹͳ͍ͬͯΔͷͰʁ SNLP2019 20
moreover circuitry furthermore fog cellular spiking fog swirling circuitry mechanisms lastly halos Table 5: Nearest neighbors for high-frequency words. amazon apple Our GloVe Our GloVe amazonian itunes cherry iigs forest kindle apples iphone brazil emusic peach macintosh rain nightlifepartner pear itunes green astore red ipad trees cdbaby sweet ipod wildlife guianas healthy ios preserve likewise doctor microsoft water aforementioned fruit garbageband rains ebay edible phone Table 6: Nearest neighbors for ambiguous words. GloVe 0.852 0.629 0.301 0.315 WeMAP 0.855 0.630 0.306 0.345 SG 0.853 0.631 0.304 0.341 CBOW 0.823 0.629 0.297 0.339 CvMF 0.871 0.633 0.305 0.362 CvMF(NIG) 0.871 0.633 0.305 0.363 Table 7: Document classiﬁcation results (F1). (sHDP)1314, 7) GloVe15 (Pennington et al., 201 8) WeMAP (Jameel et al., 2019), 9) Skipgra (SG) and Continuous Bag-of-Words16 (Mikol et al., 2013b) models. In the case of the wo embedding models, we create document vectors the same way as we do for our model, by simp replacing the role of target word vectors with do ument word vectors. In all the datasets, we removed punctuation a 13https://github.com/Ardavans/sHDP ී௨ͷ୯ޠϕΫτϧͱҧ͏ڍಈʹ ݸਓతݟղ Կނʁ ࠞ߹v-MFʹΑΔจ຺୯ޠͷΫϥελϦϯά݁Ռʹ େ͖͘ґଘͦ͠͏ɻ ͨ·ͨ·ͰʁʢvMFʹΑͬͯہॴղʹ͍ͦ͏ʣ ΫϥελϦϯάͷ݁ՌΛݟͯΈ͍ͨɻ SNLP2019 24