easy to see that this objective is equivalent to maximizing the following likelihood function P(D|⌦) / Y i,j xij6=0 N(log xij; µij, 2)f(xij) where 2 > 0 can be chosen arbitrarily, N means the Normal distribution and µij = wi · ˜ wj + bi + ˜ bj Furthermore, D denotes the given corpus and ⌦ refers to the set of parameters learned by the word embedding model, i.e. the word vectors wi and ˜ wj and the bias terms. The advantage of this probabilistic formulation is that it allows us to introduce priors on the pa- rameters of the model. This strategy was recently for ✓ 6= 0, while k✓k ter. To est samples, w and Gr¨ un, A finite movMF, i of the foll where K k 0 (✓1, ..., ✓K tribution c Maximiza where xij is the number of times wi and wj co- occur in the given corpus, bi and ˜ bj are bias terms and f(xij) is a weighting function aimed at reduc- ing the impact of sparse co-occurrence counts. It is easy to see that this objective is equivalent to maximizing the following likelihood function P(D|⌦) / Y i,j xij6=0 N(log xij; µij, 2)f(xij) where 2 > 0 can be chosen arbitrarily, N means the Normal distribution and µij = wi · ˜ wj + bi + ˜ bj Furthermore, D denotes the given corpus and ⌦ refers to the set of parameters learned by the word embedding model, i.e. the word vectors wi and ˜ wj which is com geometric fu not need to e acts as a scal for ✓ 6= 0, is while k✓k is ter. To estim samples, we and Gr¨ un, 20 A finite m movMF, is a of the follow h where K is k 0 fo ฏۉʢੵʣͱࢄ͔ΒڞىසΛੜ͢Δº ࠞ߹W.'Ϟσϧʹج͍ͮͯจ຺୯ޠΛੜ͢ΔࣄલΛ࠷େԽ͢Δ Y i,j xij6=0 N(log xij; µij, 2)f(xij) · Y i P( ˜ wi) Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distri- bution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vec- tors. The vMF density for x 2 Sd (with Sd the d-dimensional unit hypersphere) is given by: vmf(x|✓) = e✓| x 0F1(; d/2; ||✓||2 4 ) where the denominator is given by 0F1(; p; q) = 1 X n=0 (p) (p + n) qn n! Furthermore, we use L2 regularization to constrain the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word specific variance 2 j . In that case, we maximize the following: Y i,j xij6=0 N(log xij; µij, 2 j ) · Y i P( ˜ wi) · Y i P( 2 j ) where P( 2 j ) is modelled as an inverse-gamma distribution (NIG). Note that in this variant we do not use the weighting function f(xij), as this was found to be unnecessary when using a context- word specific variance 2 j in (Jameel et al., 2019). We will refer this variant as CvMF(NIG). Document embedding. The model described above can also be used to learn document embed- dings. To this end, the target word vectors are sim- ply replaced by document vectors and the counts The advantage of this probabilistic formulation is that it allows us to introduce priors on the pa- rameters of the model. This strategy was recently used in the WeMAP model (Jameel et al., 2019) to replace the constant variance 2 by a variance 2 j that depends on the context word. In this paper, however, we will use priors on the parameters of the word embedding model itself. Specifically, we will impose a prior on the context word vectors ˜ w, i.e. we will maximize: Y i,j xij6=0 N(log xij; µij, 2)f(xij) · Y i P( ˜ wi) Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distri- bution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vec- 1 K tribution can be computed using the Expectation- Maximization (EM) algorithm (Banerjee et al., 2005; Hornik and Gr¨ un, 2014). Note that movMF is a distribution on unit vec- tors, whereas context word vectors should not be normalized. We therefore define the prior on con- text word vectors as follows: P(˜ w) / h ˜ w k˜ wk | ⇥ Furthermore, we use L2 regularization to constrain the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word specific variance 2 j . In that case, we maximize the following: Y i,j xij6=0 N(log xij; µij, 2 j ) · Y i P( ˜ wi) · Y i P( 2 j ) where P( 2) is modelled as an inverse-gamma Y i,j xij6=0 N(log xij; µij, 2)f(xij) · Y i P( ˜ wi) Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distri- bution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vec- tors. The vMF density for x 2 Sd (with Sd the d-dimensional unit hypersphere) is given by: vmf(x|✓) = e✓| x 0F1(; d/2; ||✓||2 4 ) where the denominator is given by 0F1(; p; q) = 1 X n=0 (p) (p + n) qn n! Furthermore, we use L2 regularization to constrain the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word specific variance 2 j . In that case, we maximize the following: Y i,j xij6=0 N(log xij; µij, 2 j ) · Y i P( ˜ wi) · Y i P( 2 j ) where P( 2 j ) is modelled as an inverse-gamma distribution (NIG). Note that in this variant we do not use the weighting function f(xij), as this was found to be unnecessary when using a context- word specific variance 2 j in (Jameel et al., 2019). We will refer this variant as CvMF(NIG). Document embedding. The model described above can also be used to learn document embed- dings. To this end, the target word vectors are sim- ply replaced by document vectors and the counts 3321 i,j xij6=0 i Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distri- bution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vec- tors. The vMF density for x 2 Sd (with Sd the d-dimensional unit hypersphere) is given by: vmf(x|✓) = e✓| x 0F1(; d/2; ||✓||2 4 ) where the denominator is given by 0F1(; p; q) = 1 X n=0 (p) (p + n) qn n! the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word specific variance 2 j . In that case, we maximize the following: Y i,j xij6=0 N(log xij; µij, 2 j ) · Y i P( ˜ wi) · Y i P( 2 j ) where P( 2 j ) is modelled as an inverse-gamma distribution (NIG). Note that in this variant we do not use the weighting function f(xij), as this was found to be unnecessary when using a context- word specific variance 2 j in (Jameel et al., 2019). We will refer this variant as CvMF(NIG). Document embedding. The model described above can also be used to learn document embed- dings. To this end, the target word vectors are sim- ply replaced by document vectors and the counts i,j xij6=0 N(log xij; µij, 2)f(xij) · i P( ˜ wi) Essentially, we want the prior P( ˜ wi) to model the assumption that context word vectors are clus- tered. To this end, we use a mixture of von-Mises Fisher distributions. To describe this distribution, we begin with a von Mises-Fisher (vMF) distri- bution (Mardia and Jupp, 2009; Hornik and Gr¨ un, 2014), which is a distribution over unit vectors in Rd that depends on a parameter ✓ 2 Rd, where d will denote the dimensionality of the word vec- tors. The vMF density for x 2 Sd (with Sd the d-dimensional unit hypersphere) is given by: vmf(x|✓) = e✓| x 0F1(; d/2; ||✓||2 4 ) where the denominator is given by 0F1(; p; q) = 1 X n=0 (p) (p + n) qn n! Furthermore, we use L2 regularization to constrain the norm k˜ wk. We will refer to our model as CvMF. In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word specific variance 2 j . In that case, we maximize the following: Y i,j xij6=0 N(log xij; µij, 2 j ) · Y i P( ˜ wi) · Y i P( 2 j ) where P( 2 j ) is modelled as an inverse-gamma distribution (NIG). Note that in this variant we do not use the weighting function f(xij), as this was found to be unnecessary when using a context- word specific variance 2 j in (Jameel et al., 2019). We will refer this variant as CvMF(NIG). Document embedding. The model described above can also be used to learn document embed- dings. To this end, the target word vectors are sim- ply replaced by document vectors and the counts ֦ுޙͷGloVe จͷ͕ࣜtypoͳؾ͕͢Δ… AɿࣄલΛಋೖͰ͖Δ͔Β SNLP2019 16