Major tasks: • Text classiﬁcation; classify a news to categories • Machine translation; English ⁵ French • Dialog system; Chat bot, Google assistant, Siri, Alexa • Analysis of text data to understand language phenomenon 4
with words like human • Ex. Dictionary: • Hard for machines • Deﬁnition consists of words Screenshot from https://dictionary.cambridge.org/dictionary/english/probability 7
and the other elements are . • Ex. , • It is equivalent to discrete representation. • Ex: is , is , … • Dimensionality will be over millions • A similarity between two words is always • Usually, the similarity is cosine similarity: • But we expect vectors such that . 1 0 probability = [1,0,0,…,0]⊤ Gaussian = [0,1,0,…,0]⊤ probability 0 Gaussian 1 || 0 cos(probability, Gaussian) = cos(probability, cat) = 0 cos(probability, Gaussian) > cos(probability, cat) 8
2013a] • Ex. king - man + woman =~ queen • Find similar words [Mikolov et al., 2013b] • Ex. bayesian =~ bayes, probabilistic, frequentist, … • Initialisation values of neural nets [Collobert & Weston, 2008] • Feature vector for downstream tasks [Wieting et al., 2016] • Ex. Sentence classiﬁcation, sentence similarity, etc. • Text analysis • Ex. Trace meaning of words over years: cell (1990s) vs cell (1850s) [Hamilton et al., 2016] 10
Decompose a matrix based on word count into lower dimensional space. 2. Predictive models; Solve a task such as supervised/ unsupervised task, then use the weights as word embeddings. 12
pairs. Ex: How many times two words appear in the same documents. 2. Decompose the matrix into lower dimensional space by a matrix factorisation method such as SVD. • Ex. top-100 columns of left-singular vectors • Simple • Not scalable on a large corpus 14
word from the previous two words. For example, given a sentence ICML is the leading conference, when . p(the ∣ ICML, is) t = 3 Formally, the task is minimisation negative log likelihood . L = − T ∑ t=3 p(wt ∣ wt−2 , wt−1 ) 15
word from the previous two words. 2. Deﬁne a model (usually neural networks). Ex: RNN, CNN, or MLP. , . p(xt ∣ xt−1 , xt−2 ) = softmax [V (uxt−1 + uxt−2 )] xt where u ∈ ℝd, V ∈ ℝ||×d Ex: MLPs with softmax function: 16
word from the previous two words. 2. Deﬁne a model (usually neural networks). Ex: RNN, CNN, or MLP. 3. Solve the task by optimising parameters of the model. We obtain as word vectors! {uw ∣ w ∈ } 17
word from the previous two words. 2. Deﬁne a model (usually neural networks). Ex: RNN, CNN, or MLP. 3. Solve the task by optimising parameters of the model. • Scalable thanks to gradient based optimisation • Recent models are based on these predictive tasks. 18
where vocabulary, a set of words, is . • Goal: ﬁnding a map from word to ( ). • • Learnable parameters: . • Word vector is the one-hot vector of word . • Each matrix is called embedding layer or lookup table in neural networks. T w ℝd d ≪ || 50 ≤ d ≤ 1 000 U, V ∈ ℝd×|| uw = Uxw , where xw w 20
word by the company it keeps” by John R. Firth [J. R. Firth, 1957] • If words have a similar meaning, they appear in similar contexts. • Ex. ICML & NeurIPS • This idea is behind almost all word vector models. 21
e.g. . CBoW’s loss is a negative log likelihood deﬁned by wt = is t = {ICML, the, leading} ICML is the leading conference … Continuous Bag-of-Words (CBoW) [Mikolov et al., 2013b] Ex. L = − T ∑ t=1 log p(wt ∣ t ) where p(wt ∣ t ) = exp(v⊤ wt ut ) ∑ i∈ exp(v⊤ i ut ) and ut = 1 |t | ∑ w∈t uw . 23
. Skip-gram’s loss is a negative log likelihood deﬁned by t = {ICML, the, leading} wt = is Continuous Skip-gram (Skip-gram) [Mikolov et al., 2013b] ICML is the leading conference … Ex. L = − T ∑ t=1 ∑ wc ∈t log p(wc ∣ wt ) where p(wc ∣ wt ) = exp(v⊤ wc uwt ) ∑ i∈ exp(v⊤ i uwt ) 25
consist of softmax over vocabulary . • Ex. On preprocessed Wikipedia, is over million and the number of words is over 4.5 billions. || CBoW: p(wt ∣ t ) = exp(v⊤ wt ut ) ∑ i∈ exp(v⊤ i ut ) a Skip-gram: p(wc ∣ wt ) = exp(v⊤ wc uwt ) ∑ i∈ exp(v⊤ i uwt ) 26
negative words sampled from a noise distribution. • Usually, is 2 — 20. • Too many negative samples hurt word vector quality. || • Surrogate loss for learning word vectors only. • Binary classiﬁcation whether a true word or noise. Ex. skip-gram with negative sampling loss L = − T ∑ t=1 ∑ wc ∈t log p(wc ∣ wt ) L = − T ∑ t=1 ∑ wc ∈t [ ln σ(v⊤ wc uwt ) positive + ∑ wn ∈ ln σ(−v⊤ wn uwt ) negative ] 27
NCE aims to estimate probabilistic model’s parameters to avoid computing partition function. • NCE’s objective is similar to negative sampling loss. 28
point in vector space to treat meanings of word. • Word as Gaussian [Vilnis & McCallum, 2015] • High variance means multiple word meanings rock (music) v.s. rock (stone) Probabilistic word vectors Image from [Athiwaratkun & Wilson, 2017]. 29
point in vector space to treat meanings of word. • Word as Gaussian mixture [Athiwaratkun & Wilson, 2017] • Each Gaussian is related to each meaning of the word. Probabilistic word vectors Image from [Athiwaratkun & Wilson, 2017]. 30
deterministic map from a word to a vector/distribution. • Meaning of word is determined by its context (distributed hypothesis!) • Ex: Apple reveals new iPhones v.s. I eat an apple • Solution: Learn a function such that outputs a word vector given its context. • These methods are popular in the past few years. • Especially, attention based models. 31
bidirectional language model on a large corpus. 2. Use a combination of hidden states of the model given a sequences as a word vector. • Link to the overview of ELMo Note: Similar model’s name is borrowed from Sesame Street such as BERT [Devlin et al., 2019] Image from https://twitter.com/elmo's icon (11 June 2019). 32
(Intrinsic evaluation) • Similarity tasks • Analogical tasks • Application oriented evaluation (Extrinsic evaluation) • Measure performance of downstream task when word vectors are used as feature vectors, e.g., (super)GLUE [Wang et al., 2019ab]. • Initialisation values of other NLP models, e.g., neural models for text classiﬁcation. 37
impractical assumptions, skip-gram with negative sampling’s word vectors satisfy • Actually, PMI’s variants are used in NLP community. • But PMI matrix + SVD did not perform like skip-gram with negative sampling, especially on analogical tasks. v⊤ c uw = log p(w, c) p(w)p(c) PMI − log|| 39
logistic PCA [Landgraf & Bellay, 2017] Skip-gram with negative sampling loss can be written as Note: context word distribution is not the same to actual skip-gram w. negative sampling. ℓ = ∑ w,c (P(w, c) + ||P(w)P(c)) (xw,c (uT w vc) − log [1 + exp (uT w vc)]) where xw,c = P(w, c) P(w, c) + ||P(w)P(c) . P(c) = nc |D| 41
Word Distributions. In ACL, pages 1645–1656, 2017. • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching Word Vectors with Subword Information. TACL, 5(1):135–146, 2017. • R. Collobert and J. Weston. A Uniﬁed Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In ICML, pages 160–167, 2008. • J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transform- ers for Language Understanding. In NAACL-HLT, pages 4171–4186, 2019. • M. Gutmann and A. Hyv ̈arinen. Noise-Contrastive Estimation: A New Estimation Principle for Un- normalized Statistical Models. In AISTATS, pages 297–304, 2010. • W. L. Hamilton, J. Leskovec, and D. Jurafsky. Cultural Shift or Linguistic Drift? Comparing Two Compu- tational Measures of Semantic Change. In EMNLP, pages 2116–2121, 2016. • A. J. Landgraf and J. Bellay. word2vec Skip-Gram with Negative Sampling is a Weighted Logistic PCA. arXiv, 2017. • O. Levy and Y. Goldberg. Neural Word Embedding as Implicit Matrix Factorization. In NeurIPS, 2014. • O. Levy, Y. Goldberg, and I. Dagan. Improving Distri- butional Similarity with Lessons Learned from Word Embeddings. TACL, 3:211–225, 2015. • T. Mikolov, G. Corrado, K. Chen, and J. Dean. Efﬁcient Estimation of Word Representations in Vector Space. In ICLR Workshop, 2013a. 43
Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. In NeurIPS, 2013b. • T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin. Advances in Pre-Training Distributed Word Representations. In LREC, pages 52–55, 2018. • J. Pennington, R. Socher, and C. D. Manning. GloVe: Global Vectors for Word Representation. In EMNLP, pages 1532–1543, 2014. • M. E. Peters, M. Neumann, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep Contextualized Word Representations. In NAACL-HLT, pages 2227– 2237, 2018. • T. Schnabel, I. Labutov, D. Mimno, and T. Joachims. Evaluation Methods for Unsupervised Word Embed- dings. In EMNLP, pages 298–307, 2015. • L. Vilnis and A. McCallum. Word Representations via Gaussian Embedding. In ICLR, 2015. • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. arXiv, 2019a. • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Under- standing. In ICLR, 2019b. • J. Wieting, M. Bansal, K. Gimpel, and K. Livescu. To- wards Universal Paraphrastic Sentence Embeddings. In ICLR, 2016. 44
solve binary classiﬁcation such that it classiﬁes whether observed data point or noise data point. • The noise distribution needs to • have an analytical expression of PDF/PMF. • generate samples easily. • be similar to observed data distribution in some aspect. • Ex. Covariance structure for images data. • Ex. Unigram/uniform distribution for NLP data. Let’s use standard normal distribution as our noise dist.! log 49
: the (1,061,396), French (4,813), Paris (1,699) • For French, Paris is more informative than the as a context word. • To reduce this imbalance, subsample frequent words from training data at each iteration. • , where . • Expected freq when : the (43,797), French (4,813), Paris (1,699) • This technique makes training speed faster and vector quality better. pdiscard (w) = max [0,1 − ( t/f(w) + t/f(w))] t ∈ (0,1] t = 10e−4 Word subsampling [Mikolov et al. 2013a] 54
words does the models consider as neighbours. Ex. window size is for skip-gram. • Dynamic context window size: each window size is sampled from uniform distribution where each probability is . • Intuition: Closer words are more important. 2 1 window size ICML is the leading conference … 55
word2vec: and fastText: • The code caches the noise distribution as a very long integer array. • So, the ratio of elements corresponding to each word is the same as . • Note: Alias method is a more memory efﬁcient approach to sample a noise data. p(w) = freq(w)α ∑ v∈ freq(v)α α = 0.75 α = 0.5 p(w) 56
word. • Input sentence: icml is the leading conference • Ex. • Formally, . • Objective: negative log likelihood. • If is modelled by neural nets, we can call them neural language models (NLMs). p(the ∣ icml, is) p(w1 , …, wT ) = T ∏ t=1 p(wt ∣ w1 …, wt−1 ) p 58
• Task: distinguish between true sequence and noise sequence • : words • : the same to excluding the central word replaced by a random word. • Loss: hinge loss, , instead of neg. log likelihood. xpos n xneg xpos max(0,1 − f(xpos ) + f(xneg )) Figure from [Collobert & Weston, 2008]. 61
Embedding models learn a map each word to a single vector. • These models cannot obtain vectors of 1. Rare words; these vectors are updated rarely because their contexts are limited. 2. Out-of-vocabulary words; they don’t appear in the training sequence because they are new words or removed words by pre-processing. 62
subwords. Ex. 1. `***tion` is a postﬁx for noun • Ex: information, estimation, … 2. `Post***` is a preﬁx related to ‘after’ • Ex: postﬁx, posterior, … fastText [Bojanowski et al., 2017] assigns not only words but also subwords to vectors. Logo from https://fasttext.cc/ 63
in the training data are assigned to dimensional vector. • Word vector is calculated by averaging over word vector and sub-word vectors . Ex. Calculation of word vector of where: 1. Convert the word into subwords: where → <wh + whe + her + ere+ re> 2. Calculate word vector: Note: “<” and ”>” are special characters to distinguish between word and subword, e.g. “<her>” vs ”her”. • Bojanowski et al. [2017] recommends that subwords range is 3—6. • The other parts are the same to word2vec. d u w s uwhere = wwhere + s<wh + swhe + sher + sere + sre> 6 64
models are inspired by predictive models. • A popular example is GloVe whose loss is deﬁned by L = || ∑ i,j=1 f (Xij) (uT i vj + bi + ˜ bj − log Xij) 2 , where • : co-occurrence frequency • : weight function • : bias terms • GloVe tends to be worse than word2vec/fastText [Levy et al., 2015, Mikolov et al., 2018] Xij f(Xij ) b, ˜ b 65