Slide 1

Slide 1 text

Shuntaro Yada PhD Student Methodology Seminar Tech Interest Group 21 Jun 2018 Literature Review: Word Embeddings

Slide 2

Slide 2 text

Agenda 1. Lenci, A. (2018). Distributional models of word meaning. Annual Review of Linguistics, 4(1), 151–171 ‣ Combined with: Lenci, A. (2008). Distributional approaches in linguistic and cognitive research. Italian Journal of Linguistics, 20(May), 1–31 2. Zeng, Z., Yin, Y., Song, Y., & Zhang, M. (2017). Socialized word embeddings. In International Joint Conference on Artificial Intelligence (pp. 3915–3921) 3. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In NAACL-HLT (pp. 2227–2237) 2 • I will introduce these papers listed on the right • The first one is theoretical overviews and reviews • The last two are examples of cutting-edge methods

Slide 3

Slide 3 text

Distributional models of word meaning Alessandro Lenci Annual Review of Linguistics, 4(1), 151– 171, 2018

Slide 4

Slide 4 text

Overview • Explain what the distributional hypothesis is • Briefly introduce major ways to generate distributional representations of words ‣ I focus on the two most popular ways (Count/ Prediction) and add more materials to explain the latter way • Summarise the common challenges with distributional representations of words 4 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges

Slide 5

Slide 5 text

Distributional Hypothesis • “Lexemes with similar linguistic contexts have similar meanings” (Lenci, 2018: p. 152) • One of the ways to give the definition of word meaning 5 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges “I found a wonderful restaurant yesterday!” “I found a fantastic restaurant yesterday!” Looks like they have a similar meaning Target word Context words (latter) Context words (former)

Slide 6

Slide 6 text

Distributional Hypothesis • Distributional hypothesis (DH) forms the theoretical foundation of distributional semantics (aka vector space semantics) • Lenci (2008) pointed out two levels of DH: ‣ Weak DH: only assumes correlations between semantics and word distributions ‣ Strong DH: also assumes DH is a cognitive hypothesis 6 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges

Slide 7

Slide 7 text

Distributed vs Distributional • According to Ferrone and Zanzotto (2017), distributed representations contain distributional representations • Distributed: ways to represent each word by a vector with several dimensions instead of a symbolic vector (e.g., one-hot vectors) • Distributional: ways to represent each word by a vector with several dimensions based on distributional hypothesis 7 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges

Slide 8

Slide 8 text

• The two major ways to generate distributional representations: ‣ Count models ‣ Prediction models • The paper shows an introduction of count models (Sec 3.2) 8 Taxonomy of Methods: The Method of Learning • Overview • Distributional Hypothesis • Taxonomy of Methods ‣ The Method of Learning ‣ The Type of Context • Models • Challenges

Slide 9

Slide 9 text

9 Taxonomy of Methods: The Type of Context • Overview • Distributional Hypothesis • Taxonomy of Methods ‣ The Method of Learning ‣ The Type of Context • Models • Challenges e.g., My example of wonderful/fantastic. Most popular taking into account syntactically dependent words only (e.g., predicate argument structure?) Imagine TFIDF

Slide 10

Slide 10 text

10 Count models • Overview • Distributional Hypothesis • Taxonomy of Methods • Models ‣ Count models ‣ Prediction models • Challenges Co-occurrence matrix Enhance significance to reflect the importance of the contexts For example, taking:

Slide 11

Slide 11 text

Count models • Use some methods to obtain latent features among columns in explicit vectors → implicit vectors • One easy example is to apply dimensionality- reduction techniques like singular value decomposition or principal components analysis • See Table 2 of Lenci (2018) for famous tools ‣ GloVe (Pennington+, 2014), which is based on weighted least-squares regression, is the most popular word representations among count models 11 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models ‣ Count models ‣ Prediction models • Challenges

Slide 12

Slide 12 text

Prediction models • As known as word vectors, word embeddings, and distributed embeddings • Learn word representations using a neural network model while the model is learning a language model • The most famous tool of this category is word2vec ‣ The rest of the papers I will explain today are also based on prediction models 12 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models ‣ Count models ‣ Prediction models • Challenges

Slide 13

Slide 13 text

Prediction models • Language model ‣ A probability distribution over sequences of words ‣ This assigns a probability of given sequences • Neural language model ‣ A language model using neural network ‣ Given sequences, it predicts the next word of the sequences 13 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models ‣ Count models ‣ Prediction models • Challenges

Slide 14

Slide 14 text

14 Prediction models • Overview • Distributional Hypothesis • Taxonomy of Methods • Models ‣ Count models ‣ Prediction models • Challenges have a nice one one this has the same number of units as total word types and represents a probability distribution of the next word … Output layer Hidden layers Embedding layer Modified from ௶Ҫ+ (2017) ʰਂ૚ֶशʹΑΔࣗવݴޠॲཧʱ E Win Whid Wout These parts become word vectors during training An example of feed forward neural network language model

Slide 15

Slide 15 text

15 Prediction models • Overview • Distributional Hypothesis • Taxonomy of Methods • Models ‣ Count models ‣ Prediction models • Challenges have a nice one one … Output layer Embedding layer Modified from ௶Ҫ+ (2017) ʰਂ૚ֶशʹΑΔࣗવݴޠॲཧʱ E word2vec No hidden layers Just taking inner products But with bunch of optimisation techniques

Slide 16

Slide 16 text

Challenges • Distributional semantic models tend to mix up various types of semantic similarity/relatedness ‣ No distinction among hypernymy, antonymy, meronymy, locative relations and topical relations • How to represent larger linguistic units than word 16 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges

Slide 17

Slide 17 text

Socialized Word Embeddings Ziqian Zeng, Yichun Yin, Yangqiu Song, Ming Zhang International Joint Conference on Artificial Intelligence 2017

Slide 18

Slide 18 text

Overview • Add the following two aspect to word embeddings: ‣ Personalisation (user information; not new) ‣ Socialisation (inter-user relationship; new) • Three-fold evaluation: ‣ Perplexity comparison between word2vec ‣ Application to document-level sentiment classification ✦ As the features for SVM (inc. user segmentation) ✦ As the attention source for neural models 18 • Overview • Proposed method • Evaluation • Comments

Slide 19

Slide 19 text

Proposed method: Personalisation • Starting from continuous bag-of-words model (CBOW) of word2vec (Mikolov et al., 2013) • Consider the context words for a word as user- dependent ‣ Maximise: ‣ ‘for each user, s/he will think about a predicted word given the global words meanings and customize them to his/her own preference’ 19 • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments J1 = N X i X wj 2Wi log P(wj |C(wj), ui) AAACRHicbVBLSwMxGMz6rPVV9eglWAQFKbsi6EUQvYgHqWBboVtDNk1rbDa7JFmlxPw4L/4Ab/4CLx4U8SpmaxFfA4HJzHzky0QpZ0r7/oM3Mjo2PjFZmCpOz8zOzZcWFusqySShNZLwRJ5FWFHOBK1ppjk9SyXFccRpI+od5H7jikrFEnGq+yltxbgrWIcRrJ2ESs0wxvqCYG6OLArgLgxVFp+bY4sMszlH5hpdwpAJ+JVsWMQsDHnSNdU15958OQc2v69vwAyxdYtKZb/iDwD/kmBIymCIKirdh+2EZDEVmnCsVDPwU90yWGpGOLXFMFM0xaSHu7TpqMAxVS0zKMHCVae0YSeR7ggNB+r3CYNjpfpx5JL5uuq3l4v/ec1Md3Zahok001SQz4c6GYc6gXmjsM0kJZr3HcFEMrcrJBdYYqJd70VXQvD7y39JfbMS+JXgZKu8tz+sowCWwQpYAwHYBnvgEFRBDRBwCx7BM3jx7rwn79V7+4yOeMOZJfAD3vsHzviyeQ== AAACRHicbVBLSwMxGMz6rPVV9eglWAQFKbsi6EUQvYgHqWBboVtDNk1rbDa7JFmlxPw4L/4Ab/4CLx4U8SpmaxFfA4HJzHzky0QpZ0r7/oM3Mjo2PjFZmCpOz8zOzZcWFusqySShNZLwRJ5FWFHOBK1ppjk9SyXFccRpI+od5H7jikrFEnGq+yltxbgrWIcRrJ2ESs0wxvqCYG6OLArgLgxVFp+bY4sMszlH5hpdwpAJ+JVsWMQsDHnSNdU15958OQc2v69vwAyxdYtKZb/iDwD/kmBIymCIKirdh+2EZDEVmnCsVDPwU90yWGpGOLXFMFM0xaSHu7TpqMAxVS0zKMHCVae0YSeR7ggNB+r3CYNjpfpx5JL5uuq3l4v/ec1Md3Zahok001SQz4c6GYc6gXmjsM0kJZr3HcFEMrcrJBdYYqJd70VXQvD7y39JfbMS+JXgZKu8tz+sowCWwQpYAwHYBnvgEFRBDRBwCx7BM3jx7rwn79V7+4yOeMOZJfAD3vsHzviyeQ== AAACRHicbVBLSwMxGMz6rPVV9eglWAQFKbsi6EUQvYgHqWBboVtDNk1rbDa7JFmlxPw4L/4Ab/4CLx4U8SpmaxFfA4HJzHzky0QpZ0r7/oM3Mjo2PjFZmCpOz8zOzZcWFusqySShNZLwRJ5FWFHOBK1ppjk9SyXFccRpI+od5H7jikrFEnGq+yltxbgrWIcRrJ2ESs0wxvqCYG6OLArgLgxVFp+bY4sMszlH5hpdwpAJ+JVsWMQsDHnSNdU15958OQc2v69vwAyxdYtKZb/iDwD/kmBIymCIKirdh+2EZDEVmnCsVDPwU90yWGpGOLXFMFM0xaSHu7TpqMAxVS0zKMHCVae0YSeR7ggNB+r3CYNjpfpx5JL5uuq3l4v/ec1Md3Zahok001SQz4c6GYc6gXmjsM0kJZr3HcFEMrcrJBdYYqJd70VXQvD7y39JfbMS+JXgZKu8tz+sowCWwQpYAwHYBnvgEFRBDRBwCx7BM3jx7rwn79V7+4yOeMOZJfAD3vsHzviyeQ== AAACRHicbVBLSwMxGMz6rPVV9eglWAQFKbsi6EUQvYgHqWBboVtDNk1rbDa7JFmlxPw4L/4Ab/4CLx4U8SpmaxFfA4HJzHzky0QpZ0r7/oM3Mjo2PjFZmCpOz8zOzZcWFusqySShNZLwRJ5FWFHOBK1ppjk9SyXFccRpI+od5H7jikrFEnGq+yltxbgrWIcRrJ2ESs0wxvqCYG6OLArgLgxVFp+bY4sMszlH5hpdwpAJ+JVsWMQsDHnSNdU15958OQc2v69vwAyxdYtKZb/iDwD/kmBIymCIKirdh+2EZDEVmnCsVDPwU90yWGpGOLXFMFM0xaSHu7TpqMAxVS0zKMHCVae0YSeR7ggNB+r3CYNjpfpx5JL5uuq3l4v/ec1Md3Zahok001SQz4c6GYc6gXmjsM0kJZr3HcFEMrcrJBdYYqJd70VXQvD7y39JfbMS+JXgZKu8tz+sowCWwQpYAwHYBnvgEFRBDRBwCx7BM3jx7rwn79V7+4yOeMOZJfAD3vsHzviyeQ==

Slide 20

Slide 20 text

• Word vectors and user vectors have the same dimensionality • The user-dependent word vector is represented as the sum of the word vector and the user vector 20 Proposed method: Personalisation • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments w(i) j = wj + ui AAACG3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgm6EohuXFewD2hom00k77WQSZiZKCfkPN/6KGxeKuBJc+DdO2uCj9cDAmXPu5d573JBRqSzr08gtLC4tr+RXC2vrG5tb5vZOQwaRwKSOAxaIloskYZSTuqKKkVYoCPJdRpru6CL1m7dESBrwazUOSddHfU49ipHSkmNWOj5SA9eL75KbuEQPEyceJvAM/sjOEB59f6PEoY5ZtMrWBHCe2Bkpggw1x3zv9AIc+YQrzJCUbdsKVTdGQlHMSFLoRJKECI9Qn7Q15cgnshtPbkvggVZ60AuEflzBifq7I0a+lGPf1ZXpjnLWS8X/vHakvNNuTHkYKcLxdJAXMagCmAYFe1QQrNhYE4QF1btCPEACYaXjLOgQ7NmT50mjUratsn11XKyeZ3HkwR7YByVggxNQBZegBuoAg3vwCJ7Bi/FgPBmvxtu0NGdkPbvgD4yPL2SEoZ4= AAACG3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgm6EohuXFewD2hom00k77WQSZiZKCfkPN/6KGxeKuBJc+DdO2uCj9cDAmXPu5d573JBRqSzr08gtLC4tr+RXC2vrG5tb5vZOQwaRwKSOAxaIloskYZSTuqKKkVYoCPJdRpru6CL1m7dESBrwazUOSddHfU49ipHSkmNWOj5SA9eL75KbuEQPEyceJvAM/sjOEB59f6PEoY5ZtMrWBHCe2Bkpggw1x3zv9AIc+YQrzJCUbdsKVTdGQlHMSFLoRJKECI9Qn7Q15cgnshtPbkvggVZ60AuEflzBifq7I0a+lGPf1ZXpjnLWS8X/vHakvNNuTHkYKcLxdJAXMagCmAYFe1QQrNhYE4QF1btCPEACYaXjLOgQ7NmT50mjUratsn11XKyeZ3HkwR7YByVggxNQBZegBuoAg3vwCJ7Bi/FgPBmvxtu0NGdkPbvgD4yPL2SEoZ4= AAACG3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgm6EohuXFewD2hom00k77WQSZiZKCfkPN/6KGxeKuBJc+DdO2uCj9cDAmXPu5d573JBRqSzr08gtLC4tr+RXC2vrG5tb5vZOQwaRwKSOAxaIloskYZSTuqKKkVYoCPJdRpru6CL1m7dESBrwazUOSddHfU49ipHSkmNWOj5SA9eL75KbuEQPEyceJvAM/sjOEB59f6PEoY5ZtMrWBHCe2Bkpggw1x3zv9AIc+YQrzJCUbdsKVTdGQlHMSFLoRJKECI9Qn7Q15cgnshtPbkvggVZ60AuEflzBifq7I0a+lGPf1ZXpjnLWS8X/vHakvNNuTHkYKcLxdJAXMagCmAYFe1QQrNhYE4QF1btCPEACYaXjLOgQ7NmT50mjUratsn11XKyeZ3HkwR7YByVggxNQBZegBuoAg3vwCJ7Bi/FgPBmvxtu0NGdkPbvgD4yPL2SEoZ4= AAACG3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgm6EohuXFewD2hom00k77WQSZiZKCfkPN/6KGxeKuBJc+DdO2uCj9cDAmXPu5d573JBRqSzr08gtLC4tr+RXC2vrG5tb5vZOQwaRwKSOAxaIloskYZSTuqKKkVYoCPJdRpru6CL1m7dESBrwazUOSddHfU49ipHSkmNWOj5SA9eL75KbuEQPEyceJvAM/sjOEB59f6PEoY5ZtMrWBHCe2Bkpggw1x3zv9AIc+YQrzJCUbdsKVTdGQlHMSFLoRJKECI9Qn7Q15cgnshtPbkvggVZ60AuEflzBifq7I0a+lGPf1ZXpjnLWS8X/vHakvNNuTHkYKcLxdJAXMagCmAYFe1QQrNhYE4QF1btCPEACYaXjLOgQ7NmT50mjUratsn11XKyeZ3HkwR7YByVggxNQBZegBuoAg3vwCJ7Bi/FgPBmvxtu0NGdkPbvgD4yPL2SEoZ4=

Slide 21

Slide 21 text

Proposed method: Personalisation • To deal with learning on the large vocabulary, hierarchical softmax with the Huffman tree built based on word frequencies is used • Parameters in optimisation function (J1), word vectors, and user vectors are updated by Stochastic Gradient Descent (SGD) 21 • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments

Slide 22

Slide 22 text

Proposed method: Socialisation • Homophily in social networks (Lazarsfeld and Merton, 1954; McPherson et al., 2001) ‣ someone’s friends tend to share similar opinions or topics • User vectors in friendship (neighbours) should be similar ‣ Minimise: ‣ SGD is also applied here to update user vectors 22 • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments

Slide 23

Slide 23 text

• Incorporating socialisation makes user vectors being updated more frequently than word vectors • Introduce a constraint for user vectors’ L2- norm 23 Proposed method: Socialisation • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments

Slide 24

Slide 24 text

24 Dataset • Overview • Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments

Slide 25

Slide 25 text

25 Perplexity • 6-gram perplexity • Changing the importance of social regularisation (λ) and the strength of personalisation (r) ‣ From the shape of curves, they can be optimised for the given dataset

Slide 26

Slide 26 text

Sentiment Classification • Use socialised word embeddings to a downstream task ‣ They chose document-level sentiment classification (to predict ratings of Yelp reviews) • Check two aspects: ‣ User segmentation (active users or not) ‣ Applicability for attention vectors in neural models 26 • Overview • Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments

Slide 27

Slide 27 text

27 Sentiment Classification: SVM and user segmentation • Overview • Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments • Split users by published reviews ‣ Total num. of reviews are the same in both segments • Use the average of word vectors in a document as features for SVM

Slide 28

Slide 28 text

Sentiment Classification: NN models and user attention In Yelp review prediction task, some papers proposed neural models that apply (extra-)attention mechanism to users 28 • Overview • Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments (e.g.) Chen et al., 2016

Slide 29

Slide 29 text

Sentiment Classification: NN models and user attention • How about using socialised word embeddings as “fixed” attention vectors in those models? • Better than without attention but slightly worse than original models (with attention vector trained) 29 • Overview • Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments

Slide 30

Slide 30 text

Comments • Socialisation looks a very interesting/promising idea ‣ Not significant performance gains though ‣ How to regularise sociality has room for improvement ✦ Regarding neighbour users should-be similar seems too naive and strong 30 • Overview • Proposed method • Evaluation • Comments

Slide 31

Slide 31 text

Comments • Their source code is available: ‣ https://github.com/HKUST-KnowComp/ SocializedWordEmbeddings • They have just published improved version of socialised word embeddings this year ‣ https://github.com/HKUST-KnowComp/SRBRW 31 • Overview • Proposed method • Evaluation • Comments @inproceedings{zeng2018biased, title={Biased Random Walk based Social Regularization for Word Embeddings}, author={Zeng, Ziqian and Liu, Xin and Song, Yangqiu}, booktitle={IJCAI}, pages={XX-YY}, year={2018}, }

Slide 32

Slide 32 text

Deep contextualized word representations Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer NAACL-HLT 2018

Slide 33

Slide 33 text

Overview • Propose a new type of deep contextualised word representations (ELMo) that model: ‣ Complex characteristics of word use (e.g., syntax and semantics) ‣ How these uses vary across linguistic contexts (i.e., to model polysemy) • Show that ELMo can improve existing neural models in various NLP tasks • Argue that ELMo can capture more abstract linguistic characteristics in the higher level of layers 33 • Overview • Method • Evaluation • Analysis • Comments

Slide 34

Slide 34 text

Example 34 GloVe mostly learns sport-related context ELMo can distinguish the word sense based on the context

Slide 35

Slide 35 text

Method • Embeddings from Language Models: ELMo • Learn word embeddings through building bidirectional language models (biLMs) ‣ biLMs consist of forward and backward LMs ✦ Forward: ✦ Backward: 35 • Overview • Method • Evaluation • Analysis • Comments p(t1 , t2 , …, tN ) = N ∏ k=1 p(tk |t1 , t2 , …, tk−1 ) p(t1 , t2 , …, tN ) = N ∏ k=1 p(tk |tk+1 , tk+2 , …, tN )

Slide 36

Slide 36 text

Method With long short term memory (LSTM) network, predicting the next words in both directions to build biLMs 36 have a nice one … Output layer Hidden layers (LSTMs) Embedding layer … a nice one xk ok k − 1 k − 1 Expanded in the forward direction of k The forward LM architecture • Overview • Method • Evaluation • Analysis • Comments tk hLM k1 hLM k2

Slide 37

Slide 37 text

Method ELMo represents a word ɹ as a linear combination of corresponding hidden layers (inc. its embedding) 37 tk xk ok k − 1 k − 1 hLM k1 ok k + 1 k + 1 hLM k2 hLM k1 hLM k2 Forward LM Backward LM tk tk stask 0 stask 1 stask 2 Concatenate hidden layers hLM k1 hLM k2 hLM k0 ([xk ; xk ]) × × × [hLM kj ; hLM kj ] { ELMotask k × γtask ∑ = Unlike usual word embeddings, ELMo is assigned to every token instead of a type biLMs ELMo is a task specific representation. A down-stream task learns weighting parameters

Slide 38

Slide 38 text

ELMo can be integrated to almost all neural NLP tasks with simple concatenation to the embedding layer Method 38 • Overview • Method • Evaluation • Analysis • Comments have a nice ELMo ELMo ELMo biLMs Corpus Usual inputs Enhance inputs with ELMos Train

Slide 39

Slide 39 text

Many linguistic tasks are improved by using ELMo 39 Evaluation • Overview • Method • Evaluation • Analysis • Comments Q&A Textual entailment Sentiment analysis Semantic role labelling Coreference resolution Named entity recognition

Slide 40

Slide 40 text

Analysis The higher layer seemed to learn semantics while the lower layer probably captured syntactic features 40 • Overview • Method • Evaluation • Analysis • Comments Word sense disambiguation PoS tagging

Slide 41

Slide 41 text

Analysis The higher layer seemed to learn semantics while the lower layer probably captured syntactic features??? 41 • Overview • Method • Evaluation • Analysis • Comments Most models preferred “syntactic (probably)” features Even in sentiment analysis

Slide 42

Slide 42 text

Analysis ELMo-enhanced models can make use of small datasets more efficiently 42 • Overview • Method • Evaluation • Analysis • Comments Textual entailment Semantic role labelling

Slide 43

Slide 43 text

Comments • Pre-trained ELMo models are available at https:// allennlp.org/elmo ‣ AllenNLP is a deep NLP library on top of PyTorch ‣ AllenNLP is a product of AI2 (Allen Institute for Artificial Intelligence) which works on other interesting projects like Semantic Scholar • ELMo can process character-level inputs ‣ Japanese (Chinese, Korean, …) ELMo models likely to be possible 43 • Overview • Method • Evaluation • Analysis • Comments