Literature review: word embeddings

Shuntaro Yada PhD Student Methodology Seminar Tech Interest Group 21
Jun 2018 Literature Review: Word Embeddings

Agenda 1. Lenci, A. (2018). Distributional models of word meaning.
Annual Review of Linguistics, 4(1), 151–171 ‣ Combined with: Lenci, A. (2008). Distributional approaches in linguistic and cognitive research. Italian Journal of Linguistics, 20(May), 1–31 2. Zeng, Z., Yin, Y., Song, Y., & Zhang, M. (2017). Socialized word embeddings. In International Joint Conference on Artificial Intelligence (pp. 3915–3921) 3. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In NAACL-HLT (pp. 2227–2237) 2 • I will introduce these papers listed on the right • The first one is theoretical overviews and reviews • The last two are examples of cutting-edge methods

Distributional models of word meaning Alessandro Lenci Annual Review of
Linguistics, 4(1), 151– 171, 2018

Overview • Explain what the distributional hypothesis is • Briefly
introduce major ways to generate distributional representations of words ‣ I focus on the two most popular ways (Count/ Prediction) and add more materials to explain the latter way • Summarise the common challenges with distributional representations of words 4 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges

Distributional Hypothesis • “Lexemes with similar linguistic contexts have similar
meanings” (Lenci, 2018: p. 152) • One of the ways to give the definition of word meaning 5 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges “I found a wonderful restaurant yesterday!” “I found a fantastic restaurant yesterday!” Looks like they have a similar meaning Target word Context words (latter) Context words (former)

Distributional Hypothesis • Distributional hypothesis (DH) forms the theoretical foundation
of distributional semantics (aka vector space semantics) • Lenci (2008) pointed out two levels of DH: ‣ Weak DH: only assumes correlations between semantics and word distributions ‣ Strong DH: also assumes DH is a cognitive hypothesis 6 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges

Distributed vs Distributional • According to Ferrone and Zanzotto (2017),
distributed representations contain distributional representations • Distributed: ways to represent each word by a vector with several dimensions instead of a symbolic vector (e.g., one-hot vectors) • Distributional: ways to represent each word by a vector with several dimensions based on distributional hypothesis 7 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges

• The two major ways to generate distributional representations: ‣
Count models ‣ Prediction models • The paper shows an introduction of count models (Sec 3.2) 8 Taxonomy of Methods: The Method of Learning • Overview • Distributional Hypothesis • Taxonomy of Methods ‣ The Method of Learning ‣ The Type of Context • Models • Challenges

9 Taxonomy of Methods: The Type of Context • Overview
• Distributional Hypothesis • Taxonomy of Methods ‣ The Method of Learning ‣ The Type of Context • Models • Challenges e.g., My example of wonderful/fantastic. Most popular taking into account syntactically dependent words only (e.g., predicate argument structure?) Imagine TFIDF

10 Count models • Overview • Distributional Hypothesis • Taxonomy
of Methods • Models ‣ Count models ‣ Prediction models • Challenges Co-occurrence matrix Enhance significance to reflect the importance of the contexts For example, taking:

Count models • Use some methods to obtain latent features
among columns in explicit vectors → implicit vectors • One easy example is to apply dimensionality- reduction techniques like singular value decomposition or principal components analysis • See Table 2 of Lenci (2018) for famous tools ‣ GloVe (Pennington+, 2014), which is based on weighted least-squares regression, is the most popular word representations among count models 11 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models ‣ Count models ‣ Prediction models • Challenges

Prediction models • As known as word vectors, word embeddings,
and distributed embeddings • Learn word representations using a neural network model while the model is learning a language model • The most famous tool of this category is word2vec ‣ The rest of the papers I will explain today are also based on prediction models 12 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models ‣ Count models ‣ Prediction models • Challenges

Prediction models • Language model ‣ A probability distribution over
sequences of words ‣ This assigns a probability of given sequences • Neural language model ‣ A language model using neural network ‣ Given sequences, it predicts the next word of the sequences 13 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models ‣ Count models ‣ Prediction models • Challenges

14 Prediction models • Overview • Distributional Hypothesis • Taxonomy
of Methods • Models ‣ Count models ‣ Prediction models • Challenges have a nice one one this has the same number of units as total word types and represents a probability distribution of the next word … Output layer Hidden layers Embedding layer Modified from ௶Ҫ+ (2017) ʰਂ૚ֶशʹΑΔࣗવݴޠॲཧʱ E Win Whid Wout These parts become word vectors during training An example of feed forward neural network language model

15 Prediction models • Overview • Distributional Hypothesis • Taxonomy
of Methods • Models ‣ Count models ‣ Prediction models • Challenges have a nice one one … Output layer Embedding layer Modified from ௶Ҫ+ (2017) ʰਂ૚ֶशʹΑΔࣗવݴޠॲཧʱ E word2vec No hidden layers Just taking inner products But with bunch of optimisation techniques

Challenges • Distributional semantic models tend to mix up various
types of semantic similarity/relatedness ‣ No distinction among hypernymy, antonymy, meronymy, locative relations and topical relations • How to represent larger linguistic units than word 16 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges

Socialized Word Embeddings Ziqian Zeng, Yichun Yin, Yangqiu Song, Ming
Zhang International Joint Conference on Artificial Intelligence 2017

Overview • Add the following two aspect to word embeddings:
‣ Personalisation (user information; not new) ‣ Socialisation (inter-user relationship; new) • Three-fold evaluation: ‣ Perplexity comparison between word2vec ‣ Application to document-level sentiment classification ✦ As the features for SVM (inc. user segmentation) ✦ As the attention source for neural models 18 • Overview • Proposed method • Evaluation • Comments

Proposed method: Personalisation • Starting from continuous bag-of-words model (CBOW)
of word2vec (Mikolov et al., 2013) • Consider the context words for a word as user- dependent ‣ Maximise: ‣ ‘for each user, s/he will think about a predicted word given the global words meanings and customize them to his/her own preference’ 19 • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments J1 = N X i X wj 2Wi log P(wj |C(wj), ui) <latexit sha1_base64="P0DTTiq/sMDCisw8h+RKFMCeyNY=">AAACRHicbVBLSwMxGMz6rPVV9eglWAQFKbsi6EUQvYgHqWBboVtDNk1rbDa7JFmlxPw4L/4Ab/4CLx4U8SpmaxFfA4HJzHzky0QpZ0r7/oM3Mjo2PjFZmCpOz8zOzZcWFusqySShNZLwRJ5FWFHOBK1ppjk9SyXFccRpI+od5H7jikrFEnGq+yltxbgrWIcRrJ2ESs0wxvqCYG6OLArgLgxVFp+bY4sMszlH5hpdwpAJ+JVsWMQsDHnSNdU15958OQc2v69vwAyxdYtKZb/iDwD/kmBIymCIKirdh+2EZDEVmnCsVDPwU90yWGpGOLXFMFM0xaSHu7TpqMAxVS0zKMHCVae0YSeR7ggNB+r3CYNjpfpx5JL5uuq3l4v/ec1Md3Zahok001SQz4c6GYc6gXmjsM0kJZr3HcFEMrcrJBdYYqJd70VXQvD7y39JfbMS+JXgZKu8tz+sowCWwQpYAwHYBnvgEFRBDRBwCx7BM3jx7rwn79V7+4yOeMOZJfAD3vsHzviyeQ==</latexit> <latexit sha1_base64="P0DTTiq/sMDCisw8h+RKFMCeyNY=">AAACRHicbVBLSwMxGMz6rPVV9eglWAQFKbsi6EUQvYgHqWBboVtDNk1rbDa7JFmlxPw4L/4Ab/4CLx4U8SpmaxFfA4HJzHzky0QpZ0r7/oM3Mjo2PjFZmCpOz8zOzZcWFusqySShNZLwRJ5FWFHOBK1ppjk9SyXFccRpI+od5H7jikrFEnGq+yltxbgrWIcRrJ2ESs0wxvqCYG6OLArgLgxVFp+bY4sMszlH5hpdwpAJ+JVsWMQsDHnSNdU15958OQc2v69vwAyxdYtKZb/iDwD/kmBIymCIKirdh+2EZDEVmnCsVDPwU90yWGpGOLXFMFM0xaSHu7TpqMAxVS0zKMHCVae0YSeR7ggNB+r3CYNjpfpx5JL5uuq3l4v/ec1Md3Zahok001SQz4c6GYc6gXmjsM0kJZr3HcFEMrcrJBdYYqJd70VXQvD7y39JfbMS+JXgZKu8tz+sowCWwQpYAwHYBnvgEFRBDRBwCx7BM3jx7rwn79V7+4yOeMOZJfAD3vsHzviyeQ==</latexit> <latexit sha1_base64="P0DTTiq/sMDCisw8h+RKFMCeyNY=">AAACRHicbVBLSwMxGMz6rPVV9eglWAQFKbsi6EUQvYgHqWBboVtDNk1rbDa7JFmlxPw4L/4Ab/4CLx4U8SpmaxFfA4HJzHzky0QpZ0r7/oM3Mjo2PjFZmCpOz8zOzZcWFusqySShNZLwRJ5FWFHOBK1ppjk9SyXFccRpI+od5H7jikrFEnGq+yltxbgrWIcRrJ2ESs0wxvqCYG6OLArgLgxVFp+bY4sMszlH5hpdwpAJ+JVsWMQsDHnSNdU15958OQc2v69vwAyxdYtKZb/iDwD/kmBIymCIKirdh+2EZDEVmnCsVDPwU90yWGpGOLXFMFM0xaSHu7TpqMAxVS0zKMHCVae0YSeR7ggNB+r3CYNjpfpx5JL5uuq3l4v/ec1Md3Zahok001SQz4c6GYc6gXmjsM0kJZr3HcFEMrcrJBdYYqJd70VXQvD7y39JfbMS+JXgZKu8tz+sowCWwQpYAwHYBnvgEFRBDRBwCx7BM3jx7rwn79V7+4yOeMOZJfAD3vsHzviyeQ==</latexit> <latexit sha1_base64="P0DTTiq/sMDCisw8h+RKFMCeyNY=">AAACRHicbVBLSwMxGMz6rPVV9eglWAQFKbsi6EUQvYgHqWBboVtDNk1rbDa7JFmlxPw4L/4Ab/4CLx4U8SpmaxFfA4HJzHzky0QpZ0r7/oM3Mjo2PjFZmCpOz8zOzZcWFusqySShNZLwRJ5FWFHOBK1ppjk9SyXFccRpI+od5H7jikrFEnGq+yltxbgrWIcRrJ2ESs0wxvqCYG6OLArgLgxVFp+bY4sMszlH5hpdwpAJ+JVsWMQsDHnSNdU15958OQc2v69vwAyxdYtKZb/iDwD/kmBIymCIKirdh+2EZDEVmnCsVDPwU90yWGpGOLXFMFM0xaSHu7TpqMAxVS0zKMHCVae0YSeR7ggNB+r3CYNjpfpx5JL5uuq3l4v/ec1Md3Zahok001SQz4c6GYc6gXmjsM0kJZr3HcFEMrcrJBdYYqJd70VXQvD7y39JfbMS+JXgZKu8tz+sowCWwQpYAwHYBnvgEFRBDRBwCx7BM3jx7rwn79V7+4yOeMOZJfAD3vsHzviyeQ==</latexit>

• Word vectors and user vectors have the same dimensionality
• The user-dependent word vector is represented as the sum of the word vector and the user vector 20 Proposed method: Personalisation • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments w(i) j = wj + ui <latexit sha1_base64="ice+/tJd2u0YbJ7CSX4HnNpOHWE=">AAACG3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgm6EohuXFewD2hom00k77WQSZiZKCfkPN/6KGxeKuBJc+DdO2uCj9cDAmXPu5d573JBRqSzr08gtLC4tr+RXC2vrG5tb5vZOQwaRwKSOAxaIloskYZSTuqKKkVYoCPJdRpru6CL1m7dESBrwazUOSddHfU49ipHSkmNWOj5SA9eL75KbuEQPEyceJvAM/sjOEB59f6PEoY5ZtMrWBHCe2Bkpggw1x3zv9AIc+YQrzJCUbdsKVTdGQlHMSFLoRJKECI9Qn7Q15cgnshtPbkvggVZ60AuEflzBifq7I0a+lGPf1ZXpjnLWS8X/vHakvNNuTHkYKcLxdJAXMagCmAYFe1QQrNhYE4QF1btCPEACYaXjLOgQ7NmT50mjUratsn11XKyeZ3HkwR7YByVggxNQBZegBuoAg3vwCJ7Bi/FgPBmvxtu0NGdkPbvgD4yPL2SEoZ4=</latexit> <latexit sha1_base64="ice+/tJd2u0YbJ7CSX4HnNpOHWE=">AAACG3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgm6EohuXFewD2hom00k77WQSZiZKCfkPN/6KGxeKuBJc+DdO2uCj9cDAmXPu5d573JBRqSzr08gtLC4tr+RXC2vrG5tb5vZOQwaRwKSOAxaIloskYZSTuqKKkVYoCPJdRpru6CL1m7dESBrwazUOSddHfU49ipHSkmNWOj5SA9eL75KbuEQPEyceJvAM/sjOEB59f6PEoY5ZtMrWBHCe2Bkpggw1x3zv9AIc+YQrzJCUbdsKVTdGQlHMSFLoRJKECI9Qn7Q15cgnshtPbkvggVZ60AuEflzBifq7I0a+lGPf1ZXpjnLWS8X/vHakvNNuTHkYKcLxdJAXMagCmAYFe1QQrNhYE4QF1btCPEACYaXjLOgQ7NmT50mjUratsn11XKyeZ3HkwR7YByVggxNQBZegBuoAg3vwCJ7Bi/FgPBmvxtu0NGdkPbvgD4yPL2SEoZ4=</latexit> <latexit sha1_base64="ice+/tJd2u0YbJ7CSX4HnNpOHWE=">AAACG3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgm6EohuXFewD2hom00k77WQSZiZKCfkPN/6KGxeKuBJc+DdO2uCj9cDAmXPu5d573JBRqSzr08gtLC4tr+RXC2vrG5tb5vZOQwaRwKSOAxaIloskYZSTuqKKkVYoCPJdRpru6CL1m7dESBrwazUOSddHfU49ipHSkmNWOj5SA9eL75KbuEQPEyceJvAM/sjOEB59f6PEoY5ZtMrWBHCe2Bkpggw1x3zv9AIc+YQrzJCUbdsKVTdGQlHMSFLoRJKECI9Qn7Q15cgnshtPbkvggVZ60AuEflzBifq7I0a+lGPf1ZXpjnLWS8X/vHakvNNuTHkYKcLxdJAXMagCmAYFe1QQrNhYE4QF1btCPEACYaXjLOgQ7NmT50mjUratsn11XKyeZ3HkwR7YByVggxNQBZegBuoAg3vwCJ7Bi/FgPBmvxtu0NGdkPbvgD4yPL2SEoZ4=</latexit> <latexit sha1_base64="ice+/tJd2u0YbJ7CSX4HnNpOHWE=">AAACG3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgm6EohuXFewD2hom00k77WQSZiZKCfkPN/6KGxeKuBJc+DdO2uCj9cDAmXPu5d573JBRqSzr08gtLC4tr+RXC2vrG5tb5vZOQwaRwKSOAxaIloskYZSTuqKKkVYoCPJdRpru6CL1m7dESBrwazUOSddHfU49ipHSkmNWOj5SA9eL75KbuEQPEyceJvAM/sjOEB59f6PEoY5ZtMrWBHCe2Bkpggw1x3zv9AIc+YQrzJCUbdsKVTdGQlHMSFLoRJKECI9Qn7Q15cgnshtPbkvggVZ60AuEflzBifq7I0a+lGPf1ZXpjnLWS8X/vHakvNNuTHkYKcLxdJAXMagCmAYFe1QQrNhYE4QF1btCPEACYaXjLOgQ7NmT50mjUratsn11XKyeZ3HkwR7YByVggxNQBZegBuoAg3vwCJ7Bi/FgPBmvxtu0NGdkPbvgD4yPL2SEoZ4=</latexit>

Proposed method: Personalisation • To deal with learning on the
large vocabulary, hierarchical softmax with the Huffman tree built based on word frequencies is used • Parameters in optimisation function (J1), word vectors, and user vectors are updated by Stochastic Gradient Descent (SGD) 21 • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments

Proposed method: Socialisation • Homophily in social networks (Lazarsfeld and
Merton, 1954; McPherson et al., 2001) ‣ someone’s friends tend to share similar opinions or topics • User vectors in friendship (neighbours) should be similar ‣ Minimise: ‣ SGD is also applied here to update user vectors 22 • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments

• Incorporating socialisation makes user vectors being updated more frequently
than word vectors • Introduce a constraint for user vectors’ L2- norm 23 Proposed method: Socialisation • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments

24 Dataset • Overview • Proposed method • Evaluation ‣
Dataset ‣ Perplexity ‣ Sentiment Classification • Comments

25 Perplexity • 6-gram perplexity • Changing the importance of
social regularisation (λ) and the strength of personalisation (r) ‣ From the shape of curves, they can be optimised for the given dataset

Sentiment Classification • Use socialised word embeddings to a downstream
task ‣ They chose document-level sentiment classification (to predict ratings of Yelp reviews) • Check two aspects: ‣ User segmentation (active users or not) ‣ Applicability for attention vectors in neural models 26 • Overview • Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments

27 Sentiment Classification: SVM and user segmentation • Overview •
Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments • Split users by published reviews ‣ Total num. of reviews are the same in both segments • Use the average of word vectors in a document as features for SVM

Sentiment Classification: NN models and user attention In Yelp review
prediction task, some papers proposed neural models that apply (extra-)attention mechanism to users 28 • Overview • Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments (e.g.) Chen et al., 2016

Sentiment Classification: NN models and user attention • How about
using socialised word embeddings as “fixed” attention vectors in those models? • Better than without attention but slightly worse than original models (with attention vector trained) 29 • Overview • Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments

Comments • Socialisation looks a very interesting/promising idea ‣ Not
significant performance gains though ‣ How to regularise sociality has room for improvement ✦ Regarding neighbour users should-be similar seems too naive and strong 30 • Overview • Proposed method • Evaluation • Comments

Comments • Their source code is available: ‣ https://github.com/HKUST-KnowComp/ SocializedWordEmbeddings
• They have just published improved version of socialised word embeddings this year ‣ https://github.com/HKUST-KnowComp/SRBRW 31 • Overview • Proposed method • Evaluation • Comments @inproceedings{zeng2018biased, title={Biased Random Walk based Social Regularization for Word Embeddings}, author={Zeng, Ziqian and Liu, Xin and Song, Yangqiu}, booktitle={IJCAI}, pages={XX-YY}, year={2018}, }

Deep contextualized word representations Matthew E. Peters, Mark Neumann, Mohit
Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer NAACL-HLT 2018

Overview • Propose a new type of deep contextualised word
representations (ELMo) that model: ‣ Complex characteristics of word use (e.g., syntax and semantics) ‣ How these uses vary across linguistic contexts (i.e., to model polysemy) • Show that ELMo can improve existing neural models in various NLP tasks • Argue that ELMo can capture more abstract linguistic characteristics in the higher level of layers 33 • Overview • Method • Evaluation • Analysis • Comments

Example 34 GloVe mostly learns sport-related context ELMo can distinguish
the word sense based on the context

Method • Embeddings from Language Models: ELMo • Learn word
embeddings through building bidirectional language models (biLMs) ‣ biLMs consist of forward and backward LMs ✦ Forward: ✦ Backward: 35 • Overview • Method • Evaluation • Analysis • Comments p(t1 , t2 , …, tN ) = N ∏ k=1 p(tk |t1 , t2 , …, tk−1 ) p(t1 , t2 , …, tN ) = N ∏ k=1 p(tk |tk+1 , tk+2 , …, tN )

Method With long short term memory (LSTM) network, predicting the
next words in both directions to build biLMs 36 have a nice one … Output layer Hidden layers (LSTMs) Embedding layer … a nice one xk ok k − 1 k − 1 Expanded in the forward direction of k The forward LM architecture • Overview • Method • Evaluation • Analysis • Comments tk hLM k1 hLM k2

Method ELMo represents a word ɹ as a linear combination
of corresponding hidden layers (inc. its embedding) 37 tk xk ok k − 1 k − 1 hLM k1 ok k + 1 k + 1 hLM k2 hLM k1 hLM k2 Forward LM Backward LM tk tk stask 0 stask 1 stask 2 Concatenate hidden layers hLM k1 hLM k2 hLM k0 ([xk ; xk ]) × × × [hLM kj ; hLM kj ] { ELMotask k × γtask ∑ = Unlike usual word embeddings, ELMo is assigned to every token instead of a type biLMs ELMo is a task specific representation. A down-stream task learns weighting parameters

ELMo can be integrated to almost all neural NLP tasks
with simple concatenation to the embedding layer Method 38 • Overview • Method • Evaluation • Analysis • Comments have a nice ELMo ELMo ELMo biLMs Corpus Usual inputs Enhance inputs with ELMos Train

Many linguistic tasks are improved by using ELMo 39 Evaluation
• Overview • Method • Evaluation • Analysis • Comments Q&A Textual entailment Sentiment analysis Semantic role labelling Coreference resolution Named entity recognition

Analysis The higher layer seemed to learn semantics while the
lower layer probably captured syntactic features 40 • Overview • Method • Evaluation • Analysis • Comments Word sense disambiguation PoS tagging

Analysis The higher layer seemed to learn semantics while the
lower layer probably captured syntactic features??? 41 • Overview • Method • Evaluation • Analysis • Comments Most models preferred “syntactic (probably)” features Even in sentiment analysis

Analysis ELMo-enhanced models can make use of small datasets more
efficiently 42 • Overview • Method • Evaluation • Analysis • Comments Textual entailment Semantic role labelling

Comments • Pre-trained ELMo models are available at https:// allennlp.org/elmo
‣ AllenNLP is a deep NLP library on top of PyTorch ‣ AllenNLP is a product of AI2 (Allen Institute for Artificial Intelligence) which works on other interesting projects like Semantic Scholar • ELMo can process character-level inputs ‣ Japanese (Chinese, Korean, …) ELMo models likely to be possible 43 • Overview • Method • Evaluation • Analysis • Comments

Literature review: word embeddings

Literature review: word embeddings

More Decks by Shuntaro Yada

Other Decks in Research

Featured

Transcript