Don’t count, predict! A systematic comparison of context-counting vs. contextpredicting semantic vectors

Don’t count, predict! A systematic comparison of context-counting vs. context-
predicting semantic vectors Marco Baroni, Geogiana Dinu and Germán Kruszewski (ACL 2014) (Tables are taken from the above-mentioned paper) Presented by Mamoru Komachi <[email protected]> The 6th summer camp of NLP September 5th, 2014

Well-known Distributional Hypothesis; Any problems so far? v “A word
is characterized by the company it keeps.” (Firth, 1957) v Characterize a word by its context (vector) v Widely accepted to the NLP community 2 (Source: http://www.ircs.upenn.edu/zellig/) Zellig Harris (1909-1992)

Count-vector-based distributional semantic approaches faced a new challenge (deep learning)
v “Context-predicting models (more commonly known as embeddings or neural language models) are the new kids on the distributional semantics blocks.” v “[T]he literature is still lacking a systematic comparison of the predictive models with classic, count-vector-based distributional semantic approaches.” v “The results, …, show that the buzz is fully justified, as the context-predicting models obtain a thorough and resounding victory against their count-based counter-parts.” 3

Background Count models and predict models 4

Count models are traditional and standard ways to model distributional
semantics v Collect context vectors for each word type v Context vectors = n words on the left and right (symmetric, n = 2 and 5, position independent) v Context scores are calculated by positive pointwise mutual information or local mutual information (log- likelihood ratio) v Reduce dimensionality to k (k = 200 … 500) by singular value decomposition or non-negative matrix factorization 5

Predict models are training-based new ways to model distributional semantics
v Optimize context vectors for each word type v Context vectors = n words on the left and right (symmetric, n = 2 and 5, position independent) (Collobert et al., 2011) v Learn a model to predict a word given context vectors v Can directly optimize weights of a context vector of a word using supervised learning (but with no manual annotation, i.e. predict models use the same unannotated data as count models) v Mikolov et al. (2013) v Word type is mapped to k (k = 200 … 500) v Collobert & Weston (2008) model v 100 dimensional vector, trained on Wikipedia for two months (!) 6

Tasks Lexical semantics 7

Training data and toolkits are freely available: easy to re-implement
v Training data v ukWaC + English Wikipedia + British National Corpus v 2.8 billon tokens (retain the top 300K most frequent words for target and context modeling) v Toolkits v Count model: DISSECT toolkit (authors’ software) v Predict model: word2vec, Collobert & Weston model 8

Benchmarks: 5 standard tasks in distributional semantic modeling v Semantic
relatedness v Synonym detection v Concept categorization v Selectional preferences v Analogy 9

Semantic relatedness: rate the degree of semantic similarity between two
words on a numerical scale v Evaluation v Compare the correlation between the average scores that human subjects assigned to the pairs and the cosines between the corresponding vectors using the count/predict models v Datasets v Rubenstein and Goodenough (1965): 65 noun pairs v WordSim353 (Finkelstein et al., 2002): 353 pairs v Agirre et al. (2009): Split WordSim353 into similarity and relatedness subsets v MEN (Bruni et al., 2013): 1,000 word pairs 10

Synonym detection: given a target term, choose a word from
4 synonym candidates v Example v (imposed = correct, believed, requested, correlated) -> levied v Methods v Compute cosines of each candidate vector with the target, and pick the candidate with the largest cosine as their answer (extensively tuned count model achieves 100% accuracy) v Datasets v TOEFL set (Landauer and Dumais, 1997): 80 multiple- choice questions that pair a target word with 4 synonym candidates 11

Concept categorization: group a set of nominal concepts into natural
categories v Example v helicopters and motorcycles -> vehicle class v dogs and elephants -> mammal class v Method v Unsupervised clustering into n (n is given by the gold data) v Datasets v Almuhareb-Poesio benchmark (2006): 402 concepts organized into 21 categories v ESSLLI 2008 Distributional Semantic Workshop shared-task set (Baroni et al., 2008): 44 concepts into 6 categories v Battig set (Baroni et al., 2010): 83 concepts into 10 categories 12

Selectional preferences: given a verb-noun pair, rate the typicality of
a noun as a subj or obj of the verb v Example v (eat, people) -> assign high score for subject relation, low score for object relation v Method v Take the 20 most strongly associated nouns to the verb, average the vectors to get a prototype vector, and then compute cos similarity to that vector v Datasets v Pado (2007): 211 pairs v MacRae et al. (1998): 100 pairs 13

Analogy: given a pair of words and a test word,
find another word that instantiates the relation v Example v (brother : sister, grandson : X) -> X = granddaughter v (work : works, speak : X) -> X = speaks v Method v Subtract the second example term vector from the first, add the test term vector, and find the nearest neighbor to that vector (Mikolov et al., 2013) v Dataset v Mikolov et al. (2013): 9K semantic and 10.5K syntactic analogy questions 14

Experiments: 5 tasks of lexical semantics 15

Results and discussions Lexical semantics 16

Results: Predict models outperform count models 17

Predict models are not so sensitive to the parameter settings
18

Observations v Count model v PMI is better than LLR
v SVD outperforms NMF, but no compression improves results v Predict model v Negative sampling outperforms costly hierarchical softmax method v Subsampling frequent words seems to have similar tendency to PMI weighting in count models v Off-the-shelf C&W model v Poor performance (under investigation) 19

Discussions v Predict models obtained excellent results by trying few
variations in the default settings, whereas count models need to optimize a large number of parameters thoroughly to get maximum performance v Predict models scale to large dataset, use only hundreds of dimensions, without intense tuning v Count models and predict models are complementary in the errors they make v State-of-the-art count models incorporate lexico- syntactic relations v Possibly combined to make a better unified model 20

Open questions v “Do the dimension of predict models also
encode latent semantic domains?” v “Do these models afford the same flexibility of count vectors in capturing linguistically rich contexts?” v “Does the structure of predict vectors mimic meaningful semantic relations?” 21

Not feature engineering but context engineering v How to encode
syntactic, topical and functional information into context features is still under development v Whether certain properties of vectors reflect semantic relations in the expected way: e.g. whether the vectors of hypernyms “distributionally include” the vectors of hyponyms 22

Summary v Context-predicting models perform as good as the highly-tuned
classic count-vector models on a wide range of lexical semantics tasks v Best models: v Count model: window size = 2; scoring = PMI; no dimension reduction; 300k dimensions v Predict model: window size = 5; no hierarchical softmax; negative sampling; 400 dimensions v Suggest a new promising direction in computational semantics 23

Is it true that count models and predict models look
at the same information? (cont.) I heard that word2vec uses a sampling-based method to determine how far it looks for context window. v Possibly not. Predict models overweight near neighbors more than count models. However, it’s not clear that it accounts for the difference in performance. 24

Is there any black-magic in tuning parameters, especially the step
variable in dimension reduction? v No. It is possible that the reduced dimensionality n and the size of context vectors k behave similarly in a given range, but it may be OK for following two reasons: v In count models, dimensionality reduction doesn’t really matter since no compression performs best. v From the development point of view, the size of the final model has a large impact to the deployment of the model, so comparing these two variables makes sense at least in practice. 25

Why predict models outperform count models? Is there any theoretical
analysis? v Concerning the paper, the authors do not mention the reason. v It may be because predict models abstract semantic relations, making stepping stones for inferring semantic relatedness more concisely. v Predict models tune a large number of parameters, so it’s not surprising that predict models achieve better performance than count models. 26

Is there any comparison in a PP- attachment task? (cont.)
I read a paper saying that word2vec features do not improve PP-attachment unlike SVO modeling task. v No. It is possible that PP-attachment may fail since in the setting of this paper only uses local context. 27

Don’t count, predict! A systematic comparison o...

Don’t count, predict! A systematic comparison of context-counting vs. contextpredicting semantic vectors

Mamoru Komachi

More Decks by Mamoru Komachi

Other Decks in Research

Featured

Transcript