Don’t count, predict! A systematic comparison of context-counting vs. contextpredicting semantic vectors

Slide 1

Slide 1 text

Don’t count, predict! A systematic comparison of context-counting vs. contextpredicting semantic vectors Marco Baroni, Geogiana Dinu and Germán Kruszewski (ACL 2014) (Tables are taken from the above-mentioned paper) Presented by Mamoru Komachi The 6th summer camp of NLP September 5th, 2014

Slide 2

Slide 2 text

Well-known Distributional Hypothesis; Any problems so far? v “A word is characterized by the company it keeps.” (Firth, 1957) v Characterize a word by its context (vector) v Widely accepted to the NLP community 2 (Source: http://www.ircs.upenn.edu/zellig/) Zellig Harris (1909-1992)

Slide 3

Slide 3 text

Count-vector-based distributional semantic approaches faced a new challenge (deep learning) v “Context-predicting models (more commonly known as embeddings or neural language models) are the new kids on the distributional semantics blocks.” v “[T]he literature is still lacking a systematic comparison of the predictive models with classic, count-vector-based distributional semantic approaches.” v “The results, …, show that the buzz is fully justified, as the context-predicting models obtain a thorough and resounding victory against their count-based counter-parts.” 3

Slide 4

Slide 4 text

Background Count models and predict models 4

Slide 5

Slide 5 text

Count models are traditional and standard ways to model distributional semantics v Collect context vectors for each word type v Context vectors = n words on the left and right (symmetric, n = 2 and 5, position independent) v Context scores are calculated by positive pointwise mutual information or local mutual information (log- likelihood ratio) v Reduce dimensionality to k (k = 200 … 500) by singular value decomposition or non-negative matrix factorization 5

Slide 6

Slide 6 text

Predict models are training-based new ways to model distributional semantics v Optimize context vectors for each word type v Context vectors = n words on the left and right (symmetric, n = 2 and 5, position independent) (Collobert et al., 2011) v Learn a model to predict a word given context vectors v Can directly optimize weights of a context vector of a word using supervised learning (but with no manual annotation, i.e. predict models use the same unannotated data as count models) v Mikolov et al. (2013) v Word type is mapped to k (k = 200 … 500) v Collobert & Weston (2008) model v 100 dimensional vector, trained on Wikipedia for two months (!) 6

Slide 7

Slide 7 text

Tasks Lexical semantics 7

Slide 8

Slide 8 text

Training data and toolkits are freely available: easy to re-implement v Training data v ukWaC + English Wikipedia + British National Corpus v 2.8 billon tokens (retain the top 300K most frequent words for target and context modeling) v Toolkits v Count model: DISSECT toolkit (authors’ software) v Predict model: word2vec, Collobert & Weston model 8

Slide 9

Slide 9 text

Benchmarks: 5 standard tasks in distributional semantic modeling v Semantic relatedness v Synonym detection v Concept categorization v Selectional preferences v Analogy 9

Slide 10

Slide 10 text

Semantic relatedness: rate the degree of semantic similarity between two words on a numerical scale v Evaluation v Compare the correlation between the average scores that human subjects assigned to the pairs and the cosines between the corresponding vectors using the count/predict models v Datasets v Rubenstein and Goodenough (1965): 65 noun pairs v WordSim353 (Finkelstein et al., 2002): 353 pairs v Agirre et al. (2009): Split WordSim353 into similarity and relatedness subsets v MEN (Bruni et al., 2013): 1,000 word pairs 10

Slide 11

Slide 11 text

Synonym detection: given a target term, choose a word from 4 synonym candidates v Example v (imposed = correct, believed, requested, correlated) -> levied v Methods v Compute cosines of each candidate vector with the target, and pick the candidate with the largest cosine as their answer (extensively tuned count model achieves 100% accuracy) v Datasets v TOEFL set (Landauer and Dumais, 1997): 80 multiple- choice questions that pair a target word with 4 synonym candidates 11

Slide 12

Slide 12 text

Concept categorization: group a set of nominal concepts into natural categories v Example v helicopters and motorcycles -> vehicle class v dogs and elephants -> mammal class v Method v Unsupervised clustering into n (n is given by the gold data) v Datasets v Almuhareb-Poesio benchmark (2006): 402 concepts organized into 21 categories v ESSLLI 2008 Distributional Semantic Workshop shared-task set (Baroni et al., 2008): 44 concepts into 6 categories v Battig set (Baroni et al., 2010): 83 concepts into 10 categories 12

Slide 13

Slide 13 text

Selectional preferences: given a verb-noun pair, rate the typicality of a noun as a subj or obj of the verb v Example v (eat, people) -> assign high score for subject relation, low score for object relation v Method v Take the 20 most strongly associated nouns to the verb, average the vectors to get a prototype vector, and then compute cos similarity to that vector v Datasets v Pado (2007): 211 pairs v MacRae et al. (1998): 100 pairs 13

Slide 14

Slide 14 text

Analogy: given a pair of words and a test word, find another word that instantiates the relation v Example v (brother : sister, grandson : X) -> X = granddaughter v (work : works, speak : X) -> X = speaks v Method v Subtract the second example term vector from the first, add the test term vector, and find the nearest neighbor to that vector (Mikolov et al., 2013) v Dataset v Mikolov et al. (2013): 9K semantic and 10.5K syntactic analogy questions 14

Slide 15

Slide 15 text

Experiments: 5 tasks of lexical semantics 15

Slide 16

Slide 16 text

Results and discussions Lexical semantics 16

Slide 17

Slide 17 text

Results: Predict models outperform count models 17

Slide 18

Slide 18 text

Predict models are not so sensitive to the parameter settings 18

Slide 19

Slide 19 text

Observations v Count model v PMI is better than LLR v SVD outperforms NMF, but no compression improves results v Predict model v Negative sampling outperforms costly hierarchical softmax method v Subsampling frequent words seems to have similar tendency to PMI weighting in count models v Off-the-shelf C&W model v Poor performance (under investigation) 19

Slide 20

Slide 20 text

Discussions v Predict models obtained excellent results by trying few variations in the default settings, whereas count models need to optimize a large number of parameters thoroughly to get maximum performance v Predict models scale to large dataset, use only hundreds of dimensions, without intense tuning v Count models and predict models are complementary in the errors they make v State-of-the-art count models incorporate lexico- syntactic relations v Possibly combined to make a better unified model 20

Slide 21

Slide 21 text

Open questions v “Do the dimension of predict models also encode latent semantic domains?” v “Do these models afford the same flexibility of count vectors in capturing linguistically rich contexts?” v “Does the structure of predict vectors mimic meaningful semantic relations?” 21

Slide 22

Slide 22 text

Not feature engineering but context engineering v How to encode syntactic, topical and functional information into context features is still under development v Whether certain properties of vectors reflect semantic relations in the expected way: e.g. whether the vectors of hypernyms “distributionally include” the vectors of hyponyms 22

Slide 23

Slide 23 text

Summary v Context-predicting models perform as good as the highly-tuned classic count-vector models on a wide range of lexical semantics tasks v Best models: v Count model: window size = 2; scoring = PMI; no dimension reduction; 300k dimensions v Predict model: window size = 5; no hierarchical softmax; negative sampling; 400 dimensions v Suggest a new promising direction in computational semantics 23

Slide 24

Slide 24 text

Is it true that count models and predict models look at the same information? (cont.) I heard that word2vec uses a sampling-based method to determine how far it looks for context window. v Possibly not. Predict models overweight near neighbors more than count models. However, it’s not clear that it accounts for the difference in performance. 24

Slide 25

Slide 25 text

Is there any black-magic in tuning parameters, especially the step variable in dimension reduction? v No. It is possible that the reduced dimensionality n and the size of context vectors k behave similarly in a given range, but it may be OK for following two reasons: v In count models, dimensionality reduction doesn’t really matter since no compression performs best. v From the development point of view, the size of the final model has a large impact to the deployment of the model, so comparing these two variables makes sense at least in practice. 25

Slide 26

Slide 26 text

Why predict models outperform count models? Is there any theoretical analysis? v Concerning the paper, the authors do not mention the reason. v It may be because predict models abstract semantic relations, making stepping stones for inferring semantic relatedness more concisely. v Predict models tune a large number of parameters, so it’s not surprising that predict models achieve better performance than count models. 26

Slide 27

Slide 27 text

Is there any comparison in a PP- attachment task? (cont.) I read a paper saying that word2vec features do not improve PP-attachment unlike SVO modeling task. v No. It is possible that PP-attachment may fail since in the setting of this paper only uses local context. 27