Towards Learning Word Representation

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model
Experiments Conclusions Towards Learning Word Representation Magdalena Wiercioch Faculty of Mathematics and Computer Science Jagiellonian University, Poland e-mail: [email protected] TFML, Feb. 13, 2017 1/20 Towards Learning Word Representation Forward

Experiments Conclusions Outline 1 Research motivation 2 Background 3 Model 4 Experiments 5 Conclusions 2/20 2/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Research motivation Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov: Enriching Word Vectors with Subword Information, arXiv, 2016. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeﬀrey Dean: Eﬃcient estimation of word representations in vector space, arXiv, 2013. 3/20 3/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Background Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman: Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990. neural networks Hinrich Schütze: Dimensions of meaning. Proceedings of the 1992 ACM/IEEE Conference on Supercomputing, 1992. N. Sakamoto, K. Yamamoto, and S. Nakagawa: Combination of syllable based n-gram search and word search for spoken term detection through spoken queries and iv/oov classiﬁcation. Automatic Speech Recognition and Understanding (ASRU), 2015. 4/20 4/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Model We extended the method introduced by Bojanowski et al. The model demonstrated by Bojanowski is derived from continuous Skip-gram (SG) model proposed by Mikolov et al. 5/20 5/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Skip-gram model The goal of Skip-gram model is to ﬁnd word representation that is useful for predicting the surrounding words in a corpus. Let us denote the sequence of training words - vocabulary, W = {w1, w2, . . . , wS}, where S is the size of vocabulary. Skip-gram model maximizes the average log probability l(W ) = S t=1 c∈Ct log p(wc |wt ), where Ct is the context. 6/20 6/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Skip-gram model 7/20 7/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Skip-gram model The probability of observing a context word wc given wt is parametrized using the word vectors. Given a scoring function s, which maps pairs of (word, context) to value in R, a possible choice to deﬁne the probability of a context word is the softmax. p(Context|Word) = yc = ewc wt S j=1 ew j wt , where wc , wt , wj are vector representations of words and yc is the output of the c-th neuron of the output layer. The parametrization for the scoring function is done by taking the scalar product between word and context embeddings: s(Word, Context) = wt wc . 8/20 8/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Subword model by Bojanowski et al. The Skip-gram model ignores the internal structure of words. They introduced a diﬀerent scoring function s s(w, c) = g∈Gw zg vc , where Gw = {1, . . . , G} is the set of letter n-grams which appear in w. 9/20 9/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Subword model by Bojanowski et al. The Skip-gram model ignores the internal structure of words. They introduced a diﬀerent scoring function s s(w, c) = g∈Gw zg vc , where Gw = {1, . . . , G} is the set of letter n-grams which appear in w. Limitations: n-grams with a length greater or equal than 3 and smaller or equal than 6 were considered. We claim it may be insuﬃcient for short and rare words. 9/20 9/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Fragmentation model Let us denote by Gw = {1, . . . , G} the set of letter n-grams which appear in w and Hw = {1, . . . , H} to be the set of syllable n-grams which appear in w. 10/20 10/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Fragmentation model Let us denote by Gw = {1, . . . , G} the set of letter n-grams which appear in w and Hw = {1, . . . , H} to be the set of syllable n-grams which appear in w. We associate a vector representation zg to each letter n-gram g and a vector representation zh to each syllable n-gram h. 10/20 10/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Fragmentation model Let us denote by Gw = {1, . . . , G} the set of letter n-grams which appear in w and Hw = {1, . . . , H} to be the set of syllable n-grams which appear in w. We associate a vector representation zg to each letter n-gram g and a vector representation zh to each syllable n-gram h. The new word representation is considered as the direct concatenation of the two vector representations of its n-grams (letter and syllables) znew = [zg , zh ]. 10/20 10/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Fragmentation model Let us denote by Gw = {1, . . . , G} the set of letter n-grams which appear in w and Hw = {1, . . . , H} to be the set of syllable n-grams which appear in w. We associate a vector representation zg to each letter n-gram g and a vector representation zh to each syllable n-gram h. The new word representation is considered as the direct concatenation of the two vector representations of its n-grams (letter and syllables) znew = [zg , zh ]. The scoring function is s(w, c) = new∈Gw ∪Hw znew vc . The upgraded model makes use of n-grams of varied length n. 10/20 10/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Datasets We used benchmarks of three languages, i.e. English, German and Romanian. The data contains word pairs along with human-assigned similarity judgements. 11/20 11/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Settings We compared our approach with 5 baseline representations: a model based on recurrent neural network (RNNLM) from 2010, a method trained using Noise Contrastive Estimation (NCE), two log bilinear methods by Mikolov, i.e. Continuous Bag of Words (CBoW) and Skip-gram (SG), the model proposed by Bojanowski et al. (Ft). A context window of 6 words (both left and right) was used. 12/20 12/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Spearman’s correlation coeﬃcient for the word similarity task. dataset RNNLM NCE CBoW Sg Ft our WS353 (en) 0.42 0.45 0.48 0.47 0.5 0.5 SimVerb-3500 (en) 0.44 0.46 0.44 0.47 0.47 0.47 Sim999 (en) 0.44 0.45 0.45 0.46 0.45 0.45 RG65 (en) 0.39 0.4 0.43 0.46 0.46 0.47 SGS130 (en) 0.45 0.48 0.5 0.49 0.5 0.5 YP130 (en) 0.43 0.45 0.44 0.47 0.48 0.48 Gur30 (ge) 0.45 0.46 0.49 0.51 0.51 0.51 Gur65 (ge) 0.45 0.47 0.52 0.54 0.54 0.55 ZG222 (ge) 0.5 0.53 0.53 0.55 0.56 0.56 RO353 (ro) 0.51 0.55 0.57 0.59 0.59 0.61 It enables to assess how well the given representations capture word similarity. Our method slightly outperformed the baseline models in 3 cases. 13/20 13/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Semantic analogies task results. The accuracy speciﬁed as %. dataset RNNLM NCE CBoW Sg Ft our WS353 (en) 15.3 24.2 0.23.8 28 27.5 27.5 SimVerb-3500 (en) 20.1 26.7 30.6 34.5 34.5 34 Sim999 (en) 18.3 21.2 29.8 24.3 24.8 24.8 RG65 (en) 29.7 35.2 39.1 42 42 42 SGS130 (en) 35.2 41.3 47 56.1 56.1 56.1 YP130 (en) 46.4 42.6 43.6 56.3 56.3 56.3 Gur30 (ge) 37.2 61.2 38.7 46.7 46.7 46.7 Gur65 (ge) 39.8 34.2 44.7 46.9 46 46 ZG222 (ge) 41.7 36.2 55.3 52.6 52.6 52.2 RO353 (ro) 43.9 50 46.6 60.4 60.1 60.1 14/20 14/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Syntactic analogies task results. The accuracy speciﬁed as %. dataset RNNLM NCE CBoW Sg Ft our WS353 (en) 24.7 30.2 33.5 40.9 40.2 40.2 SimVerb-3500 (en) 31.6 33.9 37.2 52 52 52 Sim999 (en) 26 32 55 49.8 49.3 49.5 RG65 (en) 35.6 40.2 40.7 48.9 48.9 48.9 SGS130 (en) 38.4 59 43.2 49.6 49.6 49.6 YP130 (en) 32.3 37.8 45.8 50.3 50.3 50.3 Gur30 (ge) 30.1 35.2 40.9 49.3 49.3 49.3 Gur65 (ge) 24 35.7 47.3 62.5 62.5 62.5 ZG222 (ge) 38.7 45.3 56.9 67.2 67.2 67.1 RO353 (ro) 30.6 41.7 59.2 53.1 53.1 53.1 15/20 15/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Semantic and syntactic analogies tasks results Our method did not overcome any competing model. It gave similar results to other Skip-gram based approaches. It may be worth to explore the method’s performance on more dense languages. 16/20 16/20 Towards Learning Word Representation Forward Back

Experiments Conclusions The plots of performance versus training epoch for word similarity task. Dataset: SimVerb-3500. All three models converge quickly to a satisfactory level of performance. Our approach yields more reliable results. 17/20 17/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Two dimensional projections of our method and Bojanowski-based (right) word representations. We projected the learned word representations into two dimensions using the t-SNE tool. All words were assigned to their groups correctly. 18/20 18/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Conclusions We have shown our method outperforms state-of-the-art approaches on dense languages when tasks such as word similarity ranking or syntactic and semantic analogies are taken into consideration. This research indicates that other methods of subword information retrieval should be investigated in depth. 19/20 19/20 Towards Learning Word Representation Forward Back

Experiments Conclusions Conclusions Thank you for your attention. 20/20 20/20 Towards Learning Word Representation Back

Towards Learning Word Representation

Towards Learning Word Representation

Magdalena Wiercioch

Other Decks in Research

Featured

Transcript

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model

Towards Learning Word Representation Magdalena Wiercioch Research motivation Background Model