Slide 1

Slide 1 text

Journal Club Keita Watanabe12 1Graduate School of Frontier Science, The University of Tokyo, 2RIKEN Brain Science Institute 1/14 Keita Watanabe October 11, 2017

Slide 2

Slide 2 text

First of all... Today, I would like to talk about a novel neural embedding method Starspace [Wu et al., 2017] but this paper assumes that readers understand Word2vec [Mikolov et al., 2013]. I guess most of you are unfamiliar with this algorithm, thus I will also introduce Word2vec but to do that, I should briefly explain ”Distributional Hypothesis”, thus I will start from ”Distributional Hypothesis”[Harris, 1954, Firth, 1957]. 2/14 Keita Watanabe October 11, 2017

Slide 3

Slide 3 text

Distributional Hypothesis “The distributional hypothesis in linguistics is derived from the semantic theory of language usage, i.e. words that are used and occur in the same contexts tend to purport similar meanings.[Harris, 1954] The underlying idea that ”a word is characterized by the company it keeps” was popularized by Firth.[Firth, 1957]”. (From wikipedia) 3/14 Keita Watanabe October 11, 2017

Slide 4

Slide 4 text

Word2vec [Mikolov et al., 2013] Retrieved from here. “Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.” (From wikipedia) 4/14 Keita Watanabe October 11, 2017

Slide 5

Slide 5 text

The skip-gram model ∗ w: a word ∗ c: context (a word that appeared around w) J = − ∑ w∈D ∑ c∈Cw log P(c|w) Where, P(c|w) = exp (vm · ˜ vc ) ∑ c′∈V exp (vm · ˜ vc′ ) Retrieved from here. 5/14 Keita Watanabe October 11, 2017

Slide 6

Slide 6 text

Negative Sampling P(c|w) can be approximated by P(c|w) ≈ P(+1|w, c) ∏ c′∈Unigramk (D) P(−1|w, c′) Here ∗ P(+1|w, c): a probability that word c is one of the context of w ∗ P(−1|w, c′): a probability that word c is not the context of w ∗ Unigramk (D): k context that are sampled from P(w). (pseudo) negative sampling 6/14 Keita Watanabe October 11, 2017

Slide 7

Slide 7 text

Thus, P(c|w) ≈ σ(vm · ˜ vc ) ∏ c′∈Unigramk (D) σ(−vm · ˜ v′ c ) After all, J = − ∑ w∈D ∑ c∈Cw (log σ(vm · ˜ vc ) + ∑ c′∈Unigramk (D) log σ(−vm · ˜ v′ c )) word2vec reduce above equation with SGD. Retrieved from here. 7/14 Keita Watanabe October 11, 2017

Slide 8

Slide 8 text

Omake ∗ [Levy and Goldberg, 2014]: discusses the relationship between Word2vec and PMI(Point-wise mutual information) ∗ [Arora et al., 2015]: discusses mathematical operation in the vector space. Good summary of the article. ∗ The original paper was not crystal clear to me. Review papers [Rong, 2014, Goldberg and Levy, 2014] were quite helpful. 8/14 Keita Watanabe October 11, 2017

Slide 9

Slide 9 text

StarSpace [Wu et al., 2017] 9/14 Keita Watanabe October 11, 2017

Slide 10

Slide 10 text

∑ (a,b+)∈E+,b−∈E− Lbatch(sim(a, b), sim(a, b− 1 ), sim(a, b− 2 ), . . . , sim(a, b− k )) ∗ The generator of positive entity pairs (a, b) coming from the set E+. This is task dependent and will be described subsequently. ∗ The generator of negative entities b− i coming from the set E−. We utilize a k-negative sampling strategy (Mikolov et al. 2013) ∗ The similarity function sim(·, ·): cosine osimilarity or inner product ∗ The loss function that compares the positive pair (a, b) with the negative pairs (a, b− i ): ranking loss or negative log loss of softmax (same as Word2vec). 10/14 Keita Watanabe October 11, 2017

Slide 11

Slide 11 text

Example positive sample & negative samples Multiclass Classification (e.g. Text Classification): The positive pair generator comes directly from a training set of labeled data specifying (a, b) pairs where a are documents (bags-of-words) and b are labels (singleton features). Negative entities b− are sampled from the set of possible labels. Learning Sentence Embeddings: Learning word embeddings (e.g. as above) and using them to embed sentences does not seem optimal when you can learn sentence embeddings directly. Given a training set of unlabeled documents, each consisting of sentences, we select a and b as a pair of sentences both coming from the same document; b− are sentences coming from other documents. 11/14 Keita Watanabe October 11, 2017

Slide 12

Slide 12 text

Multi-Task Learning: Any of these tasks can be combined, and trained at the same time if they share some features in the base dictionary F. For example one could combine supervised classification with unsupervised word or sentence embedding, to give semi-supervised learning. 12/14 Keita Watanabe October 11, 2017

Slide 13

Slide 13 text

References I [Arora et al., 2015] Arora, S., Li, Y., Liang, Y., Ma, T., and Risteski, A. (2015). RAND-WALK: A Latent Variable Model Approach to Word Embeddings. [Firth, 1957] Firth, J. R. (1957). A synopsis of linguistic theory . [Goldberg and Levy, 2014] Goldberg, Y. and Levy, O. (2014). word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. [Harris, 1954] Harris, Z. S. (1954). Distributional Structure. WORD, 10(2-3):146–162. [Levy and Goldberg, 2014] Levy, O. and Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. pages 2177–2185. [Mikolov et al., 2013] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. [Rong, 2014] Rong, X. (2014). word2vec Parameter Learning Explained. arXiv.org. 13/14 Keita Watanabe October 11, 2017

Slide 14

Slide 14 text

References II [Wu et al., 2017] Wu, L., Fisch, A., Chopra, S., Adams, K., Bordes, A., and Weston, J. (2017). StarSpace: Embed All The Things! 14/14 Keita Watanabe October 11, 2017