Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Journal Club: Star Space

Journal Club: Star Space

Keita Watanabe

October 30, 2021

More Decks by Keita Watanabe

Other Decks in Research


  1. Journal Club Keita Watanabe12 1Graduate School of Frontier Science, The

    University of Tokyo, 2RIKEN Brain Science Institute 1/14 Keita Watanabe October 11, 2017
  2. First of all... Today, I would like to talk about

    a novel neural embedding method Starspace [Wu et al., 2017] but this paper assumes that readers understand Word2vec [Mikolov et al., 2013]. I guess most of you are unfamiliar with this algorithm, thus I will also introduce Word2vec but to do that, I should briefly explain ”Distributional Hypothesis”, thus I will start from ”Distributional Hypothesis”[Harris, 1954, Firth, 1957]. 2/14 Keita Watanabe October 11, 2017
  3. Distributional Hypothesis “The distributional hypothesis in linguistics is derived from

    the semantic theory of language usage, i.e. words that are used and occur in the same contexts tend to purport similar meanings.[Harris, 1954] The underlying idea that ”a word is characterized by the company it keeps” was popularized by Firth.[Firth, 1957]”. (From wikipedia) 3/14 Keita Watanabe October 11, 2017
  4. Word2vec [Mikolov et al., 2013] Retrieved from here. “Word2vec is

    a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.” (From wikipedia) 4/14 Keita Watanabe October 11, 2017
  5. The skip-gram model ∗ w: a word ∗ c: context

    (a word that appeared around w) J = − ∑ w∈D ∑ c∈Cw log P(c|w) Where, P(c|w) = exp (vm · ˜ vc ) ∑ c′∈V exp (vm · ˜ vc′ ) Retrieved from here. 5/14 Keita Watanabe October 11, 2017
  6. Negative Sampling P(c|w) can be approximated by P(c|w) ≈ P(+1|w,

    c) ∏ c′∈Unigramk (D) P(−1|w, c′) Here ∗ P(+1|w, c): a probability that word c is one of the context of w ∗ P(−1|w, c′): a probability that word c is not the context of w ∗ Unigramk (D): k context that are sampled from P(w). (pseudo) negative sampling 6/14 Keita Watanabe October 11, 2017
  7. Thus, P(c|w) ≈ σ(vm · ˜ vc ) ∏ c′∈Unigramk

    (D) σ(−vm · ˜ v′ c ) After all, J = − ∑ w∈D ∑ c∈Cw (log σ(vm · ˜ vc ) + ∑ c′∈Unigramk (D) log σ(−vm · ˜ v′ c )) word2vec reduce above equation with SGD. Retrieved from here. 7/14 Keita Watanabe October 11, 2017
  8. Omake ∗ [Levy and Goldberg, 2014]: discusses the relationship between

    Word2vec and PMI(Point-wise mutual information) ∗ [Arora et al., 2015]: discusses mathematical operation in the vector space. Good summary of the article. ∗ The original paper was not crystal clear to me. Review papers [Rong, 2014, Goldberg and Levy, 2014] were quite helpful. 8/14 Keita Watanabe October 11, 2017
  9. ∑ (a,b+)∈E+,b−∈E− Lbatch(sim(a, b), sim(a, b− 1 ), sim(a, b−

    2 ), . . . , sim(a, b− k )) ∗ The generator of positive entity pairs (a, b) coming from the set E+. This is task dependent and will be described subsequently. ∗ The generator of negative entities b− i coming from the set E−. We utilize a k-negative sampling strategy (Mikolov et al. 2013) ∗ The similarity function sim(·, ·): cosine osimilarity or inner product ∗ The loss function that compares the positive pair (a, b) with the negative pairs (a, b− i ): ranking loss or negative log loss of softmax (same as Word2vec). 10/14 Keita Watanabe October 11, 2017
  10. Example positive sample & negative samples Multiclass Classification (e.g. Text

    Classification): The positive pair generator comes directly from a training set of labeled data specifying (a, b) pairs where a are documents (bags-of-words) and b are labels (singleton features). Negative entities b− are sampled from the set of possible labels. Learning Sentence Embeddings: Learning word embeddings (e.g. as above) and using them to embed sentences does not seem optimal when you can learn sentence embeddings directly. Given a training set of unlabeled documents, each consisting of sentences, we select a and b as a pair of sentences both coming from the same document; b− are sentences coming from other documents. 11/14 Keita Watanabe October 11, 2017
  11. Multi-Task Learning: Any of these tasks can be combined, and

    trained at the same time if they share some features in the base dictionary F. For example one could combine supervised classification with unsupervised word or sentence embedding, to give semi-supervised learning. 12/14 Keita Watanabe October 11, 2017
  12. References I [Arora et al., 2015] Arora, S., Li, Y.,

    Liang, Y., Ma, T., and Risteski, A. (2015). RAND-WALK: A Latent Variable Model Approach to Word Embeddings. [Firth, 1957] Firth, J. R. (1957). A synopsis of linguistic theory . [Goldberg and Levy, 2014] Goldberg, Y. and Levy, O. (2014). word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. [Harris, 1954] Harris, Z. S. (1954). Distributional Structure. WORD, 10(2-3):146–162. [Levy and Goldberg, 2014] Levy, O. and Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. pages 2177–2185. [Mikolov et al., 2013] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. [Rong, 2014] Rong, X. (2014). word2vec Parameter Learning Explained. arXiv.org. 13/14 Keita Watanabe October 11, 2017
  13. References II [Wu et al., 2017] Wu, L., Fisch, A.,

    Chopra, S., Adams, K., Bordes, A., and Weston, J. (2017). StarSpace: Embed All The Things! 14/14 Keita Watanabe October 11, 2017