Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Word2Vec: From intuition to practice using gensim

matiskay
September 09, 2016

Word2Vec: From intuition to practice using gensim

matiskay

September 09, 2016
Tweet

More Decks by matiskay

Other Decks in Programming

Transcript

  1. WORD2VEC FROM INTUITION TO PRACTICE USING GENSIM Edgar Marca [email protected]

    Python Peru Meetup September 1st, 2016 Lima - Perú
  2. About Edgar Marca Software Engineer at Love Mondays. One of

    the organizers of Data Science Lima Meetup. Machine Learning and Data Science enthusiasm. Eu falo um pouco de Português. 1
  3. Data Science Lima Meetup Datos 5 Meetups y el 6to

    a la vuelta de la esquina 410 Datanautas en el Grupo de Meetup. 329 Personas en el Grupo de Facebook. Organizadores Manuel Solorzano. Dennis Barreda. Freddy Cahuas. Edgar Marca 3
  4. Data Never Sleeps Figure: How much data is generated every

    minute? 1 1Data Never Sleeps 3.0 https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/ 6
  5. Introduction Text is the core business of internet companies today.

    Machine Learning and natural language processing techniques are applied to big datasets to improve search, ranking and many other tasks (spam detection, ads recomendations, email categorization, machine translation, speech recognition, etc) 8
  6. One-hot Representation One-hot encoding Represent every word as an R|V|

    vector with all 0s and 1 at the index of that word. 13
  7. One-hot Representation EXAMPLE Example: Let V = {the, hotel, nice,

    motel} wthe =            1 0 0 0            , whotel =            0 1 0 0            , wnice =            0 0 1 0            , wmotel =            0 0 0 1            We represent each word as a completely independent entity. This word representation does not give us directly any notion of similarity. 14
  8. One-hot Representation For instance ⟨whotel, wmotel⟩R4 = 0 (1) ⟨whotel,

    wcat⟩R4 = 0 (2) we can try to reduce the size of this space from R4 to something smaller and find a subspace that encodes the relationships between words. 15
  9. One-hot Representation Problems The dimension depends on the vocabulary size.

    Leads to data sparsity. So we need more data. Provide not useful information to the system. Encondings are arbitrary. 16
  10. Bag-of-words representation Sum of one-hot codes. Ignores orders or words.

    Examples: vocabulary = (monday, tuesday, is, a, today) Monday Monday = [2, 0, 0, 0, 0] today is monday = [1 0 1 1 1] today is tuesday = [0 1 1 1 1] is a monday today = [1 0 1 1 1] 17
  11. Language Modeling (Unigrams, Bigrams, etc) A language model is a

    probabilistic model that assigns probability to any sequence of n words P(w1 , w2 , . . . , wn) Unigrams Assuming that the word ocurrences are completely independent P(w1 , w2 , . . . , wn) = Πn i=1 P(wi) (3) 19
  12. Language Modeling (Unigrams, Bigrams, etc) Bigrams The probability of the

    sequence depend on the pairwise prob- ability of a word in the sequence and the word next to it. P(w1 , w2 , . . . , wn) = Πn i=2 P(wi | wi−1) (4) 20
  13. Word Embeddings Word Embeddings A set of language modeling and

    feature learning techniques in NLP where words or phrases from the vocabulary are mapped to vectors of real numbers in a low-dimensional space relative to the vocabulary size (”continuous space”). Vector space models (VSMs) represent (embed) words in a continous vector space. Semantically similar words are mapped to nearby points. Basic idea is Distributional Hypothesis: words that appear in the same context share semantic meaning. 21
  14. Word2Vec Figure: Two original papers published in association with word2vec

    by Mikolov et al. (2013) Efficient Estimation of Word Representations in Vector Space https://arxiv.org/abs/1301.3781. Distributed Representations of Words and Phrases and their Compositionality https://arxiv.org/abs/1310.4546. 24
  15. Word2Vec vking − vman + vwoman ≈ vqueen vparis −

    vfrance + vitaly ≈ vrome Learns from raw text Huge splash in NLP world. Comes pretrained. (If you don’t have any specialize vocabulary) Word2vec is computationally efficient model for learning word embeddings. Word2Vec is a successful example of ”shallow” learning. Very simple Feedforward neural network with single hidden layer, backpropagation, and no non-linearities. 32
  16. What the Fuck Are Trump Supporters Thinking? They gathered four

    million tweets belonging to more than two thousand hard-core Trump supporters. Distances between those vectors encoded the semantic distance between their associated words (e.g. the vector representation of the word morons was near idiots but far away from funny) Link: https://medium.com/adventurous-social-science/ what-the-fuck-are-trump-supporters-thinking-ecc16fb66a8d 38
  17. Takeaways If you don’t have enough data you can use

    pre-trained models. Remember: Garbage in, garbage out. Every data set will come out with diferent results. Use Word2vec as feature extractor. 43
  18. 44