Word2Vec: From intuition to practice using gensim

F090111d1a70ddbaf30db849379ee2fe?s=47 matiskay
September 09, 2016

Word2Vec: From intuition to practice using gensim

F090111d1a70ddbaf30db849379ee2fe?s=128

matiskay

September 09, 2016
Tweet

Transcript

  1. WORD2VEC FROM INTUITION TO PRACTICE USING GENSIM Edgar Marca matiskay@gmail.com

    Python Peru Meetup September 1st, 2016 Lima - Perú
  2. About Edgar Marca Software Engineer at Love Mondays. One of

    the organizers of Data Science Lima Meetup. Machine Learning and Data Science enthusiasm. Eu falo um pouco de Português. 1
  3. DATA SCIENCE LIMA MEETUP

  4. Data Science Lima Meetup Datos 5 Meetups y el 6to

    a la vuelta de la esquina 410 Datanautas en el Grupo de Meetup. 329 Personas en el Grupo de Facebook. Organizadores Manuel Solorzano. Dennis Barreda. Freddy Cahuas. Edgar Marca 3
  5. Data Science Lima Meetup Figure: Foto del quinto Data Science

    Lima Meetup. 4
  6. DATA

  7. Data Never Sleeps Figure: How much data is generated every

    minute? 1 1Data Never Sleeps 3.0 https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/ 6
  8. NATURAL LANGUAGE PROCESSING

  9. Introduction Text is the core business of internet companies today.

    Machine Learning and natural language processing techniques are applied to big datasets to improve search, ranking and many other tasks (spam detection, ads recomendations, email categorization, machine translation, speech recognition, etc) 8
  10. Natural Language Processing Problems with text Messy. Irregularities of the

    language. Hierarchically. Sparse Nature. 9
  11. REPRESENTATIONS FOR TEXTS

  12. Contextual Representation 11

  13. How to Learn good representations? 12

  14. One-hot Representation One-hot encoding Represent every word as an R|V|

    vector with all 0s and 1 at the index of that word. 13
  15. One-hot Representation EXAMPLE Example: Let V = {the, hotel, nice,

    motel} wthe =            1 0 0 0            , whotel =            0 1 0 0            , wnice =            0 0 1 0            , wmotel =            0 0 0 1            We represent each word as a completely independent entity. This word representation does not give us directly any notion of similarity. 14
  16. One-hot Representation For instance ⟨whotel, wmotel⟩R4 = 0 (1) ⟨whotel,

    wcat⟩R4 = 0 (2) we can try to reduce the size of this space from R4 to something smaller and find a subspace that encodes the relationships between words. 15
  17. One-hot Representation Problems The dimension depends on the vocabulary size.

    Leads to data sparsity. So we need more data. Provide not useful information to the system. Encondings are arbitrary. 16
  18. Bag-of-words representation Sum of one-hot codes. Ignores orders or words.

    Examples: vocabulary = (monday, tuesday, is, a, today) Monday Monday = [2, 0, 0, 0, 0] today is monday = [1 0 1 1 1] today is tuesday = [0 1 1 1 1] is a monday today = [1 0 1 1 1] 17
  19. Distributional hypotesis You shall know a word by the company

    it keeps! Firth (1957) 18
  20. Language Modeling (Unigrams, Bigrams, etc) A language model is a

    probabilistic model that assigns probability to any sequence of n words P(w1 , w2 , . . . , wn) Unigrams Assuming that the word ocurrences are completely independent P(w1 , w2 , . . . , wn) = Πn i=1 P(wi) (3) 19
  21. Language Modeling (Unigrams, Bigrams, etc) Bigrams The probability of the

    sequence depend on the pairwise prob- ability of a word in the sequence and the word next to it. P(w1 , w2 , . . . , wn) = Πn i=2 P(wi | wi−1) (4) 20
  22. Word Embeddings Word Embeddings A set of language modeling and

    feature learning techniques in NLP where words or phrases from the vocabulary are mapped to vectors of real numbers in a low-dimensional space relative to the vocabulary size (”continuous space”). Vector space models (VSMs) represent (embed) words in a continous vector space. Semantically similar words are mapped to nearby points. Basic idea is Distributional Hypothesis: words that appear in the same context share semantic meaning. 21
  23. WORD2VEC

  24. Distributional hypotesis You shall know a word by the company

    it keeps! Firth (1957) 23
  25. Word2Vec Figure: Two original papers published in association with word2vec

    by Mikolov et al. (2013) Efficient Estimation of Word Representations in Vector Space https://arxiv.org/abs/1301.3781. Distributed Representations of Words and Phrases and their Compositionality https://arxiv.org/abs/1310.4546. 24
  26. Continuous Bag of Words and Skip-gram 25

  27. Contextual Representation Word is represented by context in use 26

  28. Contextual Representation 27

  29. Word Vectors 28

  30. Word Vectors 29

  31. Word Vectors 30

  32. Word Vectors 31

  33. Word2Vec vking − vman + vwoman ≈ vqueen vparis −

    vfrance + vitaly ≈ vrome Learns from raw text Huge splash in NLP world. Comes pretrained. (If you don’t have any specialize vocabulary) Word2vec is computationally efficient model for learning word embeddings. Word2Vec is a successful example of ”shallow” learning. Very simple Feedforward neural network with single hidden layer, backpropagation, and no non-linearities. 32
  34. Word2vec 33

  35. Gensim 34

  36. APPLICATIONS

  37. What the Fuck Are Trump Supporters Thinking? 36

  38. What the Fuck Are Trump Supporters Thinking? 37

  39. What the Fuck Are Trump Supporters Thinking? They gathered four

    million tweets belonging to more than two thousand hard-core Trump supporters. Distances between those vectors encoded the semantic distance between their associated words (e.g. the vector representation of the word morons was near idiots but far away from funny) Link: https://medium.com/adventurous-social-science/ what-the-fuck-are-trump-supporters-thinking-ecc16fb66a8d 38
  40. Restaurant Recomendation. http://www.slideshare.net/SudeepDasPhD/ recsys-2015-making-meaningful-restaurant-recommendations-at-opent 39

  41. Restaurant Recomendation. http://www.slideshare.net/SudeepDasPhD/ recsys-2015-making-meaningful-restaurant-recommendations-at-opent 40

  42. Song Recomendations Link: https://social.shorthand.com/mawsonguy/3CfQA8mj2S/ playlist-harvesting 41

  43. TAKEAWAYS

  44. Takeaways If you don’t have enough data you can use

    pre-trained models. Remember: Garbage in, garbage out. Every data set will come out with diferent results. Use Word2vec as feature extractor. 43
  45. 44

  46. Obrigado 45