Word2Vec: From intuition to practice using gensim

Slide 1

Slide 1 text

WORD2VEC FROM INTUITION TO PRACTICE USING GENSIM Edgar Marca [email protected] Python Peru Meetup September 1st, 2016 Lima - Perú

Slide 2

Slide 2 text

About Edgar Marca Software Engineer at Love Mondays. One of the organizers of Data Science Lima Meetup. Machine Learning and Data Science enthusiasm. Eu falo um pouco de Português. 1

Slide 3

Slide 3 text

DATA SCIENCE LIMA MEETUP

Slide 4

Slide 4 text

Data Science Lima Meetup Datos 5 Meetups y el 6to a la vuelta de la esquina 410 Datanautas en el Grupo de Meetup. 329 Personas en el Grupo de Facebook. Organizadores Manuel Solorzano. Dennis Barreda. Freddy Cahuas. Edgar Marca 3

Slide 5

Slide 5 text

Data Science Lima Meetup Figure: Foto del quinto Data Science Lima Meetup. 4

Slide 6

Slide 6 text

DATA

Slide 7

Slide 7 text

Data Never Sleeps Figure: How much data is generated every minute? 1 1Data Never Sleeps 3.0 https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/ 6

Slide 8

Slide 8 text

NATURAL LANGUAGE PROCESSING

Slide 9

Slide 9 text

Introduction Text is the core business of internet companies today. Machine Learning and natural language processing techniques are applied to big datasets to improve search, ranking and many other tasks (spam detection, ads recomendations, email categorization, machine translation, speech recognition, etc) 8

Slide 10

Slide 10 text

Natural Language Processing Problems with text Messy. Irregularities of the language. Hierarchically. Sparse Nature. 9

Slide 11

Slide 11 text

REPRESENTATIONS FOR TEXTS

Slide 12

Slide 12 text

Contextual Representation 11

Slide 13

Slide 13 text

How to Learn good representations? 12

Slide 14

Slide 14 text

One-hot Representation One-hot encoding Represent every word as an R|V| vector with all 0s and 1 at the index of that word. 13

Slide 15

Slide 15 text

One-hot Representation EXAMPLE Example: Let V = {the, hotel, nice, motel} wthe =            1 0 0 0            , whotel =            0 1 0 0            , wnice =            0 0 1 0            , wmotel =            0 0 0 1            We represent each word as a completely independent entity. This word representation does not give us directly any notion of similarity. 14

Slide 16

Slide 16 text

One-hot Representation For instance ⟨whotel, wmotel⟩R4 = 0 (1) ⟨whotel, wcat⟩R4 = 0 (2) we can try to reduce the size of this space from R4 to something smaller and find a subspace that encodes the relationships between words. 15

Slide 17

Slide 17 text

One-hot Representation Problems The dimension depends on the vocabulary size. Leads to data sparsity. So we need more data. Provide not useful information to the system. Encondings are arbitrary. 16

Slide 18

Slide 18 text

Bag-of-words representation Sum of one-hot codes. Ignores orders or words. Examples: vocabulary = (monday, tuesday, is, a, today) Monday Monday = [2, 0, 0, 0, 0] today is monday = [1 0 1 1 1] today is tuesday = [0 1 1 1 1] is a monday today = [1 0 1 1 1] 17

Slide 19

Slide 19 text

Distributional hypotesis You shall know a word by the company it keeps! Firth (1957) 18

Slide 20

Slide 20 text

Language Modeling (Unigrams, Bigrams, etc) A language model is a probabilistic model that assigns probability to any sequence of n words P(w1 , w2 , . . . , wn) Unigrams Assuming that the word ocurrences are completely independent P(w1 , w2 , . . . , wn) = Πn i=1 P(wi) (3) 19

Slide 21

Slide 21 text

Language Modeling (Unigrams, Bigrams, etc) Bigrams The probability of the sequence depend on the pairwise probability of a word in the sequence and the word next to it. P(w1 , w2 , . . . , wn) = Πn i=2 P(wi | wi−1) (4) 20

Slide 22

Slide 22 text

Word Embeddings Word Embeddings A set of language modeling and feature learning techniques in NLP where words or phrases from the vocabulary are mapped to vectors of real numbers in a low-dimensional space relative to the vocabulary size (”continuous space”). Vector space models (VSMs) represent (embed) words in a continous vector space. Semantically similar words are mapped to nearby points. Basic idea is Distributional Hypothesis: words that appear in the same context share semantic meaning. 21

Slide 23

Slide 23 text

WORD2VEC

Slide 24

Slide 24 text

Distributional hypotesis You shall know a word by the company it keeps! Firth (1957) 23

Slide 25

Slide 25 text

Word2Vec Figure: Two original papers published in association with word2vec by Mikolov et al. (2013) Efficient Estimation of Word Representations in Vector Space https://arxiv.org/abs/1301.3781. Distributed Representations of Words and Phrases and their Compositionality https://arxiv.org/abs/1310.4546. 24

Slide 26

Slide 26 text

Continuous Bag of Words and Skip-gram 25

Slide 27

Slide 27 text

Contextual Representation Word is represented by context in use 26

Slide 28

Slide 28 text

Contextual Representation 27

Slide 29

Slide 29 text

Word Vectors 28

Slide 30

Slide 30 text

Word Vectors 29

Slide 31

Slide 31 text

Word Vectors 30

Slide 32

Slide 32 text

Word Vectors 31

Slide 33

Slide 33 text

Word2Vec vking − vman + vwoman ≈ vqueen vparis − vfrance + vitaly ≈ vrome Learns from raw text Huge splash in NLP world. Comes pretrained. (If you don’t have any specialize vocabulary) Word2vec is computationally efficient model for learning word embeddings. Word2Vec is a successful example of ”shallow” learning. Very simple Feedforward neural network with single hidden layer, backpropagation, and no non-linearities. 32

Slide 34

Slide 34 text

Word2vec 33

Slide 35

Slide 35 text

Gensim 34

Slide 36

Slide 36 text

APPLICATIONS

Slide 37

Slide 37 text

What the Fuck Are Trump Supporters Thinking? 36

Slide 38

Slide 38 text

What the Fuck Are Trump Supporters Thinking? 37

Slide 39

Slide 39 text

What the Fuck Are Trump Supporters Thinking? They gathered four million tweets belonging to more than two thousand hard-core Trump supporters. Distances between those vectors encoded the semantic distance between their associated words (e.g. the vector representation of the word morons was near idiots but far away from funny) Link: https://medium.com/adventurous-social-science/ what-the-fuck-are-trump-supporters-thinking-ecc16fb66a8d 38

Slide 40

Slide 40 text

Restaurant Recomendation. http://www.slideshare.net/SudeepDasPhD/ recsys-2015-making-meaningful-restaurant-recommendations-at-opent 39

Slide 41

Slide 41 text

Restaurant Recomendation. http://www.slideshare.net/SudeepDasPhD/ recsys-2015-making-meaningful-restaurant-recommendations-at-opent 40

Slide 42

Slide 42 text

Song Recomendations Link: https://social.shorthand.com/mawsonguy/3CfQA8mj2S/ playlist-harvesting 41

Slide 43

Slide 43 text

TAKEAWAYS

Slide 44

Slide 44 text

Takeaways If you don’t have enough data you can use pre-trained models. Remember: Garbage in, garbage out. Every data set will come out with diferent results. Use Word2vec as feature extractor. 43