Word2Vec: From intuition to practice using gensim

WORD2VEC FROM INTUITION TO PRACTICE USING GENSIM Edgar Marca matiskay@gmail.com
Python Peru Meetup September 1st, 2016 Lima - Perú

About Edgar Marca Software Engineer at Love Mondays. One of
the organizers of Data Science Lima Meetup. Machine Learning and Data Science enthusiasm. Eu falo um pouco de Português. 1

DATA SCIENCE LIMA MEETUP

Data Science Lima Meetup Datos 5 Meetups y el 6to
a la vuelta de la esquina 410 Datanautas en el Grupo de Meetup. 329 Personas en el Grupo de Facebook. Organizadores Manuel Solorzano. Dennis Barreda. Freddy Cahuas. Edgar Marca 3

Data Science Lima Meetup Figure: Foto del quinto Data Science
Lima Meetup. 4

Data Never Sleeps Figure: How much data is generated every
minute? 1 1Data Never Sleeps 3.0 https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/ 6

NATURAL LANGUAGE PROCESSING

Introduction Text is the core business of internet companies today.
Machine Learning and natural language processing techniques are applied to big datasets to improve search, ranking and many other tasks (spam detection, ads recomendations, email categorization, machine translation, speech recognition, etc) 8

Natural Language Processing Problems with text Messy. Irregularities of the
language. Hierarchically. Sparse Nature. 9

REPRESENTATIONS FOR TEXTS

Contextual Representation 11

How to Learn good representations? 12

One-hot Representation One-hot encoding Represent every word as an R|V|
vector with all 0s and 1 at the index of that word. 13

One-hot Representation EXAMPLE Example: Let V = {the, hotel, nice,
motel} wthe =            1 0 0 0            , whotel =            0 1 0 0            , wnice =            0 0 1 0            , wmotel =            0 0 0 1            We represent each word as a completely independent entity. This word representation does not give us directly any notion of similarity. 14

One-hot Representation For instance ⟨whotel, wmotel⟩R4 = 0 (1) ⟨whotel,
wcat⟩R4 = 0 (2) we can try to reduce the size of this space from R4 to something smaller and find a subspace that encodes the relationships between words. 15

One-hot Representation Problems The dimension depends on the vocabulary size.
Leads to data sparsity. So we need more data. Provide not useful information to the system. Encondings are arbitrary. 16

Bag-of-words representation Sum of one-hot codes. Ignores orders or words.
Examples: vocabulary = (monday, tuesday, is, a, today) Monday Monday = [2, 0, 0, 0, 0] today is monday = [1 0 1 1 1] today is tuesday = [0 1 1 1 1] is a monday today = [1 0 1 1 1] 17

Distributional hypotesis You shall know a word by the company
it keeps! Firth (1957) 18

Language Modeling (Unigrams, Bigrams, etc) A language model is a
probabilistic model that assigns probability to any sequence of n words P(w1 , w2 , . . . , wn) Unigrams Assuming that the word ocurrences are completely independent P(w1 , w2 , . . . , wn) = Πn i=1 P(wi) (3) 19

Language Modeling (Unigrams, Bigrams, etc) Bigrams The probability of the
sequence depend on the pairwise probability of a word in the sequence and the word next to it. P(w1 , w2 , . . . , wn) = Πn i=2 P(wi | wi−1) (4) 20

Word Embeddings Word Embeddings A set of language modeling and
feature learning techniques in NLP where words or phrases from the vocabulary are mapped to vectors of real numbers in a low-dimensional space relative to the vocabulary size (”continuous space”). Vector space models (VSMs) represent (embed) words in a continous vector space. Semantically similar words are mapped to nearby points. Basic idea is Distributional Hypothesis: words that appear in the same context share semantic meaning. 21

WORD2VEC

Distributional hypotesis You shall know a word by the company
it keeps! Firth (1957) 23

Word2Vec Figure: Two original papers published in association with word2vec
by Mikolov et al. (2013) Efficient Estimation of Word Representations in Vector Space https://arxiv.org/abs/1301.3781. Distributed Representations of Words and Phrases and their Compositionality https://arxiv.org/abs/1310.4546. 24

Continuous Bag of Words and Skip-gram 25

Contextual Representation Word is represented by context in use 26

Contextual Representation 27

Word Vectors 28

Word Vectors 29

Word Vectors 30

Word Vectors 31

Word2Vec vking − vman + vwoman ≈ vqueen vparis −
vfrance + vitaly ≈ vrome Learns from raw text Huge splash in NLP world. Comes pretrained. (If you don’t have any specialize vocabulary) Word2vec is computationally efficient model for learning word embeddings. Word2Vec is a successful example of ”shallow” learning. Very simple Feedforward neural network with single hidden layer, backpropagation, and no non-linearities. 32

Word2vec 33

Gensim 34

APPLICATIONS

What the Fuck Are Trump Supporters Thinking? 36

What the Fuck Are Trump Supporters Thinking? 37

What the Fuck Are Trump Supporters Thinking? They gathered four
million tweets belonging to more than two thousand hard-core Trump supporters. Distances between those vectors encoded the semantic distance between their associated words (e.g. the vector representation of the word morons was near idiots but far away from funny) Link: https://medium.com/adventurous-social-science/ what-the-fuck-are-trump-supporters-thinking-ecc16fb66a8d 38

Restaurant Recomendation. http://www.slideshare.net/SudeepDasPhD/ recsys-2015-making-meaningful-restaurant-recommendations-at-opent 39

Restaurant Recomendation. http://www.slideshare.net/SudeepDasPhD/ recsys-2015-making-meaningful-restaurant-recommendations-at-opent 40

Song Recomendations Link: https://social.shorthand.com/mawsonguy/3CfQA8mj2S/ playlist-harvesting 41

TAKEAWAYS

Takeaways If you don’t have enough data you can use
pre-trained models. Remember: Garbage in, garbage out. Every data set will come out with diferent results. Use Word2vec as feature extractor. 43

Obrigado 45

Word2Vec: From intuition to practice using gensim

Word2Vec: From intuition to practice using gensim

More Decks by matiskay

Other Decks in Programming

Featured

Transcript