Word Embeddings
Rome
Paris
Italy
France
= [0.91, 0.83, 0.17, …, 0.41]
= [0.92, 0.82, 0.17, …, 0.98]
= [0.32, 0.77, 0.67, …, 0.42]
= [0.33, 0.78, 0.66, …, 0.97]
Slide 18
Slide 18 text
Word Embeddings
Rome
Paris
Italy
France
= [0.91, 0.83, 0.17, …, 0.41]
= [0.92, 0.82, 0.17, …, 0.98]
= [0.32, 0.77, 0.67, …, 0.42]
= [0.33, 0.78, 0.66, …, 0.97]
n. dimensions << vocabulary size
Slide 19
Slide 19 text
Word Embeddings
Rome
Paris
Italy
France
= [0.91, 0.83, 0.17, …, 0.41]
= [0.92, 0.82, 0.17, …, 0.98]
= [0.32, 0.77, 0.67, …, 0.42]
= [0.33, 0.78, 0.66, …, 0.97]
Slide 20
Slide 20 text
Word Embeddings
Rome
Paris
Italy
France
= [0.91, 0.83, 0.17, …, 0.41]
= [0.92, 0.82, 0.17, …, 0.98]
= [0.32, 0.77, 0.67, …, 0.42]
= [0.33, 0.78, 0.66, …, 0.97]
Slide 21
Slide 21 text
Word Embeddings
Rome
Paris
Italy
France
= [0.91, 0.83, 0.17, …, 0.41]
= [0.92, 0.82, 0.17, …, 0.98]
= [0.32, 0.77, 0.67, …, 0.42]
= [0.33, 0.78, 0.66, …, 0.97]
Slide 22
Slide 22 text
Word Embeddings
Rome
Paris Italy
France
Slide 23
Slide 23 text
Word Embeddings
is-capital-of
Slide 24
Slide 24 text
Word Embeddings
Paris
Slide 25
Slide 25 text
Word Embeddings
Paris + Italy
Slide 26
Slide 26 text
Word Embeddings
Paris + Italy - France
Slide 27
Slide 27 text
Word Embeddings
Paris + Italy - France ≈ Rome
Rome
Slide 28
Slide 28 text
FROM LANGUAGE
TO VECTORS?
Slide 29
Slide 29 text
Distributional
Hypothesis
Slide 30
Slide 30 text
–J.R. Firth, 1957
“You shall know a word
by the company it keeps.”
Slide 31
Slide 31 text
–Z. Harris, 1954
“Words that occur in similar context
tend to have similar meaning.”
Slide 32
Slide 32 text
Context ≈ Meaning
Slide 33
Slide 33 text
I enjoyed eating some pizza at the restaurant
Slide 34
Slide 34 text
I enjoyed eating some pizza at the restaurant
Word
Slide 35
Slide 35 text
I enjoyed eating some pizza at the restaurant
The company it keeps
Word
Slide 36
Slide 36 text
I enjoyed eating some pizza at the restaurant
I enjoyed eating some Irish stew at the restaurant
Slide 37
Slide 37 text
I enjoyed eating some pizza at the restaurant
I enjoyed eating some Irish stew at the restaurant
Slide 38
Slide 38 text
I enjoyed eating some pizza at the restaurant
I enjoyed eating some Irish stew at the restaurant
Same Context
Slide 39
Slide 39 text
Same Context
=
?
Slide 40
Slide 40 text
WORD2VEC
Slide 41
Slide 41 text
word2vec (2013)
Slide 42
Slide 42 text
word2vec Architecture
Mikolov et al. (2013) Efficient Estimation of Word Representations in Vector Space
Slide 43
Slide 43 text
Vector Calculation
Slide 44
Slide 44 text
Vector Calculation
Goal: learn vec(word)
Slide 45
Slide 45 text
Vector Calculation
Goal: learn vec(word)
1. Choose objective function
Slide 46
Slide 46 text
Vector Calculation
Goal: learn vec(word)
1. Choose objective function
2. Init: random vectors
Slide 47
Slide 47 text
Vector Calculation
Goal: learn vec(word)
1. Choose objective function
2. Init: random vectors
3. Run stochastic gradient descent
Slide 48
Slide 48 text
Vector Calculation
Goal: learn vec(word)
1. Choose objective function
2. Init: random vectors
3. Run stochastic gradient descent
Slide 49
Slide 49 text
Vector Calculation
Goal: learn vec(word)
1. Choose objective function
2. Init: random vectors
3. Run stochastic gradient descent
Slide 50
Slide 50 text
Objective Function
Slide 51
Slide 51 text
I enjoyed eating some pizza at the restaurant
Objective Function
Slide 52
Slide 52 text
I enjoyed eating some pizza at the restaurant
Objective Function
Slide 53
Slide 53 text
I enjoyed eating some pizza at the restaurant
Objective Function
Slide 54
Slide 54 text
I enjoyed eating some pizza at the restaurant
Objective Function
maximise
the likelihood of a word
given its context
Slide 55
Slide 55 text
I enjoyed eating some pizza at the restaurant
Objective Function
maximise
the likelihood of a word
given its context
e.g. P(pizza | restaurant)
Slide 56
Slide 56 text
I enjoyed eating some pizza at the restaurant
Objective Function
Slide 57
Slide 57 text
I enjoyed eating some pizza at the restaurant
Objective Function
maximise
the likelihood of the context
given the focus word
Slide 58
Slide 58 text
I enjoyed eating some pizza at the restaurant
Objective Function
maximise
the likelihood of the context
given the focus word
e.g. P(restaurant | pizza)
Slide 59
Slide 59 text
WORD2VEC IN PYTHON
Slide 60
Slide 60 text
No content
Slide 61
Slide 61 text
pip install gensim
Slide 62
Slide 62 text
Example
Slide 63
Slide 63 text
from gensim.models import Word2Vec
fname = ‘my_dataset.json’
corpus = MyCorpusReader(fname)
model = Word2Vec(corpus)
Example
Slide 64
Slide 64 text
from gensim.models import Word2Vec
fname = ‘my_dataset.json’
corpus = MyCorpusReader(fname)
model = Word2Vec(corpus)
Example
• word2vec + morphology (sub-words)
• Pre-trained vectors on ~300 languages (Wikipedia)
fastText (2016-17)
Slide 95
Slide 95 text
• word2vec + morphology (sub-words)
• Pre-trained vectors on ~300 languages (Wikipedia)
• rare words
fastText (2016-17)
Slide 96
Slide 96 text
• word2vec + morphology (sub-words)
• Pre-trained vectors on ~300 languages (Wikipedia)
• rare words
• out of vocabulary words (sometimes )
fastText (2016-17)
Slide 97
Slide 97 text
• word2vec + morphology (sub-words)
• Pre-trained vectors on ~300 languages (Wikipedia)
• rare words
• out of vocabulary words (sometimes )
• morphologically rich languages
fastText (2016-17)
Slide 98
Slide 98 text
FINAL REMARKS
Slide 99
Slide 99 text
But we’ve been doing this for X years
Slide 100
Slide 100 text
• Approaches based on co-occurrences are not new
• … but usually outperformed by word embeddings
• … and don’t scale as well as word embeddings
But we’ve been doing this for X years
Slide 101
Slide 101 text
Garbage in, garbage out
Slide 102
Slide 102 text
Garbage in, garbage out
• Pre-trained vectors are useful … until they’re not
• The business domain is important
• The pre-processing steps are important
• > 100K words? Maybe train your own model
• > 1M words? Yep, train your own model
Slide 103
Slide 103 text
Summary
Slide 104
Slide 104 text
Summary
• Word Embeddings are magic!
• Big victory of unsupervised learning
• Gensim makes your life easy
Slide 105
Slide 105 text
THANK YOU
@MarcoBonzanini
speakerdeck.com/marcobonzanini
GitHub.com/bonzanini
marcobonzanini.com
Slide 106
Slide 106 text
Credits & Readings
Slide 107
Slide 107 text
Credits & Readings
Credits
• Lev Konstantinovskiy (@teagermylk)
Readings
• Deep Learning for NLP (R. Socher) http://cs224d.stanford.edu/
• “GloVe: global vectors for word representation” by Pennington et al.
• “Distributed Representation of Sentences and Documents” (doc2vec)
by Le and Mikolov
• “Enriching Word Vectors with Subword Information” (fastText)
by Bojanokwsi et al.
Slide 108
Slide 108 text
Credits & Readings
Even More Readings
• “Man is to Computer Programmer as Woman is to Homemaker?
Debiasing Word Embeddings” by Bolukbasi et al.
• “Quantifying and Reducing Stereotypes in Word Embeddings” by
Bolukbasi et al.
• “Equality of Opportunity in Machine Learning” - Google Research Blog
https://research.googleblog.com/2016/10/equality-of-opportunity-in-machine.html
Pics Credits
• Classification: https://commons.wikimedia.org/wiki/File:Cluster-2.svg
• Translation: https://commons.wikimedia.org/wiki/File:Translation_-_A_till_%C3%85-colours.svg
• Irish stew: https://commons.wikimedia.org/wiki/File:Irish_stew_(13393166514).jpg
• Pizza: https://commons.wikimedia.org/wiki/File:Eq_it-na_pizza-margherita_sep2005_sml.jpg