Word Embeddings
Rome
Paris
Italy
France
= [0.91, 0.83, 0.17, …, 0.41]
= [0.92, 0.82, 0.17, …, 0.98]
= [0.32, 0.77, 0.67, …, 0.42]
= [0.33, 0.78, 0.66, …, 0.97]
Slide 18
Slide 18 text
Word Embeddings
Rome
Paris
Italy
France
= [0.91, 0.83, 0.17, …, 0.41]
= [0.92, 0.82, 0.17, …, 0.98]
= [0.32, 0.77, 0.67, …, 0.42]
= [0.33, 0.78, 0.66, …, 0.97]
n. dimensions << vocabulary size
Slide 19
Slide 19 text
Word Embeddings
Rome
Paris
Italy
France
= [0.91, 0.83, 0.17, …, 0.41]
= [0.92, 0.82, 0.17, …, 0.98]
= [0.32, 0.77, 0.67, …, 0.42]
= [0.33, 0.78, 0.66, …, 0.97]
Slide 20
Slide 20 text
Word Embeddings
Rome
Paris
Italy
France
= [0.91, 0.83, 0.17, …, 0.41]
= [0.92, 0.82, 0.17, …, 0.98]
= [0.32, 0.77, 0.67, …, 0.42]
= [0.33, 0.78, 0.66, …, 0.97]
Slide 21
Slide 21 text
Word Embeddings
Rome
Paris
Italy
France
= [0.91, 0.83, 0.17, …, 0.41]
= [0.92, 0.82, 0.17, …, 0.98]
= [0.32, 0.77, 0.67, …, 0.42]
= [0.33, 0.78, 0.66, …, 0.97]
Slide 22
Slide 22 text
Word Embeddings
Rome
Paris Italy
France
Slide 23
Slide 23 text
Word Embeddings
Paris + Italy - France ≈ Rome
Rome
Slide 24
Slide 24 text
THE MAIN INTUITION
Slide 25
Slide 25 text
Distributional Hypothesis
Slide 26
Slide 26 text
–J.R. Firth 1957
“You shall know a word
by the company it keeps.”
Slide 27
Slide 27 text
–Z. Harris 1954
“Words that occur in similar context
tend to have similar meaning.”
Slide 28
Slide 28 text
Context ≈ Meaning
Slide 29
Slide 29 text
I enjoyed eating some pizza at the restaurant
Slide 30
Slide 30 text
I enjoyed eating some pizza at the restaurant
Word
Slide 31
Slide 31 text
I enjoyed eating some pizza at the restaurant
The company it keeps
Word
Slide 32
Slide 32 text
I enjoyed eating some pizza at the restaurant
I enjoyed eating some fiorentina at the restaurant
Slide 33
Slide 33 text
I enjoyed eating some pizza at the restaurant
I enjoyed eating some fiorentina at the restaurant
Slide 34
Slide 34 text
I enjoyed eating some pizza at the restaurant
I enjoyed eating some fiorentina at the restaurant
Same context
Slide 35
Slide 35 text
I enjoyed eating some pizza at the restaurant
I enjoyed eating some fiorentina at the restaurant
Same context
Pizza = Fiorentina ?
Slide 36
Slide 36 text
A BIT OF THEORY
word2vec
Slide 37
Slide 37 text
No content
Slide 38
Slide 38 text
No content
Slide 39
Slide 39 text
Vector Calculation
Slide 40
Slide 40 text
Vector Calculation
Goal: learn vec(word)
Slide 41
Slide 41 text
Vector Calculation
Goal: learn vec(word)
1. Choose objective function
Slide 42
Slide 42 text
Vector Calculation
Goal: learn vec(word)
1. Choose objective function
2. Init: random vectors
Slide 43
Slide 43 text
Vector Calculation
Goal: learn vec(word)
1. Choose objective function
2. Init: random vectors
3. Run gradient descent
Slide 44
Slide 44 text
I enjoyed eating some pizza at the restaurant
Slide 45
Slide 45 text
I enjoyed eating some pizza at the restaurant
Slide 46
Slide 46 text
I enjoyed eating some pizza at the restaurant
Slide 47
Slide 47 text
I enjoyed eating some pizza at the restaurant
Maximise the likelihood
of the context given the focus word
Slide 48
Slide 48 text
I enjoyed eating some pizza at the restaurant
Maximise the likelihood
of the context given the focus word
P(i | pizza)
P(enjoyed | pizza)
…
P(restaurant | pizza)
Slide 49
Slide 49 text
Example
I enjoyed eating some pizza at the restaurant
Slide 50
Slide 50 text
I enjoyed eating some pizza at the restaurant
Iterate over context words
Example
Slide 51
Slide 51 text
I enjoyed eating some pizza at the restaurant
bump P( i | pizza )
Example
Slide 52
Slide 52 text
I enjoyed eating some pizza at the restaurant
bump P( enjoyed | pizza )
Example
Slide 53
Slide 53 text
I enjoyed eating some pizza at the restaurant
bump P( eating | pizza )
Example
Slide 54
Slide 54 text
I enjoyed eating some pizza at the restaurant
bump P( some | pizza )
Example
Slide 55
Slide 55 text
I enjoyed eating some pizza at the restaurant
bump P( at | pizza )
Example
Slide 56
Slide 56 text
I enjoyed eating some pizza at the restaurant
bump P( the | pizza )
Example
Slide 57
Slide 57 text
I enjoyed eating some pizza at the restaurant
bump P( restaurant | pizza )
Example
Slide 58
Slide 58 text
I enjoyed eating some pizza at the restaurant
Move to next focus word and repeat
Example
Slide 59
Slide 59 text
I enjoyed eating some pizza at the restaurant
bump P( i | at )
Example
Slide 60
Slide 60 text
I enjoyed eating some pizza at the restaurant
bump P( enjoyed | at )
Example
Slide 61
Slide 61 text
I enjoyed eating some pizza at the restaurant
… you get the picture
Example
Slide 62
Slide 62 text
P( eating | pizza )
Slide 63
Slide 63 text
P( eating | pizza ) ??
Slide 64
Slide 64 text
P( eating | pizza )
Input word
Output word
Slide 65
Slide 65 text
P( eating | pizza )
Input word
Output word
P( vec(eating) | vec(pizza) )
Slide 66
Slide 66 text
P( vout | vin )
P( vec(eating) | vec(pizza) )
P( eating | pizza )
Input word
Output word
Slide 67
Slide 67 text
P( vout | vin )
P( vec(eating) | vec(pizza) )
P( eating | pizza )
Input word
Output word
???
Slide 68
Slide 68 text
P( vout | vin )
Slide 69
Slide 69 text
cosine( vout, vin )
Slide 70
Slide 70 text
cosine( vout, vin ) [-1, 1]
Slide 71
Slide 71 text
softmax(cosine( vout, vin ))
Slide 72
Slide 72 text
softmax(cosine( vout, vin )) [0, 1]
Slide 73
Slide 73 text
softmax(cosine( vout, vin ))
P
(vout
|
vin) =
exp(cosine(vout
,
vin))
P
k
2
V exp(cosine(vk
,
vin))
Slide 74
Slide 74 text
Vector Calculation Recap
Slide 75
Slide 75 text
Vector Calculation Recap
Learn vec(word)
Slide 76
Slide 76 text
Vector Calculation Recap
Learn vec(word)
by gradient descent
Slide 77
Slide 77 text
Vector Calculation Recap
Learn vec(word)
by gradient descent
on the softmax probability
Slide 78
Slide 78 text
Plot Twist
Slide 79
Slide 79 text
No content
Slide 80
Slide 80 text
No content
Slide 81
Slide 81 text
Paragraph Vector
a.k.a.
doc2vec
i.e.
P(vout | vin, label)
Slide 82
Slide 82 text
A BIT OF PRACTICE
Slide 83
Slide 83 text
No content
Slide 84
Slide 84 text
pip install gensim
Slide 85
Slide 85 text
Case Study 1: Skills and CVs
Slide 86
Slide 86 text
Case Study 1: Skills and CVs
Data set of ~300k resumes
Each experience is a “sentence”
Each experience has 3-15 skills
Approx 15k unique skills
Slide 87
Slide 87 text
Case Study 1: Skills and CVs
from gensim.models import Word2Vec
fname = 'candidates.jsonl'
corpus = ResumesCorpus(fname)
model = Word2Vec(corpus)
Slide 88
Slide 88 text
Case Study 1: Skills and CVs
model.most_similar('chef')
[('cook', 0.94),
('bartender', 0.91),
('waitress', 0.89),
('restaurant', 0.76),
...]
Slide 89
Slide 89 text
Case Study 1: Skills and CVs
model.most_similar('chef',
negative=['food'])
[('puppet', 0.93),
('devops', 0.92),
('ansible', 0.79),
('salt', 0.77),
...]
Slide 90
Slide 90 text
Case Study 1: Skills and CVs
Useful for:
Data exploration
Query expansion/suggestion
Recommendations
Slide 91
Slide 91 text
Case Study 2: Beer!
Slide 92
Slide 92 text
Case Study 2: Beer!
Data set of ~2.9M beer reviews
89 different beer styles
635k unique tokens
185M total tokens
Slide 93
Slide 93 text
Case Study 2: Beer!
from gensim.models import Doc2Vec
fname = 'ratebeer_data.csv'
corpus = RateBeerCorpus(fname)
model = Doc2Vec(corpus)
Slide 94
Slide 94 text
Case Study 2: Beer!
from gensim.models import Doc2Vec
fname = 'ratebeer_data.csv'
corpus = RateBeerCorpus(fname)
model = Doc2Vec(corpus)
3.5h on my laptop
… remember to pickle
Slide 95
Slide 95 text
Case Study 2: Beer!
model.docvecs.most_similar('Stout')
[('Sweet Stout', 0.9877),
('Porter', 0.9620),
('Foreign Stout', 0.9595),
('Dry Stout', 0.9561),
('Imperial/Strong Porter', 0.9028),
...]
Slide 96
Slide 96 text
Case Study 2: Beer!
model.most_similar([model.docvecs['Stout']])
[('coffee', 0.6342),
('espresso', 0.5931),
('charcoal', 0.5904),
('char', 0.5631),
('bean', 0.5624),
...]
Case Study 2: Beer!
Useful for:
Understanding the language of beer enthusiasts
Planning your next pint
Classification
Slide 105
Slide 105 text
FINAL REMARKS
Slide 106
Slide 106 text
But we’ve been
doing this for X years
Slide 107
Slide 107 text
But we’ve been
doing this for X years
• Approaches based on co-occurrences are not new
• Think SVD / LSA / LDA
• … but they are usually outperformed by word2vec
• … and don’t scale as well as word2vec
Slide 108
Slide 108 text
Efficiency
Slide 109
Slide 109 text
Efficiency
• There is no co-occurrence matrix
(vectors are learned directly)
• Softmax has complexity O(V)
Hierarchical Softmax only O(log(V))
Slide 110
Slide 110 text
Garbage in, garbage out
Slide 111
Slide 111 text
Garbage in, garbage out
• Pre-trained vectors are useful
• … until they’re not
• The business domain is important
• The pre-processing steps are important
• > 100K words? Maybe train your own model
• > 1M words? Yep, train your own model
Slide 112
Slide 112 text
Summary
Slide 113
Slide 113 text
Summary
• Word Embeddings are magic!
• Big victory of unsupervised learning
• Gensim makes your life easy
Slide 114
Slide 114 text
Credits & Readings
Slide 115
Slide 115 text
Credits & Readings
Credits
• Lev Konstantinovskiy (@gensim_py)
• Chris E. Moody (@chrisemoody) see videos on lda2vec
Readings
• Deep Learning for NLP (R. Socher) http://cs224d.stanford.edu/
• “word2vec parameter learning explained” by Xin Rong
More readings
• “GloVe: global vectors for word representation” by Pennington et al.
• “Dependency based word embeddings” and “Neural word embeddings
as implicit matrix factorization” by O. Levy and Y. Goldberg
Slide 116
Slide 116 text
THANK YOU
@MarcoBonzanini
GitHub.com/bonzanini
marcobonzanini.com