Intermezzo (Gradient Descent)
• Optimisation algorithm
• Purpose: find the min (or max) for F
Slide 67
Slide 67 text
Intermezzo (Gradient Descent)
• Optimisation algorithm
• Purpose: find the min (or max) for F
• Batch-oriented (use all data points)
Slide 68
Slide 68 text
Intermezzo (Gradient Descent)
• Optimisation algorithm
• Purpose: find the min (or max) for F
• Batch-oriented (use all data points)
• Stochastic GD: update after each sample
Slide 69
Slide 69 text
Objective Function
Slide 70
Slide 70 text
I enjoyed eating some pizza at the restaurant
Objective Function
Slide 71
Slide 71 text
I enjoyed eating some pizza at the restaurant
Objective Function
Slide 72
Slide 72 text
I enjoyed eating some pizza at the restaurant
Objective Function
Slide 73
Slide 73 text
I enjoyed eating some pizza at the restaurant
Objective Function
Slide 74
Slide 74 text
I enjoyed eating some pizza at the restaurant
Objective Function
maximise
the likelihood of a word
given its context
Slide 75
Slide 75 text
I enjoyed eating some pizza at the restaurant
Objective Function
maximise
the likelihood of a word
given its context
e.g. P(pizza | eating)
Slide 76
Slide 76 text
I enjoyed eating some pizza at the restaurant
Objective Function
Slide 77
Slide 77 text
I enjoyed eating some pizza at the restaurant
Objective Function
maximise
the likelihood of the context
given its focus word
Slide 78
Slide 78 text
I enjoyed eating some pizza at the restaurant
Objective Function
maximise
the likelihood of the context
given its focus word
e.g. P(eating | pizza)
Slide 79
Slide 79 text
Example
I enjoyed eating some pizza at the restaurant
Slide 80
Slide 80 text
I enjoyed eating some pizza at the restaurant
Iterate over context words
Example
Slide 81
Slide 81 text
I enjoyed eating some pizza at the restaurant
bump P( i | pizza )
Example
Slide 82
Slide 82 text
I enjoyed eating some pizza at the restaurant
bump P( enjoyed | pizza )
Example
Slide 83
Slide 83 text
I enjoyed eating some pizza at the restaurant
bump P( eating | pizza )
Example
Slide 84
Slide 84 text
I enjoyed eating some pizza at the restaurant
bump P( some | pizza )
Example
Slide 85
Slide 85 text
I enjoyed eating some pizza at the restaurant
bump P( at | pizza )
Example
Slide 86
Slide 86 text
I enjoyed eating some pizza at the restaurant
bump P( the | pizza )
Example
Slide 87
Slide 87 text
I enjoyed eating some pizza at the restaurant
bump P( restaurant | pizza )
Example
Slide 88
Slide 88 text
I enjoyed eating some pizza at the restaurant
Move to next focus word and repeat
Example
Slide 89
Slide 89 text
I enjoyed eating some pizza at the restaurant
bump P( i | at )
Example
Slide 90
Slide 90 text
I enjoyed eating some pizza at the restaurant
bump P( enjoyed | at )
Example
Slide 91
Slide 91 text
I enjoyed eating some pizza at the restaurant
… you get the picture
Example
Slide 92
Slide 92 text
P( eating | pizza )
Slide 93
Slide 93 text
P( eating | pizza ) ??
Slide 94
Slide 94 text
P( eating | pizza )
Input word
Output word
Slide 95
Slide 95 text
P( eating | pizza )
Input word
Output word
P( vec(eating) | vec(pizza) )
Slide 96
Slide 96 text
P( vout | vin )
P( vec(eating) | vec(pizza) )
P( eating | pizza )
Input word
Output word
Slide 97
Slide 97 text
P( vout | vin )
P( vec(eating) | vec(pizza) )
P( eating | pizza )
Input word
Output word
???
Slide 98
Slide 98 text
P( vout | vin )
Slide 99
Slide 99 text
cosine( vout, vin )
Slide 100
Slide 100 text
cosine( vout, vin ) [-1, 1]
Slide 101
Slide 101 text
softmax(cosine( vout, vin ))
Slide 102
Slide 102 text
softmax(cosine( vout, vin )) [0, 1]
Slide 103
Slide 103 text
softmax(cosine( vout, vin ))
P
(vout
|
vin) =
exp(cosine(vout
,
vin))
P
k
2
V exp(cosine(vk
,
vin))
Slide 104
Slide 104 text
Vector Calculation Recap
Slide 105
Slide 105 text
Vector Calculation Recap
Learn vec(word)
Slide 106
Slide 106 text
Vector Calculation Recap
Learn vec(word)
by gradient descent
Slide 107
Slide 107 text
Vector Calculation Recap
Learn vec(word)
by gradient descent
on the softmax probability
Slide 108
Slide 108 text
Plot Twist
Slide 109
Slide 109 text
No content
Slide 110
Slide 110 text
No content
Slide 111
Slide 111 text
Paragraph Vector
a.k.a.
doc2vec
i.e.
P(vout | vin, label)
Slide 112
Slide 112 text
A BIT OF PRACTICE
Slide 113
Slide 113 text
No content
Slide 114
Slide 114 text
pip install gensim
Slide 115
Slide 115 text
Case Study 1: Skills and CVs
Slide 116
Slide 116 text
Case Study 1: Skills and CVs
from gensim.models import Word2Vec
fname = 'candidates.jsonl'
corpus = ResumesCorpus(fname)
model = Word2Vec(corpus)
Slide 117
Slide 117 text
Case Study 1: Skills and CVs
from gensim.models import Word2Vec
fname = 'candidates.jsonl'
corpus = ResumesCorpus(fname)
model = Word2Vec(corpus)
Slide 118
Slide 118 text
Case Study 1: Skills and CVs
model.most_similar('chef')
[('cook', 0.94),
('bartender', 0.91),
('waitress', 0.89),
('restaurant', 0.76),
...]
Slide 119
Slide 119 text
Case Study 1: Skills and CVs
model.most_similar('chef',
negative=['food'])
[('puppet', 0.93),
('devops', 0.92),
('ansible', 0.79),
('salt', 0.77),
...]
Slide 120
Slide 120 text
Case Study 1: Skills and CVs
Useful for:
Data exploration
Query expansion/suggestion
Recommendations
Slide 121
Slide 121 text
Case Study 2: Beer!
Slide 122
Slide 122 text
Case Study 2: Beer!
Data set of ~2.9M beer reviews
89 different beer styles
635k unique tokens
185M total tokens
https://snap.stanford.edu/data/web-RateBeer.html
Slide 123
Slide 123 text
Case Study 2: Beer!
from gensim.models import Doc2Vec
fname = 'ratebeer_data.csv'
corpus = RateBeerCorpus(fname)
model = Doc2Vec(corpus)
Slide 124
Slide 124 text
Case Study 2: Beer!
from gensim.models import Doc2Vec
fname = 'ratebeer_data.csv'
corpus = RateBeerCorpus(fname)
model = Doc2Vec(corpus)
3.5h on my laptop
… remember to pickle
Slide 125
Slide 125 text
Case Study 2: Beer!
model.docvecs.most_similar('Stout')
[('Sweet Stout', 0.9877),
('Porter', 0.9620),
('Foreign Stout', 0.9595),
('Dry Stout', 0.9561),
('Imperial/Strong Porter', 0.9028),
...]
Slide 126
Slide 126 text
Case Study 2: Beer!
model.most_similar([model.docvecs['Stout']])
[('coffee', 0.6342),
('espresso', 0.5931),
('charcoal', 0.5904),
('char', 0.5631),
('bean', 0.5624),
...]
• word2vec + morphology (sub-words)
• Pre-trained vectors on ~300 languages
fastText (2016-17)
Slide 164
Slide 164 text
• word2vec + morphology (sub-words)
• Pre-trained vectors on ~300 languages
• morphologically rich languages
fastText (2016-17)
Slide 165
Slide 165 text
FINAL REMARKS
Slide 166
Slide 166 text
But we’ve been doing this for X years
Slide 167
Slide 167 text
But we’ve been doing this for X years
• Approaches based on co-occurrences are not new
Slide 168
Slide 168 text
But we’ve been doing this for X years
• Approaches based on co-occurrences are not new
• … but usually outperformed by word embeddings
Slide 169
Slide 169 text
But we’ve been doing this for X years
• Approaches based on co-occurrences are not new
• … but usually outperformed by word embeddings
• … and don’t scale as well as word embeddings
Slide 170
Slide 170 text
Garbage in, garbage out
Slide 171
Slide 171 text
Garbage in, garbage out
• Pre-trained vectors are useful … until they’re not
Slide 172
Slide 172 text
Garbage in, garbage out
• Pre-trained vectors are useful … until they’re not
• The business domain is important
Slide 173
Slide 173 text
Garbage in, garbage out
• Pre-trained vectors are useful … until they’re not
• The business domain is important
• > 100K words? Maybe train your own model
Slide 174
Slide 174 text
Garbage in, garbage out
• Pre-trained vectors are useful … until they’re not
• The business domain is important
• > 100K words? Maybe train your own model
• > 1M words? Yep, train your own model
Slide 175
Slide 175 text
Summary
Slide 176
Slide 176 text
Summary
• Word Embeddings are magic!
• Big victory of unsupervised learning
• Gensim makes your life easy
Slide 177
Slide 177 text
Credits & Readings
Slide 178
Slide 178 text
Credits & Readings
Credits
• Lev Konstantinovskiy (@teagermylk)
Readings
• Deep Learning for NLP (R. Socher) http://cs224d.stanford.edu/
• “GloVe: global vectors for word representation” by Pennington et al.
• “Distributed Representation of Sentences and Documents” (doc2vec)
by Le and Mikolov
• “Enriching Word Vectors with Subword Information” (fastText)
by Bojanokwsi et al.
Slide 179
Slide 179 text
Credits & Readings
Even More Readings
• “Man is to Computer Programmer as Woman is to Homemaker? Debiasing
Word Embeddings” by Bolukbasi et al.
• “Quantifying and Reducing Stereotypes in Word Embeddings” by Bolukbasi et al.
• “Equality of Opportunity in Machine Learning” - Google Research Blog
https://research.googleblog.com/2016/10/equality-of-opportunity-in-machine.html
Pics Credits
• Classification: https://commons.wikimedia.org/wiki/File:Cluster-2.svg
• Translation: https://commons.wikimedia.org/wiki/File:Translation_-_A_till_%C3%85-colours.svg
• Broccoli: https://commons.wikimedia.org/wiki/File:Broccoli_and_cross_section_edit.jpg
• Pizza: https://commons.wikimedia.org/wiki/File:Eq_it-na_pizza-margherita_sep2005_sml.jpg
Slide 180
Slide 180 text
THANK YOU
@MarcoBonzanini
speakerdeck.com/marcobonzanini
GitHub.com/bonzanini
marcobonzanini.com