Intermezzo (Gradient Descent)
• Optimisation algorithm
• Purpose: find the min (or max) for F
Slide 65
Slide 65 text
Intermezzo (Gradient Descent)
• Optimisation algorithm
• Purpose: find the min (or max) for F
• Batch-oriented (use all data points)
Slide 66
Slide 66 text
Intermezzo (Gradient Descent)
• Optimisation algorithm
• Purpose: find the min (or max) for F
• Batch-oriented (use all data points)
• Stochastic GD: update after each sample
Slide 67
Slide 67 text
Objective Function
Slide 68
Slide 68 text
I enjoyed eating some pizza at the restaurant
Objective Function
Slide 69
Slide 69 text
I enjoyed eating some pizza at the restaurant
Objective Function
Slide 70
Slide 70 text
I enjoyed eating some pizza at the restaurant
Objective Function
Slide 71
Slide 71 text
I enjoyed eating some pizza at the restaurant
Maximise the likelihood
of the context given the focus word
Objective Function
Slide 72
Slide 72 text
I enjoyed eating some pizza at the restaurant
Maximise the likelihood
of the context given the focus word
P(i | pizza)
P(enjoyed | pizza)
…
P(restaurant | pizza)
Objective Function
Slide 73
Slide 73 text
Example
I enjoyed eating some pizza at the restaurant
Slide 74
Slide 74 text
I enjoyed eating some pizza at the restaurant
Iterate over context words
Example
Slide 75
Slide 75 text
I enjoyed eating some pizza at the restaurant
bump P( i | pizza )
Example
Slide 76
Slide 76 text
I enjoyed eating some pizza at the restaurant
bump P( enjoyed | pizza )
Example
Slide 77
Slide 77 text
I enjoyed eating some pizza at the restaurant
bump P( eating | pizza )
Example
Slide 78
Slide 78 text
I enjoyed eating some pizza at the restaurant
bump P( some | pizza )
Example
Slide 79
Slide 79 text
I enjoyed eating some pizza at the restaurant
bump P( at | pizza )
Example
Slide 80
Slide 80 text
I enjoyed eating some pizza at the restaurant
bump P( the | pizza )
Example
Slide 81
Slide 81 text
I enjoyed eating some pizza at the restaurant
bump P( restaurant | pizza )
Example
Slide 82
Slide 82 text
I enjoyed eating some pizza at the restaurant
Move to next focus word and repeat
Example
Slide 83
Slide 83 text
I enjoyed eating some pizza at the restaurant
bump P( i | at )
Example
Slide 84
Slide 84 text
I enjoyed eating some pizza at the restaurant
bump P( enjoyed | at )
Example
Slide 85
Slide 85 text
I enjoyed eating some pizza at the restaurant
… you get the picture
Example
Slide 86
Slide 86 text
P( eating | pizza )
Slide 87
Slide 87 text
P( eating | pizza ) ??
Slide 88
Slide 88 text
P( eating | pizza )
Input word
Output word
Slide 89
Slide 89 text
P( eating | pizza )
Input word
Output word
P( vec(eating) | vec(pizza) )
Slide 90
Slide 90 text
P( vout | vin )
P( vec(eating) | vec(pizza) )
P( eating | pizza )
Input word
Output word
Slide 91
Slide 91 text
P( vout | vin )
P( vec(eating) | vec(pizza) )
P( eating | pizza )
Input word
Output word
???
Slide 92
Slide 92 text
P( vout | vin )
Slide 93
Slide 93 text
cosine( vout, vin )
Slide 94
Slide 94 text
cosine( vout, vin ) [-1, 1]
Slide 95
Slide 95 text
softmax(cosine( vout, vin ))
Slide 96
Slide 96 text
softmax(cosine( vout, vin )) [0, 1]
Slide 97
Slide 97 text
softmax(cosine( vout, vin ))
P
(vout
|
vin) =
exp(cosine(vout
,
vin))
P
k
2
V exp(cosine(vk
,
vin))
Slide 98
Slide 98 text
Vector Calculation Recap
Slide 99
Slide 99 text
Vector Calculation Recap
Learn vec(word)
Slide 100
Slide 100 text
Vector Calculation Recap
Learn vec(word)
by gradient descent
Slide 101
Slide 101 text
Vector Calculation Recap
Learn vec(word)
by gradient descent
on the softmax probability
Slide 102
Slide 102 text
Plot Twist
Slide 103
Slide 103 text
No content
Slide 104
Slide 104 text
No content
Slide 105
Slide 105 text
Paragraph Vector
a.k.a.
doc2vec
i.e.
P(vout | vin, label)
Slide 106
Slide 106 text
A BIT OF PRACTICE
Slide 107
Slide 107 text
No content
Slide 108
Slide 108 text
pip install gensim
Slide 109
Slide 109 text
Case Study 1: Skills and CVs
Slide 110
Slide 110 text
Case Study 1: Skills and CVs
Data set of ~300k resumes
Each experience is a “sentence”
Each experience has 3-15 skills
Approx 15k unique skills
Slide 111
Slide 111 text
Case Study 1: Skills and CVs
from gensim.models import Word2Vec
fname = 'candidates.jsonl'
corpus = ResumesCorpus(fname)
model = Word2Vec(corpus)
Slide 112
Slide 112 text
Case Study 1: Skills and CVs
model.most_similar('chef')
[('cook', 0.94),
('bartender', 0.91),
('waitress', 0.89),
('restaurant', 0.76),
...]
Slide 113
Slide 113 text
Case Study 1: Skills and CVs
model.most_similar('chef',
negative=['food'])
[('puppet', 0.93),
('devops', 0.92),
('ansible', 0.79),
('salt', 0.77),
...]
Slide 114
Slide 114 text
Case Study 1: Skills and CVs
Useful for:
Data exploration
Query expansion/suggestion
Recommendations
Slide 115
Slide 115 text
Case Study 2: Beer!
Slide 116
Slide 116 text
Case Study 2: Beer!
Data set of ~2.9M beer reviews
89 different beer styles
635k unique tokens
185M total tokens
Slide 117
Slide 117 text
Case Study 2: Beer!
from gensim.models import Doc2Vec
fname = 'ratebeer_data.csv'
corpus = RateBeerCorpus(fname)
model = Doc2Vec(corpus)
Slide 118
Slide 118 text
Case Study 2: Beer!
from gensim.models import Doc2Vec
fname = 'ratebeer_data.csv'
corpus = RateBeerCorpus(fname)
model = Doc2Vec(corpus)
3.5h on my laptop
… remember to pickle
Slide 119
Slide 119 text
Case Study 2: Beer!
model.docvecs.most_similar('Stout')
[('Sweet Stout', 0.9877),
('Porter', 0.9620),
('Foreign Stout', 0.9595),
('Dry Stout', 0.9561),
('Imperial/Strong Porter', 0.9028),
...]
Slide 120
Slide 120 text
Case Study 2: Beer!
model.most_similar([model.docvecs['Stout']])
[('coffee', 0.6342),
('espresso', 0.5931),
('charcoal', 0.5904),
('char', 0.5631),
('bean', 0.5624),
...]
Case Study 2: Beer!
Useful for:
Understanding the language of beer enthusiasts
Planning your next pint
Classification
Slide 129
Slide 129 text
Case Study 3: Evil AI
Slide 130
Slide 130 text
Case Study 3: Evil AI
from gensim.models.keyedvectors \
import KeyedVectors
fname = ‘GoogleNews-vectors.bin'
model = KeyedVectors.load_word2vec_format(
fname,
binary=True
)
Slide 131
Slide 131 text
Case Study 3: Evil AI
model.most_similar(
positive=['king', ‘woman'],
negative=[‘man’]
)
Slide 132
Slide 132 text
Case Study 3: Evil AI
model.most_similar(
positive=['king', ‘woman'],
negative=[‘man’]
)
[('queen', 0.7118),
('monarch', 0.6189),
('princess', 0.5902),
('crown_prince', 0.5499),
('prince', 0.5377),
…]
Slide 133
Slide 133 text
Case Study 3: Evil AI
model.most_similar(
positive=['Paris', ‘Italy'],
negative=[‘France’]
)
Slide 134
Slide 134 text
Case Study 3: Evil AI
model.most_similar(
positive=['Paris', ‘Italy'],
negative=[‘France’]
)
[('Milan', 0.7222),
('Rome', 0.7028),
('Palermo_Sicily', 0.5967),
('Italian', 0.5911),
('Tuscany', 0.5632),
…]
Slide 135
Slide 135 text
Case Study 3: Evil AI
model.most_similar(
positive=[‘professor', ‘woman'],
negative=[‘man’]
)
Slide 136
Slide 136 text
Case Study 3: Evil AI
model.most_similar(
positive=[‘professor', ‘woman'],
negative=[‘man’]
)
[('associate_professor', 0.7771),
('assistant_professor', 0.7558),
('professor_emeritus', 0.7066),
('lecturer', 0.6982),
('sociology_professor', 0.6539),
…]
Slide 137
Slide 137 text
Case Study 3: Evil AI
model.most_similar(
positive=[‘computer_programmer’, ‘woman'],
negative=[‘man’]
)
Slide 138
Slide 138 text
Case Study 3: Evil AI
model.most_similar(
positive=[‘computer_programmer’, ‘woman'],
negative=[‘man’]
)
[('homemaker', 0.5627),
('housewife', 0.5105),
('graphic_designer', 0.5051),
('schoolteacher', 0.4979),
('businesswoman', 0.4934),
…]
Slide 139
Slide 139 text
Case Study 3: Evil AI
• Culture is biased
Slide 140
Slide 140 text
Case Study 3: Evil AI
• Culture is biased
• Language is biased
Slide 141
Slide 141 text
Case Study 3: Evil AI
• Culture is biased
• Language is biased
• Algorithms are not?
Slide 142
Slide 142 text
Case Study 3: Evil AI
• Culture is biased
• Language is biased
• Algorithms are not?
• “Garbage in, garbage out”
Slide 143
Slide 143 text
Case Study 3: Evil AI
Slide 144
Slide 144 text
FINAL REMARKS
Slide 145
Slide 145 text
But we’ve been
doing this for X years
Slide 146
Slide 146 text
But we’ve been
doing this for X years
• Approaches based on co-occurrences are not new
• Think SVD / LSA / LDA
• … but they are usually outperformed by word2vec
• … and don’t scale as well as word2vec
Slide 147
Slide 147 text
Efficiency
Slide 148
Slide 148 text
Efficiency
• There is no co-occurrence matrix
(vectors are learned directly)
• Softmax has complexity O(V)
Hierarchical Softmax only O(log(V))
Slide 149
Slide 149 text
Garbage in, garbage out
Slide 150
Slide 150 text
Garbage in, garbage out
• Pre-trained vectors are useful
• … until they’re not
• The business domain is important
• The pre-processing steps are important
• > 100K words? Maybe train your own model
• > 1M words? Yep, train your own model
Slide 151
Slide 151 text
Summary
Slide 152
Slide 152 text
Summary
• Word Embeddings are magic!
• Big victory of unsupervised learning
• Gensim makes your life easy
Slide 153
Slide 153 text
Credits & Readings
Slide 154
Slide 154 text
Credits & Readings
Credits
• Lev Konstantinovskiy (@gensim_py)
• Chris E. Moody (@chrisemoody) see videos on lda2vec
Readings
• Deep Learning for NLP (R. Socher) http://cs224d.stanford.edu/
• “word2vec parameter learning explained” by Xin Rong
More readings
• “GloVe: global vectors for word representation” by Pennington et al.
• “Dependency based word embeddings” and “Neural word embeddings
as implicit matrix factorization” by O. Levy and Y. Goldberg
Slide 155
Slide 155 text
Credits & Readings
Even More Readings
• “Man is to Computer Programmer as Woman is to Homemaker?
Debiasing Word Embeddings” by Bolukbasi et al.
• “Quantifying and Reducing Stereotypes in Word Embeddings” by
Bolukbasi et al.
• “Equality of Opportunity in Machine Learning” - Google Research Blog
https://research.googleblog.com/2016/10/equality-of-opportunity-in-machine.html
Pics Credits
• Classification: https://commons.wikimedia.org/wiki/File:Cluster-2.svg
• Translation: https://commons.wikimedia.org/wiki/File:Translation_-_A_till_%C3%85-colours.svg
Slide 156
Slide 156 text
THANK YOU
@MarcoBonzanini
GitHub.com/bonzanini
marcobonzanini.com