Introduction to Word Embeddings

Introduction to Word Embeddings Galuh Sahid @galuhsahid | github.com/galuhsahid

https://unsplash.com/photos/q0AtbGIOb5k @galuhsahid

@galuhsahid

How do we represent words? @galuhsahid

Language is hard @galuhsahid

Lake Forest Ocean River @galuhsahid

Lake Forest Ocean River Body of water Not body of
water Body of water Body of water @galuhsahid

Lake Forest Ocean River Doesn’t have trees Has trees Doesn’t
have trees Doesn’t have trees @galuhsahid

WordNet https://www.nltk.org/images/wordnet-hierarchy.png @galuhsahid

What kind of representation do we want? • Real numbers
• What do we want to know about a word? Whether they have the same meaning, semantic relationship… etc. • Can we do it without labelling everything manually? • Ideally it’s not too large! @galuhsahid

Word Embeddings to the Rescue Represent words as vectors of
real numbers with much lower and thus denser dimensions @galuhsahid We’re putting words that are outside any vector space into a vector space - hence, we’re embedding the words into that vector space

Lake [ 0.89254 , 2.3112 , -0.70036 , 0.76679 ,
-1.0815 , 0.40426 , -1.3462 , 0.71 , 0.90067 , -1.043 , -0.57966 , 0.18669 , 1.0996 , -0.90042 , -0.045962, 0.31492 , 1.4128 , 0.84963 , -1.3389 , -0.32252 , -0.10208 , -0.31783 , 0.33173 , 0.096593, 0.36732 , -1.1466 , 0.3123 , 1.549 , -0.13059 , -0.62003 , 1.774 , -0.62134 , 0.065215, -0.39758 , 0.095832, -0.56289 , -0.39552 , -0.16224 , 1.0035 , 0.39161 , -0.54489 , 0.21744 , 0.10831 , -0.06952 , -1.046 , -0.36096 , -0.48233 , -0.90467 , -0.044913, -0.52132 ] (Spoiler alert) @galuhsahid

Visualization Lake Ocean River Forest Pizza @galuhsahid

Visualization Lake Ocean River Forest Pizza Maybe this dimension represents
the concept of whether it is a food or not… @galuhsahid

Visualization Lake Ocean River Forest Pizza Maybe this dimension represents
the concept of whether it is a food or not… Or it could be something not intuitive to us - we actually have no idea, though. It could be anything @galuhsahid

One-Hot Encoding Our vocabulary: lake, forest, ocean, river, pizza Lake
1 0 0 0 0 Forest 0 1 0 0 0 Ocean 0 0 1 0 0 River 0 0 0 1 0 Pizza 0 0 0 0 1 Size: |V| = 5 0 1 2 3 4 0 1 2 3 4 @galuhsahid

One-Hot Encoding Our vocabulary: every English word (approx. 171,476 words
in use) Lake 0 0 0 … 1 … 0 0 Size: |V| = ~171.476 Aardvark 1 0 0 … 0 … 0 0 Zyzzogeton 0 0 0 … 0 … 0 1 https://en.oxforddictionaries.com/explore/how-many-words-are-there-in-the-english-language/ … … Size: |V| = ~171.476 @galuhsahid

Distributional Representation “Tell me who your friends are, and I’ll
tell you who you are.” @galuhsahid

tell you who you are.” “You shall know a word by the company it keeps.” @galuhsahid (Firth, 1957)

tell you who you are.” “You shall know a word by the company it keeps.” Distributional Hypothesis (Harris, 1954) Words that occur in similar contexts have similar meaning @galuhsahid (Firth, 1957)

Distributional Representation A lake is a large body of water
in a body of land. An ocean is a large area of water between continents. A river is a stream of water that flows through a channel in the surface of the ground. A forest is a piece of land with many trees. Pizza is a type of food that was created in Italy. @galuhsahid

No manual annotation! @galuhsahid

Approaches W O R D E M B E D
D I N G S • Count-based methods • Computes how often a word co-occurs with its neighbour words, then map the counts to a small, dense vector @galuhsahid

Count-based Lake large body water body land. Ocean large area
water continents. River stream water flows channel surface ground. Forest piece land many trees. Pizza type food created Italy. [large, body, water] [large, area, water] [stream, water, flows] [piece, land, many] [type, food, created] window = 4 Neighbor words A P P R O A C H E S @galuhsahid

Lake Ocean River Forest Pizza [large, body, water] [large, area,
water] [stream, water, flows] [piece, land, many] [type, food, created] Neighbor words Large Body Water Area Stream Flows Piece Land Many 1 1 1 0 0 0 0 0 0 0 0 0 Type Food Created 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Count-based A P P R O A C H E S @galuhsahid

D I N G S • Count-based methods • Computes how often a word co-occurs with its neighbour words, then map the counts to a small, dense vector • Reduce dimensions using Singular Vector Decomposition (SVD) or Latent Dirichlet Allocation (LDA) @galuhsahid

D I N G S • Predictive methods • Try to predict a word from its neighbors in terms of small and denser embedding vectors B a r o n i , M . , D i n u , G . , & K r u s z e w s k i , G . ( 2 0 1 4 ) . D o n ' t c o u n t , p r e d i c t ! A s y s t e m a t i c c o m p a r i s o n o f c o n t e x t - c o u n t i n g v s . c o n t e x t - p r e d i c t i n g s e m a n t i c v e c t o r s . I n P r o c e e d i n g s o f t h e 5 2 n d A n n u a l M e e t i n g o f t h e A s s o c i a t i o n f o r C o m p u t a t i o n a l L i n g u i s t i c s ( V o l u m e 1 : L o n g P a p e r s ) ( V o l . 1 , p p . 2 3 8 - 2 4 7 ) . @galuhsahid

Predictive Methods • Word2Vec • Mikolov, T., Chen, K., Corrado,
G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. • GloVe • Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). @galuhsahid

Predictive Methods • FastText • https://research.fb.com/downloads/fasttext/ • ELMo • Peters,
M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365. @galuhsahid

Predictive Methods • Word2Vec • Mikolov, T., Chen, K., Corrado,
G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. • GloVe • Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). @galuhsahid

Architecture W O R D 2 V E C •
Continuous Bag-of-Words (CBOW) • Skip-gram @galuhsahid

Skip-gram W O R D 2 V E C -
S K I P - G R A M pizza [ 0.89254 , 2.3112 , -0.70036 , 0.76679 , -1.0815 , 0.40426 , -1.3462 , 0.71 , 0.90067 , -1.043 , -0.57966 , 0.18669 , 1.0996 , -0.90042 , -0.045962, 0.31492 , 1.4128 , 0.84963 , -1.3389 , -0.32252 , -0.10208 , -0.31783 , 0.33173 , 0.096593, 0.36732 , -1.1466 , 0.3123 , 1.549 , -0.13059 , -0.62003 , 1.774 , -0.62134 , 0.065215, -0.39758 , 0.095832, -0.56289 , -0.39552 , -0.16224 , 1.0035 , 0.39161 , -0.54489 , 0.21744 , 0.10831 , -0.06952 , -1.046 , -0.36096 , -0.48233 , -0.90467 , -0.044913, -0.52132 ] @galuhsahid

S K I P - G R A M @galuhsahid “I ate the leftover pizza for dinner”

S K I P - G R A M @galuhsahid “I ate the leftover pizza for dinner” Window size = 5

S K I P - G R A M @galuhsahid “I ate the leftover pizza for dinner” Window size = 5 Neighbour words: the, leftover, for, dinner

Overview W O R D 2 V E C -
S K I P - G R A M Output leftover for pizza Projection @galuhsahid Input the dinner • “I ate the leftover pizza for dinner”

S K I P - G R A M @galuhsahid

Overview W O R D 2 V E C -
S K I P - G R A M Output leftover for pizza Projection @galuhsahid Input the dinner • “I ate the leftover pizza for dinner”

Architecture W O R D 2 V E C -
S K I P - G R A M pizza Input @galuhsahid 0 0 … 1 0 0 …

S K I P - G R A M pizza Input @galuhsahid 0 0 … 1 0 0 … Projection D V

S K I P - G R A M pizza Input @galuhsahid 0 0 … 1 0 0 … Projection D V = vocabulary size

S K I P - G R A M pizza Input @galuhsahid 0 0 … 1 0 0 … Projection D V = vocabulary size = number of dimensions

Projection Layer W O R D 2 V E C
- S K I P - G R A M Pizza … 0.23 -0.12 0.27 Aardvark … -0.21 0.35 0.56 Zyzzogeton … … D V … 0.89 2.31 -0.52 @galuhsahid

S K I P - G R A M pizza Input @galuhsahid 0 0 … 1 0 0 … Projection D V Output leftover for the dinner (Softmax)

Projection Layer W O R D 2 V E C
- S K I P - G R A M Pizza … 0.23 -0.12 0.27 Aardvark … -0.21 0.35 0.56 Zyzzogeton … … D V … 0.89 2.31 -0.52 @galuhsahid

@galuhsahid

Word Vector W O R D 2 V E C
- S K I P - G R A M … 0.89 2.31 -0.52 Pizza This is our word vector! @galuhsahid

Word Vector W O R D 2 V E C
- S K I P - G R A M … 0.89 2.31 -0.52 Pizza This is our word vector! [ 0.89254 , 2.3112 , -0.70036 , 0.76679 , -1.0815 , 0.40426 , -1.3462 , 0.71 , 0.90067 , -1.043 , -0.57966 , 0.18669 , 1.0996 , -0.90042 , -0.045962, 0.31492 , 1.4128 , 0.84963 , -1.3389 , -0.32252 , -0.10208 , -0.31783 , 0.33173 , 0.096593, 0.36732 , -1.1466 , 0.3123 , 1.549 , -0.13059 , -0.62003 , 1.774 , -0.62134 , 0.065215, -0.39758 , 0.095832, -0.56289 , -0.39552 , -0.16224 , 1.0035 , 0.39161 , -0.54489 , 0.21744 , 0.10831 , -0.06952 , -1.046 , -0.36096 , -0.48233 , -0.90467 , -0.044913, -0.52132 ] @galuhsahid

The Intuition W O R D 2 V E C
- S K I P - G R A M … 0.89 2.31 -0.52 Pizza the leftover for dinner … 0.76 2.01 -0.47 Chicken some leftover recipes for @galuhsahid “I ate the leftover pizza for dinner” “I need some leftover chicken recipes for dinner”

The Intuition W O R D 2 V E C
- S K I P - G R A M … 0.89 2.31 -0.52 Pizza the leftover for dinner … 0.76 2.01 -0.47 Chicken some leftover recipes for @galuhsahid … 0.32 0.43 -0.21 Prague embassy in is located “I ate the leftover pizza for dinner” “I need some leftover chicken recipes for dinner” “The Germany embassy in Prague is located in…”

S K I P - G R A M @galuhsahid M i k o l o v, T. , C h e n , K . , C o r r a d o , G . , & D e a n , J . ( 2 0 1 3 ) . E f f i c i e n t e s t i m a t i o n o f w o r d r e p r e s e n t a t i o n s i n v e c t o r s p a c e . a r X i v p r e p r i n t a r X i v : 1 3 0 1 . 3 7 8 1 . More details:

Exploring Word Embeddings with Gensim @galuhsahid

Pre-trained Models E X P L O R I N
G W O R D E M B E D D I N G S • Gensim has an API to download pre-trained word embedding models. The list of available models can be found here. @galuhsahid

Loading the Model E X P L O R I
N G W O R D E M B E D D I N G S model_w2v = KeyedVectors.load_word2vec_format( './GoogleNews-vectors-negative300.bin', binary=True) @galuhsahid from gensim.models import KeyedVectors

Word Vector E X P L O R I N
G W O R D E M B E D D I N G S model_w2v[“lake”] array([-8.39843750e-02, 2.02148438e-01, 2.65625000e-01, 1.04980469e-01, -7.95898438e-02, 1.05957031e-01, -5.39550781e-02, 8.11767578e-03, 9.32617188e-02, -7.66601562e-02, 1.56250000e-01, -1.19628906e-01, … -4.15039062e-02, 4.08935547e-03, -2.47070312e-01, -1.78710938e-01, 3.33984375e-01, -1.79687500e-01], dtype=float32) @galuhsahid

Similar Words E X P L O R I N
G W O R D E M B E D D I N G S model_w2v.most_similar("apple") [('apples', 0.7203598022460938), ('pear', 0.6450696587562561), ('fruit', 0.6410146355628967), ('berry', 0.6302294731140137), ('pears', 0.6133961081504822), ('strawberry', 0.6058261394500732), ('peach', 0.6025873422622681), ('potato', 0.596093475818634), ('grape', 0.5935864448547363), ('blueberry', 0.5866668224334717)] • The distance is calculated using cosine similarity @galuhsahid • Similar words are nearby vectors in a vector space

Get Similarity E X P L O R I N
G W O R D E M B E D D I N G S model_w2v.similarity("apple", "mango") 0.57518554 @galuhsahid

Odd One Out E X P L O R I
N G W O R D E M B E D D I N G S model_w2v.doesnt_match(["lake", "forest", "ocean", "river"]) 'forest' @galuhsahid

Analogies E X P L O R I N G
W O R D E M B E D D I N G S model_w2v.most_similar(positive=["uncle", "woman"], negative=["man"]) • man to uncle is woman to ... @galuhsahid

W O R D E M B E D D I N G S model_w2v.most_similar(positive=["uncle", "woman"], negative=["man"]) [('aunt', 0.8022665977478027), ('mother', 0.7770732045173645), ('niece', 0.768424928188324), ('father', 0.7237852811813354), ('grandmother', 0.722037136554718), ('daughter', 0.7185647487640381), ('sister', 0.7006258368492126), ('husband', 0.6982548236846924), ('granddaughter', 0.6858304738998413), ('nephew', 0.6710714101791382)] • man to uncle is woman to ... Words that are similar to uncle and woman but dissimilar to man uncle + woman - man @galuhsahid

W O R D E M B E D D I N G S model_w2v.most_similar(positive=["Berlin", “France”], negative=["Germany"]) [('Paris', 0.7672388553619385), ('French', 0.6049168109893799), ('Parisian', 0.5810437202453613), ('Colombes', 0.5599985718727112), ('Hopital_Europeen_Georges_Pompidou', 0.555890679359436), ('Melun', 0.551270067691803), ('Dinard', 0.5451847314834595), ('Brussels', 0.5420989990234375), ('Mairie_de', 0.5337448120117188), ('Cagnes_sur_Mer', 0.531246542930603)] • Germany to Berlin is France to... @galuhsahid

W O R D E M B E D D I N G S model_w2v.most_similar(positive=["Berlin", “France”], negative=["Germany"]) • Germany to Berlin is France to... @galuhsahid

W O R D E M B E D D I N G S model_w2v.most_similar(positive=["run", "walking"], negative=["running"]) @galuhsahid • Running to run is walking to…

W O R D E M B E D D I N G S model_w2v.most_similar(positive=["run", "walking"], negative=["running"]) [('walk', 0.7163699865341187), ('walks', 0.5965700745582581), ('walked', 0.5833066701889038), ('stroll', 0.5236037969589233), ('pinch_hitter_Yunel_Escobar', 0.4562637209892273), ('Walking', 0.455409437417984), ('Batterymate_Miguel_Olivo', 0.4483090043067932), ('runs', 0.4462803602218628), ('pinch_hitter_Carlos_Guillen', 0.4402925372123718), ('Justin_Speier_relieved', 0.43528205156326294)] @galuhsahid • Running to run is walking to…

W O R D E M B E D D I N G S https://www.tensorflow.org/tutorials/representation/word2vec @galuhsahid

h t t p : / / b i o
n l p - w w w . u t u . f i / w v _ d e m o / @galuhsahid

h t t p s : / / i n
d o n e s i a n - w o r d - e m b e d d i n g . h e r o k u a p p . c o m h t t p : / / g i t h u b . c o m / g a l u h s a h i d / i n d o n e s i a n - w o r d - e m b e d d i n g @galuhsahid

Training Your Own • Sure you can! • When to
do so? • Specific problem domains • Challenge: training data • Another alternative: continue training a pre-existing word embedding E X P L O R I N G W O R D E M B E D D I N G S @galuhsahid

Search A P P L I C A T I
O N @galuhsahid

Neural Machine Translation A P P L I C A
T I O N Q i , Y. , S a c h a n , D . S . , F e l i x , M . , P a d m a n a b h a n , S . J . , & N e u b i g , G . ( 2 0 1 8 ) . W h e n a n d w h y a r e p r e - t r a i n e d w o r d e m b e d d i n g s u s e f u l f o r n e u r a l m a c h i n e t r a n s l a t i o n ? . a r X i v p r e p r i n t a r X i v : 1 8 0 4 . 0 6 3 2 3 . @galuhsahid

Recommendation Engine A P P L I C A T
I O N https://towardsdatascience.com/using-word2vec-for-music-recommendations-bb9649ac2484 @galuhsahid

Out-of-vocabulary Words C H A L L E N G
E • Word2vec doesn’t handle this • FastText handles this because it trains n-grams instead of words breaking down each word into n-grams • ELMo also handles this because it trains the model on character-level amazing -> <am>, <ama>, <maz>, <azi>, <zin>, <ing>, <ng> V1 V2 V3 V4 V5 V6 V7 min_n = max_n = 3 amazin -> <am>, <ama>, <maz>, <azi>, <zin>, <in> V1 V2 V3 V4 V5 V9 @galuhsahid

Polysemy C H A L L E N G E
The word “rock” @galuhsahid

The word “rock” https://unsplash.com/photos/I4zSNSxR8oA https://unsplash.com/photos/xssEs_oCv-A @galuhsahid

The word “rock” https://unsplash.com/photos/I4zSNSxR8oA https://unsplash.com/photos/xssEs_oCv-A https://www.muscleandfitness.com/workouts/athletecelebrity- workouts/dwayne-rock-johnsons-shoulder-workout @galuhsahid

He caught a fish at the bank of the river The bank at the end of the street was robbed yesterday @galuhsahid

He caught a fish at the bank of the river The bank at the end of the street was robbed yesterday Same word, different meaning @galuhsahid

He caught a fish at the bank of the river The bank at the end of the street was robbed yesterday Same word, different meaning @galuhsahid More recent word models such as ELMo and BERT will assign different word vectors for the word “bank” because they appear in different contexts

Bias C H A L L E N G E
https://twitter.com/zeynep/status/799662089740681217 @galuhsahid

@galuhsahid

“Our results indicate that text corpora contain recoverable and accurate imprints of our historic biases, whether morally neutral as towards insects or flowers, problematic as towards race or gender, or even simply veridical, reflecting the status quo distribution of gender with respect to careers or first names.” “Certainly, caution must be used in incorporating modules constructed via unsupervised machine learning into decision-making systems.” C a l i s k a n , A . , B r y s o n , J . J . , & N a r a y a n a n , A . ( 2 0 1 7 ) . S e m a n t i c s d e r i v e d a u t o m a t i c a l l y f r o m l a n g u a g e c o r p o r a c o n t a i n h u m a n - l i k e b i a s e s . S c i e n c e , 3 5 6 ( 6 3 3 4 ) , 1 8 3 - 1 8 6 . @galuhsahid

De-biasing word embeddings: B o l u k b a s i , T. , C h a n g , K . W . , Z o u , J . Y. , S a l i g r a m a , V. , & K a l a i , A . T. ( 2 0 1 6 ) . M a n i s t o c o m p u t e r p r o g r a m m e r a s w o m a n i s t o h o m e m a k e r ? d e b i a s i n g w o r d e m b e d d i n g s . I n A d v a n c e s i n n e u r a l i n f o r m a t i o n p r o c e s s i n g s y s t e m s ( p p . 4 3 4 9 - 4 3 5 7 ) . @galuhsahid

“We conclude that existing bias removal techniques are insufficient, and should not be trusted for providing gender-neutral modeling.” G o n e n , H . , & G o l d b e r g , Y. ( 2 0 1 9 ) . L i p s t i c k o n a P i g : D e b i a s i n g M e t h o d s C o v e r u p S y s t e m a t i c G e n d e r B i a s e s i n W o r d E m b e d d i n g s B u t d o n o t R e m o v e T h e m . a r X i v p r e p r i n t a r X i v : 1 9 0 3 . 0 3 8 6 2 . @galuhsahid

Bias L I M I T A T I O
N https://www.blog.google/products/translate/reducing-gender-bias-google-translate/ @galuhsahid

Beyond word2vec • Explore other techniques such as GloVe, fasttext,
ELMo, BERT… • Train your own word embeddings • Use word embeddings for other tasks outside of NLP tasks (song2vec, perhaps?) @galuhsahid • Use word embeddings in NLP tasks (e.g. text classification with doc2vec)

@galuhsahid

Introduction to Word Embeddings

Introduction to Word Embeddings

More Decks by Galuh Sahid

Other Decks in Programming

Featured

Transcript