280

# Introduction to Word Embeddings

France is to Paris like Czechia is to _______. If I ask you to fill the blank, you would answer "Prague" right away even without me giving a clue such as "the answer is a capital city". Our existing knowledge enables us to determine that France and Paris has a Country - Capital City connection and Czechia's capital city of Prague, so that must be the answer.

However, computers don't know that Prague belongs to the same "category" as Paris and other capital cities unless we tell them so. If we want to get computers to understand human language as well as we do, there are way too many things that we need to teach computers explicitly. Is there a better way?

With word embeddings, we represent words by a series of numbers. This opens up a whole new world for computers because now they can understand the context of a word and infer relationships between words using numbers and maths—the language they are proficient in. We'll delve into the details of what word embeddings actually are and why we need them, popular word embedding models, what problems you can solve using word embeddings, and how you can use word embeddings with Python.

Jupyter Notebook is available here: https://github.com/galuhsahid/intro-to-word-embeddings

June 14, 2019

## Transcript

8. ### Lake Forest Ocean River Body of water Not body of

water Body of water Body of water @galuhsahid
9. ### Lake Forest Ocean River Doesn’t have trees Has trees Doesn’t

have trees Doesn’t have trees @galuhsahid

11. ### What kind of representation do we want? • Real numbers

• What do we want to know about a word? Whether they have the same meaning, semantic relationship… etc. • Can we do it without labelling everything manually? • Ideally it’s not too large! @galuhsahid
12. ### Word Embeddings to the Rescue Represent words as vectors of

real numbers with much lower and thus denser dimensions @galuhsahid We’re putting words that are outside any vector space into a vector space - hence, we’re embedding the words into that vector space
13. ### Lake [ 0.89254 , 2.3112 , -0.70036 , 0.76679 ,

-1.0815 , 0.40426 , -1.3462 , 0.71 , 0.90067 , -1.043 , -0.57966 , 0.18669 , 1.0996 , -0.90042 , -0.045962, 0.31492 , 1.4128 , 0.84963 , -1.3389 , -0.32252 , -0.10208 , -0.31783 , 0.33173 , 0.096593, 0.36732 , -1.1466 , 0.3123 , 1.549 , -0.13059 , -0.62003 , 1.774 , -0.62134 , 0.065215, -0.39758 , 0.095832, -0.56289 , -0.39552 , -0.16224 , 1.0035 , 0.39161 , -0.54489 , 0.21744 , 0.10831 , -0.06952 , -1.046 , -0.36096 , -0.48233 , -0.90467 , -0.044913, -0.52132 ] (Spoiler alert) @galuhsahid

15. ### Visualization Lake Ocean River Forest Pizza Maybe this dimension represents

the concept of whether it is a food or not… @galuhsahid
16. ### Visualization Lake Ocean River Forest Pizza Maybe this dimension represents

the concept of whether it is a food or not… Or it could be something not intuitive to us - we actually have no idea, though. It could be anything @galuhsahid
17. ### One-Hot Encoding Our vocabulary: lake, forest, ocean, river, pizza Lake

1 0 0 0 0 Forest 0 1 0 0 0 Ocean 0 0 1 0 0 River 0 0 0 1 0 Pizza 0 0 0 0 1 Size: |V| = 5 0 1 2 3 4 0 1 2 3 4 @galuhsahid
18. ### One-Hot Encoding Our vocabulary: every English word (approx. 171,476 words

in use) Lake 0 0 0 … 1 … 0 0 Size: |V| = ~171.476 Aardvark 1 0 0 … 0 … 0 0 Zyzzogeton 0 0 0 … 0 … 0 1 https://en.oxforddictionaries.com/explore/how-many-words-are-there-in-the-english-language/ … … Size: |V| = ~171.476 @galuhsahid
19. ### Distributional Representation “Tell me who your friends are, and I’ll

tell you who you are.” @galuhsahid
20. ### Distributional Representation “Tell me who your friends are, and I’ll

tell you who you are.” “You shall know a word by the company it keeps.” @galuhsahid (Firth, 1957)
21. ### Distributional Representation “Tell me who your friends are, and I’ll

tell you who you are.” “You shall know a word by the company it keeps.” Distributional Hypothesis (Harris, 1954) Words that occur in similar contexts have similar meaning @galuhsahid (Firth, 1957)
22. ### Distributional Representation A lake is a large body of water

in a body of land. An ocean is a large area of water between continents. A river is a stream of water that flows through a channel in the surface of the ground. A forest is a piece of land with many trees. Pizza is a type of food that was created in Italy. @galuhsahid
23. ### Distributional Representation A lake is a large body of water

in a body of land. An ocean is a large area of water between continents. A river is a stream of water that flows through a channel in the surface of the ground. A forest is a piece of land with many trees. Pizza is a type of food that was created in Italy. @galuhsahid
24. ### Distributional Representation A lake is a large body of water

in a body of land. An ocean is a large area of water between continents. A river is a stream of water that flows through a channel in the surface of the ground. A forest is a piece of land with many trees. Pizza is a type of food that was created in Italy. @galuhsahid

26. ### Approaches W O R D E M B E D

D I N G S • Count-based methods • Computes how often a word co-occurs with its neighbour words, then map the counts to a small, dense vector @galuhsahid
27. ### Count-based Lake large body water body land. Ocean large area

water continents. River stream water flows channel surface ground. Forest piece land many trees. Pizza type food created Italy. [large, body, water] [large, area, water] [stream, water, flows] [piece, land, many] [type, food, created] window = 4 Neighbor words A P P R O A C H E S @galuhsahid
28. ### Lake Ocean River Forest Pizza [large, body, water] [large, area,

water] [stream, water, flows] [piece, land, many] [type, food, created] Neighbor words Large Body Water Area Stream Flows Piece Land Many 1 1 1 0 0 0 0 0 0 0 0 0 Type Food Created 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Count-based A P P R O A C H E S @galuhsahid
29. ### Approaches W O R D E M B E D

D I N G S • Count-based methods • Computes how often a word co-occurs with its neighbour words, then map the counts to a small, dense vector • Reduce dimensions using Singular Vector Decomposition (SVD) or Latent Dirichlet Allocation (LDA) @galuhsahid
30. ### Approaches W O R D E M B E D

D I N G S • Predictive methods • Try to predict a word from its neighbors in terms of small and denser embedding vectors B a r o n i , M . , D i n u , G . , & K r u s z e w s k i , G . ( 2 0 1 4 ) . D o n ' t c o u n t , p r e d i c t ! A s y s t e m a t i c c o m p a r i s o n o f c o n t e x t - c o u n t i n g v s . c o n t e x t - p r e d i c t i n g s e m a n t i c v e c t o r s . I n P r o c e e d i n g s o f t h e 5 2 n d A n n u a l M e e t i n g o f t h e A s s o c i a t i o n f o r C o m p u t a t i o n a l L i n g u i s t i c s ( V o l u m e 1 : L o n g P a p e r s ) ( V o l . 1 , p p . 2 3 8 - 2 4 7 ) . @galuhsahid
31. ### Predictive Methods • Word2Vec • Mikolov, T., Chen, K., Corrado,

G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. • GloVe • Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). @galuhsahid
32. ### Predictive Methods • FastText • https://research.fb.com/downloads/fasttext/ • ELMo • Peters,

M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365. @galuhsahid
33. ### Predictive Methods • Word2Vec • Mikolov, T., Chen, K., Corrado,

G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. • GloVe • Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). @galuhsahid
34. ### Architecture W O R D 2 V E C •

Continuous Bag-of-Words (CBOW) • Skip-gram @galuhsahid
35. ### Skip-gram W O R D 2 V E C -

S K I P - G R A M pizza [ 0.89254 , 2.3112 , -0.70036 , 0.76679 , -1.0815 , 0.40426 , -1.3462 , 0.71 , 0.90067 , -1.043 , -0.57966 , 0.18669 , 1.0996 , -0.90042 , -0.045962, 0.31492 , 1.4128 , 0.84963 , -1.3389 , -0.32252 , -0.10208 , -0.31783 , 0.33173 , 0.096593, 0.36732 , -1.1466 , 0.3123 , 1.549 , -0.13059 , -0.62003 , 1.774 , -0.62134 , 0.065215, -0.39758 , 0.095832, -0.56289 , -0.39552 , -0.16224 , 1.0035 , 0.39161 , -0.54489 , 0.21744 , 0.10831 , -0.06952 , -1.046 , -0.36096 , -0.48233 , -0.90467 , -0.044913, -0.52132 ] @galuhsahid
36. ### Skip-gram W O R D 2 V E C -

S K I P - G R A M @galuhsahid “I ate the leftover pizza for dinner”
37. ### Skip-gram W O R D 2 V E C -

S K I P - G R A M @galuhsahid “I ate the leftover pizza for dinner” Window size = 5
38. ### Skip-gram W O R D 2 V E C -

S K I P - G R A M @galuhsahid “I ate the leftover pizza for dinner” Window size = 5 Neighbour words: the, leftover, for, dinner
39. ### Overview W O R D 2 V E C -

S K I P - G R A M Output leftover for pizza Projection @galuhsahid Input the dinner • “I ate the leftover pizza for dinner”
40. ### Skip-gram W O R D 2 V E C -

S K I P - G R A M @galuhsahid
41. ### Overview W O R D 2 V E C -

S K I P - G R A M Output leftover for pizza Projection @galuhsahid Input the dinner • “I ate the leftover pizza for dinner”
42. ### Architecture W O R D 2 V E C -

S K I P - G R A M pizza Input @galuhsahid 0 0 … 1 0 0 …
43. ### Architecture W O R D 2 V E C -

S K I P - G R A M pizza Input @galuhsahid 0 0 … 1 0 0 … Projection D V
44. ### Architecture W O R D 2 V E C -

S K I P - G R A M pizza Input @galuhsahid 0 0 … 1 0 0 … Projection D V = vocabulary size
45. ### Architecture W O R D 2 V E C -

S K I P - G R A M pizza Input @galuhsahid 0 0 … 1 0 0 … Projection D V = vocabulary size = number of dimensions
46. ### Projection Layer W O R D 2 V E C

- S K I P - G R A M Pizza … 0.23 -0.12 0.27 Aardvark … -0.21 0.35 0.56 Zyzzogeton … … D V … 0.89 2.31 -0.52 @galuhsahid
47. ### Projection Layer W O R D 2 V E C

- S K I P - G R A M Pizza … 0.23 -0.12 0.27 Aardvark … -0.21 0.35 0.56 Zyzzogeton … … D V … 0.89 2.31 -0.52 @galuhsahid
48. ### Architecture W O R D 2 V E C -

S K I P - G R A M pizza Input @galuhsahid 0 0 … 1 0 0 … Projection D V Output leftover for the dinner (Softmax)
49. ### Projection Layer W O R D 2 V E C

- S K I P - G R A M Pizza … 0.23 -0.12 0.27 Aardvark … -0.21 0.35 0.56 Zyzzogeton … … D V … 0.89 2.31 -0.52 @galuhsahid

51. ### Word Vector W O R D 2 V E C

- S K I P - G R A M … 0.89 2.31 -0.52 Pizza This is our word vector! @galuhsahid
52. ### Word Vector W O R D 2 V E C

- S K I P - G R A M … 0.89 2.31 -0.52 Pizza This is our word vector! [ 0.89254 , 2.3112 , -0.70036 , 0.76679 , -1.0815 , 0.40426 , -1.3462 , 0.71 , 0.90067 , -1.043 , -0.57966 , 0.18669 , 1.0996 , -0.90042 , -0.045962, 0.31492 , 1.4128 , 0.84963 , -1.3389 , -0.32252 , -0.10208 , -0.31783 , 0.33173 , 0.096593, 0.36732 , -1.1466 , 0.3123 , 1.549 , -0.13059 , -0.62003 , 1.774 , -0.62134 , 0.065215, -0.39758 , 0.095832, -0.56289 , -0.39552 , -0.16224 , 1.0035 , 0.39161 , -0.54489 , 0.21744 , 0.10831 , -0.06952 , -1.046 , -0.36096 , -0.48233 , -0.90467 , -0.044913, -0.52132 ] @galuhsahid
53. ### The Intuition W O R D 2 V E C

- S K I P - G R A M … 0.89 2.31 -0.52 Pizza the leftover for dinner … 0.76 2.01 -0.47 Chicken some leftover recipes for @galuhsahid “I ate the leftover pizza for dinner” “I need some leftover chicken recipes for dinner”
54. ### The Intuition W O R D 2 V E C

- S K I P - G R A M … 0.89 2.31 -0.52 Pizza the leftover for dinner … 0.76 2.01 -0.47 Chicken some leftover recipes for @galuhsahid … 0.32 0.43 -0.21 Prague embassy in is located “I ate the leftover pizza for dinner” “I need some leftover chicken recipes for dinner” “The Germany embassy in Prague is located in…”
55. ### Architecture W O R D 2 V E C -

S K I P - G R A M @galuhsahid M i k o l o v, T. , C h e n , K . , C o r r a d o , G . , & D e a n , J . ( 2 0 1 3 ) . E f f i c i e n t e s t i m a t i o n o f w o r d r e p r e s e n t a t i o n s i n v e c t o r s p a c e . a r X i v p r e p r i n t a r X i v : 1 3 0 1 . 3 7 8 1 . More details:

57. ### Pre-trained Models E X P L O R I N

G W O R D E M B E D D I N G S • Gensim has an API to download pre-trained word embedding models. The list of available models can be found here. @galuhsahid
58. ### Loading the Model E X P L O R I

N G W O R D E M B E D D I N G S model_w2v = KeyedVectors.load_word2vec_format( './GoogleNews-vectors-negative300.bin', binary=True) @galuhsahid from gensim.models import KeyedVectors
59. ### Word Vector E X P L O R I N

G W O R D E M B E D D I N G S model_w2v[“lake”] array([-8.39843750e-02, 2.02148438e-01, 2.65625000e-01, 1.04980469e-01, -7.95898438e-02, 1.05957031e-01, -5.39550781e-02, 8.11767578e-03, 9.32617188e-02, -7.66601562e-02, 1.56250000e-01, -1.19628906e-01, … -4.15039062e-02, 4.08935547e-03, -2.47070312e-01, -1.78710938e-01, 3.33984375e-01, -1.79687500e-01], dtype=float32) @galuhsahid
60. ### Similar Words E X P L O R I N

G W O R D E M B E D D I N G S model_w2v.most_similar("apple") [('apples', 0.7203598022460938), ('pear', 0.6450696587562561), ('fruit', 0.6410146355628967), ('berry', 0.6302294731140137), ('pears', 0.6133961081504822), ('strawberry', 0.6058261394500732), ('peach', 0.6025873422622681), ('potato', 0.596093475818634), ('grape', 0.5935864448547363), ('blueberry', 0.5866668224334717)] • The distance is calculated using cosine similarity @galuhsahid • Similar words are nearby vectors in a vector space
61. ### Get Similarity E X P L O R I N

G W O R D E M B E D D I N G S model_w2v.similarity("apple", "mango") 0.57518554 @galuhsahid
62. ### Odd One Out E X P L O R I

N G W O R D E M B E D D I N G S model_w2v.doesnt_match(["lake", "forest", "ocean", "river"]) 'forest' @galuhsahid
63. ### Analogies E X P L O R I N G

W O R D E M B E D D I N G S model_w2v.most_similar(positive=["uncle", "woman"], negative=["man"]) • man to uncle is woman to ... @galuhsahid
64. ### Analogies E X P L O R I N G

W O R D E M B E D D I N G S model_w2v.most_similar(positive=["uncle", "woman"], negative=["man"]) [('aunt', 0.8022665977478027), ('mother', 0.7770732045173645), ('niece', 0.768424928188324), ('father', 0.7237852811813354), ('grandmother', 0.722037136554718), ('daughter', 0.7185647487640381), ('sister', 0.7006258368492126), ('husband', 0.6982548236846924), ('granddaughter', 0.6858304738998413), ('nephew', 0.6710714101791382)] • man to uncle is woman to ... Words that are similar to uncle and woman but dissimilar to man uncle + woman - man @galuhsahid
65. ### Analogies E X P L O R I N G

W O R D E M B E D D I N G S model_w2v.most_similar(positive=["Berlin", “France”], negative=["Germany"]) [('Paris', 0.7672388553619385), ('French', 0.6049168109893799), ('Parisian', 0.5810437202453613), ('Colombes', 0.5599985718727112), ('Hopital_Europeen_Georges_Pompidou', 0.555890679359436), ('Melun', 0.551270067691803), ('Dinard', 0.5451847314834595), ('Brussels', 0.5420989990234375), ('Mairie_de', 0.5337448120117188), ('Cagnes_sur_Mer', 0.531246542930603)] • Germany to Berlin is France to... @galuhsahid
66. ### Analogies E X P L O R I N G

W O R D E M B E D D I N G S model_w2v.most_similar(positive=["Berlin", “France”], negative=["Germany"]) • Germany to Berlin is France to... @galuhsahid
67. ### Analogies E X P L O R I N G

W O R D E M B E D D I N G S model_w2v.most_similar(positive=["run", "walking"], negative=["running"]) @galuhsahid • Running to run is walking to…
68. ### Analogies E X P L O R I N G

W O R D E M B E D D I N G S model_w2v.most_similar(positive=["run", "walking"], negative=["running"]) [('walk', 0.7163699865341187), ('walks', 0.5965700745582581), ('walked', 0.5833066701889038), ('stroll', 0.5236037969589233), ('pinch_hitter_Yunel_Escobar', 0.4562637209892273), ('Walking', 0.455409437417984), ('Batterymate_Miguel_Olivo', 0.4483090043067932), ('runs', 0.4462803602218628), ('pinch_hitter_Carlos_Guillen', 0.4402925372123718), ('Justin_Speier_relieved', 0.43528205156326294)] @galuhsahid • Running to run is walking to…
69. ### Analogies E X P L O R I N G

W O R D E M B E D D I N G S https://www.tensorflow.org/tutorials/representation/word2vec @galuhsahid
70. ### h t t p : / / b i o

n l p - w w w . u t u . f i / w v _ d e m o / @galuhsahid
71. ### h t t p s : / / i n

d o n e s i a n - w o r d - e m b e d d i n g . h e r o k u a p p . c o m h t t p : / / g i t h u b . c o m / g a l u h s a h i d / i n d o n e s i a n - w o r d - e m b e d d i n g @galuhsahid
72. ### Training Your Own • Sure you can! • When to

do so? • Specific problem domains • Challenge: training data • Another alternative: continue training a pre-existing word embedding E X P L O R I N G W O R D E M B E D D I N G S @galuhsahid
73. ### Search A P P L I C A T I

O N @galuhsahid
74. ### Neural Machine Translation A P P L I C A

T I O N Q i , Y. , S a c h a n , D . S . , F e l i x , M . , P a d m a n a b h a n , S . J . , & N e u b i g , G . ( 2 0 1 8 ) . W h e n a n d w h y a r e p r e - t r a i n e d w o r d e m b e d d i n g s u s e f u l f o r n e u r a l m a c h i n e t r a n s l a t i o n ? . a r X i v p r e p r i n t a r X i v : 1 8 0 4 . 0 6 3 2 3 . @galuhsahid
75. ### Recommendation Engine A P P L I C A T

I O N https://towardsdatascience.com/using-word2vec-for-music-recommendations-bb9649ac2484 @galuhsahid
76. ### Out-of-vocabulary Words C H A L L E N G

E • Word2vec doesn’t handle this • FastText handles this because it trains n-grams instead of words breaking down each word into n-grams • ELMo also handles this because it trains the model on character-level amazing -> <am>, <ama>, <maz>, <azi>, <zin>, <ing>, <ng> V1 V2 V3 V4 V5 V6 V7 min_n = max_n = 3 amazin -> <am>, <ama>, <maz>, <azi>, <zin>, <in> V1 V2 V3 V4 V5 V9 @galuhsahid
77. ### Polysemy C H A L L E N G E

The word “rock” @galuhsahid
78. ### Polysemy C H A L L E N G E

The word “rock” https://unsplash.com/photos/I4zSNSxR8oA https://unsplash.com/photos/xssEs_oCv-A @galuhsahid
79. ### Polysemy C H A L L E N G E

The word “rock” https://unsplash.com/photos/I4zSNSxR8oA https://unsplash.com/photos/xssEs_oCv-A @galuhsahid
80. ### Polysemy C H A L L E N G E

The word “rock” https://unsplash.com/photos/I4zSNSxR8oA https://unsplash.com/photos/xssEs_oCv-A https://www.muscleandfitness.com/workouts/athletecelebrity- workouts/dwayne-rock-johnsons-shoulder-workout @galuhsahid
81. ### Polysemy C H A L L E N G E

He caught a fish at the bank of the river The bank at the end of the street was robbed yesterday @galuhsahid
82. ### Polysemy C H A L L E N G E

He caught a fish at the bank of the river The bank at the end of the street was robbed yesterday Same word, different meaning @galuhsahid
83. ### Polysemy C H A L L E N G E

He caught a fish at the bank of the river The bank at the end of the street was robbed yesterday Same word, different meaning @galuhsahid More recent word models such as ELMo and BERT will assign different word vectors for the word “bank” because they appear in different contexts

@galuhsahid
86. ### Bias C H A L L E N G E

“Our results indicate that text corpora contain recoverable and accurate imprints of our historic biases, whether morally neutral as towards insects or flowers, problematic as towards race or gender, or even simply veridical, reflecting the status quo distribution of gender with respect to careers or first names.” “Certainly, caution must be used in incorporating modules constructed via unsupervised machine learning into decision-making systems.” C a l i s k a n , A . , B r y s o n , J . J . , & N a r a y a n a n , A . ( 2 0 1 7 ) . S e m a n t i c s d e r i v e d a u t o m a t i c a l l y f r o m l a n g u a g e c o r p o r a c o n t a i n h u m a n - l i k e b i a s e s . S c i e n c e , 3 5 6 ( 6 3 3 4 ) , 1 8 3 - 1 8 6 . @galuhsahid
87. ### Bias C H A L L E N G E

De-biasing word embeddings: B o l u k b a s i , T. , C h a n g , K . W . , Z o u , J . Y. , S a l i g r a m a , V. , & K a l a i , A . T. ( 2 0 1 6 ) . M a n i s t o c o m p u t e r p r o g r a m m e r a s w o m a n i s t o h o m e m a k e r ? d e b i a s i n g w o r d e m b e d d i n g s . I n A d v a n c e s i n n e u r a l i n f o r m a t i o n p r o c e s s i n g s y s t e m s ( p p . 4 3 4 9 - 4 3 5 7 ) . @galuhsahid
88. ### Bias C H A L L E N G E

“We conclude that existing bias removal techniques are insufficient, and should not be trusted for providing gender-neutral modeling.” G o n e n , H . , & G o l d b e r g , Y. ( 2 0 1 9 ) . L i p s t i c k o n a P i g : D e b i a s i n g M e t h o d s C o v e r u p S y s t e m a t i c G e n d e r B i a s e s i n W o r d E m b e d d i n g s B u t d o n o t R e m o v e T h e m . a r X i v p r e p r i n t a r X i v : 1 9 0 3 . 0 3 8 6 2 . @galuhsahid