Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Word Embeddings

Introduction to Word Embeddings

France is to Paris like Czechia is to _______. If I ask you to fill the blank, you would answer "Prague" right away even without me giving a clue such as "the answer is a capital city". Our existing knowledge enables us to determine that France and Paris has a Country - Capital City connection and Czechia's capital city of Prague, so that must be the answer.

However, computers don't know that Prague belongs to the same "category" as Paris and other capital cities unless we tell them so. If we want to get computers to understand human language as well as we do, there are way too many things that we need to teach computers explicitly. Is there a better way?

With word embeddings, we represent words by a series of numbers. This opens up a whole new world for computers because now they can understand the context of a word and infer relationships between words using numbers and maths—the language they are proficient in. We'll delve into the details of what word embeddings actually are and why we need them, popular word embedding models, what problems you can solve using word embeddings, and how you can use word embeddings with Python.

Jupyter Notebook is available here: https://github.com/galuhsahid/intro-to-word-embeddings

Galuh Sahid

June 14, 2019
Tweet

More Decks by Galuh Sahid

Other Decks in Programming

Transcript

  1. Lake Forest Ocean River Body of water Not body of

    water Body of water Body of water @galuhsahid
  2. Lake Forest Ocean River Doesn’t have trees Has trees Doesn’t

    have trees Doesn’t have trees @galuhsahid
  3. What kind of representation do we want? • Real numbers

    • What do we want to know about a word? Whether they have the same meaning, semantic relationship… etc. • Can we do it without labelling everything manually? • Ideally it’s not too large! @galuhsahid
  4. Word Embeddings to the Rescue Represent words as vectors of

    real numbers with much lower and thus denser dimensions @galuhsahid We’re putting words that are outside any vector space into a vector space - hence, we’re embedding the words into that vector space
  5. Lake [ 0.89254 , 2.3112 , -0.70036 , 0.76679 ,

    -1.0815 , 0.40426 , -1.3462 , 0.71 , 0.90067 , -1.043 , -0.57966 , 0.18669 , 1.0996 , -0.90042 , -0.045962, 0.31492 , 1.4128 , 0.84963 , -1.3389 , -0.32252 , -0.10208 , -0.31783 , 0.33173 , 0.096593, 0.36732 , -1.1466 , 0.3123 , 1.549 , -0.13059 , -0.62003 , 1.774 , -0.62134 , 0.065215, -0.39758 , 0.095832, -0.56289 , -0.39552 , -0.16224 , 1.0035 , 0.39161 , -0.54489 , 0.21744 , 0.10831 , -0.06952 , -1.046 , -0.36096 , -0.48233 , -0.90467 , -0.044913, -0.52132 ] (Spoiler alert) @galuhsahid
  6. Visualization Lake Ocean River Forest Pizza Maybe this dimension represents

    the concept of whether it is a food or not… @galuhsahid
  7. Visualization Lake Ocean River Forest Pizza Maybe this dimension represents

    the concept of whether it is a food or not… Or it could be something not intuitive to us - we actually have no idea, though. It could be anything @galuhsahid
  8. One-Hot Encoding Our vocabulary: lake, forest, ocean, river, pizza Lake

    1 0 0 0 0 Forest 0 1 0 0 0 Ocean 0 0 1 0 0 River 0 0 0 1 0 Pizza 0 0 0 0 1 Size: |V| = 5 0 1 2 3 4 0 1 2 3 4 @galuhsahid
  9. One-Hot Encoding Our vocabulary: every English word (approx. 171,476 words

    in use) Lake 0 0 0 … 1 … 0 0 Size: |V| = ~171.476 Aardvark 1 0 0 … 0 … 0 0 Zyzzogeton 0 0 0 … 0 … 0 1 https://en.oxforddictionaries.com/explore/how-many-words-are-there-in-the-english-language/ … … Size: |V| = ~171.476 @galuhsahid
  10. Distributional Representation “Tell me who your friends are, and I’ll

    tell you who you are.” “You shall know a word by the company it keeps.” @galuhsahid (Firth, 1957)
  11. Distributional Representation “Tell me who your friends are, and I’ll

    tell you who you are.” “You shall know a word by the company it keeps.” Distributional Hypothesis (Harris, 1954) Words that occur in similar contexts have similar meaning @galuhsahid (Firth, 1957)
  12. Distributional Representation A lake is a large body of water

    in a body of land. An ocean is a large area of water between continents. A river is a stream of water that flows through a channel in the surface of the ground. A forest is a piece of land with many trees. Pizza is a type of food that was created in Italy. @galuhsahid
  13. Distributional Representation A lake is a large body of water

    in a body of land. An ocean is a large area of water between continents. A river is a stream of water that flows through a channel in the surface of the ground. A forest is a piece of land with many trees. Pizza is a type of food that was created in Italy. @galuhsahid
  14. Distributional Representation A lake is a large body of water

    in a body of land. An ocean is a large area of water between continents. A river is a stream of water that flows through a channel in the surface of the ground. A forest is a piece of land with many trees. Pizza is a type of food that was created in Italy. @galuhsahid
  15. Approaches W O R D E M B E D

    D I N G S • Count-based methods • Computes how often a word co-occurs with its neighbour words, then map the counts to a small, dense vector @galuhsahid
  16. Count-based Lake large body water body land. Ocean large area

    water continents. River stream water flows channel surface ground. Forest piece land many trees. Pizza type food created Italy. [large, body, water] [large, area, water] [stream, water, flows] [piece, land, many] [type, food, created] window = 4 Neighbor words A P P R O A C H E S @galuhsahid
  17. Lake Ocean River Forest Pizza [large, body, water] [large, area,

    water] [stream, water, flows] [piece, land, many] [type, food, created] Neighbor words Large Body Water Area Stream Flows Piece Land Many 1 1 1 0 0 0 0 0 0 0 0 0 Type Food Created 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Count-based A P P R O A C H E S @galuhsahid
  18. Approaches W O R D E M B E D

    D I N G S • Count-based methods • Computes how often a word co-occurs with its neighbour words, then map the counts to a small, dense vector • Reduce dimensions using Singular Vector Decomposition (SVD) or Latent Dirichlet Allocation (LDA) @galuhsahid
  19. Approaches W O R D E M B E D

    D I N G S • Predictive methods • Try to predict a word from its neighbors in terms of small and denser embedding vectors B a r o n i , M . , D i n u , G . , & K r u s z e w s k i , G . ( 2 0 1 4 ) . D o n ' t c o u n t , p r e d i c t ! A s y s t e m a t i c c o m p a r i s o n o f c o n t e x t - c o u n t i n g v s . c o n t e x t - p r e d i c t i n g s e m a n t i c v e c t o r s . I n P r o c e e d i n g s o f t h e 5 2 n d A n n u a l M e e t i n g o f t h e A s s o c i a t i o n f o r C o m p u t a t i o n a l L i n g u i s t i c s ( V o l u m e 1 : L o n g P a p e r s ) ( V o l . 1 , p p . 2 3 8 - 2 4 7 ) . @galuhsahid
  20. Predictive Methods • Word2Vec • Mikolov, T., Chen, K., Corrado,

    G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. • GloVe • Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). @galuhsahid
  21. Predictive Methods • FastText • https://research.fb.com/downloads/fasttext/ • ELMo • Peters,

    M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365. @galuhsahid
  22. Predictive Methods • Word2Vec • Mikolov, T., Chen, K., Corrado,

    G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. • GloVe • Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). @galuhsahid
  23. Architecture W O R D 2 V E C •

    Continuous Bag-of-Words (CBOW) • Skip-gram @galuhsahid
  24. Skip-gram W O R D 2 V E C -

    S K I P - G R A M pizza [ 0.89254 , 2.3112 , -0.70036 , 0.76679 , -1.0815 , 0.40426 , -1.3462 , 0.71 , 0.90067 , -1.043 , -0.57966 , 0.18669 , 1.0996 , -0.90042 , -0.045962, 0.31492 , 1.4128 , 0.84963 , -1.3389 , -0.32252 , -0.10208 , -0.31783 , 0.33173 , 0.096593, 0.36732 , -1.1466 , 0.3123 , 1.549 , -0.13059 , -0.62003 , 1.774 , -0.62134 , 0.065215, -0.39758 , 0.095832, -0.56289 , -0.39552 , -0.16224 , 1.0035 , 0.39161 , -0.54489 , 0.21744 , 0.10831 , -0.06952 , -1.046 , -0.36096 , -0.48233 , -0.90467 , -0.044913, -0.52132 ] @galuhsahid
  25. Skip-gram W O R D 2 V E C -

    S K I P - G R A M @galuhsahid “I ate the leftover pizza for dinner”
  26. Skip-gram W O R D 2 V E C -

    S K I P - G R A M @galuhsahid “I ate the leftover pizza for dinner” Window size = 5
  27. Skip-gram W O R D 2 V E C -

    S K I P - G R A M @galuhsahid “I ate the leftover pizza for dinner” Window size = 5 Neighbour words: the, leftover, for, dinner
  28. Overview W O R D 2 V E C -

    S K I P - G R A M Output leftover for pizza Projection @galuhsahid Input the dinner • “I ate the leftover pizza for dinner”
  29. Skip-gram W O R D 2 V E C -

    S K I P - G R A M @galuhsahid
  30. Overview W O R D 2 V E C -

    S K I P - G R A M Output leftover for pizza Projection @galuhsahid Input the dinner • “I ate the leftover pizza for dinner”
  31. Architecture W O R D 2 V E C -

    S K I P - G R A M pizza Input @galuhsahid 0 0 … 1 0 0 …
  32. Architecture W O R D 2 V E C -

    S K I P - G R A M pizza Input @galuhsahid 0 0 … 1 0 0 … Projection D V
  33. Architecture W O R D 2 V E C -

    S K I P - G R A M pizza Input @galuhsahid 0 0 … 1 0 0 … Projection D V = vocabulary size
  34. Architecture W O R D 2 V E C -

    S K I P - G R A M pizza Input @galuhsahid 0 0 … 1 0 0 … Projection D V = vocabulary size = number of dimensions
  35. Projection Layer W O R D 2 V E C

    - S K I P - G R A M Pizza … 0.23 -0.12 0.27 Aardvark … -0.21 0.35 0.56 Zyzzogeton … … D V … 0.89 2.31 -0.52 @galuhsahid
  36. Projection Layer W O R D 2 V E C

    - S K I P - G R A M Pizza … 0.23 -0.12 0.27 Aardvark … -0.21 0.35 0.56 Zyzzogeton … … D V … 0.89 2.31 -0.52 @galuhsahid
  37. Architecture W O R D 2 V E C -

    S K I P - G R A M pizza Input @galuhsahid 0 0 … 1 0 0 … Projection D V Output leftover for the dinner (Softmax)
  38. Projection Layer W O R D 2 V E C

    - S K I P - G R A M Pizza … 0.23 -0.12 0.27 Aardvark … -0.21 0.35 0.56 Zyzzogeton … … D V … 0.89 2.31 -0.52 @galuhsahid
  39. Word Vector W O R D 2 V E C

    - S K I P - G R A M … 0.89 2.31 -0.52 Pizza This is our word vector! @galuhsahid
  40. Word Vector W O R D 2 V E C

    - S K I P - G R A M … 0.89 2.31 -0.52 Pizza This is our word vector! [ 0.89254 , 2.3112 , -0.70036 , 0.76679 , -1.0815 , 0.40426 , -1.3462 , 0.71 , 0.90067 , -1.043 , -0.57966 , 0.18669 , 1.0996 , -0.90042 , -0.045962, 0.31492 , 1.4128 , 0.84963 , -1.3389 , -0.32252 , -0.10208 , -0.31783 , 0.33173 , 0.096593, 0.36732 , -1.1466 , 0.3123 , 1.549 , -0.13059 , -0.62003 , 1.774 , -0.62134 , 0.065215, -0.39758 , 0.095832, -0.56289 , -0.39552 , -0.16224 , 1.0035 , 0.39161 , -0.54489 , 0.21744 , 0.10831 , -0.06952 , -1.046 , -0.36096 , -0.48233 , -0.90467 , -0.044913, -0.52132 ] @galuhsahid
  41. The Intuition W O R D 2 V E C

    - S K I P - G R A M … 0.89 2.31 -0.52 Pizza the leftover for dinner … 0.76 2.01 -0.47 Chicken some leftover recipes for @galuhsahid “I ate the leftover pizza for dinner” “I need some leftover chicken recipes for dinner”
  42. The Intuition W O R D 2 V E C

    - S K I P - G R A M … 0.89 2.31 -0.52 Pizza the leftover for dinner … 0.76 2.01 -0.47 Chicken some leftover recipes for @galuhsahid … 0.32 0.43 -0.21 Prague embassy in is located “I ate the leftover pizza for dinner” “I need some leftover chicken recipes for dinner” “The Germany embassy in Prague is located in…”
  43. Architecture W O R D 2 V E C -

    S K I P - G R A M @galuhsahid M i k o l o v, T. , C h e n , K . , C o r r a d o , G . , & D e a n , J . ( 2 0 1 3 ) . E f f i c i e n t e s t i m a t i o n o f w o r d r e p r e s e n t a t i o n s i n v e c t o r s p a c e . a r X i v p r e p r i n t a r X i v : 1 3 0 1 . 3 7 8 1 . More details:
  44. Pre-trained Models E X P L O R I N

    G W O R D E M B E D D I N G S • Gensim has an API to download pre-trained word embedding models. The list of available models can be found here. @galuhsahid
  45. Loading the Model E X P L O R I

    N G W O R D E M B E D D I N G S model_w2v = KeyedVectors.load_word2vec_format( './GoogleNews-vectors-negative300.bin', binary=True) @galuhsahid from gensim.models import KeyedVectors
  46. Word Vector E X P L O R I N

    G W O R D E M B E D D I N G S model_w2v[“lake”] array([-8.39843750e-02, 2.02148438e-01, 2.65625000e-01, 1.04980469e-01, -7.95898438e-02, 1.05957031e-01, -5.39550781e-02, 8.11767578e-03, 9.32617188e-02, -7.66601562e-02, 1.56250000e-01, -1.19628906e-01, … -4.15039062e-02, 4.08935547e-03, -2.47070312e-01, -1.78710938e-01, 3.33984375e-01, -1.79687500e-01], dtype=float32) @galuhsahid
  47. Similar Words E X P L O R I N

    G W O R D E M B E D D I N G S model_w2v.most_similar("apple") [('apples', 0.7203598022460938), ('pear', 0.6450696587562561), ('fruit', 0.6410146355628967), ('berry', 0.6302294731140137), ('pears', 0.6133961081504822), ('strawberry', 0.6058261394500732), ('peach', 0.6025873422622681), ('potato', 0.596093475818634), ('grape', 0.5935864448547363), ('blueberry', 0.5866668224334717)] • The distance is calculated using cosine similarity @galuhsahid • Similar words are nearby vectors in a vector space
  48. Get Similarity E X P L O R I N

    G W O R D E M B E D D I N G S model_w2v.similarity("apple", "mango") 0.57518554 @galuhsahid
  49. Odd One Out E X P L O R I

    N G W O R D E M B E D D I N G S model_w2v.doesnt_match(["lake", "forest", "ocean", "river"]) 'forest' @galuhsahid
  50. Analogies E X P L O R I N G

    W O R D E M B E D D I N G S model_w2v.most_similar(positive=["uncle", "woman"], negative=["man"]) • man to uncle is woman to ... @galuhsahid
  51. Analogies E X P L O R I N G

    W O R D E M B E D D I N G S model_w2v.most_similar(positive=["uncle", "woman"], negative=["man"]) [('aunt', 0.8022665977478027), ('mother', 0.7770732045173645), ('niece', 0.768424928188324), ('father', 0.7237852811813354), ('grandmother', 0.722037136554718), ('daughter', 0.7185647487640381), ('sister', 0.7006258368492126), ('husband', 0.6982548236846924), ('granddaughter', 0.6858304738998413), ('nephew', 0.6710714101791382)] • man to uncle is woman to ... Words that are similar to uncle and woman but dissimilar to man uncle + woman - man @galuhsahid
  52. Analogies E X P L O R I N G

    W O R D E M B E D D I N G S model_w2v.most_similar(positive=["Berlin", “France”], negative=["Germany"]) [('Paris', 0.7672388553619385), ('French', 0.6049168109893799), ('Parisian', 0.5810437202453613), ('Colombes', 0.5599985718727112), ('Hopital_Europeen_Georges_Pompidou', 0.555890679359436), ('Melun', 0.551270067691803), ('Dinard', 0.5451847314834595), ('Brussels', 0.5420989990234375), ('Mairie_de', 0.5337448120117188), ('Cagnes_sur_Mer', 0.531246542930603)] • Germany to Berlin is France to... @galuhsahid
  53. Analogies E X P L O R I N G

    W O R D E M B E D D I N G S model_w2v.most_similar(positive=["Berlin", “France”], negative=["Germany"]) • Germany to Berlin is France to... @galuhsahid
  54. Analogies E X P L O R I N G

    W O R D E M B E D D I N G S model_w2v.most_similar(positive=["run", "walking"], negative=["running"]) @galuhsahid • Running to run is walking to…
  55. Analogies E X P L O R I N G

    W O R D E M B E D D I N G S model_w2v.most_similar(positive=["run", "walking"], negative=["running"]) [('walk', 0.7163699865341187), ('walks', 0.5965700745582581), ('walked', 0.5833066701889038), ('stroll', 0.5236037969589233), ('pinch_hitter_Yunel_Escobar', 0.4562637209892273), ('Walking', 0.455409437417984), ('Batterymate_Miguel_Olivo', 0.4483090043067932), ('runs', 0.4462803602218628), ('pinch_hitter_Carlos_Guillen', 0.4402925372123718), ('Justin_Speier_relieved', 0.43528205156326294)] @galuhsahid • Running to run is walking to…
  56. Analogies E X P L O R I N G

    W O R D E M B E D D I N G S https://www.tensorflow.org/tutorials/representation/word2vec @galuhsahid
  57. h t t p : / / b i o

    n l p - w w w . u t u . f i / w v _ d e m o / @galuhsahid
  58. h t t p s : / / i n

    d o n e s i a n - w o r d - e m b e d d i n g . h e r o k u a p p . c o m h t t p : / / g i t h u b . c o m / g a l u h s a h i d / i n d o n e s i a n - w o r d - e m b e d d i n g @galuhsahid
  59. Training Your Own • Sure you can! • When to

    do so? • Specific problem domains • Challenge: training data • Another alternative: continue training a pre-existing word embedding E X P L O R I N G W O R D E M B E D D I N G S @galuhsahid
  60. Search A P P L I C A T I

    O N @galuhsahid
  61. Neural Machine Translation A P P L I C A

    T I O N Q i , Y. , S a c h a n , D . S . , F e l i x , M . , P a d m a n a b h a n , S . J . , & N e u b i g , G . ( 2 0 1 8 ) . W h e n a n d w h y a r e p r e - t r a i n e d w o r d e m b e d d i n g s u s e f u l f o r n e u r a l m a c h i n e t r a n s l a t i o n ? . a r X i v p r e p r i n t a r X i v : 1 8 0 4 . 0 6 3 2 3 . @galuhsahid
  62. Recommendation Engine A P P L I C A T

    I O N https://towardsdatascience.com/using-word2vec-for-music-recommendations-bb9649ac2484 @galuhsahid
  63. Out-of-vocabulary Words C H A L L E N G

    E • Word2vec doesn’t handle this • FastText handles this because it trains n-grams instead of words breaking down each word into n-grams • ELMo also handles this because it trains the model on character-level amazing -> <am>, <ama>, <maz>, <azi>, <zin>, <ing>, <ng> V1 V2 V3 V4 V5 V6 V7 min_n = max_n = 3 amazin -> <am>, <ama>, <maz>, <azi>, <zin>, <in> V1 V2 V3 V4 V5 V9 @galuhsahid
  64. Polysemy C H A L L E N G E

    The word “rock” @galuhsahid
  65. Polysemy C H A L L E N G E

    The word “rock” https://unsplash.com/photos/I4zSNSxR8oA https://unsplash.com/photos/xssEs_oCv-A @galuhsahid
  66. Polysemy C H A L L E N G E

    The word “rock” https://unsplash.com/photos/I4zSNSxR8oA https://unsplash.com/photos/xssEs_oCv-A @galuhsahid
  67. Polysemy C H A L L E N G E

    The word “rock” https://unsplash.com/photos/I4zSNSxR8oA https://unsplash.com/photos/xssEs_oCv-A https://www.muscleandfitness.com/workouts/athletecelebrity- workouts/dwayne-rock-johnsons-shoulder-workout @galuhsahid
  68. Polysemy C H A L L E N G E

    He caught a fish at the bank of the river The bank at the end of the street was robbed yesterday @galuhsahid
  69. Polysemy C H A L L E N G E

    He caught a fish at the bank of the river The bank at the end of the street was robbed yesterday Same word, different meaning @galuhsahid
  70. Polysemy C H A L L E N G E

    He caught a fish at the bank of the river The bank at the end of the street was robbed yesterday Same word, different meaning @galuhsahid More recent word models such as ELMo and BERT will assign different word vectors for the word “bank” because they appear in different contexts
  71. Bias C H A L L E N G E

    https://twitter.com/zeynep/status/799662089740681217 @galuhsahid
  72. Bias C H A L L E N G E

    “Our results indicate that text corpora contain recoverable and accurate imprints of our historic biases, whether morally neutral as towards insects or flowers, problematic as towards race or gender, or even simply veridical, reflecting the status quo distribution of gender with respect to careers or first names.” “Certainly, caution must be used in incorporating modules constructed via unsupervised machine learning into decision-making systems.” C a l i s k a n , A . , B r y s o n , J . J . , & N a r a y a n a n , A . ( 2 0 1 7 ) . S e m a n t i c s d e r i v e d a u t o m a t i c a l l y f r o m l a n g u a g e c o r p o r a c o n t a i n h u m a n - l i k e b i a s e s . S c i e n c e , 3 5 6 ( 6 3 3 4 ) , 1 8 3 - 1 8 6 . @galuhsahid
  73. Bias C H A L L E N G E

    De-biasing word embeddings: B o l u k b a s i , T. , C h a n g , K . W . , Z o u , J . Y. , S a l i g r a m a , V. , & K a l a i , A . T. ( 2 0 1 6 ) . M a n i s t o c o m p u t e r p r o g r a m m e r a s w o m a n i s t o h o m e m a k e r ? d e b i a s i n g w o r d e m b e d d i n g s . I n A d v a n c e s i n n e u r a l i n f o r m a t i o n p r o c e s s i n g s y s t e m s ( p p . 4 3 4 9 - 4 3 5 7 ) . @galuhsahid
  74. Bias C H A L L E N G E

    “We conclude that existing bias removal techniques are insufficient, and should not be trusted for providing gender-neutral modeling.” G o n e n , H . , & G o l d b e r g , Y. ( 2 0 1 9 ) . L i p s t i c k o n a P i g : D e b i a s i n g M e t h o d s C o v e r u p S y s t e m a t i c G e n d e r B i a s e s i n W o r d E m b e d d i n g s B u t d o n o t R e m o v e T h e m . a r X i v p r e p r i n t a r X i v : 1 9 0 3 . 0 3 8 6 2 . @galuhsahid
  75. Bias L I M I T A T I O

    N https://www.blog.google/products/translate/reducing-gender-bias-google-translate/ @galuhsahid
  76. Beyond word2vec • Explore other techniques such as GloVe, fasttext,

    ELMo, BERT… • Train your own word embeddings • Use word embeddings for other tasks outside of NLP tasks (song2vec, perhaps?) @galuhsahid • Use word embeddings in NLP tasks (e.g. text classification with doc2vec)