Word Embeddings for Natural Language Processing in Python @ London Python meetup

Word Embeddings for Natural Language Processing in Python @ London Python meetup

https://www.meetup.com/LondonPython/events/240263693/

Word embeddings are a family of Natural Language Processing (NLP) algorithms where words are mapped to vectors in low-dimensional space.

The interest around word embeddings has been on the rise in the past few years, because these techniques have been driving important improvements in many NLP applications like text classification, sentiment analysis or machine translation.

This talk is an introduction to word embeddings, in particular with details on word2vec and doc2vec

Aa38bb7a9c35bc414da6ec7dcd8d7339?s=128

Marco Bonzanini

September 28, 2017
Tweet

Transcript

  1. Word Embeddings 
 for NLP in Python Marco Bonzanini
 London

    Python Meet-up September 2017
  2. Nice to meet you

  3. WORD EMBEDDINGS?

  4. Word Embeddings Word Vectors Distributed Representations = =

  5. Why should you care?

  6. Why should you care? Data representation
 is crucial

  7. Applications

  8. Applications Classification

  9. Applications Classification Recommender Systems

  10. Applications Classification Recommender Systems Search Engines

  11. Applications Classification Recommender Systems Search Engines Machine Translation

  12. One-hot Encoding

  13. One-hot Encoding Rome Paris Italy France = [1, 0, 0,

    0, 0, 0, …, 0] = [0, 1, 0, 0, 0, 0, …, 0] = [0, 0, 1, 0, 0, 0, …, 0] = [0, 0, 0, 1, 0, 0, …, 0]
  14. One-hot Encoding Rome Paris Italy France = [1, 0, 0,

    0, 0, 0, …, 0] = [0, 1, 0, 0, 0, 0, …, 0] = [0, 0, 1, 0, 0, 0, …, 0] = [0, 0, 0, 1, 0, 0, …, 0] Rome Paris word V
  15. One-hot Encoding Rome Paris Italy France = [1, 0, 0,

    0, 0, 0, …, 0] = [0, 1, 0, 0, 0, 0, …, 0] = [0, 0, 1, 0, 0, 0, …, 0] = [0, 0, 0, 1, 0, 0, …, 0] V = vocabulary size (huge)
  16. Bag-of-words

  17. Bag-of-words doc_1 doc_2 … doc_N = [32, 14, 1, 0,

    …, 6] = [ 2, 12, 0, 28, …, 12] … = [13, 0, 6, 2, …, 0]
  18. Bag-of-words doc_1 doc_2 … doc_N = [32, 14, 1, 0,

    …, 6] = [ 2, 12, 0, 28, …, 12] … = [13, 0, 6, 2, …, 0] Rome Paris word V
  19. Word Embeddings

  20. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97]
  21. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97] n. dimensions << vocabulary size
  22. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97]
  23. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97]
  24. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97]
  25. Word Embeddings Rome Paris Italy France

  26. Word Embeddings is-capital-of

  27. Word Embeddings Paris

  28. Word Embeddings Paris + Italy

  29. Word Embeddings Paris + Italy - France

  30. Word Embeddings Paris + Italy - France ≈ Rome Rome

  31. FROM LANGUAGE TO VECTORS?

  32. Distributional Hypothesis

  33. –J.R. Firth 1957 “You shall know a word 
 by

    the company it keeps.”
  34. –Z. Harris 1954 “Words that occur in similar context tend

    to have similar meaning.”
  35. Context ≈ Meaning

  36. I enjoyed eating some pizza at the restaurant

  37. I enjoyed eating some pizza at the restaurant Word

  38. I enjoyed eating some pizza at the restaurant The company

    it keeps Word
  39. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some pineapple at the restaurant
  40. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some pineapple at the restaurant
  41. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some pineapple at the restaurant Same context
  42. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some pineapple at the restaurant Pizza = Pineapple ? Same context
  43. A BIT OF THEORY word2vec

  44. None
  45. None
  46. word2vec Architecture Mikolov et al. (2013) Efficient Estimation of Word

    Representations in Vector Space
  47. Vector Calculation

  48. Vector Calculation Goal: learn vec(word)

  49. Vector Calculation Goal: learn vec(word) 1. Choose objective function

  50. Vector Calculation Goal: learn vec(word) 1. Choose objective function 2.

    Init: random vectors
  51. Vector Calculation Goal: learn vec(word) 1. Choose objective function 2.

    Init: random vectors 3. Run stochastic gradient descent
  52. Intermezzo (Gradient Descent)

  53. Intermezzo (Gradient Descent) x F(x)

  54. Intermezzo (Gradient Descent) x F(x) Objective Function (to minimise)

  55. Intermezzo (Gradient Descent) x F(x) Find the optimal “x”

  56. Intermezzo (Gradient Descent) x F(x) Random Init

  57. Intermezzo (Gradient Descent) x F(x) Derivative

  58. Intermezzo (Gradient Descent) x F(x) Update

  59. Intermezzo (Gradient Descent) x F(x) Derivative

  60. Intermezzo (Gradient Descent) x F(x) Update

  61. Intermezzo (Gradient Descent) x F(x) and again

  62. Intermezzo (Gradient Descent) x F(x) Until convergence

  63. Intermezzo (Gradient Descent) • Optimisation algorithm

  64. Intermezzo (Gradient Descent) • Optimisation algorithm • Purpose: find the

    min (or max) for F
  65. Intermezzo (Gradient Descent) • Optimisation algorithm • Purpose: find the

    min (or max) for F • Batch-oriented (use all data points)
  66. Intermezzo (Gradient Descent) • Optimisation algorithm • Purpose: find the

    min (or max) for F • Batch-oriented (use all data points) • Stochastic GD: update after each sample
  67. Objective Function

  68. I enjoyed eating some pizza at the restaurant Objective Function

  69. I enjoyed eating some pizza at the restaurant Objective Function

  70. I enjoyed eating some pizza at the restaurant Objective Function

  71. I enjoyed eating some pizza at the restaurant Maximise the

    likelihood 
 of the context given the focus word Objective Function
  72. I enjoyed eating some pizza at the restaurant Maximise the

    likelihood 
 of the context given the focus word P(i | pizza) P(enjoyed | pizza) … P(restaurant | pizza) Objective Function
  73. Example I enjoyed eating some pizza at the restaurant

  74. I enjoyed eating some pizza at the restaurant Iterate over

    context words Example
  75. I enjoyed eating some pizza at the restaurant bump P(

    i | pizza ) Example
  76. I enjoyed eating some pizza at the restaurant bump P(

    enjoyed | pizza ) Example
  77. I enjoyed eating some pizza at the restaurant bump P(

    eating | pizza ) Example
  78. I enjoyed eating some pizza at the restaurant bump P(

    some | pizza ) Example
  79. I enjoyed eating some pizza at the restaurant bump P(

    at | pizza ) Example
  80. I enjoyed eating some pizza at the restaurant bump P(

    the | pizza ) Example
  81. I enjoyed eating some pizza at the restaurant bump P(

    restaurant | pizza ) Example
  82. I enjoyed eating some pizza at the restaurant Move to

    next focus word and repeat Example
  83. I enjoyed eating some pizza at the restaurant bump P(

    i | at ) Example
  84. I enjoyed eating some pizza at the restaurant bump P(

    enjoyed | at ) Example
  85. I enjoyed eating some pizza at the restaurant … you

    get the picture Example
  86. P( eating | pizza )

  87. P( eating | pizza ) ??

  88. P( eating | pizza ) Input word Output word

  89. P( eating | pizza ) Input word Output word P(

    vec(eating) | vec(pizza) )
  90. P( vout | vin ) P( vec(eating) | vec(pizza) )

    P( eating | pizza ) Input word Output word
  91. P( vout | vin ) P( vec(eating) | vec(pizza) )

    P( eating | pizza ) Input word Output word ???
  92. P( vout | vin )

  93. cosine( vout, vin )

  94. cosine( vout, vin ) [-1, 1]

  95. softmax(cosine( vout, vin ))

  96. softmax(cosine( vout, vin )) [0, 1]

  97. softmax(cosine( vout, vin )) P (vout | vin) = exp(cosine(vout

    , vin)) P k 2 V exp(cosine(vk , vin))
  98. Vector Calculation Recap

  99. Vector Calculation Recap Learn vec(word)

  100. Vector Calculation Recap Learn vec(word) by gradient descent

  101. Vector Calculation Recap Learn vec(word) by gradient descent on the

    softmax probability
  102. Plot Twist

  103. None
  104. None
  105. Paragraph Vector a.k.a. doc2vec i.e. P(vout | vin, label)

  106. A BIT OF PRACTICE

  107. None
  108. pip install gensim

  109. Case Study 1: Skills and CVs

  110. Case Study 1: Skills and CVs Data set of ~300k

    resumes Each experience is a “sentence” Each experience has 3-15 skills Approx 15k unique skills
  111. Case Study 1: Skills and CVs from gensim.models import Word2Vec

    fname = 'candidates.jsonl' corpus = ResumesCorpus(fname) model = Word2Vec(corpus)
  112. Case Study 1: Skills and CVs model.most_similar('chef') [('cook', 0.94), ('bartender',

    0.91), ('waitress', 0.89), ('restaurant', 0.76), ...]
  113. Case Study 1: Skills and CVs model.most_similar('chef',
 negative=['food']) [('puppet', 0.93),

    ('devops', 0.92), ('ansible', 0.79), ('salt', 0.77), ...]
  114. Case Study 1: Skills and CVs Useful for: Data exploration

    Query expansion/suggestion Recommendations
  115. Case Study 2: Beer!

  116. Case Study 2: Beer! Data set of ~2.9M beer reviews

    89 different beer styles 635k unique tokens 185M total tokens
  117. Case Study 2: Beer! from gensim.models import Doc2Vec fname =

    'ratebeer_data.csv' corpus = RateBeerCorpus(fname) model = Doc2Vec(corpus)
  118. Case Study 2: Beer! from gensim.models import Doc2Vec fname =

    'ratebeer_data.csv' corpus = RateBeerCorpus(fname) model = Doc2Vec(corpus) 3.5h on my laptop … remember to pickle
  119. Case Study 2: Beer! model.docvecs.most_similar('Stout') [('Sweet Stout', 0.9877), ('Porter', 0.9620),

    ('Foreign Stout', 0.9595), ('Dry Stout', 0.9561), ('Imperial/Strong Porter', 0.9028), ...]
  120. Case Study 2: Beer! model.most_similar([model.docvecs['Stout']]) 
 [('coffee', 0.6342), ('espresso', 0.5931),

    ('charcoal', 0.5904), ('char', 0.5631), ('bean', 0.5624), ...]
  121. Case Study 2: Beer! model.most_similar([model.docvecs['Wheat Ale']]) 
 [('lemon', 0.6103), ('lemony',

    0.5909), ('wheaty', 0.5873), ('germ', 0.5684), ('lemongrass', 0.5653), ('wheat', 0.5649), ('lime', 0.55636), ('verbena', 0.5491), ('coriander', 0.5341), ('zesty', 0.5182)]
  122. PCA: scikit-learn — Data Viz: Bokeh

  123. Dark beers

  124. Strong beers

  125. Sour beers

  126. Lagers

  127. Wheat beers

  128. Case Study 2: Beer! Useful for: Understanding the language of

    beer enthusiasts Planning your next pint Classification
  129. Case Study 3: Evil AI

  130. Case Study 3: Evil AI from gensim.models.keyedvectors \ import KeyedVectors

    fname = ‘GoogleNews-vectors.bin' model = KeyedVectors.load_word2vec_format( fname,
 binary=True )
  131. Case Study 3: Evil AI model.most_similar( positive=['king', ‘woman'], negative=[‘man’] )

  132. Case Study 3: Evil AI model.most_similar( positive=['king', ‘woman'], negative=[‘man’] )

    [('queen', 0.7118), ('monarch', 0.6189), ('princess', 0.5902), ('crown_prince', 0.5499), ('prince', 0.5377), …]
  133. Case Study 3: Evil AI model.most_similar( positive=['Paris', ‘Italy'], negative=[‘France’] )

  134. Case Study 3: Evil AI model.most_similar( positive=['Paris', ‘Italy'], negative=[‘France’] )

    [('Milan', 0.7222), ('Rome', 0.7028), ('Palermo_Sicily', 0.5967), ('Italian', 0.5911), ('Tuscany', 0.5632), …]
  135. Case Study 3: Evil AI model.most_similar( positive=[‘professor', ‘woman'], negative=[‘man’] )

  136. Case Study 3: Evil AI model.most_similar( positive=[‘professor', ‘woman'], negative=[‘man’] )

    [('associate_professor', 0.7771), ('assistant_professor', 0.7558), ('professor_emeritus', 0.7066), ('lecturer', 0.6982), ('sociology_professor', 0.6539), …]
  137. Case Study 3: Evil AI model.most_similar( positive=[‘computer_programmer’, ‘woman'], negative=[‘man’] )

  138. Case Study 3: Evil AI model.most_similar( positive=[‘computer_programmer’, ‘woman'], negative=[‘man’] )

    [('homemaker', 0.5627), ('housewife', 0.5105), ('graphic_designer', 0.5051), ('schoolteacher', 0.4979), ('businesswoman', 0.4934), …]
  139. Case Study 3: Evil AI • Culture is biased

  140. Case Study 3: Evil AI • Culture is biased •

    Language is biased
  141. Case Study 3: Evil AI • Culture is biased •

    Language is biased • Algorithms are not?
  142. Case Study 3: Evil AI • Culture is biased •

    Language is biased • Algorithms are not? • “Garbage in, garbage out”
  143. Case Study 3: Evil AI

  144. FINAL REMARKS

  145. But we’ve been
 doing this for X years

  146. But we’ve been
 doing this for X years • Approaches

    based on co-occurrences are not new • Think SVD / LSA / LDA • … but they are usually outperformed by word2vec • … and don’t scale as well as word2vec
  147. Efficiency

  148. Efficiency • There is no co-occurrence matrix
 (vectors are learned

    directly) • Softmax has complexity O(V)
 Hierarchical Softmax only O(log(V))
  149. Garbage in, garbage out

  150. Garbage in, garbage out • Pre-trained vectors are useful •

    … until they’re not • The business domain is important • The pre-processing steps are important • > 100K words? Maybe train your own model • > 1M words? Yep, train your own model
  151. Summary

  152. Summary • Word Embeddings are magic! • Big victory of

    unsupervised learning • Gensim makes your life easy
  153. Credits & Readings

  154. Credits & Readings Credits • Lev Konstantinovskiy (@gensim_py) • Chris

    E. Moody (@chrisemoody) see videos on lda2vec Readings • Deep Learning for NLP (R. Socher) http://cs224d.stanford.edu/ • “word2vec parameter learning explained” by Xin Rong More readings • “GloVe: global vectors for word representation” by Pennington et al. • “Dependency based word embeddings” and “Neural word embeddings as implicit matrix factorization” by O. Levy and Y. Goldberg
  155. Credits & Readings Even More Readings • “Man is to

    Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” by Bolukbasi et al. • “Quantifying and Reducing Stereotypes in Word Embeddings” by Bolukbasi et al. • “Equality of Opportunity in Machine Learning” - Google Research Blog
 https://research.googleblog.com/2016/10/equality-of-opportunity-in-machine.html Pics Credits • Classification: https://commons.wikimedia.org/wiki/File:Cluster-2.svg • Translation: https://commons.wikimedia.org/wiki/File:Translation_-_A_till_%C3%85-colours.svg
  156. THANK YOU @MarcoBonzanini GitHub.com/bonzanini marcobonzanini.com