Pro Yearly is on sale from $80 to $50! »

Word Embeddings for Natural Language Processing in Python

Word Embeddings for Natural Language Processing in Python

Slides for my talk on word embeddings at PyCon Italy 2017 (PyCon Otto):
https://www.pycon.it/conference/talks/word-embeddings-for-natural-language-processing-in-python
Abstract:
Word embeddings are a family of Natural Language Processing (NLP) algorithms where words are mapped to vectors in low-dimensional space. The interest around word embeddings has been on the rise in the past few years, because these techniques have been driving important improvements in many NLP applications like text classification, sentiment analysis or machine translation.

In this talk we’ll describe the intuitions behind this family of algorithms, we’ll explore some of the Python tools that allow us to implement modern NLP applications and we’ll conclude with some practical considerations.

Aa38bb7a9c35bc414da6ec7dcd8d7339?s=128

Marco Bonzanini

April 08, 2017
Tweet

Transcript

  1. Word Embeddings 
 for NLP in Python Marco Bonzanini
 PyCon

    Italia 2017
  2. Nice to meet you

  3. WORD EMBEDDINGS?

  4. Word Embeddings Word Vectors Distributed Representations = =

  5. Why should you care?

  6. Why should you care? Data representation
 is crucial

  7. Applications

  8. Applications • Classification / tagging • Recommendation Systems • Search

    Engines (Query Expansion) • Machine Translation
  9. One-hot Encoding

  10. One-hot Encoding Rome Paris Italy France = [1, 0, 0,

    0, 0, 0, …, 0] = [0, 1, 0, 0, 0, 0, …, 0] = [0, 0, 1, 0, 0, 0, …, 0] = [0, 0, 0, 1, 0, 0, …, 0]
  11. One-hot Encoding Rome Paris Italy France = [1, 0, 0,

    0, 0, 0, …, 0] = [0, 1, 0, 0, 0, 0, …, 0] = [0, 0, 1, 0, 0, 0, …, 0] = [0, 0, 0, 1, 0, 0, …, 0] Rome Paris word V
  12. One-hot Encoding Rome Paris Italy France = [1, 0, 0,

    0, 0, 0, …, 0] = [0, 1, 0, 0, 0, 0, …, 0] = [0, 0, 1, 0, 0, 0, …, 0] = [0, 0, 0, 1, 0, 0, …, 0] V = vocabulary size (huge)
  13. Bag-of-words

  14. Bag-of-words doc_1 doc_2 … doc_N = [32, 14, 1, 0,

    …, 6] = [ 2, 12, 0, 28, …, 12] … = [13, 0, 6, 2, …, 0]
  15. Bag-of-words doc_1 doc_2 … doc_N = [32, 14, 1, 0,

    …, 6] = [ 2, 12, 0, 28, …, 12] … = [13, 0, 6, 2, …, 0] Rome Paris word V
  16. Word Embeddings

  17. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97]
  18. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97] n. dimensions << vocabulary size
  19. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97]
  20. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97]
  21. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97]
  22. Word Embeddings Rome Paris Italy France

  23. Word Embeddings Paris + Italy - France ≈ Rome Rome

  24. THE MAIN INTUITION

  25. Distributional Hypothesis

  26. –J.R. Firth 1957 “You shall know a word 
 by

    the company it keeps.”
  27. –Z. Harris 1954 “Words that occur in similar context tend

    to have similar meaning.”
  28. Context ≈ Meaning

  29. I enjoyed eating some pizza at the restaurant

  30. I enjoyed eating some pizza at the restaurant Word

  31. I enjoyed eating some pizza at the restaurant The company

    it keeps Word
  32. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some fiorentina at the restaurant
  33. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some fiorentina at the restaurant
  34. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some fiorentina at the restaurant Same context
  35. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some fiorentina at the restaurant Same context Pizza = Fiorentina ?
  36. A BIT OF THEORY word2vec

  37. None
  38. None
  39. Vector Calculation

  40. Vector Calculation Goal: learn vec(word)

  41. Vector Calculation Goal: learn vec(word) 1. Choose objective function

  42. Vector Calculation Goal: learn vec(word) 1. Choose objective function 2.

    Init: random vectors
  43. Vector Calculation Goal: learn vec(word) 1. Choose objective function 2.

    Init: random vectors 3. Run gradient descent
  44. I enjoyed eating some pizza at the restaurant

  45. I enjoyed eating some pizza at the restaurant

  46. I enjoyed eating some pizza at the restaurant

  47. I enjoyed eating some pizza at the restaurant Maximise the

    likelihood 
 of the context given the focus word
  48. I enjoyed eating some pizza at the restaurant Maximise the

    likelihood 
 of the context given the focus word P(i | pizza) P(enjoyed | pizza) … P(restaurant | pizza)
  49. Example I enjoyed eating some pizza at the restaurant

  50. I enjoyed eating some pizza at the restaurant Iterate over

    context words Example
  51. I enjoyed eating some pizza at the restaurant bump P(

    i | pizza ) Example
  52. I enjoyed eating some pizza at the restaurant bump P(

    enjoyed | pizza ) Example
  53. I enjoyed eating some pizza at the restaurant bump P(

    eating | pizza ) Example
  54. I enjoyed eating some pizza at the restaurant bump P(

    some | pizza ) Example
  55. I enjoyed eating some pizza at the restaurant bump P(

    at | pizza ) Example
  56. I enjoyed eating some pizza at the restaurant bump P(

    the | pizza ) Example
  57. I enjoyed eating some pizza at the restaurant bump P(

    restaurant | pizza ) Example
  58. I enjoyed eating some pizza at the restaurant Move to

    next focus word and repeat Example
  59. I enjoyed eating some pizza at the restaurant bump P(

    i | at ) Example
  60. I enjoyed eating some pizza at the restaurant bump P(

    enjoyed | at ) Example
  61. I enjoyed eating some pizza at the restaurant … you

    get the picture Example
  62. P( eating | pizza )

  63. P( eating | pizza ) ??

  64. P( eating | pizza ) Input word Output word

  65. P( eating | pizza ) Input word Output word P(

    vec(eating) | vec(pizza) )
  66. P( vout | vin ) P( vec(eating) | vec(pizza) )

    P( eating | pizza ) Input word Output word
  67. P( vout | vin ) P( vec(eating) | vec(pizza) )

    P( eating | pizza ) Input word Output word ???
  68. P( vout | vin )

  69. cosine( vout, vin )

  70. cosine( vout, vin ) [-1, 1]

  71. softmax(cosine( vout, vin ))

  72. softmax(cosine( vout, vin )) [0, 1]

  73. softmax(cosine( vout, vin )) P (vout | vin) = exp(cosine(vout

    , vin)) P k 2 V exp(cosine(vk , vin))
  74. Vector Calculation Recap

  75. Vector Calculation Recap Learn vec(word)

  76. Vector Calculation Recap Learn vec(word) by gradient descent

  77. Vector Calculation Recap Learn vec(word) by gradient descent on the

    softmax probability
  78. Plot Twist

  79. None
  80. None
  81. Paragraph Vector a.k.a. doc2vec i.e. P(vout | vin, label)

  82. A BIT OF PRACTICE

  83. None
  84. pip install gensim

  85. Case Study 1: Skills and CVs

  86. Case Study 1: Skills and CVs Data set of ~300k

    resumes Each experience is a “sentence” Each experience has 3-15 skills Approx 15k unique skills
  87. Case Study 1: Skills and CVs from gensim.models import Word2Vec

    fname = 'candidates.jsonl' corpus = ResumesCorpus(fname) model = Word2Vec(corpus)
  88. Case Study 1: Skills and CVs model.most_similar('chef') [('cook', 0.94), ('bartender',

    0.91), ('waitress', 0.89), ('restaurant', 0.76), ...]
  89. Case Study 1: Skills and CVs model.most_similar('chef',
 negative=['food']) [('puppet', 0.93),

    ('devops', 0.92), ('ansible', 0.79), ('salt', 0.77), ...]
  90. Case Study 1: Skills and CVs Useful for: Data exploration

    Query expansion/suggestion Recommendations
  91. Case Study 2: Beer!

  92. Case Study 2: Beer! Data set of ~2.9M beer reviews

    89 different beer styles 635k unique tokens 185M total tokens
  93. Case Study 2: Beer! from gensim.models import Doc2Vec fname =

    'ratebeer_data.csv' corpus = RateBeerCorpus(fname) model = Doc2Vec(corpus)
  94. Case Study 2: Beer! from gensim.models import Doc2Vec fname =

    'ratebeer_data.csv' corpus = RateBeerCorpus(fname) model = Doc2Vec(corpus) 3.5h on my laptop … remember to pickle
  95. Case Study 2: Beer! model.docvecs.most_similar('Stout') [('Sweet Stout', 0.9877), ('Porter', 0.9620),

    ('Foreign Stout', 0.9595), ('Dry Stout', 0.9561), ('Imperial/Strong Porter', 0.9028), ...]
  96. Case Study 2: Beer! model.most_similar([model.docvecs['Stout']]) 
 [('coffee', 0.6342), ('espresso', 0.5931),

    ('charcoal', 0.5904), ('char', 0.5631), ('bean', 0.5624), ...]
  97. Case Study 2: Beer! model.most_similar([model.docvecs['Wheat Ale']]) 
 [('lemon', 0.6103), ('lemony',

    0.5909), ('wheaty', 0.5873), ('germ', 0.5684), ('lemongrass', 0.5653), ('wheat', 0.5649), ('lime', 0.55636), ('verbena', 0.5491), ('coriander', 0.5341), ('zesty', 0.5182)]
  98. PCA

  99. Dark beers

  100. Strong beers

  101. Sour beers

  102. Lagers

  103. Wheat beers

  104. Case Study 2: Beer! Useful for: Understanding the language of

    beer enthusiasts Planning your next pint Classification
  105. FINAL REMARKS

  106. But we’ve been
 doing this for X years

  107. But we’ve been
 doing this for X years • Approaches

    based on co-occurrences are not new • Think SVD / LSA / LDA • … but they are usually outperformed by word2vec • … and don’t scale as well as word2vec
  108. Efficiency

  109. Efficiency • There is no co-occurrence matrix
 (vectors are learned

    directly) • Softmax has complexity O(V)
 Hierarchical Softmax only O(log(V))
  110. Garbage in, garbage out

  111. Garbage in, garbage out • Pre-trained vectors are useful •

    … until they’re not • The business domain is important • The pre-processing steps are important • > 100K words? Maybe train your own model • > 1M words? Yep, train your own model
  112. Summary

  113. Summary • Word Embeddings are magic! • Big victory of

    unsupervised learning • Gensim makes your life easy
  114. Credits & Readings

  115. Credits & Readings Credits • Lev Konstantinovskiy (@gensim_py) • Chris

    E. Moody (@chrisemoody) see videos on lda2vec Readings • Deep Learning for NLP (R. Socher) http://cs224d.stanford.edu/ • “word2vec parameter learning explained” by Xin Rong More readings • “GloVe: global vectors for word representation” by Pennington et al. • “Dependency based word embeddings” and “Neural word embeddings as implicit matrix factorization” by O. Levy and Y. Goldberg
  116. THANK YOU @MarcoBonzanini GitHub.com/bonzanini marcobonzanini.com