Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduzione a word embeddings per capire il linguaggio naturale

Introduzione a word embeddings per capire il linguaggio naturale

Presentatione su Natural Language Processing in collegamento streaming con il meetup "Tarallucci, Vino e Machine Learning".

Descrizione dell'evento:
https://www.meetup.com/Tarallucci-Vino-Machine-Learning/events/251298019/

Aa38bb7a9c35bc414da6ec7dcd8d7339?s=128

Marco Bonzanini

June 07, 2018
Tweet

Transcript

  1. Understanding
 Natural Language with Word Vectors (and Python) @MarcoBonzanini Tarallucci,

    Vino e Machine Learning — Giugno 2018
  2. Nice to meet you

  3. WORD EMBEDDINGS?

  4. Word Embeddings Word Vectors Distributed Representations = =

  5. Why should you care?

  6. Why should you care? Data representation
 is crucial

  7. Applications

  8. Applications Classification

  9. Applications Classification Recommender Systems

  10. Applications Classification Recommender Systems Search Engines

  11. Applications Classification Recommender Systems Search Engines Machine Translation

  12. One-hot Encoding

  13. One-hot Encoding Rome Paris Italy France = [1, 0, 0,

    0, 0, 0, …, 0] = [0, 1, 0, 0, 0, 0, …, 0] = [0, 0, 1, 0, 0, 0, …, 0] = [0, 0, 0, 1, 0, 0, …, 0]
  14. One-hot Encoding Rome Paris Italy France = [1, 0, 0,

    0, 0, 0, …, 0] = [0, 1, 0, 0, 0, 0, …, 0] = [0, 0, 1, 0, 0, 0, …, 0] = [0, 0, 0, 1, 0, 0, …, 0] Rome Paris word V
  15. One-hot Encoding Rome Paris Italy France = [1, 0, 0,

    0, 0, 0, …, 0] = [0, 1, 0, 0, 0, 0, …, 0] = [0, 0, 1, 0, 0, 0, …, 0] = [0, 0, 0, 1, 0, 0, …, 0] V = vocabulary size (huge)
  16. Bag-of-words

  17. Bag-of-words doc_1 doc_2 … doc_N = [32, 14, 1, 0,

    …, 6] = [ 2, 12, 0, 28, …, 12] … = [13, 0, 6, 2, …, 0]
  18. Bag-of-words doc_1 doc_2 … doc_N = [32, 14, 1, 0,

    …, 6] = [ 2, 12, 0, 28, …, 12] … = [13, 0, 6, 2, …, 0] Rome Paris word V
  19. Word Embeddings

  20. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97]
  21. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97] n. dimensions << vocabulary size
  22. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97]
  23. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97]
  24. Word Embeddings Rome Paris Italy France = [0.91, 0.83, 0.17,

    …, 0.41] = [0.92, 0.82, 0.17, …, 0.98] = [0.32, 0.77, 0.67, …, 0.42] = [0.33, 0.78, 0.66, …, 0.97]
  25. Word Embeddings Rome Paris Italy France

  26. Word Embeddings is-capital-of

  27. Word Embeddings Paris

  28. Word Embeddings Paris + Italy

  29. Word Embeddings Paris + Italy - France

  30. Word Embeddings Paris + Italy - France ≈ Rome Rome

  31. FROM LANGUAGE TO VECTORS?

  32. Distributional Hypothesis

  33. –J.R. Firth 1957 “You shall know a word 
 by

    the company it keeps.”
  34. –Z. Harris 1954 “Words that occur in similar context tend

    to have similar meaning.”
  35. Context ≈ Meaning

  36. I enjoyed eating some pizza at the restaurant

  37. I enjoyed eating some pizza at the restaurant Word

  38. I enjoyed eating some pizza at the restaurant The company

    it keeps Word
  39. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some broccoli at the restaurant
  40. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some broccoli at the restaurant
  41. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some broccoli at the restaurant Same Context
  42. I enjoyed eating some pizza at the restaurant I enjoyed

    eating some broccoli at the restaurant = ?
  43. A BIT OF THEORY word2vec

  44. None
  45. None
  46. word2vec Architecture Mikolov et al. (2013) Efficient Estimation of Word

    Representations in Vector Space
  47. Vector Calculation

  48. Vector Calculation Goal: learn vec(word)

  49. Vector Calculation Goal: learn vec(word) 1. Choose objective function

  50. Vector Calculation Goal: learn vec(word) 1. Choose objective function 2.

    Init: random vectors
  51. Vector Calculation Goal: learn vec(word) 1. Choose objective function 2.

    Init: random vectors 3. Run stochastic gradient descent
  52. Vector Calculation Goal: learn vec(word) 1. Choose objective function 2.

    Init: random vectors 3. Run stochastic gradient descent
  53. Vector Calculation Goal: learn vec(word) 1. Choose objective function 2.

    Init: random vectors 3. Run stochastic gradient descent
  54. Intermezzo (Gradient Descent)

  55. Intermezzo (Gradient Descent) x F(x)

  56. Intermezzo (Gradient Descent) x F(x) Objective Function (to minimise)

  57. Intermezzo (Gradient Descent) x F(x) Find the optimal “x”

  58. Intermezzo (Gradient Descent) x F(x) Random Init

  59. Intermezzo (Gradient Descent) x F(x) Derivative

  60. Intermezzo (Gradient Descent) x F(x) Update

  61. Intermezzo (Gradient Descent) x F(x) Derivative

  62. Intermezzo (Gradient Descent) x F(x) Update

  63. Intermezzo (Gradient Descent) x F(x) and again

  64. Intermezzo (Gradient Descent) x F(x) Until convergence

  65. Intermezzo (Gradient Descent) • Optimisation algorithm

  66. Intermezzo (Gradient Descent) • Optimisation algorithm • Purpose: find the

    min (or max) for F
  67. Intermezzo (Gradient Descent) • Optimisation algorithm • Purpose: find the

    min (or max) for F • Batch-oriented (use all data points)
  68. Intermezzo (Gradient Descent) • Optimisation algorithm • Purpose: find the

    min (or max) for F • Batch-oriented (use all data points) • Stochastic GD: update after each sample
  69. Objective Function

  70. I enjoyed eating some pizza at the restaurant Objective Function

  71. I enjoyed eating some pizza at the restaurant Objective Function

  72. I enjoyed eating some pizza at the restaurant Objective Function

  73. I enjoyed eating some pizza at the restaurant Objective Function

  74. I enjoyed eating some pizza at the restaurant Objective Function

    maximise
 the likelihood of a word
 given its context
  75. I enjoyed eating some pizza at the restaurant Objective Function

    maximise
 the likelihood of a word
 given its context e.g. P(pizza | eating)
  76. I enjoyed eating some pizza at the restaurant Objective Function

  77. I enjoyed eating some pizza at the restaurant Objective Function

    maximise
 the likelihood of the context
 given its focus word
  78. I enjoyed eating some pizza at the restaurant Objective Function

    maximise
 the likelihood of the context
 given its focus word e.g. P(eating | pizza)
  79. Example I enjoyed eating some pizza at the restaurant

  80. I enjoyed eating some pizza at the restaurant Iterate over

    context words Example
  81. I enjoyed eating some pizza at the restaurant bump P(

    i | pizza ) Example
  82. I enjoyed eating some pizza at the restaurant bump P(

    enjoyed | pizza ) Example
  83. I enjoyed eating some pizza at the restaurant bump P(

    eating | pizza ) Example
  84. I enjoyed eating some pizza at the restaurant bump P(

    some | pizza ) Example
  85. I enjoyed eating some pizza at the restaurant bump P(

    at | pizza ) Example
  86. I enjoyed eating some pizza at the restaurant bump P(

    the | pizza ) Example
  87. I enjoyed eating some pizza at the restaurant bump P(

    restaurant | pizza ) Example
  88. I enjoyed eating some pizza at the restaurant Move to

    next focus word and repeat Example
  89. I enjoyed eating some pizza at the restaurant bump P(

    i | at ) Example
  90. I enjoyed eating some pizza at the restaurant bump P(

    enjoyed | at ) Example
  91. I enjoyed eating some pizza at the restaurant … you

    get the picture Example
  92. P( eating | pizza )

  93. P( eating | pizza ) ??

  94. P( eating | pizza ) Input word Output word

  95. P( eating | pizza ) Input word Output word P(

    vec(eating) | vec(pizza) )
  96. P( vout | vin ) P( vec(eating) | vec(pizza) )

    P( eating | pizza ) Input word Output word
  97. P( vout | vin ) P( vec(eating) | vec(pizza) )

    P( eating | pizza ) Input word Output word ???
  98. P( vout | vin )

  99. cosine( vout, vin )

  100. cosine( vout, vin ) [-1, 1]

  101. softmax(cosine( vout, vin ))

  102. softmax(cosine( vout, vin )) [0, 1]

  103. softmax(cosine( vout, vin )) P (vout | vin) = exp(cosine(vout

    , vin)) P k 2 V exp(cosine(vk , vin))
  104. Vector Calculation Recap

  105. Vector Calculation Recap Learn vec(word)

  106. Vector Calculation Recap Learn vec(word) by gradient descent

  107. Vector Calculation Recap Learn vec(word) by gradient descent on the

    softmax probability
  108. Plot Twist

  109. None
  110. None
  111. Paragraph Vector a.k.a. doc2vec i.e. P(vout | vin, label)

  112. A BIT OF PRACTICE

  113. None
  114. pip install gensim

  115. Case Study 1: Skills and CVs

  116. Case Study 1: Skills and CVs from gensim.models import Word2Vec

    fname = 'candidates.jsonl' corpus = ResumesCorpus(fname) model = Word2Vec(corpus)
  117. Case Study 1: Skills and CVs from gensim.models import Word2Vec

    fname = 'candidates.jsonl' corpus = ResumesCorpus(fname) model = Word2Vec(corpus)
  118. Case Study 1: Skills and CVs model.most_similar('chef') [('cook', 0.94), ('bartender',

    0.91), ('waitress', 0.89), ('restaurant', 0.76), ...]
  119. Case Study 1: Skills and CVs model.most_similar('chef',
 negative=['food']) [('puppet', 0.93),

    ('devops', 0.92), ('ansible', 0.79), ('salt', 0.77), ...]
  120. Case Study 1: Skills and CVs Useful for: Data exploration

    Query expansion/suggestion Recommendations
  121. Case Study 2: Beer!

  122. Case Study 2: Beer! Data set of ~2.9M beer reviews

    89 different beer styles 635k unique tokens 185M total tokens https://snap.stanford.edu/data/web-RateBeer.html
  123. Case Study 2: Beer! from gensim.models import Doc2Vec fname =

    'ratebeer_data.csv' corpus = RateBeerCorpus(fname) model = Doc2Vec(corpus)
  124. Case Study 2: Beer! from gensim.models import Doc2Vec fname =

    'ratebeer_data.csv' corpus = RateBeerCorpus(fname) model = Doc2Vec(corpus) 3.5h on my laptop … remember to pickle
  125. Case Study 2: Beer! model.docvecs.most_similar('Stout') [('Sweet Stout', 0.9877), ('Porter', 0.9620),

    ('Foreign Stout', 0.9595), ('Dry Stout', 0.9561), ('Imperial/Strong Porter', 0.9028), ...]
  126. Case Study 2: Beer! model.most_similar([model.docvecs['Stout']]) 
 [('coffee', 0.6342), ('espresso', 0.5931),

    ('charcoal', 0.5904), ('char', 0.5631), ('bean', 0.5624), ...]
  127. Case Study 2: Beer! model.most_similar([model.docvecs['Wheat Ale']]) 
 [('lemon', 0.6103), ('lemony',

    0.5909), ('wheaty', 0.5873), ('germ', 0.5684), ('lemongrass', 0.5653), ('wheat', 0.5649), ('lime', 0.55636), ('verbena', 0.5491), ('coriander', 0.5341), ('zesty', 0.5182)]
  128. PCA: scikit-learn — Data Viz: Bokeh

  129. Dark beers

  130. Strong beers

  131. Sour beers

  132. Lagers

  133. Wheat beers

  134. Case Study 2: Beer! Useful for: Understanding the language of

    beer enthusiasts Planning your next pint Classification
  135. Pre-trained Vectors

  136. Pre-trained Vectors from gensim.models.keyedvectors \
 import KeyedVectors fname = ‘GoogleNews-vectors.bin'

    model = KeyedVectors.load_word2vec_format( fname,
 binary=True )
  137. model.most_similar( positive=['king', ‘woman'], negative=[‘man’] ) Pre-trained Vectors

  138. model.most_similar( positive=['king', ‘woman'], negative=[‘man’] ) [('queen', 0.7118), ('monarch', 0.6189), ('princess',

    0.5902), ('crown_prince', 0.5499), ('prince', 0.5377), …] Pre-trained Vectors
  139. model.most_similar( positive=['Paris', ‘Italy'], negative=[‘France’] ) Pre-trained Vectors

  140. model.most_similar( positive=['Paris', ‘Italy'], negative=[‘France’] ) [('Milan', 0.7222), ('Rome', 0.7028), ('Palermo_Sicily',

    0.5967), ('Italian', 0.5911), ('Tuscany', 0.5632), …] Pre-trained Vectors
  141. model.most_similar( positive=[‘professor’,’woman’], negative=[‘man’] ) Pre-trained Vectors

  142. model.most_similar( positive=[‘professor’,’woman’], negative=[‘man’] ) [('associate_professor', 0.7771), ('assistant_professor', 0.7558), ('professor_emeritus', 0.7066),

    ('lecturer', 0.6982), ('sociology_professor', 0.6539), …] Pre-trained Vectors
  143. model.most_similar( positive=[‘professor', ‘man'], negative=[‘woman’] ) Pre-trained Vectors

  144. model.most_similar( positive=[‘professor', ‘man'], negative=[‘woman’] ) [('professor_emeritus', 0.7433), ('emeritus_professor', 0.7109), ('associate_professor',

    0.6817), ('Professor', 0.6495), ('assistant_professor', 0.6484), …] Pre-trained Vectors
  145. model.most_similar(
 positive=[‘computer_programmer’,’woman'],
 negative=[‘man’] ) Pre-trained Vectors

  146. model.most_similar(
 positive=[‘computer_programmer’,’woman'],
 negative=[‘man’] ) Pre-trained Vectors [('homemaker', 0.5627), ('housewife', 0.5105),

    ('graphic_designer', 0.5051), ('schoolteacher', 0.4979), ('businesswoman', 0.4934), …]
  147. Pre-trained Vectors Culture is biased

  148. Pre-trained Vectors Culture is biased Language is biased

  149. Pre-trained Vectors Culture is biased Language is biased Algorithms are

    not?
  150. Culture is biased Language is biased Algorithms are not? “Garbage

    in, garbage out” Pre-trained Vectors
  151. Pre-trained Vectors

  152. NOT ONLY WORD2VEC

  153. GloVe (2014)

  154. GloVe (2014) • Global co-occurrence matrix

  155. GloVe (2014) • Global co-occurrence matrix • Much bigger memory

    footprint
  156. GloVe (2014) • Global co-occurrence matrix • Much bigger memory

    footprint • Downstream tasks: similar performances
  157. doc2vec (2014)

  158. doc2vec (2014) • From words to documents

  159. doc2vec (2014) • From words to documents • (or sentences,

    paragraphs, classes, …)
  160. doc2vec (2014) • From words to documents • (or sentences,

    paragraphs, classes, …) • P(context | word, label)
  161. fastText (2016-17)

  162. • word2vec + morphology (sub-words) fastText (2016-17)

  163. • word2vec + morphology (sub-words) • Pre-trained vectors on ~300

    languages fastText (2016-17)
  164. • word2vec + morphology (sub-words) • Pre-trained vectors on ~300

    languages • morphologically rich languages fastText (2016-17)
  165. FINAL REMARKS

  166. But we’ve been doing this for X years

  167. But we’ve been doing this for X years • Approaches

    based on co-occurrences are not new
  168. But we’ve been doing this for X years • Approaches

    based on co-occurrences are not new • … but usually outperformed by word embeddings
  169. But we’ve been doing this for X years • Approaches

    based on co-occurrences are not new • … but usually outperformed by word embeddings • … and don’t scale as well as word embeddings
  170. Garbage in, garbage out

  171. Garbage in, garbage out • Pre-trained vectors are useful …

    until they’re not
  172. Garbage in, garbage out • Pre-trained vectors are useful …

    until they’re not • The business domain is important
  173. Garbage in, garbage out • Pre-trained vectors are useful …

    until they’re not • The business domain is important • > 100K words? Maybe train your own model
  174. Garbage in, garbage out • Pre-trained vectors are useful …

    until they’re not • The business domain is important • > 100K words? Maybe train your own model • > 1M words? Yep, train your own model
  175. Summary

  176. Summary • Word Embeddings are magic! • Big victory of

    unsupervised learning • Gensim makes your life easy
  177. Credits & Readings

  178. Credits & Readings Credits • Lev Konstantinovskiy (@teagermylk) Readings •

    Deep Learning for NLP (R. Socher) http://cs224d.stanford.edu/ • “GloVe: global vectors for word representation” by Pennington et al. • “Distributed Representation of Sentences and Documents” (doc2vec)
 by Le and Mikolov • “Enriching Word Vectors with Subword Information” (fastText)
 by Bojanokwsi et al.
  179. Credits & Readings Even More Readings • “Man is to

    Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” by Bolukbasi et al. • “Quantifying and Reducing Stereotypes in Word Embeddings” by Bolukbasi et al. • “Equality of Opportunity in Machine Learning” - Google Research Blog
 https://research.googleblog.com/2016/10/equality-of-opportunity-in-machine.html Pics Credits • Classification: https://commons.wikimedia.org/wiki/File:Cluster-2.svg • Translation: https://commons.wikimedia.org/wiki/File:Translation_-_A_till_%C3%85-colours.svg • Broccoli: https://commons.wikimedia.org/wiki/File:Broccoli_and_cross_section_edit.jpg • Pizza: https://commons.wikimedia.org/wiki/File:Eq_it-na_pizza-margherita_sep2005_sml.jpg
  180. THANK YOU @MarcoBonzanini speakerdeck.com/marcobonzanini GitHub.com/bonzanini marcobonzanini.com