Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduzione a word embeddings per capire il linguaggio naturale

Introduzione a word embeddings per capire il linguaggio naturale

Presentatione su Natural Language Processing in collegamento streaming con il meetup "Tarallucci, Vino e Machine Learning".

Descrizione dell'evento:
https://www.meetup.com/Tarallucci-Vino-Machine-Learning/events/251298019/

Marco Bonzanini

June 07, 2018
Tweet

More Decks by Marco Bonzanini

Other Decks in Programming

Transcript

  1. Understanding

    Natural Language
    with Word Vectors
    (and Python)
    @MarcoBonzanini
    Tarallucci, Vino e Machine Learning — Giugno 2018

    View full-size slide

  2. Nice to meet you

    View full-size slide

  3. WORD EMBEDDINGS?

    View full-size slide

  4. Word Embeddings
    Word Vectors
    Distributed
    Representations
    =
    =

    View full-size slide

  5. Why should you care?

    View full-size slide

  6. Why should you care?
    Data representation

    is crucial

    View full-size slide

  7. Applications

    View full-size slide

  8. Applications
    Classification

    View full-size slide

  9. Applications
    Classification
    Recommender Systems

    View full-size slide

  10. Applications
    Classification
    Recommender Systems
    Search Engines

    View full-size slide

  11. Applications
    Classification
    Recommender Systems
    Search Engines
    Machine Translation

    View full-size slide

  12. One-hot Encoding

    View full-size slide

  13. One-hot Encoding
    Rome
    Paris
    Italy
    France
    = [1, 0, 0, 0, 0, 0, …, 0]
    = [0, 1, 0, 0, 0, 0, …, 0]
    = [0, 0, 1, 0, 0, 0, …, 0]
    = [0, 0, 0, 1, 0, 0, …, 0]

    View full-size slide

  14. One-hot Encoding
    Rome
    Paris
    Italy
    France
    = [1, 0, 0, 0, 0, 0, …, 0]
    = [0, 1, 0, 0, 0, 0, …, 0]
    = [0, 0, 1, 0, 0, 0, …, 0]
    = [0, 0, 0, 1, 0, 0, …, 0]
    Rome
    Paris
    word V

    View full-size slide

  15. One-hot Encoding
    Rome
    Paris
    Italy
    France
    = [1, 0, 0, 0, 0, 0, …, 0]
    = [0, 1, 0, 0, 0, 0, …, 0]
    = [0, 0, 1, 0, 0, 0, …, 0]
    = [0, 0, 0, 1, 0, 0, …, 0]
    V = vocabulary size (huge)

    View full-size slide

  16. Bag-of-words

    View full-size slide

  17. Bag-of-words
    doc_1
    doc_2

    doc_N
    = [32, 14, 1, 0, …, 6]
    = [ 2, 12, 0, 28, …, 12]

    = [13, 0, 6, 2, …, 0]

    View full-size slide

  18. Bag-of-words
    doc_1
    doc_2

    doc_N
    = [32, 14, 1, 0, …, 6]
    = [ 2, 12, 0, 28, …, 12]

    = [13, 0, 6, 2, …, 0]
    Rome Paris word V

    View full-size slide

  19. Word Embeddings

    View full-size slide

  20. Word Embeddings
    Rome
    Paris
    Italy
    France
    = [0.91, 0.83, 0.17, …, 0.41]
    = [0.92, 0.82, 0.17, …, 0.98]
    = [0.32, 0.77, 0.67, …, 0.42]
    = [0.33, 0.78, 0.66, …, 0.97]

    View full-size slide

  21. Word Embeddings
    Rome
    Paris
    Italy
    France
    = [0.91, 0.83, 0.17, …, 0.41]
    = [0.92, 0.82, 0.17, …, 0.98]
    = [0.32, 0.77, 0.67, …, 0.42]
    = [0.33, 0.78, 0.66, …, 0.97]
    n. dimensions << vocabulary size

    View full-size slide

  22. Word Embeddings
    Rome
    Paris
    Italy
    France
    = [0.91, 0.83, 0.17, …, 0.41]
    = [0.92, 0.82, 0.17, …, 0.98]
    = [0.32, 0.77, 0.67, …, 0.42]
    = [0.33, 0.78, 0.66, …, 0.97]

    View full-size slide

  23. Word Embeddings
    Rome
    Paris
    Italy
    France
    = [0.91, 0.83, 0.17, …, 0.41]
    = [0.92, 0.82, 0.17, …, 0.98]
    = [0.32, 0.77, 0.67, …, 0.42]
    = [0.33, 0.78, 0.66, …, 0.97]

    View full-size slide

  24. Word Embeddings
    Rome
    Paris
    Italy
    France
    = [0.91, 0.83, 0.17, …, 0.41]
    = [0.92, 0.82, 0.17, …, 0.98]
    = [0.32, 0.77, 0.67, …, 0.42]
    = [0.33, 0.78, 0.66, …, 0.97]

    View full-size slide

  25. Word Embeddings
    Rome
    Paris Italy
    France

    View full-size slide

  26. Word Embeddings
    is-capital-of

    View full-size slide

  27. Word Embeddings
    Paris

    View full-size slide

  28. Word Embeddings
    Paris + Italy

    View full-size slide

  29. Word Embeddings
    Paris + Italy - France

    View full-size slide

  30. Word Embeddings
    Paris + Italy - France ≈ Rome
    Rome

    View full-size slide

  31. FROM LANGUAGE
    TO VECTORS?

    View full-size slide

  32. Distributional Hypothesis

    View full-size slide

  33. –J.R. Firth 1957
    “You shall know a word 

    by the company it keeps.”

    View full-size slide

  34. –Z. Harris 1954
    “Words that occur in similar context
    tend to have similar meaning.”

    View full-size slide

  35. Context ≈ Meaning

    View full-size slide

  36. I enjoyed eating some pizza at the restaurant

    View full-size slide

  37. I enjoyed eating some pizza at the restaurant
    Word

    View full-size slide

  38. I enjoyed eating some pizza at the restaurant
    The company it keeps
    Word

    View full-size slide

  39. I enjoyed eating some pizza at the restaurant
    I enjoyed eating some broccoli at the restaurant

    View full-size slide

  40. I enjoyed eating some pizza at the restaurant
    I enjoyed eating some broccoli at the restaurant

    View full-size slide

  41. I enjoyed eating some pizza at the restaurant
    I enjoyed eating some broccoli at the restaurant
    Same Context

    View full-size slide

  42. I enjoyed eating some pizza at the restaurant
    I enjoyed eating some broccoli at the restaurant
    =
    ?

    View full-size slide

  43. A BIT OF THEORY
    word2vec

    View full-size slide

  44. word2vec Architecture
    Mikolov et al. (2013) Efficient Estimation of Word Representations in Vector Space

    View full-size slide

  45. Vector Calculation

    View full-size slide

  46. Vector Calculation
    Goal: learn vec(word)

    View full-size slide

  47. Vector Calculation
    Goal: learn vec(word)
    1. Choose objective function

    View full-size slide

  48. Vector Calculation
    Goal: learn vec(word)
    1. Choose objective function
    2. Init: random vectors

    View full-size slide

  49. Vector Calculation
    Goal: learn vec(word)
    1. Choose objective function
    2. Init: random vectors
    3. Run stochastic gradient descent

    View full-size slide

  50. Vector Calculation
    Goal: learn vec(word)
    1. Choose objective function
    2. Init: random vectors
    3. Run stochastic gradient descent

    View full-size slide

  51. Vector Calculation
    Goal: learn vec(word)
    1. Choose objective function
    2. Init: random vectors
    3. Run stochastic gradient descent

    View full-size slide

  52. Intermezzo (Gradient Descent)

    View full-size slide

  53. Intermezzo (Gradient Descent)
    x
    F(x)

    View full-size slide

  54. Intermezzo (Gradient Descent)
    x
    F(x)
    Objective Function (to minimise)

    View full-size slide

  55. Intermezzo (Gradient Descent)
    x
    F(x)
    Find the optimal “x”

    View full-size slide

  56. Intermezzo (Gradient Descent)
    x
    F(x)
    Random Init

    View full-size slide

  57. Intermezzo (Gradient Descent)
    x
    F(x)
    Derivative

    View full-size slide

  58. Intermezzo (Gradient Descent)
    x
    F(x)
    Update

    View full-size slide

  59. Intermezzo (Gradient Descent)
    x
    F(x)
    Derivative

    View full-size slide

  60. Intermezzo (Gradient Descent)
    x
    F(x)
    Update

    View full-size slide

  61. Intermezzo (Gradient Descent)
    x
    F(x)
    and again

    View full-size slide

  62. Intermezzo (Gradient Descent)
    x
    F(x)
    Until convergence

    View full-size slide

  63. Intermezzo (Gradient Descent)
    • Optimisation algorithm

    View full-size slide

  64. Intermezzo (Gradient Descent)
    • Optimisation algorithm
    • Purpose: find the min (or max) for F

    View full-size slide

  65. Intermezzo (Gradient Descent)
    • Optimisation algorithm
    • Purpose: find the min (or max) for F
    • Batch-oriented (use all data points)

    View full-size slide

  66. Intermezzo (Gradient Descent)
    • Optimisation algorithm
    • Purpose: find the min (or max) for F
    • Batch-oriented (use all data points)
    • Stochastic GD: update after each sample

    View full-size slide

  67. Objective Function

    View full-size slide

  68. I enjoyed eating some pizza at the restaurant
    Objective Function

    View full-size slide

  69. I enjoyed eating some pizza at the restaurant
    Objective Function

    View full-size slide

  70. I enjoyed eating some pizza at the restaurant
    Objective Function

    View full-size slide

  71. I enjoyed eating some pizza at the restaurant
    Objective Function

    View full-size slide

  72. I enjoyed eating some pizza at the restaurant
    Objective Function
    maximise

    the likelihood of a word

    given its context

    View full-size slide

  73. I enjoyed eating some pizza at the restaurant
    Objective Function
    maximise

    the likelihood of a word

    given its context
    e.g. P(pizza | eating)

    View full-size slide

  74. I enjoyed eating some pizza at the restaurant
    Objective Function

    View full-size slide

  75. I enjoyed eating some pizza at the restaurant
    Objective Function
    maximise

    the likelihood of the context

    given its focus word

    View full-size slide

  76. I enjoyed eating some pizza at the restaurant
    Objective Function
    maximise

    the likelihood of the context

    given its focus word
    e.g. P(eating | pizza)

    View full-size slide

  77. Example
    I enjoyed eating some pizza at the restaurant

    View full-size slide

  78. I enjoyed eating some pizza at the restaurant
    Iterate over context words
    Example

    View full-size slide

  79. I enjoyed eating some pizza at the restaurant
    bump P( i | pizza )
    Example

    View full-size slide

  80. I enjoyed eating some pizza at the restaurant
    bump P( enjoyed | pizza )
    Example

    View full-size slide

  81. I enjoyed eating some pizza at the restaurant
    bump P( eating | pizza )
    Example

    View full-size slide

  82. I enjoyed eating some pizza at the restaurant
    bump P( some | pizza )
    Example

    View full-size slide

  83. I enjoyed eating some pizza at the restaurant
    bump P( at | pizza )
    Example

    View full-size slide

  84. I enjoyed eating some pizza at the restaurant
    bump P( the | pizza )
    Example

    View full-size slide

  85. I enjoyed eating some pizza at the restaurant
    bump P( restaurant | pizza )
    Example

    View full-size slide

  86. I enjoyed eating some pizza at the restaurant
    Move to next focus word and repeat
    Example

    View full-size slide

  87. I enjoyed eating some pizza at the restaurant
    bump P( i | at )
    Example

    View full-size slide

  88. I enjoyed eating some pizza at the restaurant
    bump P( enjoyed | at )
    Example

    View full-size slide

  89. I enjoyed eating some pizza at the restaurant
    … you get the picture
    Example

    View full-size slide

  90. P( eating | pizza )

    View full-size slide

  91. P( eating | pizza ) ??

    View full-size slide

  92. P( eating | pizza )
    Input word
    Output word

    View full-size slide

  93. P( eating | pizza )
    Input word
    Output word
    P( vec(eating) | vec(pizza) )

    View full-size slide

  94. P( vout | vin )
    P( vec(eating) | vec(pizza) )
    P( eating | pizza )
    Input word
    Output word

    View full-size slide

  95. P( vout | vin )
    P( vec(eating) | vec(pizza) )
    P( eating | pizza )
    Input word
    Output word
    ???

    View full-size slide

  96. P( vout | vin )

    View full-size slide

  97. cosine( vout, vin )

    View full-size slide

  98. cosine( vout, vin ) [-1, 1]

    View full-size slide

  99. softmax(cosine( vout, vin ))

    View full-size slide

  100. softmax(cosine( vout, vin )) [0, 1]

    View full-size slide

  101. softmax(cosine( vout, vin ))
    P
    (vout
    |
    vin) =
    exp(cosine(vout
    ,
    vin))
    P
    k
    2
    V exp(cosine(vk
    ,
    vin))

    View full-size slide

  102. Vector Calculation Recap

    View full-size slide

  103. Vector Calculation Recap
    Learn vec(word)

    View full-size slide

  104. Vector Calculation Recap
    Learn vec(word)
    by gradient descent

    View full-size slide

  105. Vector Calculation Recap
    Learn vec(word)
    by gradient descent
    on the softmax probability

    View full-size slide

  106. Paragraph Vector
    a.k.a.
    doc2vec
    i.e.
    P(vout | vin, label)

    View full-size slide

  107. A BIT OF PRACTICE

    View full-size slide

  108. pip install gensim

    View full-size slide

  109. Case Study 1: Skills and CVs

    View full-size slide

  110. Case Study 1: Skills and CVs
    from gensim.models import Word2Vec
    fname = 'candidates.jsonl'
    corpus = ResumesCorpus(fname)
    model = Word2Vec(corpus)

    View full-size slide

  111. Case Study 1: Skills and CVs
    from gensim.models import Word2Vec
    fname = 'candidates.jsonl'
    corpus = ResumesCorpus(fname)
    model = Word2Vec(corpus)

    View full-size slide

  112. Case Study 1: Skills and CVs
    model.most_similar('chef')
    [('cook', 0.94),
    ('bartender', 0.91),
    ('waitress', 0.89),
    ('restaurant', 0.76),
    ...]

    View full-size slide

  113. Case Study 1: Skills and CVs
    model.most_similar('chef',

    negative=['food'])
    [('puppet', 0.93),
    ('devops', 0.92),
    ('ansible', 0.79),
    ('salt', 0.77),
    ...]

    View full-size slide

  114. Case Study 1: Skills and CVs
    Useful for:
    Data exploration
    Query expansion/suggestion
    Recommendations

    View full-size slide

  115. Case Study 2: Beer!

    View full-size slide

  116. Case Study 2: Beer!
    Data set of ~2.9M beer reviews
    89 different beer styles
    635k unique tokens
    185M total tokens
    https://snap.stanford.edu/data/web-RateBeer.html

    View full-size slide

  117. Case Study 2: Beer!
    from gensim.models import Doc2Vec
    fname = 'ratebeer_data.csv'
    corpus = RateBeerCorpus(fname)
    model = Doc2Vec(corpus)

    View full-size slide

  118. Case Study 2: Beer!
    from gensim.models import Doc2Vec
    fname = 'ratebeer_data.csv'
    corpus = RateBeerCorpus(fname)
    model = Doc2Vec(corpus)
    3.5h on my laptop
    … remember to pickle

    View full-size slide

  119. Case Study 2: Beer!
    model.docvecs.most_similar('Stout')
    [('Sweet Stout', 0.9877),
    ('Porter', 0.9620),
    ('Foreign Stout', 0.9595),
    ('Dry Stout', 0.9561),
    ('Imperial/Strong Porter', 0.9028),
    ...]

    View full-size slide

  120. Case Study 2: Beer!
    model.most_similar([model.docvecs['Stout']])

    [('coffee', 0.6342),
    ('espresso', 0.5931),
    ('charcoal', 0.5904),
    ('char', 0.5631),
    ('bean', 0.5624),
    ...]

    View full-size slide

  121. Case Study 2: Beer!
    model.most_similar([model.docvecs['Wheat Ale']])

    [('lemon', 0.6103),
    ('lemony', 0.5909),
    ('wheaty', 0.5873),
    ('germ', 0.5684),
    ('lemongrass', 0.5653),
    ('wheat', 0.5649),
    ('lime', 0.55636),
    ('verbena', 0.5491),
    ('coriander', 0.5341),
    ('zesty', 0.5182)]

    View full-size slide

  122. PCA: scikit-learn — Data Viz: Bokeh

    View full-size slide

  123. Strong beers

    View full-size slide

  124. Case Study 2: Beer!
    Useful for:
    Understanding the language of beer enthusiasts
    Planning your next pint
    Classification

    View full-size slide

  125. Pre-trained Vectors

    View full-size slide

  126. Pre-trained Vectors
    from gensim.models.keyedvectors \

    import KeyedVectors
    fname = ‘GoogleNews-vectors.bin'
    model = KeyedVectors.load_word2vec_format(
    fname,

    binary=True
    )

    View full-size slide

  127. model.most_similar(
    positive=['king', ‘woman'],
    negative=[‘man’]
    )
    Pre-trained Vectors

    View full-size slide

  128. model.most_similar(
    positive=['king', ‘woman'],
    negative=[‘man’]
    )
    [('queen', 0.7118),
    ('monarch', 0.6189),
    ('princess', 0.5902),
    ('crown_prince', 0.5499),
    ('prince', 0.5377),
    …]
    Pre-trained Vectors

    View full-size slide

  129. model.most_similar(
    positive=['Paris', ‘Italy'],
    negative=[‘France’]
    )
    Pre-trained Vectors

    View full-size slide

  130. model.most_similar(
    positive=['Paris', ‘Italy'],
    negative=[‘France’]
    )
    [('Milan', 0.7222),
    ('Rome', 0.7028),
    ('Palermo_Sicily', 0.5967),
    ('Italian', 0.5911),
    ('Tuscany', 0.5632),
    …]
    Pre-trained Vectors

    View full-size slide

  131. model.most_similar(
    positive=[‘professor’,’woman’],
    negative=[‘man’]
    )
    Pre-trained Vectors

    View full-size slide

  132. model.most_similar(
    positive=[‘professor’,’woman’],
    negative=[‘man’]
    )
    [('associate_professor', 0.7771),
    ('assistant_professor', 0.7558),
    ('professor_emeritus', 0.7066),
    ('lecturer', 0.6982),
    ('sociology_professor', 0.6539),
    …]
    Pre-trained Vectors

    View full-size slide

  133. model.most_similar(
    positive=[‘professor', ‘man'],
    negative=[‘woman’]
    )
    Pre-trained Vectors

    View full-size slide

  134. model.most_similar(
    positive=[‘professor', ‘man'],
    negative=[‘woman’]
    )
    [('professor_emeritus', 0.7433),
    ('emeritus_professor', 0.7109),
    ('associate_professor', 0.6817),
    ('Professor', 0.6495),
    ('assistant_professor', 0.6484),
    …]
    Pre-trained Vectors

    View full-size slide

  135. model.most_similar(

    positive=[‘computer_programmer’,’woman'],

    negative=[‘man’]
    )
    Pre-trained Vectors

    View full-size slide

  136. model.most_similar(

    positive=[‘computer_programmer’,’woman'],

    negative=[‘man’]
    )
    Pre-trained Vectors
    [('homemaker', 0.5627),
    ('housewife', 0.5105),
    ('graphic_designer', 0.5051),
    ('schoolteacher', 0.4979),
    ('businesswoman', 0.4934),
    …]

    View full-size slide

  137. Pre-trained Vectors
    Culture is biased

    View full-size slide

  138. Pre-trained Vectors
    Culture is biased
    Language is biased

    View full-size slide

  139. Pre-trained Vectors
    Culture is biased
    Language is biased
    Algorithms are not?

    View full-size slide

  140. Culture is biased
    Language is biased
    Algorithms are not?
    “Garbage in, garbage out”
    Pre-trained Vectors

    View full-size slide

  141. Pre-trained Vectors

    View full-size slide

  142. NOT ONLY
    WORD2VEC

    View full-size slide

  143. GloVe (2014)

    View full-size slide

  144. GloVe (2014)
    • Global co-occurrence matrix

    View full-size slide

  145. GloVe (2014)
    • Global co-occurrence matrix
    • Much bigger memory footprint

    View full-size slide

  146. GloVe (2014)
    • Global co-occurrence matrix
    • Much bigger memory footprint
    • Downstream tasks: similar performances

    View full-size slide

  147. doc2vec (2014)

    View full-size slide

  148. doc2vec (2014)
    • From words to documents

    View full-size slide

  149. doc2vec (2014)
    • From words to documents
    • (or sentences, paragraphs, classes, …)

    View full-size slide

  150. doc2vec (2014)
    • From words to documents
    • (or sentences, paragraphs, classes, …)
    • P(context | word, label)

    View full-size slide

  151. fastText (2016-17)

    View full-size slide

  152. • word2vec + morphology (sub-words)
    fastText (2016-17)

    View full-size slide

  153. • word2vec + morphology (sub-words)
    • Pre-trained vectors on ~300 languages
    fastText (2016-17)

    View full-size slide

  154. • word2vec + morphology (sub-words)
    • Pre-trained vectors on ~300 languages
    • morphologically rich languages
    fastText (2016-17)

    View full-size slide

  155. FINAL REMARKS

    View full-size slide

  156. But we’ve been doing this for X years

    View full-size slide

  157. But we’ve been doing this for X years
    • Approaches based on co-occurrences are not new

    View full-size slide

  158. But we’ve been doing this for X years
    • Approaches based on co-occurrences are not new
    • … but usually outperformed by word embeddings

    View full-size slide

  159. But we’ve been doing this for X years
    • Approaches based on co-occurrences are not new
    • … but usually outperformed by word embeddings
    • … and don’t scale as well as word embeddings

    View full-size slide

  160. Garbage in, garbage out

    View full-size slide

  161. Garbage in, garbage out
    • Pre-trained vectors are useful … until they’re not

    View full-size slide

  162. Garbage in, garbage out
    • Pre-trained vectors are useful … until they’re not
    • The business domain is important

    View full-size slide

  163. Garbage in, garbage out
    • Pre-trained vectors are useful … until they’re not
    • The business domain is important
    • > 100K words? Maybe train your own model

    View full-size slide

  164. Garbage in, garbage out
    • Pre-trained vectors are useful … until they’re not
    • The business domain is important
    • > 100K words? Maybe train your own model
    • > 1M words? Yep, train your own model

    View full-size slide

  165. Summary
    • Word Embeddings are magic!
    • Big victory of unsupervised learning
    • Gensim makes your life easy

    View full-size slide

  166. Credits & Readings

    View full-size slide

  167. Credits & Readings
    Credits
    • Lev Konstantinovskiy (@teagermylk)
    Readings
    • Deep Learning for NLP (R. Socher) http://cs224d.stanford.edu/
    • “GloVe: global vectors for word representation” by Pennington et al.
    • “Distributed Representation of Sentences and Documents” (doc2vec)

    by Le and Mikolov
    • “Enriching Word Vectors with Subword Information” (fastText)

    by Bojanokwsi et al.

    View full-size slide

  168. Credits & Readings
    Even More Readings
    • “Man is to Computer Programmer as Woman is to Homemaker? Debiasing
    Word Embeddings” by Bolukbasi et al.
    • “Quantifying and Reducing Stereotypes in Word Embeddings” by Bolukbasi et al.
    • “Equality of Opportunity in Machine Learning” - Google Research Blog

    https://research.googleblog.com/2016/10/equality-of-opportunity-in-machine.html
    Pics Credits
    • Classification: https://commons.wikimedia.org/wiki/File:Cluster-2.svg
    • Translation: https://commons.wikimedia.org/wiki/File:Translation_-_A_till_%C3%85-colours.svg
    • Broccoli: https://commons.wikimedia.org/wiki/File:Broccoli_and_cross_section_edit.jpg
    • Pizza: https://commons.wikimedia.org/wiki/File:Eq_it-na_pizza-margherita_sep2005_sml.jpg

    View full-size slide

  169. THANK YOU
    @MarcoBonzanini
    speakerdeck.com/marcobonzanini
    GitHub.com/bonzanini
    marcobonzanini.com

    View full-size slide