Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Word Embeddings for Natural Language Processing in Python @ London Python meetup

Word Embeddings for Natural Language Processing in Python @ London Python meetup

https://www.meetup.com/LondonPython/events/240263693/

Word embeddings are a family of Natural Language Processing (NLP) algorithms where words are mapped to vectors in low-dimensional space.

The interest around word embeddings has been on the rise in the past few years, because these techniques have been driving important improvements in many NLP applications like text classification, sentiment analysis or machine translation.

This talk is an introduction to word embeddings, in particular with details on word2vec and doc2vec

Marco Bonzanini

September 28, 2017
Tweet

More Decks by Marco Bonzanini

Other Decks in Programming

Transcript

  1. Word Embeddings 

    for NLP
    in Python
    Marco Bonzanini

    London Python Meet-up
    September 2017

    View Slide

  2. Nice to meet you

    View Slide

  3. WORD EMBEDDINGS?

    View Slide

  4. Word Embeddings
    Word Vectors
    Distributed
    Representations
    =
    =

    View Slide

  5. Why should you care?

    View Slide

  6. Why should you care?
    Data representation

    is crucial

    View Slide

  7. Applications

    View Slide

  8. Applications
    Classification

    View Slide

  9. Applications
    Classification
    Recommender Systems

    View Slide

  10. Applications
    Classification
    Recommender Systems
    Search Engines

    View Slide

  11. Applications
    Classification
    Recommender Systems
    Search Engines
    Machine Translation

    View Slide

  12. One-hot Encoding

    View Slide

  13. One-hot Encoding
    Rome
    Paris
    Italy
    France
    = [1, 0, 0, 0, 0, 0, …, 0]
    = [0, 1, 0, 0, 0, 0, …, 0]
    = [0, 0, 1, 0, 0, 0, …, 0]
    = [0, 0, 0, 1, 0, 0, …, 0]

    View Slide

  14. One-hot Encoding
    Rome
    Paris
    Italy
    France
    = [1, 0, 0, 0, 0, 0, …, 0]
    = [0, 1, 0, 0, 0, 0, …, 0]
    = [0, 0, 1, 0, 0, 0, …, 0]
    = [0, 0, 0, 1, 0, 0, …, 0]
    Rome
    Paris
    word V

    View Slide

  15. One-hot Encoding
    Rome
    Paris
    Italy
    France
    = [1, 0, 0, 0, 0, 0, …, 0]
    = [0, 1, 0, 0, 0, 0, …, 0]
    = [0, 0, 1, 0, 0, 0, …, 0]
    = [0, 0, 0, 1, 0, 0, …, 0]
    V = vocabulary size (huge)

    View Slide

  16. Bag-of-words

    View Slide

  17. Bag-of-words
    doc_1
    doc_2

    doc_N
    = [32, 14, 1, 0, …, 6]
    = [ 2, 12, 0, 28, …, 12]

    = [13, 0, 6, 2, …, 0]

    View Slide

  18. Bag-of-words
    doc_1
    doc_2

    doc_N
    = [32, 14, 1, 0, …, 6]
    = [ 2, 12, 0, 28, …, 12]

    = [13, 0, 6, 2, …, 0]
    Rome Paris word V

    View Slide

  19. Word Embeddings

    View Slide

  20. Word Embeddings
    Rome
    Paris
    Italy
    France
    = [0.91, 0.83, 0.17, …, 0.41]
    = [0.92, 0.82, 0.17, …, 0.98]
    = [0.32, 0.77, 0.67, …, 0.42]
    = [0.33, 0.78, 0.66, …, 0.97]

    View Slide

  21. Word Embeddings
    Rome
    Paris
    Italy
    France
    = [0.91, 0.83, 0.17, …, 0.41]
    = [0.92, 0.82, 0.17, …, 0.98]
    = [0.32, 0.77, 0.67, …, 0.42]
    = [0.33, 0.78, 0.66, …, 0.97]
    n. dimensions << vocabulary size

    View Slide

  22. Word Embeddings
    Rome
    Paris
    Italy
    France
    = [0.91, 0.83, 0.17, …, 0.41]
    = [0.92, 0.82, 0.17, …, 0.98]
    = [0.32, 0.77, 0.67, …, 0.42]
    = [0.33, 0.78, 0.66, …, 0.97]

    View Slide

  23. Word Embeddings
    Rome
    Paris
    Italy
    France
    = [0.91, 0.83, 0.17, …, 0.41]
    = [0.92, 0.82, 0.17, …, 0.98]
    = [0.32, 0.77, 0.67, …, 0.42]
    = [0.33, 0.78, 0.66, …, 0.97]

    View Slide

  24. Word Embeddings
    Rome
    Paris
    Italy
    France
    = [0.91, 0.83, 0.17, …, 0.41]
    = [0.92, 0.82, 0.17, …, 0.98]
    = [0.32, 0.77, 0.67, …, 0.42]
    = [0.33, 0.78, 0.66, …, 0.97]

    View Slide

  25. Word Embeddings
    Rome
    Paris Italy
    France

    View Slide

  26. Word Embeddings
    is-capital-of

    View Slide

  27. Word Embeddings
    Paris

    View Slide

  28. Word Embeddings
    Paris + Italy

    View Slide

  29. Word Embeddings
    Paris + Italy - France

    View Slide

  30. Word Embeddings
    Paris + Italy - France ≈ Rome
    Rome

    View Slide

  31. FROM LANGUAGE
    TO VECTORS?

    View Slide

  32. Distributional Hypothesis

    View Slide

  33. –J.R. Firth 1957
    “You shall know a word 

    by the company it keeps.”

    View Slide

  34. –Z. Harris 1954
    “Words that occur in similar context
    tend to have similar meaning.”

    View Slide

  35. Context ≈ Meaning

    View Slide

  36. I enjoyed eating some pizza at the restaurant

    View Slide

  37. I enjoyed eating some pizza at the restaurant
    Word

    View Slide

  38. I enjoyed eating some pizza at the restaurant
    The company it keeps
    Word

    View Slide

  39. I enjoyed eating some pizza at the restaurant
    I enjoyed eating some pineapple at the restaurant

    View Slide

  40. I enjoyed eating some pizza at the restaurant
    I enjoyed eating some pineapple at the restaurant

    View Slide

  41. I enjoyed eating some pizza at the restaurant
    I enjoyed eating some pineapple at the restaurant
    Same context

    View Slide

  42. I enjoyed eating some pizza at the restaurant
    I enjoyed eating some pineapple at the restaurant
    Pizza = Pineapple ?
    Same context

    View Slide

  43. A BIT OF THEORY
    word2vec

    View Slide

  44. View Slide

  45. View Slide

  46. word2vec Architecture
    Mikolov et al. (2013) Efficient Estimation of Word Representations in Vector Space

    View Slide

  47. Vector Calculation

    View Slide

  48. Vector Calculation
    Goal: learn vec(word)

    View Slide

  49. Vector Calculation
    Goal: learn vec(word)
    1. Choose objective function

    View Slide

  50. Vector Calculation
    Goal: learn vec(word)
    1. Choose objective function
    2. Init: random vectors

    View Slide

  51. Vector Calculation
    Goal: learn vec(word)
    1. Choose objective function
    2. Init: random vectors
    3. Run stochastic gradient descent

    View Slide

  52. Intermezzo (Gradient Descent)

    View Slide

  53. Intermezzo (Gradient Descent)
    x
    F(x)

    View Slide

  54. Intermezzo (Gradient Descent)
    x
    F(x)
    Objective Function (to minimise)

    View Slide

  55. Intermezzo (Gradient Descent)
    x
    F(x)
    Find the optimal “x”

    View Slide

  56. Intermezzo (Gradient Descent)
    x
    F(x)
    Random Init

    View Slide

  57. Intermezzo (Gradient Descent)
    x
    F(x)
    Derivative

    View Slide

  58. Intermezzo (Gradient Descent)
    x
    F(x)
    Update

    View Slide

  59. Intermezzo (Gradient Descent)
    x
    F(x)
    Derivative

    View Slide

  60. Intermezzo (Gradient Descent)
    x
    F(x)
    Update

    View Slide

  61. Intermezzo (Gradient Descent)
    x
    F(x)
    and again

    View Slide

  62. Intermezzo (Gradient Descent)
    x
    F(x)
    Until convergence

    View Slide

  63. Intermezzo (Gradient Descent)
    • Optimisation algorithm

    View Slide

  64. Intermezzo (Gradient Descent)
    • Optimisation algorithm
    • Purpose: find the min (or max) for F

    View Slide

  65. Intermezzo (Gradient Descent)
    • Optimisation algorithm
    • Purpose: find the min (or max) for F
    • Batch-oriented (use all data points)

    View Slide

  66. Intermezzo (Gradient Descent)
    • Optimisation algorithm
    • Purpose: find the min (or max) for F
    • Batch-oriented (use all data points)
    • Stochastic GD: update after each sample

    View Slide

  67. Objective Function

    View Slide

  68. I enjoyed eating some pizza at the restaurant
    Objective Function

    View Slide

  69. I enjoyed eating some pizza at the restaurant
    Objective Function

    View Slide

  70. I enjoyed eating some pizza at the restaurant
    Objective Function

    View Slide

  71. I enjoyed eating some pizza at the restaurant
    Maximise the likelihood 

    of the context given the focus word
    Objective Function

    View Slide

  72. I enjoyed eating some pizza at the restaurant
    Maximise the likelihood 

    of the context given the focus word
    P(i | pizza)
    P(enjoyed | pizza)

    P(restaurant | pizza)
    Objective Function

    View Slide

  73. Example
    I enjoyed eating some pizza at the restaurant

    View Slide

  74. I enjoyed eating some pizza at the restaurant
    Iterate over context words
    Example

    View Slide

  75. I enjoyed eating some pizza at the restaurant
    bump P( i | pizza )
    Example

    View Slide

  76. I enjoyed eating some pizza at the restaurant
    bump P( enjoyed | pizza )
    Example

    View Slide

  77. I enjoyed eating some pizza at the restaurant
    bump P( eating | pizza )
    Example

    View Slide

  78. I enjoyed eating some pizza at the restaurant
    bump P( some | pizza )
    Example

    View Slide

  79. I enjoyed eating some pizza at the restaurant
    bump P( at | pizza )
    Example

    View Slide

  80. I enjoyed eating some pizza at the restaurant
    bump P( the | pizza )
    Example

    View Slide

  81. I enjoyed eating some pizza at the restaurant
    bump P( restaurant | pizza )
    Example

    View Slide

  82. I enjoyed eating some pizza at the restaurant
    Move to next focus word and repeat
    Example

    View Slide

  83. I enjoyed eating some pizza at the restaurant
    bump P( i | at )
    Example

    View Slide

  84. I enjoyed eating some pizza at the restaurant
    bump P( enjoyed | at )
    Example

    View Slide

  85. I enjoyed eating some pizza at the restaurant
    … you get the picture
    Example

    View Slide

  86. P( eating | pizza )

    View Slide

  87. P( eating | pizza ) ??

    View Slide

  88. P( eating | pizza )
    Input word
    Output word

    View Slide

  89. P( eating | pizza )
    Input word
    Output word
    P( vec(eating) | vec(pizza) )

    View Slide

  90. P( vout | vin )
    P( vec(eating) | vec(pizza) )
    P( eating | pizza )
    Input word
    Output word

    View Slide

  91. P( vout | vin )
    P( vec(eating) | vec(pizza) )
    P( eating | pizza )
    Input word
    Output word
    ???

    View Slide

  92. P( vout | vin )

    View Slide

  93. cosine( vout, vin )

    View Slide

  94. cosine( vout, vin ) [-1, 1]

    View Slide

  95. softmax(cosine( vout, vin ))

    View Slide

  96. softmax(cosine( vout, vin )) [0, 1]

    View Slide

  97. softmax(cosine( vout, vin ))
    P
    (vout
    |
    vin) =
    exp(cosine(vout
    ,
    vin))
    P
    k
    2
    V exp(cosine(vk
    ,
    vin))

    View Slide

  98. Vector Calculation Recap

    View Slide

  99. Vector Calculation Recap
    Learn vec(word)

    View Slide

  100. Vector Calculation Recap
    Learn vec(word)
    by gradient descent

    View Slide

  101. Vector Calculation Recap
    Learn vec(word)
    by gradient descent
    on the softmax probability

    View Slide

  102. Plot Twist

    View Slide

  103. View Slide

  104. View Slide

  105. Paragraph Vector
    a.k.a.
    doc2vec
    i.e.
    P(vout | vin, label)

    View Slide

  106. A BIT OF PRACTICE

    View Slide

  107. View Slide

  108. pip install gensim

    View Slide

  109. Case Study 1: Skills and CVs

    View Slide

  110. Case Study 1: Skills and CVs
    Data set of ~300k resumes
    Each experience is a “sentence”
    Each experience has 3-15 skills
    Approx 15k unique skills

    View Slide

  111. Case Study 1: Skills and CVs
    from gensim.models import Word2Vec
    fname = 'candidates.jsonl'
    corpus = ResumesCorpus(fname)
    model = Word2Vec(corpus)

    View Slide

  112. Case Study 1: Skills and CVs
    model.most_similar('chef')
    [('cook', 0.94),
    ('bartender', 0.91),
    ('waitress', 0.89),
    ('restaurant', 0.76),
    ...]

    View Slide

  113. Case Study 1: Skills and CVs
    model.most_similar('chef',

    negative=['food'])
    [('puppet', 0.93),
    ('devops', 0.92),
    ('ansible', 0.79),
    ('salt', 0.77),
    ...]

    View Slide

  114. Case Study 1: Skills and CVs
    Useful for:
    Data exploration
    Query expansion/suggestion
    Recommendations

    View Slide

  115. Case Study 2: Beer!

    View Slide

  116. Case Study 2: Beer!
    Data set of ~2.9M beer reviews
    89 different beer styles
    635k unique tokens
    185M total tokens

    View Slide

  117. Case Study 2: Beer!
    from gensim.models import Doc2Vec
    fname = 'ratebeer_data.csv'
    corpus = RateBeerCorpus(fname)
    model = Doc2Vec(corpus)

    View Slide

  118. Case Study 2: Beer!
    from gensim.models import Doc2Vec
    fname = 'ratebeer_data.csv'
    corpus = RateBeerCorpus(fname)
    model = Doc2Vec(corpus)
    3.5h on my laptop
    … remember to pickle

    View Slide

  119. Case Study 2: Beer!
    model.docvecs.most_similar('Stout')
    [('Sweet Stout', 0.9877),
    ('Porter', 0.9620),
    ('Foreign Stout', 0.9595),
    ('Dry Stout', 0.9561),
    ('Imperial/Strong Porter', 0.9028),
    ...]

    View Slide

  120. Case Study 2: Beer!
    model.most_similar([model.docvecs['Stout']])

    [('coffee', 0.6342),
    ('espresso', 0.5931),
    ('charcoal', 0.5904),
    ('char', 0.5631),
    ('bean', 0.5624),
    ...]

    View Slide

  121. Case Study 2: Beer!
    model.most_similar([model.docvecs['Wheat Ale']])

    [('lemon', 0.6103),
    ('lemony', 0.5909),
    ('wheaty', 0.5873),
    ('germ', 0.5684),
    ('lemongrass', 0.5653),
    ('wheat', 0.5649),
    ('lime', 0.55636),
    ('verbena', 0.5491),
    ('coriander', 0.5341),
    ('zesty', 0.5182)]

    View Slide

  122. PCA: scikit-learn — Data Viz: Bokeh

    View Slide

  123. Dark beers

    View Slide

  124. Strong beers

    View Slide

  125. Sour beers

    View Slide

  126. Lagers

    View Slide

  127. Wheat beers

    View Slide

  128. Case Study 2: Beer!
    Useful for:
    Understanding the language of beer enthusiasts
    Planning your next pint
    Classification

    View Slide

  129. Case Study 3: Evil AI

    View Slide

  130. Case Study 3: Evil AI
    from gensim.models.keyedvectors \
    import KeyedVectors
    fname = ‘GoogleNews-vectors.bin'
    model = KeyedVectors.load_word2vec_format(
    fname,

    binary=True
    )

    View Slide

  131. Case Study 3: Evil AI
    model.most_similar(
    positive=['king', ‘woman'],
    negative=[‘man’]
    )

    View Slide

  132. Case Study 3: Evil AI
    model.most_similar(
    positive=['king', ‘woman'],
    negative=[‘man’]
    )
    [('queen', 0.7118),
    ('monarch', 0.6189),
    ('princess', 0.5902),
    ('crown_prince', 0.5499),
    ('prince', 0.5377),
    …]

    View Slide

  133. Case Study 3: Evil AI
    model.most_similar(
    positive=['Paris', ‘Italy'],
    negative=[‘France’]
    )

    View Slide

  134. Case Study 3: Evil AI
    model.most_similar(
    positive=['Paris', ‘Italy'],
    negative=[‘France’]
    )
    [('Milan', 0.7222),
    ('Rome', 0.7028),
    ('Palermo_Sicily', 0.5967),
    ('Italian', 0.5911),
    ('Tuscany', 0.5632),
    …]

    View Slide

  135. Case Study 3: Evil AI
    model.most_similar(
    positive=[‘professor', ‘woman'],
    negative=[‘man’]
    )

    View Slide

  136. Case Study 3: Evil AI
    model.most_similar(
    positive=[‘professor', ‘woman'],
    negative=[‘man’]
    )
    [('associate_professor', 0.7771),
    ('assistant_professor', 0.7558),
    ('professor_emeritus', 0.7066),
    ('lecturer', 0.6982),
    ('sociology_professor', 0.6539),
    …]

    View Slide

  137. Case Study 3: Evil AI
    model.most_similar(
    positive=[‘computer_programmer’, ‘woman'],
    negative=[‘man’]
    )

    View Slide

  138. Case Study 3: Evil AI
    model.most_similar(
    positive=[‘computer_programmer’, ‘woman'],
    negative=[‘man’]
    )
    [('homemaker', 0.5627),
    ('housewife', 0.5105),
    ('graphic_designer', 0.5051),
    ('schoolteacher', 0.4979),
    ('businesswoman', 0.4934),
    …]

    View Slide

  139. Case Study 3: Evil AI
    • Culture is biased

    View Slide

  140. Case Study 3: Evil AI
    • Culture is biased
    • Language is biased

    View Slide

  141. Case Study 3: Evil AI
    • Culture is biased
    • Language is biased
    • Algorithms are not?

    View Slide

  142. Case Study 3: Evil AI
    • Culture is biased
    • Language is biased
    • Algorithms are not?
    • “Garbage in, garbage out”

    View Slide

  143. Case Study 3: Evil AI

    View Slide

  144. FINAL REMARKS

    View Slide

  145. But we’ve been

    doing this for X years

    View Slide

  146. But we’ve been

    doing this for X years
    • Approaches based on co-occurrences are not new
    • Think SVD / LSA / LDA
    • … but they are usually outperformed by word2vec
    • … and don’t scale as well as word2vec

    View Slide

  147. Efficiency

    View Slide

  148. Efficiency
    • There is no co-occurrence matrix

    (vectors are learned directly)
    • Softmax has complexity O(V)

    Hierarchical Softmax only O(log(V))

    View Slide

  149. Garbage in, garbage out

    View Slide

  150. Garbage in, garbage out
    • Pre-trained vectors are useful
    • … until they’re not
    • The business domain is important
    • The pre-processing steps are important
    • > 100K words? Maybe train your own model
    • > 1M words? Yep, train your own model

    View Slide

  151. Summary

    View Slide

  152. Summary
    • Word Embeddings are magic!
    • Big victory of unsupervised learning
    • Gensim makes your life easy

    View Slide

  153. Credits & Readings

    View Slide

  154. Credits & Readings
    Credits
    • Lev Konstantinovskiy (@gensim_py)
    • Chris E. Moody (@chrisemoody) see videos on lda2vec
    Readings
    • Deep Learning for NLP (R. Socher) http://cs224d.stanford.edu/
    • “word2vec parameter learning explained” by Xin Rong
    More readings
    • “GloVe: global vectors for word representation” by Pennington et al.
    • “Dependency based word embeddings” and “Neural word embeddings
    as implicit matrix factorization” by O. Levy and Y. Goldberg

    View Slide

  155. Credits & Readings
    Even More Readings
    • “Man is to Computer Programmer as Woman is to Homemaker?
    Debiasing Word Embeddings” by Bolukbasi et al.
    • “Quantifying and Reducing Stereotypes in Word Embeddings” by
    Bolukbasi et al.
    • “Equality of Opportunity in Machine Learning” - Google Research Blog

    https://research.googleblog.com/2016/10/equality-of-opportunity-in-machine.html
    Pics Credits
    • Classification: https://commons.wikimedia.org/wiki/File:Cluster-2.svg
    • Translation: https://commons.wikimedia.org/wiki/File:Translation_-_A_till_%C3%85-colours.svg

    View Slide

  156. THANK YOU
    @MarcoBonzanini
    GitHub.com/bonzanini
    marcobonzanini.com

    View Slide