Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Word Embeddings for Natural Language Processing in Python

Word Embeddings for Natural Language Processing in Python

Slides for my talk on word embeddings at PyCon Italy 2017 (PyCon Otto):
https://www.pycon.it/conference/talks/word-embeddings-for-natural-language-processing-in-python
Abstract:
Word embeddings are a family of Natural Language Processing (NLP) algorithms where words are mapped to vectors in low-dimensional space. The interest around word embeddings has been on the rise in the past few years, because these techniques have been driving important improvements in many NLP applications like text classification, sentiment analysis or machine translation.

In this talk we’ll describe the intuitions behind this family of algorithms, we’ll explore some of the Python tools that allow us to implement modern NLP applications and we’ll conclude with some practical considerations.

Marco Bonzanini

April 08, 2017
Tweet

More Decks by Marco Bonzanini

Other Decks in Technology

Transcript

  1. Word Embeddings 

    for NLP
    in Python
    Marco Bonzanini

    PyCon Italia 2017

    View Slide

  2. Nice to meet you

    View Slide

  3. WORD EMBEDDINGS?

    View Slide

  4. Word Embeddings
    Word Vectors
    Distributed
    Representations
    =
    =

    View Slide

  5. Why should you care?

    View Slide

  6. Why should you care?
    Data representation

    is crucial

    View Slide

  7. Applications

    View Slide

  8. Applications
    • Classification / tagging
    • Recommendation Systems
    • Search Engines (Query Expansion)
    • Machine Translation

    View Slide

  9. One-hot Encoding

    View Slide

  10. One-hot Encoding
    Rome
    Paris
    Italy
    France
    = [1, 0, 0, 0, 0, 0, …, 0]
    = [0, 1, 0, 0, 0, 0, …, 0]
    = [0, 0, 1, 0, 0, 0, …, 0]
    = [0, 0, 0, 1, 0, 0, …, 0]

    View Slide

  11. One-hot Encoding
    Rome
    Paris
    Italy
    France
    = [1, 0, 0, 0, 0, 0, …, 0]
    = [0, 1, 0, 0, 0, 0, …, 0]
    = [0, 0, 1, 0, 0, 0, …, 0]
    = [0, 0, 0, 1, 0, 0, …, 0]
    Rome
    Paris
    word V

    View Slide

  12. One-hot Encoding
    Rome
    Paris
    Italy
    France
    = [1, 0, 0, 0, 0, 0, …, 0]
    = [0, 1, 0, 0, 0, 0, …, 0]
    = [0, 0, 1, 0, 0, 0, …, 0]
    = [0, 0, 0, 1, 0, 0, …, 0]
    V = vocabulary size (huge)

    View Slide

  13. Bag-of-words

    View Slide

  14. Bag-of-words
    doc_1
    doc_2

    doc_N
    = [32, 14, 1, 0, …, 6]
    = [ 2, 12, 0, 28, …, 12]

    = [13, 0, 6, 2, …, 0]

    View Slide

  15. Bag-of-words
    doc_1
    doc_2

    doc_N
    = [32, 14, 1, 0, …, 6]
    = [ 2, 12, 0, 28, …, 12]

    = [13, 0, 6, 2, …, 0]
    Rome Paris word V

    View Slide

  16. Word Embeddings

    View Slide

  17. Word Embeddings
    Rome
    Paris
    Italy
    France
    = [0.91, 0.83, 0.17, …, 0.41]
    = [0.92, 0.82, 0.17, …, 0.98]
    = [0.32, 0.77, 0.67, …, 0.42]
    = [0.33, 0.78, 0.66, …, 0.97]

    View Slide

  18. Word Embeddings
    Rome
    Paris
    Italy
    France
    = [0.91, 0.83, 0.17, …, 0.41]
    = [0.92, 0.82, 0.17, …, 0.98]
    = [0.32, 0.77, 0.67, …, 0.42]
    = [0.33, 0.78, 0.66, …, 0.97]
    n. dimensions << vocabulary size

    View Slide

  19. Word Embeddings
    Rome
    Paris
    Italy
    France
    = [0.91, 0.83, 0.17, …, 0.41]
    = [0.92, 0.82, 0.17, …, 0.98]
    = [0.32, 0.77, 0.67, …, 0.42]
    = [0.33, 0.78, 0.66, …, 0.97]

    View Slide

  20. Word Embeddings
    Rome
    Paris
    Italy
    France
    = [0.91, 0.83, 0.17, …, 0.41]
    = [0.92, 0.82, 0.17, …, 0.98]
    = [0.32, 0.77, 0.67, …, 0.42]
    = [0.33, 0.78, 0.66, …, 0.97]

    View Slide

  21. Word Embeddings
    Rome
    Paris
    Italy
    France
    = [0.91, 0.83, 0.17, …, 0.41]
    = [0.92, 0.82, 0.17, …, 0.98]
    = [0.32, 0.77, 0.67, …, 0.42]
    = [0.33, 0.78, 0.66, …, 0.97]

    View Slide

  22. Word Embeddings
    Rome
    Paris Italy
    France

    View Slide

  23. Word Embeddings
    Paris + Italy - France ≈ Rome
    Rome

    View Slide

  24. THE MAIN INTUITION

    View Slide

  25. Distributional Hypothesis

    View Slide

  26. –J.R. Firth 1957
    “You shall know a word 

    by the company it keeps.”

    View Slide

  27. –Z. Harris 1954
    “Words that occur in similar context
    tend to have similar meaning.”

    View Slide

  28. Context ≈ Meaning

    View Slide

  29. I enjoyed eating some pizza at the restaurant

    View Slide

  30. I enjoyed eating some pizza at the restaurant
    Word

    View Slide

  31. I enjoyed eating some pizza at the restaurant
    The company it keeps
    Word

    View Slide

  32. I enjoyed eating some pizza at the restaurant
    I enjoyed eating some fiorentina at the restaurant

    View Slide

  33. I enjoyed eating some pizza at the restaurant
    I enjoyed eating some fiorentina at the restaurant

    View Slide

  34. I enjoyed eating some pizza at the restaurant
    I enjoyed eating some fiorentina at the restaurant
    Same context

    View Slide

  35. I enjoyed eating some pizza at the restaurant
    I enjoyed eating some fiorentina at the restaurant
    Same context
    Pizza = Fiorentina ?

    View Slide

  36. A BIT OF THEORY
    word2vec

    View Slide

  37. View Slide

  38. View Slide

  39. Vector Calculation

    View Slide

  40. Vector Calculation
    Goal: learn vec(word)

    View Slide

  41. Vector Calculation
    Goal: learn vec(word)
    1. Choose objective function

    View Slide

  42. Vector Calculation
    Goal: learn vec(word)
    1. Choose objective function
    2. Init: random vectors

    View Slide

  43. Vector Calculation
    Goal: learn vec(word)
    1. Choose objective function
    2. Init: random vectors
    3. Run gradient descent

    View Slide

  44. I enjoyed eating some pizza at the restaurant

    View Slide

  45. I enjoyed eating some pizza at the restaurant

    View Slide

  46. I enjoyed eating some pizza at the restaurant

    View Slide

  47. I enjoyed eating some pizza at the restaurant
    Maximise the likelihood 

    of the context given the focus word

    View Slide

  48. I enjoyed eating some pizza at the restaurant
    Maximise the likelihood 

    of the context given the focus word
    P(i | pizza)
    P(enjoyed | pizza)

    P(restaurant | pizza)

    View Slide

  49. Example
    I enjoyed eating some pizza at the restaurant

    View Slide

  50. I enjoyed eating some pizza at the restaurant
    Iterate over context words
    Example

    View Slide

  51. I enjoyed eating some pizza at the restaurant
    bump P( i | pizza )
    Example

    View Slide

  52. I enjoyed eating some pizza at the restaurant
    bump P( enjoyed | pizza )
    Example

    View Slide

  53. I enjoyed eating some pizza at the restaurant
    bump P( eating | pizza )
    Example

    View Slide

  54. I enjoyed eating some pizza at the restaurant
    bump P( some | pizza )
    Example

    View Slide

  55. I enjoyed eating some pizza at the restaurant
    bump P( at | pizza )
    Example

    View Slide

  56. I enjoyed eating some pizza at the restaurant
    bump P( the | pizza )
    Example

    View Slide

  57. I enjoyed eating some pizza at the restaurant
    bump P( restaurant | pizza )
    Example

    View Slide

  58. I enjoyed eating some pizza at the restaurant
    Move to next focus word and repeat
    Example

    View Slide

  59. I enjoyed eating some pizza at the restaurant
    bump P( i | at )
    Example

    View Slide

  60. I enjoyed eating some pizza at the restaurant
    bump P( enjoyed | at )
    Example

    View Slide

  61. I enjoyed eating some pizza at the restaurant
    … you get the picture
    Example

    View Slide

  62. P( eating | pizza )

    View Slide

  63. P( eating | pizza ) ??

    View Slide

  64. P( eating | pizza )
    Input word
    Output word

    View Slide

  65. P( eating | pizza )
    Input word
    Output word
    P( vec(eating) | vec(pizza) )

    View Slide

  66. P( vout | vin )
    P( vec(eating) | vec(pizza) )
    P( eating | pizza )
    Input word
    Output word

    View Slide

  67. P( vout | vin )
    P( vec(eating) | vec(pizza) )
    P( eating | pizza )
    Input word
    Output word
    ???

    View Slide

  68. P( vout | vin )

    View Slide

  69. cosine( vout, vin )

    View Slide

  70. cosine( vout, vin ) [-1, 1]

    View Slide

  71. softmax(cosine( vout, vin ))

    View Slide

  72. softmax(cosine( vout, vin )) [0, 1]

    View Slide

  73. softmax(cosine( vout, vin ))
    P
    (vout
    |
    vin) =
    exp(cosine(vout
    ,
    vin))
    P
    k
    2
    V exp(cosine(vk
    ,
    vin))

    View Slide

  74. Vector Calculation Recap

    View Slide

  75. Vector Calculation Recap
    Learn vec(word)

    View Slide

  76. Vector Calculation Recap
    Learn vec(word)
    by gradient descent

    View Slide

  77. Vector Calculation Recap
    Learn vec(word)
    by gradient descent
    on the softmax probability

    View Slide

  78. Plot Twist

    View Slide

  79. View Slide

  80. View Slide

  81. Paragraph Vector
    a.k.a.
    doc2vec
    i.e.
    P(vout | vin, label)

    View Slide

  82. A BIT OF PRACTICE

    View Slide

  83. View Slide

  84. pip install gensim

    View Slide

  85. Case Study 1: Skills and CVs

    View Slide

  86. Case Study 1: Skills and CVs
    Data set of ~300k resumes
    Each experience is a “sentence”
    Each experience has 3-15 skills
    Approx 15k unique skills

    View Slide

  87. Case Study 1: Skills and CVs
    from gensim.models import Word2Vec
    fname = 'candidates.jsonl'
    corpus = ResumesCorpus(fname)
    model = Word2Vec(corpus)

    View Slide

  88. Case Study 1: Skills and CVs
    model.most_similar('chef')
    [('cook', 0.94),
    ('bartender', 0.91),
    ('waitress', 0.89),
    ('restaurant', 0.76),
    ...]

    View Slide

  89. Case Study 1: Skills and CVs
    model.most_similar('chef',

    negative=['food'])
    [('puppet', 0.93),
    ('devops', 0.92),
    ('ansible', 0.79),
    ('salt', 0.77),
    ...]

    View Slide

  90. Case Study 1: Skills and CVs
    Useful for:
    Data exploration
    Query expansion/suggestion
    Recommendations

    View Slide

  91. Case Study 2: Beer!

    View Slide

  92. Case Study 2: Beer!
    Data set of ~2.9M beer reviews
    89 different beer styles
    635k unique tokens
    185M total tokens

    View Slide

  93. Case Study 2: Beer!
    from gensim.models import Doc2Vec
    fname = 'ratebeer_data.csv'
    corpus = RateBeerCorpus(fname)
    model = Doc2Vec(corpus)

    View Slide

  94. Case Study 2: Beer!
    from gensim.models import Doc2Vec
    fname = 'ratebeer_data.csv'
    corpus = RateBeerCorpus(fname)
    model = Doc2Vec(corpus)
    3.5h on my laptop
    … remember to pickle

    View Slide

  95. Case Study 2: Beer!
    model.docvecs.most_similar('Stout')
    [('Sweet Stout', 0.9877),
    ('Porter', 0.9620),
    ('Foreign Stout', 0.9595),
    ('Dry Stout', 0.9561),
    ('Imperial/Strong Porter', 0.9028),
    ...]

    View Slide

  96. Case Study 2: Beer!
    model.most_similar([model.docvecs['Stout']])

    [('coffee', 0.6342),
    ('espresso', 0.5931),
    ('charcoal', 0.5904),
    ('char', 0.5631),
    ('bean', 0.5624),
    ...]

    View Slide

  97. Case Study 2: Beer!
    model.most_similar([model.docvecs['Wheat Ale']])

    [('lemon', 0.6103),
    ('lemony', 0.5909),
    ('wheaty', 0.5873),
    ('germ', 0.5684),
    ('lemongrass', 0.5653),
    ('wheat', 0.5649),
    ('lime', 0.55636),
    ('verbena', 0.5491),
    ('coriander', 0.5341),
    ('zesty', 0.5182)]

    View Slide

  98. PCA

    View Slide

  99. Dark beers

    View Slide

  100. Strong beers

    View Slide

  101. Sour beers

    View Slide

  102. Lagers

    View Slide

  103. Wheat beers

    View Slide

  104. Case Study 2: Beer!
    Useful for:
    Understanding the language of beer enthusiasts
    Planning your next pint
    Classification

    View Slide

  105. FINAL REMARKS

    View Slide

  106. But we’ve been

    doing this for X years

    View Slide

  107. But we’ve been

    doing this for X years
    • Approaches based on co-occurrences are not new
    • Think SVD / LSA / LDA
    • … but they are usually outperformed by word2vec
    • … and don’t scale as well as word2vec

    View Slide

  108. Efficiency

    View Slide

  109. Efficiency
    • There is no co-occurrence matrix

    (vectors are learned directly)
    • Softmax has complexity O(V)

    Hierarchical Softmax only O(log(V))

    View Slide

  110. Garbage in, garbage out

    View Slide

  111. Garbage in, garbage out
    • Pre-trained vectors are useful
    • … until they’re not
    • The business domain is important
    • The pre-processing steps are important
    • > 100K words? Maybe train your own model
    • > 1M words? Yep, train your own model

    View Slide

  112. Summary

    View Slide

  113. Summary
    • Word Embeddings are magic!
    • Big victory of unsupervised learning
    • Gensim makes your life easy

    View Slide

  114. Credits & Readings

    View Slide

  115. Credits & Readings
    Credits
    • Lev Konstantinovskiy (@gensim_py)
    • Chris E. Moody (@chrisemoody) see videos on lda2vec
    Readings
    • Deep Learning for NLP (R. Socher) http://cs224d.stanford.edu/
    • “word2vec parameter learning explained” by Xin Rong
    More readings
    • “GloVe: global vectors for word representation” by Pennington et al.
    • “Dependency based word embeddings” and “Neural word embeddings
    as implicit matrix factorization” by O. Levy and Y. Goldberg

    View Slide

  116. THANK YOU
    @MarcoBonzanini
    GitHub.com/bonzanini
    marcobonzanini.com

    View Slide