Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Word and Document Embeddings

Word and Document Embeddings


Leland McInnes

April 12, 2021


  1. Low Dimensional Embeddings of Words and Documents And how they

    might apply to Single-Cell Data
  2. Motivation

  3. NLP has seen huge advances recently

  4. None
  5. None
  6. How far can we get with simple methods?

  7. Embeddings

  8. The new NLP methods are based around various “embeddings”. But

    what are embeddings?
  9. A mathematical representation (often vectors) + A way to measure

    distance between representations
  10. A lot of focus falls on the first part But

    distances are often critical (as we will see)
  11. Document Embeddings

  12. How do we represent a document mathematically?

  13. The “bag-of-words” approach: Discard order and count how often each

    word occurs
  14. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do

    eiusmod tempor incididunt ut labore et dolore magna aliqua. Auctor elit sed vulputate mi sit amet mauris, quis vel eros donec ac odio tempor orci
  15. ac amet auctor donec elit eros mauris mi odio orci

    quis sed sit tempor vel vulputate adipiscing aliqua amet consectetur do dolor dolore eiusmod elit et incididunt ipsum labore lorem magna sed sit tempor ut
  16. 0 1 1 1 0 1 1 1 1 0

    1 1 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 dolor do consectetur auctor amet aliqua adipiscing ac dolore donec eiusmod elit eros et incididunt ipsum labore lorem magna mauris mi odio orci quis sed sit tempor ut vel vulputate
  17. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do

    eiusmod tempor incididunt ut labore et dolore magna aliqua. Sociis natoque penatibus et magnis dis parturient montes nascetur. Quis viverra nibh cras pulvinar mattis. Augue eget arcu dictum varius duis. Urna neque viverra justo nec ultrices dui sapien eget. Fringilla ut morbi tincidunt augue interdum velit. Mauris in aliquam sem fringilla ut morbi. In hac habitasse platea dictumst vestibulum rhoncus. Lobortis scelerisque fermentum dui faucibus in ornare quam. Eget nulla facilisi etiam dignissim diam quis. Venenatis lectus magna fringilla urna porttitor rhoncus dolor. Non pulvinar neque laoreet suspendisse. At varius vel pharetra vel turpis nunc eget. Ullamcorper morbi tincidunt ornare massa eget egestas purus viverra accumsan. Eu tincidunt tortor aliquam nulla facilisi cras fermentum odio. Orci nulla pellentesque dignissim enim sit amet venenatis. Blandit cursus risus at ultrices. Amet est placerat in egestas erat imperdiet sed. Consequat semper viverra nam libero justo laoreet sit. Mauris pharetra et ultrices neque ornare aenean. Non consectetur a erat nam. Dolor sit amet consectetur adipiscing elit ut aliquam purus. Aliquet lectus proin nibh nisl. Dis parturient montes nascetur ridiculus. Cras fermentum odio eu feugiat pretium nibh ipsum. Dui id ornare arcu odio ut. Risus nec feugiat in fermentum. Elementum nibh tellus molestie nunc non blandit massa enim. Porttitor eget dolor morbi non arcu risus quis varius. Fermentum dui faucibus in ornare. Suspendisse faucibus interdum posuere lorem ipsum dolor sit. Sit amet aliquam id diam maecenas ultricies mi eget mauris. Proin nibh nisl condimentum id venenatis a condimentum vitae. Sit amet nisl suscipit adipiscing bibendum est ultricies. Duis convallis convallis tellus id interdum velit laoreet id donec. Congue nisi vitae suscipit tellus mauris a diam maecenas. Sed euismod nisi porta lorem. Nisl rhoncus mattis rhoncus urna neque viverra justo. Eget magna fermentum iaculis eu non diam phasellus vestibulum. Feugiat nibh sed pulvinar proin gravida hendrerit lectus. Ac turpis egestas maecenas pharetra convallis. Amet commodo nulla facilisi nullam vehicula ipsum a arcu cursus. Sed viverra tellus in hac habitasse platea. Pharetra massa massa ultricies mi quis hendrerit. Amet est placerat in egestas erat imperdiet sed euismod nisi. Id velit ut tortor pretium viverra suspendisse potenti nullam. Sit amet nisl purus in mollis nunc sed id semper. Porttitor massa id neque aliquam. Felis eget velit aliquet sagittis id. Consectetur a erat nam at lectus urna. Vel orci porta non pulvinar neque laoreet suspendisse interdum. Sit amet nisl suscipit adipiscing bibendum est ultricies integer quis. Dapibus ultrices in iaculis nunc sed augue. Molestie at elementum eu facilisis sed odio morbi. Odio facilisis mauris sit amet massa vitae tortor. Imperdiet nulla malesuada pellentesque elit eget. Ornare quam viverra orci sagittis eu. Ornare massa eget egestas purus viverra. Porta non pulvinar neque laoreet suspendisse interdum. Netus et malesuada fames ac turpis egestas sed. Congue nisi vitae suscipit tellus mauris. Vivamus arcu felis bibendum ut tristique et egestas. Suspendisse faucibus interdum posuere lorem ipsum dolor sit amet. Congue quisque egestas diam in. Vestibulum morbi blandit cursus risus at ultrices. Venenatis urna cursus eget nunc scelerisque viverra mauris. Sit amet cursus sit amet dictum sit amet justo. Mi eget mauris pharetra et ultrices neque. Massa tempor nec feugiat nisl pretium fusce id. Tristique sollicitudin nibh sit amet commodo nulla facilisi nullam.
  18. None
  19. None
  20. None
  21. Just large sparse matrices of counts This looks like a

    lot of other types of data
  22. How should we measure distance? Documents are distributions of words,

    so use a distance between distributions.
  23. Hellinger distance Approximated by cosine distance

  24. None
  25. Every domain has its domain specific transformations NLP uses “TF-IDF”

  26. None
  27. Distance matters Hellinger Multinomial Independent Columns

  28. But are my columns really independent?

  29. This red rose smells sweet That scarlet flower has a

    lovely scent
  30. Word Embeddings

  31. “You shall know a word by the company it keeps”

    — John Firth
  32. “Can you use it in a sentence?”

  33. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do

    eiusmod Window radius three Target word Context before Context after Context Matters
  34. Represent a word as the document of all its contexts

    Represent the resulting document as a bag of words
  35. A word is represented by counts of other words that

    occur “nearby”
  36. Basic operations on the resulting matrix, followed by an SVD

    are enough to produce good word vectors
  37. None
  38. Optimal Transport

  39. We want to measure distances between distributions

  40. Hellinger? Kullback-Liebler Divergence? Total variation?

  41. Our distributions may not have common support!

  42. Wasserstein-Kantorovich Distance

  43. None
  44. None
  45. “Earth-mover distance”

  46. None
  47. Work = Force × Distance Force ∝ Mass Find the

    least work to fill the hole
  48. Find the optimal “transport plan”

  49. None
  50. None
  51. None
  52. None
  53. None
  54. Computationally expensive, but there exist ways to speed it up

    Sinkhorn iterations Linear optimal transport
  55. Joint Word-Document Embeddings

  56. Model a document as a distribution of word vectors Wasserstein

    distance between documents
  57. This red rose smells sweet That scarlet flower has a

    lovely scent
  58. None
  59. Excellent results, slow performance

  60. Linear optimal transport: Transform to a euclidean space that approximates

    Wasserstein distance
  61. None
  62. Linear optimal transport can produce vectors for other machine learning

    tasks as well
  63. None
  64. Conclusions

  65. NLP has made huge progress recently. Many of the techniques

    can be generalized for use in other domains
  66. Similarity or distance in the column space is critical

  67. Locality or co-occurrence is all you need to build column

    space distances
  68. Simple tricks like column space embeddings and linear optimal transport

    can compete with Google’s massive deep neural network techniques
  69. Optional extras

  70. Linear Optimal Transport

  71. Sinkhorn Iterations