Slide 1

Slide 1 text

Low Dimensional Embeddings of Words and Documents And how they might apply to Single-Cell Data

Slide 2

Slide 2 text

Motivation

Slide 3

Slide 3 text

NLP has seen huge advances recently

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

How far can we get with simple methods?

Slide 7

Slide 7 text

Embeddings

Slide 8

Slide 8 text

The new NLP methods are based around various “embeddings”. But what are embeddings?

Slide 9

Slide 9 text

A mathematical representation (often vectors) + A way to measure distance between representations

Slide 10

Slide 10 text

A lot of focus falls on the first part But distances are often critical (as we will see)

Slide 11

Slide 11 text

Document Embeddings

Slide 12

Slide 12 text

How do we represent a document mathematically?

Slide 13

Slide 13 text

The “bag-of-words” approach: Discard order and count how often each word occurs

Slide 14

Slide 14 text

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Auctor elit sed vulputate mi sit amet mauris, quis vel eros donec ac odio tempor orci

Slide 15

Slide 15 text

ac amet auctor donec elit eros mauris mi odio orci quis sed sit tempor vel vulputate adipiscing aliqua amet consectetur do dolor dolore eiusmod elit et incididunt ipsum labore lorem magna sed sit tempor ut

Slide 16

Slide 16 text

0 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 dolor do consectetur auctor amet aliqua adipiscing ac dolore donec eiusmod elit eros et incididunt ipsum labore lorem magna mauris mi odio orci quis sed sit tempor ut vel vulputate

Slide 17

Slide 17 text

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Sociis natoque penatibus et magnis dis parturient montes nascetur. Quis viverra nibh cras pulvinar mattis. Augue eget arcu dictum varius duis. Urna neque viverra justo nec ultrices dui sapien eget. Fringilla ut morbi tincidunt augue interdum velit. Mauris in aliquam sem fringilla ut morbi. In hac habitasse platea dictumst vestibulum rhoncus. Lobortis scelerisque fermentum dui faucibus in ornare quam. Eget nulla facilisi etiam dignissim diam quis. Venenatis lectus magna fringilla urna porttitor rhoncus dolor. Non pulvinar neque laoreet suspendisse. At varius vel pharetra vel turpis nunc eget. Ullamcorper morbi tincidunt ornare massa eget egestas purus viverra accumsan. Eu tincidunt tortor aliquam nulla facilisi cras fermentum odio. Orci nulla pellentesque dignissim enim sit amet venenatis. Blandit cursus risus at ultrices. Amet est placerat in egestas erat imperdiet sed. Consequat semper viverra nam libero justo laoreet sit. Mauris pharetra et ultrices neque ornare aenean. Non consectetur a erat nam. Dolor sit amet consectetur adipiscing elit ut aliquam purus. Aliquet lectus proin nibh nisl. Dis parturient montes nascetur ridiculus. Cras fermentum odio eu feugiat pretium nibh ipsum. Dui id ornare arcu odio ut. Risus nec feugiat in fermentum. Elementum nibh tellus molestie nunc non blandit massa enim. Porttitor eget dolor morbi non arcu risus quis varius. Fermentum dui faucibus in ornare. Suspendisse faucibus interdum posuere lorem ipsum dolor sit. Sit amet aliquam id diam maecenas ultricies mi eget mauris. Proin nibh nisl condimentum id venenatis a condimentum vitae. Sit amet nisl suscipit adipiscing bibendum est ultricies. Duis convallis convallis tellus id interdum velit laoreet id donec. Congue nisi vitae suscipit tellus mauris a diam maecenas. Sed euismod nisi porta lorem. Nisl rhoncus mattis rhoncus urna neque viverra justo. Eget magna fermentum iaculis eu non diam phasellus vestibulum. Feugiat nibh sed pulvinar proin gravida hendrerit lectus. Ac turpis egestas maecenas pharetra convallis. Amet commodo nulla facilisi nullam vehicula ipsum a arcu cursus. Sed viverra tellus in hac habitasse platea. Pharetra massa massa ultricies mi quis hendrerit. Amet est placerat in egestas erat imperdiet sed euismod nisi. Id velit ut tortor pretium viverra suspendisse potenti nullam. Sit amet nisl purus in mollis nunc sed id semper. Porttitor massa id neque aliquam. Felis eget velit aliquet sagittis id. Consectetur a erat nam at lectus urna. Vel orci porta non pulvinar neque laoreet suspendisse interdum. Sit amet nisl suscipit adipiscing bibendum est ultricies integer quis. Dapibus ultrices in iaculis nunc sed augue. Molestie at elementum eu facilisis sed odio morbi. Odio facilisis mauris sit amet massa vitae tortor. Imperdiet nulla malesuada pellentesque elit eget. Ornare quam viverra orci sagittis eu. Ornare massa eget egestas purus viverra. Porta non pulvinar neque laoreet suspendisse interdum. Netus et malesuada fames ac turpis egestas sed. Congue nisi vitae suscipit tellus mauris. Vivamus arcu felis bibendum ut tristique et egestas. Suspendisse faucibus interdum posuere lorem ipsum dolor sit amet. Congue quisque egestas diam in. Vestibulum morbi blandit cursus risus at ultrices. Venenatis urna cursus eget nunc scelerisque viverra mauris. Sit amet cursus sit amet dictum sit amet justo. Mi eget mauris pharetra et ultrices neque. Massa tempor nec feugiat nisl pretium fusce id. Tristique sollicitudin nibh sit amet commodo nulla facilisi nullam.

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

Just large sparse matrices of counts This looks like a lot of other types of data

Slide 22

Slide 22 text

How should we measure distance? Documents are distributions of words, so use a distance between distributions.

Slide 23

Slide 23 text

Hellinger distance Approximated by cosine distance

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Every domain has its domain specific transformations NLP uses “TF-IDF”

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

Distance matters Hellinger Multinomial Independent Columns

Slide 28

Slide 28 text

But are my columns really independent?

Slide 29

Slide 29 text

This red rose smells sweet That scarlet flower has a lovely scent

Slide 30

Slide 30 text

Word Embeddings

Slide 31

Slide 31 text

“You shall know a word by the company it keeps” — John Firth

Slide 32

Slide 32 text

“Can you use it in a sentence?”

Slide 33

Slide 33 text

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod Window radius three Target word Context before Context after Context Matters

Slide 34

Slide 34 text

Represent a word as the document of all its contexts Represent the resulting document as a bag of words

Slide 35

Slide 35 text

A word is represented by counts of other words that occur “nearby”

Slide 36

Slide 36 text

Basic operations on the resulting matrix, followed by an SVD are enough to produce good word vectors

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

Optimal Transport

Slide 39

Slide 39 text

We want to measure distances between distributions

Slide 40

Slide 40 text

Hellinger? Kullback-Liebler Divergence? Total variation?

Slide 41

Slide 41 text

Our distributions may not have common support!

Slide 42

Slide 42 text

Wasserstein-Kantorovich Distance

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

“Earth-mover distance”

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

Work = Force × Distance Force ∝ Mass Find the least work to fill the hole

Slide 48

Slide 48 text

Find the optimal “transport plan”

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

Computationally expensive, but there exist ways to speed it up Sinkhorn iterations Linear optimal transport

Slide 55

Slide 55 text

Joint Word-Document Embeddings

Slide 56

Slide 56 text

Model a document as a distribution of word vectors Wasserstein distance between documents

Slide 57

Slide 57 text

This red rose smells sweet That scarlet flower has a lovely scent

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

Excellent results, slow performance

Slide 60

Slide 60 text

Linear optimal transport: Transform to a euclidean space that approximates Wasserstein distance

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

Linear optimal transport can produce vectors for other machine learning tasks as well

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

Conclusions

Slide 65

Slide 65 text

NLP has made huge progress recently. Many of the techniques can be generalized for use in other domains

Slide 66

Slide 66 text

Similarity or distance in the column space is critical

Slide 67

Slide 67 text

Locality or co-occurrence is all you need to build column space distances

Slide 68

Slide 68 text

Simple tricks like column space embeddings and linear optimal transport can compete with Google’s massive deep neural network techniques

Slide 69

Slide 69 text

Optional extras

Slide 70

Slide 70 text

Linear Optimal Transport

Slide 71

Slide 71 text

Sinkhorn Iterations