Deep Contextualized Word Embeddings

Slide 1

Slide 1 text

Deep Contextualized Word Representation ELMO – Embeddings from Language Model

Slide 2

Slide 2 text

NLP Timeline

Slide 3

Slide 3 text

Talk about • What to expect from a Language Model ? • Previous Language Models • Deep Contextualized Word Embeddings • Summary

Slide 4

Slide 4 text

What to expect from a language model? They should capture • Complex characteristic of word use (syntax and semantics) • Use varies across linguistic context i.e., model polysemy

Slide 5

Slide 5 text

Previous Language Model • LSTMs were used heavily for language modeling

Slide 6

Slide 6 text

Deep Contextualized Word Embeddings • In place of LSTMs what if we use a bidirectional LSTM

Slide 7

Slide 7 text

Deep Contextualized Word Embeddings (Cont) • Forward Layer • For a set of tokens, model the probability of a sentence by computing the probability of token tk given the history (t1, t2, t3, . . . . . . ., tk-1 ) Cost function

Slide 8

Slide 8 text

Deep Contextualized Word Embeddings (Cont) • Backward Layer • For a set of tokens, model the probability of a sentence by computing the probability of token tk given the input (tk + 1, tk + 2, tk + 3, . . . . . . ., tN ) Cost function

Slide 9

Slide 9 text

Deep Contextualized Word Embeddings (Cont) Training - • Forward and Backward layer are trained together • Optimizer – Gradient Descent • Loss Function – Cross Entropy Log likelihood for both forward and backward layer are jointly maximized

Slide 10

Slide 10 text

Deep Contextualized Word Embeddings (Cont) Addons – - Residual connection between layers

Slide 11

Slide 11 text

Deep Contextualized Word Embeddings (Cont) Input to the model • Representation of word token – need some sort of embedding • Context independent embedding generated from ngram char CNN with 2048 channels

Slide 12

Slide 12 text

Deep Contextualized Word Embeddings (Cont) Salient features of input embeddings • Allows to pick morphological features that word level embeddings could miss • Valid representation for out of vocabulary words

Slide 13

Slide 13 text

Deep Contextualized Word Embeddings (Cont) What if? • Combine the lower - level representation in some weighted fashion • Results in Deep context - rich embeddings • What if we can combine the representation with respect to task? How about creating task specific deep contextual embeddings?

Slide 14

Slide 14 text

Deep Contextualized Word Embeddings (Cont) Let’s think : • Lower - level neurons capture local properties such as morphological structuring, syntax related aspects. Can be useful to deal with dependency parsing, POS tagging, etc. • High – level neurons capture context dependent aspects. Can be used on tasks such as word sense disambiguation, etc What if we expose both and combine both to represent a deep contextual representation of words?

Slide 15

Slide 15 text

Deep Contextualized Word Embeddings (Cont)

Slide 16

Slide 16 text

Deep Contextualized Word Embeddings (Cont) Function f can be represented as : Gamma – scalar to scale the entire vector

Slide 17

Slide 17 text

Summary Ø2 layer biLSTM with 4096 units and 512 dimension projection ØResidual connection between the first and second layer ØTask independent word representation with 2048 n gram char CNN model ØTie weight for forward and backward layer together, jointly maximize the log likelihood ØWeighted representation from all layers followed by a scaler

Slide 18

Slide 18 text

Thank you