Deep Contextualized Word Embeddings

Deep Contextualized Word Representation ELMO – Embeddings from Language Model

NLP Timeline

Talk about • What to expect from a Language Model
? • Previous Language Models • Deep Contextualized Word Embeddings • Summary

What to expect from a language model? They should capture
• Complex characteristic of word use (syntax and semantics) • Use varies across linguistic context i.e., model polysemy

Previous Language Model • LSTMs were used heavily for language
modeling

Deep Contextualized Word Embeddings • In place of LSTMs what
if we use a bidirectional LSTM

Deep Contextualized Word Embeddings (Cont) • Forward Layer • For
a set of tokens, model the probability of a sentence by computing the probability of token tk given the history (t1, t2, t3, . . . . . . ., tk-1 ) Cost function

Deep Contextualized Word Embeddings (Cont) • Backward Layer • For
a set of tokens, model the probability of a sentence by computing the probability of token tk given the input (tk + 1, tk + 2, tk + 3, . . . . . . ., tN ) Cost function

Deep Contextualized Word Embeddings (Cont) Training - • Forward and
Backward layer are trained together • Optimizer – Gradient Descent • Loss Function – Cross Entropy Log likelihood for both forward and backward layer are jointly maximized

Deep Contextualized Word Embeddings (Cont) Addons – - Residual connection
between layers

Deep Contextualized Word Embeddings (Cont) Input to the model •
Representation of word token – need some sort of embedding • Context independent embedding generated from ngram char CNN with 2048 channels

Deep Contextualized Word Embeddings (Cont) Salient features of input embeddings
• Allows to pick morphological features that word level embeddings could miss • Valid representation for out of vocabulary words

Deep Contextualized Word Embeddings (Cont) What if? • Combine the
lower - level representation in some weighted fashion • Results in Deep context - rich embeddings • What if we can combine the representation with respect to task? How about creating task specific deep contextual embeddings?

Deep Contextualized Word Embeddings (Cont) Let’s think : • Lower
- level neurons capture local properties such as morphological structuring, syntax related aspects. Can be useful to deal with dependency parsing, POS tagging, etc. • High – level neurons capture context dependent aspects. Can be used on tasks such as word sense disambiguation, etc What if we expose both and combine both to represent a deep contextual representation of words?

Deep Contextualized Word Embeddings (Cont)

Deep Contextualized Word Embeddings (Cont) Function f can be represented
as : Gamma – scalar to scale the entire vector

Summary Ø2 layer biLSTM with 4096 units and 512 dimension
projection ØResidual connection between the first and second layer ØTask independent word representation with 2048 n gram char CNN model ØTie weight for forward and backward layer together, jointly maximize the log likelihood ØWeighted representation from all layers followed by a scaler

Thank you

Deep Contextualized Word Embeddings

Deep Contextualized Word Embeddings

Mayank Mishra

More Decks by Mayank Mishra

Other Decks in Science

Featured

Transcript

Deep Contextualized Word Representation ELMO – Embeddings from Language Model

NLP Timeline

Talk about • What to expect from a Language Model

What to expect from a language model? They should capture

Previous Language Model • LSTMs were used heavily for language

Deep Contextualized Word Embeddings • In place of LSTMs what

Deep Contextualized Word Embeddings (Cont) • Forward Layer • For

Deep Contextualized Word Embeddings (Cont) • Backward Layer • For

Deep Contextualized Word Embeddings (Cont) Training - • Forward and

Deep Contextualized Word Embeddings (Cont) Addons – - Residual connection

Deep Contextualized Word Embeddings (Cont) Input to the model •

Deep Contextualized Word Embeddings (Cont) Salient features of input embeddings

Deep Contextualized Word Embeddings (Cont) What if? • Combine the

Deep Contextualized Word Embeddings (Cont) Let’s think : • Lower

Deep Contextualized Word Embeddings (Cont)

Deep Contextualized Word Embeddings (Cont) Function f can be represented

Summary Ø2 layer biLSTM with 4096 units and 512 dimension

Thank you