Bert, Transformers and Attention

M.Sc. Oliver Guhr Hochschule für Technik und Wirtschaft Fakultät Informatik/Mathematik
Fachgebiet Künstliche Intelligenz [email protected]

Topics • • • • • •

Question Answering on SQuAD 2.0

Transformer Quiz

Bidirectional Encoder Representations from Transformers

Applications

Transfer learning with texts

Task One: Mask Words

Task Two: Next Sentence Prediction

language models Semi-supervised training

Bert Models

Text Classiﬁcation

Bert This Bert model can process sequences up to 512
tokens.

Bert Each token generates a vector with the length of
the hidden size.

Bert Classiﬁcation

Task speciﬁc training

NLP Background

Distributional Hypothesis

Word Vectors

Transformers

How encoders work.

Attention is all you need

Transformer Encoder

Scaled dot product attention Query Key Value

The encoder self-attention distribution for the word “it” from the
5th to the 6th layer of a Transformer trained on English to French translation (one of eight attention heads). Source: https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html What does Attention do?

Attention

Matrix Calculation

Multi Head Attention

Positional Encoding

For embedding with a dimensionality of 4 the encodings look
like this: Positional Encoding

Add and Normalize

Future...

• Imagenet and CIFAR with transformers ◦ 88.55% on ImageNet,
◦ 90.72% on ImageNet-ReaL, ◦ 94.55% on CIFAR-100 • Paper by Dosovitskiy et al. • Other approaches to vision tasks ◦ Taming Transformers for High-Resolution Image Synthesis An Image is Worth 16x16 Words

Reformer: The Efﬁcient Transformer context windows of 1 million words
• Similar ideas:

RealFormer: Transformer Likes Residual Attention Resnets idea

Sources

Transformer

Bert, Transformers and Attention

Bert, Transformers and Attention

Other Decks in Science

Featured

Transcript