Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Bert, Transformers and Attention

Oliver Guhr
January 13, 2021

Bert, Transformers and Attention

Deep Neural Networks have revolutionized image processing, but these successes could not simply be transferred to text processing - until 2017 when the Transformers were introduced. This new architecture has brought great advances in many areas of speech processing, and some, such as generating compelling text, is now possible. Together we look at how transfer learning and attention work with the Transformer architecture.

Oliver Guhr

January 13, 2021
Tweet

Transcript

  1. 35

  2. The encoder self-attention distribution for the word “it” from the

    5th to the 6th layer of a Transformer trained on English to French translation (one of eight attention heads). Source: https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html What does Attention do?
  3. The encoder self-attention distribution for the word “it” from the

    5th to the 6th layer of a Transformer trained on English to French translation (one of eight attention heads). Source: https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html What does Attention do?
  4. • Imagenet and CIFAR with transformers ◦ 88.55% on ImageNet,

    ◦ 90.72% on ImageNet-ReaL, ◦ 94.55% on CIFAR-100 • Paper by Dosovitskiy et al. • Other approaches to vision tasks ◦ Taming Transformers for High-Resolution Image Synthesis An Image is Worth 16x16 Words