Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Bert, Transformers and Attention

0de6724c5eb0403a86fbcaa01d2c098d?s=47 Oliver Guhr
January 13, 2021

Bert, Transformers and Attention

Deep Neural Networks have revolutionized image processing, but these successes could not simply be transferred to text processing - until 2017 when the Transformers were introduced. This new architecture has brought great advances in many areas of speech processing, and some, such as generating compelling text, is now possible. Together we look at how transfer learning and attention work with the Transformer architecture.

0de6724c5eb0403a86fbcaa01d2c098d?s=128

Oliver Guhr

January 13, 2021
Tweet

Transcript

  1. Bert, Transformers and Attention

  2. M.Sc. Oliver Guhr Hochschule für Technik und Wirtschaft Fakultät Informatik/Mathematik

    Fachgebiet Künstliche Intelligenz oliver.guhr@htw-dresden.de
  3. Topics • • • • • •

  4. Question Answering on SQuAD 2.0

  5. Transformer Quiz

  6. Bidirectional Encoder Representations from Transformers

  7. Applications

  8. Transfer learning with texts

  9. Bert

  10. Task One: Mask Words

  11. Task Two: Next Sentence Prediction

  12. language models Semi-supervised training

  13. Bert

  14. Bert Models

  15. Bert

  16. Text Classification

  17. Bert This Bert model can process sequences up to 512

    tokens.
  18. Bert Each token generates a vector with the length of

    the hidden size.
  19. Bert Classification

  20. Task specific training

  21. Task specific training

  22. NLP Background

  23. Distributional Hypothesis

  24. Word Vectors

  25. Word Vectors

  26. Word Vectors

  27. Transformers

  28. How encoders work.

  29. Attention is all you need

  30. Attention is all you need

  31. Transformer Encoder

  32. Transformer Encoder

  33. Scaled dot product attention Query Key Value

  34. Scaled dot product attention Query Key Value

  35. 35

  36. The encoder self-attention distribution for the word “it” from the

    5th to the 6th layer of a Transformer trained on English to French translation (one of eight attention heads). Source: https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html What does Attention do?
  37. The encoder self-attention distribution for the word “it” from the

    5th to the 6th layer of a Transformer trained on English to French translation (one of eight attention heads). Source: https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html What does Attention do?
  38. Attention

  39. Attention

  40. Attention

  41. Attention

  42. Matrix Calculation

  43. Matrix Calculation

  44. Multi Head Attention

  45. Multi Head Attention

  46. Multi Head Attention

  47. Multi Head Attention

  48. Multi Head Attention

  49. Positional Encoding

  50. Positional Encoding

  51. For embedding with a dimensionality of 4 the encodings look

    like this: Positional Encoding
  52. Add and Normalize

  53. Future...

  54. • Imagenet and CIFAR with transformers ◦ 88.55% on ImageNet,

    ◦ 90.72% on ImageNet-ReaL, ◦ 94.55% on CIFAR-100 • Paper by Dosovitskiy et al. • Other approaches to vision tasks ◦ Taming Transformers for High-Resolution Image Synthesis An Image is Worth 16x16 Words
  55. Reformer: The Efficient Transformer context windows of 1 million words

    • Similar ideas:
  56. RealFormer: Transformer Likes Residual Attention Resnets idea

  57. Sources

  58. Transformer