Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CS25 Transformers United

sadahry
September 01, 2022

CS25 Transformers United

An introduction slide for an online lecture (translated from Japanese
for international student).

Held in the fall of 2021 at Stanford,
Online lectures specializing in Transformer (10 times in total)

The Class:
https://web.stanford.edu/class/cs25/

YouTube:
https://www.youtube.com/playlist?list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM

sadahry

September 01, 2022
Tweet

More Decks by sadahry

Other Decks in Technology

Transcript

  1. What’s the porpose of this slide? For Lab • Make

    sure that each person knows which lectures in CS25 are likely to be helpful. For me • Grasp the scope and applicability of Transformer • Increase the options that can be considered in one's own research(backchannel)
  2. An online lecture dedicated to Transformer held in Fall 2021

    at Stanford. (10 times in total) Lecture: https://web.stanford.edu/class/cs25/ Youtube: https://www.youtube.com/playlist?list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM ※Causion The tiitle and order of Youtube is slightly different from the lecture. This slide follows the description of the lecture. What’s CS25?
  3. Supplementary information for class There is a reference link for

    each content in the lecture URL, so please check it. • Recommended Readings(≒ What to read before the lecture = papers) • Additional Readings(≒ What to read after the lecture)
  4. • I read the Transformer paper first, then watched all

    10 episodes. ◦ I should have read the paper after watching the first episode :/ • I didn't read the Recommended Readings at first ◦ and stopped and referred to them when I didn't understand. • Use Auto-Translation of Japanese + Read Transcript *1 ◦ Transcript is only in English. ◦ Sometimes hard to read Auto-Translation • Tried to understand the outline. ◦ But took over 4 hours per lectures. ▪ like reading a paper of new territories In my case.. *1
  5. 1. Introduction to Transformers (22:44) 2. Transformers in Language: GPT-3,

    Codex (48:39) Speaker: Mark Chen (OpenAI) 3. Applications in Vision (1:08:37) Speaker: Lucas Beyer (Google Brain) 4. Transformers in RL & Universal Compute Engines (1:20:43) Speaker: Aditya Grover (FAIR) 5. ★ Scaling transformers (1:05:44) Speaker: Barret Zoph (Google Brain) 6. ★ Perceiver: Arbitrary IO with transformers (58:59) Speaker: Andrew Jaegle (DeepMind) 7. Self Attention & Non-Parametric Transformers (1:05:43) Speaker: Aidan Gomez (University of Oxford) 8. GLOM: Representing part-whole hierarchies in a neural network (52:48) Speaker: Geoffrey Hinton (UoT) 9. Interpretability with transformers (59:34) Speaker: Chris Olah (AnthropicAI) 10. Transformers for Applications in Audio, Speech and Music: From Language Modeling to Understanding to Synthesis (48:19) Speaker: Prateek Verma (Stanford) Contents ★: Recommended
  6. 1. Brief description of Attention and Transformer 2. Automatic generation

    of natural language and programming code 3. Advantages of Giant Models and Vision Transformer 4. Offline Reinforcement Learning with Transformer 5. Transformer can be used with any huge data sets 6. Transformer for any long arrays of data 7. Transformer for Kaggle-like data 8. Understanding hierarchically with Neural Networks 9. Reverse engineering Self-Attention 10. Relationship between Audio and Transformer In a word…
  7. Speaker Div Garg Chetanya Rastogi Advay Pal (at Stanford) Topics

    • The Past to LSTM and the Future from Transformer • From the overview of Attention papers to the overview of Self-Attention ◦ The idea of query key value • Overview of Transformer layer • Overview of BERT and GPT-3 1. Introduction to Transformers (22:44)
  8. I thought.. • Easy to understand. You don't have to

    read Recommended Readings to watch it. • You can get the general flow to the Transformer. ◦ No particular argument from the LSTM side (they said LSTM in Prehistoric Era) • Better understanding of Attention ◦ It also explains the basics other than Self-Attention • Gives us a better intuitive understanding ◦ Shared useful articles ▪ https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a ▪ https://jalammar.github.io/illustrated-transformer/ ◦ Maybe you don’t need this lecture if you just want to figure out Transformer 1. Introduction to Transformers Texts of the style: Quoted
  9. Speaker Mark Chen (OpenAI) Topics • Transition of natural language

    generation • (Chen's) Research process leading up to GPT-3 at OpenAI • GPT-3 and its application (iGPT) • Transition of Code Generation • Overview of Codex 2. Transformers in Language: GPT-3, Codex (48:39)
  10. 2. Transformers in Language: GPT-3, Codex Codex overview (as it

    was interesting) • Study to infer return after specifying def and writing Docs • Pre-study of 159GB of Github code ◦ 30~40% speedup just by compressing space • Fine-tune with 164 def+Docs+return code *1 • High accuracy when pass assert code to Docs ◦ Low accuracy when pass text only • Use the index pass@k*2 for evaluation • Define k=len(picked_up), n=len(generated) • pass@k= E(any one of the k selected from n will be correct) = E(Whether or not one correct answer can be selected from some code suggestions) • sampling temperature(t)*3 should be about 0.8 *1 *2 *3
  11. 2. Transformers in Language: GPT-3, Codex I thought.. • You

    don't have to read Recommended Readings to watch it. • Many of the stories are "I put it in like this and this is what happened." You can get practical knowledge from these videos. • “Last thing is I used to dabble in competitive programming myself, and I really wanted to create a model that could solve problems that I couldn't.” well motivated. • Text generation may not be fun unless you are fluent in English • Unsupervised LSTM of Amazon review was interesting*1 ◦ It was able to “predict” positives/negatives • The "Getting to GPT-3" is a general overview ◦ e.g.) GPT-1 downstream task needs a model • “GPT-3 and its application (iGPT)" is also outlined ◦ Multitasking up to DALL-E (≠ DALL-E2) explained ◦ +- is high precision but x is low precision in GPT-3 *2 *1 *2
  12. Speaker Lucas Beyer (Google Brain) Topics • Overview of Class

    Classification in Images • Advantages of the Big Model • Introduction to ViT ◦ Position Encoding and Attention unique to images • The Future of ViT 3. Applications in Vision (1:08:37)
  13. I thought.. • A little difficult to understand. I can't

    grasp the details. ◦ Especially the definition of the graph axes. ◦ Better to refer to the paper as appropriate. • There were a lot of discussions about the size of the model ◦ To put it simply, "Big (model) is better than small. ▪ Model creation time is long but tolerable ▪ High score regardless of the amount of training data*1 ▪ Fine-tune performance is also good*2 • The answer from the question like "If you can reduce the amount of pre-training, won't it speed up the time to create the model?" was good. ◦ like "It would be nice if you had a perfect data set, but if you do pre-training on a large scale, it is easier to get inferential results because of the "seen it before" nature of the data." 3. Applications in Vision *1 *2
  14. 3. Applications in Vision and… • Interesting story about Position

    Encoding*3 • The answer from the question like "Do target images across pixels *4 affect learning?" is good. ◦ like "It's Big data, so there are images with similar patterns." • "Future of ViT", it was just for reference. ◦ I don't know much about it, but latest papers might be more useful. *3 *4
  15. Speaker Aditya Grover (FAIR) Topics • Difference between RL (=

    reinforcement learning) and sequential data (= images and language) • Overview of Decision Transformer (DT) ◦ Offline RL talk ◦ (There was no discussion of Online RL) 4. Transformers in RL & Universal Compute Engines (1:20:43)
  16. I thought.. • Relatively easy to understand. You don't need

    to know RL to follow the flow. ◦ Performance comparisons would be more enjoyable with prerequisite knowledge. ▪ No explanation of Atari benchmarks, Q-learning, etc • You can understand the interpretation of Transformer in the context of RL. ◦ DT is a tricky technique. ◦ Conversely, there is little knowledge to be gained as a Transformer.. • There are a lot of tweaks and learnings • Because RL deals with data that is output with some adjustment and extraction (e.g., logs) Compared to images, natural language, etc., the model can be smaller. • DT pre-study is about 4~5 hours • It was also helpful to talk about the constraints of DT ◦ It's not greedy, so in the worst case, you need infinite time series. ◦ Difficult to extract mysterious or effective behavior • DT could be applied to predict the future by setting γ (discount rate). Or it may be possible to predict γ itself. I am reluctant to use DT. However, he seemed reluctant to use DT. 4. Transformers in RL & Universal Compute Engines
  17. Speaker Barret Zoph (Google Brain) Irwan Bello (Google Brain) Topics

    • Switch Transformers Overview ◦ Size Comparison ◦ Comparison with other approaches ◦ Application to Tasks 5. Scaling transformers (1:05:44) ★
  18. 5. Scaling transformers Overview of Switch Transformers (it was a

    good paper) • A model for the huge number of training data. • Replace the FFN layer with multiple candidates (called experts) and select the FFN layer with the maximum probability of the inner product of Attention. So this learning is load-balanced. *1 • BackProp learns the value of the maximum probability as pseudo Loss*2 • In this architecture… ◦ Each expert learns in a distributed manner, making each layer low-dimensional. ◦ High robustness due to the existence of experts with similar characteristics ◦ Training efficiency is also increased by a factor of 7*3 *1 *2 ★ *3
  19. 5. Scaling transformers I thought.. • You may need to

    read Recommended Readings. ◦ They frequently use the words only in them. • The theory of scaling is clear. And Q&A will further deepen your understanding. ◦ On the other hand, it was also something you could understand if you read the papers. • The answer from the question like "Whenever we’re dealing with empirical science, but what is the research question? Why did you decide to attack this one first?" was so cool. ◦ "I think that, one, empirically, it seems that sparsity has led to better empirical results.” ◦ and continue to talk like “I also think this configuration is complicated, but it is better to start with something that works (in a modern configuration) than to try something that doesn't necessarily work and see if it really works. The goal is the same as LSTM. In our work, we built an approach that naturally builds up from there, by Transformers, to tackle a lot of other machine learning challenges." • commented; I love the coffee slurping. lol ★
  20. Speaker Andrew Jaegle (DeepMind) Topics • Motivation for generic models

    ◦ Understanding of data trends ◦ Integrated understanding of multimodal data • CNN (Convolution) vs Transformer (Position Encoding + Self-attention) • Overview of Perceiver ◦ Cross Attention 6. Perceiver: Arbitrary IO with transformers (58:59) ★
  21. Overview of Perceiver (it was a good paper) • A

    model for the huge array number (M) of training data. • QKV's Q is made N-dimensional (N<<M) to reduce the computational complexity from O(M^2) to O(MN). *1 • Method: Define N-dimensional input as an array of random values → Feed Input (binary array, M) with Attention calculation (Cross Attention, O(MN)) → Transformer Layer processing in N-dimensions (Latent Transoformer, O(N^2)) → (repeat)... *2 The N-dimensional vector representation is obtained from the M-row binary array. • Enables processing of huge arrays such as video without convolution layer. • Almost anything of binary can be processed. ◦ Multimodal is also OK (!) ★ 6. Perceiver: Arbitrary IO with transformers *1 *2 *2
  22. I thought.. • Recommended Readings might be good enough to

    refer to as needed. ◦ The content is advanced, but the overview is well explained. I didn't need to read the whole text. • I liked the explanation of the NxM Attention layer, which seems abstract. ◦ Interpretation of which clusters 1 pixel belongs to, rather than 1:1 relationship as in natural language. ◦ Because it is not a 1:1 correspondence, the learning process is somehow black boxed. • Careful comparisons with other models. Better understandings. ◦ ConvNet, Transformer, ViT, BERT, RAFT..etc • Through the comparison of models in each domain, the understanding of the comparison of predictive models will be deepened. • General-purpose models that can be applied to a variety of "single tasks” ◦ Create training models for each tasks, even for the same method (e.g. Video) ★ 6. Perceiver: Arbitrary IO with transformers
  23. Speaker Aidan Gomez (University of Oxford) Topics • Overview of

    Transformer • Episode when Transformer was created • Overview of Non-Parametric Transformers (NPT) 7. Self Attention & Non-Parametric Transformers (1:05:43)
  24. 7. Self Attention & Non-Parametric Transformers I thought.. • Hearing

    the story behind the creation of Transformer. ◦ It seems the production period was 3 months (!) ◦ I was so unimagined about "Involved in the Transformer paper as an intern" that I couldn't understand how he felt. ▪ As he says, ”I really did not appreciate the impact.” • Non-Parametric = no need for hyperparameter like RF and others. • I thought of NPT as "Kaggle with Transformer". ◦ Small tabular data approaches ◦ XGBoost was the competitor *1 • An approach that looks at row dependencies as well. as column dependencies *2 ◦ Input is all tabular features & targets (!) ◦ It also learns to regularize. *1 *順位 *2
  25. Speaker Geoffrey Hinton (UoT) Topics • Understanding Neural Networks (NN)

    • GLOM Overview • Overview of Cube Demonstration ◦ NN paper in the 1970s, the source of the idea for GLOM 8. GLOM: Representing part-whole hierarchies in a neural network (52:48)
  26. I thought.. • Good listening experience, but overall the theory

    and analogy is difficult to understand. ◦ e.g. Difference in learning two sentences. "next weekend what we will be doing is visiting relatives." "next weekend what we will be is visiting relatives." • Transformer was not relevant, only an overview on GLOM. • The purpose is "hierarchical knowledge understanding "*1 ◦ e.g. (face->((nose->nasal cavity),eyes)) • Better understanding of NN characteristics. ◦ There is no neurological talk. I keep it to engineering talk only. • Personally, I found the following points interesting. ◦ GLOM is a "hierarchy of concrete Embedding" and not an abstract expression ◦ So you're only trying to represent one thing. You're not trying to represent some set of possible objects. If you're trying to represent some set of possible objects, you have a horrible binding problem, and you couldn't use these very vague distributions. 8. GLOM: Representing part-whole hierarchies in a neural network *1
  27. Speaker Chris Olah (AnthropicAI) Topics • Machine Interpretability ◦ Reverse

    engineering to replace it with a human understandable algorithm • Self-Attention layer analysis (Additional) Recommended Readings: • In-context Learning and Induction Heads 9. Interpretability with transformers (59:34)
  28. I thought.. • The process of elemental decomposition of Transformer

    and the mathematical decomposition of Self-Attention were interesting. • The general framework is understandable, but it is difficult to follow the details. ◦ The lecture was fast and there were few supplementary explanations for illustrations, so I was left with a lot of "?“ while watching this lecture. • The key to In-context Learning is not value conversion but Induction heads ◦ In Attention, there is no conversion of values, but mostly copying of values. ◦ In the case of 2 layer Self-Attention, the highest accuracy is to throw q only at the 2nd layer*1 ◦ First layer Attention understands the connection of tokens and induces v for q in the second layer *2 • Embedding and Position Encoding are excluded. 9. Interpretability with transformers *2 *1
  29. Speaker Prateek Verma (Stanford) Topics • Position of Transformer •

    Organizing Basic Speech Knowledge • Classification of raw speech data using Transformer ◦ Vector Quantization ◦ Wavelets (Additional) Recommended Readings: • Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions 10. Transformers for Applications in Audio, Speech and Music: From Language Modeling to Understanding to Synthesis (48:19)
  30. I thought.. • You don't supposedly need audio knowledge to

    see it. • "Transformer is good at discrete data, not continuous data." was impressive. • Interesting idea of using raw sounds as input and Wavelet transforms between layers ◦ The approach is to leave the characteristics of continuous data as much as possible. ▪ The shape of Wavelet layer (seems to) behaves like a kind of Window function*1 ◦ However, there is also a normal Embedding layer, and the inference result was ? *2, so this architecture might just be an idea. 10. Transformers for Applications in Audio, Speech and Music: From Language Modeling to Understanding to Synthesis *1 *2
  31. Overall impression Good • It was very significant to get

    the impression that "These people are at the forefront!" • I could understand the "Big is better than small" of the giant model from various angles. • The subject matter was relatively simple and easy.  • The Q&A session was informative. ◦ The questions were quite in-depth, which gave me more perspectives. ◦ Especially in 5. 6., not only smart answers but also tips for trial and error were provided. Not Good • Only mentions until 2021, no mentions of e.g. DALL-E2. ◦ I heard that "a few years ago" means “old technology”, somewhere in this lecture.. • If you want to get theoretical knowledge, you should read the paper. ◦ If you want to hear how it was tried and tested and how it can be interpreted, I recommend you to watch them.
  32. • Understanding of fine-tuning is still limited ◦ e.g. Can

    individual expression be achieved through fine-tuning? • Information in 2022 • No real-world applications yet? ◦ I wanted to know about model improvement and active learning stuff. • Still not grasping the true meaning of "represent one thing" ◦ Is emotion "one thing"? Can we say "one thing" when it comes to accuracy? ◦ Multimodal learning to capture as much of "one thing" as possible from all environmental factors in action? • Not enough information on Audio x Transformer ◦ End-to-end speech recognition, speech synthesis ideas, etc. ▪ jukebox (read it someday) ◦ ViT from spectrogram images. Remainings
  33. Others • It's just my personal impression, so if you

    actually see the lectures, you may have a different impression from me. • I’ve payed “attention” to misreading, but please point out if you find some mistakes. m(__)m • If you have any questions after watching the lectures, let's discuss in the Slack channel (e.g. #random)!