CS25 Transformers United

Introduction of CS25: Transformers United

What’s the porpose of this slide? For Lab • Make
sure that each person knows which lectures in CS25 are likely to be helpful. For me • Grasp the scope and applicability of Transformer • Increase the options that can be considered in one's own research(backchannel)

An online lecture dedicated to Transformer held in Fall 2021
at Stanford. (10 times in total) Lecture: https://web.stanford.edu/class/cs25/ Youtube: https://www.youtube.com/playlist?list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM ※Causion The tiitle and order of Youtube is slightly different from the lecture. This slide follows the description of the lecture. What’s CS25?

Supplementary information for class There is a reference link for
each content in the lecture URL, so please check it. • Recommended Readings（≒ What to read before the lecture = papers） • Additional Readings（≒ What to read after the lecture）

• I read the Transformer paper first, then watched all
10 episodes. ◦ I should have read the paper after watching the first episode :/ • I didn't read the Recommended Readings at first ◦ and stopped and referred to them when I didn't understand. • Use Auto-Translation of Japanese + Read Transcript *1 ◦ Transcript is only in English. ◦ Sometimes hard to read Auto-Translation • Tried to understand the outline. ◦ But took over 4 hours per lectures. ▪ like reading a paper of new territories In my case.. *1

1. Introduction to Transformers (22:44) 2. Transformers in Language: GPT-3,
Codex (48:39) Speaker: Mark Chen (OpenAI) 3. Applications in Vision (1:08:37) Speaker: Lucas Beyer (Google Brain) 4. Transformers in RL & Universal Compute Engines (1:20:43) Speaker: Aditya Grover (FAIR) 5. ★ Scaling transformers (1:05:44) Speaker: Barret Zoph (Google Brain) 6. ★ Perceiver: Arbitrary IO with transformers (58:59) Speaker: Andrew Jaegle (DeepMind) 7. Self Attention & Non-Parametric Transformers (1:05:43) Speaker: Aidan Gomez (University of Oxford) 8. GLOM: Representing part-whole hierarchies in a neural network (52:48) Speaker: Geoffrey Hinton (UoT) 9. Interpretability with transformers (59:34) Speaker: Chris Olah (AnthropicAI) 10. Transformers for Applications in Audio, Speech and Music: From Language Modeling to Understanding to Synthesis (48:19) Speaker: Prateek Verma (Stanford) Contents ★: Recommended

1. Brief description of Attention and Transformer 2. Automatic generation
of natural language and programming code 3. Advantages of Giant Models and Vision Transformer 4. Ofﬂine Reinforcement Learning with Transformer 5. Transformer can be used with any huge data sets 6. Transformer for any long arrays of data 7. Transformer for Kaggle-like data 8. Understanding hierarchically with Neural Networks 9. Reverse engineering Self-Attention 10. Relationship between Audio and Transformer In a word…

Speaker Div Garg Chetanya Rastogi Advay Pal (at Stanford) Topics
• The Past to LSTM and the Future from Transformer • From the overview of Attention papers to the overview of Self-Attention ◦ The idea of query key value • Overview of Transformer layer • Overview of BERT and GPT-3 1. Introduction to Transformers (22:44)

I thought.. • Easy to understand. You don't have to
read Recommended Readings to watch it. • You can get the general ﬂow to the Transformer. ◦ No particular argument from the LSTM side (they said LSTM in Prehistoric Era) • Better understanding of Attention ◦ It also explains the basics other than Self-Attention • Gives us a better intuitive understanding ◦ Shared useful articles ▪ https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a ▪ https://jalammar.github.io/illustrated-transformer/ ◦ Maybe you don’t need this lecture if you just want to ﬁgure out Transformer 1. Introduction to Transformers Texts of the style: Quoted

Speaker Mark Chen (OpenAI) Topics • Transition of natural language
generation • (Chen's) Research process leading up to GPT-3 at OpenAI • GPT-3 and its application (iGPT) • Transition of Code Generation • Overview of Codex 2. Transformers in Language: GPT-3, Codex (48:39)

2. Transformers in Language: GPT-3, Codex Codex overview (as it
was interesting) • Study to infer return after specifying def and writing Docs • Pre-study of 159GB of Github code ◦ 30~40% speedup just by compressing space • Fine-tune with 164 def+Docs+return code *1 • High accuracy when pass assert code to Docs ◦ Low accuracy when pass text only • Use the index pass@k*2 for evaluation • Deﬁne k=len(picked_up), n=len(generated) • pass@k= E(any one of the k selected from n will be correct) = E(Whether or not one correct answer can be selected from some code suggestions) • sampling temperature(t)*3 should be about 0.8 *1 *2 *3

2. Transformers in Language: GPT-3, Codex I thought.. • You
don't have to read Recommended Readings to watch it. • Many of the stories are "I put it in like this and this is what happened." You can get practical knowledge from these videos. • “Last thing is I used to dabble in competitive programming myself, and I really wanted to create a model that could solve problems that I couldn't.” well motivated. • Text generation may not be fun unless you are ﬂuent in English • Unsupervised LSTM of Amazon review was interesting*1 ◦ It was able to “predict” positives/negatives • The "Getting to GPT-3" is a general overview ◦ e.g.) GPT-1 downstream task needs a model • “GPT-3 and its application (iGPT)" is also outlined ◦ Multitasking up to DALL-E (≠ DALL-E2) explained ◦ +- is high precision but x is low precision in GPT-3 *2 *1 *2

Speaker Lucas Beyer (Google Brain) Topics • Overview of Class
Classiﬁcation in Images • Advantages of the Big Model • Introduction to ViT ◦ Position Encoding and Attention unique to images • The Future of ViT 3. Applications in Vision (1:08:37)

I thought.. • A little difﬁcult to understand. I can't
grasp the details. ◦ Especially the deﬁnition of the graph axes. ◦ Better to refer to the paper as appropriate. • There were a lot of discussions about the size of the model ◦ To put it simply, "Big (model) is better than small. ▪ Model creation time is long but tolerable ▪ High score regardless of the amount of training data*1 ▪ Fine-tune performance is also good*2 • The answer from the question like "If you can reduce the amount of pre-training, won't it speed up the time to create the model?" was good. ◦ like "It would be nice if you had a perfect data set, but if you do pre-training on a large scale, it is easier to get inferential results because of the "seen it before" nature of the data." 3. Applications in Vision *1 *2

3. Applications in Vision and… • Interesting story about Position
Encoding*3 • The answer from the question like "Do target images across pixels *4 affect learning?" is good. ◦ like "It's Big data, so there are images with similar patterns." • "Future of ViT", it was just for reference. ◦ I don't know much about it, but latest papers might be more useful. *3 *4

Speaker Aditya Grover (FAIR) Topics • Difference between RL (=
reinforcement learning) and sequential data (= images and language) • Overview of Decision Transformer (DT) ◦ Ofﬂine RL talk ◦ (There was no discussion of Online RL) 4. Transformers in RL & Universal Compute Engines (1:20:43)

I thought.. • Relatively easy to understand. You don't need
to know RL to follow the flow. ◦ Performance comparisons would be more enjoyable with prerequisite knowledge. ▪ No explanation of Atari benchmarks, Q-learning, etc • You can understand the interpretation of Transformer in the context of RL. ◦ DT is a tricky technique. ◦ Conversely, there is little knowledge to be gained as a Transformer.. • There are a lot of tweaks and learnings • Because RL deals with data that is output with some adjustment and extraction (e.g., logs) Compared to images, natural language, etc., the model can be smaller. • DT pre-study is about 4~5 hours • It was also helpful to talk about the constraints of DT ◦ It's not greedy, so in the worst case, you need infinite time series. ◦ Difficult to extract mysterious or effective behavior • DT could be applied to predict the future by setting γ (discount rate). Or it may be possible to predict γ itself. I am reluctant to use DT. However, he seemed reluctant to use DT. 4. Transformers in RL & Universal Compute Engines

Speaker Barret Zoph (Google Brain) Irwan Bello (Google Brain) Topics
• Switch Transformers Overview ◦ Size Comparison ◦ Comparison with other approaches ◦ Application to Tasks 5. Scaling transformers (1:05:44) ★

5. Scaling transformers Overview of Switch Transformers (it was a
good paper) • A model for the huge number of training data. • Replace the FFN layer with multiple candidates (called experts) and select the FFN layer with the maximum probability of the inner product of Attention. So this learning is load-balanced. *1 • BackProp learns the value of the maximum probability as pseudo Loss*2 • In this architecture… ◦ Each expert learns in a distributed manner, making each layer low-dimensional. ◦ High robustness due to the existence of experts with similar characteristics ◦ Training efﬁciency is also increased by a factor of 7*3 *1 *2 ★ *3

5. Scaling transformers I thought.. • You may need to
read Recommended Readings. ◦ They frequently use the words only in them. • The theory of scaling is clear. And Q&A will further deepen your understanding. ◦ On the other hand, it was also something you could understand if you read the papers. • The answer from the question like "Whenever we’re dealing with empirical science, but what is the research question? Why did you decide to attack this one first?" was so cool. ◦ "I think that, one, empirically, it seems that sparsity has led to better empirical results.” ◦ and continue to talk like “I also think this configuration is complicated, but it is better to start with something that works (in a modern configuration) than to try something that doesn't necessarily work and see if it really works. The goal is the same as LSTM. In our work, we built an approach that naturally builds up from there, by Transformers, to tackle a lot of other machine learning challenges." • commented; I love the coffee slurping. lol ★

Speaker Andrew Jaegle (DeepMind) Topics • Motivation for generic models
◦ Understanding of data trends ◦ Integrated understanding of multimodal data • CNN (Convolution) vs Transformer (Position Encoding + Self-attention) • Overview of Perceiver ◦ Cross Attention 6. Perceiver: Arbitrary IO with transformers (58:59) ★

Overview of Perceiver (it was a good paper) • A
model for the huge array number (M) of training data. • QKV's Q is made N-dimensional (N<<M) to reduce the computational complexity from O(M^2) to O(MN). *1 • Method: Deﬁne N-dimensional input as an array of random values → Feed Input (binary array, M) with Attention calculation (Cross Attention, O(MN)) → Transformer Layer processing in N-dimensions (Latent Transoformer, O(N^2)) → (repeat)... *2 The N-dimensional vector representation is obtained from the M-row binary array. • Enables processing of huge arrays such as video without convolution layer. • Almost anything of binary can be processed. ◦ Multimodal is also OK (!) ★ 6. Perceiver: Arbitrary IO with transformers *1 *2 *2

I thought.. • Recommended Readings might be good enough to
refer to as needed. ◦ The content is advanced, but the overview is well explained. I didn't need to read the whole text. • I liked the explanation of the NxM Attention layer, which seems abstract. ◦ Interpretation of which clusters 1 pixel belongs to, rather than 1:1 relationship as in natural language. ◦ Because it is not a 1:1 correspondence, the learning process is somehow black boxed. • Careful comparisons with other models. Better understandings. ◦ ConvNet, Transformer, ViT, BERT, RAFT..etc • Through the comparison of models in each domain, the understanding of the comparison of predictive models will be deepened. • General-purpose models that can be applied to a variety of "single tasks” ◦ Create training models for each tasks, even for the same method (e.g. Video) ★ 6. Perceiver: Arbitrary IO with transformers

Speaker Aidan Gomez (University of Oxford) Topics • Overview of
Transformer • Episode when Transformer was created • Overview of Non-Parametric Transformers (NPT) 7. Self Attention & Non-Parametric Transformers (1:05:43)

7. Self Attention & Non-Parametric Transformers I thought.. • Hearing
the story behind the creation of Transformer. ◦ It seems the production period was 3 months (!) ◦ I was so unimagined about "Involved in the Transformer paper as an intern" that I couldn't understand how he felt. ▪ As he says, ”I really did not appreciate the impact.” • Non-Parametric = no need for hyperparameter like RF and others. • I thought of NPT as "Kaggle with Transformer". ◦ Small tabular data approaches ◦ XGBoost was the competitor *1 • An approach that looks at row dependencies as well. as column dependencies *2 ◦ Input is all tabular features & targets (!) ◦ It also learns to regularize. *1 *順位 *2

Speaker Geoffrey Hinton (UoT) Topics • Understanding Neural Networks (NN)
• GLOM Overview • Overview of Cube Demonstration ◦ NN paper in the 1970s, the source of the idea for GLOM 8. GLOM: Representing part-whole hierarchies in a neural network (52:48)

I thought.. • Good listening experience, but overall the theory
and analogy is difﬁcult to understand. ◦ e.g. Difference in learning two sentences. "next weekend what we will be doing is visiting relatives." "next weekend what we will be is visiting relatives." • Transformer was not relevant, only an overview on GLOM. • The purpose is "hierarchical knowledge understanding "*1 ◦ e.g. (face->((nose->nasal cavity),eyes)) • Better understanding of NN characteristics. ◦ There is no neurological talk. I keep it to engineering talk only. • Personally, I found the following points interesting. ◦ GLOM is a "hierarchy of concrete Embedding" and not an abstract expression ◦ So you're only trying to represent one thing. You're not trying to represent some set of possible objects. If you're trying to represent some set of possible objects, you have a horrible binding problem, and you couldn't use these very vague distributions. 8. GLOM: Representing part-whole hierarchies in a neural network *1

Speaker Chris Olah (AnthropicAI) Topics • Machine Interpretability ◦ Reverse
engineering to replace it with a human understandable algorithm • Self-Attention layer analysis (Additional) Recommended Readings: • In-context Learning and Induction Heads 9. Interpretability with transformers (59:34)

I thought.. • The process of elemental decomposition of Transformer
and the mathematical decomposition of Self-Attention were interesting. • The general framework is understandable, but it is difﬁcult to follow the details. ◦ The lecture was fast and there were few supplementary explanations for illustrations, so I was left with a lot of "?“ while watching this lecture. • The key to In-context Learning is not value conversion but Induction heads ◦ In Attention, there is no conversion of values, but mostly copying of values. ◦ In the case of 2 layer Self-Attention, the highest accuracy is to throw q only at the 2nd layer*1 ◦ First layer Attention understands the connection of tokens and induces v for q in the second layer *2 • Embedding and Position Encoding are excluded. 9. Interpretability with transformers *2 *1

Speaker Prateek Verma (Stanford) Topics • Position of Transformer •
Organizing Basic Speech Knowledge • Classiﬁcation of raw speech data using Transformer ◦ Vector Quantization ◦ Wavelets (Additional) Recommended Readings: • Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions 10. Transformers for Applications in Audio, Speech and Music: From Language Modeling to Understanding to Synthesis (48:19)

I thought.. • You don't supposedly need audio knowledge to
see it. • "Transformer is good at discrete data, not continuous data." was impressive. • Interesting idea of using raw sounds as input and Wavelet transforms between layers ◦ The approach is to leave the characteristics of continuous data as much as possible. ▪ The shape of Wavelet layer (seems to) behaves like a kind of Window function*1 ◦ However, there is also a normal Embedding layer, and the inference result was ? *2, so this architecture might just be an idea. 10. Transformers for Applications in Audio, Speech and Music: From Language Modeling to Understanding to Synthesis *1 *2

Overall impression Good • It was very signiﬁcant to get
the impression that "These people are at the forefront!" • I could understand the "Big is better than small" of the giant model from various angles. • The subject matter was relatively simple and easy.　 • The Q&A session was informative. ◦ The questions were quite in-depth, which gave me more perspectives. ◦ Especially in 5. 6., not only smart answers but also tips for trial and error were provided. Not Good • Only mentions until 2021, no mentions of e.g. DALL-E2. ◦ I heard that "a few years ago" means “old technology”, somewhere in this lecture.. • If you want to get theoretical knowledge, you should read the paper. ◦ If you want to hear how it was tried and tested and how it can be interpreted, I recommend you to watch them.

• Understanding of ﬁne-tuning is still limited ◦ e.g. Can
individual expression be achieved through ﬁne-tuning? • Information in 2022 • No real-world applications yet? ◦ I wanted to know about model improvement and active learning stuff. • Still not grasping the true meaning of "represent one thing" ◦ Is emotion "one thing"? Can we say "one thing" when it comes to accuracy? ◦ Multimodal learning to capture as much of "one thing" as possible from all environmental factors in action? • Not enough information on Audio x Transformer ◦ End-to-end speech recognition, speech synthesis ideas, etc. ▪ jukebox (read it someday) ◦ ViT from spectrogram images. Remainings

Others • It's just my personal impression, so if you
actually see the lectures, you may have a different impression from me. • I’ve payed “attention” to misreading, but please point out if you ﬁnd some mistakes. m(__)m • If you have any questions after watching the lectures, let's discuss in the Slack channel (e.g. #random)!

CS25 Transformers United

CS25 Transformers United

More Decks by sadahry

Other Decks in Technology

Featured

Transcript