論文紹介 / An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Slide 1

Slide 1 text

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale shade-tree Twitter: @shade_tree2112 Website: https://forest1988.github.io Paper’s page at Open Review: https://openreview.net/forum?id=YicbFdNTTy 第六回全⽇本コンピュータビジョン勉強会 Transformer論⽂読み会 My Website 2021/4/18 1

Slide 2

Slide 2 text

前⼝上 2021/4/18 2

Slide 3

Slide 3 text

今回ご紹介する論⽂ 2021/4/18 3 https://openreview.net/forum?id=YicbFdNTTy

Slide 4

Slide 4 text

論⽂選定の理由（※主観を含みます） • Vision Transformer (ViT) を提案。Transformer [Vaswani+, 2017] を Computer Vision (CV) に適⽤した研究の中でも代表的なものの⼀つであり、「Transformer が CV でも⾼い性能を発揮する」とバズったきっかけの⼀つ。 • Transformer の CV 応⽤の先⾏研究はあるが、ViTの登場は⼀つの節⽬と考える。 • Transformer 関連ライブラリの de-facto standard になっている Hugging Face Transformers [Wolf+, 2020] においても、CV 系のモデルとしてまず ViT が実装されている。 • ※ただし、DETR [Carion+, 2020] の実装が先に試みられていた模様。 #9998 • ※V&L のモデルも実装されている。 “Transformers: State-of-the-Art Natural Language Processing” [Wolf+, 2020] 2021/4/18 4

Slide 5

Slide 5 text

論⽂のインパクト • ICLR 2021 Open Review • Decision: Accept (Oral) • Comment: This paper has generated a lot of great discussion and it presents a very different way of doing image recognition at scale compared to current state of the art practices. All reviewers rated this paper as an accept. This work is interesting enough that in my view it really deservers further exposure and discussion and an oral presentation at ICLR would be a good way to achieve that. • 被引⽤ • 197 (2021/04/17 時点) 2021/4/18 5

Slide 6

Slide 6 text

お前は誰だ – 発表者の⽴場・視点 (1) • shade-tree • 某⼤学院の博⼠課程学⽣ • 主な研究分野： • NLP, Natural Language Generation, Machine Learning, Storytelling, Emotions • ↑ CV が⼊ってない…… • CV 専⾨家が多数いる環境で NLP やりながら、「CV はよく分からないんですが」と素⼈質問（原義）する⼈。 • 「Transformer 何も分からん。RNN を使って何がいけないんだ！」と悪戦苦闘しているうちに Transformer 好きになった。推しライブラリは Hugging Face Transformers。 • Transformer が CV にも使えると聞いてテンション上昇中の今⽇この頃。 2021/4/18 6

Slide 7

Slide 7 text

お前は誰だ – 発表者の⽴場・視点 (2) • V&L Transformer ちょっとわかる（原義） • 森友亮†, 上原康平†, 原⽥達也, (†equal contribution) “視覚・⾔語融合 Transformer モデルによる画像からの物語⽂⽣成,” CAI+CAI first workshop (⾔語処理学会第27回年次⼤会ワークショップ), 福岡（オンライン）, 2021年3⽉. [PDF] 2021/4/18 7 === ⼈⼿による画像ナラティブ === some elephants are in a tent. They are tied by a chain. They seems to be happy. They are seeing something above a tent. A tent is made by wood. === 提案⼿法 (pretrained decoder) === An elephant is walking. It is in a zoo. It seems to be sad. Some elephants are walking. Some trees are near by elephant. === 提案⼿法 (scratch decoder) === Some elephants are standing. They are in a road. They seems to be happy. Some elephants are in africa. Some trees are near by elephant. !"#$"%&'(")*+,-.*')$&% /&')$&% !"# !"#$ $%$&'()* $%$&'()*+ !$"# !"#$

Slide 8

Slide 8 text

補⾜ • 特に注釈を付けた引⽤⽂献は、以下のものです。 • 紹介論⽂で引⽤されていないもの • 図版などを引⽤するもの 2021/4/18 8

Slide 9

Slide 9 text

本題 2021/4/18 9

Slide 10

Slide 10 text

どんな論⽂か？ • Open Review の情報より抜粋（強調は発表者による） • One-sentence Summary: Transformers applied directly to image patches and pre-trained on large datasets work really well on image classification. • Program Chairs • Comment: This paper has generated a lot of great discussion and it presents a very different way of doing image recognition at scale compared to current state of the art practices. All reviewers rated this paper as an accept. This work is interesting enough that in my view it really deservers further exposure and discussion and an oral presentation at ICLR would be a good way to achieve that. 2021/4/18 10

Slide 11

Slide 11 text

論⽂の貢献 • Transformer を image recognition に直接適⽤した。 • Self-attention を⽤いた先⾏研究とは異なり、画像特有の帰納バイアスを、最初に patch を抽出するステップを除いて、導⼊しない。代わりに、画像 patch の sequence と⾒做し、NLP で使われている Transformer の Encoder 部を直接適⽤した。 • シンプルでスケーラブルな⼿法であり、⼤規模なデータセットを⽤いた pre-training で⾼い性能を発揮した。 2021/4/18 11

Slide 12

Slide 12 text

関連研究 (1) • Transformer の画像への適⽤ • Pixel 単位で naïve に適⽤ → 現実的なサイズにスケールしない • 先⾏研究で⾏われた⼯夫 • 局所的な Pixel 間のみに適⽤ [Parmar+, 2018] • Sparse Transformers [Child+, 2019] の利⽤ • 可変サイズの Block への適⽤ [Weissenborn+, 2019] ✘これらの特殊な attention 構造は、性能は出るが、実装が複雑で効率化が困難 →この論⽂の⼤きな貢献は、Standard な Transformer をなるべくそのまま使う⽅法で、良い性能が得られることを⽰したこと。 2021/4/18 12

Slide 13

Slide 13 text

関連研究 (2) • ⼀番近いのは [Cordonnier+, 2020] • ⼊⼒画像を 2 x 2 の patch に分割して、top で self-attention を全体に適⽤ 2021/4/18 13 ”On the relationship between self- attention and convolutional layers” [Cordonnier+, 2020]

Slide 14

Slide 14 text

関連研究 (3) • CNN + Self-attention • augmenting feature maps for image classification [Bello+, 2019] • further processing the output of a CNN using self-attention • for object detection [Hu et al., 2018; Carion et al., 2020] • video processing [Wang et al., 2018; Sun et al., 2019] • image classification [Wu et al., 2020] • unsupervised object discovery [Locatello et al., 2020] • unified text-vision tasks [Chen et al., 2020c; Lu et al., 2019; Li et al., 2019]. 2021/4/18 14

Slide 15

Slide 15 text

関連研究 (4) • image GPT (iGPT) [Chen et al., 2020a] • 解像度とカラースペースを⼩さくした上で GPT を適⽤ 2021/4/18 15 ”Generative pretraining from pixels” [Chen+, 2020a]

Slide 16

Slide 16 text

提案⼿法 Vision Transformer (ViT) (1) 2021/4/18 16 • 可能な限り original の Transformer に寄せた設計 • NLP における Transformer の scalability や効率的な実装の活⽤を意図厳密には Transformer の Encoder 部のみ使⽤

Slide 17

Slide 17 text

提案⼿法 Vision Transformer (ViT) (2) • ２次元画像を扱うために、𝐻×𝑊×𝐶 の次元数の⼊⼒を 𝑁× 𝑃!×𝐶 に reshape する。 • Height, Width, Channel. (𝑃×𝑃) は patch の解像度 • 𝑁 = 𝐻𝑊/𝑃! : patch の数 = Transformer に⼊⼒する sequence ⻑ • Patch を１次元配列に変換し、linear layer を通して D 次元 vector に変換。これを patch embedding と呼ぶ。 • BERT の special token “[class]” と同様の考えで、patch embedding に特殊な embedding を付与する。 • Position embeddings を加算。実験的に、1D で良いことを確認。 2021/4/18 17

Slide 18

Slide 18 text

提案⼿法 Vision Transformer (ViT) (3) 2021/4/18 18 Patch Embeddings • MSA : multiheaded self-attention • The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine- tuning time. Position Embeddings

Slide 19

Slide 19 text

実験：データセット • Pre-training に⽤いたデータセット • model scalability を調べるため、多様なサイズのデータセットを使⽤ • ILSVRC-2012 ImageNet dataset • 1k classes, 1.3M images • ImageNet-21k [Deng+, 2009] • 21k classes, 14M images • JFT [Sun+, 2017] • 18k classes, 303M high-resolution images • 前処理などは [Kolesnikov+, 2020] を踏襲 2021/4/18 19

Slide 20

Slide 20 text

実験：ベンチマーク • Benchmark tasks に転⽤ • ImageNet on the original validation labels • ImageNet on the cleaned-up ReaL labels [Beyer+, 2020] • CIFAR-10/100 [Krizhevsky, 2009] • Oxford-IIIT Pets [Parkhi+, 2012] • Oxford Flowers-102 [Nilsback & Zisserman, 2008]. • 前処理などは [Kolesnikov+, 2020] を踏襲 • 19-task VTAB classification suite [Zhai+, 2019b] • low-data transfer to diverse tasks (1,000 training examples per task) 2021/4/18 20

Slide 21

Slide 21 text

実験：モデル • BERT に倣ってモデルのバリエーションを命名 • ViT-L/16 は “Large” で input patch を 16×16 size にしたもの • ⽐較⼿法 • Baseline CNNs – ResNet [He+, 2016] に改変を加えた ResNet (BiT) 2021/4/18 21

Slide 22

Slide 22 text

実験：学習の詳細 • Pre-training • Optimizer: Adam (𝛽" = 0.9, 𝛽! = 0.999) • ResNet においても SGD より良い性能が出たことを確認 • Batch size: 4096 • Weight decay: 0.1 • Linear learning rate warmup and decay • Fine-tuning • Optimizer: SGD with momentum • Batch size: 512 2021/4/18 22

Slide 23

Slide 23 text

実験：SOTA との⽐較 (1) • 提案⼿法の⼤規模モデルである ViT-H/14, ViT-L/16 を、SOTA の CNNs モデルと⽐較 • Big Transfer (BiT) [Kolesnikov+, 2020] • Noisy Student [Xie+, 2020] • a large EfficientNet trained using semi-supervised learning on ImageNet and JFT- 300M with the labels removed. • 全てのモデルを TPUv3 で学習し、学習時間も⽰した 2021/4/18 23

Slide 24

Slide 24 text

実験：SOTA との⽐較 (2) 2021/4/18 24 • ViT-L/16 (smaller) を JFT で学習したものが BiT-L を outperform • ViT-H/14 (larger) はより良い精度 • 特に難しいタスクにおいて

Slide 25

Slide 25 text

実験：SOTA との⽐較 (3) • VTAB performance • Natural: Pets, CIFAR, etc. • Specialized: medical and satellite imagery • Structured: tasks that require geometric understanding like localization 2021/4/18 25

Slide 26

Slide 26 text

実験：データセットのサイズについて (1) • Pre-train ViT models on datasets of increasing size: ImageNet, ImageNet-21k, and JFT- 300M. 2021/4/18 26

Slide 27

Slide 27 text

実験：データセットのサイズについて (2) • Train ViT on random subsets of 9M, 30M, and 90M as well as the full JFT- 300M dataset. • CNNs の inductive bias は smaller datasets で有⽤。しかし larger datasets においては learning the relevant patterns が sufficient ないしは beneficial. • Further analysis of few-shot properties of ViT is an exciting direction of future work. 2021/4/18 27

Slide 28

Slide 28 text

実験： Scaling Study • 様々なモデルの Scaling を JFT-300M で評価 • ViT は ResNets に対して performance/compute のトレードオフで優位 • hybrids は⼩規模データで有⽤だが、⼤規模モデルだとあまり効果なし 2021/4/18 28

Slide 29

Slide 29 text

実験： ViT の中で何が起きているか • Self-attention allows ViT to integrate information across the entire image even in the lowest layers. • Other attention heads have consistently small attention distances in the low layers. 2021/4/18 29

Slide 30

Slide 30 text

実験から分かること • 訓練データセットが⼤きくないと Transformer (for CV) は良い性能を発揮しない。 • Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data. • mid-sized datasets such as ImageNet • ⼤規模データで学習すれば、SOTA に近付く、あるいは勝る。 • the picture changes if the models are trained on larger datasets (14M- 300M images) • ImageNet-21k • JFT-300M 2021/4/18 30

Slide 31

Slide 31 text

結論 • Unlike prior works using self-attention in computer vision, we do not introduce image-specific inductive biases into the architecture apart from the initial patch extraction step. Instead, we interpret an image as a sequence of patches and process it by a standard Transformer encoder as used in NLP. • challenges remain • apply ViT to other computer vision tasks • continue exploring self- supervised pre-training methods • there is still large gap between self-supervised and large-scale supervised pre- training • further scaling 2021/4/18 31

Slide 32

Slide 32 text

余談 2021/4/18 32

Slide 33

Slide 33 text

疑問 – BoW との関係性 • NLP の⼿法が CV に応⽤された代表例として、Bag-of-Words (BoW) が Bag-of-Visual-Words (BoVW) となったものがある。 • この論⽂もそういった流れとして捉えることができないか？ 2021/4/18 33

Slide 34

Slide 34 text

疑問 – BoW との関係性（回答済み） 2021/4/18 34 • Open Review で議論されていた。

Slide 35

Slide 35 text

BoVW, Positional Encoding • BoVW について議論しないのか、画像に Positional Encoding を使う意味はあるのか、という指摘。 • ViT != BoW • コンセプトとして全く違う、というのが著者らの主張。 • as ViT models interaction between all patches throughout the whole network through global self-attention layers. • Positional Embedding は patch の location を考慮する上で重要で、 appendix で精度向上について議論されている。 2021/4/18 35

Slide 36

Slide 36 text

Appendix: Position Embeddings について • while there is a large gap between the performances of the model with no positional embedding and models with positional embedding, there is little to no difference between different ways of encoding positional information. • We speculate that since our Transformer encoder operates on patch-level inputs, as opposed to pixel-level, the differences in how to encode spatial information is less important. 2021/4/18 36

Slide 37

Slide 37 text

Open Review での議論 (1) • Q: why not perform pertaining using the autoregressive language model (LM) or masked LM like GPT and Bert pertaining. • A: in our experience supervised training typically allows for better performance with the same amount of compute. （略） How to do it best is a matter of future research. • Q: “An image is worth 16x16 words”, what does it mean? • A: This is merely a wordplay based on the fact that our largest model (H/14), when trained on the standard ImageNet resolution 224x224 pixels, splits the input image into 16x16=256 patches, and we feed these patches to a transformer in the same way words are fed to transformers in NLP. 2021/4/18 37

Slide 38

Slide 38 text

Open Review での議論 (2) • This level of experimental verifications is only possible if huge computation re-sources are only available, which is not accessible for most research teams, esp in academia. • Cons: No significant technical novelty • The proposed model is incremental modifications of the original Transformer and its existing variants. • 明⽩は回答はしていないように⾒受けられるが、Added additional technical details and polished the text throughout. で対応？ 2021/4/18 38

Slide 39

Slide 39 text

おまけ • Hugging Face ライブラリで⽇本語 Transformer を流⾏らせたい。 • “Languages at Hugging Face” の⼀環、Japanese スレッド • https://discuss.huggingface.co/t/japanese-nlp-introductions/3799 2021/4/18 39