論文紹介 / An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An Image is Worth 16x16 Words: Transformers for Image Recognition
at Scale shade-tree Twitter: @shade_tree2112 Website: https://forest1988.github.io Paper’s page at Open Review: https://openreview.net/forum?id=YicbFdNTTy 第六回全⽇本コンピュータビジョン勉強会 Transformer論⽂読み会 My Website 2021/4/18 1

前⼝上 2021/4/18 2

今回ご紹介する論⽂ 2021/4/18 3 https://openreview.net/forum?id=YicbFdNTTy

論⽂選定の理由（※主観を含みます） • Vision Transformer (ViT) を提案。Transformer [Vaswani+, 2017] を Computer
Vision (CV) に適⽤した研究の中でも代表的なものの⼀つであり、「Transformer が CV でも⾼い性能を発揮する」とバズったきっかけの⼀つ。 • Transformer の CV 応⽤の先⾏研究はあるが、ViTの登場は⼀つの節⽬と考える。 • Transformer 関連ライブラリの de-facto standard になっている Hugging Face Transformers [Wolf+, 2020] においても、CV 系のモデルとしてまず ViT が実装されている。 • ※ただし、DETR [Carion+, 2020] の実装が先に試みられていた模様。 #9998 • ※V&L のモデルも実装されている。 “Transformers: State-of-the-Art Natural Language Processing” [Wolf+, 2020] 2021/4/18 4

論⽂のインパクト • ICLR 2021 Open Review • Decision: Accept (Oral)
• Comment: This paper has generated a lot of great discussion and it presents a very different way of doing image recognition at scale compared to current state of the art practices. All reviewers rated this paper as an accept. This work is interesting enough that in my view it really deservers further exposure and discussion and an oral presentation at ICLR would be a good way to achieve that. • 被引⽤ • 197 (2021/04/17 時点) 2021/4/18 5

お前は誰だ – 発表者の⽴場・視点 (1) • shade-tree • 某⼤学院の博⼠課程学⽣ • 主な研究分野：
• NLP, Natural Language Generation, Machine Learning, Storytelling, Emotions • ↑ CV が⼊ってない…… • CV 専⾨家が多数いる環境で NLP やりながら、「CV はよく分からないんですが」と素⼈質問（原義）する⼈。 • 「Transformer 何も分からん。RNN を使って何がいけないんだ！」と悪戦苦闘しているうちに Transformer 好きになった。推しライブラリは Hugging Face Transformers。 • Transformer が CV にも使えると聞いてテンション上昇中の今⽇この頃。 2021/4/18 6

お前は誰だ – 発表者の⽴場・視点 (2) • V&L Transformer ちょっとわかる（原義） • 森
友亮†, 上原康平†, 原⽥達也, (†equal contribution) “視覚・⾔語融合 Transformer モデルによる画像からの物語⽂⽣成,” CAI+CAI first workshop (⾔語処理学会第27回年次⼤会ワークショップ), 福岡（オンライン）, 2021年3⽉. [PDF] 2021/4/18 7 === ⼈⼿による画像ナラティブ === some elephants are in a tent. They are tied by a chain. They seems to be happy. They are seeing something above a tent. A tent is made by wood. === 提案⼿法 (pretrained decoder) === An elephant is walking. It is in a zoo. It seems to be sad. Some elephants are walking. Some trees are near by elephant. === 提案⼿法 (scratch decoder) === Some elephants are standing. They are in a road. They seems to be happy. Some elephants are in africa. Some trees are near by elephant. !"#$"%&'(")*+,-.*')$&% /&')$&% !"# !"#$ $%$&'()* $%$&'()*+ !$"# !"#$

補⾜ • 特に注釈を付けた引⽤⽂献は、以下のものです。 • 紹介論⽂で引⽤されていないもの • 図版などを引⽤するもの 2021/4/18 8

本題 2021/4/18 9

どんな論⽂か？ • Open Review の情報より抜粋（強調は発表者による） • One-sentence Summary: Transformers applied
directly to image patches and pre-trained on large datasets work really well on image classification. • Program Chairs • Comment: This paper has generated a lot of great discussion and it presents a very different way of doing image recognition at scale compared to current state of the art practices. All reviewers rated this paper as an accept. This work is interesting enough that in my view it really deservers further exposure and discussion and an oral presentation at ICLR would be a good way to achieve that. 2021/4/18 10

論⽂の貢献 • Transformer を image recognition に直接適⽤した。 • Self-attention を⽤いた先⾏研究とは異なり、画像特有の帰納バ
イアスを、最初に patch を抽出するステップを除いて、導⼊しない。代わりに、画像 patch の sequence と⾒做し、NLP で使われている Transformer の Encoder 部を直接適⽤した。 • シンプルでスケーラブルな⼿法であり、⼤規模なデータセットを⽤いた pre-training で⾼い性能を発揮した。 2021/4/18 11

関連研究 (1) • Transformer の画像への適⽤ • Pixel 単位で naïve に適⽤
→ 現実的なサイズにスケールしない • 先⾏研究で⾏われた⼯夫 • 局所的な Pixel 間のみに適⽤ [Parmar+, 2018] • Sparse Transformers [Child+, 2019] の利⽤ • 可変サイズの Block への適⽤ [Weissenborn+, 2019] ✘これらの特殊な attention 構造は、性能は出るが、実装が複雑で効率化が困難 →この論⽂の⼤きな貢献は、Standard な Transformer をなるべくそのまま使う⽅法で、良い性能が得られることを⽰したこと。 2021/4/18 12

関連研究 (2) • ⼀番近いのは [Cordonnier+, 2020] • ⼊⼒画像を 2 x
2 の patch に分割して、top で self-attention を全体に適⽤ 2021/4/18 13 ”On the relationship between self- attention and convolutional layers” [Cordonnier+, 2020]

関連研究 (3) • CNN + Self-attention • augmenting feature maps
for image classification [Bello+, 2019] • further processing the output of a CNN using self-attention • for object detection [Hu et al., 2018; Carion et al., 2020] • video processing [Wang et al., 2018; Sun et al., 2019] • image classification [Wu et al., 2020] • unsupervised object discovery [Locatello et al., 2020] • unified text-vision tasks [Chen et al., 2020c; Lu et al., 2019; Li et al., 2019]. 2021/4/18 14

関連研究 (4) • image GPT (iGPT) [Chen et al., 2020a]
• 解像度とカラースペースを⼩さくした上で GPT を適⽤ 2021/4/18 15 ”Generative pretraining from pixels” [Chen+, 2020a]

提案⼿法 Vision Transformer (ViT) (1) 2021/4/18 16 • 可能な限り original
の Transformer に寄せた設計 • NLP における Transformer の scalability や効率的な実装の活⽤を意図厳密には Transformer の Encoder 部のみ使⽤

提案⼿法 Vision Transformer (ViT) (2) • ２次元画像を扱うために、𝐻×𝑊×𝐶 の次元数の⼊⼒を 𝑁× 𝑃!×𝐶
に reshape する。 • Height, Width, Channel. (𝑃×𝑃) は patch の解像度 • 𝑁 = 𝐻𝑊/𝑃! : patch の数 = Transformer に⼊⼒する sequence ⻑ • Patch を１次元配列に変換し、linear layer を通して D 次元 vector に変換。これを patch embedding と呼ぶ。 • BERT の special token “[class]” と同様の考えで、patch embedding に特殊な embedding を付与する。 • Position embeddings を加算。実験的に、1D で良いことを確認。 2021/4/18 17

提案⼿法 Vision Transformer (ViT) (3) 2021/4/18 18 Patch Embeddings •
MSA : multiheaded self-attention • The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine- tuning time. Position Embeddings

実験：データセット • Pre-training に⽤いたデータセット • model scalability を調べるため、多様なサイズのデータセットを使⽤ • ILSVRC-2012
ImageNet dataset • 1k classes, 1.3M images • ImageNet-21k [Deng+, 2009] • 21k classes, 14M images • JFT [Sun+, 2017] • 18k classes, 303M high-resolution images • 前処理などは [Kolesnikov+, 2020] を踏襲 2021/4/18 19

実験：ベンチマーク • Benchmark tasks に転⽤ • ImageNet on the original
validation labels • ImageNet on the cleaned-up ReaL labels [Beyer+, 2020] • CIFAR-10/100 [Krizhevsky, 2009] • Oxford-IIIT Pets [Parkhi+, 2012] • Oxford Flowers-102 [Nilsback & Zisserman, 2008]. • 前処理などは [Kolesnikov+, 2020] を踏襲 • 19-task VTAB classification suite [Zhai+, 2019b] • low-data transfer to diverse tasks (1,000 training examples per task) 2021/4/18 20

実験：モデル • BERT に倣ってモデルのバリエーションを命名 • ViT-L/16 は “Large” で input
patch を 16×16 size にしたもの • ⽐較⼿法 • Baseline CNNs – ResNet [He+, 2016] に改変を加えた ResNet (BiT) 2021/4/18 21

実験：学習の詳細 • Pre-training • Optimizer: Adam (𝛽" = 0.9, 𝛽!
= 0.999) • ResNet においても SGD より良い性能が出たことを確認 • Batch size: 4096 • Weight decay: 0.1 • Linear learning rate warmup and decay • Fine-tuning • Optimizer: SGD with momentum • Batch size: 512 2021/4/18 22

実験：SOTA との⽐較 (1) • 提案⼿法の⼤規模モデルである ViT-H/14, ViT-L/16 を、SOTA の CNNs
モデルと⽐較 • Big Transfer (BiT) [Kolesnikov+, 2020] • Noisy Student [Xie+, 2020] • a large EfficientNet trained using semi-supervised learning on ImageNet and JFT- 300M with the labels removed. • 全てのモデルを TPUv3 で学習し、学習時間も⽰した 2021/4/18 23

実験：SOTA との⽐較 (2) 2021/4/18 24 • ViT-L/16 (smaller) を JFT
で学習したものが BiT-L を outperform • ViT-H/14 (larger) はより良い精度 • 特に難しいタスクにおいて

実験：SOTA との⽐較 (3) • VTAB performance • Natural: Pets, CIFAR,
etc. • Specialized: medical and satellite imagery • Structured: tasks that require geometric understanding like localization 2021/4/18 25

実験：データセットのサイズについて (1) • Pre-train ViT models on datasets of increasing
size: ImageNet, ImageNet-21k, and JFT- 300M. 2021/4/18 26

実験：データセットのサイズについて (2) • Train ViT on random subsets of 9M,
30M, and 90M as well as the full JFT- 300M dataset. • CNNs の inductive bias は smaller datasets で有⽤。しかし larger datasets においては learning the relevant patterns が sufficient ないしは beneficial. • Further analysis of few-shot properties of ViT is an exciting direction of future work. 2021/4/18 27

実験： Scaling Study • 様々なモデルの Scaling を JFT-300M で評価 •
ViT は ResNets に対して performance/compute のトレードオフで優位 • hybrids は⼩規模データで有⽤だが、⼤規模モデルだとあまり効果なし 2021/4/18 28

実験： ViT の中で何が起きているか • Self-attention allows ViT to integrate information
across the entire image even in the lowest layers. • Other attention heads have consistently small attention distances in the low layers. 2021/4/18 29

実験から分かること • 訓練データセットが⼤きくないと Transformer (for CV) は良い性能を発揮しない。 • Transformers
lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data. • mid-sized datasets such as ImageNet • ⼤規模データで学習すれば、SOTA に近付く、あるいは勝る。 • the picture changes if the models are trained on larger datasets (14M- 300M images) • ImageNet-21k • JFT-300M 2021/4/18 30

結論 • Unlike prior works using self-attention in computer vision,
we do not introduce image-specific inductive biases into the architecture apart from the initial patch extraction step. Instead, we interpret an image as a sequence of patches and process it by a standard Transformer encoder as used in NLP. • challenges remain • apply ViT to other computer vision tasks • continue exploring self- supervised pre-training methods • there is still large gap between self-supervised and large-scale supervised pre- training • further scaling 2021/4/18 31

余談 2021/4/18 32

疑問 – BoW との関係性 • NLP の⼿法が CV に応⽤された代表例として、Bag-of-Words (BoW)
が Bag-of-Visual-Words (BoVW) となったものがある。 • この論⽂もそういった流れとして捉えることができないか？ 2021/4/18 33

疑問 – BoW との関係性（回答済み） 2021/4/18 34 • Open Review
で議論されていた。

BoVW, Positional Encoding • BoVW について議論しないのか、画像に Positional Encoding を使う意味はあるのか、という指摘。
• ViT != BoW • コンセプトとして全く違う、というのが著者らの主張。 • as ViT models interaction between all patches throughout the whole network through global self-attention layers. • Positional Embedding は patch の location を考慮する上で重要で、 appendix で精度向上について議論されている。 2021/4/18 35

Appendix: Position Embeddings について • while there is a large
gap between the performances of the model with no positional embedding and models with positional embedding, there is little to no difference between different ways of encoding positional information. • We speculate that since our Transformer encoder operates on patch-level inputs, as opposed to pixel-level, the differences in how to encode spatial information is less important. 2021/4/18 36

Open Review での議論 (1) • Q: why not perform pertaining
using the autoregressive language model (LM) or masked LM like GPT and Bert pertaining. • A: in our experience supervised training typically allows for better performance with the same amount of compute. （略） How to do it best is a matter of future research. • Q: “An image is worth 16x16 words”, what does it mean? • A: This is merely a wordplay based on the fact that our largest model (H/14), when trained on the standard ImageNet resolution 224x224 pixels, splits the input image into 16x16=256 patches, and we feed these patches to a transformer in the same way words are fed to transformers in NLP. 2021/4/18 37

Open Review での議論 (2) • This level of experimental verifications
is only possible if huge computation re-sources are only available, which is not accessible for most research teams, esp in academia. • Cons: No significant technical novelty • The proposed model is incremental modifications of the original Transformer and its existing variants. • 明⽩は回答はしていないように⾒受けられるが、Added additional technical details and polished the text throughout. で対応？ 2021/4/18 38

おまけ • Hugging Face ライブラリで⽇本語 Transformer を流⾏らせたい。 • “Languages at
Hugging Face” の⼀環、Japanese スレッド • https://discuss.huggingface.co/t/japanese-nlp-introductions/3799 2021/4/18 39

論文紹介 / An Image is Worth 16x16 Words: Transform...

論文紹介 / An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

More Decks by Yusuke Mori

Other Decks in Research

Featured

Transcript