Vision Transformerのしくみ

7JTJPO5SBOTGPSNFSͷ͘͠Έ ຳӜେߊɼ౻٢߂࿱ɼฏ઒ཌྷɼࢁԼོٛʢத෦େֶɾػց஌֮ϩϘςΟΫεݚڀάϧʔϓʣ IUUQNQSHKQ

w 5SBOTGPSNFSͷ͘͠Έ Ϟσϧͷৄࡉͳઆ໌ w 7JTJPO5SBOTGPSNFSͷ͘͠Έ Ϟσϧͷৄࡉͳઆ໌ ಈ࡞֬ೝͱ܏޲
Ԡ༻ྫ w 7JTJPO5SBOTGPSNFSͱࣗݾڭࢣ͋Γֶश $POUFOUT

ίϯϐϡʔλϏδϣϯλεΫͱओཁϞσϧ Object Detection Trajectory Forecasting Vision Language Object Tracking Neural
Architecture Search Reinforcement Learning Generative Attention Explanation Pose Estimation Point Graph 3D Recognition Motion Capture Matching Re-identiﬁcation Distillation Caption Dense Segmentation Eﬃcient Video Domain Face Adversarial Scene Depth $POWPMVUJPOBM/FVSBM/FUXPSL 3FDVSSFOU/FVSBM/FUXPSL Single Super-Resolution

େ5SBOTGPSNFS࣌୅ Object Detection Trajectory Forecasting Vision Language Object Tracking Neural
Architecture Search Reinforcement Learning Generative Attention Explanation Pose Estimation Point Graph 3D Recognition Motion Capture Matching Re-identiﬁcation Distillation Caption Dense Segmentation Eﬃcient Video Domain Face Adversarial Scene Depth Single Super-Resolution 5SBOTGPSNFS Ҿ༻਺ ࣌఺ ɿສઍ

5SBOTGPSNFS

w "UUFOUJPOػߏͷΈΛ༻͍ͨϞσϧ 3//΍$//ʹ୅Θͬͯจষੜ੒΍຋༁λεΫͰ4P5" ෯޿͍Ԡ༻ੑ͔ΒଞͷλεΫͰ΋4P5"Λୡ੒ ෺ମݕग़ɼը૾Ωϟϓγϣϯɼը૾ੜ੒ɼFUD 5SBOTGPSNFS<7BTXBOJ /FVS*14> Figure
1: The Transformer - model architecture. 1 Encoder and Decoder Stacks 5SBOTGPSNFS<"7BTXBOJ /FVS*14> ,OPXMFEHF%JTUJMMBUJPO<("HVMBS """*> 0CKFDU%FUFDUJPO</$BSJPO &$$7> *NBHF(FOFSBUJPO<);IBOH *$.-> *NBHF$BQUJPOJOH<4)FSEBEF /FVS*14>

&ODPEFS %FDPEFS x t−1 x t ̂ x t+1 ̂
x t+1 ̂ x t+2 < EOS > 4FRTFR ☺ೖྗ࣌ࠁͷґଘؔ܎ จ຺ Λ൓ө͢Δ͜ͱ͕Մೳ ☹ܭࢉͷฒྻԽ͕ࠔ೉ ☹௚ۙͷಛ௃ʹڧ͘Өڹ͞ΕΔ 5SBOTGPSNFS 5SBOTGPSNFS ☺ೖྗग़ྗؒͷಛ௃ྔͷরԠؔ܎Λ௕ظతʹ֫ಘՄೳ ☺ೖྗγʔέϯεΛҰ౓ʹೖྗ͢ΔͷͰฒྻԽͨ͠ܭࢉ͕Մೳ ☹Կ૚΋ॏͶͯܭࢉ͢ΔͨΊܭࢉྔ͕๲େ &ODPEFS %FDPEFS x t−1 x t ̂ x t+1 ̂ x t+1 ̂ x t+2 < EOS > "UUFOUJPO "UUFOUJPO4FRTFR ☺ೖྗ࣌ࠁຖͷಛ௃Λจষͷ௕͞ʹؔ܎ͳ͘औಘՄೳ ☺ೖྗग़ྗؒͷಛ௃ྔͷରԠؔ܎Λ֫ಘՄೳ ☹ܭࢉͷฒྻԽ͕ࠔ೉ ☹௕ظͷґଘؔ܎Ϟσϧͷߏங͕ࠔ೉ 5SBOTGPSNFSͷഎܠ

w "UUFOUJPOػߏͷΈͰߏ੒ $//ͷΑ͏ʹฒྻܭࢉ͕Մೳ 3//ͷΑ͏ʹ௕ظґଘϞσϧΛߏஙՄೳ w 1PTJUJPOBM&ODPEJOHͷߏங ࣌ࠁຖʹҐஔ৘ใΛຒΊࠐΉ
3//Λ༻͍ͣʹจ຺৘ใΛอ࣋Մೳ w 4FMG"UUFOUJPOϞσϧͰߏ੒ ೖྗग़ྗؒͷরԠؔ܎Λ௕ظతʹ֫ಘՄೳ 5SBOTGPSNFSͷಛ௃ &ODPEFS %FDPEFS

w ࣌ࠁຖʹҐஔ৘ใΛຒΊࠐΉ ܥྻσʔλΛਖ਼֬ͳॱྻʹอͭ 3//΍$//ͷ૬ରత͔ͭઈରతͳҐஔ৘ใΛՃࢉ͢ΔΠϝʔδ 1PTJUJPOBM&ODPEJOH 3// ࢲ 3//
͸ 3// Ϧϯΰ 3// ͕ 3// ޷͖ 3// Ͱ͢ U U U U U U 5SBOTGPSNFS͸U͕෼͔Βͳ͍ͷͰ໌ࣔతʹ༩͑Δඞཁ͕͋Δ PE (pos,2i) = sin(pos/10,0002i/d model ) PE (pos,2i+1) = cos(pos/10,0002i/d model ) 1PTJUJPOBM&ODPEJOHͷఆࣜԽ d model i pos 1&ͷ࣍ݩ਺ ܥྻσʔλͷҐஔ 1&ͷ࣍ݩͷ੒෼ ˠ1PTJUJPOBM&ODPEJOHͷ೾௕͸ ͔Β ·Ͱͷஈ֊తͳඇ࿈ଓͷ޿͕Γ ౳ൺ਺ྻ 2π 10,000 ⋅ 2π ܥྻσʔλͷҐஔ 1&ͷ࣍ݩ 1PTJUJPOBM&ODPEJOHͷՄࢹԽ

͸ 3// Ϧϯΰ 3// ͕ 3// ޷͖ 3// Ͱ͢ U U U U U U 5SBOTGPSNFS͸U͕෼͔Βͳ͍ͷͰ໌ࣔతʹ༩͑Δඞཁ͕͋Δ PE (pos,2i) = sin(pos/10,0002i/d model ) PE (pos,2i+1) = cos(pos/10,0002i/d model ) 1PTJUJPOBM&ODPEJOHͷఆࣜԽ d model i pos 1&ͷ࣍ݩ਺ ܥྻσʔλͷҐஔ 1&ͷ࣍ݩͷ੒෼ ܥྻσʔλͷҐஔ 1&ͷ࣍ݩ 1PTJUJPOBM&ODPEJOHͷՄࢹԽ ˠ1PTJUJPOBM&ODPEJOHͷ೾௕͸ ͔Β ·Ͱͷஈ֊తͳඇ࿈ଓͷ޿͕Γ ౳ൺ਺ྻ 2π 10,000 ⋅ 2π TJOؔ਺͚ͩͰ͸ҟͳΔ࣌ࠁಉ͕࢜ಉ͡Ґஔ৘ใͱͳΔ৔߹͕͋Δ DPTؔ਺Λ௥Ճ͢Δ͜ͱͰҟͳΔҐஔ৘ใΛ෇༩͢Δ

͸ 3// Ϧϯΰ 3// ͕ 3// ޷͖ 3// Ͱ͢ U U U U U U 5SBOTGPSNFS͸U͕෼͔Βͳ͍ͷͰ໌ࣔతʹ༩͑Δඞཁ͕͋Δ PE (pos,2i) = sin(pos/10,0002i/d model ) PE (pos,2i+1) = cos(pos/10,0002i/d model ) 1PTJUJPOBM&ODPEJOHͷఆࣜԽ d model i pos 1&ͷ࣍ݩ਺ ܥྻσʔλͷҐஔ 1&ͷ࣍ݩͷ੒෼ ˠ1PTJUJPOBM&ODPEJOHͷ೾௕͸ ͔Β ·Ͱͷஈ֊తͳඇ࿈ଓͷ޿͕Γ ౳ൺ਺ྻ 2π 10,000 ⋅ 2π ܥྻσʔλͷҐஔ 1&ͷ࣍ݩ 1PTJUJPOBM&ODPEJOHͷՄࢹԽ 1&ͷ֤࣍ݩΛܥྻσʔλʹԊͬͯՄࢹԽ ܥྻσʔλͷҐஔ TJO೾ DPT೾ ܥྻσʔλͷҐஔ ɾ໌ࣔతͳҐஔ৘ใͱͯ͠ɼͦΕͧΕͷ೾ܗͷ஋͕༩͑ΒΕΔ ɾ࣍ݩͷཁૉ͕খ͍͞ͱपظ͕୹͘ɼ࣍ݩͷཁૉ͕େ͖͍ͱपظ͕௕͍ EJN EJN EJN EJN

w 5SBOTGPSNFSͷ伴ͱͳΔ෦෼ .VMUJ)FBE"UUFOUJPOͷதͷϞδϡʔϧ &ODPEFS%FDPEFSͷ྆ํͰ࢖༻ 4FMG"UUFOUJPO 2VFSZ ,FZ
7BMVF Ͱߏ੒ 4DBMFE%PU1SPEVDU"UUFOUJPO Attention(Q, K, V) = softmax( QKT dk )V Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk , and values of dimension dv . We compute the dot products of the query with all keys, divide each by p dk , and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as: Attention(Q, K, V ) = softmax( QKT p dk )V (1) The two most commonly used attention functions are additive attention [2], and dot-product (multi- ఆࣜԽ 2VFSZͷ࣍ݩ਺ d k ྫɿ2 ,ͷฏۉ͕ɼ෼ࢄ͕ͱԾఆ͢Δͱɼ͜ΕΒͷߦྻੵͷฏۉ஋͕ɼ෼ࢄ ͱͳΔ d k 4PGUNBYؔ਺ͷޯ഑Λܭࢉ࣌ʹɼҰ෦ͷ಺ੵ஋͕ඇৗʹେ͖͍ͱɼ ಺ੵ஋͕࠷େͷཁૉҎ֎ͷޯ഑͕ඇৗʹখ͘͞ͳΔ 2 ,ͷಛ௃ྔΛ ͰεέʔϦϯά͢Δ͜ͱͰɼฏۉɼ෼ࢄͱͳΓ׈Β͔ͳޯ഑ΛͱΔ d k Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention

4FMG"UUFOUJPOͷৄࡉ q 1 k 1 v 1 q 2 k
2 v 2 q 3 k 3 v 3 q 4 k 4 v 4 q 5 k 5 v 5 x 1 e 1 x 2 e 2 x 3 e 3 x 4 e 4 x 5 e 5 &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH 1& Figure 1: The Transformer - model architecture. der and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each q i = W q e i k i = W k e i v i = W v e i &NCFEEJOHͨ͠ಛ௃ྔ ͔Β2VFSZɼ,FZɼ7BMVFಛ௃ྔΛͦΕͧΕͷઢܗม׵ͰٻΊΔ e i 2VFSZ ,FZ 7BMVF

2 v 2 q 3 k 3 v 3 q 4 k 4 v 4 q 5 k 5 v 5 [α 1,1 α 1,2 α 1,3 α 1,4 α 1,5 ] [α 2,1 α 2,2 α 2,3 α 2,4 α 2,5 ] [α 3,1 α 3,2 α 3,3 α 3,4 α 3,5 ] [α 4,1 α 4,2 α 4,3 α 4,4 α 4,5 ] [α 5,1 α 5,2 α 5,3 α 5,4 α 5,5 ] [ ̂ α 1,1 ̂ α 1,2 ̂ α 1,3 ̂ α 1,4 ̂ α 1,5 ] [ ̂ α 2,1 ̂ α 2,2 ̂ α 2,3 ̂ α 2,4 ̂ α 2,5 ] [ ̂ α 3,1 ̂ α 3,2 ̂ α 3,3 ̂ α 3,4 ̂ α 3,5 ] [ ̂ α 4,1 ̂ α 4,2 ̂ α 4,3 ̂ α 4,4 ̂ α 4,5 ] [ ̂ α 5,1 ̂ α 5,2 ̂ α 5,3 ̂ α 5,4 ̂ α 5,5 ] TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY x 1 e 1 x 2 e 2 x 3 e 3 x 4 e 4 x 5 e 5 &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH 1& Figure 1: The Transformer - model architecture. der and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 2VFSZͱ,FZಛ௃ྔͷ಺ੵΛͱΓɼTPGUNBYؔ਺Ͱܥྻؒͷؔ࿈౓ "UUFOUJPOXFJHIU Λऔಘ ̂ α = softmax( QKT dk ) Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each

2 v 2 q 3 k 3 v 3 q 4 k 4 v 4 q 5 k 5 v 5 [α 1,1 α 1,2 α 1,3 α 1,4 α 1,5 ] [α 2,1 α 2,2 α 2,3 α 2,4 α 2,5 ] [α 3,1 α 3,2 α 3,3 α 3,4 α 3,5 ] [α 4,1 α 4,2 α 4,3 α 4,4 α 4,5 ] [α 5,1 α 5,2 α 5,3 α 5,4 α 5,5 ] [ ̂ α 1,1 ̂ α 1,2 ̂ α 1,3 ̂ α 1,4 ̂ α 1,5 ] [ ̂ α 2,1 ̂ α 2,2 ̂ α 2,3 ̂ α 2,4 ̂ α 2,5 ] [ ̂ α 3,1 ̂ α 3,2 ̂ α 3,3 ̂ α 3,4 ̂ α 3,5 ] [ ̂ α 4,1 ̂ α 4,2 ̂ α 4,3 ̂ α 4,4 ̂ α 4,5 ] [ ̂ α 5,1 ̂ α 5,2 ̂ α 5,3 ̂ α 5,4 ̂ α 5,5 ] ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ output 1 output 2 output 3 output 4 output 5 TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY x 1 e 1 x 2 e 2 x 3 e 3 x 4 e 4 x 5 e 5 &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH 1& Figure 1: The Transformer - model architecture. der and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each "UUFOUJPOXFJHIUͱ7BMVFಛ௃ྔΛ৐ࢉ͠৘ใ෇༩͢Δ͜ͱͰɼ࣌ࠁؒͷಛ௃ྔͷؔ܎ੑΛٻΊΔ Attention(Q, K, V) = ̂ αV

w ઌͷखॱΛIݸ༻ҙ͠ɼͦΕͧΕݸผʹ࣮ߦ )FBEຖʹண໨ͨ࣌͠ࠁ͕ҟͳΔಛ௃͕ಘΒΕΔ .VMUJ)FBEߏ଄ʹΑΓɼදݱྗ͕૿͠ਫ਼౓޲্͕ݟࠐΊΔ .VMUJ)FBE"UUFOUJPO q 1 k
1 v 1 q 2 k 2 v 2 q 3 k 3 v 3 q 4 k 4 v 4 q 5 k 5 v 5 [α 1,1 α 1,2 α 1,3 α 1,4 α 1,5 ] [α 2,1 α 2,2 α 2,3 α 2,4 α 2,5 ] [α 3,1 α 3,2 α 3,3 α 3,4 α 3,5 ] [α 4,1 α 4,2 α 4,3 α 4,4 α 4,5 ] [α 5,1 α 5,2 α 5,3 α 5,4 α 5,5 ] [ ̂ α 1,1 ̂ α 1,2 ̂ α 1,3 ̂ α 1,4 ̂ α 1,5 ] [ ̂ α 2,1 ̂ α 2,2 ̂ α 2,3 ̂ α 2,4 ̂ α 2,5 ] [ ̂ α 3,1 ̂ α 3,2 ̂ α 3,3 ̂ α 3,4 ̂ α 3,5 ] [ ̂ α 4,1 ̂ α 4,2 ̂ α 4,3 ̂ α 4,4 ̂ α 4,5 ] [ ̂ α 5,1 ̂ α 5,2 ̂ α 5,3 ̂ α 5,4 ̂ α 5,5 ] ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ output 1 output 2 output 3 output 4 output 5 TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY Figure 1: The Transformer - 3.1 Encoder and Decoder Stacks .VMUJ)FBE"UUFOUJPO Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk , and values of dimension dv . We compute the dot products of the query with all keys, divide each by p dk , and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as: Attention(Q, K, V ) = softmax( QKT p dk )V (1) The two most commonly used attention functions are additive attention [2], and dot-product (multi- plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of 1 p dk . Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code. While for small values of dk the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of dk [3]. We suspect that for large values of dk , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4. To counteract this effect, we scale the dot products by 1 p dk . 3.2.2 Multi-Head Attention Instead of performing a single attention function with dmodel -dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk , dk and dv dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv -dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2. 4To illustrate why the dot products get large, assume that the components of q and k are independent random variables with mean 0 and variance 1. Then their dot product, q · k = P dk i=1 qiki , has mean 0 and variance dk . 4 $PODBU .VMUJ)FBE"UUFOUJPO q 1 k 1 v 1 q 2 k 2 v 2 q 3 k 3 v 3 q 4 k 4 v 4 q 5 k 5 v 5 [α 1,1 α 1,2 α 1,3 α 1,4 α 1,5 ] [α 2,1 α 2,2 α 2,3 α 2,4 α 2,5 ] [α 3,1 α 3,2 α 3,3 α 3,4 α 3,5 ] [α 4,1 α 4,2 α 4,3 α 4,4 α 4,5 ] [α 5,1 α 5,2 α 5,3 α 5,4 α 5,5 ] [ ̂ α 1,1 ̂ α 1,2 ̂ α 1,3 ̂ α 1,4 ̂ α 1,5 ] [ ̂ α 2,1 ̂ α 2,2 ̂ α 2,3 ̂ α 2,4 ̂ α 2,5 ] [ ̂ α 3,1 ̂ α 3,2 ̂ α 3,3 ̂ α 3,4 ̂ α 3,5 ] [ ̂ α 4,1 ̂ α 4,2 ̂ α 4,3 ̂ α 4,4 ̂ α 4,5 ] [ ̂ α 5,1 ̂ α 5,2 ̂ α 5,3 ̂ α 5,4 ̂ α 5,5 ] ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ output 1 output 2 output 3 output 4 output 5 TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY .VMUJ)FBE"UUFOUJPO q 1 k 1 v 1 q 2 k 2 v 2 q 3 k 3 v 3 q 4 k 4 v 4 q 5 k 5 v 5 [α 1,1 α 1,2 α 1,3 α 1,4 α 1,5 ] [α 2,1 α 2,2 α 2,3 α 2,4 α 2,5 ] [α 3,1 α 3,2 α 3,3 α 3,4 α 3,5 ] [α 4,1 α 4,2 α 4,3 α 4,4 α 4,5 ] [α 5,1 α 5,2 α 5,3 α 5,4 α 5,5 ] [ ̂ α 1,1 ̂ α 1,2 ̂ α 1,3 ̂ α 1,4 ̂ α 1,5 ] [ ̂ α 2,1 ̂ α 2,2 ̂ α 2,3 ̂ α 2,4 ̂ α 2,5 ] [ ̂ α 3,1 ̂ α 3,2 ̂ α 3,3 ̂ α 3,4 ̂ α 3,5 ] [ ̂ α 4,1 ̂ α 4,2 ̂ α 4,3 ̂ α 4,4 ̂ α 4,5 ] [ ̂ α 5,1 ̂ α 5,2 ̂ α 5,3 ̂ α 5,4 ̂ α 5,5 ] ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ output 1 output 2 output 3 output 4 output 5 TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY I q 1 k 1 v 1 q 2 k 2 v 2 q 3 k 3 v 3 q 4 k 4 v 4 q 5 k 5 v 6 q 5 k 5 v 5 q 5 k 5 v 5

5SBOTGPSNFS#MPDLͷશମॲཧ ϕΫτϧΛ࣌ؒํ޲ʹ ࠞͥͯม׵ ਖ਼ ن Խ ϕΫτϧΛEFQUIํ޲ʹ ݸผʹม׵ ਖ਼ ن
Խ ఺ઢ෦Λ/ճ܁Γฦ͢ TFMGBUUFOUJPO GFFEGPSXBSE ࢲ͸ݘ͕޷͖ͩɻ ಛ௃ϕΫτϧ lࢲz l͸z lݘ l͕z l޷͖z lͩz lɻz ୯ޠ਺Y࣍ݩ਺ W feed1 '$ޙͷಛ௃ϕΫτϧ W feed2 RVFSZ LFZ WBMVF ୯ޠ਺Y࣍ݩ਺ W q W k W v ୯ޠ਺Y୯ޠ਺ ͱ7BMVFؒͷߦྻԋࢉ ॏΈΛ৐ࢉ ֤ߦʹ͸֤୯ޠͷଞͷ୯ޠͱ ͷॏཁ౓͕֨ೲ͞Ε͍ͯΔ RVFSZͱLFZؒͷߦྻԋࢉ W out సஔ ߦํ޲ʹ TPGUNBY ୯ޠ਺Y࣍ݩ਺ α ̂ α ೖྗɿ୯ޠྻ Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each

ਪ࿦࣌ͷσίʔμͷॲཧ ࢲ͸ݘ͕޷͖ͩɻ ೖྗɿ୯ޠྻ .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ 'FFE'PSXBSE ਖ਼
ن Խ ఺ઢ෦Λ/ճ܁Γฦ͢ ಛ௃ϕΫτϧ lࢲz l͸z lݘ l͕z l޷͖z lͩz lɻz &ODPEFS .BTLFE .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ ಛ௃ϕΫτϧ GFFEGPSXBSE ਖ਼ ن Խ ෼ ྨ ث l*z &04 .BTLFE .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ ਖ਼ ن Խ ಛ௃ϕΫτϧ GFFEGPSXBSE ਖ਼ ن Խ ෼ ྨ ث lMJLFz &04l*z .BTLFE .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ ਖ਼ ن Խ ಛ௃ϕΫτϧ GFFEGPSXBSE ਖ਼ ن Խ ෼ ྨ ث &/% &04l*zlMJLFzlEPHTzlz ఺ઢ෦Λ/ճ܁Γฦ͢ ఺ઢ෦Λ/ճ܁Γฦ͢ ఺ઢ෦Λ/ճ܁Γฦ͢ %FDPEFS ʜ Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The ﬁrst is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512. Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i. 3.2 Attention An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. 3 2 , 7 .VMUJ)FBE "UUFOUJPO 2 , 7 .VMUJ)FBE "UUFOUJPO 2 , 7

ਪ࿦࣌ͷσίʔμͷॲཧ ࢲ͸ݘ͕޷͖ͩɻ ೖྗɿ୯ޠྻ .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ 'FFE'PSXBSE ਖ਼
ن Խ ఺ઢ෦Λ/ճ܁Γฦ͢ ಛ௃ϕΫτϧ lࢲz l͸z lݘ l͕z l޷͖z lͩz lɻz &ODPEFS ਖ਼ ن Խ .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ ಛ௃ϕΫτϧ GFFEGPSXBSE ਖ਼ ن Խ ෼ ྨ ث l*z &04 ਖ਼ ن Խ ਖ਼ ن Խ ಛ௃ϕΫτϧ GFFEGPSXBSE ਖ਼ ن Խ ෼ ྨ ث lMJLFz &04l*z ਖ਼ ن Խ ਖ਼ ن Խ ಛ௃ϕΫτϧ GFFEGPSXBSE ਖ਼ ن Խ ෼ ྨ ث &/% &04l*zlMJLFzlEPHTzlz ఺ઢ෦Λ/ճ܁Γฦ͢ ఺ઢ෦Λ/ճ܁Γฦ͢ ఺ઢ෦Λ/ճ܁Γฦ͢ %FDPEFS ʜ Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The ﬁrst is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512. Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i. 3.2 Attention An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. 3 2 , 7 .VMUJ)FBE "UUFOUJPO 2 , 7 .VMUJ)FBE "UUFOUJPO 2 , 7 .BTLFE .VMUJ)FBE "UUFOUJPO .BTLFE .VMUJ)FBE "UUFOUJPO RVFSZ LFZ WBMVF ୯ޠ਺Y࣍ݩ਺ W q W k W v ୯ޠ਺Y୯ޠ਺ ͱWBMVFؒͷߦྻԋࢉ ॏΈΛ৐ࢉ ֤ߦʹ͸֤୯ޠͷଞͷ୯ޠͱͷ ॏཁ౓͕֨ೲ͞Ε͍ͯΔ RVFSZͱLFZؒͷߦྻԋࢉ W out సஔ ྻํ޲ʹTPGUNBY ୯ޠY࣍ݩ਺ RVFSZ LFZ WBMVF ୯ޠ਺Y࣍ݩ਺ W q W k W v ୯ޠ਺Y୯ޠ਺ ͱWBMVFؒͷߦྻԋࢉ ॏΈΛ৐ࢉ ֤ߦʹ͸֤୯ޠͷଞͷ୯ޠͱͷ ॏཁ౓͕֨ೲ͞Ε͍ͯΔ RVFSZͱLFZؒͷߦྻԋࢉ W out సஔ ྻํ޲ʹTPGUNBY ୯ޠY࣍ݩ਺ .BTL ະདྷͷ৘ใΛ఻೻͠ͳ͍Α͏ʹ.BTLΛ͔͚Δ ɾೖྗ୯ޠ਺ͷ৔߹ ɾೖྗ୯ޠ਺ͷ৔߹ .BTLFE .VMUJ)FBE "UUFOUJPO ̂ α ̂ α

IUUQTHJUIVCDPNNBDIJOFQFSDFQUJPOSPCPUJDTHSPVQ.13(%FFQ-FBSOJOH-FDUVSF/PUFCPPL 5SBOTGPSNFSͷ࣮૷ྫ͕͋ΔͷͰɼੋඇ֬ೝ͍ͯͩ͘͠͞

w ܦ࿏༧ଌλεΫʹ͓͚Δ-45.ͱ5SBOTGPSNFSͷͭͷϞσϧͷਫ਼౓ݕূ w ࣮ݧ৚݅ σʔληοτɿ&5)6$: w ࢢ֗஍ͷาߦऀΛۭࡱͨ͠σʔληοτ w ͭͷγʔϯͰߏ੒
w MFBWFPOFPVUͰֶशɾධՁ ΤϙοΫ਺ɿ όοναΠζɿ ࠷దԽख๏ɿ"EBN ֶश཰ɿ ධՁࢦඪ w "%&ɿ֤༧ଌ࣌ࠁͷਅ஋ͱ༧ଌ஋ͷ-ڑ཭ޡࠩͷฏۉ w '%&ɿ࠷ऴ༧ଌ࣌ࠁͷਅ஋ͱ༧ଌ஋ͷ-ڑ཭ޡࠩ ܦ࿏༧ଌλεΫʹΑΔൺֱ࣮ݧ ˔าߦऀ ˔ࣗಈं ˔ࣗసं ܦ࿏༧ଌྫ ༧ଌྫ͸<>ΑΓҾ༻ ).JOPVSB l1BUIQSFEJDUJPOTVTJOHPCKFDUBUUSJCVUFTBOETFNBOUJDFOWJSPONFOU z7*4"11

ܦ࿏༧ଌ࣮ݧ Ϟσϧ ύϥϝʔλ ֶश࣌ؒ <TFD> ਪ࿦࣌ؒ <TFD> γʔϯ<N> &5) )05&-
6$: ;"3" ;"3" "7( -45. 5SBOTGPSNFS 44,144,642 ˞ֶशਪ࿦଎౓͸αϯϓϧʹର͢Δ΋ͷͱ͢Δ Num. Layer Sub Layer Name Remark (no bias) Detailed Param. Param. Total Param. - Encoder Parameter - - - - 18,916,864 - Decoder Parameter - - - - 25,226,752 1 Output MLP 512 x 2 1,026 1,026 1,026 All Parameter 44,144,642 ܦ࿏༧ଌ݁Ռ 5SBOTGPSNFSͷύϥϝʔλ

5SBOTGPSNFSͷύϥϝʔλͷৄࡉ&ODPEFS Num. Layer Sub Layer Name Remark (no bias) Detailed
Param. Param. Total Param. 1 Embedding MLP 2 x 512 1,536 1,536 1,536 1 Positional Encoding - 512 0 0 0 6 Layer Norm. Norm. 512 + 512 1,024 1,024 6,144 Self-Attention  Block MLP (Query) 512 x 512 262,656 787,968 4,727,808 MLP (Key) 512 x 512 262,656 MLP (Value) 512 x 512 262,656 Feed Forward MLP 512 x 512 262,656 262,656 1,575,936 Layer Norm. Norm. 512 + 512 1,024 1,024 6,144 Res Connection MLP 512 x 2048 1,050,624 2,099,712 12,598,272 MLP 2048 x 512 1,049,088 1 Layer Norm. Norm 512 + 512 1,024 1,024 1,024 Encoder Parameter 18,916,864 Figure 1: The Transformer - model architecture. ncoder and Decoder Stacks r: The encoder is composed of a stack of N = 6 identical layers. Each layer has two ers. The ﬁrst is a multi-head self-attention mechanism, and the second is a simple, position- ly connected feed-forward network. We employ a residual connection [11] around each of sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is orm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer o facilitate these residual connections, all sub-layers in the model, as well as the embedding produce outputs of dimension dmodel = 512. r: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two ers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head n over the output of the encoder stack. Similar to the encoder, we employ residual connections each of the sub-layers, followed by layer normalization. We also modify the self-attention er in the decoder stack to prevent positions from attending to subsequent positions. This g, combined with fact that the output embeddings are offset by one position, ensures that the ons for position i can depend only on the known outputs at positions less than i. &ODPEFSͷύϥϝʔλ

5SBOTGPSNFSͷύϥϝʔλͷৄࡉ%FDPEFS Num. Layer Sub Layer Name Remark (no bias) Detailed
Param. Param. Total Param. 1 Embedding MLP 2 x 512 1,536 1,536 1,536 1 Positional Encoding - 512 0 0 0 6 Layer Norm. Norm. 512 + 512 1,024 1,024 6,144 Masked Self-Attention  Block MLP (Query) 512 x 512 262,656 787,968 4,727,808 MLP (Key) 512 x 512 262,656 MLP (Value) 512 x 512 262,656 Feed Forward MLP 512 x 512 262,656 262,656 1,575,936 Layer Norm. Norm. 512 + 512 1,024 1,024 6,144 Self-Attention  Block MLP (Query) 512 x 512 262,656 787,968 4,727,808 MLP (Key) 512 x 512 262,656 MLP (Value) 512 x 512 262,656 Feed Forward MLP 512 x 512 262,656 262,656 1,575,936 Layer Norm. Norm. 512 + 512 1,024 1,024 6,144 Res Connection MLP 512 x 2048 1,050,624 2,099,712 12,598,272 MLP 2048 x 512 1,049,088 1 Layer Norm. Norm 512 + 512 1,024 1,024 1,024 Decoder Parameter 25,226,752 %FDPEFSͷύϥϝʔλ Figure 1: The Transformer - model architecture. ncoder and Decoder Stacks r: The encoder is composed of a stack of N = 6 identical layers. Each layer has two ers. The ﬁrst is a multi-head self-attention mechanism, and the second is a simple, position- ly connected feed-forward network. We employ a residual connection [11] around each of sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is orm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer o facilitate these residual connections, all sub-layers in the model, as well as the embedding produce outputs of dimension dmodel = 512. r: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two ers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head n over the output of the encoder stack. Similar to the encoder, we employ residual connections each of the sub-layers, followed by layer normalization. We also modify the self-attention er in the decoder stack to prevent positions from attending to subsequent positions. This g, combined with fact that the output embeddings are offset by one position, ensures that the ons for position i can depend only on the known outputs at positions less than i.

7JTJPO5SBOTGPSNFS

w 5SBOTGPSNFSΛ7JTJPO෼໺ʹԠ༻ͨ͠ը૾෼ྨख๏ ը૾Λݻఆύονʹ෼ղ ෼ղͨ͠ύονΛqBUUFOʹ͠ɼ5SBOTGPSNFS&ODPEFSͰ֤ύονಛ௃ྔΛಘΔ *NBHF/FUͳͲͷΫϥε෼ྨλεΫͰ4P5" 7JTJPO5SBOTGPSNFS<%PTPWJUTLJZ *$-3>
"%PTPWJUTLJZ l"/*."(&*48035)9803%453"/4'03.&34'03*."(&3&$0(/*5*0/"54$"-& z*$-3

w 7J5͸$//ͷSFDFQUJWFpFMEͷΑ͏ͳಛ௃Λଊ͑ΕΔ<> ύονʹ෼ղ͠5SBOTGPSNFSͰύονؒͷಛ௃Λֶश $//Ͱ͸ଊ͖͑Εͳ͔ͬͨը૾શମͷಛ௃Λଊ͑Δ $//ͱൺ΂ͯԿ͕ڧ͍͔ʁ
$// 7J5 YͷྖҬ SFDFQUJWFpFME ͷಛ௃Λଊ͑Δ ύονʹ෼ղ͠5SBOTGPSNFSͰը૾શମͷಛ௃Λଊ͑Δ ʜ 7J5 ˞ճ৞ΈࠐΜͩ৔߹ <>+$PSEPOOJFS l0/5)&3&-"5*0/4)*1#&58&&/4&-'"55&/5*0/"/%$0/70-65*0/"--":&34 z*$-3

w ը૾Λݻఆύονʹ෼ղͯ͠5SBOTGPSNFS&ODPEFSͰ֤ύονಛ௃Λநग़ ωοτϫʔΫ֓ཁ Preprint. Under review. Transformer Encoder MLP Head
Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Figure 1: Model overview. We split an image into ﬁxed-size patches, linearly embed each of them,

w ෼ྨ໰୊༻ʹ৽͘͠$-45PLFOΛ௥Ճ ֶशՄೳͳύϥϝʔλ 5SBOTGPSNFS&ODPEFSͷ$-45PLFOͷग़ྗ͔ΒΫϥε෼ྨ $-45PLFO Transformer Encoder MLP
Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them, Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them, Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them,

w ೖྗը૾Λ෼ղͨ͠ύονྖҬΛ&NCFEEJOHͨ͠ޙɼҐஔ৘ใΛ෇༩ͯ͠7J5΁ &NCFEEJOH 1PTJUJPO&NCFEEJOH asis functions for a low-dimensional the
arns em- em- the idal That ex- ield tire the the nte- ten- ﬁnd west y is &NCFEEJOH 'MBUUFO x1 p x2 p x3 p x4 p x1 p E x2 p E x3 p E x4 p E E ∈ ℝ(P2⋅C)×D <DMT>UPLFO xN p ∈ ℝP2⋅C + + + + + 1PTJUJPO&NCFEEJOH E pos ∈ ℝ(N+1)×D 7J5 ඍ෼ՄೳͳύϥϝʔλͰ ໌ࣔతʹҐஔ͸ڭ͑ͣʹࣗಈͰֶश͞ΕΔ ࢝఺ ϕΫτϧͰදݱ Ͱ͋Γ ը૾ಛ௃͕7J5Ͱू໿͞Ε࠷ऴ૚ͰΫϥε෼ྨ 5SBOTGPSNFSͷ&NCFEEJOHಛ௃ྔͱಉ͡ѻ͍ N ɿύον਺ ɿ࣍ݩ਺ D ɿνϟϯωϧ਺ ɿύονը૾αΠζ C P ೖྗը૾ΛYʹখ෼͚ ʜ ʜ E pos ϕʔεϞσϧج४

w ೖྗը૾Λ෼ղͨ͠ύονྖҬΛ&NCFEEJOHͨ͠ޙɼҐஔ৘ใΛ෇༩ͯ͠7J5΁ &NCFEEJOH 1PTJUJPO&NCFEEJOH asis functions for a low-dimensional the
arns em- em- the idal That ex- ield tire the the nte- ten- ﬁnd west y is &NCFEEJOH 'MBUUFO x1 p x2 p x3 p x4 p x1 p E x2 p E x3 p E x4 p E E ∈ ℝ(P2⋅C)×D <DMT>UPLFO xN p ∈ ℝP2⋅C + + + + + 1PTJUJPO&NCFEEJOH E pos ∈ ℝ(N+1)×D ඍ෼ՄೳͳύϥϝʔλͰ ໌ࣔతʹҐஔ͸ڭ͑ͣʹࣗಈͰֶश͞ΕΔ ࢝఺ ϕΫτϧͰදݱ Ͱ͋Γ ը૾ಛ௃͕7J5Ͱू໿͞Ε࠷ऴ૚ͰΫϥε෼ྨ 5SBOTGPSNFSͷ&NCFEEJOHಛ௃ྔͱಉ͡ѻ͍ N ɿύον਺ ɿ࣍ݩ਺ D ɿνϟϯωϧ਺ ɿύονը૾αΠζ C P ೖྗը૾ΛYʹখ෼͚ ʜ ʜ E pos ϕʔεϞσϧج४ Preprint. Under review. Figure 7: Left: Filters of the initial linear embedding of RGB values of ViT-L/32. Center: Sim- ilarity of position embeddings of ViT-L/32. Tiles show the cosine similarity between the position $//ͷ௿࣍ݩ૚ͷϑΟϧλͷΑ͏ͳ΋ͷֶ͕श͞ΕΔ 7J5

3(#&NCFEEJOH'JMUFSTͷՄࢹԽσʔληοτผ *NBHF/FUɿ3(#৘ใ *NBHF/FULɿΤοδ ˠσʔληοτຖʹଊ͍͑ͯΔ৘ใ͕ҧ͏ҹ৅

7J5 w ೖྗը૾Λ෼ղͨ͠ύονྖҬΛ&NCFEEJOHͨ͠ޙɼҐஔ৘ใΛ෇༩ͯ͠7J5΁ &NCFEEJOH 1PTJUJPO&NCFEEJOH asis functions for a low-dimensional
the arns em- em- the idal That ex- ield tire the the nte- ten- ﬁnd west y is &NCFEEJOH 'MBUUFO x1 p x2 p x3 p x4 p x1 p E x2 p E x3 p E x4 p E E ∈ ℝ(P2⋅C)×D <DMT>UPLFO xN p ∈ ℝP2⋅C + + + + + 1PTJUJPO&NCFEEJOH E pos ∈ ℝ(N+1)×D ඍ෼ՄೳͳύϥϝʔλͰ ໌ࣔతʹҐஔ͸ڭ͑ͣʹࣗಈͰֶश͞ΕΔ ࢝఺ ϕΫτϧͰදݱ Ͱ͋Γ ը૾ಛ௃͕7J5Ͱू໿͞Ε࠷ऴ૚ͰΫϥε෼ྨ 5SBOTGPSNFSͷ&NCFEEJOHಛ௃ྔͱಉ͡ѻ͍ N ɿύον਺ ɿ࣍ݩ਺ D ɿνϟϯωϧ਺ ɿύονը૾αΠζ C P ೖྗը૾ΛYʹখ෼͚ ʜ ʜ E pos ϕʔεϞσϧج४ Preprint. Under review. Figure 7: Left: Filters of the initial linear embedding of RGB values of ViT-L/32. Center: Sim- ͋Δ1&ͱଞͷ1&ͱͷྨࣅ౓Λදݱͨ͠΋ͷ ෇ۙͷଞͷҐஔͱͷྨࣅ౓͕ߴ͍ ͔ΒԕํͷଞͷҐஔͱͷྨࣅ౓͕௿͍ ෇ۙͷ1&ಉ࢜͸ࣅͨ஋ʹͳΔΑ͏ʹֶश͞Ε͍ͯΔ

w σʔληοτຖʹֶश͞Εͨ1&ͷύλʔϯ͕ҟͳΔ σʔλ਺͕গͳ͍ͱେ·͔ͳҐஔ৘ใΛɼσʔλ਺͕ଟ͍ͱࡉ͔ͳҐஔ৘ใΛݟΔ 1PTJUJPO&NCFEEJOHͷՄࢹԽσʔληοτผ *NBHF/FUʢສຕʣ *NBHF/FULʢ ສຕʣ

7JTJPO5SBOTGPSNFSͷ#MPDLͷશମॲཧ ಛ௃ϕΫτϧ DMTUPLFO ɾ7JTJPO5SBOTGPSNFS ɾ5SBOTGPSNFS ϕΫτϧΛ࣌ؒํ޲ʹ ࠞͥͯม׵ ਖ਼ ن Խ
ϕΫτϧΛEFQUIํ޲ʹ ݸผʹม׵ ਖ਼ ن Խ ఺ઢ෦Λ/ճ܁Γฦ͢ TFMGBUUFOUJPO GFFEGPSXBSE ࢲ͸ݘ͕޷͖ͩɻ ಛ௃ϕΫτϧ lࢲz l͸z lݘ l͕z l޷͖z lͩz lɻz ೖྗɿ୯ޠྻ ϕΫτϧΛۭؒํ޲ʹ ࠞͥͯม׵ ਖ਼ ن Խ ϕΫτϧΛEFQUIํ޲ʹ ݸผʹม׵ ਖ਼ ن Խ ఺ઢ෦Λ/ճ܁Γฦ͢ TFMGBUUFOUJPO GFFEGPSXBSE DMTUPLFO DMTUPLFOͷΈ༻͍ͯ Ϋϥε෼ྨΛߦ͏ DMTUPLFO

w 7J5͸$//ͷSFDFQUJWFpFMEͷΑ͏ͳಛ௃Λଊ͑ΕΔ<> ύονʹ෼ղ͠5SBOTGPSNFSͰύονؒͷಛ௃Λֶश $//Ͱ͸ଊ͖͑Εͳ͔ͬͨը૾શମͷಛ௃Λଊ͑Δ w ύοναΠζΛมߋ͢Δ͜ͱͰ༷ʑͳλεΫ΁Ԡ༻͕ظ଴<><><>
ը૾தͷԿͷಛ௃Λଊ͍͑ͯΔ͔͸ύοναΠζʹӨڹ͞ΕΔ $//͸ςΫενϟΛॏࢹ͢Δ͕ɼ7J5͸෺ମͷܗঢ়Λॏࢹ 7J5ʹ͓͚Δಛ௃දݱ֫ಘ <>+$PSEPOOJFS l0/5)&3&-"5*0/4)*1#&58&&/4&-'"55&/5*0/"/%$0/70-65*0/"--":&34 z*$-3 <>4#IPKBOBQBMMJS l6OEFSTUBOEJOH3PCVTUOFTTPG5SBOTGPSNFSTGPS*NBHF$MBTTJpDBUJPO zBS9JW Model ViT-B ViT-L ViT-H ResNet-50x1 ResNet-101x1 ResNet-50x3 ResNet-101x3 ResNet-152x4 # Params 86M 307M 632M 23M 45M 207M 401M 965M Table 1. Architectures. Model architectures used in our experiments along with the number of learnable parameters for each. Figure 2. Robustness Benchmarks. Accuracy of ViT and ResNet models on ILSVRC-2012 (clean), ImageNet-C, ImageNet-R and ImageNet-A. For ImageNet-C the accuracy is averaged across all corruption types and severity levels. We observe that (i) relative accuracy on ILSVRC-2012 is generally predictive of relative accuracy on the perturbed datasets, and that when trained on sufficient data, the accuracy of ViT models (ii) outperforms ResNets, and (iii) scales better with model size. Detailed results for ImageNet-C can be found in Appendix C. Our results, averaged over all corruptions and all severi- ties, are shown in the second column of Fig. 2. More granu- lar results can be found in Appendix C. We find that the size of the pre-training dataset has a fundamental effect on the perturbations in ImageNet-R and ImageNet-C, the models’ behavior on ImageNet-R is similar, as shown in Fig. 2. Again, ViTs underperform ResNets when the pre-training data is small and starts to outperform them when pre-trained Model ViT-B ViT-L ViT-H ResNet-50x1 ResNet-101x1 ResNet-50x3 ResNet-101x3 ResNet-152x4 # Params 86M 307M 632M 23M 45M 207M 401M 965M Table 1. Architectures. Model architectures used in our experiments along with the number of learnable parameters for each. Figure 2. Robustness Benchmarks. Accuracy of ViT and ResNet models on ILSVRC-2012 (clean), ImageNet-C, ImageNet-R and ImageNet-A. For ImageNet-C the accuracy is averaged across all corruption types and severity levels. We observe that (i) relative accuracy on ILSVRC-2012 is generally predictive of relative accuracy on the perturbed datasets, and that when trained on sufficient data, the accuracy of ViT models (ii) outperforms ResNets, and (iii) scales better with model size. Detailed results for ImageNet-C can be found in Appendix C. Our results, averaged over all corruptions and all severi- ties, are shown in the second column of Fig. 2. More granu- lar results can be found in Appendix C. We find that the size of the pre-training dataset has a fundamental effect on the perturbations in ImageNet-R and ImageNet-C, the models’ behavior on ImageNet-R is similar, as shown in Fig. 2. Again, ViTs underperform ResNets when the pre-training data is small and starts to outperform them when pre-trained <>3(FJSIPT l*."(&/&553"*/&%$//4"3&#*"4&%508"3%45&9563&*/$3&"4*/(4)"1&#*"4*.1307&4"$$63"$:"/%30#645/&44 z*$-3 Published as a conference paper at ICLR 2019 AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet 100 GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet 100 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 100 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 100 Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans 99 97 99 100100 98 44 49 48 54 75 40 28 24 18 87 100100100100 90 original greyscale silhouette edges texture Figure 2: Accuracies and example stimuli for five different experiments without cue conflict. changing biases, and discovering emergent benefits of changed biases. We show that the texture bias in standard CNNs can be overcome and changed towards a shape bias if trained on a suitable data set. Remarkably, networks with a higher shape bias are inherently more robust to many different <>ΑΓҾ༻ٴͼɼվม <>ΑΓҾ༻ <>45VMJ l"SF$POWPMVUJPOBM/FVSBM/FUXPSLTPS5SBOTGPSNFSTNPSFMJLFIVNBOWJTJPO zBS9JW Fig. 4: Error consistency results on SIN dataset. distribution (i.e., p 2 D240 corresponding to the off-diagonal entries of the 16 ⇥ 16 confusion matrix) by taking the error counts to be the off-diagonal elements of the confusion matrix: eij = CMi, j, 8 j 6= i In this context, the inter-class JS distance compares what classes were misclassified as what. An interesting finding is that, instead of a strong correlation shown by class-wise JS in Figure 3(a), Figure 3(b) suggests that there is no correlation of inter-class JS distance with Cohen’s k implying that this metric gives insight beyond Co- hen’s k in measuring error-consistency with humans. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Fraction of 'texture' decisions Fraction of 'shape' decisions Shape categories • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ResNet−50 AlexNet VGG−16 GoogLeNet ViT−B_16 ViT−L_32 Humans (avg.) Fig. 5: Shape bias for different networks for the SIN dataset (Geirhos et al., 2019). Vertical lines indicate averages. range of models on the Stylized ImageNet (SIN) dataset (Fig- <>ΑΓҾ༻

w ϞσϧαΠζ͕େ͖͘ͳΔʹͭΕύϥϝʔλ਺͸๲େ 7J5ͷϞσϧαΠζ Published as a conference paper at ICLR
2021 Model Layers Hidden size D MLP size Heads Params ViT-Base 12 768 3072 12 86M ViT-Large 24 1024 4096 16 307M ViT-Huge 32 1280 5120 16 632M Table 1: Details of Vision Transformer model variants. We also evaluate on the 19-task VTAB classiﬁcation suite (Zhai et al., 2019b). VTAB evaluates low-data transfer to diverse tasks, using 1 000 training examples per task. The tasks are divided into three groups: Natural – tasks like the above, Pets, CIFAR, etc. Specialized – medical and satellite imagery, and Structured – tasks that require geometric understanding like localization. Model Variants. We base ViT conﬁgurations on those used for BERT (Devlin et al., 2019), as summarized in Table 1. The “Base” and “Large” models are directly adopted from BERT and we add the larger “Huge” model. In what follows we use brief notation to indicate the model size and

w ࣄલֶशσʔληοτ *NBHF/FUɿը૾ສຕɼ Ϋϥε *NBHF/FULɿը૾ ສຕɼ Ϋϥε
+'5. ඇެ։ ɿը૾ԯຕɼ Ϋϥε w సҠֶशλεΫͷσʔληοτʢ্هͷࣄલֶशࡁΈϞσϧΛ༻͍ͯసҠֶशʣ *NBHF/FU *NBHF/FU3FB- $*'"3 0YGPSE***51FUT 0YGPSE'MPXFST 75"# ࣄલֶशͱసҠֶश

w .-1)FBEΛऔΓସ͑ͯશମΛֶश 7J5ʹ͓͚ΔసҠֶश 5SBOTGPSNFS&ODPEFS ֶश ಛ௃ϕΫτϧ DMTUPLFO DMTUPLFO ༧ଌΫϥε .-1)FBE
ֶश ࣍ݩ਺Y Ϋϥε਺ ɾ*NBHF/FULͰࣄલֶश ಛ௃ϕΫτϧ DMTUPLFO DMTUPLFO ༧ଌΫϥε .-1)FBE ֶश ࣍ݩ਺Y Ϋϥε਺ ɾ$*'"3ʹసҠֶश .-1)FBEͷΈΫϥε਺ʹԠͯ͡औΓସ͑Δ *NBHF/FULQSFUSBJOͷ 5SBOTGPSNFS&ODPEFS ࠶ֶश

w +'5.Ͱࣄલֶश֤ͯ͠σʔληοτͰసҠֶश 4P5"ͱͷൺֱ Published as a conference paper at ICLR
2021 Ours-JFT Ours-JFT Ours-I21k BiT-L Noisy Student (ViT-H/14) (ViT-L/16) (ViT-L/16) (ResNet152x4) (EfficientNet-L2) ImageNet 88.55± 0.04 87.76± 0.03 85.30± 0.02 87.54± 0.02 88.4/88.5⇤ ImageNet ReaL 90.72± 0.05 90.54± 0.03 88.62± 0.05 90.54 90.55 CIFAR-10 99.50± 0.06 99.42± 0.03 99.15± 0.03 99.37± 0.06 CIFAR-100 94.55± 0.04 93.90± 0.05 93.25± 0.05 93.51± 0.08 Oxford-IIIT Pets 97.56± 0.03 97.32± 0.11 94.67± 0.15 96.62± 0.23 Oxford Flowers-102 99.68± 0.02 99.74± 0.00 99.61± 0.02 99.63± 0.03 VTAB (19 tasks) 77.63± 0.23 76.28± 0.46 72.72± 0.21 76.29± 1.70 TPUv3-core-days 2.5k 0.68k 0.23k 9.9k 12.3k Table 2: Comparison with state of the art on popular image classification benchmarks. We report mean and standard deviation of the accuracies, averaged over three fine-tuning runs. Vision Transformer models pre-trained on the JFT-300M dataset outperform ResNet-based baselines on all datasets, while taking substantially less computational resources to pre-train. ViT pre-trained on the smaller public ImageNet-21k dataset performs well too. ⇤Slightly improved 88.5% result reported in Touvron et al. (2020). ԯ ԯ ԯ ԯ ύϥϝʔλ਺ ˞516WDPSFEBZTɿֶशʹ࢖༻ͨ͠516WίΞ਺ νοϓ͋ͨΓݸ ʹֶश࣌ؒ ೔਺ Λ͔͚ͨ΋ͷ ˞ ˠશͯͷσʔληοτͰ4P5"Λୡ੒

w *NBHF/FUL΍+'5.Ͱࣄલֶशͨ͠ϞσϧΛ*NBHF/FUʹసҠֶश 7J5͸๲େͳσʔληοτͰࣄલֶशͨ͠৔߹ʹਫ਼౓޲্͕ݟࠐΊΔ *NBHF/FU΁సҠֶशͨ݁͠Ռ Published as a conference paper
at ICLR 2021 Figure 3: Transfer to ImageNet. While large ViT models perform worse than BiT ResNets (shaded area) when pre-trained on small datasets, they shine when pre-trained on Figure 4: Linear few-shot evaluation on Ima- geNet versus pre-training size. ResNets perform better with smaller pre-training datasets but plateau sooner than ViT, which performs

w +'5.Ͱֶश͢Δը૾ຕ਺ͷมߋʹΑΔ࣮ݧ݁Ռ $//ϕʔε͸ຕ਺͕গͳ͍΄Ͳਫ਼౓͕޲্͢Δ͕ɼຕ਺͕ଟ͍ͱਫ਼౓͕಄ଧͪ 7J5͸ຕ਺͕ଟ͍΄Ͳਫ਼౓޲্͕ݟࠐΊΔ *NBHF/FU΁సҠֶशͨ݁͠Ռ Published as a
conference paper at ICLR 2021 Figure 3: Transfer to ImageNet. While large ViT models perform worse than BiT ResNets (shaded area) when pre-trained on small datasets, they shine when pre-trained on Figure 4: Linear few-shot evaluation on Ima- geNet versus pre-training size. ResNets perform better with smaller pre-training datasets but plateau sooner than ViT, which performs Published as a conference paper at ICLR 2021 Figure 3: Transfer to ImageNet. While large ViT models perform worse than BiT ResNets (shaded area) when pre-trained on small datasets, they shine when pre-trained on Figure 4: Linear few-shot evaluation on Ima- geNet versus pre-training size. ResNets perform better with smaller pre-training datasets but plateau sooner than ViT, which performs

w "UUFOUJPO.BQΛ"UUFOUJPO3PMMPVUͰՄࢹԽ "UUFOUJPO3PMMPVUɿϞσϧ͕ಛఆͷΫϥεࣝผͨ͠ཧ༝Λ"UUFOUJPO8FJHIU͔Βઆ໌ w ૚ຖͷ"UUFOUJPO8FJHIUΛ৐ࢉ w $-45PLFOͱ֤ύονؒͷؔ࿈ੑ͔Βؔ࿈ੑ͕ߴ͍ύονಛ௃Λௐࠪ෼ੳ ਪ࿦݁Ռʹରͯ͠આ໌Λؒ઀తʹଊ͑Δ͜ͱ͕Մೳ
"UUFOUJPO.BQͷग़ྗ<"COBS "$-> 30 -BZFS 5SBOTGPSNFSFODPEFS ʜ ʜ ʜ ʜ ʜ ʜ ʜ "UUFOUJPO3PMMPVU<4"COBS "$-><> <>4"COBS l2VBOUJGZJOH"UUFOUJPO'MPXJO5SBOTGPSNFST z"$-

31 -BZFS 5SBOTGPSNFSFODPEFS ʜ ʜ ʜ ʜ ʜ ʜ ʜ
w "UUFOUJPO.BQΛ"UUFOUJPO3PMMPVUͰՄࢹԽ "UUFOUJPO3PMMPVUɿϞσϧ͕ಛఆͷΫϥεࣝผͨ͠ཧ༝Λ"UUFOUJPO8FJHIU͔Βઆ໌ w ૚ຖͷ"UUFOUJPO8FJHIUΛ৐ࢉ w $-45PLFOͱ֤ύονؒͷؔ࿈ੑ͔Βؔ࿈ੑ͕ߴ͍ύονಛ௃Λௐࠪ෼ੳ ਪ࿦݁Ռʹରͯ͠આ໌Λؒ઀తʹଊ͑Δ͜ͱ͕Մೳ "UUFOUJPO3PMMPVUͰՄࢹԽͨ͠ྫɽ<>ΑΓҾ༻ Preprint. Under review. <>"%PTPWJUTLJZ l"/*."(&*48035)9803%453"/4'03.&34'03*."(&3&$0(/*5*0/"54$"-& z*$-3 <>4"COBS l2VBOUJGZJOH"UUFOUJPO'MPXJO5SBOTGPSNFST z"$- "UUFOUJPO.BQͷग़ྗ<"COBS "$-> "UUFOUJPO3PMMPVU<4"COBS "$-><>

$PNQVUFS7JTJPOʹ͓͚Δ5SBOTGPSNFSͷԠ༻ྫ *NBHF(FOFSBUJPO<:+JBOH /FVS*14> 4FNBOUJD4FHNFOUBUJPO<&9JF BS9JW> 0CKFDU%FUFDUJPO<88BOH *$$7> 7JEFP3FDPHOJUJPO<""SOBC *$$7> Patch
Emb Encoder Stage 1 Patch Emb Encoder Stage 2 Patch Emb Encoder Stage 4 Patch Emb Encoder Stage 3 !! : $ 4 × ' 4 ×(! !" : $ 8 × ' 8 ×(" !# : $ 16 × ' 16 ×(# !$ : $ 32 × ' 32 ×($ $×'×3 Patch Embedding Linear Norm Reshape Transformer Encoder (.% ×) Reshape Stage i &!"# '!"# (! $ ×*! &!"# (! × '!"# (! ×((! $*!"# ) &!"# (! × '!"# (! ×*! Position Embedding Element-wise Add Feature Map Norm Norm Feed Forward Multi-Head Attention Spacial Reduction SRA Figure 3: Overall architecture of the proposed Pyramid Vision Transformer (PVT). The entire model is divided into four stages, and each stage is comprised of a patch embedding layer, and a Li-layer Transformer encoder. Following the pyramid structure, the output resolution of the four stages progressively shrinks from 4-stride to 32-stride. able queries, successfully removing the hand-crafted process such as Non-Maximal Suppression (NMS). Based on DETR, deformable DETR [64] further introduces a deformable attention layer to focus on a sparse set of con- textual elements which obtains fast convergence and better performance. Recently, Vision Transformer (ViT) [10] em- ploys a pure Transformer [51] model to make image classification by treating an image as a sequence of patches. DeiT [50] further extends ViT by using a novel distillation approach. Different from previous methods, this work try to introduce the pyramid structure into Transformer and design a pure Transformer backbone for dense prediction tasks. 3. Pyramid Vision Transformer (PVT) 3.1. Overall Architecture Our goal is to introduce the pyramid structure into Trans- former, so that it can generate multi-scale feature maps for dense prediction tasks (e.g., object detection and semantic segmentation). An overview of PVT is depicted in Figure 3. Similar to CNN backbones [15], our method has four stages of each patch is 4⇥4⇥3. Then, we feed the flattened patches to a linear projection and get embedded patches with size of HW 42 ⇥C1 . After that, the embedded patches along with position embedding pass through a Transformer encoder with L1 layer, and the output is reshaped to a feature map F1 , and its size is H 4 ⇥ W 4 ⇥ C1 . In the same way, using the feature map from the prior stage as input, we obtain the following feature maps F2 , F3 , and F4 , whose strides are 8, 16, and 32 pixels with respect to the input image. With the feature pyramid {F1, F2, F3, F4 }, our method can be easily applied to most downstream tasks, including image classification, object detection, and semantic segmentation. 3.2. Feature Pyramid for Transformer Unlike CNN backbone networks [15] that use convolution stride to obtain multi-scale feature maps, our PVT use progressive shrinking strategy to control the scale of feature maps by patch embedding layers. Here, we denote the patch size of the i-th stage as Pi. At the beginning of the stage i, we first evenly divide the input feature map Fi 1 2 RHi 1 ⇥Wi 1 ⇥Ci 1 into Hi 1 Wi 1 P 2 i patches, … 1 CLS 0 3 2 N Position + Token Embedding MLP Head Class Factorised Encoder L× K V Q Self-Attention Transformer Encoder MLP Layer Norm Layer Norm Multi-Head Dot-Product Attention Embed to tokens Factorised Self-Attention 2 1 N Factorised Dot-Product • • • Spatial Spatial • • • Temporal Temporal Spatial Temporal Spatial Temporal • • • Spatial Temporal • • • Fuse Spatial Temporal Fuse • • • 2 1 N • • • 2 1 N • • • Figure 1: We propose a pure-transformer architecture for video classification, inspired by the recent success of such models for images [15]. To effectively process a large number of spatio-temporal tokens, we develop several model variants which factorise different components of the transformer encoder over the spatial- and temporal-dimensions. As shown on the right, these factorisations correspond to different attention patterns over space and time. 2. Related Work Architectures for video understanding have mirrored ad- vances in image recognition. Early video research used hand-crafted features to encode appearance and motion information [38, 66]. The success of AlexNet on Ima- geNet [35, 13] initially led to the repurposing of 2D image convolutional networks (CNNs) for video as “two- stream” networks [31, 53, 44]. These models processed RGB frames and optical flow images independently before fusing them at the end. Availability of larger video classification datasets such as Kinetics [32] subsequently facili- tated the training of spatio-temporal 3D CNNs [6, 19, 62] which have significantly more parameters and thus require larger training datasets. As 3D convolutional networks require significantly more computation than their image counterparts, many architectures factorise convolutions across spatial and temporal dimensions and/or use grouped convolutions [56, 63, 64, 78, 17]. We also leverage factorisation of the spatial and temporal dimensions of videos to increase efficiency, but in the context of transformer-based models. Concurrently, in natural language processing (NLP), Vaswani et al. [65] achieved state-of-the-art results by re- placing convolutions and recurrent networks with the transformer network that consisted only of self-attention, layer normalisation and multilayer perceptron (MLP) operations. Current state-of-the-art architectures in NLP [14, 49] re- main transformer-based, and have been scaled to web- blocks [27, 4, 7, 54] within a ResNet architecture [24]. Although previous works attempted to replace convolutions in vision architectures [46, 50, 52], it is only very recently that Dosovitisky et al. [15] showed with their ViT architecture that pure-transformer networks, similar to those employed in NLP, can achieve state-of-the-art results for image classification too. The authors showed that such models are only effective at large scale, as transformers lack some of inductive biases of convolutional networks (such as translational equivariance), and thus require datasets larger than the common ImageNet ILSRVC dataset [13] to train. ViT has inspired a large amount of follow-up work in the community, and we note that there are a number of concurrent approaches on extending it to other tasks in computer vision [68, 71, 81, 82] and improving its data- efficiency [61, 45]. In particular, [2, 43] have also proposed transformer-based models for video. In this paper, we develop pure-transformer architectures for video classification. We propose several variants of our model, including those that are more efficient by factorising the spatial and temporal dimensions of the input video. We also show how additional regularisation and pretrained models can be used to combat the fact that video datasets are not as large as their image counterparts that ViT was originally trained on. Furthermore, we outperform the state- of-the-art across five popular datasets. Overlap Patch Embeddings Transformer Block 1 MLP Layer ! " × # " ×"$ ! % × # % ×"& ! '& × # '& ×"" ! $( × # $( ×"' ! " × # " ×4" MLP ! " × # " ×$)*+ Transformer Block 2 Transformer Block 3 Transformer Block 4 Overlap Patch Merging Efficient Self-Attn Mix-FFN ×" UpSample MLP ! "!"# × # "!"# ×"$ ! "!"# × # "!"# ×" ! % × # % ×" Encoder Decoder Figure 2: The proposed SegFormer framework consists of two main modules: A hierarchical Transformer encoder to extract coarse and fine features; and a lightweight All-MLP decoder to directly fuse these multi-level & 7UDQVIRUPHU %ORFNV [8S6FDOH 7UDQVIRUPHU %ORFNV [8S6FDOH *ULG7UDQVIRUPHU%ORFNV 5HDO)DNH" 0/3 1RLVH,QSXW *HQHUDWRU 'LVFULPLQDWRU /LQHDU8QIODWWHQ [ [ [ [& +HDG [ [& [[ [$YJ3RRO *ULG7UDQVIRUPHU%ORFNV &/6 &RQFDWHQDWH *ULG7UDQVIRUPHU %ORFNV [ [ 7UDQVIRUPHU%ORFNV [$YJ3RRO &RQFDWHQDWH 7UDQVIRUPHU %ORFNV [ [ + 3 : 3 & ; ; + 3 : 3 ; ; + 3 : 3 & ; ; & & /LQHDU /LQHDU /LQHDU 7UDQVIRUPHU %ORFNV Figure 2: The pipeline of the pure transform-based generator and discriminator of TransGAN. We take 256⇥256 resolution image generation task as a typical example to illustrate the main procedure. Here patch size p is set to 32 as an example for the convenience of illustration, while practically the patch Set of Detections Input point cloud Set of Points MLP MLP Transformer Encoder Transformer Decoder Query embeddings Set Aggregate 3DETR Set of Po Downsample Figure 2: Approach. (Left) 3DETR is an end-to-end trainable Transformer that takes a set of 3D a set of 3D bounding boxes. The Transformer encoder produces a set of per-point features using m features and a set of ‘query’ embeddings are input to the Transformer decoder that produces a se to the ground truth and optimize a set loss. Our model does not use color information (used for sample a set of ‘query’ points that are embedded and then converted into bounding box predictio 3DETR, simplifications in bounding box parametrization and the simpler set-to-set objective function. 3.1. Preliminaries The recent VoteNet [42] framework forms the basis for many detection models in 3D, and like our method, is a set-to-set prediction framework. VoteNet uses a special- aggregation downsam sample the points and tures. The resulting su an encoder to also ob takes these features a ing boxes using a para Both encoder and dec %0CKFDU%FUFDUJPO<*.JTSB *$$7>

w ϐϥϛουߏ଄Λར༻ͨ͠7J5 ը૾෼ྨɼ෺ମݕग़ɼηάϝϯςʔγϣϯλεΫΛղ͘Ϟσϧ ֤૚Ͱಘͨಛ௃ϚοϓΛطଘͷϐϥϛουߏ଄Λ࣋ͭ$//ͷಛ௃Ϛοϓͷೖྗͱͯ͠࢖༻ 4UBHF͕ਂ͘ͳΔʹͭΕಛ௃ϚοϓͷαΠζΛখ͘͢͞Δ w 7J5ͷܭࢉίετΛ཈੍
7J5ͷԠ༻ɿ1W5<8BOH *$$7> 88BOH l1ZSBNJE7JTJPO5SBOTGPSNFS"7FSTBUJMF#BDLCPOFGPS%FOTF1SFEJDUJPOXJUIPVU$POWPMVUJPOT z*$$7 Patch Emb Encoder Stage 1 Patch Emb Encoder Stage 2 Patch Emb Encoder Stage 4 Patch Emb Encoder Stage 3 !! : $ 4 × ' 4 ×(! !" : $ 8 × ' 8 ×(" !# : $ 16 × ' 16 ×(# !$ : $ 32 × ' 32 ×($ $×'×3 Patch Embedding Linear Norm Reshape Transformer Encoder (.% ×) Reshape Stage i &!"# '!"# (! $ ×*! &!"# (! × '!"# (! ×((! $*!"# ) &!"# (! × '!"# (! ×*! Position Embedding Element-wise Add Feature Map Norm Norm Feed Forward Multi-Head Attention Spacial Reduction SRA Figure 3: Overall architecture of the proposed Pyramid Vision Transformer (PVT). The entire model is divided into four stages, and each stage is comprised of a patch embedding layer, and a Li-layer Transformer encoder. Following the pyramid structure, the output resolution of the four stages progressively shrinks from 4-stride to 32-stride.

w ϐϥϛουߏ଄Λར༻ͨ͠7J5 ը૾෼ྨɼ෺ମݕग़ɼηάϝϯςʔγϣϯλεΫΛղ͘Ϟσϧ ֤૚Ͱಘͨಛ௃ϚοϓΛطଘͷϐϥϛουߏ଄Λ࣋ͭ$//ͷಛ௃Ϛοϓͷೖྗͱͯ͠࢖༻ 4UBHF͕ਂ͘ͳΔʹͭΕಛ௃ϚοϓͷαΠζΛখ͘͢͞Δ w 7J5ͷܭࢉίετΛ཈੍
88BOH l1ZSBNJE7JTJPO5SBOTGPSNFS"7FSTBUJMF#BDLCPOFGPS%FOTF1SFEJDUJPOXJUIPVU$POWPMVUJPOT z*$$7 Patch Emb Encoder Stage 1 Patch Emb Encoder Stage 2 Patch Emb Encoder Stage 4 Patch Emb Encoder Stage 3 !! : $ 4 × ' 4 ×(! !" : $ 8 × ' 8 ×(" !# : $ 16 × ' 16 ×(# !$ : $ 32 × ' 32 ×($ $×'×3 Patch Embedding Linear Norm Reshape Transformer Encoder (.% ×) Reshape Stage i &!"# '!"# (! $ ×*! &!"# (! × '!"# (! ×((! $*!"# ) &!"# (! × '!"# (! ×*! Position Embedding Element-wise Add Feature Map Norm Norm Feed Forward Multi-Head Attention Spacial Reduction SRA Figure 3: Overall architecture of the proposed Pyramid Vision Transformer (PVT). The entire model is divided into four stages, and each stage is comprised of a patch embedding layer, and a Li-layer Transformer encoder. Following the pyramid structure, the output resolution of the four stages progressively shrinks from 4-stride to 32-stride. ෺ମݕग़ɿύϥϝʔλ਺͕ಉఔ౓Ͱ$//ϕʔεͱൺֱͯ͠ਫ਼౓޲্ Method #Param (M) GFLOPs Top-1 (%) R18* [15] 11.7 1.8 30.2 R18 [15] 11.7 1.8 31.5 DeiT-Tiny/16 [50] 5.7 1.3 27.8 PVT-Tiny (ours) 13.2 1.9 24.9 R50* [15] 25.6 4.1 23.9 R50 [15] 25.6 4.1 21.5 X50-32x4d* [56] 25.0 4.3 22.4 X50-32x4d [56] 25.0 4.3 20.5 DeiT-Small/16 [50] 22.1 4.6 20.1 PVT-Small (ours) 24.5 3.8 20.2 R101* [15] 44.7 7.9 22.6 R101 [15] 44.7 7.9 20.2 X101-32x4d* [56] 44.2 8.0 21.2 X101-32x4d [56] 44.2 8.0 19.4 ViT-Small/16 [10] 48.8 9.9 19.2 PVT-Medium (ours) 44.2 6.7 18.8 X101-64x4d* [56] 83.5 15.6 20.4 X101-64x4d [56] 83.5 15.6 18.5 ViT-Base/16 [10] 86.6 17.6 18.2 DeiT-Base/16 [50] 86.6 17.6 18.2 PVT-Large (ours) 61.4 9.8 18.3 Table 2: Image classification performance on the Ima- geNet validation set. “Top-1” denotes the top-1 error rate. “#Param” refers to the number of parameters. “GFLOPs” is calculated under the input scale of 224 ⇥ 224. “*” indicates the performance of the method trained with the strategy in its original paper. to perform the random-size cropping to 224 ⇥224, random pyramid structure may be beneficial to dense prediction tasks, but the gains it brings to image classification are lim- ited. Note that, ViT and DeiT may have limitations as they are particularly designed for classification tasks, which are not suitable for dense prediction tasks that usually require effective feature pyramids. 5.2. Object Detection Experiment Settings. We conduct object detection experiments on the challenging COCO benchmark [28]. All models are trained on the COCO train2017 (⇠118k images) and evaluated on the val2017 (5k images). We evaluate our PVT backbones on two standard detectors: Reti- naNet [27] and Mask R-CNN [14]. During training, we first use the pre-trained weights on ImageNet to initialize the backbone and Xavier [13] to initialize the newly added layers. Our models are trained with the batch size of 16 on 8 V100 GPUs and optimized by AdamW [33] with the initial learning rate of 1⇥10 4. Following the common setting [27, 14, 5], we adopt 1⇥ or 3⇥ training schedule (i.e., 12 or 36 epochs) to train all detection models. The training image is resized to the shorter side of 800 pixels, while the longer side does not exceed 1333 pixels. When using 3⇥ training schedule, we also randomly resize the shorter side of the input image within the range of [640, 800]. In the testing phase, the shorter side of the input image is fixed to 800 pixels. Results. As shown in Table 3, using RetinaNet for object detection, we find that when the parameter number is ResNet18 [15] 21.3 31.8 49.6 33.6 16.3 34.3 43.2 35.4 53.9 37.6 19.5 38.2 46.8 PVT-Tiny (ours) 23.0 36.7(+4.9) 56.9 38.9 22.6 38.8 50.0 39.4(+4.0) 59.8 42.0 25.5 42.0 52.1 ResNet50 [15] 37.7 36.3 55.3 38.6 19.3 40.0 48.8 39.0 58.4 41.8 22.4 42.8 51.6 PVT-Small (ours) 34.2 40.4(+4.1) 61.3 43.0 25.0 42.9 55.7 42.2(+3.2) 62.7 45.0 26.2 45.2 57.2 ResNet101 [15] 56.7 38.5 57.8 41.2 21.4 42.6 51.1 40.9 60.1 44.0 23.7 45.0 53.8 ResNeXt101-32x4d [56] 56.4 39.9(+1.4) 59.6 42.7 22.3 44.2 52.5 41.4(+0.5) 61.0 44.3 23.9 45.5 53.7 PVT-Medium (ours) 53.9 41.9(+3.4) 63.1 44.3 25.0 44.9 57.6 43.2(+2.3) 63.8 46.1 27.3 46.3 58.9 ResNeXt101-64x4d [56] 95.5 41.0 60.9 44.0 23.9 45.2 54.0 41.8 61.5 44.4 25.2 45.4 54.6 PVT-Large (ours) 71.1 42.6(+1.6) 63.7 45.4 25.8 46.0 58.4 43.4(+1.6) 63.6 46.1 26.1 46.0 59.5 Table 3: Object detection performance on the COCO val2017. “#Param” refers to parameter number. “MS” means using multi-scale training [27, 14] Backbone #Param (M) Mask R-CNN 1x Mask R-CNN 3x + MS APb APb 50 APb 75 APm APm 50 APm 75 APb APb 50 APb 75 APm APm 50 APm 75 ResNet18 [15] 31.2 34.0 54.0 36.7 31.2 51.0 32.7 36.9 57.1 40.0 33.6 53.9 35.7 PVT-Tiny (ours) 32.9 36.7(+2.7) 59.2 39.3 35.1(+3.9) 56.7 37.3 39.8(+2.9) 62.2 43.0 37.4(+3.8) 59.3 39.9 ResNet50 [15] 44.2 38.0 58.6 41.4 34.4 55.1 36.7 41.0 61.7 44.9 37.1 58.4 40.1 PVT-Small (ours) 44.1 40.4(+2.4) 62.9 43.8 37.8(+3.4) 60.1 40.3 43.0(+2.0) 65.3 46.9 39.9(+2.8) 62.5 42.8 ResNet101 [15] 63.2 40.4 61.1 44.2 36.4 57.7 38.8 42.8 63.2 47.1 38.5 60.1 41.3 ResNeXt101-32x4d [56] 62.8 41.9(+1.5) 62.5 45.9 37.5(+1.1) 59.4 40.2 44.0(+1.2) 64.4 48.0 39.2(+0.7) 61.4 41.9 PVT-Medium (ours) 63.9 42.0(+1.6) 64.4 45.6 39.0(+2.6) 61.6 42.1 44.2(+1.4) 66.0 48.2 40.5(+2.0) 63.1 43.5 ResNeXt101-64x4d [56] 101.9 42.8 63.8 47.3 38.4 60.6 41.3 44.4 64.9 48.8 39.7 61.9 42.6 PVT-Large (ours) 81.0 42.9(+0.1) 65.0 46.6 39.5(+1.1) 61.9 42.5 44.5(+0.1) 66.0 48.3 40.7(+1.0) 63.4 43.7 Table 4: Object detection and instance segmentation performance on the COCO val2017. “#Param” refers to parameter number. APb and APm denote bounding box AP and mask AP, respectively. “MS” means using multi-scale training [27, 14]. Backbone Semantic FPN #Param (M) mIoU (%) ResNet18 [15] 15.5 32.9 PVT-Tiny (ours) 17.0 35.7(+2.8) ResNet50 [15] 28.5 36.7 PVT-Small (ours) 28.2 39.8(+3.1) ResNet101 [15] 47.5 38.8 ResNeXt101-32x4d [56] 47.1 39.7(+0.9) PVT-Medium (ours) 48.0 41.6(+2.8) ResNeXt101-64x4d [56] 86.4 40.2 PVT-Large (ours) 65.1 42.1(+1.9) Table 5: Semantic segmentation performance of different backbones on the ADE20K validation set. “#Param” refers to parameter number. gories, where there are 20210, 2000, and 3352 images for training, validation and, testing, respectively. We evaluate our PVT backbone by applying it to Semantic FPN [21], a simple segmentation method without dilated convolutions [57]. In the training phase, the backbone is initialized with the pre-trained weights on ImageNet [9], and other newly added layers are initialized with Xavier [13]. We optimize our models by AdamW [33] with the initial learning rate of 1e-4. Following the common settings [21, 6], we train our models for 80k iterations with the batch size of 16 on 4 V100 GPUs. The learning rate is decayed according to the polynomial decay schedule with the power of 0.9. We randomly resize and crop the training image to 512 ⇥ 512 and scale the image to the shorter side of 512 during testing. Results. As shown in Table 5, our PVT consistently outperforms ResNet [15] and ResNeXt [56] under different parameter scales, using Semantic FPN for semantic segmentation. For example, with almost the same parameter number, our PVT-Tiny/Small/Medium is at least 2.8 mIoU higher than ResNet-18/50/101. In addition, although the parameter number of our Semantic FPN+PVT-Large is 20% lower than that of Semantic FPN+ResNeXt101-64x4d, the mIoU is still 1.9 higher (42.1 vs. 40.2), showing that for semantic segmentation, our PVT can extract better features than the CNN backbone benefitting from the global attention mechanism. ResNeXt101-64x4d [56] 95.5 41.0 60.9 44.0 23.9 45.2 54.0 41.8 61.5 44.4 25 PVT-Large (ours) 71.1 42.6(+1.6) 63.7 45.4 25.8 46.0 58.4 43.4(+1.6) 63.6 46.1 26 Table 3: Object detection performance on the COCO val2017. “#Param” refers to parameter number. “M multi-scale training [27, 14] Backbone #Param (M) Mask R-CNN 1x Mask R-CNN 3x APb APb 50 APb 75 APm APm 50 APm 75 APb APb 50 APb 75 APm ResNet18 [15] 31.2 34.0 54.0 36.7 31.2 51.0 32.7 36.9 57.1 40.0 33.6 PVT-Tiny (ours) 32.9 36.7(+2.7) 59.2 39.3 35.1(+3.9) 56.7 37.3 39.8(+2.9) 62.2 43.0 37.4(+ ResNet50 [15] 44.2 38.0 58.6 41.4 34.4 55.1 36.7 41.0 61.7 44.9 37.1 PVT-Small (ours) 44.1 40.4(+2.4) 62.9 43.8 37.8(+3.4) 60.1 40.3 43.0(+2.0) 65.3 46.9 39.9(+ ResNet101 [15] 63.2 40.4 61.1 44.2 36.4 57.7 38.8 42.8 63.2 47.1 38.5 ResNeXt101-32x4d [56] 62.8 41.9(+1.5) 62.5 45.9 37.5(+1.1) 59.4 40.2 44.0(+1.2) 64.4 48.0 39.2(+ PVT-Medium (ours) 63.9 42.0(+1.6) 64.4 45.6 39.0(+2.6) 61.6 42.1 44.2(+1.4) 66.0 48.2 40.5(+ ResNeXt101-64x4d [56] 101.9 42.8 63.8 47.3 38.4 60.6 41.3 44.4 64.9 48.8 39.7 PVT-Large (ours) 81.0 42.9(+0.1) 65.0 46.6 39.5(+1.1) 61.9 42.5 44.5(+0.1) 66.0 48.3 40.7(+ Table 4: Object detection and instance segmentation performance on the COCO val2017. “#Param” re number. APb and APm denote bounding box AP and mask AP, respectively. “MS” means using multi-scale Backbone Semantic FPN #Param (M) mIoU (%) ResNet18 [15] 15.5 32.9 PVT-Tiny (ours) 17.0 35.7(+2.8) ResNet50 [15] 28.5 36.7 PVT-Small (ours) 28.2 39.8(+3.1) ResNet101 [15] 47.5 38.8 ResNeXt101-32x4d [56] 47.1 39.7(+0.9) PVT-Medium (ours) 48.0 41.6(+2.8) ResNeXt101-64x4d [56] 86.4 40.2 PVT-Large (ours) 65.1 42.1(+1.9) Table 5: Semantic segmentation performance of different backbones on the ADE20K validation set. “#Param” refers to parameter number. gories, where there are 20210, 2000, and 3352 images for training, validation and, testing, respectively. We evaluate our PVT backbone by applying it to Semantic FPN [21], a simple segmentation method without dilated convolu- newly added layers are initialized with Xa timize our models by AdamW [33] with th rate of 1e-4. Following the common set train our models for 80k iterations with th on 4 V100 GPUs. The learning rate is deca the polynomial decay schedule with the p randomly resize and crop the training ima and scale the image to the shorter side of 5 Results. As shown in Table 5, our PVT performs ResNet [15] and ResNeXt [56] u rameter scales, using Semantic FPN for se tion. For example, with almost the same p our PVT-Tiny/Small/Medium is at least than ResNet-18/50/101. In addition, altho ter number of our Semantic FPN+PVT-La than that of Semantic FPN+ResNeXt101- is still 1.9 higher (42.1 vs. 40.2), showing segmentation, our PVT can extract better ը૾෼ྨɿϐϥϛουߏ଄͸ը૾෼ྨʹ޲͔ͳ͍ ηάϝϯςʔγϣϯɿ෺ମݕग़ಉ༷ʹਫ਼౓޲্ ෺ମݕग़ɼηάϝϯςʔγϣϯλεΫʹ͓͍ͯϐϥϛουߏ଄͸༗ޮ 7J5ͷԠ༻ɿ1W5<8BOH *$$7>

w (FOFSBUPSͱ%JTDSJNJOBUPS྆ํʹ5SBOTGPSNFS&ODPEFSΛ௥Ճ (FOFSBUPSɿϊΠζϕΫτϧ͔Β5SBOTGPSNFS&ODPEFSΛ௨ͯ͡ը૾Λੜ੒ w 1JYFM4IV⒐FͰΞοϓαϯϓϦϯά w ղ૾౓Λ্͛ΔҰํͰνϟϯωϧ਺͕গͳ͘ͳΔͨΊɼύϥϝʔλ਺૿ՃΛ๷͙ %JTDSJNJOBUPSɿੜ੒ͨ͠ը૾ͱຊ෺ͷը૾Λ5SBOTGPSNFS&ODPEFSΛ௨ͯࣝ͡ผ
w $-45PLFOΛ࠷ऴ૚ʹ௥Ճ w $-45PLFOͷ.-1ͷग़ྗ͔Β3FBM'BLFΛࣝผ 7J5ͷԠ༻ɿ5SBOT("/<:+JBOH /FVS*14> :+JBOH l5SBOT("/5XP1VSF5SBOTGPSNFST$BO.BLF0OF4USPOH("/ BOE5IBU$BO4DBMF6Q z/FVS*14 & 7UDQVIRUPHU %ORFNV [8S6FDOH 7UDQVIRUPHU %ORFNV [8S6FDOH *ULG7UDQVIRUPHU%ORFNV 5HDO)DNH" 0/3 1RLVH,QSXW *HQHUDWRU 'LVFULPLQDWRU /LQHDU8QIODWWHQ [ [ [ [& +HDG [ [& [[ [$YJ3RRO *ULG7UDQVIRUPHU%ORFNV &/6 &RQFDWHQDWH *ULG7UDQVIRUPHU %ORFNV [ [ 7UDQVIRUPHU%ORFNV [$YJ3RRO &RQFDWHQDWH 7UDQVIRUPHU %ORFNV [ [ + 3 : 3 & ; ; + 3 : 3 ; ; + 3 : 3 & ; ; & & /LQHDU /LQHDU /LQHDU 7UDQVIRUPHU %ORFNV Figure 2: The pipeline of the pure transform-based generator and discriminator of TransGAN. We take 256⇥256 resolution image generation task as a typical example to illustrate the main procedure. Here 1JYFM4IV⒐F<84IJ $713>

w (FOFSBUPSͱ%JTDSJNJOBUPS྆ํʹ5SBOTGPSNFS&ODPEFSΛ௥Ճ (FOFSBUPSɿϊΠζϕΫτϧ͔Β5SBOTGPSNFS&ODPEFSΛ௨ͯ͡ը૾Λੜ੒ w 1JYFM4IV⒐FͰΞοϓαϯϓϦϯά w ղ૾౓Λ্͛ΔҰํͰνϟϯωϧ਺͕গͳ͘ͳΔͨΊɼύϥϝʔλ਺૿ՃΛ๷͙ %JTDSJNJOBUPSɿੜ੒ͨ͠ը૾ͱຊ෺ͷը૾Λ5SBOTGPSNFS&ODPEFSΛ௨ͯࣝ͡ผ
w $-45PLFOΛ࠷ऴ૚ʹ௥Ճ w $-45PLFOͷ.-1ͷग़ྗ͔Β3FBM'BLFΛࣝผ & 7UDQVIRUPHU %ORFNV [8S6FDOH 7UDQVIRUPHU %ORFNV [8S6FDOH *ULG7UDQVIRUPHU%ORFNV 5HDO)DNH" 0/3 1RLVH,QSXW *HQHUDWRU 'LVFULPLQDWRU /LQHDU8QIODWWHQ [ [ [ [& +HDG [ [& [[ [$YJ3RRO *ULG7UDQVIRUPHU%ORFNV &/6 &RQFDWHQDWH *ULG7UDQVIRUPHU %ORFNV [ [ 7UDQVIRUPHU%ORFNV [$YJ3RRO &RQFDWHQDWH 7UDQVIRUPHU %ORFNV [ [ + 3 : 3 & ; ; + 3 : 3 ; ; + 3 : 3 & ; ; & & /LQHDU /LQHDU /LQHDU 7UDQVIRUPHU %ORFNV Figure 2: The pipeline of the pure transform-based generator and discriminator of TransGAN. We take 256⇥256 resolution image generation task as a typical example to illustrate the main procedure. Here 1JYFM4IV⒐F<84IJ $713> Table 1: Unconditional image generation results on CIFAR-10, STl-10, and CelebA (128 ⇥ 128) dataset. We train the models with their official code if the results are unavailable, denoted as “*”, others are all reported from references. Methods CIFAR-10 STL-10 CelebA IS" FID# IS" FID# FID# WGAN-GP [1] 6.49 ± 0.09 39.68 - - - SN-GAN [46] 8.22 ± 0.05 - 9.16 ± 0.12 40.1 - AutoGAN [18] 8.55 ± 0.10 12.42 9.16 ± 0.12 31.01 - AdversarialNAS-GAN [18] 8.74 ± 0.07 10.87 9.63 ± 0.19 26.98 - Progressive-GAN [16] 8.80 ± 0.05 15.52 - - 7.30 COCO-GAN [66] - - - - 5.74 StyleGAN-V2 [68] 9.18 11.07 10.21* ± 0.14 20.84* 5.59* StyleGAN-V2 + DiffAug. [68] 9.40 9.89 10.31*± 0.12 19.15* 5.40* TransGAN 9.02 ± 0.12 9.26 10.43 ± 0.16 18.28 5.28 other “modern" normalization layers [74–76] that need affine parameters for both mean and variances, we find that a simple re-scaling without learnable parameters suffices to stabilize TransGAN training – in fact, it makes TransGAN train better and improves the FID. 4 Experiments Datasets We start by evaluating our methods on three common testbeds: CIFAR-10 [77], STL- 10 [78], and CelebA [79] dataset. The CIFAR-10 dataset consists of 60k 32 ⇥ 32 images, with 50k training and 10k testing images, respectively. We follow the standard setting to use the 50k training images without labels. For the STL-10 dataset, we use both the 5k training images and 100k unlabeled images, and all are resized to 48 ⇥ 48 resolution. For the CelebA dataset, we use 200k unlabeled face images (aligned and cropped version), with each image at 128 ⇥ 128 resolution. We further consider the CelebA-HQ and LSUN Church datasets to scale up TransGAN to higher resolution image generation tasks. We use 30k images for CelebA-HQ [16] dataset and 125k images for LSUN Church dataset [80], all at 256 ⇥ 256 resolution. Implementation We follow the setting of WGAN [45], and use the WGAN-GP loss [1]. We adopt a learning rate of 1e 4 for both generator and discriminator, an Adam optimizer with 1 = 0 and 2 = 0.99, exponential moving average weights for generator, and a batch size of 128 for generator and 64 for discriminator, for all experiments. We choose DiffAug. [68] as basic augmentation strategy during the training process if not specially mentioned, and apply it to our competitors for a fair comparison. Other popular augmentation strategies ([69, 10]) are not discussed here since it is beyond the scope of this work. We use common evaluation metrics Inception Score (IS) [15] and Frechet Figure 6: High-resolution representative visual examples. E Memory Cost Comparison We compare the GPU memory cost between standard self-attention and grid self-attention. Our testbed is set on Nvidia V100 GPU with batch size set to 1, using Pytorch V1.7 environment. We evaluate the inference cost of these two architectures, without calculating the gradient. Since the original self-attention will cause out-of-memory issue even when batch size is set to 1, we reduce the model size on (256 ⇥ 256) resolution tasks to make it fit GPU memory, and apply the same strategy on 128 ⇥ 128 and 64 ⇥ 64 architectures as well. When evaluating the grid self-attention, we do not reduce the model size and only modify the standard self-attention on the specific stages where the resolution is larger than 32 ⇥ 32, and replace it with the proposed Grid Self-Attention. As shown in in Figure 8, even the model size of the one that represents the standard self-attention is reduced, it still costs significantly larger GPU memory than the proposed Grid Self-Attention does. 15 CIFAR-10 32 x 32 STL-10 48 x 48 CelebA 128 x 128 CelebA-HQ & Church 256 x 256 Figure 4: Visual results produced by TransGAN on different datasets, as resolution grows from 32 ⇥ 32 to 256 ⇥ 256. More visual examples are included in Appendix F. Table 2: The effectiveness of Data Augmentation on both CNN-based GANs and TransGAN. We use the full CIFAR-10 training set and DiffAug [68]. Methods WGAN-GP AutoGAN StyleGAN-V2 TransGAN IS " FID # IS " FID # IS " FID # IS " FID # Original 6.49 39.68 8.55 12.42 9.18 11.07 8.36 22.53 + DiffAug [68] 6.29 37.14 8.60 12.72 9.40 9.89 9.02 9.26 Table 3: The ablation study of proposed techniques in three common dataset CelebA(64 ⇥ 64), CelebA(128 ⇥ 128, and LSUN Church(256 ⇥ 256)). “OOM” represents out-of-momery issue. Training Configuration CelebA CelebA LSUN Church (64x64) (128x128) (256x256) (A). Standard Self-Attention 8.92 OOM OOM (B). Nyström Self-Attention [62] 13.47 17.42 39.92 (C). Axis Self-Attention [65] 12.39 13.95 29.30 (D). Grid Self-Attention 9.89 10.58 20.39 + Multi-scale Discriminator 9.28 8.03 15.29 + Modified Normalization 7.05 7.13 13.27 + Relative Position Encoding 6.14 6.32 11.93 $//ϕʔεͱൺֱͯ͠ը૾ͷ඼࣭͕ྑ͍ ඍ෼Մೳͳ%BUBBVHNFOUBUJPO͸ޮՌత ੜ੒ྫ 7J5ͷԠ༻ɿ5SBOT("/<:+JBOH /FVS*14> :+JBOH l5SBOT("/5XP1VSF5SBOTGPSNFST$BO.BLF0OF4USPOH("/ BOE5IBU$BO4DBMF6Q z/FVS*14

w 5SBOTGPSNFSߏ଄Λ༻͍ͨಈըೝࣝϞσϧ 7J5ͷԠ༻ɿ7J7J5<"SOBC *$$7> ""SOBC l7J7J5"7JEFP7JTJPO5SBOTGPSNFS z*$$7 … 1
CLS 0 3 2 N Position + Token Embedding MLP Head Class Factorised Encoder L× K V Q Self-Attention Transformer Encoder MLP Layer Norm Layer Norm Multi-Head Dot-Product Attention Embed to tokens Factorised Self-Attention 2 1 N Factorised Dot-Product • • • Spatial Spatial • • • Temporal Temporal Spatial Temporal Spatial Temporal • • • Spatial Temporal • • • Fuse Spatial Temporal Fuse • • • 2 1 N • • • 2 1 N • • • Figure 1: We propose a pure-transformer architecture for video classification, inspired by the recent success of such models for images [15]. To effectively process a large number of spatio-temporal tokens, we develop several model variants which factorise different components of the transformer encoder over the spatial- and temporal-dimensions. As shown on the right, these factorisations correspond to different attention patterns over space and time. 2. Related Work Architectures for video understanding have mirrored ad- vances in image recognition. Early video research used hand-crafted features to encode appearance and motion information [38, 66]. The success of AlexNet on Ima- geNet [35, 13] initially led to the repurposing of 2D image convolutional networks (CNNs) for video as “two- stream” networks [31, 53, 44]. These models processed RGB frames and optical flow images independently before fusing them at the end. Availability of larger video classification datasets such as Kinetics [32] subsequently facili- tated the training of spatio-temporal 3D CNNs [6, 19, 62] which have significantly more parameters and thus require larger training datasets. As 3D convolutional networks require significantly more computation than their image coun- blocks [27, 4, 7, 54] within a ResNet architecture [24]. Although previous works attempted to replace convolutions in vision architectures [46, 50, 52], it is only very recently that Dosovitisky et al. [15] showed with their ViT architecture that pure-transformer networks, similar to those employed in NLP, can achieve state-of-the-art results for image classification too. The authors showed that such models are only effective at large scale, as transformers lack some of inductive biases of convolutional networks (such as translational equivariance), and thus require datasets larger than the common ImageNet ILSRVC dataset [13] to train. ViT has inspired a large amount of follow-up work in the community, and we note that there are a number of concurrent approaches on extending it to other tasks in computer vision [68, 71, 81, 82] and improving its data- efficiency [61, 45]. In particular, [2, 43] have also proposed Ϟσϧશମ૾ ers (ViT) pts the transformer ages with minimal on-overlapping im- near projection and Rd. The sequence mer encoder is N ] + p, (1) o a 2D convolution. classification token ts representation at the final represen- # ! " Figure 2: Uniform frame sampling: We simply sample nt frames, and embed each 2D frame independently following ViT [15]. 3.1. Overview of Vision Transformers (ViT) Vision Transformer (ViT) [15] adapts the transformer architecture of [65] to process 2D images with minimal changes. In particular, ViT extracts N non-overlapping image patches, xi 2 Rh⇥w, performs a linear projection and then rasterises them into 1D tokens zi 2 Rd. The sequence of tokens input to the following transformer encoder is z = [zcls, Ex1, Ex2, . . . , ExN ] + p, (1) where the projection by E is equivalent to a 2D convolution. As shown in Fig. 1, an optional learned classification token zcls is prepended to this sequence, and its representation at the final layer of the encoder serves as the final representation used by the classification layer [14]. In addition, a learned positional embedding, p 2 RN⇥d, is added to the tokens to retain positional information, as the subsequent self-attention operations in the transformer are permutation invariant. The tokens are then passed through an encoder consisting of a sequence of L transformer layers. Each layer ` comprises of Multi-Headed Self-Attention [65], layer normalisation (LN) [1], and MLP blocks as follows: y ` = MSA(LN(z `)) + z ` (2) z `+1 = MLP(LN(y `)) + y `. (3) The MLP consists of two linear projections separated by a GELU non-linearity [25] and the token-dimensionality, d, remains fixed throughout all layers. Finally, a linear classi- # ! " Figure 2: Uniform frame sampling: We simply sample nt frames, and embed each 2D frame independently following ViT [15]. ! " # Figure 3: Tubelet embedding. We extract and linearly embed non- overlapping tubelets that span the spatio-temporal input volume. Tubelet embedding An alternate method, as shown in Fig. 3, is to extract non-overlapping, spatio-temporal “tubes” from the input volume, and to linearly project this to Rd. This method is an extension of ViT’s embedding to 3D, ํ๏̍ɿ6OJGPSNGSBNFTBNQMJOH ํ๏̎ɿ5VCFMFUFNCFEEJOH &NCFEEJOHWJEFPDMJQQJOH ֤ϑϨʔϜը૾ͷύονྖҬΛҰྻʹϑϥοτ ۭ࣌ؒͷϘΫηϧྖҬΛநग़ … 1 C L S N Positional + Token Embedding Temporal + Token Embedding Embed to tokens … 1 N 2 … 1 N … T Temporal Transformer Encoder MLP Head Class … C L S 1 0 0 C L S 0 C L S 0 Spatial Transformer Encoder Spatial Transformer Encoder Spatial Transformer Encoder Figure 4: Factorised encoder (Model 2). This model consists of two transformer encoders in series: the first models interactions between tokens extracted from the same temporal index to produce a latent representation per time-index. The second transformer models interactions between time steps. It thus corresponds to a “late fusion” of spatial- and temporal information. wise interactions between all spatio-temporal tokens, and it thus models long-range interactions across the video from the first layer. However, as it models all pairwise interactions, Multi-Headed Self Attention (MSA) [65] has quadratic complexity with respect to the number of tokens. This complexity is pertinent for video, as the number of tokens increases linearly with the number of input frames, and motivates the development of more efficient architectures next. Model 2: Factorised encoder As shown in Fig. 4, this model consists of two separate transformer encoders. The first, spatial encoder, only models interactions between tokens extracted from the same temporal index. A representation for each temporal index, hi 2 Rd, is obtained after Ls layers: This is the encoded classification token, zLs cls if it was prepended to the input (Eq. 1), or a global average pooling from the tokens output by the spatial encoder, z Ls , other- wise. The frame-level representations, hi, are concatenated into H 2 Rnt ⇥d, and then forwarded through a temporal encoder consisting of Lt transformer layers to model interactions between tokens from different temporal indices. The output token of this encoder is then finally classified. This architecture corresponds to a “late fusion” [31, 53, 69, 43] of temporal information, and the initial spa- Transform K V Q Layer Norm Multi-Head Attention Te Spatial Self-Attention Block Token embedding Positional embedding Figure 5: Factorised self-attention (Mo former block, the multi-headed self-a torised into two operations (indicated only compute self-attention spatially, an Model 3: Factorised self-attentio trast, contains the same number o Model 1. However, instead of c self-attention across all pairs of to factorise the operation to first only spatially (among all tokens extract poral index), and then temporally tracted from the same spatial inde Each self-attention block in the tr spatio-temporal interactions, but d than Model 1 by factorising the ope sets of elements, thus achieving t complexity as Model 2. We note t over input dimensions has also bee and concurrently in the context of v vided Space-Time” model. This operation can be performed the tokens z from R1⇥nt ·nh ·nw ·d to by zs) to compute spatial self-attent to temporal self-attention, zt is re Here we assume the leading dimens sion”. Our factorised self-attention y ` s = MSA(LN(z y ` t = MSA(LN(y z `+1 = MLP(LN(y We observed that the order of spa attention or temporal-then-spatial make a difference, provided that th initialised as described in Sec. 3.4 of parameters, however, increases c 'BDUPSJTFE&ODPEFS ۭؒͱ࣌ؒʹؔ͢Δಛ௃Λͭͷಠཱͨ͠ϞσϧͰநग़ … 1 C L S N Positional + Token Embedding Temporal + Token Embedding Embed to tokens … 1 N 2 … 1 N … T Temporal Transformer Encoder MLP Head Class … C L S 1 0 0 C L S 0 C L S 0 Spatial Transformer Encoder Spatial Transformer Encoder Spatial Transformer Encoder Figure 4: Factorised encoder (Model 2). This model consists of two transformer encoders in series: the first models interactions between tokens extracted from the same temporal index to produce a latent representation per time-index. The second transformer models interactions between time steps. It thus corresponds to a “late fusion” of spatial- and temporal information. wise interactions between all spatio-temporal tokens, and it thus models long-range interactions across the video from the first layer. However, as it models all pairwise interactions, Multi-Headed Self Attention (MSA) [65] has quadratic complexity with respect to the number of tokens. Transformer Block x L K V Q MLP Layer Norm Layer Norm Multi-Head Attention K V Q Layer Norm Multi-Head Attention Temporal Self-Attention Block Spatial Self-Attention Block Token embedding Positional embedding Figure 5: Factorised self-attention (Model 3). Within each transformer block, the multi-headed self-attention operation is factorised into two operations (indicated by striped boxes) that first only compute self-attention spatially, and then temporally. Model 3: Factorised self-attention This model, in con- trast, contains the same number of transformer layers as Model 1. However, instead of computing multi-headed self-attention across all pairs of tokens, z `, at layer l, we factorise the operation to first only compute self-attention spatially (among all tokens extracted from the same temporal index), and then temporally (among all tokens extracted from the same spatial index) as shown in Fig. 5. Each self-attention block in the transformer thus models spatio-temporal interactions, but does so more efficiently than Model 1 by factorising the operation over two smaller 'BDUPSJTFE4FMG"UUFOUJPO ࠷ॳʹۭؒతͳಛ௃Λଊ͑ͨޙʹ࣌ؒతͳಛ௃Λଊ͑Δ K V Q Self-Attention Block Layer Norm Multi-Head Dot-product Attention Concatenate Linear K V Q Scaled Dot-Product Attention Linear Linear Linear Spatial Heads K V Q Scaled Dot-Product Attention Linear Linear Linear Temporal Heads Figure 6: Factorised dot-product attention (Model 4). For half of the heads, we compute dot-product attention over only the spatial axes, and for the other half, over only the temporal axis. discuss several effect scale video classificat Positional embeddin added to each input models have nt times age model. As a resu dings by “repeating” Rnt ·nh ·nw ⇥d. Theref the same spatial inde then fine-tuned. 'BDUPSJTFE%PU1SPEVDU ۭؒͱ࣌ؒʹؔ͢Δಛ௃Λநग़ͨ͠ޙʹ࿈݁

w 5SBOTGPSNFSߏ଄Λ༻͍ͨಈըೝࣝϞσϧ … 1 CLS 0 3 2 N Position
+ Token Embedding MLP Head Class Factorised Encoder L× K V Q Self-Attention Transformer Encoder MLP Layer Norm Layer Norm Multi-Head Dot-Product Attention Embed to tokens Factorised Self-Attention 2 1 N Factorised Dot-Product • • • Spatial Spatial • • • Temporal Temporal Spatial Temporal Spatial Temporal • • • Spatial Temporal • • • Fuse Spatial Temporal Fuse • • • 2 1 N • • • 2 1 N • • • Figure 1: We propose a pure-transformer architecture for video classification, inspired by the recent success of such models for images [15]. To effectively process a large number of spatio-temporal tokens, we develop several model variants which factorise different components of the transformer encoder over the spatial- and temporal-dimensions. As shown on the right, these factorisations correspond to different attention patterns over space and time. 2. Related Work Architectures for video understanding have mirrored ad- vances in image recognition. Early video research used hand-crafted features to encode appearance and motion information [38, 66]. The success of AlexNet on Ima- geNet [35, 13] initially led to the repurposing of 2D image convolutional networks (CNNs) for video as “two- stream” networks [31, 53, 44]. These models processed RGB frames and optical flow images independently before fusing them at the end. Availability of larger video classification datasets such as Kinetics [32] subsequently facili- tated the training of spatio-temporal 3D CNNs [6, 19, 62] which have significantly more parameters and thus require larger training datasets. As 3D convolutional networks require significantly more computation than their image coun- blocks [27, 4, 7, 54] within a ResNet architecture [24]. Although previous works attempted to replace convolutions in vision architectures [46, 50, 52], it is only very recently that Dosovitisky et al. [15] showed with their ViT architecture that pure-transformer networks, similar to those employed in NLP, can achieve state-of-the-art results for image classification too. The authors showed that such models are only effective at large scale, as transformers lack some of inductive biases of convolutional networks (such as translational equivariance), and thus require datasets larger than the common ImageNet ILSRVC dataset [13] to train. ViT has inspired a large amount of follow-up work in the community, and we note that there are a number of concurrent approaches on extending it to other tasks in computer vision [68, 71, 81, 82] and improving its data- efficiency [61, 45]. In particular, [2, 43] have also proposed Ϟσϧશମ૾ ers (ViT) pts the transformer ages with minimal on-overlapping im- near projection and Rd. The sequence mer encoder is N ] + p, (1) o a 2D convolution. classification token ts representation at the final represen- # ! " Figure 2: Uniform frame sampling: We simply sample nt frames, and embed each 2D frame independently following ViT [15]. 3.1. Overview of Vision Transformers (ViT) Vision Transformer (ViT) [15] adapts the transformer architecture of [65] to process 2D images with minimal changes. In particular, ViT extracts N non-overlapping image patches, xi 2 Rh⇥w, performs a linear projection and then rasterises them into 1D tokens zi 2 Rd. The sequence of tokens input to the following transformer encoder is z = [zcls, Ex1, Ex2, . . . , ExN ] + p, (1) where the projection by E is equivalent to a 2D convolution. As shown in Fig. 1, an optional learned classification token zcls is prepended to this sequence, and its representation at the final layer of the encoder serves as the final representation used by the classification layer [14]. In addition, a learned positional embedding, p 2 RN⇥d, is added to the tokens to retain positional information, as the subsequent self-attention operations in the transformer are permutation invariant. The tokens are then passed through an encoder consisting of a sequence of L transformer layers. Each layer ` comprises of Multi-Headed Self-Attention [65], layer normalisation (LN) [1], and MLP blocks as follows: y ` = MSA(LN(z `)) + z ` (2) z `+1 = MLP(LN(y `)) + y `. (3) The MLP consists of two linear projections separated by a GELU non-linearity [25] and the token-dimensionality, d, remains fixed throughout all layers. Finally, a linear classi- # ! " Figure 2: Uniform frame sampling: We simply sample nt frames, and embed each 2D frame independently following ViT [15]. ! " # Figure 3: Tubelet embedding. We extract and linearly embed non- overlapping tubelets that span the spatio-temporal input volume. Tubelet embedding An alternate method, as shown in Fig. 3, is to extract non-overlapping, spatio-temporal “tubes” from the input volume, and to linearly project this to Rd. This method is an extension of ViT’s embedding to 3D, ํ๏̍ɿ6OJGPSNGSBNFTBNQMJOH ํ๏̎ɿ5VCFMFUFNCFEEJOH &NCFEEJOHWJEFPDMJQQJOH ֤ϑϨʔϜը૾ͷύονྖҬΛҰྻʹϑϥοτ ۭ࣌ؒͷϘΫηϧྖҬΛநग़ … 1 C L S N Positional + Token Embedding Temporal + Token Embedding Embed to tokens … 1 N 2 … 1 N … T Temporal Transformer Encoder MLP Head Class … C L S 1 0 0 C L S 0 C L S 0 Spatial Transformer Encoder Spatial Transformer Encoder Spatial Transformer Encoder Figure 4: Factorised encoder (Model 2). This model consists of two transformer encoders in series: the first models interactions between tokens extracted from the same temporal index to produce a latent representation per time-index. The second transformer models interactions between time steps. It thus corresponds to a “late fusion” of spatial- and temporal information. wise interactions between all spatio-temporal tokens, and it thus models long-range interactions across the video from the first layer. However, as it models all pairwise interactions, Multi-Headed Self Attention (MSA) [65] has quadratic complexity with respect to the number of tokens. This complexity is pertinent for video, as the number of tokens increases linearly with the number of input frames, and motivates the development of more efficient architectures next. Model 2: Factorised encoder As shown in Fig. 4, this model consists of two separate transformer encoders. The first, spatial encoder, only models interactions between tokens extracted from the same temporal index. A representation for each temporal index, hi 2 Rd, is obtained after Ls layers: This is the encoded classification token, zLs cls if it was prepended to the input (Eq. 1), or a global average pooling from the tokens output by the spatial encoder, z Ls , other- wise. The frame-level representations, hi, are concatenated into H 2 Rnt ⇥d, and then forwarded through a temporal encoder consisting of Lt transformer layers to model interactions between tokens from different temporal indices. The output token of this encoder is then finally classified. This architecture corresponds to a “late fusion” [31, 53, 69, 43] of temporal information, and the initial spa- Transform K V Q Layer Norm Multi-Head Attention Te Spatial Self-Attention Block Token embedding Positional embedding Figure 5: Factorised self-attention (Mo former block, the multi-headed self-a torised into two operations (indicated only compute self-attention spatially, an Model 3: Factorised self-attentio trast, contains the same number o Model 1. However, instead of c self-attention across all pairs of to factorise the operation to first only spatially (among all tokens extract poral index), and then temporally tracted from the same spatial inde Each self-attention block in the tr spatio-temporal interactions, but d than Model 1 by factorising the ope sets of elements, thus achieving t complexity as Model 2. We note t over input dimensions has also bee and concurrently in the context of v vided Space-Time” model. This operation can be performed the tokens z from R1⇥nt ·nh ·nw ·d to by zs) to compute spatial self-attent to temporal self-attention, zt is re Here we assume the leading dimens sion”. Our factorised self-attention y ` s = MSA(LN(z y ` t = MSA(LN(y z `+1 = MLP(LN(y We observed that the order of spa attention or temporal-then-spatial make a difference, provided that th initialised as described in Sec. 3.4 of parameters, however, increases c 'BDUPSJTFE&ODPEFS ۭؒͱ࣌ؒʹؔ͢Δಛ௃Λͭͷಠཱͨ͠ϞσϧͰநग़ … 1 C L S N Positional + Token Embedding Temporal + Token Embedding Embed to tokens … 1 N 2 … 1 N … T Temporal Transformer Encoder MLP Head Class … C L S 1 0 0 C L S 0 C L S 0 Spatial Transformer Encoder Spatial Transformer Encoder Spatial Transformer Encoder Figure 4: Factorised encoder (Model 2). This model consists of two transformer encoders in series: the first models interactions between tokens extracted from the same temporal index to produce a latent representation per time-index. The second transformer models interactions between time steps. It thus corresponds to a “late fusion” of spatial- and temporal information. wise interactions between all spatio-temporal tokens, and it thus models long-range interactions across the video from the first layer. However, as it models all pairwise interactions, Multi-Headed Self Attention (MSA) [65] has quadratic complexity with respect to the number of tokens. Transformer Block x L K V Q MLP Layer Norm Layer Norm Multi-Head Attention K V Q Layer Norm Multi-Head Attention Temporal Self-Attention Block Spatial Self-Attention Block Token embedding Positional embedding Figure 5: Factorised self-attention (Model 3). Within each transformer block, the multi-headed self-attention operation is factorised into two operations (indicated by striped boxes) that first only compute self-attention spatially, and then temporally. Model 3: Factorised self-attention This model, in con- trast, contains the same number of transformer layers as Model 1. However, instead of computing multi-headed self-attention across all pairs of tokens, z `, at layer l, we factorise the operation to first only compute self-attention spatially (among all tokens extracted from the same temporal index), and then temporally (among all tokens extracted from the same spatial index) as shown in Fig. 5. Each self-attention block in the transformer thus models spatio-temporal interactions, but does so more efficiently than Model 1 by factorising the operation over two smaller 'BDUPSJTFE4FMG"UUFOUJPO ࠷ॳʹۭؒతͳಛ௃Λଊ͑ͨޙʹ࣌ؒతͳಛ௃Λଊ͑Δ K V Q Self-Attention Block Layer Norm Multi-Head Dot-product Attention Concatenate Linear K V Q Scaled Dot-Product Attention Linear Linear Linear Spatial Heads K V Q Scaled Dot-Product Attention Linear Linear Linear Temporal Heads Figure 6: Factorised dot-product attention (Model 4). For half of the heads, we compute dot-product attention over only the spatial axes, and for the other half, over only the temporal axis. discuss several effect scale video classificat Positional embeddin added to each input models have nt times age model. As a resu dings by “repeating” Rnt ·nh ·nw ⇥d. Theref the same spatial inde then fine-tuned. 'BDUPSJTFE%PU1SPEVDU ۭؒͱ࣌ؒʹؔ͢Δಛ௃Λநग़ͨ͠ޙʹ࿈݁ ༷ʑͳϕϯνϚʔΫͰ4P5" Table 6: Comparisons to state-of-the-art across multiple datasets. For “views”, x ⇥ y denotes x temporal crops and y spatial crops. “320” denotes models trained and tested with a spatial resolution of 320 instead of 224. (a) Kinetics 400 Method Top 1 Top 5 Views blVNet [16] 73.5 91.2 – STM [30] 73.7 91.6 – TEA [39] 76.1 92.5 10 ⇥ 3 TSM-ResNeXt-101 [40] 76.3 – – I3D NL [72] 77.7 93.3 10 ⇥ 3 CorrNet-101 [67] 79.2 – 10 ⇥ 3 ip-CSN-152 [63] 79.2 93.8 10 ⇥ 3 LGD-3D R101 [48] 79.4 94.4 – SlowFast R101-NL [18] 79.8 93.9 10 ⇥ 3 X3D-XXL [17] 80.4 94.6 10 ⇥ 3 TimeSformer-L [2] 80.7 94.7 1 ⇥ 3 ViViT-L/16x2 80.6 94.7 4 ⇥ 3 ViViT-L/16x2 320 81.3 94.7 4 ⇥ 3 Methods with large-scale pretraining ip-CSN-152 [63] (IG [41]) 82.5 95.3 10 ⇥ 3 ViViT-L/16x2 (JFT) 82.8 95.5 4 ⇥ 3 ViViT-L/16x2 320 (JFT) 83.5 95.5 4 ⇥ 3 ViViT-H/16x2 (JFT) 84.8 95.8 4 ⇥ 3 (b) Kinetics 600 Method Top 1 Top 5 Views AttentionNAS [73] 79.8 94.4 – LGD-3D R101 [48] 81.5 95.6 – SlowFast R101-NL [18] 81.8 95.1 10 ⇥ 3 X3D-XL [17] 81.9 95.5 10 ⇥ 3 TimeSformer-HR [2] 82.4 96.0 – ViViT-L/16x2 82.5 95.6 4 ⇥ 3 ViViT-L/16x2 320 83.0 95.7 4 ⇥ 3 ViViT-L/16x2 (JFT) 84.3 96.2 4 ⇥ 3 ViViT-H/16x2 (JFT) 85.8 96.5 4 ⇥ 3 (c) Moments in Time Top 1 Top 5 TSN [69] 25.3 50.1 TRN [83] 28.3 53.4 I3D [6] 29.5 56.1 blVNet [16] 31.4 59.3 AssembleNet-101 [51] 34.3 62.7 ViViT-L/16x2 38.0 64.9 (d) Epic Kitchens 100 Top 1 accuracy Method Action Verb Noun TSN [69] 33.2 60.2 46.0 TRN [83] 35.3 65.9 45.4 TBN [33] 36.7 66.0 47.2 TSM [40] 38.3 67.9 49.0 SlowFast [18] 38.5 65.6 50.0 ViViT-L/16x2 Fact. encoder 44.0 66.4 56.8 (e) Something-Something v2 Method Top 1 Top 5 TRN [83] 48.8 77.6 SlowFast [17, 77] 61.7 – TimeSformer-HR [2] 62.5 – TSM [40] 63.4 88.5 STM [30] 64.2 89.8 TEA [39] 65.1 – blVNet [16] 65.2 90.3 ViViT-L/16x2 Fact. encoder 65.4 89.8 Figure 9: The effect of varying the number of frames input to the network when keeping the number of tokens constant by adjusting the tubelet length t. We use ViViT-B, and spatio-temporal atten- process longer videos without increasing the number of tokens, they offer an efficient method for processing longer videos than those considered by existing video classification datasets, and keep it as an avenue for future work. 4.3. Comparison to state-of-the-art Based on our ablation studies in the previous section, we compare to the current state-of-the-art using two of our model variants. We use the unfactorised spatio-temporal attention model (Model 1) for the larger datasets, Kinetics and Moments in Time. For the smaller datasets, Epic Kitchens and SSv2, we use our Factorised encoder model (Model 2). ๲େͳσʔληοτͰࣄલֶश͢Δ͜ͱͰਫ਼౓޲্͕ݟࠐΊΔ 7J5ͷԠ༻ɿ7J7J5<"SOBC *$$7> ""SOBC l7J7J5"7JEFP7JTJPO5SBOTGPSNFS z*$$7

w 5SBOTGPSNFSΛϕʔεͱͯ࣍͠ݩ఺܈͔Β࣍ݩ෺ମݕग़Λߦ͏ &ODPEFS w %PXOTBNQMJOH͞Εͨ఺܈͔Β఺܈ؒͷؔ࿈౓Λऔಘ w ఺܈ʹ࣍ݩͷҐஔ৘ใ͕͋ΔͨΊ1PTJUJPOBM&ODPEJOH͸࢖༻͠ͳ͍ %FDPEFS
w %PXOTBNQMJOH͞Εͨ఺܈͔Β2VFSZͷ఺܈ΛϥϯμϜʹऔಘ w %FDPEFS͸࠲ඪʹΞΫηεͰ͖ͳ͍ͨΊɼ1PTJUJPOBM&ODPEJOHΛ௥Ճ ग़ྗ͸࣍ݩ w ҐஔɼαΠζɼ֯౓ Ϋϥεͱ࢒ࠩ ɼΫϥε 7J5ͷԠ༻ɿ%&53<.JTSB *$$7> *.JTSB l"O&OEUP&OE5SBOTGPSNFS.PEFMGPS%0CKFDU%FUFDUJPO z*$$7 Set of Detections Input point cloud Set of Points MLP MLP Transformer Encoder Transformer Decoder Query embeddings Set Aggregate MLP F <latexit sha1_base64="yz3D13QcNguP69x6gLy9k+6iF6Y=">AAAB/HicbVDLSsNAFJ3UV62vaJdugq3gqiRFUHdFQVxWsA9oQphMJ+3QmUmYmQghxF9x40IRt36IO//GSZuFth4YOJxzL/fMCWJKpLLtb6Oytr6xuVXdru3s7u0fmIdHfRklAuEeimgkhgGUmBKOe4ooioexwJAFFA+C2U3hDx6xkCTiDyqNscfghJOQIKi05Jv1pisJ8zOXQTUVLLvN86ZvNuyWPYe1SpySNECJrm9+ueMIJQxzhSiUcuTYsfIyKBRBFOc1N5E4hmgGJ3ikKYcMSy+bh8+tU62MrTAS+nFlzdXfGxlkUqYs0JNFRrnsFeJ/3ihR4aWXER4nCnO0OBQm1FKRVTRhjYnASNFUE4gE0VktNIUCIqX7qukSnOUvr5J+u+Wct67u243OdVlHFRyDE3AGHHABOuAOdEEPIJCCZ/AK3own48V4Nz4WoxWj3KmDPzA+fwC/k5Ta</latexit> Query embeddings 3DETR Query points Set of Points Random Sample Positional Encoding Downsample Figure 2: Approach. (Left) 3DETR is an end-to-end trainable Transformer that takes a set of 3D points (point cloud) as input and outputs a set of 3D bounding boxes. The Transformer encoder produces a set of per-point features using multiple layers of self-attention. The point points respectively for each sample in SUN RGB-D and ScanNetV2 datasets. The N ⇥3 matrix of point coordinates is then passed through one layer of the downsampling and set aggregation operation [45] which uses Farthest-Point- Sampling to sample 2048 points randomly from the scene. Each point is projected to a 256 dimensional feature followed by the set-aggregation operation that aggregates features within a `2 distance of 0.2. The output is a 2048⇥256 dimensional matrix of features for the N0 = 2048 points which is input to the encoder. We now describe the encoder LayerNorm Input Multi-headed Attention + LayerNorm MLP + Output One Layer of Transformer Encoder LayerNorm Input Multi-headed Attention + LayerNorm MLP + Output One Layer of Transformer Decoder <latexit sha1_base64="N+AbrXF+6H20mUqcDRMNkw2sXXE=">AAACjnicbVFda9RAFJ3Er5qqjfroy+CmuH1Zkj7UUilWRagPSgW3LWzWcDO52R2aTMLMjbCE/Bz/kG/+Gye729IPLwwczjn3Y+5N60IaCsO/jnvv/oOHjzYee5tPnj7b8p+/ODVVowWORVVU+jwFg4VUOCZJBZ7XGqFMCzxLLz71+tkv1EZW6gctapyWMFMylwLIUon/ezvIf7ZxCTTXZWuoyVBR1yUxzZGAD7/s8EN+zUEIYo76yjG8FKCZlX2qzdgJPFv1ZolLm6iUIbC+wFua2pUruWrw+euHruvV4NubmGSJhmdB4g/CUbgMfhdEazBg6zhJ/D9xVommn0gUYMwkCmuatqBJigI7L24M1iAuYIYTCxXYPtN2uc6Ob1sm43ml7VPEl+z1jBZKYxZlap390Oa21pP/0yYN5fvTVqq6IVRi1ShvCk4V72/DM6lRULGwAISWdlYu5qBBkL2gZ5cQ3f7yXXC6O4r2RtH33cHRx/U6Ntgr9poNWcTesiN2zE7YmAln04mcA+ed67t77qH7fmV1nXXOS3Yj3ON/D2TE9A==</latexit> N0 ⇥ d <latexit sha1_base64="N+AbrXF+6H20mUqcDRMNkw2sXXE=">AAACjnicbVFda9RAFJ3Er5qqjfroy+CmuH1Zkj7UUilWRagPSgW3LWzWcDO52R2aTMLMjbCE/Bz/kG/+Gye729IPLwwczjn3Y+5N60IaCsO/jnvv/oOHjzYee5tPnj7b8p+/ODVVowWORVVU+jwFg4VUOCZJBZ7XGqFMCzxLLz71+tkv1EZW6gctapyWMFMylwLIUon/ezvIf7ZxCTTXZWuoyVBR1yUxzZGAD7/s8EN+zUEIYo76yjG8FKCZlX2qzdgJPFv1ZolLm6iUIbC+wFua2pUruWrw+euHruvV4NubmGSJhmdB4g/CUbgMfhdEazBg6zhJ/D9xVommn0gUYMwkCmuatqBJigI7L24M1iAuYIYTCxXYPtN2uc6Ob1sm43ml7VPEl+z1jBZKYxZlap390Oa21pP/0yYN5fvTVqq6IVRi1ShvCk4V72/DM6lRULGwAISWdlYu5qBBkL2gZ5cQ3f7yXXC6O4r2RtH33cHRx/U6Ntgr9poNWcTesiN2zE7YmAln04mcA+ed67t77qH7fmV1nXXOS3Yj3ON/D2TE9A==</latexit> N0 ⇥ d <latexit sha1_base64="N+AbrXF+6H20mUqcDRMNkw2sXXE=">AAACjnicbVFda9RAFJ3Er5qqjfroy+CmuH1Zkj7UUilWRagPSgW3LWzWcDO52R2aTMLMjbCE/Bz/kG/+Gye729IPLwwczjn3Y+5N60IaCsO/jnvv/oOHjzYee5tPnj7b8p+/ODVVowWORVVU+jwFg4VUOCZJBZ7XGqFMCzxLLz71+tkv1EZW6gctapyWMFMylwLIUon/ezvIf7ZxCTTXZWuoyVBR1yUxzZGAD7/s8EN+zUEIYo76yjG8FKCZlX2qzdgJPFv1ZolLm6iUIbC+wFua2pUruWrw+euHruvV4NubmGSJhmdB4g/CUbgMfhdEazBg6zhJ/D9xVommn0gUYMwkCmuatqBJigI7L24M1iAuYIYTCxXYPtN2uc6Ob1sm43ml7VPEl+z1jBZKYxZlap390Oa21pP/0yYN5fvTVqq6IVRi1ShvCk4V72/DM6lRULGwAISWdlYu5qBBkL2gZ5cQ3f7yXXC6O4r2RtH33cHRx/U6Ntgr9poNWcTesiN2zE7YmAln04mcA+ed67t77qH7fmV1nXXOS3Yj3ON/D2TE9A==</latexit> N0 ⇥ d <latexit sha1_base64="wU/EfJDJ6CRMvB2OekVP7EeYuKA=">AAAB8nicbVBNT8JAEN3iF+IX6tHLRjDxRFoO6pHgxSMmgiSlIdvtFjZsd5vdqQlp+BlePGiMV3+NN/+NC/Sg4EsmeXlvJjPzwlRwA6777ZQ2Nre2d8q7lb39g8Oj6vFJz6hMU9alSijdD4lhgkvWBQ6C9VPNSBIK9hhObuf+4xPThiv5ANOUBQkZSR5zSsBKfr09AJ4wg6P6sFpzG+4CeJ14BamhAp1h9WsQKZolTAIVxBjfc1MIcqKBU8FmlUFmWErohIyYb6kkdk+QL06e4QurRDhW2pYEvFB/T+QkMWaahLYzITA2q95c/M/zM4hvgpzLNAMm6XJRnAkMCs//xxHXjIKYWkKo5vZWTMdEEwo2pYoNwVt9eZ30mg3vquHdN2utdhFHGZ2hc3SJPHSNWugOdVAXUaTQM3pFbw44L86787FsLTnFzCn6A+fzBxV6kHg=</latexit> B ⇥ d Encoder Feature Multi-headed Attention LayerNorm + ~F ~F <latexit sha1_base64="wU/EfJDJ6CRMvB2OekVP7EeYuKA=">AAAB8nicbVBNT8JAEN3iF+IX6tHLRjDxRFoO6pHgxSMmgiSlIdvtFjZsd5vdqQlp+BlePGiMV3+NN/+NC/Sg4EsmeXlvJjPzwlRwA6777ZQ2Nre2d8q7lb39g8Oj6vFJz6hMU9alSijdD4lhgkvWBQ6C9VPNSBIK9hhObuf+4xPThiv5ANOUBQkZSR5zSsBKfr09AJ4wg6P6sFpzG+4CeJ14BamhAp1h9WsQKZolTAIVxBjfc1MIcqKBU8FmlUFmWErohIyYb6kkdk+QL06e4QurRDhW2pYEvFB/T+QkMWaahLYzITA2q95c/M/zM4hvgpzLNAMm6XJRnAkMCs//xxHXjIKYWkKo5vZWTMdEEwo2pYoNwVt9eZ30mg3vquHdN2utdhFHGZ2hc3SJPHSNWugOdVAXUaTQM3pFbw44L86787FsLTnFzCn6A+fzBxV6kHg=</latexit> B ⇥ d Figure 6: Architecture of Encoder and Decoder. We present the architecture for one layer of the 3DETR encoder and decoder. The encoder layer takes as input N0 ⇥ d features for N0 points and outputs N0 ⇥ d features too. It performs self-attention followed by a MLP. The decoder takes as input B ⇥ d features (the query embeddings or the prior decoder layer), N0 ⇥ d point features from the encoder to output B ⇥ d features for B boxes. The decoder performs self-attention between the B query/box features and cross-attention between the B query/box features and the N0 point features. We denote by ⇠F the Fourier positional encod- ings [64] used in the decoder. All 3DETR models use d = 256.

w 5SBOTGPSNFSΛϕʔεͱͯ࣍͠ݩ఺܈͔Β࣍ݩ෺ମݕग़Λߦ͏ &ODPEFS w %PXOTBNQMJOH͞Εͨ఺܈͔Β఺܈ؒͷؔ࿈౓Λऔಘ w ఺܈ʹ࣍ݩͷҐஔ৘ใ͕͋ΔͨΊ1PTJUJPOBM&ODPEJOH͸࢖༻͠ͳ͍ %FDPEFS
w %PXOTBNQMJOH͞Εͨ఺܈͔Β2VFSZͷ఺܈ΛϥϯμϜʹऔಘ w %FDPEFS͸࠲ඪʹΞΫηεͰ͖ͳ͍ͨΊɼ1PTJUJPOBM&ODPEJOHΛ௥Ճ ग़ྗ͸࣍ݩ w ҐஔɼαΠζɼ֯౓ Ϋϥεͱ࢒ࠩ ɼΫϥε Set of Detections Input point cloud Set of Points MLP MLP Transformer Encoder Transformer Decoder Query embeddings Set Aggregate MLP F <latexit sha1_base64="yz3D13QcNguP69x6gLy9k+6iF6Y=">AAAB/HicbVDLSsNAFJ3UV62vaJdugq3gqiRFUHdFQVxWsA9oQphMJ+3QmUmYmQghxF9x40IRt36IO//GSZuFth4YOJxzL/fMCWJKpLLtb6Oytr6xuVXdru3s7u0fmIdHfRklAuEeimgkhgGUmBKOe4ooioexwJAFFA+C2U3hDx6xkCTiDyqNscfghJOQIKi05Jv1pisJ8zOXQTUVLLvN86ZvNuyWPYe1SpySNECJrm9+ueMIJQxzhSiUcuTYsfIyKBRBFOc1N5E4hmgGJ3ikKYcMSy+bh8+tU62MrTAS+nFlzdXfGxlkUqYs0JNFRrnsFeJ/3ihR4aWXER4nCnO0OBQm1FKRVTRhjYnASNFUE4gE0VktNIUCIqX7qukSnOUvr5J+u+Wct67u243OdVlHFRyDE3AGHHABOuAOdEEPIJCCZ/AK3own48V4Nz4WoxWj3KmDPzA+fwC/k5Ta</latexit> Query embeddings 3DETR Query points Set of Points Random Sample Positional Encoding Downsample Figure 2: Approach. (Left) 3DETR is an end-to-end trainable Transformer that takes a set of 3D points (point cloud) as input and outputs a set of 3D bounding boxes. The Transformer encoder produces a set of per-point features using multiple layers of self-attention. The point points respectively for each sample in SUN RGB-D and ScanNetV2 datasets. The N ⇥3 matrix of point coordinates is then passed through one layer of the downsampling and set aggregation operation [45] which uses Farthest-Point- Sampling to sample 2048 points randomly from the scene. Each point is projected to a 256 dimensional feature followed by the set-aggregation operation that aggregates features within a `2 distance of 0.2. The output is a 2048⇥256 dimensional matrix of features for the N0 = 2048 points which is input to the encoder. We now describe the encoder LayerNorm Input Multi-headed Attention + LayerNorm MLP + Output One Layer of Transformer Encoder LayerNorm Input Multi-headed Attention + LayerNorm MLP + Output One Layer of Transformer Decoder <latexit sha1_base64="N+AbrXF+6H20mUqcDRMNkw2sXXE=">AAACjnicbVFda9RAFJ3Er5qqjfroy+CmuH1Zkj7UUilWRagPSgW3LWzWcDO52R2aTMLMjbCE/Bz/kG/+Gye729IPLwwczjn3Y+5N60IaCsO/jnvv/oOHjzYee5tPnj7b8p+/ODVVowWORVVU+jwFg4VUOCZJBZ7XGqFMCzxLLz71+tkv1EZW6gctapyWMFMylwLIUon/ezvIf7ZxCTTXZWuoyVBR1yUxzZGAD7/s8EN+zUEIYo76yjG8FKCZlX2qzdgJPFv1ZolLm6iUIbC+wFua2pUruWrw+euHruvV4NubmGSJhmdB4g/CUbgMfhdEazBg6zhJ/D9xVommn0gUYMwkCmuatqBJigI7L24M1iAuYIYTCxXYPtN2uc6Ob1sm43ml7VPEl+z1jBZKYxZlap390Oa21pP/0yYN5fvTVqq6IVRi1ShvCk4V72/DM6lRULGwAISWdlYu5qBBkL2gZ5cQ3f7yXXC6O4r2RtH33cHRx/U6Ntgr9poNWcTesiN2zE7YmAln04mcA+ed67t77qH7fmV1nXXOS3Yj3ON/D2TE9A==</latexit> N0 ⇥ d <latexit sha1_base64="N+AbrXF+6H20mUqcDRMNkw2sXXE=">AAACjnicbVFda9RAFJ3Er5qqjfroy+CmuH1Zkj7UUilWRagPSgW3LWzWcDO52R2aTMLMjbCE/Bz/kG/+Gye729IPLwwczjn3Y+5N60IaCsO/jnvv/oOHjzYee5tPnj7b8p+/ODVVowWORVVU+jwFg4VUOCZJBZ7XGqFMCzxLLz71+tkv1EZW6gctapyWMFMylwLIUon/ezvIf7ZxCTTXZWuoyVBR1yUxzZGAD7/s8EN+zUEIYo76yjG8FKCZlX2qzdgJPFv1ZolLm6iUIbC+wFua2pUruWrw+euHruvV4NubmGSJhmdB4g/CUbgMfhdEazBg6zhJ/D9xVommn0gUYMwkCmuatqBJigI7L24M1iAuYIYTCxXYPtN2uc6Ob1sm43ml7VPEl+z1jBZKYxZlap390Oa21pP/0yYN5fvTVqq6IVRi1ShvCk4V72/DM6lRULGwAISWdlYu5qBBkL2gZ5cQ3f7yXXC6O4r2RtH33cHRx/U6Ntgr9poNWcTesiN2zE7YmAln04mcA+ed67t77qH7fmV1nXXOS3Yj3ON/D2TE9A==</latexit> N0 ⇥ d <latexit sha1_base64="N+AbrXF+6H20mUqcDRMNkw2sXXE=">AAACjnicbVFda9RAFJ3Er5qqjfroy+CmuH1Zkj7UUilWRagPSgW3LWzWcDO52R2aTMLMjbCE/Bz/kG/+Gye729IPLwwczjn3Y+5N60IaCsO/jnvv/oOHjzYee5tPnj7b8p+/ODVVowWORVVU+jwFg4VUOCZJBZ7XGqFMCzxLLz71+tkv1EZW6gctapyWMFMylwLIUon/ezvIf7ZxCTTXZWuoyVBR1yUxzZGAD7/s8EN+zUEIYo76yjG8FKCZlX2qzdgJPFv1ZolLm6iUIbC+wFua2pUruWrw+euHruvV4NubmGSJhmdB4g/CUbgMfhdEazBg6zhJ/D9xVommn0gUYMwkCmuatqBJigI7L24M1iAuYIYTCxXYPtN2uc6Ob1sm43ml7VPEl+z1jBZKYxZlap390Oa21pP/0yYN5fvTVqq6IVRi1ShvCk4V72/DM6lRULGwAISWdlYu5qBBkL2gZ5cQ3f7yXXC6O4r2RtH33cHRx/U6Ntgr9poNWcTesiN2zE7YmAln04mcA+ed67t77qH7fmV1nXXOS3Yj3ON/D2TE9A==</latexit> N0 ⇥ d <latexit sha1_base64="wU/EfJDJ6CRMvB2OekVP7EeYuKA=">AAAB8nicbVBNT8JAEN3iF+IX6tHLRjDxRFoO6pHgxSMmgiSlIdvtFjZsd5vdqQlp+BlePGiMV3+NN/+NC/Sg4EsmeXlvJjPzwlRwA6777ZQ2Nre2d8q7lb39g8Oj6vFJz6hMU9alSijdD4lhgkvWBQ6C9VPNSBIK9hhObuf+4xPThiv5ANOUBQkZSR5zSsBKfr09AJ4wg6P6sFpzG+4CeJ14BamhAp1h9WsQKZolTAIVxBjfc1MIcqKBU8FmlUFmWErohIyYb6kkdk+QL06e4QurRDhW2pYEvFB/T+QkMWaahLYzITA2q95c/M/zM4hvgpzLNAMm6XJRnAkMCs//xxHXjIKYWkKo5vZWTMdEEwo2pYoNwVt9eZ30mg3vquHdN2utdhFHGZ2hc3SJPHSNWugOdVAXUaTQM3pFbw44L86787FsLTnFzCn6A+fzBxV6kHg=</latexit> B ⇥ d Encoder Feature Multi-headed Attention LayerNorm + ~F ~F <latexit sha1_base64="wU/EfJDJ6CRMvB2OekVP7EeYuKA=">AAAB8nicbVBNT8JAEN3iF+IX6tHLRjDxRFoO6pHgxSMmgiSlIdvtFjZsd5vdqQlp+BlePGiMV3+NN/+NC/Sg4EsmeXlvJjPzwlRwA6777ZQ2Nre2d8q7lb39g8Oj6vFJz6hMU9alSijdD4lhgkvWBQ6C9VPNSBIK9hhObuf+4xPThiv5ANOUBQkZSR5zSsBKfr09AJ4wg6P6sFpzG+4CeJ14BamhAp1h9WsQKZolTAIVxBjfc1MIcqKBU8FmlUFmWErohIyYb6kkdk+QL06e4QurRDhW2pYEvFB/T+QkMWaahLYzITA2q95c/M/zM4hvgpzLNAMm6XJRnAkMCs//xxHXjIKYWkKo5vZWTMdEEwo2pYoNwVt9eZ30mg3vquHdN2utdhFHGZ2hc3SJPHSNWugOdVAXUaTQM3pFbw44L86787FsLTnFzCn6A+fzBxV6kHg=</latexit> B ⇥ d Figure 6: Architecture of Encoder and Decoder. We present the architecture for one layer of the 3DETR encoder and decoder. The encoder layer takes as input N0 ⇥ d features for N0 points and outputs N0 ⇥ d features too. It performs self-attention followed by a MLP. The decoder takes as input B ⇥ d features (the query embeddings or the prior decoder layer), N0 ⇥ d point features from the encoder to output B ⇥ d features for B boxes. The decoder performs self-attention between the B query/box features and cross-attention between the B query/box features and the N0 point features. We denote by ⇠F the Fourier positional encod- ings [64] used in the decoder. All 3DETR models use d = 256. An End-to-End Transformer Model for 3D Object Detection Ishan Misra Rohit Girdhar Armand Joulin Facebook AI Research https://facebookresearch.github.io/3detr Abstract We propose 3DETR, an end-to-end Transformer based object detection model for 3D point clouds. Compared to existing detection methods that employ a number of 3D- specific inductive biases, 3DETR requires minimal mod- ifications to the vanilla Transformer block. Specifically, we find that a standard Transformer with non-parametric queries and Fourier positional embeddings is competitive with specialized architectures that employ libraries of 3D- specific operators with hand-tuned hyperparameters. Nev- Input Point Cloud Decoder Attention Detections Figure 1: 3DETR. We train an end-to-end Transformer model for 3D object detection on point clouds. Our model has a Transformer encoder for feature encoding and a Transformer decoder for pre- V] 16 Sep 2021 Ground Truth Prediction Ground Truth Prediction Ground Truth Prediction Ground Truth Prediction Ground Truth Prediction Ground Truth Prediction Figure 3: Qualitative Results using 3DETR. Detection results for scenes from the val set of the SUN RGB-D dataset. 3DETR does not use color information (used only for visualization) and predicts boxes from point clouds. 3DETR can detect objects even with single-view depth scans and predicts amodal boxes e.g., the full extent of the bed (top left) including objects missing in the ground truth (top right). Method Encoder Decoder Loss ScanNetV2 SUN RGB-D AP25 AP50 AP25 AP50 3DETR Tx. Tx. Set 62.7 37.5 58.0 30.3 PN++ Tx. Set 61.4 34.7 56.8 26.9 PN++: PointNet++ [45], Tx.: Transformer, Set loss § 3.4 Table 2: 3DETR with different encoders. We vary the encoder used in 3DETR and observe that the performance is unchanged or slightly worse when moving to a PointNet++ encoder. This suggests that the decoder design and the loss function in 3DETR # Method Encoder Decoder Loss ScanNetV2 SUN RGB-D AP25 AP50 AP25 AP50 Comparing different decoders 1 3DETR Tx. Tx. Set 62.7 37.5 58.0 30.3 2 Tx. Box Box 31.0 10.2 36.4 14.4 3 Tx. Vote Vote 46.1 23.4 47.5 24.9 Comparing different losses 4 Tx. Tx. Box 49.6 20.5 49.5 21.1 5 Tx. Tx. Vote 54.0 31.9 53.4 28.3 ਖ਼֬ͳ෺ମݕग़͕Մೳ (5ʹ͸ແ͍ΞϞʔμϧͳ ෺ମΛݕग़ ӈ্ ੨఺ΛࢀরϙΠϯτͱͨ͠ࡍɼ %FDPEFSͷ"UUFOUJPO͸Πϯελϯε಺ͷϙΠϯτʹண໨ ˠ෺ମݕग़͕༰қʹͳͬͨͱߟ͑ΒΕΔ 7J5ͷԠ༻ɿ%&53<.JTSB *$$7> *.JTSB l"O&OEUP&OE5SBOTGPSNFS.PEFMGPS%0CKFDU%FUFDUJPO z*$$7

w 5SBOTGPSNFSΛηάϝϯςʔγϣϯλεΫ΁Ԡ༻ .JY5SBOTGPSNFS w ϚϧνϨϕϧͳಛ௃Λ֫ಘՄೳͳ֊૚ܕ5SBOTGPSNFS w ܭࢉίετΛ࡟ݮ͢Δߏ଄ ܰྔ͔ͭγϯϓϧͳ.-1σίʔμΛ࠾༻
w 5SBOTGPSNFS͸ہॴత͔ͭ޿Ҭతͳಛ௃ΛऔಘՄೳ w .-1͸ہॴతͳಛ௃Λิ׬Ͱ͖ɼڧྗͳදݱΛ֫ಘՄೳ 7J5ͷԠ༻ɿ4FH'PSNFS<9JF BS9JW> &9JF l4FH'PSNFS4JNQMFBOE&⒏DJFOU%FTJHOGPS4FNBOUJD4FHNFOUBUJPOXJUI5SBOTGPSNFST zBS9JW Overlap Patch Embeddings Transformer Block 1 MLP Layer ! " × # " ×"$ ! % × # % ×"& ! '& × # '& ×"" ! $( × # $( ×"' ! " × # " ×4" MLP ! " × # " ×$)*+ Transformer Block 2 Transformer Block 3 Transformer Block 4 Overlap Patch Merging Efficient Self-Attn Mix-FFN ×" UpSample MLP ! "!"# × # "!"# ×"$ ! "!"# × # "!"# ×" ! % × # % ×" Encoder Decoder Figure 2: The proposed SegFormer framework consists of two main modules: A hierarchical Transformer encoder to extract coarse and fine features; and a lightweight All-MLP decoder to directly fuse these multi-level features and predict the semantic segmentation mask. “FFN” indicates feed-forward network. i the MiT encoder go through an MLP layer to unify the channel dimension. Then, in a second step, features are up-sampled to 1/4th and concatenated together. Third, a MLP layer is adopted to fuse the concatenated features F. Finally, another MLP layer takes the fused feature to predict the segmentation mask M with a H 4 ⇥ W 4 ⇥ N cls resolution, where N cls is the number of categories. This lets us formulate the decoder as: ˆ F i = Linear(C i , C)(F i ), 8i ˆ F i = Upsample( W 4 ⇥ W 4 )( ˆ F i ), 8i F = Linear(4C, C)(Concat( ˆ F i )), 8i M = Linear(C, N cls )(F), (4) where M refers to the predicted mask, and Linear(C in , C out )(·) refers to a linear layer with C in and C out as input and output vector dimensions respectively. DeepLabv3+ SegFormer Stage-1 Stage-2 Stage-3 Head Stage-4 Figure 3: Effective Receptive Field (ERF) on Cityscapes (average over 100 images). Top row: Deeplabv3+. Bottom row: Seg- Former. ERFs of the four stages and the decoder heads of both Effective Receptive Field Analysis. For semantic segmentation, maintain- ing large receptive field to include context information has been a central issue [5, 19, 20]. Here, we use effective receptive field (ERF) [70] as a toolkit to visualize and interpret why our MLP decoder design is so effective on Transformers. In Figure 3, we visualize ERFs of the four encoder stages and the decoder heads for both 3FE#PYɿ4UBHFͷ4FMG"UUFOUJPOͷSFDFQUJWFpFMEͷେ͖͞ #MVF#PYɿ.-1-BZFSͷ.-1ͷSFDFQUJWFpFMEͷେ͖͞ $JUZTDBQFTʹ͓͚ΔSFDFQUJWFpFMEͷޮՌΛ෼ੳ ˠ$//ϕʔε͸ہॴతͳಛ௃ͷΈଊ͑Δ͕ɼ5SBOTGPSNFSϕʔε͸ہॴతɾ޿Ҭతͳಛ௃Λଊ͑Δ ˠ.-1ͷSFDFQUJWFpFME͸ہॴతྖҬͷີ౓͕ߴ͍খ͍͞෺ମͷਖ਼֬ͳηάϝϯςʔγϣϯ͕ظ଴

w 5SBOTGPSNFSΛηάϝϯςʔγϣϯλεΫ΁Ԡ༻ .JY5SBOTGPSNFS w ϚϧνϨϕϧͳಛ௃Λ֫ಘՄೳͳ֊૚ܕ5SBOTGPSNFS w ܭࢉίετΛ࡟ݮ͢Δߏ଄ ܰྔ͔ͭγϯϓϧͳ.-1σίʔμΛ࠾༻
w 5SBOTGPSNFS͸ہॴత͔ͭ޿Ҭతͳಛ௃ΛऔಘՄೳ w .-1͸ہॴతͳಛ௃Λิ׬Ͱ͖ɼڧྗͳදݱΛ֫ಘՄೳ Overlap Patch Embeddings Transformer Block 1 MLP Layer ! " × # " ×"$ ! % × # % ×"& ! '& × # '& ×"" ! $( × # $( ×"' ! " × # " ×4" MLP ! " × # " ×$)*+ Transformer Block 2 Transformer Block 3 Transformer Block 4 Overlap Patch Merging Efficient Self-Attn Mix-FFN ×" UpSample MLP ! "!"# × # "!"# ×"$ ! "!"# × # "!"# ×" ! % × # % ×" Encoder Decoder Figure 2: The proposed SegFormer framework consists of two main modules: A hierarchical Transformer encoder to extract coarse and fine features; and a lightweight All-MLP decoder to directly fuse these multi-level features and predict the semantic segmentation mask. “FFN” indicates feed-forward network. i the MiT encoder go through an MLP layer to unify the channel dimension. Then, in a second step, features are up-sampled to 1/4th and concatenated together. Third, a MLP layer is adopted to fuse the concatenated features F. Finally, another MLP layer takes the fused feature to predict the segmentation mask M with a H 4 ⇥ W 4 ⇥ N cls resolution, where N cls is the number of categories. This lets us formulate the decoder as: ˆ F i = Linear(C i , C)(F i ), 8i ˆ F i = Upsample( W 4 ⇥ W 4 )( ˆ F i ), 8i F = Linear(4C, C)(Concat( ˆ F i )), 8i M = Linear(C, N cls )(F), (4) where M refers to the predicted mask, and Linear(C in , C out )(·) refers to a linear layer with C in and C out as input and output vector dimensions respectively. DeepLabv3+ SegFormer Stage-1 Stage-2 Stage-3 Head Stage-4 Figure 3: Effective Receptive Field (ERF) on Cityscapes (average over 100 images). Top row: Deeplabv3+. Bottom row: Seg- Former. ERFs of the four stages and the decoder heads of both Effective Receptive Field Analysis. For semantic segmentation, maintain- ing large receptive field to include context information has been a central issue [5, 19, 20]. Here, we use effective receptive field (ERF) [70] as a toolkit to visualize and interpret why our MLP decoder design is so effective on Transformers. In Figure 3, we visualize ERFs of the four encoder stages and the decoder heads for both 3FE#PYɿ4UBHFͷ4FMG"UUFOUJPOͷSFDFQUJWFpFMEͷେ͖͞ #MVF#PYɿ.-1-BZFSͷ.-1ͷSFDFQUJWFpFMEͷେ͖͞ $JUZTDBQFTʹ͓͚ΔSFDFQUJWFpFMEͷޮՌΛ෼ੳ ˠ$//ϕʔε͸ہॴతͳಛ௃ͷΈଊ͑Δ͕ɼ5SBOTGPSNFSϕʔε͸ہॴతɾ޿Ҭతͳಛ௃Λଊ͑Δ ˠ.-1ͷSFDFQUJWFpFME͸ہॴతྖҬͷີ౓͕ߴ͍খ͍͞෺ମͷਖ਼֬ͳηάϝϯςʔγϣϯ͕ظ଴ SegFormer SETR SegFormer DeepLabv3+ Figure 4: Qualitative results on Cityscapes. Compared to SETR, our SegFormer predicts masks with substantially finer details near object boundaries. Compared to DeeplabV3+, SegFormer reduces long-range errors as highlighted in red. Best viewed in screen. 4.4 Robustness to natural corruptions Model robustness is important for many safety-critical tasks such as autonomous driving [77]. In this experiment, we evaluate the robustness of SegFormer to common corruptions and perturbations. To ৄࡉྖҬΛਖ਼֬ʹηάϝϯςʔγϣϯՄೳ 7J5ͷԠ༻ɿ4FH'PSNFS<9JF BS9JW> &9JF l4FH'PSNFS4JNQMFBOE&⒏DJFOU%FTJHOGPS4FNBOUJD4FHNFOUBUJPOXJUI5SBOTGPSNFST zBS9JW

4FH'PSNFS͕΋ͨΒ͢Մೳੑ IUUQTXXXZPVUVCFDPNXBUDI W+.P32[;F6 ϊΠζͷӨڹΛड͚ੑೳ͕ྼԽ ϊΠζʹର͠ϩόετ ˠ5SBOTGPSNFS͸෺ମͷܗঢ়Λֶश͢ΔͨΊɼϊΠζͷӨڹΛड͚ʹ͍͘

w $(σʔλͷΈΛֶश࣮ͯ͠ը૾ΛධՁ ࣮ը૾σʔλʹٴ͹ͳ͍͕ɼ$(σʔλͷޮՌ͕େ͖͍ 4FH'PSNFS͕΋ͨΒ͢Մೳੑ ख๏ ֶशσʔλ 'MBU $POTUSVDUJPO 0CKFDU
/BUVSF 4LZ )VNBO 7FIJDMF ฏۉ )3/FU ࣮ը૾ʢ ຕʣ $(ʢ ຕʣ 4FH'PSNFS ࣮ը૾ʢ ຕʣ $(ʢ ຕʣ $JUZTDBQFTͷఆྔతධՁ )3/FU 4FH'PSNFS ೖྗը૾ ˣ ˣ

7J5͸ࠓޙ$7෼໺ʹਁಁ͢Δͱࢥ͏͚Ͳ ਫ਼౓޲্ʹ͸σʔλྔ͕๲େ͗͢Δʜ ˞+'5.͸ԯຕͷը૾

w ڭࢣϥϕϧΛ෇༩ͯ͠ͳ͍େྔͷσʔλΛٖࣅతͳ໰୊ʢ1SFUFYUUBTLʣʹΑֶͬͯश ࣗݾڭࢣ͋ΓֶशͰ࡞੒ͨ͠ϞσϧΛసҠֶश͢Δ͜ͱͰධՁ w ୅දతͳֶशํ๏ ରরֶशϕʔεͷΞϓϩʔνɿ4JN$-3<$IFO *$.-> .P$P<)F
$713> ϙδςΟϒϖΞͷΈͷΞϓϩʔνɿ#:0-<(SJMM /FVS*14> 4JN4JBN<$IFO $713> ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH ᶃେྔͷڭࢣͳ͠σʔλͰࣄલֶशϞσϧΛ࡞੒ ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁సҠֶश 1FMJDBO ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ '$ ࣄલֶशϞσϧ େྔͷڭࢣͳ͠σʔλ ࣄલֶशϞσϧ esses image data, we analyze its internal inearly projects the ﬂattened patches into the top principal components of the the ible basis functions for a low-dimensional Figure 6: Representative examples of attention from the output token to the input space. See Appendix D.6 for details. ed to the del learns ition em- ition em- hes in the nusoidal D). That ology ex- not yield he entire egree the mpute the n is inte- is “atten- We ﬁnd he lowest obally is nsistently alized at- esNet be- may serve Further, bally, we ʜ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ࣗݾڭࢣ͋Γֶश

w ੜెωοτϫʔΫͱڭࢣωοτϫʔΫͷ̎ͭͷωοτϫʔΫΛར༻ ੜెͷग़ྗ෼෍͕ڭࢣͷग़ྗ෼෍ʹۙͮ͘Α͏ʹֶश w ಛ௃ྔʹରͯ͠Թ౓෇͖ιϑτϚοΫεؔ਺Λద༻͢Δ͜ͱͰ֬཰෼෍Λܭࢉ %*/0<.$BSPO *$$7> ࢦ਺Ҡಈฏۉ 4PGUNBY
4PGUNBY $FOUFS TUPQHSBE ಛ௃ྔ Τϯίʔμ ϓϩδΣΫλ TIBSQFOJOH ֬཰෼෍ MPDBM HMPCBM ੜెωοτϫʔΫ ڭࢣωοτϫʔΫ σʔλ૿෯ ଛࣦܭࢉ 7J5 7J5 .-1 .-1 ہॴྖҬΛΫϩοϓ ʢ೉͍͠໰୊ʣ ޿͍ൣғΛΫϩοϓ ʢ༏͍͠໰୊ʣ

w TIBSQFOJOHɿҰͭͷಛ௃Λڧௐ͢ΔΑ͏ʹௐ੔ w DFOUFSJOH ɿશͯͷಛ௃ྔ͕ಉ͡ಛ௃Λڧௐ͠ͳ͍Α͏ʹௐ੔ w DFOUFSJOH஋ DFOUFSJOHͱTIBSQFOJOH c ←
mc + (1 − m) 1 B B ∑ i=1 g θt (x i ) m B όοναΠζ ϋΠύϥ P s (x)(i) = exp(g θs (x)(i)/τ s ) ∑K k=1 exp(gθs (x)(k)/τs ) P t (x)(i) = exp((g θt (x)(i) − c)/τ t ) ∑K k=1 exp((gθt (x)(k) − c)/τt ) τ t Թ౓ύϥϝʔλ c DFOUFSJOH஋ ੜెωοτϫʔΫɿ ڭࢣωοτϫʔΫɿ τ s Թ౓ύϥϝʔλ

w ڭࢣͷύϥϝʔλ͸ੜెͷύϥϝʔλΛࢦ਺Ҡಈฏۉ͢Δ͜ͱͰߋ৽ $PTJOFTDIFEVMFSʹج͍ͮͯ Λมߋ 0.996 ≤ λ ≤ 1
4PGUNBY 4PGUNBY $FOUFS TUPQHSBE ಛ௃ྔ Τϯίʔμ ϓϩδΣΫλ TIBSQFOJOH ֬཰෼෍ MPDBM HMPCBM ੜెωοτϫʔΫ ڭࢣωοτϫʔΫ σʔλ૿෯ ଛࣦܭࢉ 7J5 .-1 .-1 ࢦ਺ҠಈฏۉʹΑΔύϥϝʔλͷߋ৽ θ t ← λθ t + (1 − λ)θ s θ t ڭࢣͷύϥϝʔλ θ s ੜెͷύϥϝʔλ λ ߋ৽ύϥϝʔλʹର͢ΔॏΈ 7J5 ࢦ਺Ҡಈฏۉ #BDLQSPQ

w σʔληοτɿ*NBHF/FU w ωοτϫʔΫɿ3FT/FUɼ7JTJPO5SBOTGPSNFS 7J5 w 3FT/FU ैདྷ๏ͱಉఔ౓ͷੑೳΛൃش
w 7J5 ैདྷ๏Λ௒͑ΔੑೳΛൃش ਫ਼౓ൺֱ tokens and [CLS] token are fed to a standard Transformer network with a “pre-norm” layer normalization [11, 39]. The Transformer is a sequence of self-attention and feed-forward layers, paralleled with skip connections. The self-attention layers update the token representations by looking at the other token representations with an attention mechanism [4]. Implementation details. We pretrain the models on the ImageNet dataset [60] without labels. We train with the adamw optimizer [44] and a batch size of 1024, distributed over 16 GPUs when using ViT-S/16. The learning rate is linearly ramped up during the first 10 epochs to its base value determined with the following linear scaling rule [29]: lr = 0.0005 ⇤ batchsize/256. After this warmup, we decay the learning rate with a cosine schedule [43]. The weight decay also follows a cosine schedule from 0.04 to 0.4. The temperature ⌧s is set to 0.1 while we use a linear warm-up for ⌧t from 0.04 to 0.07 during the first 30 epochs. We follow the data augmentations of BYOL [30] (color jittering, Gaussian blur and solarization) and multi-crop [10] with a bicubic interpolation to adapt the position embeddings to the scales [19, 69]. The code and models to reproduce our results is publicly available. Evaluation protocols. Standard protocols for self- supervised learning are to either learn a linear classifier on frozen features [82, 33] or to finetune the features on downstream tasks. For linear evaluations, we apply random resize crops and horizontal flips augmentation during training, and report accuracy on a central crop. set of ImageNet for different self-supervised methods. We focus on ResNet-50 and ViT-small architectures, but also report the best results obtained across architectures. ⇤ are run by us. We run the k-NN evaluation for models with official released weights. The throughput (im/s) is calculated on a NVIDIA V100 GPU with 128 samples per forward. Parameters (M) are of the feature extractor. Method Arch. Param. im/s Linear k-NN Supervised RN50 23 1237 79.3 79.3 SCLR [12] RN50 23 1237 69.1 60.7 MoCov2 [15] RN50 23 1237 71.1 61.9 InfoMin [67] RN50 23 1237 73.0 65.3 BarlowT [81] RN50 23 1237 73.2 66.0 OBoW [27] RN50 23 1237 73.8 61.9 BYOL [30] RN50 23 1237 74.4 64.8 DCv2 [10] RN50 23 1237 75.2 67.1 SwAV [10] RN50 23 1237 75.3 65.7 DINO RN50 23 1237 75.3 67.5 Supervised ViT-S 21 1007 79.8 79.8 BYOL⇤ [30] ViT-S 21 1007 71.4 66.6 MoCov2⇤ [15] ViT-S 21 1007 72.7 64.4 SwAV⇤ [10] ViT-S 21 1007 73.5 66.3 DINO ViT-S 21 1007 77.0 74.5 Comparison across architectures SCLR [12] RN50w4 375 117 76.8 69.3 SwAV [10] RN50w2 93 384 77.3 67.3 BYOL [30] RN50w2 93 384 77.4 – DINO ViT-B/16 85 312 78.2 76.1 SwAV [10] RN50w5 586 76 78.5 67.1 BYOL [30] RN50w4 375 117 78.6 – BYOL [30] RN200w2 250 123 79.6 73.9 DINO ViT-S/8 21 180 79.7 78.3 SCLRv2 [13] RN152w3+SK 794 46 79.8 73.1 DINO ViT-B/8 85 63 80.1 77.4

w ೖྗը૾ʹରͯ͠"UUFOUJPO3PMMPVUʹΑΔՄࢹԽ ϥϕϧ৘ใ͕ແͯ͘΋ਖ਼֬ͳ෺ମྖҬ "UUFOUJPO.BQ Λ֫ಘ w ڭࢣ͋ΓͰֶशͨ݁͠ՌΑΓ΋ਖ਼֬ͳ"UUFOUJPO.BQΛ֫ಘ 7J5ʹ͓͚Δ4FMG"UUFOUJPOͷՄࢹԽ Emerging
Properties in Self-Supervised Vision Transformers Mathilde Caron1,2 Hugo Touvron1,3 Ishan Misra1 Herv´ e Jegou1 Julien Mairal2 Piotr Bojanowski1 Armand Joulin1 1 Facebook AI Research 2 Inria⇤ 3 Sorbonne University Figure 1: Self-attention from a Vision Transformer with 8 ⇥ 8 patches trained with no supervision. We look at the self-attention of the [CLS] token on the heads of the last layer. This token is not attached to any label nor supervision. These maps show that the model automatically learns class-specific features leading to unsupervised object segmentations. Abstract In this paper, we question if self-supervised learning pro- vides new properties to Vision Transformer (ViT) [19] that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an 1. Introduction Transformers [70] have recently emerged as an alternative to convolutional neural networks (convnets) for visual recognition [19, 69, 83]. Their adoption has been coupled with a training strategy inspired by natural language processing (NLP), that is, pretraining on large quantities of data and finetuning on the target dataset [18, 55]. The resulting Vision Transformers (ViT) [19] are competitive with convnets but, they have not yet delivered clear benefits over them: they :2104.14294v2 [cs.CV] 24 May 2021 Supervised DINO Table 7: Important component for self-supervised ViT pre- training. Models are trained for 300 epochs with ViT-S/16. We study the different components that matter for the k-NN and linear (“Lin.”) evaluations. For the different variants, we highlight the differences from the default DINO setting. The best combination is the momentum encoder with the multicrop augmentation and the cross-entropy loss. We also report results with BYOL [30], MoCo-v2 [15] and SwAV [10]. Supervised DINO Random Supervised DINO ViT-S/16 22.0 27.3 45.9 Table 7: Imp training. Mod study the diffe (“Lin.”) evalu differences fro is the momen the cross-entr MoCo-v2 [15] Method M 1 DINO 2 3

ڭࢣ͋Γֶशͨ͠ࡍͷ"UUFOUJPONBQ %*/0ͷ"UUFOUJPONBQ IUUQTBJGBDFCPPLDPNCMPHEJOPQBXTDPNQVUFSWJTJPOXJUITFMGTVQFSWJTFEUSBOTGPSNFSTBOEYNPSFF⒏DJFOUUSBJOJOH

w $//΍-45.ΛϕʔεϥΠϯͱͨ͠λεΫ͕5SBOTGPSNFSʹ୅ΘΔ γʔέϯεը૾ͷશମతͳಛ௃Λଊ͑Δ͜ͱ͕ՄೳͳͨΊ w σʔληοτͷେن໛Խ 5SBOTGPSNFSܥͷϞσϧ͸େن໛ͳσʔληοτͰࣄલֶश͢Δͱਫ਼౓޲্͕ݟࠐΊΔ w 5SBOTGPSNFSͷ໰୊఺͔Β༧ଌ͞ΕΔࠓޙͷಈ޲
5SBOTGPSNFS͸ܭࢉίετ͕ߴ͍ w 5SBOTGPSNFSͷ.PCJMFOFU൛ɿ.PCJMF'PSNFS େن໛ͳσʔληοτߏஙʹର͢Δਓతίετ w $(ͰࣗಈΞϊςʔγϣϯͯ͠ɼֶशʹར༻Ͱ͖ΔՄೳੑ 5SBOTGPSNFSͷࠓޙʹ͍ͭͯ

ࢀߟจݙ <7BTXBOJ /FVS*14>"TIJTI7BTXBOJ /PBN4IB[FFS /JLJ1BSNBS +BLPC6T[LPSFJU -MJPO+POFT "JEBO/(PNF[ -VLBT[
,BJTFS *MMJB1PMPTVLIJO l"UUFOUJPOJT"MM:PV/FFE z/FVS*14 <%PTPWJUTLJZ *$-3>"MFYFZ%PTPWJUTLJZ -VDBT#FZFS "MFYBOEFS,PMFTOJLPW %JSL8FJTTFOCPSO 9JBPIVB;IBJ 5IPNBT 6OUFSUIJOFS .PTUBGB%FIHIBOJ .BUUIJBT.JOEFSFS (FPSH)FJHPME 4ZMWBJO(FMMZ +BLPC6T[LPSFJU /FJM)PVMTCZ l"O*NBHFJT8PSUI Y8PSET5SBOTGPSNFSTGPS*NBHF3FDPHOJUJPOBU4DBMFz*$-3 <"COBS "$->4BNJSB"COBS 8JMMFN;VJEFNB l2VBOUJGZJOH"UUFOUJPO'MPXJO5SBOTGPSNFST z"$- <8BOH *$$7>8FOIBJ8BOH &O[F9JF 9JBOH-J %FOH1JOH'BO ,BJUBP4POH %JOH-JBOH 5POH-V 1JOH-VP -JOH4IBP l1ZSBNJE7JTJPO5SBOTGPSNFS"7FSTBUJMF#BDLCPOFGPS%FOTF1SFEJDUJPOXJUIPVU$POWPMVUJPOT z*$$7 <:+JBOH /FVS*14>:JGBO+JBOH 4IJZV$IBOH ;IBOHZBOH8BOH l5SBOT("/5XP1VSF5SBOTGPSNFST$BO.BLF0OF4USPOH ("/ BOE5IBU$BO4DBMF6Q z/FVS*14 <.JTSB *$$7>"OVSBH"SOBC .PTUBGB%FIHIBOJ (FPSH)FJHPME $IFO4VO .BSJP-V㶜J㶛 $PSEFMJB4DINJE l7J7J5"7JEFP 7JTJPO5SBOTGPSNFS z*$$7 <9JF BS9JW>&O[F9JF 8FOIBJ8BOH ;IJEJOH:V "OJNB"OBOELVNBS +PTF."MWBSF[ 1JOH-VP l4FH'PSNFS4JNQMFBOE &⒏DJFOU%FTJHOGPS4FNBOUJD4FHNFOUBUJPOXJUI5SBOTGPSNFST zBS9JW <.$BSPO *$$7>.BUIJMEF$BSPO )VHP5PVWSPO *TIBO.JTSB )FSWÉ+ÉHPV +VMJFO.BJSBM 1JPUS#PKBOPXTLJ "SNBOE+PVMJO l&NFSHJOH1SPQFSUJFTJO4FMG4VQFSWJTFE7JTJPO5SBOTGPSNFSTz *$$7

ػց஌֮ϩϘςΟΫεݚڀάϧʔϓ த෦େֶϩΰ த෦େֶϩΰ ڭत ౻٢߂࿱ Hironobu Fujiyoshi E-mail: [email protected]
1997೥ த෦େֶେֶӃത࢜ޙظ՝ఔमྃ, 1997೥ ถΧʔωΪʔϝϩϯେֶϩϘοτ޻ֶݚڀॴPostdoctoral Fellow, 2000೥ த෦େֶ޻ֶ෦৘ใ޻ֶՊߨࢣ, 2004೥ த෦େֶ।ڭत, 2005೥ ถΧʔωΪʔϝϩϯେֶϩϘοτ޻ֶݚڀॴ٬һݚڀһ(ʙ2006೥), 2010೥ த෦େֶڭत, 2014೥໊ݹ԰େֶ٬һڭत.  ܭࢉػࢹ֮ɼಈը૾ॲཧɼύλʔϯೝࣝɾཧղͷݚڀʹैࣄɽ  ϩϘΧοϓݚڀ৆(2005೥)ɼ৘ใॲཧֶձ࿦จࢽCVIM༏ल࿦จ৆(2009೥)ɼ৘ใॲཧֶձࢁԼه೦ݚڀ৆(2009೥)ɼը૾ηϯγϯάγϯϙδ΢Ϝ༏लֶज़৆(2010, 2013, 2014೥) ɼ ిࢠ৘ใ௨৴ֶձ ৘ใɾγεςϜιαΠΤςΟ࿦จ৆(2013೥)ଞ ڭत ࢁԼོٛ Takayoshi Yamashita E-mail:[email protected] 2002೥ ಸྑઌ୺Պֶٕज़େֶӃେֶത࢜લظ՝ఔमྃ, 2002೥ ΦϜϩϯגࣜձࣾೖࣾ, 2009೥ த෦େֶେֶӃത࢜ޙظ՝ఔमྃ(ࣾձਓυΫλʔ), 2014೥ த෦େֶߨࢣɼ 2017೥ த෦େֶ।ڭतɼ2021೥ த෦େֶڭतɽ  ਓͷཧղʹ޲͚ͨಈը૾ॲཧɼύλʔϯೝࣝɾػցֶशͷݚڀʹैࣄɽ  ը૾ηϯγϯάγϯϙδ΢Ϝߴ໦৆(2009೥)ɼిࢠ৘ใ௨৴ֶձ ৘ใɾγεςϜιαΠΤςΟ࿦จ৆(2013೥)ɼిࢠ৘ใ௨৴ֶձPRMUݚڀձݚڀ঑ྭ৆(2013೥)ड৆ɽ ߨࢣ ฏ઒ཌྷ Tsubasa Hirakawa E-mail:[email protected] 2013೥ ޿ౡେֶେֶӃത࢜՝ఔલظऴྃɼ2014೥ ޿ౡେֶେֶӃത࢜՝ఔޙظೖֶɼ2017೥ த෦େֶݚڀһ (ʙ2019೥)ɼ2017೥ ޿ౡେֶେֶӃത࢜ޙظ՝ఔमྃɽ2019 ೥ த෦େֶಛ೚ॿڭɼ2021೥ த෦େֶߨࢣɽ2014೥ ಠཱߦ੓๏ਓ೔ຊֶज़ৼڵձಛผݚڀһDC1ɽ2014೥ ESIEE Paris٬һݚڀһ (ʙ2015೥)ɽ ίϯϐϡʔλϏδϣϯɼύλʔϯೝࣝɼҩ༻ը૾ॲཧͷݚڀʹैࣄ ֶੜ ຳӜେߊ Hiroaki Minoura E-mail: [email protected] 2020೥ த෦େֶେֶӃത࢜લظ՝ఔ ৘ใ޻ֶઐ߈मྃɼ2020೥ த෦େֶେֶӃത࢜ޙظ՝ఔ ৘ใ޻ֶઐ߈ࡏֶதɽίϯϐϡʔλϏδϣϯɼύλʔϯೝࣝͷݚڀʹैࣄɽ

Vision Transformerのしくみ

Vision Transformerのしくみ

More Decks by himidev

Other Decks in Technology

Featured

Transcript