$P)0(<8BUBOBCF *14+> าߦऀͷߏతͳྨࣅੑΛଊ͑ΔͨΊʹෳͷಛྔͷڞىੑΛදݱ ୈੈɿϋϯυΫϥϑτಛ +PJOU)BBSMJLFGFBUVSF ϙδςΟϒΫϥε ωΨςΟϒΫϥε j =ʢ̍̍̍ʣ= ̓ ͖͍͠ॲཧ )0(ಛྔͷڞىදݱ த ߨ ࢁ ػ 4 Ѫ Te F ya h த෦େֶ ֶ෦ϩϘο τཧֶՊ ڭत ౻٢߂ ػց֮ˍϩϘςΟ Ϋεάϧʔϓ 487-8501 Ѫݝय़Ҫࢢদຊொ1200 Tel 0568-51-9096 Fax 0568-51-9409 hf@cs.chubu.ac.jp http://vision.cs.chubu.ac.jp ത࢜ ʢֶʣ M த෦େֶ ֶ෦ϩϘο τཧֶՊ ڭत ౻٢߂ ػց֮ˍϩϘςΟ Ϋεάϧʔϓ 487-8501 Ѫݝय़Ҫࢢদຊொ1200 Tel 0568-51-9096 Fax 0568-51-9409 hf@cs.chubu.ac.jp http://vision.cs.chubu.ac.jp ത࢜ ʢֶʣ MACHINE PERCEPTION AND ROBOTICS GROUP Chubu University Department of Robotics Science and Technology College of Engineering Professor Dr.Eng. Hironobu Fujiyoshi Machine Perception and Robotics Group 1200 Matsumoto-cho, Kasugai, Aichi 487-8501 Japan Tel +81-568-51-9096 Fax +81-568-51-9409 hf@cs.chubu.ac.jp http://vision.cs.chubu.ac.jp w )0(ಛྔͷޯͷؔੑΛଊ͑Δ r $P)0(<8BUBOBCF> w ہॴྖҬͷޯϖΞΛྦྷੵͨ͠ಉ࣌ਖ਼ىߦྻ r +PJOU)0(<ࡾҪ> w #PPTJOHʹΑΓࣝผʹ༗ޮͳہॴྖҬͷؔੑΛ֫ಘ $P)0( )0(ಛྔͷڞىදݱ த෦େֶ ֶ෦ใֶՊ ߨࢣ ࢁԼོٛ ػց֮ˍϩϘςΟ Ϋ 487-8501 Ѫݝय़Ҫࢢদຊ Tel 0568-51-9670 Fax 0568-51-1540 yamashita@cs.chubu http://vision.cs.chubu MACHINE PERCEPTIO Chubu University Department of Compu த෦େֶ ֶ෦ϩϘο τཧֶՊ ڭत ౻٢߂ ػց֮ˍϩϘςΟ Ϋεάϧʔϓ 487-8501 Ѫݝय़Ҫࢢদຊொ1200 Tel 0568-51-9096 Fax 0568-51-9409 hf@cs.chubu.ac.jp http://vision.cs.chubu.ac.jp ത࢜ ʢֶʣ MACHINE PERCEPTION AND ROBOTICS GROUP Chubu University Department of Robotics Science and Technology ࢁԼོٛ ػց֮ˍϩϘςΟ Ϋεάϧʔ 487-8501 Ѫݝय़Ҫࢢদຊொ1200 Tel 0568-51-9670 Fax 0568-51-1540 yamashita@cs.chubu.ac.jp http://vision.cs.chubu.ac.jp MACHINE PERCEPTION AND R Chubu University Department of Computer Scie College of Engineering Lecturer Dr.Eng. Takayoshi Yam Machine Perception and Robo 1200 Matsumoto-cho, Kasuga 487-8501 Japan Tel +81-568-51-9670 Fax +81-568-51-1540 yamashita@cs.chubu.ac.jp http://vision.cs.chubu.ac.jp MACHINE PERCEPTION AND R ౻٢߂ ػց֮ˍϩϘςΟ Ϋεάϧʔϓ 487-8501 Ѫݝय़Ҫࢢদຊொ1200 Tel 0568-51-9096 Fax 0568-51-9409 hf@cs.chubu.ac.jp http://vision.cs.chubu.ac.jp ത࢜ ʢֶʣ MACHINE PERCEPTION AND ROBOTICS GROUP Chubu University Department of Robotics Science and Technology College of Engineering Professor Dr.Eng. Hironobu Fujiyoshi Machine Perception and Robotics Group 1200 Matsumoto-cho, Kasugai, Aichi 487-8501 Japan Tel +81-568-51-9096 Fax +81-568-51-9409 hf@cs.chubu.ac.jp http://vision.cs.chubu.ac.jp MACHINE PERCEPTION AND ROBOTICS GROUP w )0(ಛྔͷޯͷؔੑΛଊ͑Δ r $P)0(<8BUBOBCF> w ہॴྖҬͷޯϖΞΛྦྷੵͨ͠ಉ࣌ਖ਼ىߦྻ r +PJOU)0(<ࡾҪ> w #PPTJOHʹΑΓࣝผʹ༗ޮͳہॴྖҬͷؔੑΛ֫ಘ $P)0( +PJOU)0(
 H W W H ըૉ͝ͱʹΫϥε֬Λग़ྗ W H 1FSTPO 1FSTPO 1FSTPO 1FSTPO C ʜ ʜ ɿΈࠐΈ ɿϓʔϦϯά ɿΞοϓαϯϓϦϯά C άϦου͝ͱʹ Ϋϥε֬ͱݕग़ྖҬΛग़ྗ Ϋϥε֬Λग़ྗ ೖྗ ग़ྗ C + B $// ग़ྗ݁ՌͷՄࢹԽ $// $// ମݕग़ ը૾ྨ ηϚϯςΟοΫ ηάϝϯςʔγϣϯ
Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk , and values of dimension dv . We compute the dot products of the query with all keys, divide each by p dk , and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as: Attention(Q, K, V ) = softmax( QKT p dk )V (1) The two most commonly used attention functions are additive attention [2], and dot-product (multi- plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of 1 p dk . Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code. Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk , and values of dimension dv . We compute the dot products of the query with all keys, divide each by p dk , and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together 5SBOTGPSNFS 7J5
in Self-Supervised Vision Transformers Mathilde Caron1,2 Hugo Touvron1,3 Ishan Misra1 Herv´ e Jegou1 Julien Mairal2 Piotr Bojanowski1 Armand Joulin1 1 Facebook AI Research 2 Inria⇤ 3 Sorbonne University V] 24 May 2021 7J5ͷࣗݾڭࢣ͋Γֶशɿ%*/0<$BSPO *$$7> keep 60% of the mass. On top, we show the resulting masks for a ViT-S/8 trained with supervision and DINO. We show the best head for both models. The table at the bottom compares the Jac- card similarity between the ground truth and these masks on the validation images of PASCAL VOC12 dataset. Table 6: Transfer learning by finetuning pretrained models on different datasets. We report top-1 accuracy. Self-supervised pretraining with DINO transfers better than supervised pretraining. Cifar10 Cifar100 INat18 INat19 Flwrs Cars INet ViT-S/16 Sup. [69] 99.0 89.5 70.7 76.6 98.2 92.1 79.9 DINO 99.0 90.5 72.0 78.2 98.5 93.0 81.5 ViT-B/16 Sup. [69] 99.0 90.8 73.2 77.7 98.4 92.1 81.8 DINO 99.1 91.7 72.6 78.6 98.8 93.0 82.8 In Table 7, we report different model variants as we add or remove components. First, we observe that in the absence of momentum, our framework does not work (row 2) and more advanced operations, SK for example, are required to avoid collapse (row 9). However, with momentum, using SK has little impact (row 3). In addtition, comparing rows 3 and 9 highlights the importance of the momentum encoder for performance. Second, in rows 4 and 5, we observe that 7 BYOL X 7 7 MSE 8 MoCov2 X 7 7 INCE 9 SwAV 7 X X CE SK: Sinkhorn-Knopp, MC: Multi-Cro CE: Cross-Entropy, MSE: Mean Square E Fi Pa ua th fe wi M 30 with different patch sizes, 16 ⇥ 16, 8 also compare to ViT-B with 16 ⇥ 16 a the models are trained for 300 epochs performance greatly improves as we de patch. It is interesting to see that perfo improved without adding additional p the performance gain from using sma the expense of throughput: when usi throughput falls to 44 im/s, vs 180 im/ ˠڭࢣ͋ΓࣄલֶशϞσϧ 4VQ Λ͑ΔੑೳΛൃش ˠϥϕϧใ͕ແͯ͘ਖ਼֬ͳମྖҬΛ֫ಘ
શ݁߹ͰҐஔʹґଘ͠ͳ͍ಛΛ֫ಘ w ୈ̏ੈɿ7J5ʹΑΔಛදݱ֫ಘ $//Ͱ֫ಘͰ͖ͳ͔ͬͨܗঢ়ಛΛ֫ಘՄೳˠϊΠζʹରͯ͠ؤ݈ ϥϕϧใ͕ແͯ͘ਖ਼֬ͳମྖҬΛ֫ಘʢࣗݾڭࢣ͋Γֶशʣ ·ͱΊɿہॴಛྔɿը૾ೝࣝʹ͓͚Δಛදݱ֫ಘͷมભ Emerging Properties in Self-Supervised Vision Transformers Mathilde Caron1,2 Hugo Touvron1,3 Ishan Misra1 Herv´ e Jegou1 Julien Mairal2 Piotr Bojanowski1 Armand Joulin1 1 Facebook AI Research 2 Inria⇤ 3 Sorbonne University Figure 1: Self-attention from a Vision Transformer with 8 ⇥ 8 patches trained with no supervision. We look at the self-attention of the [CLS] token on the heads of the last layer. This token is not attached to any label nor supervision. These maps show that the model automatically learns class-specific features leading to unsupervised object segmentations. Abstract 1. Introduction Transformers [70] have recently emerged as an alternative 94v2 [cs.CV] 24 May 2021 ୈੈɿϋϯυΫϥϑτಛ ୈੈɿ$//ʹΑΔಛදݱ֫ಘ ୈੈɿ7J5ʹΑΔಛදݱ֫ಘ