⾃⼰教師あり学習によるビジョン基盤モデルの事前学習

ϩϘοτ޻ֶηϛφʔɿʮϩϘοτͷͨΊͷ--.ɾ7-.ར׆༻ʯ ⾃⼰ ڭࢣ͋ΓֶशʹΑΔϏδϣϯج൫Ϟσϧͷࣄલֶश ౻٢߂࿱ʢத෦େֶɾػց஌֮ϩϘςΟΫεݚڀάϧʔϓʣ IUUQNQSHKQ

ࣗݾ঺հɿ౻٢߂࿱ 2 த෦େֶϩΰ த෦େֶϩΰ ֶྺɿ ೥ذೆ޻ۀߴߍిࢠՊଔۀ ೥த෦େֶిࢠ޻ֶՊଔۀ ೥த෦େֶେֶӃम࢜՝ఔमྃ ೥த෦େֶେֶӃത࢜ޙظ՝ఔຬظୀֶʢത࢜ʣ
ݚڀ׆ಈɿ ೥ถΧʔωΪʔϝϩϯେֶϩϘοτ޻ֶݚڀॴϙευΫݚڀһʢ೥ʣ ೥த෦େֶ޻ֶ෦ߨࢣ ೥த෦େֶ޻ֶ෦।ڭत ೥ถΧʔωΪʔϝϩϯେֶϩϘοτ޻ֶݚڀॴ٬һݚڀһʢ೥ʣ ೥த෦େֶ޻ֶ෦ڭत ～ ೥ػց஌֮ϩϘςΟΫεݚڀάϧʔϓ ݱࡏʹࢸΔ ֶ֎׆ಈɿ ೔ຊσΟʔϓϥʔχϯάڠձཧࣄ ࢈૯ݚਓ޻஌ೳݚڀηϯλʔ٬һݚڀһ ΫϩεΞϙΠϯτϝϯτʢσϯιʔʣ vol.162 ౻٢߂࿱ʢத෦େֶʣʮ“+AI”ͰมΘΔະདྷʯ IUUQTXXXZPVUVCFDPNXBUDI W[K&0J7)6 ౻٢߂࿱ ஑ాӯࣿ͞Μ (೫໦ࡔ46) ઒ాेເ͞Μ

w େྔ͔ͭ޿ൣͳσʔλͰֶश༷ͨ͠ʑͳԼྲྀλεΫʹసҠͰ͖ΔϞσϧ Ұൠతʹ͸ࣗݾڭࢣ͋ΓֶशΛ༻͍༷ͯʑͳԼྲྀλεΫͰ༗ޮͱͳΔ൚༻తͳಛ௃දݱΛ֫ಘ ج൫Ϟσϧʢ'PVOEBUJPONPEFMʣ ςΩετ ը૾ ಈը૾ Ի੠
ߏ଄σʔλ ج൫Ϟσϧ 'PVOEBUJPO.PEFM 5SBJOJOH 2VFTUJPO "OTXFSJOH 4FOUJNFOU "OBMZTJT *OGPSNBUJPO &YUSBDUJPO *NBHF $BQUJPOJOH 0CKFDU 3FDPHOJUJPO *OTUSVDUJPO 'PMMPXJOH "EBQUBUJPO σʔλ ԼྲྀλεΫ ☺︎ ☹ 2 " ࣗݾڭࢣ͋ΓֶशϞσϧ -PTT %# -BCFM

w ਖ਼ղϥϕϧΛ෇༩ͯ͠ͳ͍େྔͷϥϕϧͳ͠σʔλΛ༻͍ͯࣄલֶश w ໨తɿ༷ʑͳసҠֶशɾ fi OFUVOJOHઌͷλεΫͰ༗ޮͳಛ௃ྔͷநग़ ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁సҠֶशɾ
fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ᶃϥϕϧͳ͠σʔλͰϞσϧΛࣄલֶश ϥϕϧͳ͠σʔλ Ϟσϧ ʢཚ਺ॳظ஋ʣ ses image data, we analyze its internal early projects the ﬂattened patches into he top principal components of the the le basis functions for a low-dimensional Figure 6: Representative examples of attention from the output token to the input d to the l learns on em- on em- s in the usoidal ). That ogy ex- ot yield e entire ree the ute the is inte- “atten- We ﬁnd lowest bally is stently zed at- Net be- ʜ ࣗݾڭࢣ͋Γֶश 44- ϓϨςΩετλεΫ 1SFUFYUUBTL ԼྲྀλεΫ %PXOTUSFBNUBTL

ࣗݾڭࢣ͋Γֶशͷਐల $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ ࡞੒ͯ͠ରরֶश $1$W <0+)ÉOB ff
*$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO $713`> ϚϧνϞʔμϧ΁ͷ֦ு 3(# ηϚηά ਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(# ݴޠ $"7."& <:(POH *$-3`> 3(# Ի੠ ରরֶश .*. ରরֶश 8IBU%P447J5-FBSO </1BSL *$-3`> ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ ".3"%*0 <.3BO[JOHFS $713`> ରরֶश .*. $-*1 4".ͷ ͭͷֶशࡁΈج൫ϞσϧΛ ͭͷϞσϧʹϞσϧѹॖ 1SFUFYUλεΫͷվળ

w σʔλ͔ΒࣗಈͰਖ਼ղϥϕϧΛ࡞੒Ͱ͖ΔλεΫ ྫɿ1SFEJDUJOH*NBHF3PUBUJPOT<(JEBSJT *$-3> w౓ ౓ ౓ ౓ͷ͍ͣΕ͔ͷճసΛద༻͠ɼճసͷछྨΛ༧ଌʢΫϥε෼ྨʣ ϓϨςΩετλεΫ
1SFUFYUUBTL Published as a conference paper at ICLR 2018 Rotated image: X0 Rotated image: X3 Rotated image: X 2 Rotated image: X1 ConvNet model F(.) ConvNet model F(.) ConvNet model F(.) ConvNet model F(.) Image X Predict 270 degrees rotation (y=3) Rotate 270 degrees g( X , y=3) Rotate 180 degrees g( X , y=2) Rotate 90 degrees g( X , y=1) Rotate 0 degrees g( X , y=0) Maximize prob. F3( X 3) Predict 0 degrees rotation (y=0) Maximize prob. F2( X2) Maximize prob. F1( X 1) Maximize prob. F0( X 0) Predict 180 degrees rotation (y=2) Predict 90 degrees rotation (y=1) Objectives: Figure 2: Illustration of the self-supervised task that we propose for semantic feature learning. Given four possible geometric transformations, the 0, 90, 180, and 270 degrees rotations, we train a ConvNet model F(.) to recognize the rotation that is applied to the image that it gets as input. Fy(Xy⇤ ) is the probability of rotation transformation y predicted by model F(.) when it gets as input an image that has been transformed by the rotation transformation y⇤. to successfully predict the rotation of an image the ConvNet model must necessarily learn to localize salient objects in the image, recognize their orientation and object type, and then relate the object orientation with the dominant orientation that each type of object tends to be depicted within the <(JEBSJT *$-3>͔ΒҾ༻ ˠճసͷ༧ଌʹ͸෺ମͷ֓೦ʢҐஔɼ࢟੎ɼछྨͳͲʣͷཧղ͕ඞཁ ˠਖ਼ղϥϕϧ͸ը૾ʹద༻͞Εͨσʔλ֦ுͷ৘ใ͔ΒࣗಈͰ࡞੒Մೳ

w 4PMWJOH+JHTBX1V[[MFT</PSPP[JBOE 'BWBSP &$$7> λΠϧঢ়ʹͭͷύονΛγϟοϑϧ͠ɺγϟοϑϧॱ൪ͷΠϯσοΫεΛ༧ଌ w $POUFYU&ODPEFST<1BUIBL $713>
&ODPEFSɾ%FDPEFSߏ଄ͷϞσϧʹΑΓϚεΫ͞ΕͨྖҬΛ༧ଌ w $PMPSGVM*NBHF$PMPSJ[BUJPO<;IBOH &$$7> Χϥʔը૾͔Βม׵ͨ͠άϨʔεέʔϧը૾͔Β-BC৭ۭؒͷBC஋Λ༧ଌ w /PO1BSBNFUSJD*OTUBODF%JTDSJNJOBUJPO<8V $713> ֤ը૾ʹର͢Δಛ௃ྔ͕ಠཱ͢ΔΑ͏ʹֶशʢσʔλ਺ʹΫϥε਺ͱֶͯ͠शʣ w $POUFYU1SFEJDUJPO<%PFSTDI *$$7> λΠϧঢ়ʹΫϩοϓͨ͠ύονؒͷ૬ରҐஔΛ༧ଌ w $POUSBTUJWF1SFEJDUJWF$PEJOH $1$W <)ÉOB ff *$.-> ύονͷಛ௃ྔ͔ΒύονҐஔ͕ ݸઌͷύονͷಛ௃ྔΛ༧ଌ k ༷ʑͳϓϨςΩετλεΫ Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles 7 Fig. 3: Context Free Network. The figure illustrates how a puzzle is generated and solved. We randomly crop a 225 ⇥ 225 pixel window from an image (red dashed box), divide it into a 3 ⇥ 3 grid, and randomly pick a 64 ⇥ 64 pixel tiles from each 75 ⇥ 75 pixel cell. These 9 tiles are reordered via a randomly chosen permutation from a predefined permutation set and are then fed to the CFN. The task is to predict the index of the chosen permutation (technically, we define as output a probability vector with 1 at the 64-th location and 0 elsewhere). The CFN is a siamese-ennead CNN. For simplicity, we do not indicate the max- pooling and ReLU layers. These shared layers are implemented exactly as in AlexNet [25]. In the transfer learning experiments we show results with the trained weights transferred on AlexNet (precisely, stride 4 on the first layer). The training in the transfer learning experiment is the same as in the other competing methods. Notice instead, that during the training on the puzzle task, we set the stride of the first layer of the CFN to 2 instead of 4. that permutation, and ask the CFN to return a vector with the probability value for each index. Given 9 tiles, there are 9! = 362,880 possible permutations. From our experimental validation, we found that the permutation set is an important factor on the performance of the representation that the network learns. We perform an ablation study on the impact of the permutation set in subsection 4.2. 3.3 Training the CFN The output of the CFN can be seen as the conditional probability density function (pdf) of the spatial arrangement of object parts (or scene parts) in a part- based model, i.e., p(S|A 1 , A 2 , . . . , A 9 ) = p(S|F 1 , F 2 , . . . , F 9 ) 9 Y i=1 p(Fi |Ai ) (1) where S is the configuration of the tiles, Ai is the i-th part appearance of the object, and {Fi }i=1,...,9 form the intermediate feature representation. Our objective is to train the CFN so that the features Fi have semantic attributes that can identify the relative position between parts. differ in the approach: whereas [7] are solving a discriminative task (is patch A above patch B or below?), our context encoder solves a pure prediction problem (what pixel intensities should go in the hole?). Interestingly, similar distinc- tion exist in using language context to learn word embeddings: Collobert and Weston [5] advocate a discriminative approach, whereas word2vec [30] formulate it as word prediction. One important benefit of our approach is that our supervisory signal is much richer: a context encoder needs to predict roughly 15,000 real values per training example, compared to just 1 option among 8 choices in [7]. Likely due in part to this difference, our context encoders take far less time to train than [7]. Moreover, context based prediction is also harder to “cheat” since low-level image features, such as chromatic aberration, do not provide any meaning- ful information, in contrast to [7] where chromatic aberration partially solves the task. On the other hand, it is not yet clear if requiring faithful pixel generation is necessary for learning good visual features. Image generation Generative models of natural images have enjoyed significant research interest [16, 24, 35]. Re- cently, Radford et al. [33] proposed new convolutional architectures and optimization hyperparameters for Genera- tive Adversarial Networks (GAN) [16] producing encour- aging results. We train our context encoders using an ad- versary jointly with reconstruction loss for generating inpainting results. We discuss this in detail in Section 3.2. Dosovitskiy et al. [10] and Rifai et al. [36] demonstrate that CNNs can learn to generate novel images of particular object categories (chairs and faces, respectively), but rely on large labeled datasets with examples of these categories. In contrast, context encoders can be applied to any unlabeled image database and learn to generate images based on the surrounding context. Inpainting and hole-filling It is important to point out that our hole-filling task cannot be handled by classical inpainting [4, 32] or texture synthesis [2, 11] approaches, Figure 2: Context Encoder. The context image is passed through the encoder to obtain features which are connected to the decoder using channel-wise fully-connected layer as described in Section 3.1. The decoder then produces the missing regions in the image. 3. Context encoders for image generation We now introduce context encoders: CNNs that predict missing parts of a scene from their surroundings. We first give an overview of the general architecture, then provide details on the learning procedure and finally present various strategies for image region removal. 3.1. Encoder-decoder pipeline The overall architecture is a simple encoder-decoder pipeline. The encoder takes an input image with missing regions and produces a latent feature representation of that image. The decoder takes this feature representation and produces the missing image content. We found it important to connect the encoder and the decoder through a channel- wise fully-connected layer, which allows each unit in the decoder to reason about the entire image content. Figure 2 shows an overview of our architecture. Encoder Our encoder is derived from the AlexNet architecture [26]. Given an input image of size 227×227, we use the first five convolutional layers and the following pooling layer (called pool5) to compute an abstract 6 × 6 × 256 dimensional feature representation. In contrast to AlexNet, 128D Unit Sphere O 1-th image 2-th image i-th image n-1 th image n-th image CNN backbone 128D 2048D 128D L2 norm low dim Non-param Softmax Memory Bank Figure 2: The pipeline of our unsupervised feature learning approach. We use a backbone CNN to encode each image as a feature vector, which is projected to a 128-dimensional space and L2 normalized. The optimal feature embedding is learned via instance-level discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere. 3. Approach Our goal is to learn an embedding function v = f✓(x) without supervision. f✓ is a deep neural network with parameters ✓, mapping image x to feature v. This embedding would induces a metric over the image space, as d✓(x, y) = kf✓(x) f✓(y)k for instances x and y. A good embedding should map visually similar images closer to each other. Our novel unsupervised feature learning approach is instance-level discrimination. We treat each image instance as a distinct class of its own and train a classifier to distin- guish between individual instance classes (Fig.2). 3.1. Non-Parametric Softmax Classifier Parametric Classifier. We formulate the instance-level classification objective using the softmax criterion. Sup- where ⌧ is a temperature parameter that controls the concentration level of the distribution [11]. ⌧ is important for supervised feature learning [43], and also necessary for tuning the concentration of v on our unit sphere. The learning objective is then to maximize the joint probability Qn i=1 P✓(i|f✓(xi)), or equivalently to minimize the negative log-likelihood over the training set, as J(✓) = n X i=1 log P(i|f✓(xi)). (3) Learning with A Memory Bank. To compute the probability P(i|v) in Eq. (2), {vj } for all the images are needed. Instead of exhaustively computing these representations ev- ery time, we maintain a feature memory bank V for stor- 4 Zhang, Isola, Efros Fig. 2. Our network architecture. Each conv layer refers to a block of 2 or 3 repeated conv and ReLU layers, followed by a BatchNorm [30] layer. The net has no pool layers. All changes in resolution are achieved through spatial downsampling or upsampling between conv blocks. [29]. In Section 3.1, we provide quantitative comparisons to Larsson et al., and encourage interested readers to investigate both concurrent papers. 2 Approach We train a CNN to map from a grayscale input to a distribution over quantized color value outputs using the architecture shown in Figure 2. Architectural details are described in the supplementary materials on our project webpage1, and the model is publicly available. In the following, we focus on the design of the objective function, and our technique for inferring point estimates of color from the predicted color distribution. 2.1 Objective Function Given an input lightness channel X 2 RH⇥W ⇥1, our objective is to learn a mapping b Y = F(X) to the two associated color channels Y 2 RH⇥W ⇥2, where H, W are image dimensions. (We denote predictions with ab · symbol and ground truth without.) We perform this task in CIE Lab color space. Because distances in this space model perceptual distance, a natural objective function, as used in [1, 2], is the Euclidean loss L2 (·, ·) between predicted and ground truth colors: L2 ( b Y, Y) = 1 2 X kYh,w b Yh,wk 2 2 (1) fθ x z [256, 256, 3] [7, 7, Patched ResNet-161 fθ x z [256, 256, 3] [7, 7, Self-supervised pre-training 100% images; 0% labels Linear classification 100% images and labels fθ x z [224, 224, 3] [14, 14 Efficient classification 1% to 100% images and labels fθ x z [H, W, 3] [H/16, W/ Transfer learning 100% images and labels x [224, 224, 3] Supervised training 1% to 100% images and labels Pre-trained Fixed / Tuned ResNet-161 Image x Feature Extractor fθ Patched ResNet-161 z c Context Network gφ Masked ConvNet Pre-trained Fixed / Tuned ResNet-161 Pre-trained Fixed Patched ResNet-161 occur in a specific spatial configuration (if there is no specific configuration of the parts, then it is “stuff” [1]). We present a ConvNet-based approach to learn a visual representation from this task. We demonstrate that the resulting visual representation is good for both object detection, pro- viding a significant boost on PASCAL VOC 2007 compared to learning from scratch, as well as for unsupervised object discovery / visual data mining. This means, surprisingly, that our representation generalizes across images, despite being trained using an objective function that operates on a single image at a time. That is, instance-level supervision appears to improve performance on category-level tasks. 2. Related Work One way to think of a good image representation is as the latent variables of an appropriate generative model. An ideal generative model of natural images would both generate images according to their natural distribution, and be concise in the sense that it would seek common causes for different images and share information between them. However, inferring the latent structure given an image is in- tractable for even relatively simple models. To deal with these computational issues, a number of works, such as 3 2 1 5 4 8 7 6 ); Y = 3 , X = ( Figure 2. The algorithm receives two patches in one of these eight possible spatial arrangements, without any context, and must then classify which configuration was sampled. model (e.g. a deep network) to predict, from a single word, the n preceding and n succeeding words. In principle, similar reasoning could be applied in the image domain, a kind of visual “fill in the blank” task, but, again, one runs into the </PSPP[JBOE 'BWBSP &$$7>͔ΒҾ༻ <1BUIBL $713>͔ΒҾ༻ <8V $713>͔ΒҾ༻ <;IBOH &$$7>͔ΒҾ༻ <)ÉOB ff *$.->͔ΒҾ༻ <%PFSTDI *$$7>͔ΒҾ༻

w ಛ௃දݱΛ௚઀ධՁ͢Δ͜ͱ͸೉͍ͨ͠ΊɼԼྲྀλεΫʹର͢Δਫ਼౓͔ΒධՁ w ୅දతͳํ๏ ઢܗධՁ L//๏ ϑΝΠϯνϡʔχϯά
ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏

w ઢܗධՁ -JOFBSFWBMVBUJPO -JOFBSQSPCJOH -JOFBSDMBTTJ fi DBUJPO Ϟσϧ͕நग़ͨ͠ಛ௃ྔͱڭࢣϥϕϧΛ༻͍ͯઢܗ૚Λڭࢣ͋Γֶश
ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏ ࣄલֶशϞσϧ rocesses image data, we analyze its internal er linearly projects the ﬂattened patches into ows the top principal components of the the ausible basis functions for a low-dimensional Figure 6: Representative examples of attention from the output token to the input space. See Appendix D.6 for added to the model learns position em- position em- atches in the a sinusoidal dix D). That opology ex- do not yield ss the entire t degree the compute the tion is inte- This “atten- Ns. We ﬁnd n the lowest n globally is consistently localized at- ResNet be- it may serve ʜ ʜ ಛ௃ྔ ࣗݾڭࢣ͋Γֶश ʹΑΔࣄલֶशϞσϧ ਖ਼ղϥϕϧ -JOFBS ʜ ಛ௃ྔ ʜ "JSQMBOF $BU ڭࢣ͋Γֶश ཚ਺ॳظ஋ ͷઢܗ૚

w L//๏ʹΑΔධՁ Ϟσϧ͕நग़ͨ͠ಛ௃ྔͱڭࢣϥϕϧΛ༻͍ͯL//๏Λద༻ ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏ ࣄલֶशϞσϧ mage data, we
analyze its internal projects the ﬂattened patches into principal components of the the is functions for a low-dimensional Figure 6: Representative examples of attention from the output token to the input e s - - e al at - d e e e - - d st s y - - ʜ ʜ ಛ௃ྔ ࣗݾڭࢣ͋Γֶश ʹΑΔࣄલֶशϞσϧ ʜ ಛ௃ྔ ਖ਼ղϥϕϧ ʜ "JSQMBOF $BU ֶश༻σʔλ ʜ ಛ௃ྔ ਖ਼ղϥϕϧ ʜ $BU %PH ධՁ༻σʔλ L//๏ʹΑΓධՁ ,ݸͷۙ๣఺͔ΒΫϥε෼ྨ

w ϑΝΠϯνϡʔχϯά fi OFUVOJOH Ϋϥε෼ྨɼ෺ମݕग़ɼηάϝϯςʔγϣϯͳͲͷ༷ʑͳԼྲྀλεΫ΁ϑΝΠϯνϡʔχϯά ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏ 1FMJDBO
ڭࢣϥϕϧ -JOFBS ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ࣗݾڭࢣ͋Γֶश ʹΑΔࣄલֶशϞσϧ λεΫݻ༗ ͷϞσϧߏ଄ ڭࢣ͋Γֶश

w ࣗݾڭࢣ͋ΓֶशͰ֫ಘͨ͠ಛ௃දݱΛؒ઀తʹධՁ ઢܗධՁ ϝϦοτɹɿൺֱతߴ͍ਫ਼౓Λୡ੒ σϝϦοτɿઢܗධՁͷֶश৚݅ʢϋΠύϥʣʹΑΓਫ਼౓͕มԽ L//๏ ϝϦοτɹɿϋΠύϥʹΑΔਫ਼౓΁ͷӨڹ͕খ͍͞ σϝϦοτɿઢܗධՁͱൺ΂ͯਫ਼౓͕௿͍͜ͱ͕ଟ͍
w ࣄલֶशϞσϧͱͯ͠ͷసҠੑΛධՁ ϑΝΠϯνϡʔχϯά ϝϦοτɹɿ෯޿͍ԼྲྀλεΫͰධՁ͕Մೳ σϝϦοτɿධՁͷͨΊͷֶशʹ͕࣌ؒඞཁ ධՁํ๏ͷ࢖͍෼͚ ઢܗධՁͱL//๏͸ ηοτͰධՁ

*$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO $713`> ϚϧνϞʔμϧ΁ͷ֦ு 3(# ηϚηά ਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(# ݴޠ $"7."& <:(POH *$-3`> 3(# Ի੠ ରরֶश .*. 8IBU%P447J5-FBSO </1BSL *$-3`> ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ ".3"%*0 <.3BO[JOHFS $713`> ରরֶश .*. $-*1 4".ͷ ͭͷֶशࡁΈج൫ϞσϧΛ ͭͷϞσϧʹϞσϧѹॖ ରরֶश

ରরֶश $POUSBTUJWF-FBSOJOH w ϛχόον಺ͷσʔλʹ͓͍ͯϖΞΛݟ͚ͭΔϓϨςΩετλεΫͷֶशํ๏ σʔλ֦ுʹΑΓ࡞੒ͨ͠σʔλؒͷྨࣅੑ΍ࠩҟΛࣝผ σʔλ֦ு σʔλ֦ு ج४ͷը૾
ਖ਼ྫ ෛྫ ෛྫ ਖ਼ྫϖΞɿಛ௃ྔͷྨࣅ౓Λߴ͘͢Δؔ܎ ෛྫϖΞɿಛ௃ྔͷྨࣅ౓Λ௿͘͢Δؔ܎ σʔλ಺ͷύλʔϯ΍ύʔπͷؔ܎ੑʹֶ͍ͭͯश

w ରরֶशͷखॱ ɽσʔλ֦ுʹΑΓαϯϓϧͷσʔλʹ͖ͭɼαϯϓϧͷσʔλΛ࡞੒ ɽ֤σʔλΛΤϯίʔμʢग़ྗ૚ΛऔΓআֶ͍ͨशର৅ͷϞσϧʣʹೖྗͯ͠ಛ௃ྔΛநग़ ɽσʔλ֦ுલͷσʔλ͕ಉ͡৔߹͸ྨࣅ౓Λߴ͘ɼҟͳΔ৔߹͸௿͘ͳΔΑ͏ʹֶश ରরֶश $POUSBTUJWF-FBSOJOH ଛࣦܭࢉ Τϯίʔμ
ಛ௃ྔ ͚ۙͮΔ ཭͢ σʔλ֦ு ϛχόον ͚ۙͮΔ ཭͢ ͚ۙͮΔ ཭͢ ͚ۙͮΔ ཭͢

w 4JN$-3ɿ4JNQMF'SBNFXPSLGPS$POUSBTUJWF-FBSOJOHPG7JTVBM3FQSFTFOUBUJPOT ಛघͳϞσϧߏ଄΍ςΫχοΫ͕ෆཁͳγϯϓϧͳରরֶश ڭࢣ͋Γࣄલֶशͱಉఔ౓Ҏ্ͷࣄલֶशޮՌΛൃش ˠޙଓͷରরֶशͷج൫ͱͯ͠࢖༻ ୅දతͳରরֶशɿ4JN$-3<$IFO *$.->
ϛχόον ࣹӨϔου Τϯίʔμ .-1 ଛࣦܭࢉ /59FOU σʔλ֦ு ಛ௃ྔ

w σʔλ֦ுͷ෼ੳɿͭͷσʔλ֦ுͷ૊Έ߹ΘͤํʹΑΔઢܗධՁͷਫ਼౓มԽΛௐࠪ 4JN$-3ɿσʔλ֦ு RS R R R 6R 1R
5R R R RS R R R 6R 1R 5R R R <$IFO *$.->͔ΒҾ༻ RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R σʔληοτɿ*NBHF/FU, RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R ϥϯμϜΫϩοϓͱ৭ม׵ͷ૊Έ߹Θ͕ͤ࠷΋ߴ͍ਫ਼౓Λୡ੒

w ϥϯμϜΫϩοϓͷΫϩοϓྖҬʹԠͯ͡छྨͷֶश͕ൃੜ 4JN$-3ɿσʔλ֦ு ہॴతͳ৘ใͱେҬతͳ৘ใͷඥ෇͚ େҬతͳ৘ใ ہॴతͳ৘ใ ෺ମΛߏ੒͢Δύʔπؒͷؔ܎ੑͷཧղ ہॴతͳ৘ใ ہॴతͳ৘ใ
໰୊఺ɿ ಉ͡ը૾͔ΒΫϩοϓͨ͠ը૾͸ࣅͨ৭৘ใˠϞσϧ͕৭৘ใ͔Βਖ਼ྫͱෛྫΛ۠ผ ϥϯμϜΫϩοϓʴ৭ม׵ʹΑΓ୹བྷతͳֶशΛ๷͗ɼߴ͍ࣄલֶशͷޮՌΛൃش

4JN$-3ɿଛࣦؔ਺ Li,j = − log exp(sim(zi , zj )/τ)
∑2N k=1 1[k≠i] exp(sim(zi , zk )/τ) ίαΠϯྨࣅ౓ ਖ਼ྫϖΞ શͯͷϖΞͷྨࣅ౓ Թ౓ύϥϝʔλ w ଛࣦؔ਺ͱͯ͠/PSNBMJ[FE5FNQFSBUVSFTDBMFE$SPTT&OUSPQZMPTT /59FOU Λ࢖༻ αϯϓϧؒͷྨࣅ౓ؔ܎Λ֬཰෼෍Ͱදݱͯ͠$SPTT&OUPSQZMPTTΛܭࢉ p1,2 p1,3 p1,2N p1,4 pi,j = exp(sim(zi , zj )/τ) ∑2N k=1 1[k≠i] exp(sim(zi , zk )/τ) MPHJUT ֬཰෼෍ Թ౓෇͖ 4PGUNBYؔ਺ $SPTT&OUSPQZ ଛࣦ ڭࢣϥϕϧ y1,2 y1,3 y1,2N y1,4 Li,j = − 2N ∑ k=1 1[k≠i] yi,k log pi,k = − log pi,j sim(z1 , z2 ) sim(z1 , z3 ) sim(z1 , z2N ) sim(z1 , z4 ) αϯϓϧͱαϯϓϧ͕ϙδςΟϒϖΞ ͷྫ i = 1, j = 2

w ଛࣦؔ਺ͱͯ͠/PSNBMJ[FE5FNQFSBUVSFTDBMFE$SPTT&OUSPQZMPTT /59FOU Λ࢖༻ αϯϓϧؒͷྨࣅ౓ؔ܎Λ֬཰෼෍Ͱදݱͯ͠$SPTT&OUPSQZMPTTΛܭࢉ 4JN$-3ɿଛࣦؔ਺ p1,2 p1,3
p1,2N p1,4 ֬཰෼෍ΛӶ͘ ֬཰෼෍ΛͳͩΒ͔ʹ Թ౓෇͖ 4PGUNBYؔ਺ τ < 1.0 τ > 1.0 4JN$-3Ͱ͸ ͱͯ͠ Λ࢖༻ τ ௨ৗͷ4PGUNBYؔ਺ τ = 1.0 ͱൺ΂ͯ αϯϓϧͱαϯϓϧ͕ϙδςΟϒϖΞ ͷྫ i = 1, j = 2 p1,2 p1,3 p1,2N p1,4 MPHJUT sim(z1 , z2 ) sim(z1 , z3 ) sim(z1 , z2N ) sim(z1 , z4 ) Li,j = − log exp(sim(zi , zj )/τ) ∑2N k=1 1[k≠i] exp(sim(zi , zk )/τ) ίαΠϯྨࣅ౓ ਖ਼ྫϖΞ શͯͷϖΞͷྨࣅ౓ Թ౓ύϥϝʔλ

4JN$-3ɿֶश৚݅ʹΑΔਫ਼౓มԽ 7 7 A Simple Framework for Contrastive Learning
of Visual Representations (a) Without color distortion. (b) With color distortion. Figure 6. Histograms of pixel intensities (over all channels) for different crops of two different images (i.e. two rows). The image for the ﬁrst row is from Figure 4. All axes have the same range. Color distortion strength Methods 1/8 1/4 1/2 1 1 (+Blur) AutoAug SimCLR 59.6 61.0 62.6 63.2 64.5 61.1 Supervised 77.0 76.7 76.5 75.7 75.4 77.1 Table 1. Top-1 accuracy of unsupervised ResNet-50 using linear evaluation and supervised ResNet-505, under varied color distortion strength (see Appendix A) and other data transformations. Strength 1 (+Blur) is our default data augmentation policy. ric data augmentation hurts the performance. Nonetheless, this setup should not substantively change the impact of 1 P R 3 P 0 R 7RS 5 5 5 5 5 5 5 5 5 5 5 5 5 6 S 5 6 S 5 6 S 5 5 5 5 Figure 7. Linear evaluation of models with varied depth and width. Models in blue dots are ours trained for 100 epochs, models in red stars are ours trained for 1000 epochs, and models in green crosses are supervised ResNets trained for 90 epochs7 (He et al., 2016). shown in Table 1. Stronger color augmentation substan- tially improves the linear evaluation of the learned unsupervised models. In this context, AutoAugment (Cubuk et al., 2019), a sophisticated augmentation policy found using supervised learning, does not work better than simple cropping όοναΠζͱΤϙοΫ਺ʹΑΔਫ਼౓มԽ ϞσϧαΠζʹΑΔਫ਼౓มԽ όοναΠζɼΤϙοΫ਺ɼϞσϧαΠζ͕େ͖͍΄ͲରরֶशͷޮՌ͕޲্ <$IFO *$.->͔ΒҾ༻ɼҰ෦վม <$IFO *$.->͔ΒҾ༻ ɿFQPDIͷڭࢣ͋Γֶश ɿFQPDIͷࣗݾڭࢣ͋Γֶश ɿ FQPDIͷࣗݾڭࢣ͋Γֶश

w 4JN$-3Ͱࣔ͞Εͨ஌ݟ ɽϥϯμϜΫϩοϓʹΑΓ࡞੒ͨ͠ਖ਼ྫΛରԠ෇͚ΔͨΊʹը૾ͷแׅతͳཧղ͕ଅਐ ɽϥϯμϜΫϩοϓʴ৭ม׵ʹΑΓ৭৘ใʹͷΈண໨͢Δ୹བྷతͳֶशΛ཈੍ ɽόοναΠζɼΤϙοΫ਺ɼϞσϧαΠζΛେ͖͘͢Δ͜ͱͰߴ͍ࣄલֶशͷޮՌΛൃش 4JN$-3ͷ·ͱΊ w 4JN$-3Ҏ߱ͷରরֶशͷϕʔεઃఆʹ
ϥϯμϜΫϩοϓͱ৭ม׵Λ૊Έ߹Θͤͨσʔλ֦ு ࣹӨϔουͷಋೖ େ͖ͳόοναΠζͱΤϙοΫ਺ͷ࠾༻

w ԼྲྀλεΫͷੑೳ޲্΍ܭࢉίετ࡟ݮͳͲΛ໨తͱ༷ͯ͠ʑͳํ๏͕ߟҊ ਖ਼ྫ਺ͷ૿Ճ ෛྫ਺ͷ૿Ճɿ.P$P ਖ਼ྫͱෛྫͷվળ σʔλ֦ுͷվળ
େن໛Ϟσϧͷར༻ ෛྫ͕ෆཁͳֶशɿ4JN4JBN ը૾Ҏ֎ͷϞʔμϧ΁ͷద༻ ϚϧνϞʔμϧ΁ͷ֦ுɿ$-*1 ༷ʑͳରরֶश

w .P$Pɿ.PNFOUVN$POUSBTU ֶशதʹநग़ͨ͠ಛ௃ϕΫτϧΛࣙॻʹอଘͯ͠ෛྫͱͯ͠࠶ར༻ ࣙॻ಺ͷෛྫͷಛ௃දݱʹҰ؏ੑΛ࣋ͨͤΔͨΊʹϞϝϯλϜϞσϧΛಋೖ ϞϝϯλϜϞσϧͷग़ྗͷΈΛࣙॻ΁อଘ ෛྫ਺ͷ૿Ճɿ.P$P<)F $713>
-JOFBS -JOFBS ࢦ਺Ҡಈฏۉ ϛχόον σʔλ֦ு ࣙॻ ݸ ଛࣦܭࢉ ʢෛྫʣ Τϯίʔμ ઢܗ૚ ϞϝϯλϜϞσϧ θm ← λθm + (1 − λ)θe θm ϞϝϯλϜϞσϧͷύϥϝʔλ θe Τϯίʔμɾઢܗ૚ͷύϥϝʔλ λ ߋ৽ύϥϝʔλʹର͢ΔॏΈ ʢਖ਼ྫʣ

w ਖ਼ྫϖΞͷಛ௃ϕΫτϧΛ͚ۙͮΔֶशʹΑΓը૾ͷแׅతͳཧղ͕ଅਐ w ෛྫͷ਺ʹԠͯ͡ଛࣦؔ਺ͷܭࢉίετ͕૿Ճ ෛྫ͕ෆཁͳରরֶश ಛ௃ۭؒ ͚ۙͮΔ ͚ۙͮΔ ཭͢
ෛྫΛߟྀ͠ͳ͍৔߹ɼͲΜͳը૾ʹରͯ͠΋ಉ͡ಛ௃ϕΫτϧΛग़ྗ͢Δ่յ͕ൃੜ ͚ۙͮΔ ͚ۙͮΔ ཭͢ º ಛ௃ۭؒ

w 4JN4JBNɿTJNQMF4JBNFTFOFUXPSLT ࣹӨϔουͷಛ௃ۭؒʹ͓͍ͯ༧ଌϔουʹΑΓਖ਼ྫͷಛ௃ϕΫτϧΛ༧ଌ ඇରশͳϞσϧߏ଄ͱޯ഑ఀࢭॲཧʹΑΓ่յΛىͣ͜͞ʹෛྫ͕ෆཁͳֶश͕Մೳ ෛྫ͕ෆཁͳରরֶशɿ4JN4JBN<$IFO $713> ϛχόον
Τϯίʔμ ࣹӨϔου ༧ଌϔου ଛࣦܭࢉ ෛͷίαΠϯྨࣅ౓ ޯ഑ఀࢭ σʔλ֦ு #BDLQSPQ .-1 .-1 ಛ௃ϕΫτϧ

w ͳ่ͥյΛىͣ͜͞ʹֶश͕Մೳʁ ޯ഑ఀࢭॲཧͱ༧ଌϔουʹΑΔඇରশͳϞσϧߏ଄ͷ૊Έ߹Θ͕ͤॏཁ w ෼ੳʹؔ͢Δݚڀ $POUSBTUJOHUIFMBOETDBQFPGDPOUSBTUJWFBOEOPODPOUSBTUJWFMFBSOJOH<1PLMF "*45"54>
&YQMPSJOHUIF&RVJWBMFODFPG4JBNFTF44-WJB"6OJ fi FE(SBEJFOU'SBNFXPSL<5BP $713> #SJEHJOHUIF(BQGSPN"TZNNFUSZ5SJDLTUP%FDPSSFMBUJPO1SJODJQMFTJO/PODPOUSBTUJWF44-<-JV /FVS*14> 0OUIFEVBMJUZCFUXFFODPOUSBTUJWFBOEOPODPOUSBTUJWF44-<(BSSJEP *$-3> *NQMJDJUWBSJBODFSFHVMBSJ[BUJPOJOOPODPOUSBTUJWF44-<)BMWBHBM /FVS*14> ෛྫ͕ෆཁͳରরֶशɿ4JN4JBN<$IFO $713> ˠҟͳΔ؍఺͔Β༷ʑͳ෼ੳ͕͞Ε͍ͯΔ *NBHF/FU,ʹର͢ΔઢܗධՁ ༧ଌϔουͷ༗ແɿ ޯ഑ఀࢭͷ༗ແɿ 0 100 0 50 epochs kNN acc. w/ stop-grad w/o stop-grad acc. (%) w/ stop-grad 67.7±0.1 w/o stop-grad 0.1 loss. Without stop-gradient it degenerates immediately. Middle aged std over all channels. Right plot: validation accuracy of a uation (“w/ stop-grad” is mean±std over 5 trials). (ablation in Sec. 4.4) or ReLU. This MLP has 2 layers. The dimension of h’s input and output (z and p) is d = 2048, and h’s hidden layer’s dimension is 512, making h a bottleneck structure (ablation in supplement). We use ResNet-50 [19] as the default backbone. Other im- lementation details are in supplement. We perform 100- pred. MLP h acc. (%) baseline lr with cosine decay 67.7 (a) no pred. MLP 0.1 (b) fixed random init. 1.5 (c) lr not decayed 68.1 Table 1. Effect of prediction MLP (ImageNet linear evaluation accuracy with 100-epoch pre-training). In all these variants, we use the same schedule for the encoder f (lr with cosine decay). that with stop-gradient, the std value is near 1 p d . This indi- cates that the outputs do not collapse, and they are scattered on the unit hypersphere. Figure 2 (right) plots the validation accuracy of a k- nearest-neighbor (kNN) classifier [36]. This kNN classifier can serve as a monitor of the progress. With stop-gradient, the kNN monitor shows a steadily improving accuracy. <$IFO $713>͔ΒҾ༻

w %*/0ɿTFMGEJTUJMMBUJPOXJUIOPMBCFMT 7JTJPO5SBOTGPSNFS 7J5 Ͱߴ͍ࣄલֶशͷޮՌΛൃش ༧ଌϔουͷ୅ΘΓʹDFOUFSJOHॲཧͱTIBSQFOJOHॲཧΛಋೖ ੜెϞσϧͷग़ྗ෼෍͕ڭࢣϞσϧͷग़ྗ෼෍ʹۙͮ͘Α͏ʹֶश
ෛྫ͕ෆཁͳରরֶशɿ%*/0<$BSPO *$$7> ࢦ਺Ҡಈฏۉ TPGUNBY TPGUNBY DFOUFSJOH ޯ഑ఀࢭ Τϯίʔμ ࣹӨϔου TIBSQFOJOH ֬཰෼෍ σʔλ֦ு .-1 .-1 ڭࢣϞσϧʢϞϝϯλϜϞσϧʣ ੜెϞσϧ ଛࣦܭࢉ ަࠩΤϯτϩϐʔ ʢϚϧνΫϩοϓʣ 7J5 7J5 ʢΫϥετʔΫϯʣ େҬతͳ৘ใ ہॴతͳ৘ใ ಛ௃ϕΫτϧ

w TIBSQFOJOH ɿಛ௃ϕΫτϧͷதͰͭͷಛ௃ྔΛڧௐ͢ΔΑ͏ʹௐ੔ w DFOUFSJOH ɿͲΜͳը૾ʹରͯ͠΋ಉ͡ಛ௃ྔΛڧௐ͠ͳ͍Α͏ʹௐ੔ w DFOUFSJOH஋ ෛྫ͕ෆཁͳରরֶशɿ%*/0<$BSPO *$$7>
m B ϛχόον਺ ϋΠύϥ c ← mc + (1 − m) 1 B B ∑ i=1 gθt (xi ) Ps (x)(i) = exp(gθs (x)(i)/τs ) ∑K k=1 exp(gθs (x)(k)/τs ) Pt (x)(i) = exp((gθt (x)(i) − c)/τt ) ∑K k=1 exp((gθt (x)(k) − c)/τt ) τt Թ౓ύϥϝʔλ c DFOUFSJOH஋ τs Թ౓ύϥϝʔλ ੜెϞσϧɿ ڭࢣϞσϧɿ

w %*/0Ͱֶशͨ͠7J5ͷΫϥετʔΫϯʹର͢Δ"UUFOUJPOXFJHIUΛՄࢹԽ ࠷ऴ૚ͷ.VMUJ)FBE"UUFOUJPOͷதͰ࠷΋લܠʹண໨͍ͯ͠Δ)FBEʹ͍ͭͯՄࢹԽ w "UUFOUJPOXFJHIU΁ᮢ஋ॲཧΛ͔͚ͯՄࢹԽ ෛྫ͕ෆཁͳରরֶशɿ%*/0<$BSPO *$$7> Emerging
Properties in Self-Supervised Vision Transformers Mathilde Caron1,2 Hugo Touvron1,3 Ishan Misra1 Herv´ e Jegou1 Julien Mairal2 Piotr Bojanowski1 Armand Joulin1 1 Facebook AI Research 2 Inria⇤ 3 Sorbonne University Figure 1: Self-attention from a Vision Transformer with 8 ⇥ 8 patches trained with no supervision. We look at the self-attention of the [CLS] token on the heads of the last layer. This token is not attached to any label nor supervision. These maps show that the model automatically learns class-specific features leading to unsupervised object segmentations. Abstract In this paper, we question if self-supervised learning pro- vides new properties to Vision Transformer (ViT) [19] that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised 1. Introduction Transformers [70] have recently emerged as an alternative to convolutional neural networks (convnets) for visual recognition [19, 69, 83]. Their adoption has been coupled with a training strategy inspired by natural language processing (NLP), that is, pretraining on large quantities of data and finetuning on the target dataset [18, 55]. The resulting Vision Transformers (ViT) [19] are competitive with convnets but, they have not yet delivered clear benefits over them: they v:2104.14294v2 [cs.CV] 24 May 2021 ˠϥϕϧ৘ใ͕ແͯ͘΋ਖ਼֬ͳ෺ମྖҬΛ֫ಘ Supervised DINO Table 7: Important component for self-supervised ViT pre- training. Models are trained for 300 epochs with ViT-S/16. We study the different components that matter for the k-NN and linear (“Lin.”) evaluations. For the different variants, we highlight the differences from the default DINO setting. The best combination is the momentum encoder with the multicrop augmentation and the cross-entropy loss. We also report results with BYOL [30], MoCo-v2 [15] and SwAV [10]. Supervised DINO Random Supervised DINO ViT-S/16 22.0 27.3 45.9 Table 7: Imp training. Mod study the diffe (“Lin.”) evalu differences fro is the momen the cross-entr MoCo-v2 [15] Method M 1 DINO 2 3 ˠڭࢣ͋Γֶशͱൺ΂ͯ%*/0͸෺ମྖҬʹूத <$BSPO *$$7>͔ΒҾ༻ <$BSPO *$$7>͔ΒҾ༻

w ख๏ͷҧ͍ɼରরֶशͷΤϙοΫ਺ͷҧ͍ʹΑΔਫ਼౓มԽ Ϟσϧɿ*NBHF/FU,Λ༻͍ͯରরֶशΛߦͬͨ3FT/FU ֤ख๏ͷਫ਼౓ൺֱ ਫ਼౓͸֤࿦จ͔ΒҾ༻ ख๏ ࣗݾڭࢣ͋Γֶशͷ৚݅ *NBHF/FU,ʹର͢ΔઢܗධՁͷਫ਼౓<>
ϛχόον਺ ෛྫ ࣙॻ ΫϥελϦϯά ϚϧνΫϩοϓ ༧ଌϔου ΤϙοΫ ΤϙοΫ ΤϙοΫ 4JN$-3<$IFO *$.-> ✔︎ .P$PW<)F BS9JW> ✔︎ ✔︎ ʔ 4X"7<$BSPO /FVS*14> ✔︎ ʔ ʔ 4X"7<$BSPO /FVS*14> ✔︎ ✔︎ 4JN4JBN<$IFO $713> ✔︎ %*/0<$BSPO *$$7> ✔︎ ʔ ʔ

w ෳ਺ͷϞʔμϧΛ༻ֶ͍ͨशʹ͓͍ͯϞʔμϧؒͷରԠ෇͚͸ͭͷ՝୊ ˠରরֶशʹΑΓϞʔμϧؒΛΞϥΠϝϯτ w ϚϧνϞʔμϧͷσʔληοτɿҟͳΔϞʔμϧͷ৘ใΛϖΞͱͯͭ͠ͷσʔλΛఆٛ ϚϧνϞʔμϧରরֶश σʔληοτͰఆٛ͞ΕͨϖΞΛਖ਼ྫͱͯ͠ରরֶश ը૾ºݴޠɿ$$.σʔληοτ ը૾º఺܈ɿ,*55*σʔληοτ
ਖ਼ྫ #SPLFOHMBTTNPCJMFQIPOFPOBXIJUFCBDLHSPVOE 1&340/CFBUJOHPOUIFQIPOFTUPDLJNBHFT ਖ਼ྫ

w $-*1ɿ$POUSBTUJWF-BOHVBHF*NBHF1SFUSBJOJOH ࣄલֶशɿը૾ͱςΩετͷϖΞͰಛ௃ྔ͕Ұக͢ΔΑ͏ʹରরֶश ςΩετϥϕϧͷಛ௃நग़ɿΫϥεʹؔ͢ΔςΩετ͔Βಛ௃ྔΛநग़ θϩγϣοτͷΫϥε෼ྨɿը૾ͱςΩετؒͷಛ௃ྔͷྨࣅ౓͔Βը૾ͷΫϥεΛ෼ྨ ϚϧνϞʔμϧରরֶशɿ$-*1<3BEGPSE *$.->
I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classiﬁer from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3 <3BEGPSE *$.->͔ΒҾ༻

w ࣄલֶश σʔληοτͰఆٛ͞ΕͨϖΞΛਖ਼ྫͱͯ͠ରরֶश ਖ਼ྫϖΞɿσʔληοτͰఆٛ͞Εͨը૾ͱςΩετͷϖΞ ෛྫϖΞɿϛχόον಺ͷਖ਼ྫϖΞҎ֎ͷը૾ͱςΩετؒͷϖΞ ϚϧνϞʔμϧରরֶशɿ$-*1<3BEGPSE *$.->
ʜ ʜ ʜ ʜ ʜ ʜ ʜ ʜ ʜ I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classiﬁer from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3 ଛࣦܭࢉ ಛ௃ྔͷྨࣅ౓ؔ܎ ʢίαΠϯྨࣅ౓ʣ ཧ૝తͳྨࣅ౓ؔ܎ <3BEGPSE *$.->͔ΒҾ༻ɼҰ෦վม

w ςΩετϥϕϧͷಛ௃நग़ ϓϩϯϓτςϯϓϨʔτʹΫϥε໊Λ౰ͯ͸ΊͨςΩετ͔Βಛ௃ྔΛநग़ w θϩγϣοτͷΫϥε෼ྨ Ϋϥε໊ΛؚΜͩςΩετͱͷಛ௃ྔͷྨࣅ౓ؔ܎͔Βೖྗը૾ͷΫϥεΛ༧ଌ
ϚϧνϞʔμϧରরֶशɿ$-*1<3BEGPSE *$.-> I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classiﬁer from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 lEPHzΫϥεͱ൑ఆ l"QIPUPPGBQMBOFz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBDBSz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBEPHz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBCJSEz͔Βநग़ͨ͠ಛ௃ྔ ϓϩϯϓτ ςϯϓϨʔτ <3BEGPSE *$.->͔ΒҾ༻ɼҰ෦վม

w ༷ʑͳϞʔμϧͷ૊Έ߹Θͤʹ͓͍ͯϚϧνϞʔμϧରরֶश͕ఏҊ ϚϧνϞʔμϧରরֶशɿ༷ʑͳϚϧνϞʔμϧ΁ͷల։ $."$$$ <4.B *$-3`> ը૾ºݴޠ .$5 <9:VBO
$713`> $-*1 <"3BEGPSE *$.-`> '-*1 <:-J $713`> $P$-3 <5)BO /FVS*14`> ը૾º0QUJDBM fl PX 4-JE3 <$4BVUJFS $713`> ը૾º఺܈ ը૾ºԻ੠ ը૾ºݴޠºԻ੠ ..7/FUXPSLT <+"MBZSBD /FVS*14`> &WFSZUIJOHBU0ODF </4IWFUTPWB $713`> .$/ <#$IFO *$$7`> V T A MIL-NCE NCE ..7/FUXPSLTͷֶशํ๏ ը૾ºݴޠº఺܈ $-*14DFOF <3$IFO $713`> $-*1 <:;FOH $713`> $-*11PJOU <5)VBOH *$$7`> $-*1ͷֶशํ๏ !! " Point Cloud of a Chair Point Cloud of a Box Point Cloud of a Bowl … Point Cloud of a Desk CLIP# Zero-Shot Recognition Optional "$ "" "% … Bowl Desk Chair Box Triplet Proxy Collection Cross-Modal Pretraining Printer, Laptop, Dog, Desk, Pillow, Chair, Light, Shoe, …. image of a Printer image of a Laptop image of a Desk … image of a Chair !! " "# Scene Point Cloud !! $ Scene Image Triplet Proxies #! $! VLM CLIP CLIP2 Text Encoder !! " Image Encoder !! % PC Encoder !! $ #&' Contrastive Alignment %% $ %& $ %' $ %( $ %) $ … %& " %% " %' " %( " %) " … + %% * %& * %+ * … !! % !! $ … … … … … … … … … … … … … … … … … … … … <"MBZSBD /FVS*14>͔ΒҾ༻ <;FOH $713>͔ΒҾ༻

w σʔλ֦ுʹΑΓ࡞੒ͨ͠ਖ਼ྫͱෛྫΛࣝผ͢ΔΑ͏ʹֶश w ֫ಘ͢Δಛ௃දݱͷྑ͠ѱ͠͸ਖ਼ྫɼෛྫʹґଘ ਖ਼ྫɼෛྫʹண໨༷ͨ͠ʑͳΞϓϩʔν͕ߟҊ w ޯ഑ఀࢭॲཧͱඇରশͳϞσϧߏ଄ʹΑΓෛྫΛߟྀͤͣʹֶश͕Մೳ ҟͳΔ؍఺͔Β༷ʑͳ෼ੳ͕͞Ε͍ͯΔ
w σʔληοτͰఆٛ͞ΕͨϖΞΛਖ਼ྫͱ͢Δ͜ͱͰϚϧνϞʔμϧ΁֦ுՄೳ ରরֶशʹΑΓϞʔμϧؒΛΞϥΠϝϯτ $-*1<3BEGPSE *$.->Ͱ͸௥ՃͷֶशΛ͢Δ͜ͱͳ͘θϩγϣοτͷΫϥε෼ྨ͕Մೳ ରরֶशͷ·ͱΊ

*$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO $713`> ϚϧνϞʔμϧ΁ͷ֦ு 3(# ηϚηά ਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(# ݴޠ $"7."& <:(POH *$-3`> 3(# Ի੠ ରরֶश .*. ରরֶश 8IBU%P447J5-FBSO </1BSL *$-3`> ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ ".3"%*0 <.3BO[JOHFS $713`> ରরֶश .*. $-*1 4".ͷ ͭͷֶशࡁΈج൫ϞσϧΛ ͭͷϞσϧʹϞσϧѹॖ .BTLFE*NBHF.PEFMJOH .*.

w #&35ɿ#JEJSFDUJPOBM&ODPEFS3FQSFTFOUBUJPOTGSPN5SBOTGPSNFST #JEJSFDUJPOBM5SBOTGPSNFSΛ1SFUSBJOJOHͱ'JOF5VOJOHͷͭͷ4UFQʹΑΓֶश ࣄલֶश 1SFUSBJOJOH ͱͯͭ͠ͷλεΫʹֶ͍ͭͯश .BTLFE-BOHVBHF.PEFMJOHɿͷτʔΫϯΛϚεΫ͠ɼϚεΫͨ͠τʔΫϯͷ୯ޠΛ༧ଌ /FYU4FOUFODF1SFEJDUJPOɿ4FOUFODF#͕4FOUFODF"ͷଓ͖ͷจষ͔༧ଌ
ࣗવݴޠॲཧ෼໺ͷࣄલֶश๏ɿ#&35<%FWMJO /""$-> BERT BERT E [CLS] E 1 E [SEP] ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... TokM Question Paragraph Start/End Span BERT E [CLS] E 1 E [SEP] ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... TokM Masked Sentence A Masked Sentence B Pre-training Fine-Tuning NSP Mask LM Mask LM Unlabeled Sentence A and B Pair SQuAD Question Answer Pair NER MNLI <%FWMJO /""$->͔ΒҾ༻

w ೖྗը૾ʹରͯ͠ύον୯ҐͰϚεΫ͠ɼϚεΫύονͷըૉ΍ಛ௃ྔΛ༧ଌ ϚεΫ͍ͯ͠ͳ͍ύονͷ৘ใΛ΋ͱʹ༧ଌˠը૾಺ͷจ຺৘ใΛଊ͑Δ w ରরֶशͱൺ΂ͯ ϑΝΠϯνϡʔχϯάʹ͓͍ͯߴ͍ੑೳΛൃش ࣮૷͕༰қ
.BTLFE*NBHF.PEFMJOH .*. 7J5ͷࣗݾڭࢣ͋Γֶशͱͯ͠ .*.͕ओྲྀʹ ೖྗը૾ .*. 7J5 ෮ݩը૾ PS ಛ௃ྔ

w ."&ɿ.BTLFE"VUPFODPEFS ඇରশͳ&ODPEFSɾ%FDPEFSߏ଄Λ༻͍ͯϚεΫը૾Λ෮ݩ &ODPEFS ɿϚεΫ͞Ε͍ͯͳ͍ύονͷΈΛೖྗͱ͢Δ7J5 %FDPEFS ɿύοντʔΫϯͱϚεΫτʔΫϯ͔Βը૾Λ࠶ߏ੒͢Δখن໛ͷ7J5 ୅දతͳ.*.ɿ."&<)F $713>
ଛࣦܭࢉ.4& ʢϚεΫτʔΫϯʹର͢Δग़ྗʹͷΈܭࢉʣ ɿϚεΫτʔΫϯ ɿΤϯίʔυ͞ΕͨύοντʔΫϯ *OQVU &ODPEFS 1& %FDPEFS 1& ࣗݾڭࢣ͋Γֶशޙ͸ &ODPEFSͷΈΛར༻

w ϥϯμϜϚεΩϯάઓུʹ͓͚ΔϚεΫ཰ʹΑΔੑೳมԽ ϚεΫ཰ͷ࣌ʹઢܗධՁͱϑΝΠϯνϡʔχϯάͷ྆ํͰߴ͍ਫ਼౓ #&35ͷϚεΫ཰ͱൺ΂ͯߴ͍ϚεΫ཰ ୅දతͳ.*.ɿ."&<)F $713> ings to
would e. The ing to ncoder nition). signed n. We d shal- ecoder th this y pro- tly re- e input . Each values 10 20 30 40 50 60 70 80 90 83 84 85 83.2 83.4 83.4 84.7 84.9 85.0 84.9 84.9 84.5 83.0 fine-tuning masking ratio (%) 10 20 30 40 50 60 70 80 90 50 60 70 54.6 58.9 61.7 67.0 69.9 71.8 73.2 73.5 71.8 66.1 linear probing masking ratio (%) Figure 5. Masking ratio. A high masking ratio (75%) works well for both fine-tuning (top) and linear probing (bottom). The y-axes are ImageNet-1K validation accuracy (%) in all plots in this paper. 4. ImageNet Experiments We do self-supervised pre-training on the ImageNet-1K (IN1K) [13] training set. Then we do supervised training to evaluate the representations with (i) end-to-end fine-tuning or (ii) linear probing. We report top-1 validation accuracy *NBHF/FU,Λ༻͍ͨઢܗධՁ ing patch to be predicted. We add positional embeddings to all tokens in this full set; without this, mask tokens would have no information about their location in the image. The decoder has another series of Transformer blocks. The MAE decoder is only used during pre-training to perform the image reconstruction task (only the encoder is used to produce image representations for recognition). Therefore, the decoder architecture can be flexibly designed in a manner that is independent of the encoder design. We experiment with very small decoders, narrower and shal- lower than the encoder. For example, our default decoder has <10% computation per token vs. the encoder. With this asymmetrical design, the full set of tokens are only pro- 10 20 30 40 50 60 70 80 90 83 84 85 83.2 83.4 83.4 84.7 84.9 85.0 84.9 84.9 84.5 83.0 fine-tuning masking ratio (%) 10 20 30 40 50 60 70 80 90 50 60 70 54.6 58.9 61.7 67.0 69.9 71.8 73.2 73.5 71.8 66.1 linear probing masking ratio (%) Figure 5. Masking ratio. A high masking ratio (75%) works well for both fine-tuning (top) and linear probing (bottom). The y-axes are ImageNet-1K validation accuracy (%) in all plots in this paper. *NBHF/FU,Λ༻͍ͨϑΝΠϯνϡʔχϯά <)F $713>͔ΒҾ༻ ϚεΫ཰ͷྫɿ block 50% grid 75% random 75% ୯ʹઢ΍ςΫενϟΛ֦ு͢Δ༧ଌͰ͸ࠔ೉ͳ໰୊ˠ෺ମ΍γʔϯͷશମ૾ͷཧղ΁

w *NBHF/FU,ͷධՁ༻σʔλʹର͢Δ෮ݩ݁Ռ 7J5ͷࣗݾڭࢣ͋Γֶशɿ."&<)F $713> ˠϚεΫ͞Ε͍ͯͳ͍ύον͔Βը૾શମͷ࠶ߏ੒͕Մೳ ೖྗը૾ ෮ݩ݁Ռ ݪը૾ ೖྗը૾
෮ݩ݁Ռ ݪը૾ ೖྗը૾ ෮ݩ݁Ռ ݪը૾ ೖྗը૾ ෮ݩ݁Ռ ݪը૾

w ԼྲྀλεΫͷੑೳ޲্΍ܭࢉίετ࡟ݮͳͲΛ໨తͱ༷ͯ͠ʑͳํ๏͕ߟҊ ըૉ஋ͷ༧ଌɿ."& ಛ௃ϕΫτϧͷ༧ଌɿJ#05 ΫϥετʔΫϯ΁ͷը૾શମͷ৘ใͷू໿ ϚεΫͷվળɿ"UUFOUJPO(VJEFE.*.
ϚϧνεέʔϧͳϞσϧ΁ͷద༻ ը૾Ҏ֎ͷϞʔμϧ΁ͷద༻ ϚϧνϞʔμϧ΁ͷ֦ு ༷ʑͳ.BTLFE*NBHF.PEFMJOH

w J#05ɿJNBHF#&35QSF5SBJOJOHXJUI0OMJOF5PLFOJ[FS ܗঢ়΍ߏ଄ͱ͍ͬͨ௿Ϩϕϧ৘ใ͚ͩͰͳ͘ߴϨϕϧͷҙຯత৘ใΛଊ͍͑ͨ ϞϝϯλϜϞσϧʢΦϯϥΠϯτʔΫφΠβʔʣͷग़ྗΛਖ਼ղ৘ใͱֶͯ͠श ෛྫ͕ෆཁͳରরֶश ͱಛ௃ϕΫτϧΛ༧ଌ͢Δ.*. ͷͭͷଛࣦ͔Βֶश
ରরֶशΛͭͭ͠ɼରরֶशͰଊ͑ͨ৘ใΛ༻͍ͯ.*. ℒ[CLS] ℒMIM ಛ௃ϕΫτϧͷ༧ଌɿJ#05<;IPV *$-3> !~ℐ $ ~% &! &" ℎ! "#$%& ℎ! [()*] ℎ $ "#$%& ℎ $ [()*] ℒ[$%&] ℒ()( online tokenizer " #! "#$%& " #! [()*] # $ [()*] # $ "#$%& $ $ [()*] $ $ "#$%& " $! [()*] stop grad stop grad EMA " $! "#$%& %[,-*.] ( ) <;IPV *$-3>͔ΒҾ༻

w ը૾͸ۭؒత৑௕ੑ͕ߴ͍ͨΊϥϯμϜϚεΫͰ͸ҙຯతྖҬ͕ϚεΫ͞ΕΔՄೳੑ͕௿͍ w J#05΁ϞϝϯλϜϞσϧ 5FBDIFS ͷ"UUFOUJPOXFJHIUʹج͍ͮͨϚεΩϯάΛಋೖ "UUFOUJPOXFJHIUɿΫϥετʔΫϯʹର͢Δ֤ύονͱͷ"UUFOUJPOXFJHIU w "UUFOUJPOXFJHIUͷߴ͍ྖҬͷϚεΩϯάͱ௿͍ྖҬͷϚεΩϯάΛݕ౼
ߴ͍ྖҬΛϚεΫ͢Δ͜ͱͰੑೳ͕վળ ϚεΫͷվળɿ"UUFOUJPO(VJEFE.*.<,BLPHFPSHJPV &$$7> Attention-Guided Masked Image Modeling 3 (a) Input (b) Random (c) Random (d) Block (e) Attention (f) AttMask (g) AttMask Image (30) (75) Wise Map High Low Fig. 1. Different than random masking strategies (b-d), our attention-guided masking (AttMask) uses the attention map arising in the encoder (e) to mask the most highly attended by default (f), <,BLPHFPSHJPV &$$7>͔ΒҾ༻

w ख๏ͷҧ͍ɼධՁํ๏ͷҧ͍ʹΑΔਫ਼౓มԽ Ϟσϧɿ*NBHF/FU,Λ༻͍ͯࣗݾڭࢣ͋ΓֶशΛߦͬͨ7J5#ͷൺֱ ֤ख๏ͷਫ਼౓ൺֱ ख๏ ࣗݾڭࢣ͋Γֶशͷ৚݅ *NBHF/FU,ʹର͢Δਫ਼౓<> ΤϙοΫ਺
༧ଌ τʔΫφΠβʔ ରরֶश ϚεΫͷվળ ༧ଌํ๏ͷվળ ઢܗධՁ fi OFUVOJOH ରরֶश %*/0 ✔︎ ."&<)F $713`> ըૉ஋ ͳ͠ #&J5<#BP *$-3`> ಛ௃ྔ E7"& ֶशࡁΈ &7"<'BOH $713> ಛ௃ྔ $-*1 ֶशࡁΈ ʔ J#05<;IPV *$-3`> ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎ "UUFOUJPO(VJEFE.*. <,BLPHFPSHJPV &$$7`> ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎ ✔︎ ʔ ʔ *+&1"<"TTSBO *$$7`> ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎ ✔︎ ʔ ਫ਼౓͸֤࿦จ͔ΒҾ༻

w ύονͷΑ͏ͳ୯Ґʹ෼ׂ͢Δ͜ͱͰը૾Ҏ֎ͷϞʔμϧʹ΋ద༻Մೳ ը૾Ҏ֎ͷϞʔμϧ΁ͷద༻ encoder decoder .... .... T W
H T W H input target Figure 1: Masked Autoencoders as spatiotemporal learners. We mask a large subset (e.g., 90%) of random patches in spacetime. An encoder operates on the set of visible patches. A small decoder then processes the full set of encoded patches and mask tokens to reconstruct the input. Except for patch and positional embeddings, neither the encoder, the decoder, nor the masking strategy, has any spatiotemporal inductive bias. To the extreme, if a video has T identical static frames, randomly sampling 1/T of all spacetime patches would reveal most of the static frame. Because slow motion is more likely than fast motion in natural videos, the masking ratio can be very high as we observe empirically. The higher masking ratio leads to a more efficient solution in practice. Following the MAE in [31] that applies the encoder only on visible tokens, a masking ratio of 90% reduces the encoder time and memory complexity to <1/10. Put together with a small decoder [31], the MAE pre-training can achieve a theoretically 7.7⇥ reduction in computation vs. encoding all tokens. In fact, the computation reduction is so large that the data loading time becomes a new bottleneck; even so, we record a 4.1⇥ wall-clock speedup. Such a significant speedup is of great importance for video research that is large-scale and time-consuming. We report strong results on a variety of video recognition datasets. Our MAE pre-training greatly improves generalization performance: on Kinetics-400 [35], it increases the accuracy of ViT-Large [18] by absolute 13% vs. training from scratch, while it takes less wall-clock training time overall (pre-training plus fine-tuning). Our MAE pre-training can outperform its supervised pre-training ."&"T4QBUJPUFNQPSBM-FBSOFST <'FJDIUFOIPGFS /FVS*14> Encoder Decoder … … Target MSE Input ) ( , Figure 1: Audio-MAE for audio self-supervised learning. An audio recording is first transformed into a spectrogram and split into patches. We embed patches and mask out a large subset (80%). An encoder then operates on the visible (20%) patch embeddings. Finally, a decoder processes the order-restored embeddings and mask tokens to reconstruct the input. Audio-MAE is minimizing the mean square error (MSE) on the masked portion of the reconstruction and the input spectrogram. This computational burden has been addressed in different ways. A popular approach is to reduce the sequence length in self-attention. Various ViT-based architectures have been developed to alleviate such issues for image and video understanding. For example, Swin-Transformer [19] only performs local attention within windows that shift across layers. MViT [20] employs pooling attention to construct a hierarchy of Transformers where sequence lengths are downsampled. For self-supervised learning, MAE [1] efficiently encodes only a small portion (25%) of visual patches while the majority of patches is discarded. The simplicity and scalability in MAE make it a promising framework for large-scale self-supervised learning. In this work, we study MAE for sound recognition and the unique challenges of the audio domain. We present Audio-MAE (Fig. 1) as unified and scalable framework for learning self-supervised audio representations. Similar to MAE, it is composed of a pair of a Transformer encoder and decoder. Figure 2: Visualizations on the Kinetics-400 [35] validation set (masking ratio 90%). We show the original video (top), masked video (middle), and MAE output (bottom) for each sample. This model reconstructs the original pixels. The video size is 16⇥224⇥224 and the spacetime patch size is 2⇥16⇥16 (the temporal patch size of 2 is not visualized here). Each sample has 8⇥14⇥14=1568 tokens with 156 being visible. For better visualizations, the known patches in the output are from the original input. Fig. 7 shows more examples. Figure 3: Visualizations of the same pre-trained model in Fig. 2 but with a masking ratio of 95%. Instead of predicting pixels [9, 18, 31, 80], another line of research focuses on the tokenization ."&UIBU-JTUFO <)VBOH /FVS*14> εϖΫτϩάϥϜʹରͯ͠ϚεΫ ಈը૾ʹରͯ͠ϚεΫ <)VBOH /FVS*14>͔ΒҾ༻ <'FJDIUFOIPGFS /FVS*14>͔ΒҾ༻

w ."&"T4QBUJPUFNQPSBM-FBSOFSTʢಈը૾ͷ."&ʣʹ͓͚ΔޮՌ ֶश࣌ؒͷൺֱɿ."&ʴϑΝΠϯνϡʔχϯά74ϑϧεΫϥονֶश ը૾Ҏ֎ͷϞʔμϧ΁ͷద༻ 0 10 20 30
40 50 60 0 20 40 60 80 accuracy (%) wall-clock time (hours) MAE pre-train 800 epochs fine-tune 100 epochs from scratch 400 epochs w/ MAE from scratch 1-view multi-vie Figure 5: MAE pre-training plus ﬁne-tuning is much more accurate and scratch. Here the x-axis is the wall-clock training time (128 A100 GPUs), a accuracy on Kinetics-400 validation. The table shows the ﬁnal accuracy. ˠ."&ʴϑΝΠϯνϡʔχϯά͸୹ֶ͍श࣌ؒͰߴੑೳ <'FJDIUFOIPGFS /FVS*14>͔ΒҾ༻

w .VMUJ."&ɿ.VMUJNPEBM.VMUJUBTL.BTLFE"VUPFODPEFST ."&ΛϚϧνϞʔμϧ΁֦ு ϚϧνϞʔμϧ΁ͷ֦ுɿ.VMUJ."&<#BDINBOO &$$7> Transformer encoder Pre-trained
MultiMAE encoder Pre-trained MultiMAE encoder Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... ... ... ... MultiMAE pre-training Single-modal fin Multi-modal fin Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e <#BDINBOO &$$7>͔ΒҾ༻

w ϚϧνϞʔμϧσʔλ ٖࣅϥϕϦϯάʹΑΓ3(#ը૾͔Β%FQUI৘ใͱ4FNBOUJDTFHNFOUBUJPO৘ใΛ࡞੒ ٖࣅϥϕϦϯάͱͯ͠4FHNFOUBUJPO%FQUIλεΫͷֶशࡁΈϞσϧͷग़ྗΛར༻ w &ODPEFS ϚεΫ͞Ε͍ͯͳ͍֤ϞμϦςΟͷύονΛ·ͱΊͯೖྗ
ϚϧνϞʔμϧ΁ͷ֦ுɿ.VMUJ."&<#BDINBOO &$$7> Transformer encoder Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... MultiMAE pre-training Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from Transformer encoder Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) Decoder Linear proj. Linear proj. Linear Decoder Selected input patches Original images Masked targets RGB Depth ntic ... ... ... ... ... ... MultiMAE pre-training Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) 3(# Transformer encoder Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... ... ... ... ... ... ... MultiMAE pre-training Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e.g., RGB, depth, and semantic segmentation) is linearly projected to tokens with a fixed dimension and encoded using a Transformer. Task-specific decoders reconstruct the masked-out patches by first performing a cross-attention step from queries to the encoded tokens, followed by a shallow Transformer. The queries consist of mask tokens (in gray), with the task-specific encoded tokens added at their respective positions. (Right) Fine-tuning: By pre-training on multiple modalities, MultiMAE lends itself to fine-tuning on single-modal and multi-modal downstream Transformer encoder Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... ... ... ... ... ... ... MultiMAE pre-training Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e.g., RGB, depth, and semantic segmentation) is linearly projected to tokens with a fixed dimension and encoded using a Transformer. Task-specific decoders reconstruct the masked-out patches by first performing a cross-attention step from queries to the encoded tokens, followed by a shallow 4FNBOUJD %FQUI ༧Ίඞཁͳσʔλ <#BDINBOO &$$7>͔ΒҾ༻ɼҰ෦վม

w %FDPEFS ઢܗࣹӨͨ͠&ODPEFSग़ྗʹରͯ͠Ґஔ৘ใͱϞμϦςΟ৘ใΛ෇༩ $SPTTBUUFOUJPOʹΑΓϞμϦςΟؒͷؔ܎Λߟྀͨ͠τʔΫϯΛ5SBOTGPSNFSCMPDL΁ೖྗ ‣ 2VFSZɹɹɿઢܗࣹӨޙͷ֤ϞμϦςΟͷτʔΫϯ ‣ ,FZ
7BMVFɿઢܗࣹӨޙͷશϞμϦςΟͷτʔΫϯ ϚϧνϞʔμϧ΁ͷ֦ுɿ.VMUJ."&<#BDINBOO &$$7> ing implementation de- 15 ation details 15 siﬁcation ﬁne-tuning . . . . . . . . . . . . 15 ntation . . . . . . . . 15 stimation . . . . . . . 17 se regression tasks . . 17 egies 17 transfer results 18 on on ImageNet 18 E variants 18 raining time 19 the number of segmentation patches constant, we downsam- ple the semantic segmentation input by a factor of 4 and use patches of size 4⇥4. MultiMAE decoder. We illustrate the MultiMAE decoder in Fig 7. Following MAE [35], each decoder has a linear projection layer to adapt the outputs from the encoder to the decoder dimension. After this linear projection, we add both sine-cosine positional embeddings and learned modality embeddings to the decoder inputs. This is then followed by a cross-attention layer, a MLP, and two Transformer blocks. Figure 7. MultiMAE decoders: Tokens from the MultiMAE en- ,FZ 7BMVF 2VFSZ <#BDINBOO &$$7>͔ΒҾ༻ɼҰ෦վม

w ͭͷϞμϦςΟʹ͓͍ͯ૯ύον਺ͷΛϥϯμϜϚεΫ ϚϧνϞʔμϧ΁ͷ֦ுɿ.VMUJ."&<#BDINBOO &$$7> Roman Bachmann* David Mizrahi* Andrei
Atanov Amir Zamir Swiss Federal Institute of Technology Lausanne (EPFL) https://multimae.epfl.ch Masked inputs MultiMAE predictions Target Semantic Depth RGB Masked inputs MultiMAE predictions Target Masked inputs MultiMAE predictions Target Figure 1. MultiMAE pre-training objective. We randomly select 1/6 of all 16⇥16 image patches from multiple modalities and learn to reconstruct the remaining 5/6 masked patches from them. The ﬁgure shows validation examples from ImageNet, where masked inputs (left), predictions (middle), and non-masked images (right) for RGB (top), depth (middle), and semantic segmentation (bottom) are provided. Since we do not compute a loss on non-masked patches, we overlay the input patches on the predictions. ׬શʹϚεΫ͞Εͨ৔߹ʹ͓͍ͯ΋ଞϞʔμϧͷ৘ใΛ΋ͱʹ༧ଌ͕Մೳ <#BDINBOO &$$7>͔ΒҾ༻

ϚϧνϞʔμϧ.*.ɿ༷ʑͳϚϧνϞʔμϧ΁ͷల։ ."& <9(FOH BS9JW`> .BTL7-. <(,XPO *$-3`> ."(7-5 <4,JN
$713`> 3(#ºݴޠ $"7."& <:(POH *$-3`> 9,% <(,XPO BS9JW`> "VEJPWJTVBM."& <.(FPSHFTDV *$$7`> ."7J- <1)VBOH /FVS*14`> 3(#ºԻ੠ 1J."& <"$IFO $713`> (FP.*. <+-JV *$$7`> 3(#º఺܈ .."& <8*LF[PHXP .-)`> 3(#º)FNBUPYZMJOº&PTJO ࡉ๔ Linear Proj. RGB ViT-S Encoder Decoder H E Linear Proj. Linear Proj. Reconstructed RGB 3FNPUF4FOTJOH%BUB'VTJPO <.$IFO BS9JW`> 0QUJDBMJNBHFº4"3JNBHFº%&.º.BQ $P."& <+:BOH """*`> 3(#ºਂ౓ initialization shared parameters Stage 1 w/o positional embeddings encoder encoder decoder with positional embeddings Stage 2 encoder … … contra. loss Positive pair Negative pair "DUJPO."& <48PP """*`> 3(#ºਂ౓º੺֎ઢ ActionMAE RGB+Depth+IR Training Fusion Action Predictor rock scissors paper 𝓛𝓛𝒄𝒄𝒄𝒄𝒄𝒄 Action Predictor take a photo RGB Encoder RGB RGB Encoder 𝓛𝓛𝒓𝒓𝒓𝒓𝒄𝒄 Depth Encoder Depth Depth Missing IR Encoder IR IR Missing mem mem RGB Memory Token Memory Token Random Drop ActionMAE Fusion Reconstruct cls cls Class Token Class Token RGB-only Inference Modality Tokens

w 7JTJPO5SBOTGPSNFS 7J5 Λֶशͷର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश w ϚεΫύονͷըૉ஋΍ಛ௃ྔΛϚεΫ͍ͯ͠ͳ͍ύονΛ΋ͱʹ༧ଌ ϚεΫύονͱϚεΫ͍ͯ͠ͳ͍ύον͕ඥ෇͚ΒΕɼը૾಺ͷจ຺৘ใΛଊ͑Δ w ֫ಘ͞ΕΔಛ௃දݱͷྑ͠ѱ͠͸ϚεΫํ๏ͱ༧ଌ͢Δ৘ใʹґଘ
ϚεΫํ๏΍ϚεΫύονͷ༧ଌํ๏ɼ༧ଌର৅ʹண໨༷ͨ͠ʑͳΞϓϩʔν͕ߟҊ w σʔλΛύονͷΑ͏ͳ୯Ґʹ۠੾Δ͜ͱͰ༷ʑͳϞμϦςΟʹ͓͍ͯ.*.͕ద༻Մೳ ༷ʑͳϞμϦςΟͷ૊Έ߹Θͤʹ͓͍ͯϚϧνϞʔμϧ.*.͕ߟҊ .BTLFE*NBHF.PEFMJOH .*. ͷ·ͱΊ

*$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO $713`> ϚϧνϞʔμϧ΁ͷ֦ு 3(# ηϚηά ਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(# ݴޠ $"7."& <:(POH *$-3`> 3(# Ի੠ ରরֶश .*. ରরֶश 8IBU%P447J5-FBSO </1BSL *$-3`> ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ ".3"%*0 <.3BO[JOHFS $713`> ରরֶश .*. $-*1 4".ͷ ͭͷֶशࡁΈج൫ϞσϧΛ ͭͷϞσϧʹϞσϧѹॖ

w 7J5Λର৅ͱͨ͠৔߹ɼରরֶशͱ.*.͸ҟͳΔϨϕϧͷֶश ରরֶश ɿը૾ϨϕϧʢΫϥετʔΫϯʣͷֶश .*.ɿύονϨϕϧͷֶश ରরֶशͱ.*.ͷϋΠϒϦουख๏ w
ରরֶशͱ.*.Λಉ࣌ʹߦ͏ϋΠϒϦουख๏ 4J5<"UJUP BS9JW> $."&<)VBOH BS9JW> 8IBU%P4FMG4VQFSWJTFE7J5T-FBSO <1BSL *$-3>

w .*.ͱରরֶश͸4FMG"UUFOUJPOɼಛ௃நग़ɼॏཁͳ૚ͷ؍఺͔ΒҟͳΔֶशޮՌΛൃش .*.ͱରরֶशΛ૊Έ߹ΘֶͤͨशʹΑΓֶ֤श͕ิ׬తͰ͋Δ͜ͱΛධՁ ϋΠϒϦουख๏ɿ8IBU%P447J5T-FBSO <1BSL *$-3> L =
(1 − λ)LMIM + λLCL λɿόϥϯεΛௐ੔͢ΔॏΈ LMIMɿ.*.ͷଛࣦ LCLɿରরֶशͷଛࣦ 4FMG"UUFOUJPOͷ૬ޓ৘ใྔ ಛ௃ྔͷϑʔϦΤղੳ *NBHF/FU,ʹର͢Δਫ਼౓ ద੾ͳόϥϯεʹௐ੔͢Δ͜ͱͰରরֶशͱ.*.ͷ྆ํͷ௕ॴΛޮՌతʹ׆༻ <1BSL *$-3>͔ΒҾ༻

w %*/0W $-*1 4".ͷͭͷج൫Ϟσϧ͔Β஌ࣝৠཹͯ͠ج൫ϞσϧΛ࡞੒ %*/0W༻ϔουɼ$-*1༻ϔουɼ4".༻ϔουΛ༻ҙͯͦ͠ΕͧΕͷϔουʹͯৠཹ ".3"%*0"HHMPNFSBUJWF.PEFM3FEVDF"MM%PNBJOT*OUP0OF <3BO[JOHFS $713>
Results Vision Foundation model (student) Student head 2 CLIP Student head 1 DINOv2 Student head 3 SAM Teacher 1: DINOv2 Multi component distillation loss Teacher 2: CLIP Multi component distillation loss Teacher 3: SAM Multi component distillation loss A vintage radio on white sandy beach, a colorful beach ball nearby Pixel-level visual tasks: Text grounding: Semantic segmentation: (from scratch) Frozen weights Trainable Image only data: Framework Figure 1. AM-RADIO - is a multi-teacher distillation framework evaluated on the same dataset such as ImageNet-1k. To this end, we evaluate more than 10 promising architectures under the same training recipe for a direct comparison. We reveal that CNN-like architectures are faster but less ex- pressive than ViTs. This has led to the development of a novel hybrid architecture, E-RADIO, that exceeds the performance of its predecessors and is at least 7x faster than teacher models. Our main contributions are as follows: • We describe a general methodology for distilling multiple distinct foundation models into one, including models with incompatible input resolutions. • We show that these student models are able to outperform their teachers on representative benchmarks. • We demonstrate that these student models can either drop-in replace their teachers, or their features can be used directly in downstream applications such as provid- ing visual encoding for LLaVA [35, 36]. • We benchmark a number of efficient architectures and propose a new architecture (E-RADIO) that allows for similar model quality at significant speedups. 2. Related Work Knowledge Distillation The underpinning of our work is based on the method of Knowledge Distillation [4, 5, 23, 30, 42] which aims to train a “student” model using soft Results Teacher 1: Pixel-level visual tasks: Image only data: tions, since t ible with ou the student i face with the In additio model, we efficient mo comparable evaluated on end, we eval der the sam reveal that C pressive tha novel hybrid formance of teacher mod Our main c • We descri <3BO[JOHFS $713>͔ΒҾ༻

w Ϋϥε෼ྨ ਫ਼౓ ෺ମݕग़ N*P6 ηϚϯςΟοΫηάϝϯςʔγϣϯ "1 ͰධՁ ڭࢣ͋ΓࣄલֶशϞσϧͱͷੑೳࠩΛάϥϑͰՄࢹԽ
*NBHF/FU,Ͱࣄલֶशͨ͠ϞσϧͷసҠੑ fi OFUVOJOH ઢܗධՁ ௥Ճͷඍௐ੔ͳ͠Ͱ༷ʑͳλεΫʹ ద༻Մೳͳಛ௃දݱΛ֫ಘ *NBHF/FU,ʢΫϥε෼ྨʣͱҟͳΔ໰୊λεΫͰ͋Δ ෺ମݕग़ͱηϚηάʹ͓͍ͯߴ͍ੑೳΛൃش

w ࣗݾڭࢣ͋Γֶशͱ͸ʁ ϥϕϧͳ͠σʔλΛ༻͍ͯϓϨΩετλεΫΛֶश͢Δ͜ͱͰϞσϧΛࣄલֶश ༷ʑͳԼྲྀλεΫͰ༗ޮͳಛ௃ྔͷநग़Λ໨ࢦ͢ ख๏ͷਐల ޮՌతͳϓϨςΩετλεΫͷઃܭˠରরֶशˠ.*.ˠσʔληοτͷେن໛Խ w
ࣗݾڭࢣ͋ΓֶशʹΑΔࣄલֶशͷޮՌ ௥Ճͷඍௐ੔ͳ͠Ͱ༷ʑͳλεΫʹద༻Մೳͳಛ௃දݱΛ֫ಘ ҟͳΔ໰୊λεΫͰ͋Δ෺ମݕग़ͱηϚηάʹ͓͍ͯߴ͍సҠੑೳΛൃش ·ͱΊ

*$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO $713`> ϚϧνϞʔμϧ΁ͷ֦ு 3(# ηϚηά ਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(# ݴޠ $"7."& <:(POH *$-3`> 3(# Ի੠ ରরֶश .*. ରরֶश 8IBU%P447J5-FBSO </1BSL *$-3`> ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ ".3"%*0 <.3BO[JOHFS $713`> ରরֶश .*. $-*1 4".ͷ ͭͷֶशࡁΈج൫ϞσϧΛ ͭͷϞσϧʹϞσϧѹॖ

w ࣗવݴޠॲཧ෼໺ <5.JLPMPW BS9JW`>& ff i DJFOU&TUJNBUJPOPG8PSE3FQSFTFOUBUJPOTJO7FDUPS4QBDF <+%FWMJO
/""$-`>#&351SFUSBJOJOHPG%FFQ#JEJSFDUJPOBM5SBOTGPSNFSTGPS-BOHVBHF6OEFSTUBOEJOH w 1SFUFYUλεΫͷվળ <$%PFSTDI *$$7 ʼ >6OTVQFSWJTFE7JTVBM3FQSFTFOUBUJPO-FBSOJOHCZ$POUFYU1SFEJDUJPO <%1BUIBL $713 ʼ >$POUFYU&ODPEFST'FBUVSF-FBSOJOHCZ*OQBJOUJOH <3;IBOH &$$7 ʼ >$PMPSGVM*NBHF$PMPSJ[BUJPO <./PSPP[JBOE1'BWBSP &$$7 ʼ >6OTVQFSWJTFE-FBSOJOHPG7JTVBM3FQSFTFOUBUJPOTCZ4PMWJOH+JHTBX1V[[MFT <./PSPP[J *$$7 ʼ >3FQSFTFOUBUJPO-FBSOJOHCZ-FBSOJOHUP$PVOU <4(JEBSJT *$-3 ʼ >6OTVQFSWJTFE3FQSFTFOUBUJPO-FBSOJOHCZ1SFEJDUJOH*NBHF3PUBUJPOT <./PSPP[J $713 ʼ >#PPTUJOH4FMG4VQFSWJTFE-FBSOJOHWJB,OPXMFEHF5SBOTGFS <4+FOOJBOE1'BWBSP $713 ʼ >4FMG4VQFSWJTFE'FBUVSF-FBSOJOHCZ-FBSOJOHUP4QPU"SUJGBDUT <;8V $713 ʼ >6OTVQFSWJTFE'FBUVSF-FBSOJOHWJB/PO1BSBNFUSJD*OTUBODF%JTDSJNJOBUJPO <"WE0PSE BS9JW ʼ >3FQSFTFOUBUJPO-FBSOJOHXJUI$POUSBTUJWF1SFEJDUJWF$PEJOH <0+)ÉOB ff *$.- ʼ >%BUB& ff i DJFOU*NBHF3FDPHOJUJPOXJUI$POUSBTUJWF1SFEJDUJWF$PEJOH ࢀߟจݙ

w ରরֶश <.:F $713 ʼ >6OTVQFSWJTFE&NCFEEJOH-FBSOJOHWJB*OWBSJBOUBOE4QSFBEJOH*OTUBODF'FBUVSF <,)F $713`>.PNFOUVN$POUSBTUGPS6OTVQFSWJTFE7JTVBM3FQSFTFOUBUJPO-FBSOJOH
<5$IFO *$.- ʼ >"4JNQMF'SBNFXPSLGPS$POUSBTUJWF-FBSOJOHPG7JTVBM3FQSFTFOUBUJPOT <5$IFO /FVS*14 ʼ >#JH4FMG4VQFSWJTFE.PEFMTBSF4USPOH4FNJ4VQFSWJTFE-FBSOFST <9$IFO BS9JW`>*NQSPWFE#BTFMJOFTXJUI.PNFOUVN$POUSBTUJWF-FBSOJOH <9$IFO *$$7`>"O&NQJSJDBM4UVEZPG5SBJOJOH4FMG4VQFSWJTFE7JTJPO5SBOTGPSNFST w ରরֶशɿΫϥελϦϯάͷಋೖ <.$BSPO /FVS*14 ʼ >6OTVQFSWJTFE-FBSOJOHPG7JTVBM'FBUVSFTCZ$POUSBTUJOH$MVTUFS"TTJHONFOUT <+-J *$-3 ʼ >1SPUPUZQJDBM$POUSBTUJWF-FBSOJOHPG6OTVQFSWJTFE3FQSFTFOUBUJPOT w ରরֶशɿωΨςΟϒϑϦʔ <+(SJMM /FVS*14>#PPUTUSBQ:PVS0XO-BUFOU"/FX"QQSPBDIUP4FMG4VQFSWJTFE-FBSOJOH <9$IFO $713`>&YQMPSJOH4JNQMF4JBNFTF3FQSFTFOUBUJPO-FBSOJOH <.$BSPO *$$7`>&NFSHJOH1SPQFSUJFTJO4FMG4VQFSWJTFE7JTJPO5SBOTGPSNFST ࢀߟจݙ

w ରরֶशɿϚϧνϞʔμϧ΁ͷ֦ு <"3BEGPSE *$.- ʼ >-FBSOJOH5SBOTGFSBCMF7JTVBM.PEFMT'SPN/BUVSBM-BOHVBHF4VQFSWJTJPO <4.B *$-3
ʼ >"DUJWF$POUSBTUJWF-FBSOJOHPG"VEJP7JTVBM7JEFP3FQSFTFOUBUJPOT <#$IFO *$$7 ʼ >.VMUJNPEBM$MVTUFSJOH/FUXPSLTGPS4FMGTVQFSWJTFE-FBSOJOHGSPN6OMBCFMFE7JEFPT <$4BVUJFS $713 ʼ >*NBHFUP-JEBS4FMG4VQFSWJTFE%JTUJMMBUJPOGPS"VUPOPNPVT%SJWJOH%BUB w .BTLFE*NBHF.PEFMJOH <)#BP *$-3 ʼ >#&J5#&351SF5SBJOJOHPG*NBHF5SBOTGPSNFST <+;IPV *$-3 ʼ >J#05*NBHF#&351SF5SBJOJOHXJUI0OMJOF5PLFOJ[FS <,)F $713 ʼ >.BTLFE"VUPFODPEFST"SF4DBMBCMF7JTJPO-FBSOFST <;9JF $713 ʼ >4JN.*."4JNQMF'SBNFXPSLGPS.BTLFE*NBHF.PEFMJOH <*,BLPHFPSHJPV &$$7 ʼ >8IBUUP)JEFGSPN:PVS4UVEFOUT"UUFOUJPO(VJEFE.BTLFE*NBHF.PEFMJOH <:$IFO &$$7 ʼ >4E"&4FMGEJTUJMMBUFE.BTLFE"VUPFODPEFS <."TTSBO *$$7 ʼ >4FMG4VQFSWJTFE-FBSOJOH'SPN*NBHFT8JUIB+PJOU&NCFEEJOH1SFEJDUJWF"SDIJUFDUVSF ࢀߟจݙ

w .BTLFE*NBHF.PEFMJOHɿϚϧνϞʔμϧ΁ͷ֦ு <3#BDINBOO &$$7 ʼ >.VMUJ."&.VMUJNPEBM.VMUJUBTL.BTLFE"VUPFODPEFST <(,XPO *$-3
ʼ >.BTLFE7JTJPOBOE-BOHVBHF.PEFMJOHGPS.VMUJNPEBM3FQSFTFOUBUJPO-FBSOJOH <:(POH *$-3 ʼ >$POUSBTUJWF"VEJP7JTVBM.BTLFE"VUPFODPEFS w ରরֶश .*. </1BSL *$-3 ʼ >8IBU%P4FMG4VQFSWJTFE7JTJPO5SBOTGPSNFST-FBSO <.3BO[JOHFS $713`>".3"%*0"HHMPNFSBUJWF7JTJPO'PVOEBUJPO.PEFM3FEVDF"MM%PNBJOT*OUP0OF ࢀߟจݙ

⾃⼰教師あり学習によるビジョン基盤モデルの事前学習

⾃⼰教師あり学習によるビジョン基盤モデルの事前学習

More Decks by Hironobu Fujiyoshi

Other Decks in Technology

Featured

Transcript