ɿ4JN$-3<$IFO *$.-> .P$P<)F $713> /FHBUJWFGSFF ɿ#:0-<(SJMM /FVS*14> 4JN4JBN<$IFO $713> .BTLFE*NBHF.PEFMJOH ɿ4JN.*.<9JF $713> ."&<)F $713> ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH ᶃେྔͷڭࢣͳ͠σʔλͰࣄલֶशϞσϧΛ࡞ ᶄ44-Ͱ࡞ͨ͠ࣄલֶशϞσϧΛରλεΫ fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ େྔͷڭࢣͳ͠σʔλ ࣄલֶशϞσϧ esses image data, we analyze its internal inearly projects the flattened patches into the top principal components of the the ible basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the output token to the input space. See Appendix D.6 for details. ed to the del learns ition em- ition em- hes in the nusoidal D). That ology ex- not yield he entire egree the mpute the n is inte- is “atten- We find he lowest obally is nsistently alized at- esNet be- may serve Further, bally, we ʜ ΫϥεྨϞσϧ ମݕग़Ϟσϧ ࣗݾڭࢣ͋Γֶश
image data, we analyze its internal nearly projects the flattened patches into the top principal components of the the ble basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the d to the el learns ion em- ion em- es in the nusoidal D). That ogy ex- ot yield e entire gree the pute the is inte- s “atten- We find e lowest obally is sistently lized at- sNet be- ʜ ʜ ࣗݾڭࢣ͋Γֶश ʹ༻͍ͨσʔληοτ ಛྔ ࣗݾڭࢣ͋Γֶश ΛߦͳͬͨϞσϧ ʜ ಛྔ ڭࢣϥϕϧ ʜ "JSQMBOF $BU ֶश༻σʔλ ʜ ಛྔ ڭࢣϥϕϧ ʜ $BU %PH ධՁ༻σʔλ ,//๏ʹΑΓධՁ ,ݸͷ͔ۙΒΫϥεྨ
processes image data, we analyze its internal ansformer linearly projects the flattened patches into left) shows the top principal components of the the emble plausible basis functions for a low-dimensional patch. Figure 6: Representative ex- amples of attention from the ding is added to the hat the model learns arity of position em- similar position em- pears; patches in the Finally, a sinusoidal (Appendix D). That image topology ex- variants do not yield ion across the entire e to what degree the ally, we compute the information is inte- , right). This “atten- ze in CNNs. We find lready in the lowest ormation globally is ds have consistently s highly localized at- t apply a ResNet be- ʜ ࣗݾڭࢣ͋Γֶश ʹ༻͍ͨσʔληοτ ʜ ಛྔ ࣗݾڭࢣ͋Γֶश ΛߦͳͬͨϞσϧ ڭࢣϥϕϧ '$ ʜ ಛྔ ʜ "JSQMBOF $BU ڭࢣ͋Γֶश ཚॳظ ͷ'$
ը૾ʹରͯ͠ɼɼɼͷ͍ͣΕ͔ͷճసΛద༻ ద༻͞Εͨճస͕छྨͷ͏͍ͪͣΕ͔Λ༧ଌʢΫϥεྨʣ 1SFUFYUUBTLͷվળ 4 Zhang, Isola, Efros Fig. 2. Our network architecture. Each conv layer refers to a block of 2 or 3 repeated conv and ReLU layers, followed by a BatchNorm [30] layer. The net has no pool layers. All changes in resolution are achieved through spatial downsampling or upsampling Published as a conference paper at ICLR 2018 Rotated image: X0 Rotated image: X3 Rotated image: X 2 Rotated image: X1 ConvNet model F(.) ConvNet model F(.) ConvNet model F(.) ConvNet model F(.) Image X Predict 270 degrees rotation (y=3) Rotate 270 degrees g( X , y=3) Rotate 180 degrees g( X , y=2) Rotate 90 degrees g( X , y=1) Rotate 0 degrees g( X , y=0) Maximize prob. F3( X 3) Predict 0 degrees rotation (y=0) Maximize prob. F2( X2) Maximize prob. F1( X 1) Maximize prob. F0( X 0) Predict 180 degrees rotation (y=2) Predict 90 degrees rotation (y=1) Objectives: *NBHF3PUBUJPOT $PMPSGVM*NBHF$PMPSJ[BUJPO
&ODPEFSɾ%FDPEFSߏͷϞσϧʹΑΓϚεΫ͞ΕͨྖҬΛ༧ଌ 1SFUFYUUBTLͷվળ Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles 7 Fig. 3: Context Free Network. The figure illustrates how a puzzle is generated differ in the approach: whereas [7] are solving a discrimina- tive task (is patch A above patch B or below?), our context encoder solves a pure prediction problem (what pixel inten- sities should go in the hole?). Interestingly, similar distinc- tion exist in using language context to learn word embed- dings: Collobert and Weston [5] advocate a discriminative approach, whereas word2vec [30] formulate it as word pre- diction. One important benefit of our approach is that our supervisory signal is much richer: a context encoder needs to predict roughly 15,000 real values per training example, compared to just 1 option among 8 choices in [7]. Likely due in part to this difference, our context encoders take far less time to train than [7]. Moreover, context based predic- Figure 2: Context Encoder. The context image is passed through the encoder to obtain features which are connected to the decoder using channel-wise fully-connected layer as described in Section 3.1. The decoder then produces the $POUFYU&ODPEFST 4PMWJOH+JHTBX1V[[MFT
1SFUFYUUBTLͷվળ ×n×3 → Rp×q×3, x and map them so define a feature med image to some e a feature transfor- akes J features and the image transfor- ature φ by using the pervisory signal 0 ∀x. (1) mily consists of the nsampling factor of 1, . . . , 4, which ex- of tiles. Notice that es of the same size. 4 }. We also define eatures on the trans- d − 4 j=1 tj. This shared weights φ(T1 ◦ x) φ(T2 ◦ x) φ(T3 ◦ x) φ(T4 ◦ x) φ(D ◦ x) φ(D ◦ y) + t d c |d − t|2 y x D ◦ x D ◦ y T1 ◦ x T2 ◦ x T3 ◦ x T4 ◦ x t max{0, M − |c − t|2} AlexNet conv1-5 fc8 1000 114x114x3 fc7 4096 fc6 4096 3x3x256 ReLU ReLU ReLU 128D Unit Sphere O 1-th image 2-th image i-th image n-1 th image n-th image CNN backbone 128D 2048D 128D L2 norm low dim Non-param Softmax Memory Bank Figure 2: The pipeline of our unsupervised feature learning approach. We use a backbone CNN to encode each image as a feature vector, which is projected to a 128-dimensional space and L2 normalized. The optimal feature embedding is learned via instance-level discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere. 3. Approach Our goal is to learn an embedding function v = f✓(x) without supervision. f✓ is a deep neural network with where ⌧ is a temperature parameter that controls the con- centration level of the distribution [11]. ⌧ is important for supervised feature learning [43], and also necessary for tun- ing the concentration of v on our unit sphere. *OTUBODF%JTDSJNJOBUJPO -FBSOJOHUP$PVOU
*$.-> ύονͷಛྔ͔ΒύονҐஔ͕ݸઌͷύονͷಛྔΛ༧ଌ 1SFUFYUUBTLͷվળ fθ gφ x z c InfoNCE [256, 256, 3] [7, 7, 4096] [7, 7, 4096] Masked ConvNet Patched ResNet-161 fθ hψ x z y Cross Ent [256, 256, 3] [7, 7, 4096] [1000, 1] Linear Self-supervised pre-training 100% images; 0% labels Linear classification 100% images and labels fθ hψ x z y Cross Ent [224, 224, 3] [14, 14, 4096] ResNet-33 Efficient classification 1% to 100% images and labels fθ hψ x z y Multi Task [H, W, 3] [H/16, W/16, 4096] Transfer learning 100% images and labels hψ x y Cross Ent [224, 224, 3] [1000, 1] ResNet-152 Supervised training 1% to 100% images and labels Baseline Pre-training Evaluation Pre-trained Fixed / Tuned ResNet-161 Image x Feature Extractor fθ Patched ResNet-161 z c Context Network gφ Masked ConvNet Faster-RCNN [20, 1] [1000, 1] Pre-trained Fixed / Tuned ResNet-161 Pre-trained Fixed Patched ResNet-161 al configuration (if there is no spe- e parts, then it is “stuff” [1]). We d approach to learn a visual repre- . We demonstrate that the resulting good for both object detection, pro- t on PASCAL VOC 2007 compared , as well as for unsupervised object mining. This means, surprisingly, generalizes across images, despite bjective function that operates on a That is, instance-level supervision ormance on category-level tasks. a good image representation is as n appropriate generative model. An of natural images would both gener- o their natural distribution, and be hat it would seek common causes d share information between them. atent structure given an image is in- vely simple models. To deal with sues, a number of works, such as m [23], contrastive divergence [22], 3 2 1 5 4 8 7 6 ); Y = 3 , X = ( Figure 2. The algorithm receives two patches in one of these eight possible spatial arrangements, without any context, and must then classify which configuration was sampled. model (e.g. a deep network) to predict, from a single word, the n preceding and n succeeding words. In principle, sim- ilar reasoning could be applied in the image domain, a kind of visual “fill in the blank” task, but, again, one runs into the problem of determining whether the predictions themselves $POUSBTUJWF1SFEJDUJWF$PEJOH $POUFYU1SFEJDUJPO k
nformation between views is changed, information about the downstream task (green) d) can be selectively included or excluded, biasing the learned representation. (a) views are chosen to preserve downstream task information between views while rmation, while in (b) reducing MI always throws out information relevant for the task ormance as MI is reduced. reducing I(v1; v2) improves downstream accuracy. We use INCE as a neural depends on network architectures. Therefore for each plot in this paper, we s while keeping other settings the same, to make the results comparable. -10 classification (b) CIFAR-10 classification ws by using pairs of image patches at various offsets from each other. As INCE is ask accuracy firstly increases and then decreases, leading to a reverse-U shape. (v1; v2) with spatial distance. We create views by randomly cropping two ૬ޓใྔ େ খ I(v1 ; v2 ) ≥ log(K) − LNCE = INCE (v1 ; v2 ) 1BUDI EJTUBODF ࣗݾڭࢣ͋Γֶशɿ%*7,ˠઢܗධՁɿ$*'"3 missing info excess info # bits per hypothesis captured info I(x; y) <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> Sweet Spot I(v1; v2) = I(x; y) <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> I(v1; v2) <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> I(v1; v2) <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> missing info # bits captured in Sweet Spot I(v1; v2) = I(x; y <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> missing info e # bits captured in Sweet Spot I(v1; v2) = I(x; y <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> missing info excess info # bits transfer performance hypothesis too m no not enough signal captured info I(x; y) <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> Sweet Spot I(v1; v2) = I(x; y) <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> I(v1; v2) <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> I(v1; v2) <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> I(v1; v2) = I(x; y) <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> missing info excess info # bits transfer performance hypothesis too n not enough signal captured info I(x; y) <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> Sweet Spot I(v1; v2) = I(x; y) <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> I(v1; v2) <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> I(v1; v2) <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> I(v1; v2) = I(x; y) <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> (a) Figure 2: As the mutual information between views is changed and nuisance variables (red) can be selectively included or ex depicts a scenario where views are chosen to preserve down throwing out nuisance information, while in (b) reducing MI alw leading to decreasing performance as MI is reduced.
$713> Figure 8. We display the similarity distribution of positive samples and the top-10 nearest negative samples that are marked as ’pos’ and ϙδςΟϒɾωΨςΟϒϖΞͷಛྔͷྨࣅͱԹύϥϝʔλͷؔ
parameters 68 70 72 74 76 78 80 ImageNet top-1 accuracy (%) Sup. Sup.(2£) Sup.(4£) InfoMin SimCLR SimCLR (2£) SimCLR (4£) MoCov2 CPCv2-L MoCo CMC AMDIM BYOL BYOL (2£) BYOL (4£) BYOL (200-2£) Sup.(200-2£) w fi OFUVOJOHʹ͓͚Δਫ਼ൺֱ ࣗݾڭࢣ͋Γֶश ɿ*NBHF/FU, ମݕग़ ɿ70$ ηάϝϯςʔγϣϯɿ70$ Supervised-IN baseline (+1.9 mIoU) and SimCLR ( Similarly, we evaluate on object detection by repro as detailed in Appendix D.5. We fine-tune on trai AP50 metric; BYOL is significantly better than the S Finally, we evaluate on depth estimation on the N given a single RGB image. Depth prediction mea that information can be localized to pixel accuracy We evaluate on the commonly used test subset o in Table 4b: relative (rel) error, root mean square max(dgt/dp, dp/dgt), is below 1.25n thresholds depth [40]. BYOL is better or on par with other me measure is respectively improved by +3.5 points a Method AP50 mIoU Supervised-IN [9] 74.4 74.4 MoCo [9] 74.9 72.5 SimCLR (repro) 75.2 75.2 BYOL (ours) 77.5 76.3 (a) Transfer results in semantic segmentation and object detection. Method Supervised-IN SimCLR (repro BYOL (ours) Table 4: Results on transferring 5 Building intuitions with ablations ύϥϝʔλ͕ଟ͍߹ʹڭࢣ͋ΓֶशͱಉఔͷੑೳΛൃش *NBHF/FUͷڭࢣ͋ΓࣄલֶशϞσϧΛ͑ΔੑೳΛൃش w *NBHF/FU,ʹ͓͚Δਫ਼ൺֱ ઢܗධՁʹΑΓྨੑೳΛධՁ
ɿ ࢦҠಈฏۉϞσϧ 4X"7 ɿ ΫϥελϦϯάɼQSFEJDUPS /FHBUJWFGSFFɿ4JN4JBN<$IFO $713> similarity predictor encoder similarity & dissimilarity encoder image SimCLR similarity Sinkhorn-Knopp encoder similarity momentum encoder predictor image moving average BYOL grad grad grad grad grad similarity predictor encoder similarity & dissimilarity encoder image SimCLR similarity Sinkhorn-Knopp encoder similarity momentum encoder predictor image moving average BYOL grad grad grad grad grad encode predicto encoder similarity & dissimilarity encoder image SimCLR encoder similarity encoder Sinkhorn-Knopp image SwAV encode predicto grad grad grad grad grad encoder similarity encoder predictor image SimSiam ncoder ncoder horn-Knopp encoder similarity momentum encoder predictor image moving average BYOL grad grad grad
69], even though it is not attached o any label nor supervision in our case. The set of patch okens and [CLS] token are fed to a standard Transformer etwork with a “pre-norm” layer normalization [11, 39]. The Transformer is a sequence of self-attention and feed-forward ayers, paralleled with skip connections. The self-attention ayers update the token representations by looking at the ther token representations with an attention mechanism [4]. mplementation details. We pretrain the models on the mageNet dataset [60] without labels. We train with the damw optimizer [44] and a batch size of 1024, distributed over 16 GPUs when using ViT-S/16. The learning rate is inearly ramped up during the first 10 epochs to its base alue determined with the following linear scaling rule [29]: r = 0.0005 ⇤ batchsize/256. After this warmup, we decay he learning rate with a cosine schedule [43]. The weight decay also follows a cosine schedule from 0.04 to 0.4. The emperature ⌧s is set to 0.1 while we use a linear warm-up or ⌧t from 0.04 to 0.07 during the first 30 epochs. We ollow the data augmentations of BYOL [30] (color jittering, Gaussian blur and solarization) and multi-crop [10] with a bicubic interpolation to adapt the position embeddings to he scales [19, 69]. The code and models to reproduce our esults is publicly available. Evaluation protocols. Standard protocols for self- Table 2: Linear and k-NN classification on ImageNet. We report top-1 accuracy for linear and k-NN evaluations on the validation set of ImageNet for different self-supervised methods. We focus on ResNet-50 and ViT-small architectures, but also report the best results obtained across architectures. ⇤ are run by us. We run the k-NN evaluation for models with official released weights. The throughput (im/s) is calculated on a NVIDIA V100 GPU with 128 samples per forward. Parameters (M) are of the feature extractor. Method Arch. Param. im/s Linear k-NN Supervised RN50 23 1237 79.3 79.3 SCLR [12] RN50 23 1237 69.1 60.7 MoCov2 [15] RN50 23 1237 71.1 61.9 InfoMin [67] RN50 23 1237 73.0 65.3 BarlowT [81] RN50 23 1237 73.2 66.0 OBoW [27] RN50 23 1237 73.8 61.9 BYOL [30] RN50 23 1237 74.4 64.8 DCv2 [10] RN50 23 1237 75.2 67.1 SwAV [10] RN50 23 1237 75.3 65.7 DINO RN50 23 1237 75.3 67.5 Supervised ViT-S 21 1007 79.8 79.8 BYOL⇤ [30] ViT-S 21 1007 71.4 66.6 MoCov2⇤ [15] ViT-S 21 1007 72.7 64.4 SwAV⇤ [10] ViT-S 21 1007 73.5 66.3 DINO ViT-S 21 1007 77.0 74.5 Comparison across architectures SCLR [12] RN50w4 375 117 76.8 69.3 SwAV [10] RN50w2 93 384 77.3 67.3 BYOL [30] RN50w2 93 384 77.4 – DINO ViT-B/16 85 312 78.2 76.1 SwAV [10] RN50w5 586 76 78.5 67.1 BYOL [30] RN50w4 375 117 78.6 – BYOL [30] RN200w2 250 123 79.6 73.9 ther token representations with an attention mechanism [4]. mplementation details. We pretrain the models on the mageNet dataset [60] without labels. We train with the damw optimizer [44] and a batch size of 1024, distributed over 16 GPUs when using ViT-S/16. The learning rate is inearly ramped up during the first 10 epochs to its base alue determined with the following linear scaling rule [29]: r = 0.0005 ⇤ batchsize/256. After this warmup, we decay he learning rate with a cosine schedule [43]. The weight decay also follows a cosine schedule from 0.04 to 0.4. The emperature ⌧s is set to 0.1 while we use a linear warm-up or ⌧t from 0.04 to 0.07 during the first 30 epochs. We ollow the data augmentations of BYOL [30] (color jittering, Gaussian blur and solarization) and multi-crop [10] with a bicubic interpolation to adapt the position embeddings to he scales [19, 69]. The code and models to reproduce our esults is publicly available. Evaluation protocols. Standard protocols for self- upervised learning are to either learn a linear classifier on frozen features [82, 33] or to finetune the features on downstream tasks. For linear evaluations, we apply andom resize crops and horizontal flips augmentation during training, and report accuracy on a central crop. For finetuning evaluations, we initialize networks with he pretrained weights and adapt them during training. However, both evaluations are sensitive to hyperparameters, nd we observe a large variance in accuracy between runs when varying the learning rate for example. We thus also valuate the quality of features with a simple weighted Method Arch. Param. im/s Linear k-NN Supervised RN50 23 1237 79.3 79.3 SCLR [12] RN50 23 1237 69.1 60.7 MoCov2 [15] RN50 23 1237 71.1 61.9 InfoMin [67] RN50 23 1237 73.0 65.3 BarlowT [81] RN50 23 1237 73.2 66.0 OBoW [27] RN50 23 1237 73.8 61.9 BYOL [30] RN50 23 1237 74.4 64.8 DCv2 [10] RN50 23 1237 75.2 67.1 SwAV [10] RN50 23 1237 75.3 65.7 DINO RN50 23 1237 75.3 67.5 Supervised ViT-S 21 1007 79.8 79.8 BYOL⇤ [30] ViT-S 21 1007 71.4 66.6 MoCov2⇤ [15] ViT-S 21 1007 72.7 64.4 SwAV⇤ [10] ViT-S 21 1007 73.5 66.3 DINO ViT-S 21 1007 77.0 74.5 Comparison across architectures SCLR [12] RN50w4 375 117 76.8 69.3 SwAV [10] RN50w2 93 384 77.3 67.3 BYOL [30] RN50w2 93 384 77.4 – DINO ViT-B/16 85 312 78.2 76.1 SwAV [10] RN50w5 586 76 78.5 67.1 BYOL [30] RN50w4 375 117 78.6 – BYOL [30] RN200w2 250 123 79.6 73.9 DINO ViT-S/8 21 180 79.7 78.3 SCLRv2 [13] RN152w3+SK 794 46 79.8 73.1 DINO ViT-B/8 85 63 80.1 77.4 4. Main Results We first validate the DINO framework used in this study with the standard self-supervised benchmark on ImageNet. We then study the properties of the resulting features for retrieval, object discovery and transfer-learning. 3FT/FUɿैདྷ๏ͱಉఔͷੑೳΛൃش 7JTJPO5SBOTGPSNFS 7J5 ɿैདྷ๏Λ͑ΔੑೳΛൃش
OFUVOJOH 7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO *$$7> ViT-S/16 22.0 27.3 45.9 ViT-S/8 21.8 23.7 44.7 Figure 4: Segmentations from supervised versus DINO. We vi- sualize masks obtained by thresholding the self-attention maps to keep 60% of the mass. On top, we show the resulting masks for a ViT-S/8 trained with supervision and DINO. We show the best head for both models. The table at the bottom compares the Jac- card similarity between the ground truth and these masks on the validation images of PASCAL VOC12 dataset. Table 6: Transfer learning by finetuning pretrained models on different datasets. We report top-1 accuracy. Self-supervised pretraining with DINO transfers better than supervised pretraining. Cifar10 Cifar100 INat18 INat19 Flwrs Cars INet ViT-S/16 Sup. [69] 99.0 89.5 70.7 76.6 98.2 92.1 79.9 DINO 99.0 90.5 72.0 78.2 98.5 93.0 81.5 ViT-B/16 Sup. [69] 99.0 90.8 73.2 77.7 98.4 92.1 81.8 DINO 99.1 91.7 72.6 78.6 98.8 93.0 82.8 In Table 7, we report different model variants as we add or remove components. First, we observe that in the absence of momentum, our framework does not work (row 2) and more advanced operations, SK for example, are required to avoid collapse (row 9). However, with momentum, using SK has little impact (row 3). In addtition, comparing rows 3 and 9 highlights the importance of the momentum encoder for performance. Second, in rows 4 and 5, we observe that multi-crop training and the cross-entropy loss in DINO are 3 X X X CE 7 72.2 4 X 7 7 CE 7 67.9 5 X 7 X MSE 7 52.6 6 X 7 X CE X 71.8 7 BYOL X 7 7 MSE X 66.6 8 MoCov2 X 7 7 INCE 7 62.0 9 SwAV 7 X X CE 7 64.7 SK: Sinkhorn-Knopp, MC: Multi-Crop, Pred.: Predi CE: Cross-Entropy, MSE: Mean Square Error, INCE: In Figure 5: Effe Patch Size. k-NN uation as a funct the throughputs f ferent input patch with ViT-B and Models are train 300 epochs. with different patch sizes, 16 ⇥ 16, 8 ⇥ 8 and 5 ⇥ also compare to ViT-B with 16 ⇥ 16 and 8 ⇥ 8 patc the models are trained for 300 epochs. We observe performance greatly improves as we decrease the siz patch. It is interesting to see that performance can be improved without adding additional parameters. H the performance gain from using smaller patches c the expense of throughput: when using 5⇥5 patc throughput falls to 44 im/s, vs 180 im/s for 8⇥8 pat ڭࢣ͋ΓࣄલֶशϞσϧΛ͑ΔੑೳΛൃش
Properties in Self-Supervised Vision Transformers Mathilde Caron1,2 Hugo Touvron1,3 Ishan Misra1 Herv´ e Jegou1 Julien Mairal2 Piotr Bojanowski1 Armand Joulin1 1 Facebook AI Research 2 Inria⇤ 3 Sorbonne University Figure 1: Self-attention from a Vision Transformer with 8 ⇥ 8 patches trained with no supervision. We look at the self-attention of the [CLS] token on the heads of the last layer. This token is not attached to any label nor supervision. These maps show that the model automatically learns class-specific features leading to unsupervised object segmentations. Abstract In this paper, we question if self-supervised learning pro- vides new properties to Vision Transformer (ViT) [19] that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the follow- ing observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised 1. Introduction Transformers [70] have recently emerged as an alternative to convolutional neural networks (convnets) for visual recog- nition [19, 69, 83]. Their adoption has been coupled with a training strategy inspired by natural language processing (NLP), that is, pretraining on large quantities of data and finetuning on the target dataset [18, 55]. The resulting Vision Transformers (ViT) [19] are competitive with convnets but, they have not yet delivered clear benefits over them: they v:2104.14294v2 [cs.CV] 24 May 2021 ϥϕϧใ͕ແͯ͘ਖ਼֬ͳମྖҬΛ֫ಘ Supervised DINO Table 7: Important component for self-supervised ViT pre- training. Models are trained for 300 epochs with ViT-S/16. We study the different components that matter for the k-NN and linear (“Lin.”) evaluations. For the different variants, we highlight the differences from the default DINO setting. The best combination is the momentum encoder with the multicrop augmentation and the cross-entropy loss. We also report results with BYOL [30], MoCo-v2 [15] and SwAV [10]. Supervised DINO Random Supervised DINO ViT-S/16 22.0 27.3 45.9 Table 7: Imp training. Mod study the diffe (“Lin.”) evalu differences fro is the momen the cross-entr MoCo-v2 [15] Method M 1 DINO 2 3 ڭࢣ͋Γֶशͱൺͯ%*/0ମྖҬʹूத
ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW> α α α = 0 α ≠ 0 ̂ x = SelfAtt(x) x = x + ̂ x + α ⋅ CrossAtt( ̂ x, y) x = x + FFN(x) ը૾Ϟσϧʹ͓͚Δॲཧ ɿը૾ͷ֤ύονಛྔ x ɿ$BQUJPOͷ֤τʔΫϯಛྔ y
i (1 − Cii )2 + λ∑ i ∑ j≠i (Cij )2 Cij = ∑ b zA b,i zB b,j ∑ b (zA b,i )2 ∑ b (zB b,j )2 ɿಛϕΫτϧͷ࣍ݩͷΠϯσοΫε i, j ɿϛχόονͷΠϯσοΫε b ɿಛϕΫτϧ zA, zB ɿQPTJUJWFXFJHIUJOHGBDUPS λ
i (1 − Cii )2 + λ∑ i ∑ j≠i (Cij )2 Cij = ∑ b zA b,i zB b,j ∑ b (zA b,i )2 ∑ b (zB b,j )2 ɿಛϕΫτϧͷ࣍ݩͷΠϯσοΫε i, j ɿϛχόονͷΠϯσοΫε b ɿಛϕΫτϧ zA, zB zA ࣍ݩ ϛχόον zB ࣍ݩ ϛχόον i j Ci,j ˠಛϕΫτϧͷ࣍ݩؒͷੑΛݮ ɿQPTJUJWFXFJHIUJOHGBDUPS λ zA 0,i zA 1,i zA 2,i zA 3,i zA 4,i zB 0,j zB 1,j zB 2,j zB 3,j zB 4,j
Table 1: Uni-modal downstream: image classification. We benchmark learned representation classification task by training linear classifiers on fixed features. We report top-1 accuracy on Im validation set, classification mAP on VOC07, and per-class (PC) and overall (O) F1 scores on COCO with † are re-implemented by Yuan et al. (2021), and the numbers with ‡ are re-implemented by u trained with significantly larger dataset are colored gray. Best results are in bold. Linear probing on ImageNet Validation Set Linear probing on VOC07 and COCO Method Pre-train Arch. Supervision Top-1(%) Method Pre-train Arch. VOC07 CO SVM MLP MLP Sup. IN-1K RN50 Label 76.5 Sup. IN-1K RN50 87.5 90.8 55.2 Sup. IN-100 RN50 Label 53.3† SimCLR IN-1K RN50 85.5 MoCo COCO RN50 NA 44.5† MoCo IN-1K RN50 79.8 MoCo-v2 COCO RN50 NA 49.3† MoCo-v2 IN-1K RN50 86.4 VirTex COCO RN50 Caption 52.8 SwAV IN-1K RN50 88.9 ICMLM COCO RN50 Caption 51.9 BYOL IN-1K RN50 86.6 MCT COCO RN50 Caption 54.9 BT IN-1K RN50 86.2 91.9‡ 56.1/ MCT COCO RN50 Caption+Tag 55.3 VICReg IN-1K RN50 86.6 91.1‡ 51.0/ VoLTA(w/o CMAF) COCO RN50 Caption 55.3 VoLTA(w/o CMAF) COCO RN50 89.6 94.3 71.4 VoLTA(w/o CMAF) COCO Swin-T Caption 56.3 VoLTA(w/o CMAF) COCO Swin-T 88.2 93.5 73.4 VoLTA(w/o CMAF) COCO Swin-B Caption 62.5 VoLTA(w/o CMAF) COCO Swin-B 88.5 93.9 74.1 VoLTA COCO Swin-B Caption 62.5 VoLTA COCO Swin-B 88.7 94.0 74.5 Table 2: Uni-modal downstream: object detection and instance segmentation. We benchm representations on VOC07 + 12 object detection task using faster R-CNN (Ren et al., 2015), and on C object detection and instance segmentation using mask R-CNN He et al. (2017), both with C4 backb (Wu et al., 2019). Best results are in bold. 3FT/FUɿ$BQUJPOͷΈΛ༻͍ͯߴ͍ਫ਼Λୡ 4XJO5SBOTGPSNFSɿ7J5Ϟσϧʹ͓͍ͯߴ͍ਫ਼ΛൃشՄೳ
/FYU4FOUFODF1SFEJDUJPO ɿ4FOUFODF#͕4FOUFODF"ͷଓ͖ͷจষ͔༧ଌ .BTLFE-BOHVBHF.PEFMJOHɿ#&35<%FWMJO /""$-> BERT BERT E [CLS] E 1 E [SEP] ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... TokM Question Paragraph Start/End Span BERT E [CLS] E 1 E [SEP] ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... TokM Masked Sentence A Masked Sentence B Pre-training Fine-Tuning NSP Mask LM Mask LM Unlabeled Sentence A and B Pair SQuAD Question Answer Pair NER MNLI
ೖྗը૾ ෮ݩ݁Ռ ݪը૾ ೖྗը૾ ෮ݩ݁Ռ ݪը૾ ೖྗը૾ ෮ݩ݁Ռ ݪը૾ ೖྗը૾ ෮ݩ݁Ռ ݪը૾ encoder design. We narrower and shal- our default decoder e encoder. With this kens are only pro- ch significantly re- constructs the input masked patch. Each ctor of pixel values he decoder is a lin- channels equals the decoder’s output is . Our loss function between the recon- space. We compute r to BERT [14].1 nstruction target is sked patch. Specif- ard deviation of all 10 20 30 40 50 60 70 80 90 masking ratio (%) Figure 5. Masking ratio. A high masking ratio (75%) works well for both fine-tuning (top) and linear probing (bottom). The y-axes are ImageNet-1K validation accuracy (%) in all plots in this paper. 4. ImageNet Experiments We do self-supervised pre-training on the ImageNet-1K (IN1K) [13] training set. Then we do supervised training to evaluate the representations with (i) end-to-end fine-tuning or (ii) linear probing. We report top-1 validation accuracy of a single 224⇥224 crop. Details are in Appendix A.1. Baseline: ViT-Large. We use ViT-Large (ViT-L/16) [16] as the backbone in our ablation study. ViT-L is very big (an order of magnitude bigger than ResNet-50 [25]) and tends to overfit. The following is a comparison between ViT-L trained from scratch vs. fine-tuned from our baseline MAE: scratch, original [16] scratch, our impl. baseline MAE 76.5 82.5 84.9 We note that it is nontrivial to train supervised ViT-L from
o discrete vi- seudo-labels (center col- nd SIFT [62] on for over a MaskFeat in ual signals is d continuous ell. tations is not ng local pat- g supervised ed data leads ! ! ! transformer linear head masked input target feature e.g., HOG H W T Figure 2. MaskFeat pre-training. We randomly replace the in- put space-time cubes of a video with a [MASK] token and di- rectly regress features (e.g. HOG) of the masked regions. After pre-training, the Transformer is fine-tuned on end tasks. scratch - MViT-S [56] 81 pixel 3 RGB 80 image descriptor 3 HOG [22] 82 dVAE 7 DALL-E [73] 81 unsupervised feature 7 DINO [9], ViT-B 82 supervised feature 7 MViT-B [31] 81 Table 1. Comparing target features for MaskFeat (video). variants are pre-trained with MaskFeat for 300 epochs on MVi 16⇥4. We report fine-tuning accuracy on K400. Default is g feature type scratch pixel colors image descriptor dVAE token unsupervised feature unsupervised feature unsupervised feature supervised feature supervised feature pseudo-label Table 2. Comparing target features for MaskFeat (image). We report 100-epoch fine-tuning accuracy on IN-1K. For two and effective epoch† on IN-1K. The default entry is marked in † Different teachers use different training strategies. dVAE is pre-trai training. To measure the cost in a unified way, we normalize the num )0(ಛྔͷ༧ଌֶशࡁΈϞσϧͷಛྔͷ༧ଌͱಉఔͷਫ਼Λୡ
w 5FBDIFSͱͯ͠4UVEFOUͷࢦҠಈฏۉϞσϧΛ༻ ϚεΫͷվળɿ"UUFOUJPO(VJEFE.*.<,BLPHFPSHJPV &$$7> Attention-Guided Masked Image Modeling 3 (a) Input (b) Random (c) Random (d) Block (e) Attention (f) AttMask (g) AttMask Image (30) (75) Wise Map High Low Fig. 1. Different than random masking strategies (b-d), our attention-guided masking (AttMask) uses the attention map arising in the encoder (e) to mask the most highly attended by default (f),
fi OFUVOJOH࣌ͷਫ਼্͕ 10 Y. Chen et al. Table 2. Image classification results on the ILSVRC-2012 ImageNet dataset with top 1 accuracy. “Epochs” refers to the number of pre-training epochs. MoCo v3 and DINO adopt multi-crop augmentation for pre-training. MoCo v3: 2 global crops of 224 × 224. DINO: 2 global crops of 224 × 224 and 10 local crops of 96 × 96. Method Epochs Crops Finetune Linear Methods using ViT-B: Train from Scratch 300 81.8 MoCo v3 300 2 83.2 76.2 DINO 400 12 83.3 77.3 BEiT 300 1 83.0 49.4 MAE 100 1 82.1 54.8 MAE 300 1 82.9 61.5 MAE 1600 1 83.6 67.8 CAE 300 1 83.3 64.2 SdAE 100 1 83.5 60.3 SdAE 300 1 84.1 64.9 top-1 accuracy. Moreover, compared to the recently proposed CAE, our SdAE achieves 0.8% top-1 accuracy gain, demonstrating the e↵ectiveness of our self- distillated design and multi-fold masking strategy. In addition, with only 100 epochs pre-training, SdAE can achieve comparable performance with MAE us-
Top-1 Methods without view data augmentations data2vec [7] ViT-L/16 1600 53.5 MAE [34] ViT-B/16 1600 68.0 ViT-L/16 1600 76.0 ViT-H/14 1600 77.2 I-JEPA ViT-B/16 600 72.9 ViT-L/16 600 77.5 ViT-H/14 300 79.3 ViT-H/16448 300 81.1 Methods using extra view data augmentations SimCLR v2 [20] RN152 (2⇥) 800 79.1 DINO [17] ViT-B/8 300 80.1 iBOT [74] ViT-L/16 250 81.0 Table 1. ImageNet. Linear-evaluation on ImageNet-1k. ViT- H/16448 is pretrained at at a resolution of 448 ⇥ 448. I-JEPA sig- nificantly improves linear probing performance compared to other methods that do not rely on hand-crafted data-augmentations dur- ing pretraining (MAE and data2vec). Moreover, I-JEPA demon- strate good scalability behavior and the larger I-JEPA model matches the performance of view-invariance approaches without requiring view data-augmentions. Method Arch. Epochs Top-1 Methods without view data augmentations data2vec [7] ViT-L/16 1600 73.3 MAE [34] ViT-L/16 1600 67.1 ViT-H/14 1600 71.5 I-JEPA ViT-L/16 600 69.4 ViT-H/14 300 73.3 ViT-H/16448 300 77.3 Methods using extra view data augmentations iBOT [74] ViT-B/16 250 69.7 DINO [17] ViT-B/8 300 70.0 SimCLR v2 [33] RN151 (2⇥) 800 70.2 BYOL [33] RN200 (2⇥) 800 71.2 MSN [3] ViT-B/4 300 75.7 Table 2. ImageNet-1%. Semi-supervised evaluation on ImageNet-1K using only 1% of the available labels. Models are adapted via fine-tuning or linear-probing, depending on whichever works best for each respective method. ViT-H/16448 is pretrained at at a resolution of 448 ⇥ 448. I-JEPA pretraining outperforms MAE which also does not rely on hand-crafted data-augmentations during pretraining. Moreover, I-JEPA benefits from scale. A ViT- H/16 trained at resolution 448 surpasses previous methods includ- 103 104 68 70 72 74 76 78 80 ViT-B/16 ViT-H/14 ViT-B/16 ViT-L/16 ViT-H/14 Pretraining GPU Hours Top 1 (%) ImageNet Linear Evaluation vs GPU Hours I-JEPA MAE ैདྷͷ.*.ରরֶशɾ/FHBUJWFGSFFͱൺͯਫ਼্͕ ."&ͱൺͯগͳֶ͍श࣌ؒͰߴ͍ਫ਼Λൃش
ରরֶशͷಋೖɿ.4/<"TTSBO &$$7> fq z prototypes f ¯ q ema z + / / prediction p target p + H(p +, p) original anchor view target view patchify & mask patchify representation [CLS] cluster assignments
evaluation on ImageNet-1K using 100% of the labels. Method Architecture Params. Epochs Top 1 Comparing similar architectures SimCLRv2 (Chen et al., 2020c) RN50 24M 800 71.7 BYOL (Grill et al., 2020) RN50 24M 1000 74.4 DINO (Caron et al., 2021) ViT-S/16 22M 800 77.0 iBOT (Zhou et al., 2021) ViT-S/16 22M 800 77.9 MSN ViT-S/16 22M 600 76.9 Comparing larger architectures MAE (He et al., 2021) ViT-H/14 632M 1600 76.6 BYOL (Grill et al., 2020) RN200 (2⇥) 250M 800 79.6 SimCLRv2 (Chen et al., 2020c) RN151+SK (3⇥) 795M 800 79.8 iBOT (Zhou et al., 2021) ViT-B/16 86M 400 79.4 DINO (Caron et al., 2021) ViT-B/8 86M 300 80.1 MoCov3 (Chen et al., 2021) ViT-BN-L/7 304M 300 81.0 MSN ViT-L/7 304M 200 80.7 Table 4: End-to-end fine-tuning of a ViT-B/16 encoder on ImageNet-1K using 100% of the labels. MSN obtains competitive performance with both joint-embedding approaches and auto-encoding approaches. Initialization Pretrain Epochs Top 1 DINO (Caron et al., 2021) 800 83.6 BEiT (Bao et al., 2021) 800 83.2 iBOT (He et al., 2021) 800 83.8 MAE (He et al., 2021) 1600 83.6 SimMIM (Xie et al., 2021) - 83.8 MaskFeat (Wei et al., 2021) - 84.0 ઢܗධՁͷֶश࣌ʹͷֶश༻σʔλΛ༻ Table 1: Extreme low-shot. We evaluate the label-efficiency of self-supervised models pretrained on the ImageNet-1K dataset. For evaluation, we use an extremely small number of the ImageNet-1K labels and report the mean top-1 accuracy and standard deviation across 3 random splits of the data. Images per Class Method Architecture Epochs 1 2 5 iBOT (Zhou et al., 2021) ViT-S/16 800 40.4 ± 0.5 50.8 ± 0.8 59.9 ± 0.2 ViT-B/16 400 46.1 ± 0.3 56.2 ± 0.7 64.7 ± 0.3 DINO (Caron et al., 2021) ViT-S/16 800 38.9 ± 0.4 48.9 ± 0.3 58.5 ± 0.1 ViT-B/16 400 41.8 ± 0.3 51.9 ± 0.6 61.4 ± 0.2 ViT-S/8 800 45.5 ± 0.4 56.0 ± 0.7 64.7 ± 0.4 ViT-B/8 300 45.8 ± 0.5 55.9 ± 0.6 64.6 ± 0.2 MAE (He et al., 2021) ViT-B/16 1600 8.2 ± 0.3 25.0 ± 0.3 40.5 ± 0.2 ViT-L/16 1600 12.3 ± 0.2 19.3 ± 1.8 42.3 ± 0.3 ViT-H/14 1600 11.6 ± 0.4 18.6 ± 0.2 32.8 ± 0.2 MSN (Ours) ViT-S/16 800 47.1 ± 0.1 55.8 ± 0.6 62.8 ± 0.3 ViT-B/16 600 49.8 ± 0.2 58.9 ± 0.4 65.5 ± 0.3 ViT-B/8 600 55.1 ± 0.1 64.9 ± 0.7 71.6 ± 0.3 ViT-B/4 300 54.3 ± 0.4 64.6 ± 0.7 72.4 ± 0.3 ViT-L/7 200 57.1 ± 0.6 66.4 ± 0.6 72.1 ± 0.2 loss, iBOT (Zhou et al., 2021) and SplitMask (El-Nouby et al., 2021) apply a joint-embedding loss to an output representing the global sequence (either the [CLS] token or a global average pool of the patch vectors). SplitMask shows that by using a patch-level loss, you can reduce the amount of unlabeled pre-training data. In contrast, we focus on reducing the amount of labeled data available for the downstream prediction task. Data2Vec (Baevski et al., 2022) demonstrates that this approach is suitable for multiple modalities such as vision, speech and text. Different from these approaches, we only match the view representations globally and do not consider a patch level loss. Consequently, we can completely ignore the masked patches, significantly reducing the computational and memory ઢܗධՁͷֶश࣌ʹΫϥε͋ͨΓdαϯϓϧͷֶश༻σʔλΛ༻ 'FXTIPUͳઃఆʹ͓͍ͯରরֶशैདྷͷ.*.ͱൺͯਫ਼্͕
ରরֶशͷಋೖɿ'-*1<-J BS9JW> case data epochs B/16 L/16 L/14 H/14 CLIP [52] WIT-400M 32 68.6 - 75.3 - OpenCLIP [36] LAION-400M 32 67.1 - 72.8 - CLIP, our repro. LAION-400M 32 68.2 72.4 73.1 - FLIP LAION-400M 32 68.0 74.3 74.6 75.5 Table 2. Zero-shot accuracy on ImageNet-1K classification, compared with various CLIP baselines. The image size is 224. The entries noted by grey are pre-trained on a different dataset. Our models use a 64k batch, 50% masking ratio, and unmasked tuning. case data epochs model zero-shot linear probe fine-tune CLIP [52] WIT-400M 32 L/14 75.3 83.9† - CLIP [52], our transfer WIT-400M 32 L/14 75.3 83.0 87.4 OpenCLIP [36] LAION-400M 32 L/14 72.8 82.1 86.2 CLIP, our repro. LAION-400M 32 L/16 72.4 82.6 86.3 FLIP LAION-400M 32 L/16 74.3 83.6 86.9 Table 3. Linear probing and fine-tuning accuracy on ImageNet-1K classification, compared with various CLIP baselines. The entries noted by grey are pre-trained on a different dataset. The image size is 224. †: CLIP in [52] optimizes with L-BFGS; we use SGD instead. The speedup of our method is of great practical value. The CLIP baseline takes ⇠10 days training in 256 TPU-v3 cores, so a speedup of 2-3⇥ saves many days in wall-clock time. This speedup facilitates exploring the scaling behav- ior, as we will discuss later in Sec. 4.3. 4.2. Comparisons with CLIP In this section, we compare with various CLIP baselines in a large variety of scenarios. We show that our method is a competitive alternative to CLIP; as such, our fast training method is a more desirable choice in practice. We consider the following CLIP baselines: • The original CLIP checkpoints [52], trained on the pri- vate dataset WIT-400M. Table 2 reports the results of our FLIP models, using the best practice as we have ablated in Table 1 (a 64k batch, 50% masking ratio, and unmasked tuning). For ViT-L/14,2 our method has 74.6% accuracy, which is 1.8% higher than OpenCLIP and 1.5% higher than our CLIP re- production. Comparing with the original CLIP, our method reduces the gap to 0.7%. We hope our method will improve the original CLIP result if it were trained on the WIT data. ImageNet linear probing. Table 3 compares the linear probing results, i.e., training a linear classifier on the tar- get dataset with frozen features. FLIP has 83.6% accuracy, 1.0% higher than our CLIP counterpart. It is also 0.6% higher than our transfer of the original CLIP checkpoint, using the same SGD trainer. data Food101 CIFAR10 CIFAR100 Birdsnap SUN397 Cars Aircraft VOC2007 DTD Oxford Pets Caltech101 Flowers102 MNIST STL10 EuroSAT RESISC45 GTSRB KITTI Country211 PCam UCF101 Kinetics700 CLEVR HatefulMemes SST2 CLIP [52] WIT-400M 92.9 96.2 77.9 48.3 67.7 77.3 36.1 84.1 55.3 93.5 92.6 78.7 87.2 99.3 59.9 71.6 50.3 23.1 32.7 58.8 76.2 60.3 24.3 63.3 64.0 CLIP [52], our eval. WIT-400M 91.0 95.2 75.6 51.2 66.6 75.0 32.3 83.3 55.0 93.6 92.4 77.7 76.0 99.3 62.0 71.6 51.6 26.9 30.9 51.6 76.1 59.5 22.2 55.3 67.3 OpenCLIP [36], our eval. LAION-400M 87.4 94.1 77.1 61.3 70.7 86.2 21.8 83.5 54.9 90.8 94.0 72.1 71.5 98.2 53.3 67.7 47.3 29.3 21.6 51.1 71.3 50.5 22.0 55.3 57.1 CLIP, our repro. LAION-400M 88.1 96.0 81.3 60.5 72.3 89.1 25.8 81.1 59.3 93.2 93.2 74.6 69.1 96.5 50.7 69.2 50.2 29.4 21.4 53.1 71.5 53.5 18.5 53.3 57.2 FLIP LAION-400M 89.3 97.2 84.1 63.0 73.1 90.7 29.1 83.1 60.4 92.6 93.8 75.0 80.3 98.5 53.5 70.8 41.4 34.8 23.1 50.3 74.1 55.8 22.7 54.0 58.5 Table 4. Zero-shot accuracy on more classification datasets, compared with various CLIP baselines. This table follows Table 11 in [52]. The model is ViT-L/14 with an image size of 224, for all entries. Entries in green are the best ones using the LAION-400M data. छྨͷதͰछྨͷσʔληοτͰਫ਼্͕ [FSPTIPUUSBOTGFSͱઢܗධՁʹ͓͍ͯਫ਼্͕
MultiMAE encoder Pre-trained MultiMAE encoder Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... ... ... ... MultiMAE pre-training Single-modal fin Multi-modal fin Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e
ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO &$$7> Transformer encoder Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... MultiMAE pre-training Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from Transformer encoder Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) Decoder Linear proj. Linear proj. Linear Decoder Selected input patches Original images Masked targets RGB Depth ntic ... ... ... ... ... ... MultiMAE pre-training Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) 3(# Transformer encoder Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... ... ... ... ... ... ... MultiMAE pre-training Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e.g., RGB, depth, and semantic segmentation) is linearly projected to tokens with a fixed dimension and encoded using a Transformer. Task-specific decoders reconstruct the masked-out patches by first performing a cross-attention step from queries to the encoded tokens, followed by a shallow Transformer. The queries consist of mask tokens (in gray), with the task-specific encoded tokens added at their respective positions. (Right) Fine-tuning: By pre-training on multiple modalities, MultiMAE lends itself to fine-tuning on single-modal and multi-modal downstream Transformer encoder Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... ... ... ... ... ... ... MultiMAE pre-training Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e.g., RGB, depth, and semantic segmentation) is linearly projected to tokens with a fixed dimension and encoded using a Transformer. Task-specific decoders reconstruct the masked-out patches by first performing a cross-attention step from queries to the encoded tokens, followed by a shallow 4FNBOUJD %FQUI ༧Ίඞཁͳσʔλ
7BMVFɿઢܗࣹӨޙͷશϞμϦςΟͷτʔΫϯ ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO &$$7> ing implementation de- 15 ation details 15 sification fine-tuning . . . . . . . . . . . . 15 ntation . . . . . . . . 15 stimation . . . . . . . 17 se regression tasks . . 17 egies 17 transfer results 18 on on ImageNet 18 E variants 18 raining time 19 the number of segmentation patches constant, we downsam- ple the semantic segmentation input by a factor of 4 and use patches of size 4⇥4. MultiMAE decoder. We illustrate the MultiMAE decoder in Fig 7. Following MAE [35], each decoder has a linear projection layer to adapt the outputs from the encoder to the decoder dimension. After this linear projection, we add both sine-cosine positional embeddings and learned modal- ity embeddings to the decoder inputs. This is then followed by a cross-attention layer, a MLP, and two Transformer blocks. Figure 7. MultiMAE decoders: Tokens from the MultiMAE en- ,FZ 7BMVF 2VFSZ
Masked Autoencoders Roman Bachmann* David Mizrahi* Andrei Atanov Amir Zamir Swiss Federal Institute of Technology Lausanne (EPFL) https://multimae.epfl.ch Masked inputs MultiMAE predictions Target Semantic Depth RGB Masked inputs MultiMAE predictions Target Masked inputs MultiMAE predictions Target Figure 1. MultiMAE pre-training objective. We randomly select 1/6 of all 16⇥16 image patches from multiple modalities and learn to reconstruct the remaining 5/6 masked patches from them. The figure shows validation examples from ImageNet, where masked inputs (left),
ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) Supervised [81] 81.8 45.8 33.9 50.1 80.7 DINO [12] 83.1 44.6 32.5 47.9 81.3 MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9 MAE [35] 83.3 46.2 36.5 50.8 85.1 MultiMAE 83.3 46.2 37.0 52.0 86.4 Table 1. Fine-tuning with RGB-only. We report the top-1 ac- curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman- tic segmentation (S), as well as 1 accuracy (") on NYUv2 depth (D). Text in bold and underline indicates the first and second-best results, respectively. All methods are pre-trained on ImageNet-1K (with pseudo labels for MultiMAE). Method RG MAE 36 MultiMAE 37 Table 2. Fine-tun report semantic seg RGB and depth, m leverage additional Text in gray indica on. ADE20K (S) Hypersim (S) Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 3 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 4 Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer r segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE f Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) sked targets ... ... ... ... ... ... Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) pled patches from multiple modalities (e.g., RGB, depth, and on and encoded using a Transformer. Task-specific decoders p from queries to the encoded tokens, followed by a shallow fic encoded tokens added at their respective positions. (Right) to fine-tuning on single-modal and multi-modal downstream Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) sked targets ... ... ... ... ... ... Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) pled patches from multiple modalities (e.g., RGB, depth, and on and encoded using a Transformer. Task-specific decoders p from queries to the encoded tokens, followed by a shallow fic encoded tokens added at their respective positions. (Right) Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) Supervised [81] 81.8 45.8 33.9 50.1 80.7 DINO [12] 83.1 44.6 32.5 47.9 81.3 MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9 MAE [35] 83.3 46.2 36.5 50.8 85.1 MultiMAE 83.3 46.2 37.0 52.0 86.4 Table 1. Fine-tuning with RGB-only. We report the top-1 ac- curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman- tic segmentation (S), as well as 1 accuracy (") on NYUv2 depth (D). Text in bold and underline indicates the first and second-best results, respectively. All methods are pre-trained on ImageNet-1K (with pseudo labels for MultiMAE). Hypersim (S) Method RGB D RGB-D MAE 36.5 32.5 36.9 MultiMAE 37.0 38.5 47.6 Table 2. Fine-tuning with RGB an report semantic segmentation transfer RGB and depth, measured in mIoU (" leverage additional modalities such a Text in gray indicates a modality that on. ADE20K (S) Hypersim (S) Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseud segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labele gray indicates a modality that the model was not pre-trained on. than two modalities during transfer quickly becomes com- putationally expensive, since without masking, our method now scales with the full number of modalities and tokens. For performing multi-modal transfers with the standard MAE, we train a new input projection for the additional modalities while fine-tuning. Further training details can depth model was partially traine mantic segmentation pseudo labe Mask2Former model as in pre-tr As shown in Table 3, MultiM depth or semantic segmentation yond the RGB-only setting, alt Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 Supervised [81] 81.8 45.8 33.9 50.1 80. DINO [12] 83.1 44.6 32.5 47.9 81. MoCo-v3 [17] 82.8 43.7 31.7 46.6 80. MAE [35] 83.3 46.2 36.5 50.8 85. MultiMAE 83.3 46.2 37.0 52.0 86. Table 1. Fine-tuning with RGB-only. We report the top-1 curacy (") on ImageNet-1K (IN-1K) [23] classification (C), m (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] sem tic segmentation (S), as well as 1 accuracy (") on NYUv2 de (D). Text in bold and underline indicates the first and second- results, respectively. All methods are pre-trained on ImageNet (with pseudo labels for MultiMAE). ADE20K (S) Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB p MAE 46.2 20.0 46.3 46.2 46.3 36.5 2 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 3 Table 3. Fine-tuning with RGB and pseudo labels. Semanti segmentation maps, measured in mIoU ("). MultiMAE benefit gray indicates a modality that the model was not pre-trained on. than two modalities during transfer quickly becomes com putationally expensive, since without masking, our metho now scales with the full number of modalities and token For performing multi-modal transfers with the standa MAE, we train a new input projection for the addition modalities while fine-tuning. Further training details ca Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) Supervised [81] 81.8 45.8 33.9 50.1 80.7 DINO [12] 83.1 44.6 32.5 47.9 81.3 MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9 MAE [35] 83.3 46.2 36.5 50.8 85.1 MultiMAE 83.3 46.2 37.0 52.0 86.4 Table 1. Fine-tuning with RGB-only. We report the top-1 ac- curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman- tic segmentation (S), as well as 1 accuracy (") on NYUv2 depth (D). Text in bold and underline indicates the first and second-best results, respectively. All methods are pre-trained on ImageNet-1K (with pseudo labels for MultiMAE). Hypersim (S) N Method RGB D RGB-D RGB MAE 36.5 32.5 36.9 50.8 MultiMAE 37.0 38.5 47.6 52.0 Table 2. Fine-tuning with RGB and ground t report semantic segmentation transfer results from RGB and depth, measured in mIoU ("). MultiMA leverage additional modalities such as depth, wh Text in gray indicates a modality that the model w on. ADE20K (S) Hypersim (S) NYUv2 ( Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled de segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities gray indicates a modality that the model was not pre-trained on. than two modalities during transfer quickly becomes com- putationally expensive, since without masking, our method now scales with the full number of modalities and tokens. For performing multi-modal transfers with the standard MAE, we train a new input projection for the additional modalities while fine-tuning. Further training details can depth model was partially trained on this d mantic segmentation pseudo labels, we use t Mask2Former model as in pre-training. As shown in Table 3, MultiMAE can use depth or semantic segmentation to boost p yond the RGB-only setting, although the Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) Supervised [81] 81.8 45.8 33.9 50.1 80.7 DINO [12] 83.1 44.6 32.5 47.9 81.3 MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9 MAE [35] 83.3 46.2 36.5 50.8 85.1 MultiMAE 83.3 46.2 37.0 52.0 86.4 Table 1. Fine-tuning with RGB-only. We report the top-1 ac- curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman- tic segmentation (S), as well as 1 accuracy (") on NYUv2 depth (D). Text in bold and underline indicates the first and second-best results, respectively. All methods are pre-trained on ImageNet-1K (with pseudo labels for MultiMAE). Hypersim (S) NY Method RGB D RGB-D RGB MAE 36.5 32.5 36.9 50.8 MultiMAE 37.0 38.5 47.6 52.0 Table 2. Fine-tuning with RGB and ground tru report semantic segmentation transfer results from c RGB and depth, measured in mIoU ("). MultiMAE leverage additional modalities such as depth, while Text in gray indicates a modality that the model was on. ADE20K (S) Hypersim (S) NYUv2 (S) Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RG MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 50 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 53 Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled dept segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities as gray indicates a modality that the model was not pre-trained on. than two modalities during transfer quickly becomes com- putationally expensive, since without masking, our method now scales with the full number of modalities and tokens. For performing multi-modal transfers with the standard MAE, we train a new input projection for the additional modalities while fine-tuning. Further training details can depth model was partially trained on this da mantic segmentation pseudo labels, we use the Mask2Former model as in pre-training. As shown in Table 3, MultiMAE can use p depth or semantic segmentation to boost per yond the RGB-only setting, although the ga Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) Supervised [81] 81.8 45.8 33.9 50.1 80.7 DINO [12] 83.1 44.6 32.5 47.9 81.3 MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9 MAE [35] 83.3 46.2 36.5 50.8 85.1 MultiMAE 83.3 46.2 37.0 52.0 86.4 Table 1. Fine-tuning with RGB-only. We report the top-1 ac- curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman- tic segmentation (S), as well as 1 accuracy (") on NYUv2 depth (D). Text in bold and underline indicates the first and second-best results, respectively. All methods are pre-trained on ImageNet-1K (with pseudo labels for MultiMAE). Hypersim (S) N Method RGB D RGB-D RGB MAE 36.5 32.5 36.9 50.8 MultiMAE 37.0 38.5 47.6 52.0 Table 2. Fine-tuning with RGB and ground t report semantic segmentation transfer results from RGB and depth, measured in mIoU ("). MultiMA leverage additional modalities such as depth, wh Text in gray indicates a modality that the model w on. ADE20K (S) Hypersim (S) NYUv2 ( Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD R MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled de segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities gray indicates a modality that the model was not pre-trained on. than two modalities during transfer quickly becomes com- depth model was partially trained on this d IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) 81.8 45.8 33.9 50.1 80.7 83.1 44.6 32.5 47.9 81.3 82.8 43.7 31.7 46.6 80.9 83.3 46.2 36.5 50.8 85.1 83.3 46.2 37.0 52.0 86.4 ne-tuning with RGB-only. We report the top-1 ac- ImageNet-1K (IN-1K) [23] classification (C), mIoU 0K [102] , Hypersim [68] , and NYUv2 [73] seman- ion (S), as well as 1 accuracy (") on NYUv2 depth bold and underline indicates the first and second-best ctively. All methods are pre-trained on ImageNet-1K labels for MultiMAE). Hypersim (S) NYUv2 (S) Method RGB D RGB-D RGB D RGB-D MAE 36.5 32.5 36.9 50.8 23.4 49.3 MultiMAE 37.0 38.5 47.6 52.0 41.4 56.0 Table 2. Fine-tuning with RGB and ground truth depth. We report semantic segmentation transfer results from combinations of RGB and depth, measured in mIoU ("). MultiMAE can effectively leverage additional modalities such as depth, while MAE cannot. Text in gray indicates a modality that the model was not pre-trained on. ADE20K (S) Hypersim (S) NYUv2 (S) RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 50.1 49.3 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 53.5 54.0 e-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled depth and semantic maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities as input. Text in s a modality that the model was not pre-trained on. odalities during transfer quickly becomes com- depth model was partially trained on this dataset. For se- ʢQ%ʣɿٙࣅϥϕϦϯάʹΑΔ%FQUIใ ʢQ4ʣɿٖࣅϥϕϦϯάʹΑΔ4FNBOUJDTFHNFOUBUJPOใ ͭͷϞʔμϧΛ༻͍ͯ fi OFUVOJOH͢Δ͜ͱͰߴ͍ೝࣝੑೳΛൃش
H T W H input target Figure 1: Masked Autoencoders as spatiotemporal learners. We mask a large subset (e.g., 90%) of random patches in spacetime. An encoder operates on the set of visible patches. A small decoder then processes the full set of encoded patches and mask tokens to reconstruct the input. Except for patch and positional embeddings, neither the encoder, the decoder, nor the masking strategy, has any spatiotemporal inductive bias. To the extreme, if a video has T identical static frames, randomly sampling 1/T of all spacetime patches would reveal most of the static frame. Because slow motion is more likely than fast motion in natural videos, the masking ratio can be very high as we observe empirically. The higher masking ratio leads to a more efficient solution in practice. Following the MAE in [31] that applies the encoder only on visible tokens, a masking ratio of 90% reduces the encoder time and memory complexity to <1/10. Put together with a small decoder [31], the MAE pre-training can achieve a theoretically 7.7⇥ reduction in computation vs. encoding all tokens. In fact, the computation reduction is so large that the data loading time becomes a new bottleneck; even so, we record a 4.1⇥ wall-clock speedup. Such a significant speedup is of great importance for video research that is large-scale and time-consuming. We report strong results on a variety of video recognition datasets. Our MAE pre-training greatly improves generalization performance: on Kinetics-400 [35], it increases the accuracy of ViT-Large [18] by absolute 13% vs. training from scratch, while it takes less wall-clock training time overall (pre-training plus fine-tuning). Our MAE pre-training can outperform its supervised pre-training ."&"T4QBUJPUFNQPSBM-FBSOFST <'FJDIUFOIPGFS /FVS*14> Encoder Decoder … … Target MSE Input ) ( , Figure 1: Audio-MAE for audio self-supervised learning. An audio recording is first transformed into a spectrogram and split into patches. We embed patches and mask out a large subset (80%). An encoder then operates on the visible (20%) patch embeddings. Finally, a decoder processes the order-restored embeddings and mask tokens to reconstruct the input. Audio-MAE is minimizing the mean square error (MSE) on the masked portion of the reconstruction and the input spectrogram. This computational burden has been addressed in different ways. A popular approach is to reduce the sequence length in self-attention. Various ViT-based architectures have been developed to alleviate such issues for image and video understanding. For example, Swin-Transformer [19] only performs local attention within windows that shift across layers. MViT [20] employs pooling attention to construct a hierarchy of Transformers where sequence lengths are downsampled. For self-supervised learning, MAE [1] efficiently encodes only a small portion (25%) of visual patches while the majority of patches is discarded. The simplicity and scalability in MAE make it a promising framework for large-scale self-supervised learning. In this work, we study MAE for sound recognition and the unique challenges of the audio domain. We present Audio-MAE (Fig. 1) as unified and scalable framework for learning self-supervised audio representations. Similar to MAE, it is composed of a pair of a Transformer encoder and decoder. Figure 2: Visualizations on the Kinetics-400 [35] validation set (masking ratio 90%). We show the original video (top), masked video (middle), and MAE output (bottom) for each sample. This model reconstructs the original pixels. The video size is 16⇥224⇥224 and the spacetime patch size is 2⇥16⇥16 (the temporal patch size of 2 is not visualized here). Each sample has 8⇥14⇥14=1568 tokens with 156 being visible. For better visualizations, the known patches in the output are from the original input. Fig. 7 shows more examples. Figure 3: Visualizations of the same pre-trained model in Fig. 2 but with a masking ratio of 95%. Instead of predicting pixels [9, 18, 31, 80], another line of research focuses on the tokenization ."&UIBU-JTUFO <)VBOH /FVS*14> εϖΫτϩάϥϜʹରͯ͠ϚεΫ ಈը૾ʹରͯ͠ϚεΫ
40 50 60 0 20 40 60 80 accuracy (%) wall-clock time (hours) MAE pre-train 800 epochs fine-tune 100 epochs from scratch 400 epochs w/ MAE from scratch 1-view multi-vie Figure 5: MAE pre-training plus fine-tuning is much more accurate and scratch. Here the x-axis is the wall-clock training time (128 A100 GPUs), a accuracy on Kinetics-400 validation. The table shows the final accuracy. ."&ʴϑΝΠϯνϡʔχϯάֶ͍श࣌ؒͰߴੑೳ