Self-Supervised Learning

ࣗݾڭࢣ͋Γֶश Ԭຊ௚थɼ౻٢߂࿱ɼฏ઒ཌྷɼࢁԼོٛʢத෦େֶɾػց஌֮ϩϘςΟΫεݚڀάϧʔϓʣ ੁপխಙʢ౦๺େֶʣ IUUQNQSHKQ

w ڭࢣϥϕϧΛ෇༩ͯ͠ͳ͍େྔͷσʔλΛٖࣅతͳ໰୊ 1SFUFYUUBTL ʹΑֶͬͯश ࣗݾڭࢣ͋ΓֶशͰֶशͨ͠Ϟσϧ͸ࣄલֶशϞσϧͱͯ͠׆༻ w ୅දతͳֶशํ๏ ରরֶश
ɿ4JN$-3<$IFO *$.-> .P$P<)F $713> /FHBUJWFGSFF ɿ#:0-<(SJMM /FVS*14> 4JN4JBN<$IFO $713> .BTLFE*NBHF.PEFMJOH ɿ4JN.*.<9JF $713> ."&<)F $713> ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH ᶃେྔͷڭࢣͳ͠σʔλͰࣄલֶशϞσϧΛ࡞੒ ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁ fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ େྔͷڭࢣͳ͠σʔλ ࣄલֶशϞσϧ esses image data, we analyze its internal inearly projects the ﬂattened patches into the top principal components of the the ible basis functions for a low-dimensional Figure 6: Representative examples of attention from the output token to the input space. See Appendix D.6 for details. ed to the del learns ition em- ition em- hes in the nusoidal D). That ology ex- not yield he entire egree the mpute the n is inte- is “atten- We ﬁnd he lowest obally is nsistently alized at- esNet be- may serve Further, bally, we ʜ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ࣗݾڭࢣ͋Γֶश

w ը૾΁زԿม׵ͳͲͷσʔλ૿෯Λద༻͢Δ͜ͱͰ࡞੒ͨ͠໰୊ w 1SFUFYUUBTLͷྫɿ4JN$-3ʢϥϯμϜΫϩοϓʴ৭ม׵ʣ ϥϯμϜΫϩοϓɿಉҰҐஔͷ༧ଌ໰୊ͱۙ઀Ґஔͷ༧ଌ໰୊Λ࡞੒ ৭ม׵ ɿ৭ͷ༧ଌ໰୊Λ࡞੒ɼҐஔͷ༧ଌ໰୊Λ৭৘ใ͔Βղ͘͜ͱΛ཈੍ ٖࣅతͳ໰୊
1SFUFYUUBTL ಉҰҐஔͷ༧ଌ ۙ઀Ґஔͷ༧ଌ ৭ͷ༧ଌ ϥϯμϜΫϩοϓ ৭ม׵

w ࣗݾڭࢣ͋ΓֶशͰ֫ಘͨ͠ಛ௃දݱΛධՁ ,//๏ʹΑΔධՁ ઢܗධՁ w ࣄલֶशϞσϧͱͯ͠ͷసҠੑΛධՁ fi
OFUVOJOH ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏

w ,//๏ʹΑΔධՁ Ϟσϧ͕நग़ͨ͠ಛ௃ྔͱڭࢣϥϕϧΛ༻͍ͯ,//๏Λద༻ ϋΠύʔύϥϝʔλʹΑΔਫ਼౓มԽ͕গͳ͍ͨΊ౷ҰతͳධՁ͕Մೳ ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏ ࣄલֶशϞσϧ sses
image data, we analyze its internal nearly projects the ﬂattened patches into the top principal components of the the ble basis functions for a low-dimensional Figure 6: Representative examples of attention from the d to the el learns ion em- ion em- es in the nusoidal D). That ogy ex- ot yield e entire gree the pute the is inte- s “atten- We ﬁnd e lowest obally is sistently lized at- sNet be- ʜ ʜ ࣗݾڭࢣ͋Γֶश ʹ༻͍ͨσʔληοτ ಛ௃ྔ ࣗݾڭࢣ͋Γֶश ΛߦͳͬͨϞσϧ ʜ ಛ௃ྔ ڭࢣϥϕϧ ʜ "JSQMBOF $BU ֶश༻σʔλ ʜ ಛ௃ྔ ڭࢣϥϕϧ ʜ $BU %PH ධՁ༻σʔλ ,//๏ʹΑΓධՁ ,ݸͷۙ๣఺͔ΒΫϥε෼ྨ

w ઢܗධՁ Ϟσϧ͕நग़ͨ͠ಛ௃ྔͱڭࢣϥϕϧΛ༻͍ͯ'$૚Λڭࢣ͋Γֶश ڭࢣ͋Γֶशͷ࠷దͳϋΠύʔύϥϝʔλ͕ࣗݾڭࢣ͋Γֶशͷख๏ʹΑΓҟͳΔ ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏ ࣄલֶशϞσϧ ormer
processes image data, we analyze its internal ansformer linearly projects the ﬂattened patches into left) shows the top principal components of the the emble plausible basis functions for a low-dimensional patch. Figure 6: Representative examples of attention from the ding is added to the hat the model learns arity of position em- similar position em- pears; patches in the Finally, a sinusoidal (Appendix D). That image topology ex- variants do not yield ion across the entire e to what degree the ally, we compute the information is inte- , right). This “atten- ze in CNNs. We ﬁnd lready in the lowest ormation globally is ds have consistently s highly localized at- t apply a ResNet be- ʜ ࣗݾڭࢣ͋Γֶश ʹ༻͍ͨσʔληοτ ʜ ಛ௃ྔ ࣗݾڭࢣ͋Γֶश ΛߦͳͬͨϞσϧ ڭࢣϥϕϧ '$ ʜ ಛ௃ྔ ʜ "JSQMBOF $BU ڭࢣ͋Γֶश ཚ਺ॳظ஋ ͷ'$૚

w fi OFUVOJOH ࣗݾڭࢣ͋Γֶश࣌ͱҟͳΔσʔληοτʢԼྲྀλεΫʣ΁ fi OFUVOJOH ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏ 1FMJDBO
ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ࣗݾڭࢣ͋Γֶश ΛߦͳͬͨϞσϧ λεΫݻ༗ ͷϞσϧߏ଄ ڭࢣ͋Γֶश

෼ੳ 6OEFSTUBOEJOHUIF#FIBWJPVS <'8BOHBOE)-JV $713`> ଛࣦઃܭ΍ֶशޮՌʹ͍ͭͯ෼ੳ )PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS <-&SJDTTPO $713`> ༷ʑͳ໰୊ઃఆ΁ͷసҠੑΛධՁ
8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL <&$PMF $713`> σʔληοτͱͷؔ܎ੑʹ͍ͭͯ෼ੳ *OGP.JO <:5JBO /FVS*14`> ϙδςΟϒϖΞͷ૊Έ߹Θͤʹ͍ͭͯ෼ੳ ࣗݾڭࢣ͋Γֶशͷ୅දతͳख๏ #BSMPX5XJOT <+;CPOUBS *$.-`> CBUDIEJNFOTJPO $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ ࡞੒ͯ͠ରরֶश $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ 1*3- <*.JTSBBOE-.BBUFO $713`> δάιʔύζϧΛಋೖ 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ ରরֶश $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ MPDBM͔ΒMPDBM΋༧ଌ &T7J5 <$-J *$-3`> .BTLFE*NBHF.PEFMJOH .*. $//Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश 7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ %*/0 <.$BSPO *$$7> σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ ը૾਺ʹΫϥε਺ͱֶͯ͠श աڈͷग़ྗΛ ωΨςΟϒϖΞͱͯ͠׆༻ .BTLFE-BOHVBHF.PEFMJOH .-. Λը૾΁Ԡ༻ େن໛ωοτϫʔΫͷಋೖ +JHTBX <./PSPP[JBOE1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> MPDBM͔ΒHMPCBM HMPCBM͔ΒHMPCBMΛ༧ଌ ಛ௃ྔʹϚεΫΩϯά ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉΛ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> .$5 <9:VBO $713`> ϚϧνϞʔμϧ΁֦ு ʢը૾ʴςΩετʣ .P$P #:0- .P#: <;9JF BS9JW`> .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ 7P-5" <41SBNBOJDL BS9JW`> MPDBMGFBUVSF"MJHONFOU ϚϧνϞʔμϧʢը૾ʴςΩετʣ γϯϓϧͳରরֶशΛఏҊ $-*1 <"3BEGPSE *$.-`> ;FSP4IPU5SBOTGFS ϚϧνϞʔμϧʢը૾ʴςΩετʣ 7J5ͷͨΊͷֶशํ๏ /FHBUJWFGSFF #:0- <+(SJMM /FVS*14> ϙδςΟϒϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ ෼ੳ #:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT <13JDIFNPOE BS9JW`> όονਖ਼نԽͷ౷ܭ৘ใ͕҉໧తͳωΨςΟϒϖΞͳͷͰ͸ʁ ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ 4X"7 <.$BSPO /FVS*14`> ϙδςΟϒϖΞͷଐ͢Δ ΫϥελΛਪఆ

෼ੳ 6OEFSTUBOEJOHUIF#FIBWJPVS <'8BOHBOE)-JV $713`> ଛࣦઃܭ΍ֶशޮՌʹ͍ͭͯ෼ੳ )PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS <-&SJDTTPO $713`> ༷ʑͳ໰୊ઃఆ΁ͷసҠੑΛධՁ
8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL <&$PMF $713`> σʔληοτͱͷؔ܎ੑʹ͍ͭͯ෼ੳ *OGP.JO <:5JBO /FVS*14`> ϙδςΟϒϖΞͷ૊Έ߹Θͤʹ͍ͭͯ෼ੳ ࣗݾڭࢣ͋Γֶशͷ୅දతͳख๏ #BSMPX5XJOT <+;CPOUBS *$.-`> CBUDIEJNFOTJPO $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ ࡞੒ͯ͠ରরֶश $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ 1*3- <*.JTSBBOE-.BBUFO $713`> δάιʔύζϧΛಋೖ 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ ରরֶश $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ MPDBM͔ΒMPDBM΋༧ଌ &T7J5 <$-J *$-3`> .BTLFE*NBHF.PEFMJOH .*. $//Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश 7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ %*/0 <.$BSPO *$$7> σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ ը૾਺ʹΫϥε਺ͱֶͯ͠श աڈͷग़ྗΛ ωΨςΟϒϖΞͱͯ͠׆༻ .BTLFE-BOHVBHF.PEFMJOH .-. Λը૾΁Ԡ༻ େن໛ωοτϫʔΫͷಋೖ +JHTBX <./PSPP[JBOE1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> MPDBM͔ΒHMPCBM HMPCBM͔ΒHMPCBMΛ༧ଌ ಛ௃ྔʹϚεΫΩϯά ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉΛ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> .$5 <9:VBO $713`> ϚϧνϞʔμϧ΁֦ு ʢը૾ʴςΩετʣ .P$P #:0- .P#: <;9JF BS9JW`> .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ 7P-5" <41SBNBOJDL BS9JW`> MPDBMGFBUVSF"MJHONFOU ϚϧνϞʔμϧʢը૾ʴςΩετʣ γϯϓϧͳରরֶशΛఏҊ $-*1 <"3BEGPSE *$.-`> ;FSP4IPU5SBOTGFS ϚϧνϞʔμϧʢը૾ʴςΩετʣ 7J5ͷͨΊͷֶशํ๏ #:0- <+(SJMM /FVS*14> ϙδςΟϒϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ ෼ੳ #:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT <13JDIFNPOE BS9JW`> όονਖ਼نԽͷ౷ܭ৘ใ͕҉໧తͳωΨςΟϒϖΞͳͷͰ͸ʁ ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ 4X"7 <.$BSPO /FVS*14`> ϙδςΟϒϖΞͷଐ͢Δ ΫϥελΛਪఆ /FHBUJWFGSFF

w $PMPSGVM*NBHF$PMPSJ[BUJPO<;IBOH &$$7> Χϥʔը૾͔ΒάϨʔεέʔϧը૾Λ࡞੒ άϨʔεέʔϧը૾͔Β-BC৭ۭؒͷBC஋Λ༧ଌ w 1SFEJDUJOH*NBHF3PUBUJPOT<(JEBSJT *$-3>
ը૾ʹରͯ͠౓ɼ౓ɼ౓ɼ౓ͷ͍ͣΕ͔ͷճసΛద༻ ద༻͞Εͨճస͕छྨͷ͏͍ͪͣΕ͔Λ༧ଌʢΫϥε෼ྨʣ 1SFUFYUUBTLͷվળ 4 Zhang, Isola, Efros Fig. 2. Our network architecture. Each conv layer refers to a block of 2 or 3 repeated conv and ReLU layers, followed by a BatchNorm [30] layer. The net has no pool layers. All changes in resolution are achieved through spatial downsampling or upsampling Published as a conference paper at ICLR 2018 Rotated image: X0 Rotated image: X3 Rotated image: X 2 Rotated image: X1 ConvNet model F(.) ConvNet model F(.) ConvNet model F(.) ConvNet model F(.) Image X Predict 270 degrees rotation (y=3) Rotate 270 degrees g( X , y=3) Rotate 180 degrees g( X , y=2) Rotate 90 degrees g( X , y=1) Rotate 0 degrees g( X , y=0) Maximize prob. F3( X 3) Predict 0 degrees rotation (y=0) Maximize prob. F2( X2) Maximize prob. F1( X 1) Maximize prob. F0( X 0) Predict 180 degrees rotation (y=2) Predict 90 degrees rotation (y=1) Objectives: *NBHF3PUBUJPOT $PMPSGVM*NBHF$PMPSJ[BUJPO

w 4PMWJOH+JHTBX1V[[MFT</PSPP[JBOE'BWBSP &$$7> λΠϧঢ়ʹͭͷύονΛ࡞੒ͯ͠γϟοϑϧ ͋Β͔͡Ίఆٛ͞Εͨγϟοϑϧॱ൪ͷΠϯσοΫεΛ༧ଌ w $POUFYU&ODPEFST<1BUIBL $713>
&ODPEFSɾ%FDPEFSߏ଄ͷϞσϧʹΑΓϚεΫ͞ΕͨྖҬΛ༧ଌ 1SFUFYUUBTLͷվળ Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles 7 Fig. 3: Context Free Network. The ﬁgure illustrates how a puzzle is generated differ in the approach: whereas [7] are solving a discriminative task (is patch A above patch B or below?), our context encoder solves a pure prediction problem (what pixel inten- sities should go in the hole?). Interestingly, similar distinc- tion exist in using language context to learn word embeddings: Collobert and Weston [5] advocate a discriminative approach, whereas word2vec [30] formulate it as word prediction. One important beneﬁt of our approach is that our supervisory signal is much richer: a context encoder needs to predict roughly 15,000 real values per training example, compared to just 1 option among 8 choices in [7]. Likely due in part to this difference, our context encoders take far less time to train than [7]. Moreover, context based predic- Figure 2: Context Encoder. The context image is passed through the encoder to obtain features which are connected to the decoder using channel-wise fully-connected layer as described in Section 3.1. The decoder then produces the $POUFYU&ODPEFST 4PMWJOH+JHTBX1V[[MFT

w -FBSOJOHUP$PVOU</PSPP[J *$$7> ը૾શମͷಛ௃ྔͱू໿ͨ͠ύονͷಛ௃ྔ͕Ұக͢ΔΑ͏ʹֶश w /PO1BSBNFUSJD*OTUBODF%JTDSJNJOBUJPO<8V $713> ֤ը૾ʹର͢Δಛ௃ྔ͕ಠཱ͢ΔΑ͏ʹֶशʢσʔλ਺ʹΫϥε਺ͱֶͯ͠शʣ
1SFUFYUUBTLͷվળ ×n×3 → Rp×q×3, x and map them so deﬁne a feature med image to some e a feature transfor- akes J features and the image transfor- ature φ by using the pervisory signal 0 ∀x. (1) mily consists of the nsampling factor of 1, . . . , 4, which ex- of tiles. Notice that es of the same size. 4 }. We also deﬁne eatures on the trans- d − 4 j=1 tj. This shared weights φ(T1 ◦ x) φ(T2 ◦ x) φ(T3 ◦ x) φ(T4 ◦ x) φ(D ◦ x) φ(D ◦ y) + t d c |d − t|2 y x D ◦ x D ◦ y T1 ◦ x T2 ◦ x T3 ◦ x T4 ◦ x t max{0, M − |c − t|2} AlexNet conv1-5 fc8 1000 114x114x3 fc7 4096 fc6 4096 3x3x256 ReLU ReLU ReLU 128D Unit Sphere O 1-th image 2-th image i-th image n-1 th image n-th image CNN backbone 128D 2048D 128D L2 norm low dim Non-param Softmax Memory Bank Figure 2: The pipeline of our unsupervised feature learning approach. We use a backbone CNN to encode each image as a feature vector, which is projected to a 128-dimensional space and L2 normalized. The optimal feature embedding is learned via instance-level discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere. 3. Approach Our goal is to learn an embedding function v = f✓(x) without supervision. f✓ is a deep neural network with where ⌧ is a temperature parameter that controls the concentration level of the distribution [11]. ⌧ is important for supervised feature learning [43], and also necessary for tuning the concentration of v on our unit sphere. *OTUBODF%JTDSJNJOBUJPO -FBSOJOHUP$PVOU

w $POUFYU1SFEJDUJPO<%PFSTDI *$$7> λΠϧঢ়ʹΫϩοϓͨ͠ύονؒͷ૬ରҐஔΛ༧ଌ w $POUSBTUJWF1SFEJDUJWF$PEJOH $1$W <)ÉOB ff
*$.-> ύονͷಛ௃ྔ͔ΒύονҐஔ͕ݸઌͷύονͷಛ௃ྔΛ༧ଌ 1SFUFYUUBTLͷվળ fθ gφ x z c InfoNCE [256, 256, 3] [7, 7, 4096] [7, 7, 4096] Masked ConvNet Patched ResNet-161 fθ hψ x z y Cross Ent [256, 256, 3] [7, 7, 4096] [1000, 1] Linear Self-supervised pre-training 100% images; 0% labels Linear classification 100% images and labels fθ hψ x z y Cross Ent [224, 224, 3] [14, 14, 4096] ResNet-33 Efficient classification 1% to 100% images and labels fθ hψ x z y Multi Task [H, W, 3] [H/16, W/16, 4096] Transfer learning 100% images and labels hψ x y Cross Ent [224, 224, 3] [1000, 1] ResNet-152 Supervised training 1% to 100% images and labels Baseline Pre-training Evaluation Pre-trained Fixed / Tuned ResNet-161 Image x Feature Extractor fθ Patched ResNet-161 z c Context Network gφ Masked ConvNet Faster-RCNN [20, 1] [1000, 1] Pre-trained Fixed / Tuned ResNet-161 Pre-trained Fixed Patched ResNet-161 al configuration (if there is no spe- e parts, then it is “stuff” [1]). We d approach to learn a visual repre- . We demonstrate that the resulting good for both object detection, pro- t on PASCAL VOC 2007 compared , as well as for unsupervised object mining. This means, surprisingly, generalizes across images, despite bjective function that operates on a That is, instance-level supervision ormance on category-level tasks. a good image representation is as n appropriate generative model. An of natural images would both gener- o their natural distribution, and be hat it would seek common causes d share information between them. atent structure given an image is in- vely simple models. To deal with sues, a number of works, such as m [23], contrastive divergence [22], 3 2 1 5 4 8 7 6 ); Y = 3 , X = ( Figure 2. The algorithm receives two patches in one of these eight possible spatial arrangements, without any context, and must then classify which configuration was sampled. model (e.g. a deep network) to predict, from a single word, the n preceding and n succeeding words. In principle, similar reasoning could be applied in the image domain, a kind of visual “fill in the blank” task, but, again, one runs into the problem of determining whether the predictions themselves $POUSBTUJWF1SFEJDUJWF$PEJOH $POUFYU1SFEJDUJPO k

ࣗݾڭࢣ͋Γֶशͷ୅දతͳख๏ #BSMPX5XJOT <+;CPOUBS *$.-`> CBUDIEJNFOTJPO $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ
࡞੒ͯ͠ରরֶश $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ 1*3- <*.JTSBBOE-.BBUFO $713`> δάιʔύζϧΛಋೖ 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ ରরֶश $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ MPDBM͔ΒMPDBM΋༧ଌ &T7J5 <$-J *$-3`> .BTLFE*NBHF.PEFMJOH .*. $//Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश 7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ %*/0 <.$BSPO *$$7> σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ ը૾਺ʹΫϥε਺ͱֶͯ͠श աڈͷग़ྗΛ ωΨςΟϒϖΞͱͯ͠׆༻ .BTLFE-BOHVBHF.PEFMJOH .-. Λը૾΁Ԡ༻ େن໛ωοτϫʔΫͷಋೖ +JHTBX <./PSPP[JBOE1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> MPDBM͔ΒHMPCBM HMPCBM͔ΒHMPCBMΛ༧ଌ ಛ௃ྔʹϚεΫΩϯά ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉΛ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> .$5 <9:VBO $713`> ϚϧνϞʔμϧ΁֦ு ʢը૾ʴςΩετʣ .P$P #:0- .P#: <;9JF BS9JW`> .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ 7P-5" <41SBNBOJDL BS9JW`> MPDBMGFBUVSF"MJHONFOU ϚϧνϞʔμϧʢը૾ʴςΩετʣ γϯϓϧͳରরֶशΛఏҊ $-*1 <"3BEGPSE *$.-`> ;FSP4IPU5SBOTGFS ϚϧνϞʔμϧʢը૾ʴςΩετʣ 7J5ͷͨΊͷֶशํ๏ ෼ੳ 6OEFSTUBOEJOHUIF#FIBWJPVS <'8BOHBOE)-JV $713`> ଛࣦઃܭ΍ֶशޮՌʹ͍ͭͯ෼ੳ )PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS <-&SJDTTPO $713`> ༷ʑͳ໰୊ઃఆ΁ͷసҠੑΛධՁ 8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL <&$PMF $713`> σʔληοτͱͷؔ܎ੑʹ͍ͭͯ෼ੳ *OGP.JO <:5JBO /FVS*14`> ϙδςΟϒϖΞͷ૊Έ߹Θͤʹ͍ͭͯ෼ੳ #:0- <+(SJMM /FVS*14> ϙδςΟϒϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ ෼ੳ #:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT <13JDIFNPOE BS9JW`> όονਖ਼نԽͷ౷ܭ৘ใ͕҉໧తͳωΨςΟϒϖΞͳͷͰ͸ʁ ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ 4X"7 <.$BSPO /FVS*14`> ϙδςΟϒϖΞͷଐ͢Δ ΫϥελΛਪఆ /FHBUJWFGSFF

w 4JN$-3ɿ4JNQMF'SBNFXPSLGPS$POUSBTUJWF-FBSOJOHPG7JTVBM3FQSFTFOUBUJPOT w ݩը૾͕ಉ͡ಛ௃ྔͱͷྨࣅ౓Λେ͖͘ɼҟͳΔಛ௃ྔͱͷྨࣅ౓Λখ͘͢͞ΔΑ͏ʹֶश ໰୊ઃఆɿݩը૾͕ಉ͡ಛ௃ྔͷϖΞΛݟ͚ͭΔ ରরֶशɿ4JN$-3<$IFO *$.-> ϛχόον
ϓϩδΣΫλ Τϯίʔμ ʢωοτϫʔΫʣ .-1 ଛࣦܭࢉ /59FOU σʔλ૿෯ ಛ௃ྔ Τϯίʔμ ɿग़ྗ૚Λআ͍ͨωοτϫʔΫ ϓϩδΣΫλ ɿ૚ͷ.-1

ϓϩδΣΫλ Τϯίʔμ ʢωοτϫʔΫʣ .-1 ཭͢ (negative pair) ଛࣦܭࢉ /59FOU σʔλ૿෯ ͚ۙͮΔ (positive pair) ಛ௃ྔ Τϯίʔμ ɿग़ྗ૚Λআ͍ͨωοτϫʔΫ ϓϩδΣΫλ ɿ૚ͷ.-1

ϓϩδΣΫλ Τϯίʔμ ʢωοτϫʔΫʣ .-1 ͚ۙͮΔ (positive pair) ཭͢ (negative pair) ଛࣦܭࢉ /59FOU σʔλ૿෯ ಛ௃ྔ Τϯίʔμ ɿग़ྗ૚Λআ͍ͨωοτϫʔΫ ϓϩδΣΫλ ɿ૚ͷ.-1

w 4JN$-3ɿ4JNQMF'SBNFXPSLGPS$POUSBTUJWF-FBSOJOHPG7JTVBM3FQSFTFOUBUJPOT w ݩը૾͕ಉ͡ಛ௃ྔͱͷྨࣅ౓Λେ͖͘ɼҟͳΔಛ௃ྔͱͷྨࣅ౓Λখ͘͢͞ΔΑ͏ʹֶश ໰୊ઃఆɿݩը૾͕ಉ͡ಛ௃ྔͷϖΞΛݟ͚ͭΔ ରরֶशɿ4JN$-3<$IFO *$.-> ಛ௃ྔ
ϛχόον ϓϩδΣΫλ Τϯίʔμ ʢωοτϫʔΫʣ .-1 ͚ۙͮΔ (positive pair) ཭͢ (negative pair) ଛࣦܭࢉ /59FOU σʔλ૿෯ #BDLQSPQ Τϯίʔμ ɿग़ྗ૚Λআ͍ͨωοτϫʔΫ ϓϩδΣΫλ ɿ૚ͷ.-1

w σʔλ૿෯ͷ෼ੳɿͭͷσʔλ૿෯ͷ૊Έ߹ΘͤํʹΑΔઢܗධՁͷਫ਼౓มԽΛௐࠪ w σʔλ૿෯ͷ૊Έ߹ΘͤํʹΑͬͯਫ਼౓͕มԽ Ϋϩοϓʴ৭ม׵͕࠷΋ྑ͍૊Έ߹Θͤˠ4JN$-3Ҏ߱ͷख๏ʹ͓͚Δελϯμʔυͳઃఆʹ ରরֶशɿ4JN$-3<$IFO *$.-> RS
R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R

w ଛࣦؔ਺ͱͯ͠/PSNBMJ[FE5FNQFSBUVSFTDBMFE$SPTT&OUSPQZMPTT /59FOU Λ࢖༻ αϯϓϧؒͷྨࣅ౓ؔ܎Λ֬཰෼෍Ͱදݱͯ͠$SPTT&OUPSQZMPTTΛܭࢉ ରরֶशɿ4JN$-3<$IFO *$.-> Li,j
= − log exp(sim(zi , zj )/τ) ∑2N k=1 1[k≠i] exp(sim(zi , zk )/τ) ίαΠϯྨࣅ౓ ϙδςΟϒϖΞ શͯͷϖΞͷྨࣅ౓ Թ౓ύϥϝʔλ

= − log exp(sim(zi , zj )/τ) ∑2N k=1 1[k≠i] exp(sim(zi , zk )/τ) ίαΠϯྨࣅ౓ ϙδςΟϒϖΞ શͯͷϖΞͷྨࣅ౓ Թ౓ύϥϝʔλ pi,j = exp(sim(zi , zj )/τ) ∑2N k=1 1[k≠i] exp(sim(zi , zk )/τ) MPHJUT ֬཰෼෍ Թ౓෇͖   4PGUNBYؔ਺ $SPTT&OUSPQZ ଛࣦ ڭࢣϥϕϧ y1,2 y1,3 y1,2N y1,4 Li,j = − 2N ∑ k=1 1[k≠i] yi,k log pi,k = − log pi,j sim(z1 , z2 ) sim(z1 , z3 ) sim(z1 , z2N ) sim(z1 , z4 ) p1,2 p1,3 p1,2N p1,4 αϯϓϧͱαϯϓϧ͕ϙδςΟϒϖΞ ͷྫ i = 1, j = 2

= − log exp(sim(zi , zj )/τ) ∑2N k=1 1[k≠i] exp(sim(zi , zk )/τ) ίαΠϯྨࣅ౓ ϙδςΟϒϖΞ શͯͷϖΞͷྨࣅ౓ Թ౓ύϥϝʔλ αϯϓϧͱαϯϓϧ͕ϙδςΟϒϖΞ ͷྫ i = 1, j = 2 p1,2 p1,3 p1,2N p1,4 p1,2 p1,3 p1,2N p1,4 ֬཰෼෍ΛӶ͘ ֬཰෼෍ΛͳͩΒ͔ʹ MPHJUT sim(z1 , z2 ) sim(z1 , z3 ) sim(z1 , z2N ) sim(z1 , z4 ) Թ౓෇͖   4PGUNBYؔ਺ τ < 1.0 τ > 1.0 4JN$-3Ͱ͸ ͱͯ͠ dͷ஋Λ࢖༻ τ ௨ৗͷ4PGUNBYؔ਺ τ = 1.0 ͱൺ΂ͯ

w ૬ޓ৘ใྔͷ؍఺͔Βྑ͍ϙδςΟϒϖΞʹ͍ͭͯ෼ੳ w ϖΞͷ૬ޓ৘ใྔ͕େ͖ͯ͘΋খͯ͘͞΋ྑ͘ͳ͍͜ͱΛ࣮ݧతʹূ໌ ରরֶशͷ෼ੳɿ8IBU.BLFTGPS(PPE7JFXTGPS$POUSBTUJWF-FBSOJOH <5JBO /FVS*14> ϥϯμϜΫϩοϓʹ͓͚ΔΫϩοϓҐஔͷӨڹ
nformation between views is changed, information about the downstream task (green) d) can be selectively included or excluded, biasing the learned representation. (a) views are chosen to preserve downstream task information between views while rmation, while in (b) reducing MI always throws out information relevant for the task ormance as MI is reduced. reducing I(v1; v2) improves downstream accuracy. We use INCE as a neural depends on network architectures. Therefore for each plot in this paper, we s while keeping other settings the same, to make the results comparable. -10 classification (b) CIFAR-10 classification ws by using pairs of image patches at various offsets from each other. As INCE is ask accuracy ﬁrstly increases and then decreases, leading to a reverse-U shape. (v1; v2) with spatial distance. We create views by randomly cropping two ૬ޓ৘ใྔ େ খ I(v1 ; v2 ) ≥ log(K) − LNCE = INCE (v1 ; v2 ) 1BUDI EJTUBODF ࣗݾڭࢣ͋Γֶशɿ%*7,ˠઢܗධՁɿ$*'"3 missing info excess info # bits per hypothesis captured info I(x; y) <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> Sweet Spot I(v1; v2) = I(x; y) <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> I(v1; v2) <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> I(v1; v2) <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> missing info # bits captured in Sweet Spot I(v1; v2) = I(x; y <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> missing info e # bits captured in Sweet Spot I(v1; v2) = I(x; y <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> missing info excess info # bits transfer performance hypothesis too m no not enough signal captured info I(x; y) <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> Sweet Spot I(v1; v2) = I(x; y) <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> I(v1; v2) <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> I(v1; v2) <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> I(v1; v2) = I(x; y) <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> missing info excess info # bits transfer performance hypothesis too n not enough signal captured info I(x; y) <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> <latexit sha1_base64="iNMoJ3Ih4rUlw97IwCJ550O5zmY=">AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=</latexit> Sweet Spot I(v1; v2) = I(x; y) <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> I(v1; v2) <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> <latexit sha1_base64="zxwT18BeJKxyIf/CAMaBuvN5iBo=">AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==</latexit> I(v1; v2) <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> <latexit sha1_base64="FQD42doeFq80eOqDoMJBDZMaHzg=">AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb</latexit> I(v1; v2) = I(x; y) <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> <latexit sha1_base64="ggh1bZ8p4ZSLyNHk3ciW+b0ICqQ=">AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=</latexit> (a) Figure 2: As the mutual information between views is changed and nuisance variables (red) can be selectively included or ex depicts a scenario where views are chosen to preserve down throwing out nuisance information, while in (b) reducing MI alw leading to decreasing performance as MI is reduced.

w ଛࣦؔ਺ͷԹ౓ύϥϝʔλͱ֫ಘ͢Δಛ௃දݱͷؔ܎ʹ͍ͭͯ෼ੳ Թ౓ύϥϝʔλখɿϙδςΟϒϖΞͱωΨςΟϒϖΞ͕෼཭ Թ౓ύϥϝʔλେɿϙδςΟϒϖΞͷྨࣅ౓͕ʹۙͮ͘ w Թ౓ύϥϝʔλ͕ͷͱ͖ྑ͍ੑೳΛୡ੒ ରরֶशͷ෼ੳɿ6OEFSTUBOEJOHUIF#FIBWJPVSPG$POUSBTUJWF-PTT <8BOHBOE-JV
$713> Dataset Result Contrastive Simple HardContrastive HardSimple 0.07 0.3 0.7 1.0 0.07 0.3 0.7 1.0 CIFAR10 accuracy 79.75 83.27 82.69 82.21 74.83 79.2 83.63 84.19 84.19 84.84 uniformity 3.86 3.60 3.17 2.96 1.68 3.88 3.89 3.87 3.86 3.85 tolerance 0.04 0.178 0.333 0.372 0.61 0.034 0.0267 0.030 0.030 0.030 CIFAR100 accuracy 51.82 56.44 50.99 48.33 39.31 50.77 56.55 57.54 56.77 55.71 uniformity 3.86 3.60 3.18 2.96 2.12 3.87 3.88 3.87 3.86 3.86 tolerance 0.10 0.269 0.331 0.343 0.39 0.088 0.124 0.158 0.172 0.174 SVHN accuracy 92.55 95.47 94.17 92.07 70.83 91.82 94.79 95.02 95.26 94.99 uniformity 3.88 3.65 3.27 3.05 1.50 3.89 3.91 3.90 3.88 3.85 tolerance 0.032 0.137 0.186 0.197 0.074 0.025 0.021 0.021 0.023 0.026 ImageNet100 accuracy 71.53 75.10 69.03 63.57 48.09 68.33 74.21 74.70 74.28 74.31 uniformity 3.917 3.693 3.323 3.08 1.742 3.929 3.932 3.927 3.923 3.917 tolerance 0.093 0.380 0.427 0.456 0.528 0.067 0.096 0.121 0.134 0.157 Table 1. We report the accuracy of linear classiﬁcation on CIFAR10, CIFAR100 and SVHN, including models trained with the ordinary contrastive loss, simple contrastive loss, hard contrastive loss and hard simple contrastive loss. For models trained on ordinary contrastive loss and hard contrastive loss, we select several representative temperatures. More results are shown in the supplementary material. Թ౓ύϥϝʔλʹΑΔਫ਼౓มԽ

w ଛࣦؔ਺ͷԹ౓ύϥϝʔλͱ֫ಘ͢Δಛ௃දݱͷؔ܎ʹ͍ͭͯ෼ੳ Թ౓ύϥϝʔλখɿϙδςΟϒϖΞͱωΨςΟϒϖΞ͕෼཭ Թ౓ύϥϝʔλେɿϙδςΟϒϖΞͷྨࣅ౓͕ʹۙͮ͘ w Թ౓ύϥϝʔλ͕ͷͱ͖ྑ͍ੑೳΛୡ੒ ରরֶशͷ෼ੳɿ6OEFSTUBOEJOHUIF#FIBWJPVSPG$POUSBTUJWF-PTT <8BOHBOE-JV
$713> Figure 8. We display the similarity distribution of positive samples and the top-10 nearest negative samples that are marked as ’pos’ and ϙδςΟϒɾωΨςΟϒϖΞͷಛ௃ྔͷྨࣅ౓ͱԹ౓ύϥϝʔλͷؔ܎

w #:0-ɿ#PPUTUSBQ:PVS0XO-BUFOU w ΦϯϥΠϯωοτϫʔΫͱλʔήοτωοτϫʔΫͷ̎ͭͷωοτϫʔΫΛར༻ w ݩը૾͕ಉ͡ಛ௃ྔͱͷྨࣅ౓Λେ͖͘͢ΔΑ͏ʹֶशʢϙδςΟϒϖΞͷΈΛར༻ʣ ໰୊ઃఆɿΦϯϥΠϯͷநग़ͨ͠ಛ௃ྔ͔Βλʔήοτͷநग़ͨ͠ผͷϏϡʔͷಛ௃ྔΛ༧ଌ /FHBUJWFGSFFɿ#:0-<(SJMM /FVS*14>
Τϯίʔμ ɿग़ྗ૚Λআ͍ͨωοτϫʔΫ ϓϩδΣΫλ ɿ૚ͷ.-1 QSFEJDUPS ɿ૚ͷ.-1 ϛχόον Τϯίʔμ ϓϩδΣΫλ QSFEJDUPS ΦϯϥΠϯωοτϫʔΫ λʔήοτωοτϫʔΫ σʔλ૿෯ ಛ௃ྔ ଛࣦܭࢉ .4& .-1 .-1 .-1 TUPQHSBE

w λʔήοτͷύϥϝʔλ͸ΦϯϥΠϯͷࢦ਺Ҡಈฏۉ͢Δ͜ͱͰߋ৽ $PTJOFTDIFEVMFSʹج͍ͮͯ Λมߋ 0.996 ≤ λ ≤ 1
/FHBUJWFGSFFɿ#:0-<(SJMM /FVS*14> θt ← λθt + (1 − λ)θo θt λʔήοτͷύϥϝʔλ θo ΦϯϥΠϯͷύϥϝʔλ λ ߋ৽ύϥϝʔλʹର͢ΔॏΈ ϛχόον Τϯίʔμ ϓϩδΣΫλ .-1 .-1 .-1 ࢦ਺Ҡಈฏۉ QSFEJDUPS TUPQHSBE ΦϯϥΠϯωοτϫʔΫ λʔήοτωοτϫʔΫ σʔλ૿෯ ಛ௃ྔ ଛࣦܭࢉ .4& #BDLQSPQ Τϯίʔμ ɿग़ྗ૚Λআ͍ͨωοτϫʔΫ ϓϩδΣΫλ ɿ૚ͷ.-1 QSFEJDUPS ɿ૚ͷ.-1

/FHBUJWFGSFFɿ#:0-<(SJMM /FVS*14> 25M 50M 100M 200M 400M Number of
parameters 68 70 72 74 76 78 80 ImageNet top-1 accuracy (%) Sup. Sup.(2£) Sup.(4£) InfoMin SimCLR SimCLR (2£) SimCLR (4£) MoCov2 CPCv2-L MoCo CMC AMDIM BYOL BYOL (2£) BYOL (4£) BYOL (200-2£) Sup.(200-2£) w fi OFUVOJOHʹ͓͚Δਫ਼౓ൺֱ ࣗݾڭࢣ͋Γֶश ɿ*NBHF/FU, ෺ମݕग़ ɿ70$ ηάϝϯςʔγϣϯɿ70$ Supervised-IN baseline (+1.9 mIoU) and SimCLR ( Similarly, we evaluate on object detection by repro as detailed in Appendix D.5. We ﬁne-tune on trai AP50 metric; BYOL is signiﬁcantly better than the S Finally, we evaluate on depth estimation on the N given a single RGB image. Depth prediction mea that information can be localized to pixel accuracy We evaluate on the commonly used test subset o in Table 4b: relative (rel) error, root mean square max(dgt/dp, dp/dgt), is below 1.25n thresholds depth [40]. BYOL is better or on par with other me measure is respectively improved by +3.5 points a Method AP50 mIoU Supervised-IN [9] 74.4 74.4 MoCo [9] 74.9 72.5 SimCLR (repro) 75.2 75.2 BYOL (ours) 77.5 76.3 (a) Transfer results in semantic segmentation and object detection. Method Supervised-IN SimCLR (repro BYOL (ours) Table 4: Results on transferring 5 Building intuitions with ablations ύϥϝʔλ਺͕ଟ͍৔߹ʹڭࢣ͋Γֶशͱಉఔ౓ͷੑೳΛൃش *NBHF/FUͷڭࢣ͋ΓࣄલֶशϞσϧΛ௒͑ΔੑೳΛൃش w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ ઢܗධՁʹΑΓ෼ྨੑೳΛධՁ

w 4JN4JBNɿTJNQMF4JBNFTFOFUXPSLT w #:0-ΛΑΓγϯϓϧʹͨ͠ख๏ΛఏҊ ࢦ਺Ҡಈฏۉ΍ΫϥελϦϯάͳͲͷطଘख๏ʹ͋Δෳࡶͳॲཧ͕ෆཁ খ͞ͳϛχόον਺ͱֶशճ਺ͰֶशՄೳ /FHBUJWFGSFFɿ4JN4JBN<$IFO $713>
ϛχόον Τϯίʔμ ϓϩδΣΫλ QSFEJDUPS ಛ௃ྔ ଛࣦܭࢉ ෛͷίαΠϯྨࣅ౓ Τϯίʔμ ɿग़ྗ૚Λআ͍ͨωοτϫʔΫ ϓϩδΣΫλ ɿ૚ͷ.-1 QSFEJDUPS ɿ૚ͷ.-1ʢ#PUUMFOFDLߏ଄ʣ TUPQHSBE σʔλ૿෯ #BDLQSPQ .-1 .-1

w 4JN4JBNΛରরֶश΍/FHBUJWFGSFFͷڞ௨ϑϨʔϜϫʔΫͱͯ͠ଊ͑Δ͜ͱ͕Մೳ w 4JN4JBN΁طଘͷςΫχοΫΛ௥Ճɾ࡟আ͢Δ͜ͱͰҟͳΔख๏Λදݱ 4JN$-3 ɿ ωΨςΟϒϖΞɼQSFEJDUPSɼTUPQHSBE #:0-
ɿ ࢦ਺ҠಈฏۉϞσϧ 4X"7 ɿ ΫϥελϦϯάɼQSFEJDUPS /FHBUJWFGSFFɿ4JN4JBN<$IFO $713> similarity predictor encoder similarity & dissimilarity encoder image SimCLR similarity Sinkhorn-Knopp encoder similarity momentum encoder predictor image moving average BYOL grad grad grad grad grad similarity predictor encoder similarity & dissimilarity encoder image SimCLR similarity Sinkhorn-Knopp encoder similarity momentum encoder predictor image moving average BYOL grad grad grad grad grad encode predicto encoder similarity & dissimilarity encoder image SimCLR encoder similarity encoder Sinkhorn-Knopp image SwAV encode predicto grad grad grad grad grad encoder similarity encoder predictor image SimSiam ncoder ncoder horn-Knopp encoder similarity momentum encoder predictor image moving average BYOL grad grad grad

w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ *NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,ͰઢܗධՁ ωοτϫʔΫɿ3FT/FU /FHBUJWFGSFFɿ4JN4JBN<$IFO $713> method
batch size negative pairs momentum encoder 100 ep 200 ep 400 ep 800 ep SimCLR (repro.+) 4096 X 66.5 68.3 69.8 70.4 MoCo v2 (repro.+) 256 X X 67.4 69.9 71.0 72.2 BYOL (repro.) 4096 X 66.5 70.6 73.2 74.3 SwAV (repro.+) 4096 66.5 69.1 70.7 71.8 SimSiam 256 68.1 70.0 70.8 71.3 Table 4. Comparisons on ImageNet linear classiﬁcation. All are based on ResNet-50 pre-trained with two 224⇥224 views. Evaluation is on a single crop. All competitors are from our reproduction, and “+” denotes improved reproduction vs. original papers (see supplement). VOC 07 detection VOC 07+12 detection COCO detection COCO instance seg. pre-train AP50 AP AP75 AP50 AP AP75 AP50 AP AP75 APmask 50 APmask APmask 75 scratch 35.9 16.8 13.0 60.2 33.8 33.1 44.0 26.4 27.8 46.9 29.3 30.8 ImageNet supervised 74.4 42.4 42.7 81.3 53.5 58.8 58.2 38.2 41.2 54.7 33.3 35.2 SimCLR (repro.+) 75.9 46.8 50.1 81.8 55.5 61.4 57.7 37.9 40.9 54.6 33.3 35.3 4JN4JBN͸গͳ͍όοναΠζɾֶशճ਺Ͱߴ͍ੑೳΛൃش

w ԼྲྀλεΫʹ͓͚Δਫ਼౓ൺֱ *NBHF/FU,Ͱࣗݾڭࢣ͋ΓֶशˠԼྲྀλεΫ΁ fi OFUVOJOH FQPDIͷࣗݾڭࢣ͋Γֶशʹ͓͍֤ͯख๏Λൺֱ /FHBUJWFGSFFɿ4JN4JBN<$IFO $713>
method batch size negative pairs momentum encoder 100 ep 200 ep 400 ep 800 ep SimCLR (repro.+) 4096 X 66.5 68.3 69.8 70.4 MoCo v2 (repro.+) 256 X X 67.4 69.9 71.0 72.2 BYOL (repro.) 4096 X 66.5 70.6 73.2 74.3 SwAV (repro.+) 4096 66.5 69.1 70.7 71.8 SimSiam 256 68.1 70.0 70.8 71.3 Table 4. Comparisons on ImageNet linear classification. All are based on ResNet-50 pre-trained with two 224⇥224 views. Evaluation is on a single crop. All competitors are from our reproduction, and “+” denotes improved reproduction vs. original papers (see supplement). VOC 07 detection VOC 07+12 detection COCO detection COCO instance seg. pre-train AP50 AP AP75 AP50 AP AP75 AP50 AP AP75 APmask 50 APmask APmask 75 scratch 35.9 16.8 13.0 60.2 33.8 33.1 44.0 26.4 27.8 46.9 29.3 30.8 ImageNet supervised 74.4 42.4 42.7 81.3 53.5 58.8 58.2 38.2 41.2 54.7 33.3 35.2 SimCLR (repro.+) 75.9 46.8 50.1 81.8 55.5 61.4 57.7 37.9 40.9 54.6 33.3 35.3 MoCo v2 (repro.+) 77.1 48.5 52.5 82.3 57.0 63.3 58.8 39.2 42.5 55.5 34.3 36.6 BYOL (repro.) 77.1 47.0 49.9 81.4 55.3 61.1 57.8 37.9 40.9 54.3 33.2 35.0 SwAV (repro.+) 75.5 46.5 49.6 81.5 55.4 61.4 57.6 37.6 40.3 54.2 33.1 35.1 SimSiam, base 75.5 47.0 50.2 82.0 56.4 62.8 57.5 37.9 40.9 54.2 33.2 35.2 SimSiam, optimal 77.3 48.5 52.5 82.4 57.0 63.7 59.3 39.2 42.1 56.0 34.4 36.7 Table 5. Transfer Learning. All unsupervised methods are based on 200-epoch pre-training in ImageNet. VOC 07 detection: Faster R-CNN [32] fine-tuned in VOC 2007 trainval, evaluated in VOC 2007 test; VOC 07+12 detection: Faster R-CNN fine-tuned in VOC 2007 trainval + 2012 train, evaluated in VOC 2007 test; COCO detection and COCO instance segmentation: Mask R-CNN [18] (1⇥ schedule) fine-tuned in COCO 2017 train, evaluated in COCO 2017 val. All Faster/Mask R-CNN models are with the C4-backbone [13]. All VOC results are the average over 5 trials. Bold entries are within 0.5 below the best. 4JN4JBN͸γϯϓϧͳֶशํ๏Ͱैདྷ๏ͱಉఔ౓ͷੑೳΛൃش

w ̑छྨͷλεΫʹର͢ΔసҠੑɼ*NBHF/FUͱԼྲྀλεΫؒͷ૬ؔΛ෼ੳ .BOZTIPUSFDPHOJUJPO ɿछྨͷσʔληοτ 'FXTIPUSFDPHOJUJPO ɿछྨͷσʔληοτ 0CKFDUEFUFDUJPO
ɿछྨͷσʔληοτ 4FNBOUJDTFHNFOUBUJPO ɿछྨͷσʔληοτ 4VSGBDFOPSNBMFTUJNBUJPO ɿछྨͷσʔληοτ ࣗݾڭࢣ͋ΓֶशͷసҠੑɿ)PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS <&SJDTTPO $713> ࣗݾڭࢣ͋Γֶशͷख๏ʹΑͬͯಘҙͳԼྲྀλεΫ͕ҟͳΔ

w %*/0ɿTFMGEJTUJMMBUJPOXJUIOPMBCFMT w ੜెωοτϫʔΫͷग़ྗ෼෍͕ڭࢣωοτϫʔΫͷग़ྗ෼෍ʹۙͮ͘Α͏ʹֶश w ಛ௃ྔʹରͯ͠Թ౓෇͖ιϑτϚοΫεؔ਺Λద༻͢Δ͜ͱͰ֬཰෼෍Λܭࢉ 7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO *$$7> ࢦ਺Ҡಈฏۉ
4PGUNBY 4PGUNBY $FOUFS TUPQHSBE ಛ௃ྔ Τϯίʔμ ϓϩδΣΫλ TIBSQFOJOH ֬཰෼෍ MPDBM HMPCBM ੜెωοτϫʔΫ ڭࢣωοτϫʔΫ σʔλ૿෯ ଛࣦܭࢉ 7J5 7J5 .-1 .-1 ہॴྖҬΛΫϩοϓ ʢ೉͍͠໰୊ʣ ޿͍ൣғΛΫϩοϓ ʢ༏͍͠໰୊ʣ Τϯίʔμ ɿग़ྗ૚Λআ͍ͨ$//.-1)FBEΛআ͍ͨ7J5 ϓϩδΣΫλ ɿ૚ͷ.-1ʢ#PUUMFOFDLߏ଄ʣ DFOUFSJOH

w TIBSQFOJOHɿಛ௃ྔͷதͰͭͷಛ௃Λڧௐ͢ΔΑ͏ʹௐ੔ w DFOUFSJOH ɿͲΜͳը૾ʹରͯ͠΋ಉ͡ಛ௃Λڧௐ͠ͳ͍Α͏ʹௐ੔ w DFOUFSJOH஋ 7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO *$$7>
Ps (x)(i) = exp(gθs (x)(i)/τs ) ∑K k=1 exp(gθs (x)(k)/τs ) Pt (x)(i) = exp((gθt (x)(i) − c)/τt ) ∑K k=1 exp((gθt (x)(k) − c)/τt ) τt Թ౓ύϥϝʔλ c DFOUFSJOH஋ ੜెωοτϫʔΫɿ τs Թ౓ύϥϝʔλ m B όοναΠζ ϋΠύϥ ڭࢣωοτϫʔΫɿ c ← mc + (1 − m) 1 B B ∑ i=1 gθt (xi )

w ڭࢣͷύϥϝʔλ͸ੜెͷύϥϝʔλΛࢦ਺Ҡಈฏۉ͢Δ͜ͱͰߋ৽ $PTJOFTDIFEVMFSʹج͍ͮͯ Λมߋ 0.996 ≤ λ ≤ 1
7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO *$$7> TUPQHSBE ಛ௃ྔ Τϯίʔμ ϓϩδΣΫλ ֬཰෼෍ MPDBM HMPCBM ੜెωοτϫʔΫ ڭࢣωοτϫʔΫ σʔλ૿෯ ଛࣦܭࢉ 7J5 .-1 .-1 θt ← λθt + (1 − λ)θs θt ڭࢣͷύϥϝʔλ θs ੜెͷύϥϝʔλ λ ߋ৽ύϥϝʔλʹର͢ΔॏΈ 7J5 ࢦ਺Ҡಈฏۉ #BDLQSPQ 4PGUNBY 4PGUNBY $FOUFS TIBSQFOJOH DFOUFSJOH

w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ *NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,ͰઢܗධՁL//๏ʹΑΔධՁ 7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO *$$7> previous works[18, 19,
69], even though it is not attached o any label nor supervision in our case. The set of patch okens and [CLS] token are fed to a standard Transformer etwork with a “pre-norm” layer normalization [11, 39]. The Transformer is a sequence of self-attention and feed-forward ayers, paralleled with skip connections. The self-attention ayers update the token representations by looking at the ther token representations with an attention mechanism [4]. mplementation details. We pretrain the models on the mageNet dataset [60] without labels. We train with the damw optimizer [44] and a batch size of 1024, distributed over 16 GPUs when using ViT-S/16. The learning rate is inearly ramped up during the first 10 epochs to its base alue determined with the following linear scaling rule [29]: r = 0.0005 ⇤ batchsize/256. After this warmup, we decay he learning rate with a cosine schedule [43]. The weight decay also follows a cosine schedule from 0.04 to 0.4. The emperature ⌧s is set to 0.1 while we use a linear warm-up or ⌧t from 0.04 to 0.07 during the first 30 epochs. We ollow the data augmentations of BYOL [30] (color jittering, Gaussian blur and solarization) and multi-crop [10] with a bicubic interpolation to adapt the position embeddings to he scales [19, 69]. The code and models to reproduce our esults is publicly available. Evaluation protocols. Standard protocols for self- Table 2: Linear and k-NN classification on ImageNet. We report top-1 accuracy for linear and k-NN evaluations on the validation set of ImageNet for different self-supervised methods. We focus on ResNet-50 and ViT-small architectures, but also report the best results obtained across architectures. ⇤ are run by us. We run the k-NN evaluation for models with official released weights. The throughput (im/s) is calculated on a NVIDIA V100 GPU with 128 samples per forward. Parameters (M) are of the feature extractor. Method Arch. Param. im/s Linear k-NN Supervised RN50 23 1237 79.3 79.3 SCLR [12] RN50 23 1237 69.1 60.7 MoCov2 [15] RN50 23 1237 71.1 61.9 InfoMin [67] RN50 23 1237 73.0 65.3 BarlowT [81] RN50 23 1237 73.2 66.0 OBoW [27] RN50 23 1237 73.8 61.9 BYOL [30] RN50 23 1237 74.4 64.8 DCv2 [10] RN50 23 1237 75.2 67.1 SwAV [10] RN50 23 1237 75.3 65.7 DINO RN50 23 1237 75.3 67.5 Supervised ViT-S 21 1007 79.8 79.8 BYOL⇤ [30] ViT-S 21 1007 71.4 66.6 MoCov2⇤ [15] ViT-S 21 1007 72.7 64.4 SwAV⇤ [10] ViT-S 21 1007 73.5 66.3 DINO ViT-S 21 1007 77.0 74.5 Comparison across architectures SCLR [12] RN50w4 375 117 76.8 69.3 SwAV [10] RN50w2 93 384 77.3 67.3 BYOL [30] RN50w2 93 384 77.4 – DINO ViT-B/16 85 312 78.2 76.1 SwAV [10] RN50w5 586 76 78.5 67.1 BYOL [30] RN50w4 375 117 78.6 – BYOL [30] RN200w2 250 123 79.6 73.9 ther token representations with an attention mechanism [4]. mplementation details. We pretrain the models on the mageNet dataset [60] without labels. We train with the damw optimizer [44] and a batch size of 1024, distributed over 16 GPUs when using ViT-S/16. The learning rate is inearly ramped up during the first 10 epochs to its base alue determined with the following linear scaling rule [29]: r = 0.0005 ⇤ batchsize/256. After this warmup, we decay he learning rate with a cosine schedule [43]. The weight decay also follows a cosine schedule from 0.04 to 0.4. The emperature ⌧s is set to 0.1 while we use a linear warm-up or ⌧t from 0.04 to 0.07 during the first 30 epochs. We ollow the data augmentations of BYOL [30] (color jittering, Gaussian blur and solarization) and multi-crop [10] with a bicubic interpolation to adapt the position embeddings to he scales [19, 69]. The code and models to reproduce our esults is publicly available. Evaluation protocols. Standard protocols for self- upervised learning are to either learn a linear classifier on frozen features [82, 33] or to finetune the features on downstream tasks. For linear evaluations, we apply andom resize crops and horizontal flips augmentation during training, and report accuracy on a central crop. For finetuning evaluations, we initialize networks with he pretrained weights and adapt them during training. However, both evaluations are sensitive to hyperparameters, nd we observe a large variance in accuracy between runs when varying the learning rate for example. We thus also valuate the quality of features with a simple weighted Method Arch. Param. im/s Linear k-NN Supervised RN50 23 1237 79.3 79.3 SCLR [12] RN50 23 1237 69.1 60.7 MoCov2 [15] RN50 23 1237 71.1 61.9 InfoMin [67] RN50 23 1237 73.0 65.3 BarlowT [81] RN50 23 1237 73.2 66.0 OBoW [27] RN50 23 1237 73.8 61.9 BYOL [30] RN50 23 1237 74.4 64.8 DCv2 [10] RN50 23 1237 75.2 67.1 SwAV [10] RN50 23 1237 75.3 65.7 DINO RN50 23 1237 75.3 67.5 Supervised ViT-S 21 1007 79.8 79.8 BYOL⇤ [30] ViT-S 21 1007 71.4 66.6 MoCov2⇤ [15] ViT-S 21 1007 72.7 64.4 SwAV⇤ [10] ViT-S 21 1007 73.5 66.3 DINO ViT-S 21 1007 77.0 74.5 Comparison across architectures SCLR [12] RN50w4 375 117 76.8 69.3 SwAV [10] RN50w2 93 384 77.3 67.3 BYOL [30] RN50w2 93 384 77.4 – DINO ViT-B/16 85 312 78.2 76.1 SwAV [10] RN50w5 586 76 78.5 67.1 BYOL [30] RN50w4 375 117 78.6 – BYOL [30] RN200w2 250 123 79.6 73.9 DINO ViT-S/8 21 180 79.7 78.3 SCLRv2 [13] RN152w3+SK 794 46 79.8 73.1 DINO ViT-B/8 85 63 80.1 77.4 4. Main Results We first validate the DINO framework used in this study with the standard self-supervised benchmark on ImageNet. We then study the properties of the resulting features for retrieval, object discovery and transfer-learning. 3FT/FUɿैདྷ๏ͱಉఔ౓ͷੑೳΛൃش 7JTJPO5SBOTGPSNFS 7J5 ɿैདྷ๏Λ௒͑ΔੑೳΛൃش

w ԼྲྀλεΫʹ͓͚Δਫ਼౓ൺֱ 4VQ ɿ*NBHF/FU,Ͱڭࢣ͋ΓֶशˠԼྲྀλεΫ΁ fi OFUVOJOH %*/0ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋ΓֶशˠԼྲྀλεΫ΁ fi
OFUVOJOH 7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO *$$7> ViT-S/16 22.0 27.3 45.9 ViT-S/8 21.8 23.7 44.7 Figure 4: Segmentations from supervised versus DINO. We vi- sualize masks obtained by thresholding the self-attention maps to keep 60% of the mass. On top, we show the resulting masks for a ViT-S/8 trained with supervision and DINO. We show the best head for both models. The table at the bottom compares the Jac- card similarity between the ground truth and these masks on the validation images of PASCAL VOC12 dataset. Table 6: Transfer learning by ﬁnetuning pretrained models on different datasets. We report top-1 accuracy. Self-supervised pretraining with DINO transfers better than supervised pretraining. Cifar10 Cifar100 INat18 INat19 Flwrs Cars INet ViT-S/16 Sup. [69] 99.0 89.5 70.7 76.6 98.2 92.1 79.9 DINO 99.0 90.5 72.0 78.2 98.5 93.0 81.5 ViT-B/16 Sup. [69] 99.0 90.8 73.2 77.7 98.4 92.1 81.8 DINO 99.1 91.7 72.6 78.6 98.8 93.0 82.8 In Table 7, we report different model variants as we add or remove components. First, we observe that in the absence of momentum, our framework does not work (row 2) and more advanced operations, SK for example, are required to avoid collapse (row 9). However, with momentum, using SK has little impact (row 3). In addtition, comparing rows 3 and 9 highlights the importance of the momentum encoder for performance. Second, in rows 4 and 5, we observe that multi-crop training and the cross-entropy loss in DINO are 3 X X X CE 7 72.2 4 X 7 7 CE 7 67.9 5 X 7 X MSE 7 52.6 6 X 7 X CE X 71.8 7 BYOL X 7 7 MSE X 66.6 8 MoCov2 X 7 7 INCE 7 62.0 9 SwAV 7 X X CE 7 64.7 SK: Sinkhorn-Knopp, MC: Multi-Crop, Pred.: Predi CE: Cross-Entropy, MSE: Mean Square Error, INCE: In Figure 5: Effe Patch Size. k-NN uation as a funct the throughputs f ferent input patch with ViT-B and Models are train 300 epochs. with different patch sizes, 16 ⇥ 16, 8 ⇥ 8 and 5 ⇥ also compare to ViT-B with 16 ⇥ 16 and 8 ⇥ 8 patc the models are trained for 300 epochs. We observe performance greatly improves as we decrease the siz patch. It is interesting to see that performance can be improved without adding additional parameters. H the performance gain from using smaller patches c the expense of throughput: when using 5⇥5 patc throughput falls to 44 im/s, vs 180 im/s for 8⇥8 pat ڭࢣ͋ΓࣄલֶशϞσϧΛ௒͑ΔੑೳΛൃش

w %*/0Ͱֶशͨ͠7J5ͷ$-45PLFOʹର͢Δ"UUFOUJPOXFJHIUΛՄࢹԽ ࠷ऴ૚ͷ.VMUJ)FBE4FMG"UUFOUJPOͷதͰ࠷΋લܠʹண໨͍ͯ͠Δ)FBEʹ͍ͭͯՄࢹԽ w "UUFOUJPOXFJHIU΁ᮢ஋ॲཧΛ͔͚ͯՄࢹԽ 7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO *$$7> Emerging
Properties in Self-Supervised Vision Transformers Mathilde Caron1,2 Hugo Touvron1,3 Ishan Misra1 Herv´ e Jegou1 Julien Mairal2 Piotr Bojanowski1 Armand Joulin1 1 Facebook AI Research 2 Inria⇤ 3 Sorbonne University Figure 1: Self-attention from a Vision Transformer with 8 ⇥ 8 patches trained with no supervision. We look at the self-attention of the [CLS] token on the heads of the last layer. This token is not attached to any label nor supervision. These maps show that the model automatically learns class-specific features leading to unsupervised object segmentations. Abstract In this paper, we question if self-supervised learning pro- vides new properties to Vision Transformer (ViT) [19] that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised 1. Introduction Transformers [70] have recently emerged as an alternative to convolutional neural networks (convnets) for visual recognition [19, 69, 83]. Their adoption has been coupled with a training strategy inspired by natural language processing (NLP), that is, pretraining on large quantities of data and finetuning on the target dataset [18, 55]. The resulting Vision Transformers (ViT) [19] are competitive with convnets but, they have not yet delivered clear benefits over them: they v:2104.14294v2 [cs.CV] 24 May 2021 ϥϕϧ৘ใ͕ແͯ͘΋ਖ਼֬ͳ෺ମྖҬΛ֫ಘ Supervised DINO Table 7: Important component for self-supervised ViT pre- training. Models are trained for 300 epochs with ViT-S/16. We study the different components that matter for the k-NN and linear (“Lin.”) evaluations. For the different variants, we highlight the differences from the default DINO setting. The best combination is the momentum encoder with the multicrop augmentation and the cross-entropy loss. We also report results with BYOL [30], MoCo-v2 [15] and SwAV [10]. Supervised DINO Random Supervised DINO ViT-S/16 22.0 27.3 45.9 Table 7: Imp training. Mod study the diffe (“Lin.”) evalu differences fro is the momen the cross-entr MoCo-v2 [15] Method M 1 DINO 2 3 ڭࢣ͋Γֶशͱൺ΂ͯ%*/0͸෺ମྖҬʹूத

ڭࢣ͋Γֶशͨ͠ࡍͷ"UUFOUJPONBQ %*/0ͷ"UUFOUJPONBQ

w $-*1ɿ$POUSBTUJWF-BOHVBHF*NBHF1SFUSBJOJOH w ը૾ͱςΩετΛ༻͍ͨϚϧνϞʔμϧࣗݾڭࢣ͋ΓֶशΛఏҊ σʔληοτͰ༻ҙ͞Ε͍ͯΔը૾ͱςΩετͷϖΞΛϙδςΟϒϖΞͱͯ͠ରরֶश w ;FSPTIPUͰը૾ͷΫϥε෼ྨ໰୊΁ద༻Մೳ ը૾ͱQSPNQUUFNQMBUF"QIPUPPGB\Ϋϥε໊^ͷಛ௃ྔͷྨࣅ౓ΛΫϥεείΞͱͯ͠ར༻
ϚϧνϞʔμϧɿ$-*1<3BEGPSE *$.-> I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classiﬁer from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3

w ࿦จஶऀΒ͕࡞੒ͨ͠8FC*NBHF5FYU 8*5 σʔληοτΛ༻͍ͯࣗݾڭࢣ͋Γֶश Πϯλʔωοτ͔Βऩूͨ͠ԯͷը૾ͱςΩετͷϖΞ͔Βߏ੒ w ;FSP4IPU5SBOTGFSʹΑΔਫ਼౓Λڭࢣ͋ΓֶशϞσϧͱൺֱ ωοτϫʔΫɿ3FT/FU
ϚϧνϞʔμϧɿ$-*1<3BEGPSE *$.-> छྨͷதͰछྨͷσʔληοτͰਫ਼౓͕޲্ Ӵ੕ը૾΍ಓ࿏ަ௨ඪࣝͷ෼ྨͳͲͷෳࡶͳλεΫ ʹ͓͍ͯ͸ਫ਼౓͕େ͖͘௿Լ

w ԼྲྀλεΫ΁ͷసҠੑʢઢܗධՁʣΛڭࢣ͋Γֶश΍ࣗݾڭࢣ͋Γֶशͷैདྷ๏ͱൺֱ ϚϧνϞʔμϧɿ$-*1<3BEGPSE *$.-> $-*1ʹΑΔࣄલֶश͸3FT/FU 7JTJPO5SBOTGPSNFS 7J5 ʹ໰Θͣߴ͍ੑೳΛൃش

w 7P-5"ɿ7JTJPO-BOHVBHF5SBOTGPSNFSXJUIXFBLMZTVQFSWJTFEMPDBMGFBUVSF"MJHONFOU w ը૾ͱςΩετ $BQUJPO Λ༻͍ͨϚϧνϞʔμϧࣗݾڭࢣ͋ΓֶशΛఏҊ #PVOEJOH#PYΛ࢖༻ͤͣʹ$BQUJPOͷΈΛ༻͍ͯը૾಺ͷৄࡉͳঢ়گʢؔ܎ʣʹֶ͍ͭͯश (BUJOHػߏΛ࣋ͬͨ$SPTTBUUFOUJPOΛಋೖ͢Δ͜ͱͰύϥϝʔλ਺Λ࡟ݮ
w ͭͷࣗݾڭࢣ͋ΓֶशλεΫʹΑΓֶश ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW>

w (BUJOHػߏΛ࣋ͬͨ$SPTTBUUFOUJPOΛ֤ϞʔμϧϞσϧͷ4FMGBUUFOUJPO΁ಋೖ ֶशՄೳͳύϥϝʔλͱͯ͠HBUJOHTDBMBSɹΛಋೖ ɹΛͱ͢Δ͜ͱͰ$SPTT"UUFOUJPOػߏͷΦϑ͕Մೳ ࣗݾڭࢣ͋ΓֶशͷλεΫʹԠͯ͡ΦϯɾΦϑΛมߋ w $SPTTNPEBMGVTJPOͷͨΊͷ௥Ճͷ૚͕ඞཁͳ͍ͨΊύϥϝʔλ਺Λ࡟ݮՄೳ
ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW> α α α = 0 α ≠ 0 ̂ x = SelfAtt(x) x = x + ̂ x + α ⋅ CrossAtt( ̂ x, y) x = x + FFN(x) ը૾Ϟσϧʹ͓͚Δॲཧ ɿը૾ͷ֤ύονಛ௃ྔ x ɿ$BQUJPOͷ֤τʔΫϯಛ௃ྔ y

w ̏εςοϓͷॲཧʹΑΓ֤&ODPEFSΛֶश $SPTTBUUFOUJPOΛΦϑʹͯ͠ը૾ͱ$BQUJPOΛೖྗ͠ɼଛࣦɹɹɹɹɹɹɹɹɹɹɹΛܭࢉ $SPTTBUUFOUJPOΛΦϯʹͯ͠ը૾ͱ$BQUJPOΛೖྗ͠ɼଛࣦɹɹɹɹɹΛܭࢉ શͯͷଛࣦΛ଍͠߹Θͤͯ#BDLQSPQ ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW> 4UFQ 4UFQ
LIT′ BT , LII′ BT , LI′ T BT , LTT′ BT , LGOT LMLM , LITM

Ϟʔμϧ಺ɾϞʔμϧؒͷରরֶशʢ#BSMPX5XJOTʣ #BSMPX5XJOTʹج͍ͮͨରরֶशʹΑΓಛ௃ϕΫτϧͷ֤࣍ݩ͕ಠཱͨ͠ಛ௃ͱͳΔΑ͏ʹֶश ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW> LAB BT = ∑
i (1 − Cii )2 + λ∑ i ∑ j≠i (Cij )2 Cij = ∑ b zA b,i zB b,j ∑ b (zA b,i )2 ∑ b (zB b,j )2 ɿಛ௃ϕΫτϧͷ࣍ݩͷΠϯσοΫε i, j ɿϛχόονͷΠϯσοΫε b ɿಛ௃ϕΫτϧ zA, zB ɿQPTJUJWFXFJHIUJOHGBDUPS λ

Ϟʔμϧ಺ɾϞʔμϧؒͷରরֶशʢ#BSMPX5XJOTʣ #BSMPX5XJOTʹج͍ͮͨରরֶशʹΑΓಛ௃ϕΫτϧͷ֤࣍ݩ͕ಠཱͨ͠ಛ௃ͱͳΔΑ͏ʹֶश ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW> LAB BT = ∑
i (1 − Cii )2 + λ∑ i ∑ j≠i (Cij )2 Cij = ∑ b zA b,i zB b,j ∑ b (zA b,i )2 ∑ b (zB b,j )2 ɿಛ௃ϕΫτϧͷ࣍ݩͷΠϯσοΫε i, j ɿϛχόονͷΠϯσοΫε b ɿಛ௃ϕΫτϧ zA, zB zA ࣍ݩ਺ ϛχόον਺ zB ࣍ݩ਺ ϛχόον਺ i j Ci,j ˠಛ௃ϕΫτϧͷ࣍ݩؒͷ৑௕ੑΛ࡟ݮ ɿQPTJUJWFXFJHIUJOHGBDUPS λ zA 0,i zA 1,i zA 2,i zA 3,i zA 4,i zB 0,j zB 1,j zB 2,j zB 3,j zB 4,j

(SBQI0QUJNBM5SBOTQPSU 8BTTFSTUFJO%JTUBODF ը૾ύονؒͷؔ܎ɼ$BQUJPOτʔΫϯؒͷؔ܎ΛάϥϑͰදݱ ϊʔυͱͯ͠ύοντʔΫϯͷಛ௃ϕΫτϧɼΤοδͱͯ͠ಛ௃ϕΫτϧؒͷྨࣅ౓Λ࢖༻ Ϟʔμϧؒʹ͓͍ͯϊʔυͷ࠷ద༌ૹɼΤοδͷ࠷ద༌ૹʹΑΓMPDBMGFBUVSFΛBMJHONFOU ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL
BS9JW> DW (ϕ, ψ) = min T∈Π(u,v) ∑ i ∑ j Tij ⋅ c(xi , yj ) Dgw (ϕ, ψ) = min ̂ T∈Π(u,v) ∑ i,i′ ,j,j′ ̂ Tij ̂ Ti′ j′ ∥c1 (xi , x′ i ) − c2 (yj , y′ j )∥ LGOT (ϕ, ψ) = γDW (ϕ, ψ) + (1 − γ)Dgw (ϕ, ψ) ɿը૾ͷύονಛ௃ྔ xi , xj ɿUSBOTQPSUQMBO T, ̂ T ɿ$BQUJPOͷτʔΫϯಛ௃ྔ yi , yj ɿίαΠϯྨࣅ౓ c( ⋅ , ⋅ ), c1 ( ⋅ , ⋅ ), c2 ( ⋅ , ⋅ ) ɿͭͷଛࣦΛௐ੔͢ΔॏΈ γ

(SBQI0QUJNBM5SBOTQPSU 8BTTFSTUFJO%JTUBODF ը૾ύονؒͷؔ܎ɼ$BQUJPOτʔΫϯؒͷؔ܎ΛάϥϑͰදݱ ϊʔυͱͯ͠ύοντʔΫϯͷಛ௃ϕΫτϧɼΤοδͱͯ͠ಛ௃ϕΫτϧؒͷྨࣅ౓Λ࢖༻ Ϟʔμϧؒʹ͓͍ͯϊʔυͷ࠷ద༌ૹɼΤοδͷ࠷ద༌ૹʹΑΓMPDBMGFBUVSFΛBMJHONFOU ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL
BS9JW> DW (ϕ, ψ) = min T∈Π(u,v) ∑ i ∑ j Tij ⋅ c(xi , yj ) Dgw (ϕ, ψ) = min ̂ T∈Π(u,v) ∑ i,i′ ,j,j′ ̂ Tij ̂ Ti′ j′ ∥c1 (xi , x′ i ) − c2 (yj , y′ j )∥ LGOT (ϕ, ψ) = γDW (ϕ, ψ) + (1 − γ)Dgw (ϕ, ψ) ɿը૾ͷύονಛ௃ྔ xi , xj ɿUSBOTQPSUQMBO T, ̂ T ɿ$BQUJPOͷτʔΫϯಛ௃ྔ yi , yj ɿίαΠϯྨࣅ౓ c( ⋅ , ⋅ ), c1 ( ⋅ , ⋅ ), c2 ( ⋅ , ⋅ ) ɿͭͷଛࣦΛௐ੔͢ΔॏΈ γ ∥c1 (xi , x′ i ) − c2 (yj , y′ j )∥ c(xi , yj )

.BTLFE-BOHVBHF.PEFMJOH $BQUJPO಺ͷτʔΫϯͷˋʹϚεΫॲཧΛద༻ $BQUJPOͷτʔΫϯಛ௃ྔ͔Β.-.)FBEʹΑΓϚεΫͨ͠τʔΫϯΛ༧ଌ *NBHF5FYU.BUDIJOH ը૾ͱϖΞʹͳ͍ͬͯΔ$BQUJPOΛϥϯμϜʹมߋ ֤Ϟʔμϧͷಛ௃ྔ͔Β*5.)FBEʹΑΓը૾ͱ$BQUJPO͕ਖ਼͍͠ϖΞ͔༧ଌ
ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW>

w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ $0$0σʔληοτΛ࢖༻ͯࣗ͠ݾڭࢣ͋Γֶश $SPTT"UUFOUJPOΛΦϑʹͨ͠*NBHFFODPEFSʹରͯ͠ઢܗධՁ XP$."'ɿࣗݾڭࢣ͋ΓֶशͷશͯͷλεΫʹ͓͍ͯ$SPTT"UUFOUJPOΛΦϑʹֶͯ͠श ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW>
Table 1: Uni-modal downstream: image classification. We benchmark learned representation classification task by training linear classifiers on fixed features. We report top-1 accuracy on Im validation set, classification mAP on VOC07, and per-class (PC) and overall (O) F1 scores on COCO with † are re-implemented by Yuan et al. (2021), and the numbers with ‡ are re-implemented by u trained with significantly larger dataset are colored gray. Best results are in bold. Linear probing on ImageNet Validation Set Linear probing on VOC07 and COCO Method Pre-train Arch. Supervision Top-1(%) Method Pre-train Arch. VOC07 CO SVM MLP MLP Sup. IN-1K RN50 Label 76.5 Sup. IN-1K RN50 87.5 90.8 55.2 Sup. IN-100 RN50 Label 53.3† SimCLR IN-1K RN50 85.5 MoCo COCO RN50 NA 44.5† MoCo IN-1K RN50 79.8 MoCo-v2 COCO RN50 NA 49.3† MoCo-v2 IN-1K RN50 86.4 VirTex COCO RN50 Caption 52.8 SwAV IN-1K RN50 88.9 ICMLM COCO RN50 Caption 51.9 BYOL IN-1K RN50 86.6 MCT COCO RN50 Caption 54.9 BT IN-1K RN50 86.2 91.9‡ 56.1/ MCT COCO RN50 Caption+Tag 55.3 VICReg IN-1K RN50 86.6 91.1‡ 51.0/ VoLTA(w/o CMAF) COCO RN50 Caption 55.3 VoLTA(w/o CMAF) COCO RN50 89.6 94.3 71.4 VoLTA(w/o CMAF) COCO Swin-T Caption 56.3 VoLTA(w/o CMAF) COCO Swin-T 88.2 93.5 73.4 VoLTA(w/o CMAF) COCO Swin-B Caption 62.5 VoLTA(w/o CMAF) COCO Swin-B 88.5 93.9 74.1 VoLTA COCO Swin-B Caption 62.5 VoLTA COCO Swin-B 88.7 94.0 74.5 Table 2: Uni-modal downstream: object detection and instance segmentation. We benchm representations on VOC07 + 12 object detection task using faster R-CNN (Ren et al., 2015), and on C object detection and instance segmentation using mask R-CNN He et al. (2017), both with C4 backb (Wu et al., 2019). Best results are in bold. 3FT/FUɿ$BQUJPOͷΈΛ༻͍ͯߴ͍ਫ਼౓Λୡ੒ 4XJO5SBOTGPSNFSɿ7J5Ϟσϧʹ͓͍ͯ΋ߴ͍ਫ਼౓ΛൃشՄೳ

w (SBQI0QUJNBM5SBOTQPSUʹΑΔ$BQUJPOτʔΫϯͱը૾ύονͷϚονϯά݁ՌΛՄࢹԽ ੺จࣈͷ$BQUJPOτʔΫϯʹର͢Δը૾ύονͷϚονϯά݁ՌΛ੺৭Ͱදݱ w $0$0σʔληοτΛ࢖༻ͯࣗ͠ݾڭࢣ͋Γֶशͨ͠ϞσϧΛ࢖༻ ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL BS9JW> $BQUJPOͷΈͰը૾ͱ$BQUJPOؒͷରԠؔ܎ʹ͍ͭͯ֫ಘ

࡞੒ͯ͠ରরֶश $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ 1*3- <*.JTSBBOE-.BBUFO $713`> δάιʔύζϧΛಋೖ 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ ରরֶश $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ MPDBM͔ΒMPDBM΋༧ଌ &T7J5 <$-J *$-3`> $//Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश 7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ %*/0 <.$BSPO *$$7> σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ ը૾਺ʹΫϥε਺ͱֶͯ͠श աڈͷग़ྗΛ ωΨςΟϒϖΞͱͯ͠׆༻ .BTLFE-BOHVBHF.PEFMJOH .-. Λը૾΁Ԡ༻ େن໛ωοτϫʔΫͷಋೖ +JHTBX <./PSPP[JBOE1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> MPDBM͔ΒHMPCBM HMPCBM͔ΒHMPCBMΛ༧ଌ ಛ௃ྔʹϚεΫΩϯά ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉΛ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> .$5 <9:VBO $713`> ϚϧνϞʔμϧ΁֦ு ʢը૾ʴςΩετʣ .P$P #:0- .P#: <;9JF BS9JW`> .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ 7P-5" <41SBNBOJDL BS9JW`> MPDBMGFBUVSF"MJHONFOU ϚϧνϞʔμϧʢը૾ʴςΩετʣ γϯϓϧͳରরֶशΛఏҊ $-*1 <"3BEGPSE *$.-`> ;FSP4IPU5SBOTGFS ϚϧνϞʔμϧʢը૾ʴςΩετʣ 7J5ͷͨΊͷֶशํ๏ .BTLFE*NBHF.PEFMJOH .*. ෼ੳ 6OEFSTUBOEJOHUIF#FIBWJPVS <'8BOHBOE)-JV $713`> ଛࣦઃܭ΍ֶशޮՌʹ͍ͭͯ෼ੳ )PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS <-&SJDTTPO $713`> ༷ʑͳ໰୊ઃఆ΁ͷసҠੑΛධՁ 8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL <&$PMF $713`> σʔληοτͱͷؔ܎ੑʹ͍ͭͯ෼ੳ *OGP.JO <:5JBO /FVS*14`> ϙδςΟϒϖΞͷ૊Έ߹Θͤʹ͍ͭͯ෼ੳ #:0- <+(SJMM /FVS*14> ϙδςΟϒϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ ෼ੳ #:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT <13JDIFNPOE BS9JW`> όονਖ਼نԽͷ౷ܭ৘ใ͕҉໧తͳωΨςΟϒϖΞͳͷͰ͸ʁ ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ 4X"7 <.$BSPO /FVS*14`> ϙδςΟϒϖΞͷଐ͢Δ ΫϥελΛਪఆ /FHBUJWFGSFF

.BTLFE*NBHF.PEFMJOHͷ୅දతͳख๏ #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ ϚεΫྖҬͷըૉΛ༧ଌ ."&
<,)F $713`> 4JN.*. <;9JF $713`> .BTLFE*NBHF.PEFMJOH .*. #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> .BTLFE-BOHVBHF.PEFMJOH .-. )0(ಛ௃ྔΛ༧ଌ .BTLFE'FBUVSF1SFEJDUJPO <$8FJ $713`> ҟͳΔϞʔμϧͷద༻ ը૾ͷ࠶ߏ੒ NVMUJGPMENBTLJOHTUSBUFHZ 4E"& <:$IFO &$$7`> "UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞੒ "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> NVMUJCMPDLNBTLJOHTUSBUFHZ *+&1" <."TTSBO BS9JW`> Ϛϧνεέʔϧͳಛ௃ྔΛ֫ಘ .$."& <1(BP /FVS*14`> ΞʔΩςΫνϟͷվળ ը૾΁Ԡ༻ ରরֶश ͷಋೖ ϚεΫͷվળ 5PLFOJ[FS GSFF ϚϧνλεΫ 4JN.*. ରরֶश 4J5 <4"UJUP BS9JW`> $."& <+.BP BS9JW`> ϚϧνλεΫ ."& ରরֶश '-*1 <:-J BS9JW> ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1 .4/ <."TTSBO &$$7`> ϚεΫͨ͠ը૾Λ༻͍ͨ /FHBUJWFGSFF Ի੠ ."&UIBU-JTUFO <1)VBOH /FVS*14`> ."&"T4QBUJPUFNQPSBM-FBSOFST <$'FJDIUFOIPGFS /FVS*14`> ಈը૾ ϚϧνϞʔμϧ .VMUJ."& <3#BDINBOO &$$7`>

w #&35ɿ#JEJSFDUJPOBM&ODPEFS3FQSFTFOUBUJPOTGSPN5SBOTGPSNFST w #JEJSFDUJPOBM5SBOTGPSNFSΛ1SFUSBJOJOHͱ'JOF5VOJOHͷͭͷ4UFQʹΑΓֶश w ࣄલֶश 1SFUSBJOJOH ͱͯͭ͠ͷλεΫʹֶ͍ͭͯश .BTLFE-BOHVBHF.PEFMJOHɿͷτʔΫϯΛϚεΫ͠ɼϚεΫͨ͠τʔΫϯͷ୯ޠΛ༧ଌ
/FYU4FOUFODF1SFEJDUJPO ɿ4FOUFODF#͕4FOUFODF"ͷଓ͖ͷจষ͔༧ଌ .BTLFE-BOHVBHF.PEFMJOHɿ#&35<%FWMJO /""$-> BERT BERT E [CLS] E 1 E [SEP] ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... TokM Question Paragraph Start/End Span BERT E [CLS] E 1 E [SEP] ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... TokM Masked Sentence A Masked Sentence B Pre-training Fine-Tuning NSP Mask LM Mask LM Unlabeled Sentence A and B Pair SQuAD Question Answer Pair NER MNLI

w #&J5ɿ#JEJSFDUJPOBM&ODPEFSSFQSFTFOUBUJPOGSPN*NBHF5SBOTGPSNFST w ϚεΫͨ͠ύονͷಛ௃ྔΛ༧ଌ w "VUPFODPEFSߏ଄Ͱֶशͨ͠5PLFOJ[FSͷग़ྗΛਖ਼ղ৘ใͱͯ͠ར༻ 5PLFOJ[FSɿ%"--&<3BNFTI *$.->ͷֶशࡁΈEJTDSFUFWBSJBUJPOBMBVUPFODPEFSΛ࢖༻ .BTLFE*NBHF.PEFMJOHɿ#&J5<#BP
*$-3> 123 234 456 567 987 876 765 543 112 223 334 445 211 322 433 544 + + + + + + + + + + + + + + + + + BEIT Encoder Blockwise Masking 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 Flatten Tokenizer Decoder Position Embedding Patch Embedding Original Image Image Patches Visual Tokens 𝐡2 L 𝐡3 L 𝐡6 L 𝐡7 L 𝐡14 L Masked Image Modeling Head Reconstructed Image Unused During Pre-Training 234 456 876 765 322 [S] [M] [M] [M] [M] [M]

w J#05ɿJNBHF#&35QSF5SBJOJOHXJUI0OMJOF5PLFOJ[FS w ϚεΫͨ͠ύονͷಛ௃ྔͱϚεΫΛ͍ͯ͠ͳ͍ҟͳΔ7JFXͷΫϥετʔΫϯΛ༧ଌ ύονಛ௃ྔͷ༧ଌɹɹɹͱҟͳΔ7JFXͷΫϥετʔΫϯͷ༧ଌɹɹɹͷͭͷଛࣦ͔Βֶश w ࢦ਺ҠಈฏۉϞσϧ POMJOFUPLFOJ[FS ͷग़ྗΛਖ਼ղ৘ใͱͯ͠ར༻
.BTLFE*NBHF.PEFMJOHɿJ#05<;IPV *$-3> !~ℐ $ ~% &! &" ℎ! "#$%& ℎ! [()*] ℎ $ "#$%& ℎ $ [()*] ℒ[$%&] ℒ()( online tokenizer " #! "#$%& " #! [()*] # $ [()*] # $ "#$%& $ $ [()*] $ $ "#$%& " $! [()*] stop grad stop grad EMA " $! "#$%& %[,-*.] ( ) ℒMIM ℒ[CLS]

w ."&ɿ.BTLFE"VUPFODPEFS w ϚεΫͨ͠ύονͷըૉΛ༧ଌ &ODPEFSɿϚεΫ͞Ε͍ͯͳ͍ύονΛೖྗͱ͢Δ7J5 %FDPEFSɿύοντʔΫϯͱϚεΫτʔΫϯ͔Βը૾Λ࠶ߏ੒͢Δখن໛ͷ7J5 ը૾ͷ࠶ߏ੒ɿ."&<)F $713>
ଛࣦܭࢉ.4& ʢϚεΫτʔΫϯʹର͢Δग़ྗʹͷΈܭࢉʣ ɿϚεΫτʔΫϯ ɿΤϯίʔυ͞ΕͨύοντʔΫϯ *OQVU &ODPEFS 1& %FDPEFS 1& ࣗݾڭࢣ͋Γֶशޙ͸ &ODPEFSͷΈΛར༻

w *NBHF/FU,ͷධՁ༻σʔλʹର͢Δ෮ݩ݁Ռ w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱʢCBTFMJOF."&ɿ."&ʹΑΔֶश fi OFUVOJOHʣ ը૾ͷ࠶ߏ੒ɿ."&<)F $713> ϚεΫ͞Ε͍ͯͳ͍ύον͔Βը૾શମͷ࠶ߏ੒͕Մೳ
ೖྗը૾ ෮ݩ݁Ռ ݪը૾ ೖྗը૾ ෮ݩ݁Ռ ݪը૾ ೖྗը૾ ෮ݩ݁Ռ ݪը૾ ೖྗը૾ ෮ݩ݁Ռ ݪը૾ encoder design. We narrower and shal- our default decoder e encoder. With this kens are only pro- ch significantly reconstructs the input masked patch. Each ctor of pixel values he decoder is a lin- channels equals the decoder’s output is . Our loss function between the recon- space. We compute r to BERT [14].1 nstruction target is sked patch. Specif- ard deviation of all 10 20 30 40 50 60 70 80 90 masking ratio (%) Figure 5. Masking ratio. A high masking ratio (75%) works well for both fine-tuning (top) and linear probing (bottom). The y-axes are ImageNet-1K validation accuracy (%) in all plots in this paper. 4. ImageNet Experiments We do self-supervised pre-training on the ImageNet-1K (IN1K) [13] training set. Then we do supervised training to evaluate the representations with (i) end-to-end fine-tuning or (ii) linear probing. We report top-1 validation accuracy of a single 224⇥224 crop. Details are in Appendix A.1. Baseline: ViT-Large. We use ViT-Large (ViT-L/16) [16] as the backbone in our ablation study. ViT-L is very big (an order of magnitude bigger than ResNet-50 [25]) and tends to overfit. The following is a comparison between ViT-L trained from scratch vs. fine-tuned from our baseline MAE: scratch, original [16] scratch, our impl. baseline MAE 76.5 82.5 84.9 We note that it is nontrivial to train supervised ViT-L from

w 4JN.*.ɿ4JNQMF'SBNFXPSLGPS.BTLFE*NBHF.PEFMJOH w ϚεΫͨ͠ύονͱϚεΫ͍ͯ͠ͳ͍ύονͷ྆ํΛ&ODPEFS΁ೖྗ w %FDPEFSͱͯ͠૚ͷઢܗ૚Λ࢖༻ w ϚεΫํ๏ɼϚεΫ཰ʹΑΔਫ਼౓มԽΛ࣮ݧతʹ෼ੳ "WH%JTUɿϚεΫͨ͠ϐΫηϧͱ࠷ۙ๣ͷϚεΫ͍ͯ͠ͳ͍ϐΫηϧͱͷϢʔΫϦουڑ཭ͷฏۉ
ը૾ͷ࠶ߏ੒ɿ4JN.*.<9JF $713>

w ϚεΫͨ͠ύονͷಛ௃ྔΛ༧ଌ w )0(ಛ௃ྔΛਖ਼ղ৘ใͱͯ͠ར༻ 5PLFOJ[FSGSFFɿ.BTLFE'FBUVSF1SFEJDUJPO<8FJ $713> s, from pixel
o discrete vi- seudo-labels (center col- nd SIFT [62] on for over a MaskFeat in ual signals is d continuous ell. tations is not ng local pat- g supervised ed data leads ! ! ! transformer linear head masked input target feature e.g., HOG H W T Figure 2. MaskFeat pre-training. We randomly replace the input space-time cubes of a video with a [MASK] token and di- rectly regress features (e.g. HOG) of the masked regions. After pre-training, the Transformer is fine-tuned on end tasks. scratch - MViT-S [56] 81 pixel 3 RGB 80 image descriptor 3 HOG [22] 82 dVAE 7 DALL-E [73] 81 unsupervised feature 7 DINO [9], ViT-B 82 supervised feature 7 MViT-B [31] 81 Table 1. Comparing target features for MaskFeat (video). variants are pre-trained with MaskFeat for 300 epochs on MVi 16⇥4. We report fine-tuning accuracy on K400. Default is g feature type scratch pixel colors image descriptor dVAE token unsupervised feature unsupervised feature unsupervised feature supervised feature supervised feature pseudo-label Table 2. Comparing target features for MaskFeat (image). We report 100-epoch fine-tuning accuracy on IN-1K. For two and effective epoch† on IN-1K. The default entry is marked in † Different teachers use different training strategies. dVAE is pre-trai training. To measure the cost in a unified way, we normalize the num )0(ಛ௃ྔͷ༧ଌ͸ֶशࡁΈϞσϧͷಛ௃ྔͷ༧ଌͱಉఔ౓ͷਫ਼౓Λୡ੒

w 5FBDIFSͷΫϥετʔΫϯʹର͢Δ"UUFOUJPOXFJHIUʹج͍ͮͯϚεΫΛ࡞੒ "UU.BTL)JHIɿ"UUFOUJPOXFJHIͷߴ͍ྖҬΛϚεΩϯά "UU.BTL)JOU ɿ"UUFOUJPOXFJHIͷߴ͍ྖҬͷҰ෦͕࢒ΔΑ͏ʹߴ͍ྖҬΛϚεΩϯά "UU.BTL-PX ɿ"UUFOUJPOXFJHIͷ௿͍ྖҬΛϚεΩϯά
w 5FBDIFSͱͯ͠4UVEFOUͷࢦ਺ҠಈฏۉϞσϧΛ࢖༻ ϚεΫͷվળɿ"UUFOUJPO(VJEFE.*.<,BLPHFPSHJPV &$$7> Attention-Guided Masked Image Modeling 3 (a) Input (b) Random (c) Random (d) Block (e) Attention (f) AttMask (g) AttMask Image (30) (75) Wise Map High Low Fig. 1. Different than random masking strategies (b-d), our attention-guided masking (AttMask) uses the attention map arising in the encoder (e) to mask the most highly attended by default (f),

w J#05΁"UUFOUJPOXFJHIUʹج͍ͮͨϚεΫઓུΛಋೖͯ͠ධՁ w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ *NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,ͰL//๏ʹΑΔධՁઢܗධՁ fi OFUVOJOH ϚεΫͷվળɿ"UUFOUJPO(VJEFE.*.<,BLPHFPSHJPV &$$7>
"UUFOUJPOXFJHIUͷߴ͍ྖҬΛϚεΫ͢Δ͜ͱͰਫ਼౓͕޲্ 10 I. Kakogeorgiou et al. Table 1. Different masking strategies for iBOT [78] pre-training on 20% of ImageNet. Top-1 accuracy for k-NN, linear probing on ImageNet validation set; ﬁne-tuning on CIFAR10/100. †: default iBOT masking strategy from BEiT [2]. ‡: aggressive random masking strategy from MAE [24]. IBOT MASKING RATIO (%) IMAGENET-1K CIFAR10 CIFAR100 k-NN LINEAR FINE-TUNING Random Block-Wise† 10-50 46.7 56.4 98.0 86.0 Random‡ 75 47.3 55.5 97.7 85.5 Random 10-50 47.8 56.7 98.0 86.1 AttMask-Low (ours) 10-50 44.0 53.4 97.6 84.6 AttMask-Hint (ours) 10-50 49.5 57.5 98.1 86.6 AttMask-High (ours) 10-50 49.7 57.9 98.2 86.6 Table 2. Top-1 k-NN accuracy on ImageNet-1k validation for iBOT pre-training on different per- centage (%) of ImageNet-1k. †: default iBOT masking strategy from BEiT [2]. 0 20 40 60 80 100 0 10 20 30 40 50 42% fewer epochs k-NN Random Block-Wise† AttMask-High (ours)

w 4E"&ɿ4FMGEJTUJMMBUFE.BTLFE"VUPFODPEFS w 5BSHFUͱͳΔಛ௃ྔͷ࡞੒ʹϚεΫΛಋೖ 4UVEFOUɿ."&ߏ଄ʹΑΓϚεΫͨ͠ύονͷಛ௃ྔΛ༧ଌ 5FBDIFSɿ4UVEFOUͱҟͳΔύονΛϚεΫͯ͠ύονͷಛ௃ྔΛநग़ʢ5BSHFUͷ࡞੒ʣ w 5FBDIFSͱͯ͠4UVEFOUͷࢦ਺ҠಈฏۉϞσϧΛ࢖༻
ϚεΫͷվળɿ4E"&<$IFO &$$7> Multi-fold Mask Encoder Input Encoder Decoder EMA Normalize Select Feature Cosine Similarity Teacher Student

w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ *NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,Ͱ fi OFUVOJOHઢܗධՁ ϚεΫͷվળɿ4E"&<$IFO &$$7> ରরֶश΍ैདྷͷ.*.ͱൺ΂ͯ
fi OFUVOJOH࣌ͷਫ਼౓͕޲্ 10 Y. Chen et al. Table 2. Image classiﬁcation results on the ILSVRC-2012 ImageNet dataset with top 1 accuracy. “Epochs” refers to the number of pre-training epochs. MoCo v3 and DINO adopt multi-crop augmentation for pre-training. MoCo v3: 2 global crops of 224 × 224. DINO: 2 global crops of 224 × 224 and 10 local crops of 96 × 96. Method Epochs Crops Finetune Linear Methods using ViT-B: Train from Scratch 300 81.8 MoCo v3 300 2 83.2 76.2 DINO 400 12 83.3 77.3 BEiT 300 1 83.0 49.4 MAE 100 1 82.1 54.8 MAE 300 1 82.9 61.5 MAE 1600 1 83.6 67.8 CAE 300 1 83.3 64.2 SdAE 100 1 83.5 60.3 SdAE 300 1 84.1 64.9 top-1 accuracy. Moreover, compared to the recently proposed CAE, our SdAE achieves 0.8% top-1 accuracy gain, demonstrating the e↵ectiveness of our self- distillated design and multi-fold masking strategy. In addition, with only 100 epochs pre-training, SdAE can achieve comparable performance with MAE us-

w *+&1"ɿ*NBHFCBTFE+PJOU&NCFEEJOH1SFEJDUJWF"SDIJUFDUVSF w 5BSHFUͱͳΔྖҬΛ෼ׂ͠ɼ5BSHFU͝ͱʹ5BSHFUྖҬͷύονಛ௃ྔΛ༧ଌ UBSHFU ɿBTQFDUSBUJP< > TDBMF< >ͷൣғͰݸͷUBSHFUCMPDLΛ࡞੒
DPOUFYU ɿBTQFDU TDBMF< >ͷൣғͰDPOUFYUCMPDLΛ࡞੒͠ɼUBSHFUͱॏͳͬͨྖҬΛ࡟আ w UBSHFUFODPEFSͱͯ͠DPOUFYUFODPEFSͷࢦ਺ҠಈฏۉϞσϧΛ࢖༻ ϚεΫͷվળɿ*+&1"<"TTSBO BS9JW> f q gf gf gf f ¯ q L2 context encoder predictor target encoder target context original context targets

w ಛ௃ྔͷநग़ UBSHFU ɿϚεΫΛద༻͍ͯ͠ͳ͍JOQVUJNBHFΛUBSHFUFODPEFS΁ೖྗ DPOUFYU ɿDPOUFYUCMPDL಺ͷύονͷΈΛDPOUFYUFODPEFS΁ೖྗ w UBSHFUCMPDLͷ༧ଌ
UBSHFUCMPDL͝ͱʹύοντʔΫϯͱϚεΫτʔΫϯΛ༻͍ͯQSFEJDUPSʹΑΓύονಛ௃ྔΛ༧ଌ QSFEJDUPSͱͯ͠খن໛ͳ7J5Λ࢖༻ ϚεΫͷվળɿ*+&1"<"TTSBO BS9JW> f q gf gf gf f ¯ q L2 context encoder predictor target encoder target context

w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ *NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,ͰઢܗධՁ ϚεΫͷվળɿ*+&1"<"TTSBO BS9JW> Method Arch. Epochs
Top-1 Methods without view data augmentations data2vec [7] ViT-L/16 1600 53.5 MAE [34] ViT-B/16 1600 68.0 ViT-L/16 1600 76.0 ViT-H/14 1600 77.2 I-JEPA ViT-B/16 600 72.9 ViT-L/16 600 77.5 ViT-H/14 300 79.3 ViT-H/16448 300 81.1 Methods using extra view data augmentations SimCLR v2 [20] RN152 (2⇥) 800 79.1 DINO [17] ViT-B/8 300 80.1 iBOT [74] ViT-L/16 250 81.0 Table 1. ImageNet. Linear-evaluation on ImageNet-1k. ViT- H/16448 is pretrained at at a resolution of 448 ⇥ 448. I-JEPA significantly improves linear probing performance compared to other methods that do not rely on hand-crafted data-augmentations during pretraining (MAE and data2vec). Moreover, I-JEPA demonstrate good scalability behavior and the larger I-JEPA model matches the performance of view-invariance approaches without requiring view data-augmentions. Method Arch. Epochs Top-1 Methods without view data augmentations data2vec [7] ViT-L/16 1600 73.3 MAE [34] ViT-L/16 1600 67.1 ViT-H/14 1600 71.5 I-JEPA ViT-L/16 600 69.4 ViT-H/14 300 73.3 ViT-H/16448 300 77.3 Methods using extra view data augmentations iBOT [74] ViT-B/16 250 69.7 DINO [17] ViT-B/8 300 70.0 SimCLR v2 [33] RN151 (2⇥) 800 70.2 BYOL [33] RN200 (2⇥) 800 71.2 MSN [3] ViT-B/4 300 75.7 Table 2. ImageNet-1%. Semi-supervised evaluation on ImageNet-1K using only 1% of the available labels. Models are adapted via fine-tuning or linear-probing, depending on whichever works best for each respective method. ViT-H/16448 is pretrained at at a resolution of 448 ⇥ 448. I-JEPA pretraining outperforms MAE which also does not rely on hand-crafted data-augmentations during pretraining. Moreover, I-JEPA benefits from scale. A ViT- H/16 trained at resolution 448 surpasses previous methods includ- 103 104 68 70 72 74 76 78 80 ViT-B/16 ViT-H/14 ViT-B/16 ViT-L/16 ViT-H/14 Pretraining GPU Hours Top 1 (%) ImageNet Linear Evaluation vs GPU Hours I-JEPA MAE ैདྷͷ.*.΍ରরֶशɾ/FHBUJWFGSFFͱൺ΂ͯਫ਼౓͕޲্ ."&ͱൺ΂ͯগͳֶ͍श࣌ؒͰߴ͍ਫ਼౓Λൃش

w 4J5ɿ4FMGTVQFSWJTFEWJTJPO5SBOTGPSNFS w 4JN.*.ϕʔεͷϚεΫྖҬͷըૉ༧ଌʹϖΞͷ༧ଌʢରরֶशʣΛ௥Ճ ϚεΫͱͯ͠ϥϯμϜϊΠζΛ࢖༻ 7J5΁DPOUSBTUJWFUPLFOΛ௥Ճ͠ɼDPOUSBTUJWFUPLFOΛ༻͍ͯରরֶश ରরֶशͷಋೖɿ4J5<"UJUP BS9JW>
JOURNAL OF L A TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4 /LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV 3URMHFWLRQWR,PDJH6SDFH 9LVLRQ7UDQVIRUPHU 'DWD7RNHQV,PDJH 3L[HO&RUUXSWLRQ 2ULJLQDO,PDJH 5HFRQVWUXFWHG,PDJH 3RVLWLRQ (PEHGGLQJ &RQWUDVWLYH +HDG &RQWUDVWLYH (PEHGGLQJ Fig. 1: Self-supervised vIsion Transformer (SiT)

w ϚεΫྖҬͷ༧ଌ ύονΛϥϯμϜʹϊΠζ΁ஔ͖׵͑ͯ7J5ʹೖྗ 7J5͕ग़ྗͨ͠ύοντʔΫϯΛ%FDPEFSʢ૚ͷ.-1ʣ΁ೖྗ֤ͯ͠ύονͷըૉΛ࠶ߏ੒ ଛࣦؔ਺ͱͯ͠0SJHJOBM*NBHFͱ࠶ߏ੒ը૾ͷ-MPTTΛ࢖༻ ରরֶशͷಋೖɿ4J5<"UJUP BS9JW>
&RQWUDVWLYH +HDG JOURNAL OF L A TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4 /LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV 3URMHFWLRQWR,PDJH6SDFH 9LVLRQ7UDQVIRUPHU 'DWD7RNHQV,PDJH 3L[HO&RUUXSWLRQ 2ULJLQDO,PDJH 5HFRQVWUXFWHG,PDJH 3RVLWLRQ (PEHGGLQJ &RQWUDVWLYH +HDG &RQWUDVWLYH (PEHGGLQJ JOURNAL OF L A TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4 /LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV 3URMHFWLRQWR,PDJH6SDFH 9LVLRQ7UDQVIRUPHU 'DWD7RNHQV,PDJH 3L[HO&RUUXSWLRQ 2ULJLQDO,PDJH 5HFRQVWUXFWHG,PDJH 3RVLWLRQ (PEHGGLQJ &RQWUDVWLYH +HDG &RQWUDVWLYH (PEHGGLQJ JOURNAL OF L A TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4 /LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV 3URMHFWLRQWR,PDJH6SDFH 9LVLRQ7UDQVIRUPHU 'DWD7RNHQV,PDJH 3L[HO&RUUXSWLRQ 2ULJLQDO,PDJH 5HFRQVWUXFWHG,PDJH 3RVLWLRQ (PEHGGLQJ &RQWUDVWLYH +HDG &RQWUDVWLYH (PEHGGLQJ 3RVLWLRQ (PEHGGLQJ &RQWUDVWLYH (PEHGGLQJ

w ϖΞͷ༧ଌ ͭͷը૾͔Βσʔλ૿෯ʹΑΓϙδςΟϒϖΞͱͳΔͭͷ7JFXΛ࡞੒ 7JFXɿύονΛϥϯμϜʹϊΠζ΁ஔ͖׵͑ͯ7J5΁ೖྗ 7JFXɿύονʹϚεΫॲཧΛద༻ͤͣʹ7J5ͷࢦ਺ҠಈฏۉϞσϧ΁ೖྗ DPOUSBTUJWFUPLFOʹର͢Δग़ྗΛ༻͍ͯରরֶश ଛࣦؔ਺ͱͯ͠OPSNBMJ[FEUFNQFSBUVSFTDBMFEDSPTTFOUSPQZMPTTΛ࢖༻
ରরֶशͷಋೖɿ4J5<"UJUP BS9JW> JOURNAL OF L A TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4 /LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV 3URMHFWLRQWR,PDJH6SDFH 9LVLRQ7UDQVIRUPHU 'DWD7RNHQV,PDJH 3L[HO&RUUXSWLRQ 2ULJLQDO,PDJH 5HFRQVWUXFWHG,PDJH 3RVLWLRQ (PEHGGLQJ &RQWUDVWLYH +HDG &RQWUDVWLYH (PEHGGLQJ JOURNAL OF L A TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4 /LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV 3URMHFWLRQWR,PDJH6SDFH 9LVLRQ7UDQVIRUPHU 'DWD7RNHQV,PDJH 3L[HO&RUUXSWLRQ 2ULJLQDO,PDJH 5HFRQVWUXFWHG,PDJH 3RVLWLRQ (PEHGGLQJ &RQWUDVWLYH +HDG &RQWUDVWLYH (PEHGGLQJ 3URMHFWLRQWR,PDJH6SDFH 5HFRQVWUXFWHG,PDJH &RQWUDVWLYH JOURNAL OF L A TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4 /LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV 3URMHFWLRQWR,PDJH6SDFH 9LVLRQ7UDQVIRUPHU 'DWD7RNHQV,PDJH 3L[HO&RUUXSWLRQ 2ULJLQDO,PDJH 5HFRQVWUXFWHG,PDJH 3RVLWLRQ (PEHGGLQJ &RQWUDVWLYH +HDG &RQWUDVWLYH (PEHGGLQJ

w ԼྲྀλεΫʹ͓͚Δਫ਼౓ൺֱ *NBHF/FU,Ͱࣗݾڭࢣ͋ΓֶशˠԼྲྀλεΫ΁ fi OFUVOJOH ରরֶशͷಋೖɿ4J5<"UJUP BS9JW> JOURNAL
OF L A TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 8 TABLE 2: Comparison with state-of-the-art methods when pretrained and ﬁnetuned on the target dataset, i.e. no external data is used, employing ViT-S/16. Method Flowers Pets CUB Aircraft STL10 Cars CIFAR10 CIFAR100 Random init. 68.8 47.5 25.3 31.1 77.1 27.4 96.9 77.8 Comparison with concurrent works MoCo-v3 [72] 88.9 69.0 53.1 62.5 95.4 84.0 97.3 83.4 Comparison with post arts Dino [73] 82.4 58.0 43.6 49.3 92.1 73.0 96.8 78.9 MAE [57] 86.9 73.0 59.4 69.0 – 91.0 – – SiT 92.8 84.7 71.2 77.8 96.5 92.1 98.2 85.2 TABLE 3: Domain Transfer of SiT pretrained on ImageNet-1K dataset. Method Flowers Pets CUB Aircraft STL10 Cars CIFAR10 CIFAR100 ImageNet-1K ViT-S/16 Random init. 68.8 47.5 25.3 31.1 77.1 27.4 96.9 77.8 – Supervised [4] 98.1 91.1 82.7 80.8 98.2 91.7 98.3 86.9 79.9 Comparison with concurrent works MoCo-v3 [72] 97.7 92.3 82.6 87.3 98.0 93.0 98.2 86.6 81.4 Dino* [73] 97.8 89.4 80.8 83.8 96.7 93.1 98.6 87.1 81.5 SiT 98.2 92.6 84.6 87.6 98.8 93.2 99.0 90.8 82.0 ViT-B/16 Comparison with concurrent works MoCo-v3 [72] 98.3 93.7 84.1 87.2 98.4 93.4 98.2 87.3 83.2 Dino [73] 98.4 90.2 80.7 81.5 97.2 93.0 98.2 87.1 82.8 ڭࢣ͋Γֶश΍γϯάϧλεΫͷࣗݾڭࢣख๏ͱൺ΂ͯਫ਼౓͕޲্

w $."&ɿ$POUSBTUJWF.BTLFE"VUPFODPEFST w ."&ʹϖΞͷ༧ଌʢରরֶशʣΛ௥Ճ ֤ύονͷಛ௃ྔΛ༧ଌ͢Δ'FBUVSF%FDPEFSΛ௥Ճ ݩը૾͕ಉ͡.BTLFEJNBHFͱ1JYFMTIJGUFE7JFXΛϙδςΟϒϖΞͱͯ͠ରরֶश ରরֶशͷಋೖɿ$."&<.BP BS9JW>
Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ...

w 0OMJOF&ODPEFS ύονϨϕϧͷϚεΫॲཧΛߦ͍ɼϚεΫ͍ͯ͠ͳ͍ύονΛೖྗ w 5BSHFU&ODPEFS 1SPKFDUJPO)FBE ϚεΫॲཧΛద༻ͤͣʹશͯͷύονΛೖྗ ରরֶशͷಋೖɿ$."&<.BP
BS9JW> Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ... Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ... Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ... Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ...

w 1JYFM%FDPEFS 0OMJOF&ODPEFSͷύοντʔΫϯͱϚεΫτʔΫϯΛೖྗ֤ͯ͠ύονͷϐΫηϧΛग़ྗ w 'FBUVSF%FDPEFS 0OMJOF&ODPEFSͷύοντʔΫϯͱϚεΫτʔΫϯΛೖྗ֤ͯ͠ύονͷಛ௃ྔΛग़ྗ ରরֶशͷಋೖɿ$."&<.BP BS9JW>
Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ... Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ... Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ... Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ... Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ...

w 3FDPOTUSVDUJPO-PTT ϚεΫτʔΫϯʹର͢Δग़ྗʹͷΈ*OQVUͷରԠ͢Δύονͱ.4&MPTTΛܭࢉ w $POUSBTUJWF-PTT ݩը૾͕ಉ͡.BTLFEJNBHFͱ1JYFMTIJGUFE7JFXΛϙδςΟϒϖΞͱͯ͠*OGP/$&-PTTΛܭࢉ ֤ύοντʔΫϯΛฏۉͨ͠ಛ௃ྔΛը૾શମͷಛ௃ྔͱͯ͠ར༻
ରরֶशͷಋೖɿ$."&<.BP BS9JW> Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ... Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ... Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ... Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ... Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ... Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ... Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ...

w ύϥϝʔλͷߋ৽ 0OMJOF&ODPEFS %FDPEFS 1SPKFDUJPO)FBEɿ-PTTʹج͍ͮͯޯ഑߱Լ๏ʹΑΓߋ৽ 5BSHFU&ODPEFSɿ0OMJOF&ODPEFSͷύϥϝʔλΛࢦ਺Ҡಈฏۉ͢Δ͜ͱͰߋ৽ ରরֶशͷಋೖɿ$."&<.BP BS9JW>
Online Encoder ... Target Encoder Masked image Pixel-shifted View ... Pixel Decoder Feature Decoder Reconstruction Loss Projection Head Contrastive Loss Input ... ... ... ...

w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ *NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,Ͱ fi OFUVOJOH ରরֶशͷಋೖɿ$."&<.BP BS9JW> Method
Pre-training epochs Params.(M) Supervision Accuracy MoCo-v3 [11] 300 86 RGB 83.2 DINO [7] 300 86 RGB 82.8 CIM [18] 300 86 RGB 83.3 BEiT [3] 800 86 DALLE 83.2 SimMIM [43] 800 86 RGB 83.8 PeCo [16] 800 86 Perceptual Codebook 84.5 MaskFeat [40] 1600 86 HOG 84.0 CAE [12] 1600 86 DALLE+RGB 83.9 iBOT [51] 1600 86 RGB 84.0 SIM [38] 1600 86 RGB 83.8 MAE [25] 1600 86 RGB 83.6 CMAE (ours) 800 86 RGB 84.4 CMAE (ours) 1600 86 RGB 84.7 ConvMAE* [19] 800 86 RGB 84.6 ConvMAE* [19] 1600 86 RGB 84.6 CMAE* (ours) 800 86 RGB 85.0 CMAE* (ours) 1600 86 RGB 85.3 Table 2: Comparison of our model with existing methods on ViT-B. We evaluate them with the top-1 accuracy on γϯάϧλεΫͷࣗݾڭࢣख๏ͱൺ΂ͯਫ਼౓͕޲্

w .4/ɿ.BTLFE4JBNFTF/FUXPSLT w ϚεΫͨ͠7JFXΛ༻͍ͨ/FHBUJWFGSFFख๏ΛఏҊ ϚεΫͨ͠7JFXͱϚεΫ͍ͯ͠ͳ͍ҟͳΔ7JFXؒͰ֬཰෼෍͕Ұக͢ΔΑ͏ʹֶश ֬཰෼෍͸ΫϥετʔΫϯͱ֤QSPUPUZQFTؒͷྨࣅ౓ΛΫϥεείΞͱͯ͠࡞੒ ֤QSPUPUZQFT͸ֶशՄೳͳύϥϝʔλͱͯ͠7J5ڞʹߋ৽
ରরֶशͷಋೖɿ.4/<"TTSBO &$$7> fq z prototypes f ¯ q ema z + / / prediction p target p + H(p +, p) original anchor view target view patchify & mask patchify representation [CLS] cluster assignments

w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ *NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,ͰઢܗධՁ ରরֶशͷಋೖɿ.4/<"TTSBO &$$7> Table 3: Linear
evaluation on ImageNet-1K using 100% of the labels. Method Architecture Params. Epochs Top 1 Comparing similar architectures SimCLRv2 (Chen et al., 2020c) RN50 24M 800 71.7 BYOL (Grill et al., 2020) RN50 24M 1000 74.4 DINO (Caron et al., 2021) ViT-S/16 22M 800 77.0 iBOT (Zhou et al., 2021) ViT-S/16 22M 800 77.9 MSN ViT-S/16 22M 600 76.9 Comparing larger architectures MAE (He et al., 2021) ViT-H/14 632M 1600 76.6 BYOL (Grill et al., 2020) RN200 (2⇥) 250M 800 79.6 SimCLRv2 (Chen et al., 2020c) RN151+SK (3⇥) 795M 800 79.8 iBOT (Zhou et al., 2021) ViT-B/16 86M 400 79.4 DINO (Caron et al., 2021) ViT-B/8 86M 300 80.1 MoCov3 (Chen et al., 2021) ViT-BN-L/7 304M 300 81.0 MSN ViT-L/7 304M 200 80.7 Table 4: End-to-end fine-tuning of a ViT-B/16 encoder on ImageNet-1K using 100% of the labels. MSN obtains competitive performance with both joint-embedding approaches and auto-encoding approaches. Initialization Pretrain Epochs Top 1 DINO (Caron et al., 2021) 800 83.6 BEiT (Bao et al., 2021) 800 83.2 iBOT (He et al., 2021) 800 83.8 MAE (He et al., 2021) 1600 83.6 SimMIM (Xie et al., 2021) - 83.8 MaskFeat (Wei et al., 2021) - 84.0 ઢܗධՁͷֶश࣌ʹͷֶश༻σʔλΛ࢖༻ Table 1: Extreme low-shot. We evaluate the label-efficiency of self-supervised models pretrained on the ImageNet-1K dataset. For evaluation, we use an extremely small number of the ImageNet-1K labels and report the mean top-1 accuracy and standard deviation across 3 random splits of the data. Images per Class Method Architecture Epochs 1 2 5 iBOT (Zhou et al., 2021) ViT-S/16 800 40.4 ± 0.5 50.8 ± 0.8 59.9 ± 0.2 ViT-B/16 400 46.1 ± 0.3 56.2 ± 0.7 64.7 ± 0.3 DINO (Caron et al., 2021) ViT-S/16 800 38.9 ± 0.4 48.9 ± 0.3 58.5 ± 0.1 ViT-B/16 400 41.8 ± 0.3 51.9 ± 0.6 61.4 ± 0.2 ViT-S/8 800 45.5 ± 0.4 56.0 ± 0.7 64.7 ± 0.4 ViT-B/8 300 45.8 ± 0.5 55.9 ± 0.6 64.6 ± 0.2 MAE (He et al., 2021) ViT-B/16 1600 8.2 ± 0.3 25.0 ± 0.3 40.5 ± 0.2 ViT-L/16 1600 12.3 ± 0.2 19.3 ± 1.8 42.3 ± 0.3 ViT-H/14 1600 11.6 ± 0.4 18.6 ± 0.2 32.8 ± 0.2 MSN (Ours) ViT-S/16 800 47.1 ± 0.1 55.8 ± 0.6 62.8 ± 0.3 ViT-B/16 600 49.8 ± 0.2 58.9 ± 0.4 65.5 ± 0.3 ViT-B/8 600 55.1 ± 0.1 64.9 ± 0.7 71.6 ± 0.3 ViT-B/4 300 54.3 ± 0.4 64.6 ± 0.7 72.4 ± 0.3 ViT-L/7 200 57.1 ± 0.6 66.4 ± 0.6 72.1 ± 0.2 loss, iBOT (Zhou et al., 2021) and SplitMask (El-Nouby et al., 2021) apply a joint-embedding loss to an output representing the global sequence (either the [CLS] token or a global average pool of the patch vectors). SplitMask shows that by using a patch-level loss, you can reduce the amount of unlabeled pre-training data. In contrast, we focus on reducing the amount of labeled data available for the downstream prediction task. Data2Vec (Baevski et al., 2022) demonstrates that this approach is suitable for multiple modalities such as vision, speech and text. Different from these approaches, we only match the view representations globally and do not consider a patch level loss. Consequently, we can completely ignore the masked patches, significantly reducing the computational and memory ઢܗධՁͷֶश࣌ʹΫϥε͋ͨΓdαϯϓϧͷֶश༻σʔλΛ࢖༻ 'FXTIPUͳઃఆʹ͓͍ͯରরֶश΍ैདྷͷ.*.ͱൺ΂ͯਫ਼౓͕޲্

w '-*1ɿ'BTU-BOHVBHF*NBHF1SFUSBJOJOH w ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1ΛఏҊ JNBHFFODPEFSͱͯ͠7JTJPO5SBOTGPSNFSΛ࢖༻ ϚεΫΛద༻͍ͯ͠ͳ͍ը૾ύονͷΈΛJNBHFFODPEFS΁ೖྗֶͯ͠श w ֶशޙʹϚεΫ཰Ͱ਺4UFQͷ'-*1
$-*1 Λߦ͏VONBTLJOHUVOJOHTUSBUFHZΛಋೖ ϚεΫͨ͠ը૾ͱϚεΫ͍ͯ͠ͳ͍ը૾ͷΪϟοϓΛٵऩ ରরֶशͷಋೖɿ'-*1<-J BS9JW> image encoder masked image visible patches text encoder text contrastive loss 0 50 100 150 200 250 training time (hours) 68 69 70 71 72 73 zero-shot accuracy (%) 3.7 speedup mask 0% (our CLIP repro.) mask 50% mask 75% ϚεΫઓུΛར༻͢Δ͜ͱͰֶश࣌ؒΛ୹ॖ

w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ -"*0/.Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,Ͱ[FSPTIPUUSBOTGFSઢܗධՁ fi OFUVOJOH w Ϋϥεࣝผ໰୊ͷԼྲྀλεΫʹ͓͚Δਫ਼౓ධՁ -"*0/.Ͱࣗݾڭࢣ͋ΓֶशˠԼྲྀλεΫ΁[FSPTIPUUSBOTGFS
ରরֶशͷಋೖɿ'-*1<-J BS9JW> case data epochs B/16 L/16 L/14 H/14 CLIP [52] WIT-400M 32 68.6 - 75.3 - OpenCLIP [36] LAION-400M 32 67.1 - 72.8 - CLIP, our repro. LAION-400M 32 68.2 72.4 73.1 - FLIP LAION-400M 32 68.0 74.3 74.6 75.5 Table 2. Zero-shot accuracy on ImageNet-1K classification, compared with various CLIP baselines. The image size is 224. The entries noted by grey are pre-trained on a different dataset. Our models use a 64k batch, 50% masking ratio, and unmasked tuning. case data epochs model zero-shot linear probe fine-tune CLIP [52] WIT-400M 32 L/14 75.3 83.9† - CLIP [52], our transfer WIT-400M 32 L/14 75.3 83.0 87.4 OpenCLIP [36] LAION-400M 32 L/14 72.8 82.1 86.2 CLIP, our repro. LAION-400M 32 L/16 72.4 82.6 86.3 FLIP LAION-400M 32 L/16 74.3 83.6 86.9 Table 3. Linear probing and fine-tuning accuracy on ImageNet-1K classification, compared with various CLIP baselines. The entries noted by grey are pre-trained on a different dataset. The image size is 224. †: CLIP in [52] optimizes with L-BFGS; we use SGD instead. The speedup of our method is of great practical value. The CLIP baseline takes ⇠10 days training in 256 TPU-v3 cores, so a speedup of 2-3⇥ saves many days in wall-clock time. This speedup facilitates exploring the scaling behavior, as we will discuss later in Sec. 4.3. 4.2. Comparisons with CLIP In this section, we compare with various CLIP baselines in a large variety of scenarios. We show that our method is a competitive alternative to CLIP; as such, our fast training method is a more desirable choice in practice. We consider the following CLIP baselines: • The original CLIP checkpoints [52], trained on the pri- vate dataset WIT-400M. Table 2 reports the results of our FLIP models, using the best practice as we have ablated in Table 1 (a 64k batch, 50% masking ratio, and unmasked tuning). For ViT-L/14,2 our method has 74.6% accuracy, which is 1.8% higher than OpenCLIP and 1.5% higher than our CLIP reproduction. Comparing with the original CLIP, our method reduces the gap to 0.7%. We hope our method will improve the original CLIP result if it were trained on the WIT data. ImageNet linear probing. Table 3 compares the linear probing results, i.e., training a linear classifier on the target dataset with frozen features. FLIP has 83.6% accuracy, 1.0% higher than our CLIP counterpart. It is also 0.6% higher than our transfer of the original CLIP checkpoint, using the same SGD trainer. data Food101 CIFAR10 CIFAR100 Birdsnap SUN397 Cars Aircraft VOC2007 DTD Oxford Pets Caltech101 Flowers102 MNIST STL10 EuroSAT RESISC45 GTSRB KITTI Country211 PCam UCF101 Kinetics700 CLEVR HatefulMemes SST2 CLIP [52] WIT-400M 92.9 96.2 77.9 48.3 67.7 77.3 36.1 84.1 55.3 93.5 92.6 78.7 87.2 99.3 59.9 71.6 50.3 23.1 32.7 58.8 76.2 60.3 24.3 63.3 64.0 CLIP [52], our eval. WIT-400M 91.0 95.2 75.6 51.2 66.6 75.0 32.3 83.3 55.0 93.6 92.4 77.7 76.0 99.3 62.0 71.6 51.6 26.9 30.9 51.6 76.1 59.5 22.2 55.3 67.3 OpenCLIP [36], our eval. LAION-400M 87.4 94.1 77.1 61.3 70.7 86.2 21.8 83.5 54.9 90.8 94.0 72.1 71.5 98.2 53.3 67.7 47.3 29.3 21.6 51.1 71.3 50.5 22.0 55.3 57.1 CLIP, our repro. LAION-400M 88.1 96.0 81.3 60.5 72.3 89.1 25.8 81.1 59.3 93.2 93.2 74.6 69.1 96.5 50.7 69.2 50.2 29.4 21.4 53.1 71.5 53.5 18.5 53.3 57.2 FLIP LAION-400M 89.3 97.2 84.1 63.0 73.1 90.7 29.1 83.1 60.4 92.6 93.8 75.0 80.3 98.5 53.5 70.8 41.4 34.8 23.1 50.3 74.1 55.8 22.7 54.0 58.5 Table 4. Zero-shot accuracy on more classification datasets, compared with various CLIP baselines. This table follows Table 11 in [52]. The model is ViT-L/14 with an image size of 224, for all entries. Entries in green are the best ones using the LAION-400M data. छྨͷதͰछྨͷσʔληοτͰਫ਼౓͕޲্ [FSPTIPUUSBOTGFSͱઢܗධՁʹ͓͍ͯਫ਼౓͕޲্

w .$."&ɿ.BTLFE$POWPMVUJPO.FFUT.BTLFE"VUPFODPEFST w Ϛϧνεέʔϧͳಛ௃ྔΛ֫ಘՄೳͳ."&ΛఏҊ 5SBOTGPSNFS#MPDLͷύονྖҬʹ߹ΘͤͯϚεΫΛ࡞੒ ϚεΫྖҬͱϚεΫΛ͍ͯ͠ͳ͍ྖҬͷಛ௃ྔ͕ࠞࡏ͠ͳ͍Α͏ʹ.BTLFE$POWPMVUJPOΛಋೖ ΞʔΩςΫνϟͷվળɿ.$."&<(BP /FVS*14>
Stage2 Stage3 H/4×W/4×C1 H/8×W/8×C2 (H/16×W/16)×C3 H/16 W/16 H/8 W/8 H/4 W/4 H×W×3 UpSample UpSample ×11 Block-wise Masking Encoder Patch Embedding Patch Embedding Stage1 ×2 Patch Embedding Masked Convolution Block ×2 Masked Convolution Block Transformer Block Transformer Block Decoder Linear H/4×W/4×C1 StrideConv+Flatten StrideConv+Flatten H/8×W/8×C2 Multi-Scale Fusion Mask Mask DepthWise Conv Linear Mask FFN Masked Convolution Block Feature Embeddings Value

<,)F $713`> 4JN.*. <;9JF $713`> .BTLFE*NBHF.PEFMJOH .*. #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> .BTLFE-BOHVBHF.PEFMJOH .-. )0(ಛ௃ྔΛ༧ଌ .BTLFE'FBUVSF1SFEJDUJPO <$8FJ $713`> ҟͳΔϞʔμϧͷద༻ ը૾ͷ࠶ߏ੒ Ի੠ ."&UIBU-JTUFO <1)VBOH /FVS*14`> ."&"T4QBUJPUFNQPSBM-FBSOFST <$'FJDIUFOIPGFS /FVS*14`> ಈը૾ ϚϧνϞʔμϧ .VMUJ."& <3#BDINBOO &$$7`> NVMUJGPMENBTLJOHTUSBUFHZ 4E"& <:$IFO &$$7`> "UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞੒ "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> NVMUJCMPDLNBTLJOHTUSBUFHZ *+&1" <."TTSBO BS9JW`> Ϛϧνεέʔϧͳಛ௃ྔΛ֫ಘ .$."& <1(BP /FVS*14`> ΞʔΩςΫνϟͷվળ ը૾΁Ԡ༻ ରরֶश ͷಋೖ ϚεΫͷվળ 5PLFOJ[FS GSFF ϚϧνλεΫ 4JN.*. ରরֶश 4J5 <4"UJUP BS9JW`> $."& <+.BP BS9JW`> ϚϧνλεΫ ."& ରরֶश '-*1 <:-J BS9JW> ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1 .4/ <."TTSBO &$$7`> ϚεΫͨ͠ը૾Λ༻͍ͨ /FHBUJWFGSFF

w .VMUJ."&ɿ.VMUJNPEBM.VMUJUBTL.BTLFE"VUPFODPEFST w ."&ΛϚϧνϞʔμϧ΁֦ு ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO &$$7> Transformer encoder Pre-trained
MultiMAE encoder Pre-trained MultiMAE encoder Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... ... ... ... MultiMAE pre-training Single-modal fin Multi-modal fin Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e

w ϚϧνϞʔμϧσʔλ ٖࣅϥϕϦϯάʹΑΓ3(#ը૾͔Β%FQUI৘ใͱ4FNBOUJDTFHNFOUBUJPO৘ใΛ࡞੒ ٖࣅϥϕϦϯάͱͯ͠4FHNFOUBUJPO%FQUIλεΫͷֶशࡁΈϞσϧͷग़ྗΛར༻ w &ODPEFS ϚεΫ͞Ε͍ͯͳ͍֤ϞμϦςΟͷύονΛ·ͱΊͯೖྗ
ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO &$$7> Transformer encoder Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... MultiMAE pre-training Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from Transformer encoder Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) Decoder Linear proj. Linear proj. Linear Decoder Selected input patches Original images Masked targets RGB Depth ntic ... ... ... ... ... ... MultiMAE pre-training Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) 3(# Transformer encoder Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... ... ... ... ... ... ... MultiMAE pre-training Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e.g., RGB, depth, and semantic segmentation) is linearly projected to tokens with a fixed dimension and encoded using a Transformer. Task-specific decoders reconstruct the masked-out patches by first performing a cross-attention step from queries to the encoded tokens, followed by a shallow Transformer. The queries consist of mask tokens (in gray), with the task-specific encoded tokens added at their respective positions. (Right) Fine-tuning: By pre-training on multiple modalities, MultiMAE lends itself to fine-tuning on single-modal and multi-modal downstream Transformer encoder Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... ... ... ... ... ... ... MultiMAE pre-training Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e.g., RGB, depth, and semantic segmentation) is linearly projected to tokens with a fixed dimension and encoded using a Transformer. Task-specific decoders reconstruct the masked-out patches by first performing a cross-attention step from queries to the encoded tokens, followed by a shallow 4FNBOUJD %FQUI ༧Ίඞཁͳσʔλ

w %FDPEFS ઢܗࣹӨͨ͠&ODPEFSग़ྗʹରͯ͠Ґஔ৘ใͱϞμϦςΟ৘ใΛ෇༩ $SPTTBUUFOUJPOʹΑΓϞμϦςΟؒͷؔ܎Λߟྀͨ͠τʔΫϯΛ5SBOTGPSNFSCMPDL΁ೖྗ ‣ 2VFSZɹɹɿઢܗࣹӨޙͷ֤ϞμϦςΟͷτʔΫϯ ‣ ,FZ
7BMVFɿઢܗࣹӨޙͷશϞμϦςΟͷτʔΫϯ ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO &$$7> ing implementation de- 15 ation details 15 siﬁcation ﬁne-tuning . . . . . . . . . . . . 15 ntation . . . . . . . . 15 stimation . . . . . . . 17 se regression tasks . . 17 egies 17 transfer results 18 on on ImageNet 18 E variants 18 raining time 19 the number of segmentation patches constant, we downsam- ple the semantic segmentation input by a factor of 4 and use patches of size 4⇥4. MultiMAE decoder. We illustrate the MultiMAE decoder in Fig 7. Following MAE [35], each decoder has a linear projection layer to adapt the outputs from the encoder to the decoder dimension. After this linear projection, we add both sine-cosine positional embeddings and learned modality embeddings to the decoder inputs. This is then followed by a cross-attention layer, a MLP, and two Transformer blocks. Figure 7. MultiMAE decoders: Tokens from the MultiMAE en- ,FZ 7BMVF 2VFSZ

w ࠶ߏ੒ը૾ͷՄࢹԽ ͭͷϞμϦςΟʹ͓͍ͯ૯ύον਺ͷΛϥϯμϜϚεΫͯ͠࠶ߏ੒ ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO &$$7> MultiMAE: Multi-modal Multi-task
Masked Autoencoders Roman Bachmann* David Mizrahi* Andrei Atanov Amir Zamir Swiss Federal Institute of Technology Lausanne (EPFL) https://multimae.epfl.ch Masked inputs MultiMAE predictions Target Semantic Depth RGB Masked inputs MultiMAE predictions Target Masked inputs MultiMAE predictions Target Figure 1. MultiMAE pre-training objective. We randomly select 1/6 of all 16⇥16 image patches from multiple modalities and learn to reconstruct the remaining 5/6 masked patches from them. The ﬁgure shows validation examples from ImageNet, where masked inputs (left),

w ԼྲྀλεΫʹ͓͚Δਫ਼౓ൺֱʢ*NBHF/FU,Ͱࣗݾڭࢣͨ͠ϞσϧΛ fi OFUVOJOHʣ ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO &$$7> Method IN-1K (C)
ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) Supervised [81] 81.8 45.8 33.9 50.1 80.7 DINO [12] 83.1 44.6 32.5 47.9 81.3 MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9 MAE [35] 83.3 46.2 36.5 50.8 85.1 MultiMAE 83.3 46.2 37.0 52.0 86.4 Table 1. Fine-tuning with RGB-only. We report the top-1 accuracy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] semantic segmentation (S), as well as 1 accuracy (") on NYUv2 depth (D). Text in bold and underline indicates the first and second-best results, respectively. All methods are pre-trained on ImageNet-1K (with pseudo labels for MultiMAE). Method RG MAE 36 MultiMAE 37 Table 2. Fine-tun report semantic seg RGB and depth, m leverage additional Text in gray indica on. ADE20K (S) Hypersim (S) Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 3 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 4 Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer r segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE f Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) sked targets ... ... ... ... ... ... Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) pled patches from multiple modalities (e.g., RGB, depth, and on and encoded using a Transformer. Task-specific decoders p from queries to the encoded tokens, followed by a shallow fic encoded tokens added at their respective positions. (Right) to fine-tuning on single-modal and multi-modal downstream Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) sked targets ... ... ... ... ... ... Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) pled patches from multiple modalities (e.g., RGB, depth, and on and encoded using a Transformer. Task-specific decoders p from queries to the encoded tokens, followed by a shallow fic encoded tokens added at their respective positions. (Right) Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) Supervised [81] 81.8 45.8 33.9 50.1 80.7 DINO [12] 83.1 44.6 32.5 47.9 81.3 MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9 MAE [35] 83.3 46.2 36.5 50.8 85.1 MultiMAE 83.3 46.2 37.0 52.0 86.4 Table 1. Fine-tuning with RGB-only. We report the top-1 accuracy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] semantic segmentation (S), as well as 1 accuracy (") on NYUv2 depth (D). Text in bold and underline indicates the first and second-best results, respectively. All methods are pre-trained on ImageNet-1K (with pseudo labels for MultiMAE). Hypersim (S) Method RGB D RGB-D MAE 36.5 32.5 36.9 MultiMAE 37.0 38.5 47.6 Table 2. Fine-tuning with RGB an report semantic segmentation transfer RGB and depth, measured in mIoU (" leverage additional modalities such a Text in gray indicates a modality that on. ADE20K (S) Hypersim (S) Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseud segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labele gray indicates a modality that the model was not pre-trained on. than two modalities during transfer quickly becomes com- putationally expensive, since without masking, our method now scales with the full number of modalities and tokens. For performing multi-modal transfers with the standard MAE, we train a new input projection for the additional modalities while fine-tuning. Further training details can depth model was partially traine mantic segmentation pseudo labe Mask2Former model as in pre-tr As shown in Table 3, MultiM depth or semantic segmentation yond the RGB-only setting, alt Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 Supervised [81] 81.8 45.8 33.9 50.1 80. DINO [12] 83.1 44.6 32.5 47.9 81. MoCo-v3 [17] 82.8 43.7 31.7 46.6 80. MAE [35] 83.3 46.2 36.5 50.8 85. MultiMAE 83.3 46.2 37.0 52.0 86. Table 1. Fine-tuning with RGB-only. We report the top-1 curacy (") on ImageNet-1K (IN-1K) [23] classification (C), m (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] sem tic segmentation (S), as well as 1 accuracy (") on NYUv2 de (D). Text in bold and underline indicates the first and second- results, respectively. All methods are pre-trained on ImageNet (with pseudo labels for MultiMAE). ADE20K (S) Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB p MAE 46.2 20.0 46.3 46.2 46.3 36.5 2 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 3 Table 3. Fine-tuning with RGB and pseudo labels. Semanti segmentation maps, measured in mIoU ("). MultiMAE benefit gray indicates a modality that the model was not pre-trained on. than two modalities during transfer quickly becomes com putationally expensive, since without masking, our metho now scales with the full number of modalities and token For performing multi-modal transfers with the standa MAE, we train a new input projection for the addition modalities while fine-tuning. Further training details ca Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) Supervised [81] 81.8 45.8 33.9 50.1 80.7 DINO [12] 83.1 44.6 32.5 47.9 81.3 MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9 MAE [35] 83.3 46.2 36.5 50.8 85.1 MultiMAE 83.3 46.2 37.0 52.0 86.4 Table 1. Fine-tuning with RGB-only. We report the top-1 accuracy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] semantic segmentation (S), as well as 1 accuracy (") on NYUv2 depth (D). Text in bold and underline indicates the first and second-best results, respectively. All methods are pre-trained on ImageNet-1K (with pseudo labels for MultiMAE). Hypersim (S) N Method RGB D RGB-D RGB MAE 36.5 32.5 36.9 50.8 MultiMAE 37.0 38.5 47.6 52.0 Table 2. Fine-tuning with RGB and ground t report semantic segmentation transfer results from RGB and depth, measured in mIoU ("). MultiMA leverage additional modalities such as depth, wh Text in gray indicates a modality that the model w on. ADE20K (S) Hypersim (S) NYUv2 ( Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled de segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities gray indicates a modality that the model was not pre-trained on. than two modalities during transfer quickly becomes com- putationally expensive, since without masking, our method now scales with the full number of modalities and tokens. For performing multi-modal transfers with the standard MAE, we train a new input projection for the additional modalities while fine-tuning. Further training details can depth model was partially trained on this d mantic segmentation pseudo labels, we use t Mask2Former model as in pre-training. As shown in Table 3, MultiMAE can use depth or semantic segmentation to boost p yond the RGB-only setting, although the Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) Supervised [81] 81.8 45.8 33.9 50.1 80.7 DINO [12] 83.1 44.6 32.5 47.9 81.3 MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9 MAE [35] 83.3 46.2 36.5 50.8 85.1 MultiMAE 83.3 46.2 37.0 52.0 86.4 Table 1. Fine-tuning with RGB-only. We report the top-1 accuracy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] semantic segmentation (S), as well as 1 accuracy (") on NYUv2 depth (D). Text in bold and underline indicates the first and second-best results, respectively. All methods are pre-trained on ImageNet-1K (with pseudo labels for MultiMAE). Hypersim (S) NY Method RGB D RGB-D RGB MAE 36.5 32.5 36.9 50.8 MultiMAE 37.0 38.5 47.6 52.0 Table 2. Fine-tuning with RGB and ground tru report semantic segmentation transfer results from c RGB and depth, measured in mIoU ("). MultiMAE leverage additional modalities such as depth, while Text in gray indicates a modality that the model was on. ADE20K (S) Hypersim (S) NYUv2 (S) Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RG MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 50 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 53 Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled dept segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities as gray indicates a modality that the model was not pre-trained on. than two modalities during transfer quickly becomes com- putationally expensive, since without masking, our method now scales with the full number of modalities and tokens. For performing multi-modal transfers with the standard MAE, we train a new input projection for the additional modalities while fine-tuning. Further training details can depth model was partially trained on this da mantic segmentation pseudo labels, we use the Mask2Former model as in pre-training. As shown in Table 3, MultiMAE can use p depth or semantic segmentation to boost per yond the RGB-only setting, although the ga Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) Supervised [81] 81.8 45.8 33.9 50.1 80.7 DINO [12] 83.1 44.6 32.5 47.9 81.3 MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9 MAE [35] 83.3 46.2 36.5 50.8 85.1 MultiMAE 83.3 46.2 37.0 52.0 86.4 Table 1. Fine-tuning with RGB-only. We report the top-1 accuracy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] semantic segmentation (S), as well as 1 accuracy (") on NYUv2 depth (D). Text in bold and underline indicates the first and second-best results, respectively. All methods are pre-trained on ImageNet-1K (with pseudo labels for MultiMAE). Hypersim (S) N Method RGB D RGB-D RGB MAE 36.5 32.5 36.9 50.8 MultiMAE 37.0 38.5 47.6 52.0 Table 2. Fine-tuning with RGB and ground t report semantic segmentation transfer results from RGB and depth, measured in mIoU ("). MultiMA leverage additional modalities such as depth, wh Text in gray indicates a modality that the model w on. ADE20K (S) Hypersim (S) NYUv2 ( Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD R MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled de segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities gray indicates a modality that the model was not pre-trained on. than two modalities during transfer quickly becomes com- depth model was partially trained on this d IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) 81.8 45.8 33.9 50.1 80.7 83.1 44.6 32.5 47.9 81.3 82.8 43.7 31.7 46.6 80.9 83.3 46.2 36.5 50.8 85.1 83.3 46.2 37.0 52.0 86.4 ne-tuning with RGB-only. We report the top-1 ac- ImageNet-1K (IN-1K) [23] classification (C), mIoU 0K [102] , Hypersim [68] , and NYUv2 [73] seman- ion (S), as well as 1 accuracy (") on NYUv2 depth bold and underline indicates the first and second-best ctively. All methods are pre-trained on ImageNet-1K labels for MultiMAE). Hypersim (S) NYUv2 (S) Method RGB D RGB-D RGB D RGB-D MAE 36.5 32.5 36.9 50.8 23.4 49.3 MultiMAE 37.0 38.5 47.6 52.0 41.4 56.0 Table 2. Fine-tuning with RGB and ground truth depth. We report semantic segmentation transfer results from combinations of RGB and depth, measured in mIoU ("). MultiMAE can effectively leverage additional modalities such as depth, while MAE cannot. Text in gray indicates a modality that the model was not pre-trained on. ADE20K (S) Hypersim (S) NYUv2 (S) RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 50.1 49.3 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 53.5 54.0 e-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled depth and semantic maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities as input. Text in s a modality that the model was not pre-trained on. odalities during transfer quickly becomes com- depth model was partially trained on this dataset. For se- ʢQ%ʣɿٙࣅϥϕϦϯάʹΑΔ%FQUI৘ใ ʢQ4ʣɿٖࣅϥϕϦϯάʹΑΔ4FNBOUJDTFHNFOUBUJPO৘ใ ͭͷϞʔμϧΛ༻͍ͯ fi OFUVOJOH͢Δ͜ͱͰߴ͍ೝࣝੑೳΛൃش

w ಈը૾΍Ի੠σʔλΛ༻͍ͨ."& ҟͳΔϞʔμϧͷద༻ encoder decoder .... .... T W
H T W H input target Figure 1: Masked Autoencoders as spatiotemporal learners. We mask a large subset (e.g., 90%) of random patches in spacetime. An encoder operates on the set of visible patches. A small decoder then processes the full set of encoded patches and mask tokens to reconstruct the input. Except for patch and positional embeddings, neither the encoder, the decoder, nor the masking strategy, has any spatiotemporal inductive bias. To the extreme, if a video has T identical static frames, randomly sampling 1/T of all spacetime patches would reveal most of the static frame. Because slow motion is more likely than fast motion in natural videos, the masking ratio can be very high as we observe empirically. The higher masking ratio leads to a more efficient solution in practice. Following the MAE in [31] that applies the encoder only on visible tokens, a masking ratio of 90% reduces the encoder time and memory complexity to <1/10. Put together with a small decoder [31], the MAE pre-training can achieve a theoretically 7.7⇥ reduction in computation vs. encoding all tokens. In fact, the computation reduction is so large that the data loading time becomes a new bottleneck; even so, we record a 4.1⇥ wall-clock speedup. Such a significant speedup is of great importance for video research that is large-scale and time-consuming. We report strong results on a variety of video recognition datasets. Our MAE pre-training greatly improves generalization performance: on Kinetics-400 [35], it increases the accuracy of ViT-Large [18] by absolute 13% vs. training from scratch, while it takes less wall-clock training time overall (pre-training plus fine-tuning). Our MAE pre-training can outperform its supervised pre-training ."&"T4QBUJPUFNQPSBM-FBSOFST <'FJDIUFOIPGFS /FVS*14> Encoder Decoder … … Target MSE Input ) ( , Figure 1: Audio-MAE for audio self-supervised learning. An audio recording is first transformed into a spectrogram and split into patches. We embed patches and mask out a large subset (80%). An encoder then operates on the visible (20%) patch embeddings. Finally, a decoder processes the order-restored embeddings and mask tokens to reconstruct the input. Audio-MAE is minimizing the mean square error (MSE) on the masked portion of the reconstruction and the input spectrogram. This computational burden has been addressed in different ways. A popular approach is to reduce the sequence length in self-attention. Various ViT-based architectures have been developed to alleviate such issues for image and video understanding. For example, Swin-Transformer [19] only performs local attention within windows that shift across layers. MViT [20] employs pooling attention to construct a hierarchy of Transformers where sequence lengths are downsampled. For self-supervised learning, MAE [1] efficiently encodes only a small portion (25%) of visual patches while the majority of patches is discarded. The simplicity and scalability in MAE make it a promising framework for large-scale self-supervised learning. In this work, we study MAE for sound recognition and the unique challenges of the audio domain. We present Audio-MAE (Fig. 1) as unified and scalable framework for learning self-supervised audio representations. Similar to MAE, it is composed of a pair of a Transformer encoder and decoder. Figure 2: Visualizations on the Kinetics-400 [35] validation set (masking ratio 90%). We show the original video (top), masked video (middle), and MAE output (bottom) for each sample. This model reconstructs the original pixels. The video size is 16⇥224⇥224 and the spacetime patch size is 2⇥16⇥16 (the temporal patch size of 2 is not visualized here). Each sample has 8⇥14⇥14=1568 tokens with 156 being visible. For better visualizations, the known patches in the output are from the original input. Fig. 7 shows more examples. Figure 3: Visualizations of the same pre-trained model in Fig. 2 but with a masking ratio of 95%. Instead of predicting pixels [9, 18, 31, 80], another line of research focuses on the tokenization ."&UIBU-JTUFO <)VBOH /FVS*14> εϖΫτϩάϥϜʹରͯ͠ϚεΫ ಈը૾ʹରͯ͠ϚεΫ

w ."&"T4QBUJPUFNQPSBM-FBSOFSTʢಈը૾ͷ."&ʣʹ͓͚ΔޮՌ ֶश࣌ؒͷൺֱɿ."&ʴϑΝΠϯνϡʔχϯά74ϑϧεΫϥονֶश ҟͳΔϞʔμϧͷద༻ 0 10 20 30
40 50 60 0 20 40 60 80 accuracy (%) wall-clock time (hours) MAE pre-train 800 epochs fine-tune 100 epochs from scratch 400 epochs w/ MAE from scratch 1-view multi-vie Figure 5: MAE pre-training plus ﬁne-tuning is much more accurate and scratch. Here the x-axis is the wall-clock training time (128 A100 GPUs), a accuracy on Kinetics-400 validation. The table shows the ﬁnal accuracy. ."&ʴϑΝΠϯνϡʔχϯά͸୹ֶ͍श࣌ؒͰߴੑೳ

w ڭࢣϥϕϧΛ෇༩ͯ͠ͳ͍େྔͷσʔλΛٖࣅతͳ໰୊ 1SFUFYUUBTL ʹΑֶͬͯश w ࣗݾڭࢣ͋ΓֶशͰֶशͨ͠Ϟσϧ͸ࣄલֶशϞσϧͱͯ͠׆༻ w $//Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश ୅දతͳख๏ɿ1SFUFYUUBTLͷվળ
d ˠରরֶश ˠ/FHBUJWFGSFF ରরֶशʹΑΓੑೳ͕େ͖͘վળ ෺ମݕग़΍ηάϝϯςʔγϣϯͳͲͷҰ෦ͷ໰୊ઃఆͰ͸ڭࢣ͋ΓࣄલֶशΛ௒͑ΔੑೳΛୡ੒ w 7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश ୅දతͳख๏ɿରরֶशɾ/FHBUJWFGSFF ˠ.BTLFE*NBHF.PEFMJOH 7JTJPO5SBOTGPSNFSͷߏ଄ʹ߹ΘͤͯޮՌΛൃش͢Δํ๏Λઃܭ ·ͱΊ

Self-Supervised Learning

Self-Supervised Learning

More Decks by Naoki Okamoto

Other Decks in Research

Featured

Transcript