Upgrade to Pro — share decks privately, control downloads, hide ads and more …

自己教師あり学習による事前学習(CVIMチュートリアル)

Naoki Okamoto
January 25, 2024

 自己教師あり学習による事前学習(CVIMチュートリアル)

2024年1月26日開催のCVIM研究会におけるチュートリアル講演の資料です.

2024年1月25日:資料を公開しました
2024年2月2日:表記ゆれを修正しました

Naoki Okamoto

January 25, 2024
Tweet

More Decks by Naoki Okamoto

Other Decks in Research

Transcript

  1. ࣗݾ঺հ  Ԭຊ௚थ /BPLJ0LBNPUP ݚڀςʔϚɿ஌ࣝৠཹɼ൒ڭࢣ͋Γֶशɼࣗݾڭࢣ͋Γֶश த෦େֶ޻ֶݚڀՊത࢜ޙظ՝ఔ̎೥ੜɹ౻٢ݚڀࣨॴଐ ࣗݾڭࢣ͋ΓֶशͷαʔϕΠεϥΠυ IUUQTTQFBLFSEFDLDPNOBPLTFMGTVQFSWJTFEMFBSOJOH ΞϯαϯϒϧֶशͷͨΊͷ஌ࣝৠཹ<&$$7> 0.

    Ensemble (74.52%) 1. ResNet18_ABN (68.1%) 2. ResNet18_ABN (70.68%) Prob(+), Attention(+) (Linear) 3. ResNet18_ABN (70.96%) Attention(+) (Correct) 4. ResNet18_ABN (72.09%) Prob(+), Attention(+) (Correct) 5. ResNet18_ABN (69.03%) Attention(+) (Linear) Attention(+) (Through) Attention(+) (Linear) Prob(+), Attention(+) (Linear) Attention(+) (Through) Prob(-), Attention(-) (Linear) Prob(+) (Correct) Attention(-) (Through) Prob(-), Attention(-) (Linear) Prob(+) (Correct) Attention(+) (Correct) Prob(+) (Linear) Prob(+), Attention(+) (Correct) Prob(+), Attention(+) (Linear) Label Linear Label Through Label Through Label Through
  2. w ਖ਼ղϥϕϧΛ෇༩ͯ͠ͳ͍େྔͷϥϕϧͳ͠σʔλΛ༻͍ͯࣄલֶश w ໨తɿ༷ʑͳసҠֶशɾ fi OFUVOJOHઌͷλεΫͰ༗ޮͳಛ௃ྔͷநग़ ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH  ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁సҠֶशɾ

    fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ᶃϥϕϧͳ͠σʔλͰϞσϧΛࣄલֶश ϥϕϧͳ͠σʔλ Ϟσϧ ʢཚ਺ॳظ஋ʣ ses image data, we analyze its internal early projects the flattened patches into he top principal components of the the le basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the output token to the input space. See Appendix D.6 for d to the l learns on em- on em- s in the usoidal ). That ogy ex- ot yield e entire ree the ute the is inte- “atten- We find lowest bally is stently zed at- Net be- y serve ʜ ࣗݾڭࢣ͋Γֶश 44-
  3. w ਖ਼ղϥϕϧΛ෇༩ͯ͠ͳ͍େྔͷϥϕϧͳ͠σʔλΛ༻͍ͯࣄલֶश w ໨తɿ༷ʑͳసҠֶशɾ fi OFUVOJOHઌͷλεΫͰ༗ޮͳಛ௃ྔͷநग़ ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH  ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁సҠֶशɾ

    fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ᶃϥϕϧͳ͠σʔλͰϞσϧΛࣄલֶश ϥϕϧͳ͠σʔλ Ϟσϧ ʢཚ਺ॳظ஋ʣ ses image data, we analyze its internal early projects the flattened patches into he top principal components of the the le basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the output token to the input space. See Appendix D.6 for d to the l learns on em- on em- s in the usoidal ). That ogy ex- ot yield e entire ree the ute the is inte- “atten- We find lowest bally is stently zed at- Net be- y serve ʜ ࣗݾڭࢣ͋Γֶश 44- ϓϨςΩετλεΫ 1SFUFYUUBTL
  4. w ਖ਼ղϥϕϧΛ෇༩ͯ͠ͳ͍େྔͷϥϕϧͳ͠σʔλΛ༻͍ͯࣄલֶश w ໨తɿ༷ʑͳసҠֶशɾ fi OFUVOJOHઌͷλεΫͰ༗ޮͳಛ௃ྔͷநग़ ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH  ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁సҠֶशɾ

    fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ᶃϥϕϧͳ͠σʔλͰϞσϧΛࣄલֶश ϥϕϧͳ͠σʔλ Ϟσϧ ʢཚ਺ॳظ஋ʣ ses image data, we analyze its internal early projects the flattened patches into he top principal components of the the le basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the output token to the input space. See Appendix D.6 for d to the l learns on em- on em- s in the usoidal ). That ogy ex- ot yield e entire ree the ute the is inte- “atten- We find lowest bally is stently zed at- Net be- y serve ʜ ࣗݾڭࢣ͋Γֶश 44- ϓϨςΩετλεΫ 1SFUFYUUBTL ԼྲྀλεΫ %PXOTUSFBNUBTL
  5. w ਖ਼ղϥϕϧΛ෇༩ͯ͠ͳ͍େྔͷϥϕϧͳ͠σʔλΛ༻͍ͯࣄલֶश w ໨తɿ༷ʑͳసҠֶशɾ fi OFUVOJOHઌͷλεΫͰ༗ޮͳಛ௃ྔͷநग़ ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH  ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁సҠֶशɾ

    fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ᶃϥϕϧͳ͠σʔλͰϞσϧΛࣄલֶश ϥϕϧͳ͠σʔλ Ϟσϧ ʢཚ਺ॳظ஋ʣ ses image data, we analyze its internal early projects the flattened patches into he top principal components of the the le basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the output token to the input space. See Appendix D.6 for d to the l learns on em- on em- s in the usoidal ). That ogy ex- ot yield e entire ree the ute the is inte- “atten- We find lowest bally is stently zed at- Net be- y serve ʜ ࣗݾڭࢣ͋Γֶश 44- ϓϨςΩετλεΫ 1SFUFYUUBTL ԼྲྀλεΫ %PXOTUSFBNUBTL ˠͲͷΑ͏ͳϓϨςΩετλεΫΛֶश͢Δ΂͖ͳͷ͔ʁ
  6. w σʔλ͔ΒࣗಈͰਖ਼ղϥϕϧΛ࡞੒Ͱ͖ΔλεΫ w ྫɿ1SFEJDUJOH*NBHF3PUBUJPOT<(JEBSJT *$-3>  ը૾ʹରͯ͠౓ɼ౓ɼ౓ɼ౓ͷ͍ͣΕ͔ͷճసΛద༻  ೖྗը૾ʹద༻͞ΕͨճసͷछྨΛ༧ଌʢΫϥε෼ྨʣ ϓϨςΩετλεΫ

    1SFUFYUUBTL  Published as a conference paper at ICLR 2018 Rotated image: X0 Rotated image: X3 Rotated image: X 2 Rotated image: X1 ConvNet model F(.) ConvNet model F(.) ConvNet model F(.) ConvNet model F(.) Image X Predict 270 degrees rotation (y=3) Rotate 270 degrees g( X , y=3) Rotate 180 degrees g( X , y=2) Rotate 90 degrees g( X , y=1) Rotate 0 degrees g( X , y=0) Maximize prob. F3( X 3) Predict 0 degrees rotation (y=0) Maximize prob. F2( X2) Maximize prob. F1( X 1) Maximize prob. F0( X 0) Predict 180 degrees rotation (y=2) Predict 90 degrees rotation (y=1) Objectives: Figure 2: Illustration of the self-supervised task that we propose for semantic feature learning. Given four possible geometric transformations, the 0, 90, 180, and 270 degrees rotations, we train a ConvNet model F(.) to recognize the rotation that is applied to the image that it gets as input. Fy(Xy⇤ ) is the probability of rotation transformation y predicted by model F(.) when it gets as input an image that has been transformed by the rotation transformation y⇤. to successfully predict the rotation of an image the ConvNet model must necessarily learn to localize <(JEBSJT *$-3>͔ΒҾ༻ ˠճసͷ༧ଌʹ͸෺ମͷ֓೦ʢҐஔɼ࢟੎ɼछྨͳͲʣͷཧղ͕ඞཁ ˠਖ਼ղϥϕϧ͸ը૾ʹద༻͞Εͨσʔλ֦ுͷ৘ใ͔ΒࣗಈͰ࡞੒Մೳ
  7. ϓϨςΩετλεΫͷਐల     $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ ࡞੒ͯ͠ରরֶश

    $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX  <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH  .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM  HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO $713`> ϚϧνϞʔμϧ΁ͷ֦ு 3(# ηϚηά ਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(# ݴޠ $"7."& <:(POH *$-3`> 3(# Ի੠ ରরֶश .*. 4JN.*. ରরֶश 4J5 <4"UJUP BS9JW`> $."& <;)VBOH BS9JW`> ."& ରরֶश 8IBU%P447J5-FBSO  </1BSL *$-3`> ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ
  8. ϓϨςΩετλεΫͷਐల     $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ ࡞੒ͯ͠ରরֶश

    $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX  <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH  .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM  HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO $713`> ϚϧνϞʔμϧ΁ͷ֦ு 3(# ηϚηά ਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(# ݴޠ $"7."& <:(POH *$-3`> 3(# Ի੠ ରরֶश .*. 4JN.*. ରরֶश 4J5 <4"UJUP BS9JW`> $."& <;)VBOH BS9JW`> ."& ରরֶश 8IBU%P447J5-FBSO  </1BSL *$-3`> ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ
  9. w 4PMWJOH+JHTBX1V[[MFT</PSPP[JBOE 'BWBSP &$$7>  λΠϧঢ়ʹͭͷύονΛ࡞੒ͯ͠γϟοϑϧ  ͋Β͔͡Ίఆٛ͞Εͨγϟοϑϧॱ൪ͷΠϯσοΫεΛ༧ଌ w $POUFYU&ODPEFST<1BUIBL

    $713>  &ODPEFSɾ%FDPEFSߏ଄ͷϞσϧʹΑΓϚεΫ͞ΕͨྖҬΛ༧ଌ ༷ʑͳϓϨςΩετλεΫ  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles 7 Fig. 3: Context Free Network. The figure illustrates how a puzzle is generated differ in the approach: whereas [7] are solving a discrimina- tive task (is patch A above patch B or below?), our context encoder solves a pure prediction problem (what pixel inten- sities should go in the hole?). Interestingly, similar distinc- tion exist in using language context to learn word embed- dings: Collobert and Weston [5] advocate a discriminative approach, whereas word2vec [30] formulate it as word pre- diction. One important benefit of our approach is that our supervisory signal is much richer: a context encoder needs to predict roughly 15,000 real values per training example, compared to just 1 option among 8 choices in [7]. Likely due in part to this difference, our context encoders take far less time to train than [7]. Moreover, context based predic- Figure 2: Context Encoder. The context image is passed through the encoder to obtain features which are connected to the decoder using channel-wise fully-connected layer as described in Section 3.1. The decoder then produces the $POUFYU&ODPEFST 4PMWJOH+JHTBX1V[[MFT          </PSPP[JBOE 'BWBSP &$$7>͔ΒҾ༻ <1BUIBL $713>͔ΒҾ༻
  10. w $PMPSGVM*NBHF$PMPSJ[BUJPO<;IBOH &$$7>  Χϥʔը૾͔ΒάϨʔεέʔϧը૾Λ࡞੒  άϨʔεέʔϧը૾͔Β-BC৭ۭؒͷBC஋Λ༧ଌ w /PO1BSBNFUSJD*OTUBODF%JTDSJNJOBUJPO<8V $713>

     ֤ը૾ʹର͢Δಛ௃ྔ͕ಠཱ͢ΔΑ͏ʹֶशʢσʔλ਺ʹΫϥε਺ͱֶͯ͠शʣ ༷ʑͳϓϨςΩετλεΫ  128D Unit Sphere O 1-th image 2-th image i-th image n-1 th image n-th image CNN backbone 128D 2048D 128D L2 norm low dim Non-param Softmax Memory Bank Figure 2: The pipeline of our unsupervised feature learning approach. We use a backbone CNN to encode each image as a feature vector, which is projected to a 128-dimensional space and L2 normalized. The optimal feature embedding is learned via instance-level discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere. 3. Approach where ⌧ is a temperature parameter that controls the con- *OTUBODF%JTDSJNJOBUJPO 4 Zhang, Isola, Efros Fig. 2. Our network architecture. Each conv layer refers to a block of 2 or 3 repeated conv and ReLU layers, followed by a BatchNorm [30] layer. The net has no pool layers. All changes in resolution are achieved through spatial downsampling or upsampling between conv blocks. $PMPSGVM*NBHF$PMPSJ[BUJPO <8V $713>͔ΒҾ༻ <;IBOH &$$7>͔ΒҾ༻
  11. w $POUFYU1SFEJDUJPO<%PFSTDI *$$7>  λΠϧঢ়ʹΫϩοϓͨ͠ύονؒͷ૬ରҐஔΛ༧ଌ w $POUSBTUJWF1SFEJDUJWF$PEJOH $1$W <)ÉOB ff

    *$.->  ύονͷಛ௃ྔ͔ΒύονҐஔ͕ݸઌͷύονͷಛ௃ྔΛ༧ଌ ༷ʑͳϓϨςΩετλεΫ  fθ gφ x z c InfoNCE [256, 256, 3] [7, 7, 4096] [7, 7, 4096] Masked ConvNet Patched ResNet-161 fθ hψ x z y Cross Ent [256, 256, 3] [7, 7, 4096] [1000, 1] Linear Self-supervised pre-training 100% images; 0% labels Linear classification 100% images and labels fθ hψ x z y Cross Ent [224, 224, 3] [14, 14, 4096] ResNet-33 Efficient classification 1% to 100% images and labels fθ hψ x z y Multi Task [H, W, 3] [H/16, W/16, 4096] Transfer learning 100% images and labels hψ x y Cross Ent [224, 224, 3] [1000, 1] ResNet-152 Supervised training 1% to 100% images and labels Baseline Pre-training Evaluation Pre-trained Fixed / Tuned ResNet-161 Image x Feature Extractor fθ Patched ResNet-161 z c Context Network gφ Masked ConvNet Faster-RCNN [20, 1] [1000, 1] Pre-trained Fixed / Tuned ResNet-161 Pre-trained Fixed Patched ResNet-161 al configuration (if there is no spe- e parts, then it is “stuff” [1]). We d approach to learn a visual repre- . We demonstrate that the resulting good for both object detection, pro- t on PASCAL VOC 2007 compared , as well as for unsupervised object mining. This means, surprisingly, generalizes across images, despite bjective function that operates on a That is, instance-level supervision ormance on category-level tasks. a good image representation is as n appropriate generative model. An of natural images would both gener- o their natural distribution, and be hat it would seek common causes d share information between them. atent structure given an image is in- vely simple models. To deal with sues, a number of works, such as m [23], contrastive divergence [22], 3 2 1 5 4 8 7 6 ); Y = 3 , X = ( Figure 2. The algorithm receives two patches in one of these eight possible spatial arrangements, without any context, and must then classify which configuration was sampled. model (e.g. a deep network) to predict, from a single word, the n preceding and n succeeding words. In principle, sim- ilar reasoning could be applied in the image domain, a kind of visual “fill in the blank” task, but, again, one runs into the problem of determining whether the predictions themselves $POUSBTUJWF1SFEJDUJWF$PEJOH $POUFYU1SFEJDUJPO k <)ÉOB ff *$.->͔ΒҾ༻ <%PFSTDI *$$7>͔ΒҾ༻
  12. ϓϨςΩετλεΫͷਐల     $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ ࡞੒ͯ͠ରরֶश

    $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX  <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH  .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM  HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO *$$7`> ϚϧνϞʔμϧ΁ͷ֦ு 3(# ηϚηά ਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(# ݴޠ $"7."& <:(POH *$-3`> 3(# Ի੠ ରরֶश .*. 4JN.*. ରরֶश 4J5 <4"UJUP BS9JW`> $."& <;)VBOH BS9JW`> ."& ରরֶश 8IBU%P447J5-FBSO  </1BSL *$-3`> ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ w ରরֶश͸ڭࢣ͋Γࣄલֶश๏ͱಉఔ౓Ҏ্ͷֶशޮՌΛൃشˠରরֶश͕ओྲྀʹ
  13. ϓϨςΩετλεΫͷਐల     $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ ࡞੒ͯ͠ରরֶश

    $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX  <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH  .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM  HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO *$$7`> ϚϧνϞʔμϧ΁ͷ֦ு 3(# ηϚηά ਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(# ݴޠ $"7."& <:(POH *$-3`> 3(# Ի੠ ରরֶश .*. 4JN.*. ରরֶश 4J5 <4"UJUP BS9JW`> $."& <;)VBOH BS9JW`> ."& ରরֶश 8IBU%P447J5-FBSO  </1BSL *$-3`> ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ w 7J5ͷ୆಄Ҏ߱͸#&35Λը૾΁Ԡ༻ͨ͠.BTLFE*NBHF.PEFMJOH͕ొ৔
  14. w ઢܗධՁ -JOFBSFWBMVBUJPO -JOFBSQSPCJOH -JOFBSDMBTTJ fi DBUJPO   Ϟσϧ͕நग़ͨ͠ಛ௃ྔͱڭࢣϥϕϧΛ༻͍ͯઢܗ૚Λڭࢣ͋Γֶश

    ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏  ࣄલֶशϞσϧ ormer processes image data, we analyze its internal ansformer linearly projects the flattened patches into left) shows the top principal components of the the emble plausible basis functions for a low-dimensional patch. Figure 6: Representative ex- amples of attention from the ding is added to the hat the model learns arity of position em- similar position em- pears; patches in the Finally, a sinusoidal (Appendix D). That image topology ex- variants do not yield ion across the entire e to what degree the ally, we compute the information is inte- , right). This “atten- ze in CNNs. We find lready in the lowest ormation globally is ds have consistently s highly localized at- t apply a ResNet be- ʜ ʜ ಛ௃ྔ ࣗݾڭࢣ͋Γֶश ΛߦͳͬͨϞσϧ ਖ਼ղϥϕϧ -JOFBS ʜ ಛ௃ྔ ʜ "JSQMBOF $BU ڭࢣ͋Γֶश ཚ਺ॳظ஋ ͷઢܗ૚
  15. w L//๏ʹΑΔධՁ  Ϟσϧ͕நग़ͨ͠ಛ௃ྔͱڭࢣϥϕϧΛ༻͍ͯL//๏Λద༻ ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏  ࣄલֶशϞσϧ sses image data,

    we analyze its internal nearly projects the flattened patches into the top principal components of the the ble basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the d to the el learns ion em- ion em- es in the nusoidal D). That ogy ex- ot yield e entire gree the pute the is inte- s “atten- We find e lowest obally is sistently lized at- sNet be- ʜ ʜ ಛ௃ྔ ࣗݾڭࢣ͋Γֶश ΛߦͳͬͨϞσϧ ʜ ಛ௃ྔ ਖ਼ղϥϕϧ ʜ "JSQMBOF $BU ֶश༻σʔλ ʜ ಛ௃ྔ ਖ਼ղϥϕϧ ʜ $BU %PH ධՁ༻σʔλ L//๏ʹΑΓධՁ ,ݸͷۙ๣఺͔ΒΫϥε෼ྨ
  16. w ϑΝΠϯνϡʔχϯά fi OFUVOJOH   Ϋϥε෼ྨɼ෺ମݕग़ɼηάϝϯςʔγϣϯͳͲͷ༷ʑͳԼྲྀλεΫ΁ϑΝΠϯνϡʔχϯά ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏  1FMJDBO

    ڭࢣϥϕϧ -JOFBS ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ࣗݾڭࢣ͋Γֶश ΛߦͳͬͨϞσϧ λεΫݻ༗ ͷϞσϧߏ଄ ڭࢣ͋Γֶश
  17. ධՁํ๏ͷ࢖͍෼͚  ઢܗධՁͱL//๏͸ ηοτͰධՁ w ࣗݾڭࢣ͋ΓֶशͰ֫ಘͨ͠ಛ௃දݱΛؒ઀తʹධՁ  ઢܗධՁ ϝϦοτɹɿൺֱతߴ͍ਫ਼౓Λୡ੒ σϝϦοτɿઢܗධՁͷֶश৚݅ʢϋΠύϥʣʹΑΓਫ਼౓͕มԽ

     L//๏ ϝϦοτɹɿϋΠύϥʹΑΔਫ਼౓΁ͷӨڹ͕খ͍͞ σϝϦοτɿઢܗධՁͱൺ΂ͯਫ਼౓͕௿͍͜ͱ͕ଟ͍ w ࣄલֶशϞσϧͱͯ͠ͷసҠੑΛධՁ  ϑΝΠϯνϡʔχϯά ϝϦοτɹɿ෯޿͍ԼྲྀλεΫͰධՁ͕Մೳ σϝϦοτɿධՁͷͨΊͷֶशʹ͕࣌ؒඞཁ
  18. w ࣗݾڭࢣ͋Γֶश  ϥϕϧͳ͠σʔλΛ༻͍ͯϓϨΩετλεΫΛֶश͢Δ͜ͱͰϞσϧΛࣄલֶश  ༷ʑͳԼྲྀλεΫͰ༗ޮͳಛ௃ྔͷநग़Λ໨ࢦ͢ w ϓϨςΩετλεΫ  σʔλ͔ΒࣗಈͰਖ਼ղϥϕϧΛ࡞੒͢Δ͜ͱ͕Ͱ͖ΔλεΫ

     ༷ʑͳख๏͕ఏҊ͞Εɼݱࡏ͸ରরֶशͱ.BTLFE*NBHF.PEFMJOH͕ओྲྀ w ࣗݾڭࢣ͋ΓֶशͷධՁํ๏  ಛ௃දݱΛ௚઀ධՁ͢Δ͜ͱ͸Ͱ͖ͳ͍ͨΊɼԼྲྀλεΫʹର͢Δਫ਼౓͔ΒධՁ  ઢܗධՁɼL//๏͸ԼྲྀλεΫʹର͢Δਫ਼౓͔Βಛ௃දݱΛؒ઀తʹධՁ  ϑΝΠϯνϡʔχϯά͸༷ʑͳԼྲྀλεΫ΁ͷసҠੑΛධՁ ࣗݾڭࢣ͋Γֶशͷجૅͷ·ͱΊ 
  19. ରরֶश $POUSBTUJWF-FBSOJOH  w ϛχόον಺ͷσʔλʹ͓͍ͯϖΞΛݟ͚ͭΔϓϨςΩετλεΫͷֶशํ๏  σʔλ֦ுʹΑΓ࡞੒ͨ͠σʔλؒͷྨࣅੑ΍ࠩҟΛࣝผ σʔλ֦ு σʔλ֦ு ج४ͷը૾

    ਖ਼ྫ ෛྫ ෛྫ ਖ਼ྫϖΞɿಛ௃ྔͷྨࣅ౓Λߴ͘͢Δؔ܎ ෛྫϖΞɿಛ௃ྔͷྨࣅ౓Λ௿͘͢Δؔ܎ σʔλ಺ͷύλʔϯ΍ύʔπͷؔ܎ੑʹֶ͍ͭͯश
  20. w σʔλ֦ுͷ෼ੳɿͭͷσʔλ֦ுͷ૊Έ߹ΘͤํʹΑΔઢܗධՁͷਫ਼౓มԽΛௐࠪ w ϥϯμϜΫϩοϓɼ$VUPVUɼ৭ม׵Λ࢖༻͠ͳ͍৔߹ʹ௿͍ਫ਼౓ 4JN$-3ɿσʔλ֦ு  RS R R R

    6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R RS R R R 6R 1R 5R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R σʔληοτɿ*NBHF/FU, <$IFO *$.->͔ΒҾ༻ɼҰ෦վม
  21. w σʔλ֦ுͷ෼ੳɿͭͷσʔλ֦ுͷ૊Έ߹ΘͤํʹΑΔઢܗධՁͷਫ਼౓มԽΛௐࠪ w ϥϯμϜΫϩοϓͱ৭ม׵ͷ૊Έ߹Θ͕ͤ࠷΋ߴ͍ਫ਼౓Λୡ੒  ৭ม׵ɿάϨʔεέʔϧԽɼΧϥʔδολ 4JN$-3ɿσʔλ֦ு  RS R

    R R 6R 1R 5R R R RS R R R 6R 1R 5R R R <$IFO *$.->͔ΒҾ༻ɼҰ෦վม RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R σʔληοτɿ*NBHF/FU,
  22. w Τϯίʔμͷޙʹ૚ߏ଄ͷ.-1ʢࣹӨϔουʣΛ௥Ճ 4JN$-3ɿϞσϧߏ଄  ࣹӨϔου Τϯίʔμ .-1 <$IFO *$.->͔ΒҾ༻ 3

    L LP L L 7 3 L L 1 L 1 ࣹӨϔουͷ༗ແʹΑΔਫ਼౓มԽ ࣹӨϔουΛಋೖ͠ͳ͍ͱਫ਼౓͕໿QU௿ԼˠΤϯίʔμͷग़ྗ͕ରরֶशʹಛԽ ࣹӨϔουͷಋೖʹΑΓΤϯίʔμͷಛ௃දݱ͕ରরֶशʹಛԽ͢Δ͜ͱΛ཈੍ ໿ϙΠϯτ ਫ਼౓͕௿Լ
  23. 4JN$-3ɿଛࣦؔ਺  Li,j = − log exp(sim(zi , zj )/τ)

    ∑2N k=1 1[k≠i] exp(sim(zi , zk )/τ) ίαΠϯྨࣅ౓ ਖ਼ྫϖΞ શͯͷϖΞͷྨࣅ౓ Թ౓ύϥϝʔλ w ଛࣦؔ਺ͱͯ͠/PSNBMJ[FE5FNQFSBUVSFTDBMFE$SPTT&OUSPQZMPTT /59FOU Λ࢖༻  αϯϓϧؒͷྨࣅ౓ؔ܎Λ֬཰෼෍Ͱදݱͯ͠$SPTT&OUPSQZMPTTΛܭࢉ
  24. w ଛࣦؔ਺ͱͯ͠/PSNBMJ[FE5FNQFSBUVSFTDBMFE$SPTT&OUSPQZMPTT /59FOU Λ࢖༻  αϯϓϧؒͷྨࣅ౓ؔ܎Λ֬཰෼෍Ͱදݱͯ͠$SPTT&OUPSQZMPTTΛܭࢉ 4JN$-3ɿଛࣦؔ਺  p1,2 p1,3

    p1,2N p1,4 pi,j = exp(sim(zi , zj )/τ) ∑2N k=1 1[k≠i] exp(sim(zi , zk )/τ) MPHJUT ֬཰෼෍ Թ౓෇͖ 4PGUNBYؔ਺ $SPTT&OUSPQZ ଛࣦ ڭࢣϥϕϧ y1,2 y1,3 y1,2N y1,4 Li,j = − 2N ∑ k=1 1[k≠i] yi,k log pi,k = − log pi,j sim(z1 , z2 ) sim(z1 , z3 ) sim(z1 , z2N ) sim(z1 , z4 ) αϯϓϧͱαϯϓϧ͕ϙδςΟϒϖΞ ͷྫ i = 1, j = 2 Li,j = − log exp(sim(zi , zj )/τ) ∑2N k=1 1[k≠i] exp(sim(zi , zk )/τ) ίαΠϯྨࣅ౓ ਖ਼ྫϖΞ શͯͷϖΞͷྨࣅ౓ Թ౓ύϥϝʔλ
  25. w ଛࣦؔ਺ͱͯ͠/PSNBMJ[FE5FNQFSBUVSFTDBMFE$SPTT&OUSPQZMPTT /59FOU Λ࢖༻  αϯϓϧؒͷྨࣅ౓ؔ܎Λ֬཰෼෍Ͱදݱͯ͠$SPTT&OUPSQZMPTTΛܭࢉ 4JN$-3ɿଛࣦؔ਺  p1,2 p1,3

    p1,2N p1,4 ֬཰෼෍ΛӶ͘ ֬཰෼෍ΛͳͩΒ͔ʹ Թ౓෇͖ 4PGUNBYؔ਺ τ < 1.0 τ > 1.0 4JN$-3Ͱ͸ ͱͯ͠ Λ࢖༻ τ ௨ৗͷ4PGUNBYؔ਺ τ = 1.0 ͱൺ΂ͯ αϯϓϧͱαϯϓϧ͕ϙδςΟϒϖΞ ͷྫ i = 1, j = 2 p1,2 p1,3 p1,2N p1,4 MPHJUT sim(z1 , z2 ) sim(z1 , z3 ) sim(z1 , z2N ) sim(z1 , z4 ) Li,j = − log exp(sim(zi , zj )/τ) ∑2N k=1 1[k≠i] exp(sim(zi , zk )/τ) ίαΠϯྨࣅ౓ ਖ਼ྫϖΞ શͯͷϖΞͷྨࣅ౓ Թ౓ύϥϝʔλ
  26. 4JN$-3ɿֶश৚݅ʹΑΔਫ਼౓มԽ  7 7 A Simple Framework for Contrastive Learning

    of Visual Representations (a) Without color distortion. (b) With color distortion. Figure 6. Histograms of pixel intensities (over all channels) for different crops of two different images (i.e. two rows). The image for the first row is from Figure 4. All axes have the same range. Color distortion strength Methods 1/8 1/4 1/2 1 1 (+Blur) AutoAug SimCLR 59.6 61.0 62.6 63.2 64.5 61.1 Supervised 77.0 76.7 76.5 75.7 75.4 77.1 Table 1. Top-1 accuracy of unsupervised ResNet-50 using linear evaluation and supervised ResNet-505, under varied color distor- tion strength (see Appendix A) and other data transformations. Strength 1 (+Blur) is our default data augmentation policy. ric data augmentation hurts the performance. Nonetheless, this setup should not substantively change the impact of individual data augmentations or their compositions. 1 P R 3 P 0 R 7RS 5 5 5 5 5 5 5 5 5 5 5 5 5 6 S 5 6 S 5 6 S 5 5 5 5 Figure 7. Linear evaluation of models with varied depth and width. Models in blue dots are ours trained for 100 epochs, models in red stars are ours trained for 1000 epochs, and models in green crosses are supervised ResNets trained for 90 epochs7 (He et al., 2016). shown in Table 1. Stronger color augmentation substan- tially improves the linear evaluation of the learned unsuper- vised models. In this context, AutoAugment (Cubuk et al., 2019), a sophisticated augmentation policy found using su- pervised learning, does not work better than simple cropping όοναΠζͱΤϙοΫ਺ʹΑΔਫ਼౓มԽ ϞσϧαΠζʹΑΔਫ਼౓มԽ όοναΠζɼΤϙοΫ਺ɼϞσϧαΠζ͕େ͖͍΄ͲରরֶशͷޮՌ͕޲্ <$IFO *$.->͔ΒҾ༻ɼҰ෦վม <$IFO *$.->͔ΒҾ༻ ɿFQPDIͷڭࢣ͋Γֶश ɿFQPDIͷࣗݾڭࢣ͋Γֶश ɿ FQPDIͷࣗݾڭࢣ͋Γֶश
  27. w ԼྲྀλεΫͷੑೳ޲্΍ܭࢉίετ࡟ݮͳͲΛ໨తͱ༷ͯ͠ʑͳํ๏͕ߟҊ  ਖ਼ྫ਺ͷ૿Ճ  ෛྫ਺ͷ૿Ճ  ਖ਼ྫͱෛྫͷվળ  σʔλ֦ுͷվળ

     େن໛Ϟσϧͷར༻  ෛྫ͕ෆཁͳֶश  ը૾Ҏ֎ͷϞʔμϧ΁ͷద༻  ϚϧνϞʔμϧ΁ͷ֦ு ༷ʑͳରরֶश 
  28. w ԼྲྀλεΫͷੑೳ޲্΍ܭࢉίετ࡟ݮͳͲΛ໨తͱ༷ͯ͠ʑͳํ๏͕ߟҊ  ਖ਼ྫ਺ͷ૿Ճ  ෛྫ਺ͷ૿Ճ  ਖ਼ྫͱෛྫͷվળ  σʔλ֦ுͷվળ

     େن໛Ϟσϧͷར༻  ෛྫ͕ෆཁͳֶश  ը૾Ҏ֎ͷϞʔμϧ΁ͷద༻  ϚϧνϞʔμϧ΁ͷ֦ு ༷ʑͳରরֶश 
  29. w ϞϝϯλϜϞσϧͷύϥϝʔλ͸Τϯίʔμͷࢦ਺ҠಈฏۉʹΑΓߋ৽  ɹ͸ΛઃఆˠϞϝϯλϜϞσϧ͸খ͞ͳύϥϝʔλߋ৽ ෛྫ਺ͷ૿Ճɿ.P$P<)F $713>  λ -JOFBS -JOFBS

    ࢦ਺Ҡಈฏۉ ϛχόον σʔλ֦ு ࣙॻ ଛࣦܭࢉ θm ← λθm + (1 − λ)θe θm  ϞϝϯλϜϞσϧͷύϥϝʔλ θe  Τϯίʔμɾઢܗ૚ͷύϥϝʔλ λ  ߋ৽ύϥϝʔλʹର͢ΔॏΈ Τϯίʔμ ઢܗ૚ ϞϝϯλϜϞσϧ
  30. w σʔλ֦ுʹج͍ͮͯਖ਼ྫͱෛྫΛઃఆ  ໰୊఺ɿϛχόον಺ʹಉ͡෺ମΧςΰϦͷҟͳΔσʔλ͕͋Δ৔߹ʹ͓͍ͯ΋ෛྫͱֶͯ͠श ਖ਼ྫͱෛྫͷվળ  w ಛ௃ϕΫτϧؒͷྨࣅ౓ʹج͍ͮͯෛྫͷҰ෦Λਖ਼ྫͱͯ͠ରরֶश  //$-<%XJCFEJ

    *$$7>  3F44-<;IFOH /FVS*14> w ಛ௃্ۭؒͰΫϥελϦϯάΛߦ͍ɼΫϥελϦϯά݁Ռʹج͍ͮͯରরֶश  4X"7<$BSPO /FVS*14>  1$-<-J *$-3>  4.P(<1BOH &$$7>
  31. w ͳ่ͥյΛىͣ͜͞ʹֶश͕Մೳʁ w ෼ੳ  $POUSBTUJOHUIFMBOETDBQFPGDPOUSBTUJWFBOEOPODPOUSBTUJWFMFBSOJOH<1PLMF "*45"54>  &YQMPSJOHUIF&RVJWBMFODFPG4JBNFTF44-WJB"6OJ fi

    FE(SBEJFOU'SBNFXPSL<5BP $713>  #SJEHJOHUIF(BQGSPN"TZNNFUSZ5SJDLTUP%FDPSSFMBUJPO1SJODJQMFTJO/PODPOUSBTUJWF44-<-JV /FVS*14>  0OUIFEVBMJUZCFUXFFODPOUSBTUJWFBOEOPODPOUSBTUJWF44-<(BSSJEP *$-3>  *NQMJDJUWBSJBODFSFHVMBSJ[BUJPOJOOPODPOUSBTUJWF44-<)BMWBHBM /FVS*14> ෛྫ͕ෆཁͳରরֶशɿ4JN4JBN<$IFO $713>  ҟͳΔ؍఺͔Β༷ʑͳ෼ੳ͕ߦΘΕ͓ͯΓɼґવͱͯ͠ະղ໌
  32. w ͳ่ͥյΛىͣ͜͞ʹֶश͕Մೳʁ w 4JN4JBNʹ͓͚Δ࣮ݧతͳ܏޲ ෛྫ͕ෆཁͳରরֶशɿ4JN4JBN<$IFO $713>  *NBHF/FU,ʹର͢ΔઢܗධՁ ༧ଌϔουͷ༗ແ ޯ഑ఀࢭͷ༗ແ

    0 100 0 50 epochs kNN acc. w/ stop-grad w/o stop-grad acc. (%) w/ stop-grad 67.7±0.1 w/o stop-grad 0.1 oss. Without stop-gradient it degenerates immediately. Middle ged std over all channels. Right plot: validation accuracy of a ation (“w/ stop-grad” is mean±std over 5 trials). (ablation in Sec. 4.4) or ReLU. This MLP has 2 layers. The dimension of h’s input and output (z and p) is d = 2048, and h’s hidden layer’s dimension is 512, making h pred. MLP h acc. (%) baseline lr with cosine decay 67.7 (a) no pred. MLP 0.1 (b) fixed random init. 1.5 (c) lr not decayed 68.1 Table 1. Effect of prediction MLP (ImageNet linear evaluation accuracy with 100-epoch pre-training). In all these variants, we use the same schedule for the encoder f (lr with cosine decay). that with stop-gradient, the std value is near 1 p d . This indi- cates that the outputs do not collapse, and they are scattered on the unit hypersphere. batch size 64 acc. (%) 66.1 Table 2. Effect of b racy with 100-epoch case (a) none (b) hidden-only (c) default (d) all Table 3. Effect of geNet linear evaluat <$IFO $713>͔ΒҾ༻ ޯ഑ఀࢭॲཧͱ༧ଌϔουʹΑΔඇରশͳϞσϧߏ଄ͷ૊Έ߹Θ͕ͤॏཁ <$IFO $713>͔ΒҾ༻
  33. w %*/0ɿTFMGEJTUJMMBUJPOXJUIOPMBCFMT w 7JTJPO5SBOTGPSNFS 7J5 Ͱߴ͍ࣄલֶशͷޮՌΛൃش w ༧ଌϔουͷ୅ΘΓʹDFOUFSJOHॲཧͱTIBSQFOJOHॲཧΛಋೖ w ੜెϞσϧͷग़ྗ෼෍͕ڭࢣϞσϧͷग़ྗ෼෍ʹۙͮ͘Α͏ʹֶश

    ෛྫ͕ෆཁͳରরֶशɿ%*/0<$BSPO *$$7>  ࢦ਺Ҡಈฏۉ TPGUNBY TPGUNBY DFOUFSJOH ޯ഑ఀࢭ Τϯίʔμ ࣹӨϔου TIBSQFOJOH ֬཰෼෍ σʔλ֦ு .-1 .-1 ڭࢣϞσϧʢϞϝϯλϜϞσϧʣ ੜెϞσϧ ଛࣦܭࢉ ަࠩΤϯτϩϐʔ ʢϚϧνΫϩοϓʣ 7J5 7J5 ʢΫϥετʔΫϯʣ େҬతͳ৘ใ ہॴతͳ৘ใ ಛ௃ϕΫτϧ
  34. w TIBSQFOJOH ɿಛ௃ϕΫτϧͷதͰͭͷಛ௃ྔΛڧௐ͢ΔΑ͏ʹௐ੔ w DFOUFSJOH ɿͲΜͳը૾ʹରͯ͠΋ಉ͡ಛ௃ྔΛڧௐ͠ͳ͍Α͏ʹௐ੔ w DFOUFSJOH஋ ෛྫ͕ෆཁͳରরֶशɿ%*/0<$BSPO *$$7>

     m B   ϛχόον਺ ϋΠύϥ  c ← mc + (1 − m) 1 B B ∑ i=1 gθt (xi ) Ps (x)(i) = exp(gθs (x)(i)/τs ) ∑K k=1 exp(gθs (x)(k)/τs ) Pt (x)(i) = exp((gθt (x)(i) − c)/τt ) ∑K k=1 exp((gθt (x)(k) − c)/τt ) τt  Թ౓ύϥϝʔλ  c  DFOUFSJOH஋ τs  Թ౓ύϥϝʔλ  ੜెϞσϧɿ ڭࢣϞσϧɿ
  35. w %*/0Ͱֶशͨ͠7J5ͷΫϥετʔΫϯʹର͢Δ"UUFOUJPOXFJHIUΛՄࢹԽ  ࠷ऴ૚ͷ.VMUJ)FBE"UUFOUJPOͷதͰ࠷΋લܠʹண໨͍ͯ͠Δ)FBEʹ͍ͭͯՄࢹԽ w "UUFOUJPOXFJHIU΁ᮢ஋ॲཧΛ͔͚ͯՄࢹԽ ෛྫ͕ෆཁͳରরֶशɿ%*/0<$BSPO *$$7>  Emerging

    Properties in Self-Supervised Vision Transformers Mathilde Caron1,2 Hugo Touvron1,3 Ishan Misra1 Herv´ e Jegou1 Julien Mairal2 Piotr Bojanowski1 Armand Joulin1 1 Facebook AI Research 2 Inria⇤ 3 Sorbonne University Figure 1: Self-attention from a Vision Transformer with 8 ⇥ 8 patches trained with no supervision. We look at the self-attention of the [CLS] token on the heads of the last layer. This token is not attached to any label nor supervision. These maps show that the model automatically learns class-specific features leading to unsupervised object segmentations. Abstract In this paper, we question if self-supervised learning pro- vides new properties to Vision Transformer (ViT) [19] that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the follow- ing observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised 1. Introduction Transformers [70] have recently emerged as an alternative to convolutional neural networks (convnets) for visual recog- nition [19, 69, 83]. Their adoption has been coupled with a training strategy inspired by natural language processing (NLP), that is, pretraining on large quantities of data and finetuning on the target dataset [18, 55]. The resulting Vision Transformers (ViT) [19] are competitive with convnets but, they have not yet delivered clear benefits over them: they v:2104.14294v2 [cs.CV] 24 May 2021 ˠϥϕϧ৘ใ͕ແͯ͘΋ਖ਼֬ͳ෺ମྖҬΛ֫ಘ Supervised DINO Table 7: Important component for self-supervised ViT pre- training. Models are trained for 300 epochs with ViT-S/16. We study the different components that matter for the k-NN and linear (“Lin.”) evaluations. For the different variants, we highlight the differences from the default DINO setting. The best combination is the momentum encoder with the multicrop augmentation and the cross-entropy loss. We also report results with BYOL [30], MoCo-v2 [15] and SwAV [10]. Supervised DINO Random Supervised DINO ViT-S/16 22.0 27.3 45.9 Table 7: Imp training. Mod study the diffe (“Lin.”) evalu differences fro is the momen the cross-entr MoCo-v2 [15] Method M 1 DINO 2 3 ˠڭࢣ͋Γֶशͱൺ΂ͯ%*/0͸෺ମྖҬʹूத <$BSPO *$$7>͔ΒҾ༻ <$BSPO *$$7>͔ΒҾ༻
  36. w ख๏ͷҧ͍ɼରরֶशͷΤϙοΫ਺ͷҧ͍ʹΑΔਫ਼౓มԽ  Ϟσϧɿ*NBHF/FU,Λ༻͍ͯରরֶशΛߦͬͨ3FT/FU ֤ख๏ͷਫ਼౓ൺֱ  ਫ਼౓͸֤࿦จ͔ΒҾ༻ ख๏ ࣗݾڭࢣ͋Γֶशͷ৚݅ *NBHF/FU,ʹର͢ΔઢܗධՁͷਫ਼౓<>

    ϛχόον਺ ෛྫ ࣙॻ ΫϥελϦϯά ϚϧνΫϩοϓ ༧ଌϔου ΤϙοΫ ΤϙοΫ ΤϙοΫ 4JN$-3<$IFO *$.->   ✔︎    .P$PW<)F BS9JW>  ✔︎ ✔︎  ʔ  4X"7<$BSPO /FVS*14>   ✔︎ ʔ  ʔ 4X"7<$BSPO /FVS*14>   ✔︎ ✔︎    4JN4JBN<$IFO $713>  ✔︎    %*/0<$BSPO *$$7>   ✔︎ ʔ ʔ 
  37. w ख๏ͷҧ͍ɼରরֶशͷΤϙοΫ਺ͷҧ͍ʹΑΔਫ਼౓มԽ  Ϟσϧɿ*NBHF/FU,Λ༻͍ͯରরֶशΛߦͬͨ3FT/FU ֤ख๏ͷਫ਼౓ൺֱ  ਫ਼౓͸֤࿦จ͔ΒҾ༻ ख๏ ࣗݾڭࢣ͋Γֶशͷ৚݅ *NBHF/FU,ʹର͢ΔઢܗධՁͷਫ਼౓<>

    ϛχόον਺ ෛྫ ࣙॻ ΫϥελϦϯά ϚϧνΫϩοϓ ༧ଌϔου ΤϙοΫ ΤϙοΫ ΤϙοΫ 4JN$-3<$IFO *$.->   ✔︎    .P$PW<)F BS9JW>  ✔︎ ✔︎  ʔ  4X"7<$BSPO /FVS*14>   ✔︎ ʔ  ʔ 4X"7<$BSPO /FVS*14>   ✔︎ ✔︎    4JN4JBN<$IFO $713>  ✔︎    %*/0<$BSPO *$$7>   ✔︎ ʔ ʔ  ରরֶशͷΤϙοΫ਺Λ૿΍͢͜ͱͰਫ਼౓͕վળ
  38. w ख๏ͷҧ͍ɼରরֶशͷΤϙοΫ਺ͷҧ͍ʹΑΔਫ਼౓มԽ  Ϟσϧɿ*NBHF/FU,Λ༻͍ͯରরֶशΛߦͬͨ3FT/FU ֤ख๏ͷਫ਼౓ൺֱ  ਫ਼౓͸֤࿦จ͔ΒҾ༻ ख๏ ࣗݾڭࢣ͋Γֶशͷ৚݅ *NBHF/FU,ʹର͢ΔઢܗධՁͷਫ਼౓<>

    ϛχόον਺ ෛྫ ࣙॻ ΫϥελϦϯά ϚϧνΫϩοϓ ༧ଌϔου ΤϙοΫ ΤϙοΫ ΤϙοΫ 4JN$-3<$IFO *$.->   ✔︎    .P$PW<)F BS9JW>  ✔︎ ✔︎  ʔ  4X"7<$BSPO /FVS*14>   ✔︎ ʔ  ʔ 4X"7<$BSPO /FVS*14>   ✔︎ ✔︎    4JN4JBN<$IFO $713>  ✔︎    %*/0<$BSPO *$$7>   ✔︎ ʔ ʔ  ࣙॻʹΑΓগͳ͍ϛχόον਺Ͱߴֶ͍शޮՌΛൃشՄೳ
  39. w ख๏ͷҧ͍ɼରরֶशͷΤϙοΫ਺ͷҧ͍ʹΑΔਫ਼౓มԽ  Ϟσϧɿ*NBHF/FU,Λ༻͍ͯରরֶशΛߦͬͨ3FT/FU ֤ख๏ͷਫ਼౓ൺֱ  ਫ਼౓͸֤࿦จ͔ΒҾ༻ ख๏ ࣗݾڭࢣ͋Γֶशͷ৚݅ *NBHF/FU,ʹର͢ΔઢܗධՁͷਫ਼౓<>

    ϛχόον਺ ෛྫ ࣙॻ ΫϥελϦϯά ϚϧνΫϩοϓ ༧ଌϔου ΤϙοΫ ΤϙοΫ ΤϙοΫ 4JN$-3<$IFO *$.->   ✔︎    .P$PW<)F BS9JW>  ✔︎ ✔︎  ʔ  4X"7<$BSPO /FVS*14>   ✔︎ ʔ  ʔ 4X"7<$BSPO /FVS*14>   ✔︎ ✔︎    4JN4JBN<$IFO $713>  ✔︎    %*/0<$BSPO *$$7>   ✔︎ ʔ ʔ  ΫϥελϦϯά݁Ռʹجֶ͍ͮͯशΛߦ͏͜ͱͰਫ਼౓͕վળ
  40. w ख๏ͷҧ͍ɼରরֶशͷΤϙοΫ਺ͷҧ͍ʹΑΔਫ਼౓มԽ  Ϟσϧɿ*NBHF/FU,Λ༻͍ͯରরֶशΛߦͬͨ3FT/FU ֤ख๏ͷਫ਼౓ൺֱ  ਫ਼౓͸֤࿦จ͔ΒҾ༻ ख๏ ࣗݾڭࢣ͋Γֶशͷ৚݅ *NBHF/FU,ʹର͢ΔઢܗධՁͷਫ਼౓<>

    ϛχόον਺ ෛྫ ࣙॻ ΫϥελϦϯά ϚϧνΫϩοϓ ༧ଌϔου ΤϙοΫ ΤϙοΫ ΤϙοΫ 4JN$-3<$IFO *$.->   ✔︎    .P$PW<)F BS9JW>  ✔︎ ✔︎  ʔ  4X"7<$BSPO /FVS*14>   ✔︎ ʔ  ʔ 4X"7<$BSPO /FVS*14>   ✔︎ ✔︎    4JN4JBN<$IFO $713>  ✔︎    %*/0<$BSPO *$$7>   ✔︎ ʔ ʔ  ϚϧνΫϩοϓʹΑΓਖ਼ྫΛ૿΍͢͜ͱͰਫ਼౓͕վળ
  41. w ख๏ͷҧ͍ɼରরֶशͷΤϙοΫ਺ͷҧ͍ʹΑΔਫ਼౓มԽ  Ϟσϧɿ*NBHF/FU,Λ༻͍ͯରরֶशΛߦͬͨ3FT/FU ֤ख๏ͷਫ਼౓ൺֱ  ਫ਼౓͸֤࿦จ͔ΒҾ༻ ख๏ ࣗݾڭࢣ͋Γֶशͷ৚݅ *NBHF/FU,ʹର͢ΔઢܗධՁͷਫ਼౓<>

    ϛχόον਺ ෛྫ ࣙॻ ΫϥελϦϯά ϚϧνΫϩοϓ ༧ଌϔου ΤϙοΫ ΤϙοΫ ΤϙοΫ 4JN$-3<$IFO *$.->   ✔︎    .P$PW<)F BS9JW>  ✔︎ ✔︎  ʔ  4X"7<$BSPO /FVS*14>   ✔︎ ʔ  ʔ 4X"7<$BSPO /FVS*14>   ✔︎ ✔︎    4JN4JBN<$IFO $713>  ✔︎    %*/0<$BSPO *$$7>   ✔︎ ʔ ʔ  4JN4JBN͸গͳ͍ϛχόον਺ɼγϯϓϧͳઃఆͰֶश͕Մೳ
  42. w ख๏ͷҧ͍ɼରরֶशͷΤϙοΫ਺ͷҧ͍ʹΑΔਫ਼౓มԽ  Ϟσϧɿ*NBHF/FU,Λ༻͍ͯରরֶशΛߦͬͨ3FT/FU ֤ख๏ͷਫ਼౓ൺֱ  ਫ਼౓͸֤࿦จ͔ΒҾ༻ ख๏ ࣗݾڭࢣ͋Γֶशͷ৚݅ *NBHF/FU,ʹର͢ΔઢܗධՁͷਫ਼౓<>

    ϛχόον਺ ෛྫ ࣙॻ ΫϥελϦϯά ϚϧνΫϩοϓ ༧ଌϔου ΤϙοΫ ΤϙοΫ ΤϙοΫ 4JN$-3<$IFO *$.->   ✔︎    .P$PW<)F BS9JW>  ✔︎ ✔︎  ʔ  4X"7<$BSPO /FVS*14>   ✔︎ ʔ  ʔ 4X"7<$BSPO /FVS*14>   ✔︎ ✔︎    4JN4JBN<$IFO $713>  ✔︎    %*/0<$BSPO *$$7>   ✔︎ ʔ ʔ  গͳ͍ΤϙοΫ਺Ͱߴ͍ਫ਼౓ΛൃشՄೳ
  43. w ख๏ͷҧ͍ɼରরֶशͷΤϙοΫ਺ͷҧ͍ʹΑΔਫ਼౓มԽ  Ϟσϧɿ*NBHF/FU,Λ༻͍ͯରরֶशΛߦͬͨ3FT/FU ֤ख๏ͷਫ਼౓ൺֱ  ਫ਼౓͸֤࿦จ͔ΒҾ༻ ख๏ ࣗݾڭࢣ͋Γֶशͷ৚݅ *NBHF/FU,ʹର͢ΔઢܗධՁͷਫ਼౓<>

    ϛχόον਺ ෛྫ ࣙॻ ΫϥελϦϯά ϚϧνΫϩοϓ ༧ଌϔου ΤϙοΫ ΤϙοΫ ΤϙοΫ 4JN$-3<$IFO *$.->   ✔︎    .P$PW<)F BS9JW>  ✔︎ ✔︎  ʔ  4X"7<$BSPO /FVS*14>   ✔︎ ʔ  ʔ 4X"7<$BSPO /FVS*14>   ✔︎ ✔︎    4JN4JBN<$IFO $713>  ✔︎    %*/0<$BSPO *$$7>   ✔︎ ʔ ʔ  ηϯλϦϯάॲཧɾγϟʔϓχϯάॲཧʹΑΓ่յΛ๷͙͜ͱ͕Մೳ ʢֶशՄೳͳύϥϝʔλΛ࣋ͭ༧ଌϔου͕ෆཁʣ
  44. w $-*1ɿ$POUSBTUJWF-BOHVBHF*NBHF1SFUSBJOJOH   ࣄલֶशɿը૾ͱςΩετͷϖΞͰಛ௃ྔ͕Ұக͢ΔΑ͏ʹࣗݾڭࢣ͋Γֶश   ը૾ͷΫϥε෼ྨɿΫϥεʹؔ͢ΔςΩετ͔Βಛ௃ྔΛநग़  

    ը૾ͷΫϥε෼ྨɿը૾ͱςΩετؒͷಛ௃ྔͷྨࣅ౓͔Βը૾ͷΫϥεΛ෼ྨ w ը૾ͱςΩετͷରԠؔ܎ΛֶशˠରԠؔ܎͔Β௥Ճͷֶशͳ͘ը૾ͷΫϥε෼ྨ͕Մೳ ϚϧνϞʔμϧରরֶशɿ$-*1<3BEGPSE *$.->  I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3 <3BEGPSE *$.->͔ΒҾ༻
  45. w σʔληοτͰఆٛ͞ΕͨϖΞΛਖ਼ྫͱͯ͠ରরֶश  ਖ਼ྫϖΞɿσʔληοτͰఆٛ͞Εͨը૾ͱςΩετͷϖΞ  ෛྫϖΞɿϛχόον಺ͷਖ਼ྫϖΞҎ֎ͷը૾ͱςΩετؒͷϖΞ ϚϧνϞʔμϧରরֶशɿ$-*1<3BEGPSE *$.->  

      ʜ     ʜ     ʜ     ʜ  ʜ ʜ ʜ ʜ ʜ I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3 ଛࣦܭࢉ ಛ௃ྔͷྨࣅ౓ؔ܎ ʢίαΠϯྨࣅ౓ʣ ཧ૝తͳྨࣅ౓ؔ܎ <3BEGPSE *$.->͔ΒҾ༻ɼҰ෦վม
  46. w ը૾ͷΫϥε෼ྨ໰୊΁௥ՃͷֶशΛҰ੾ͤͣʹʢ;FSPTIPUͰʣద༻͕Մೳ w Ϋϥε໊ΛؚΜͩςΩετͱͷಛ௃ྔͷྨࣅ౓ؔ܎͔Βೖྗը૾ͷΫϥεΛ༧ଌ ϚϧνϞʔμϧରরֶशɿ$-*1<3BEGPSE *$.->  I1·T2 I1·T3 …

    I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 lEPHzΫϥεͱ൑ఆ l"QIPUPPGBQMBOFz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBDBSz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBEPHz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBCJSEz͔Βநग़ͨ͠ಛ௃ྔ ϓϩϯϓτ ςϯϓϨʔτ <3BEGPSE *$.->͔ΒҾ༻ɼҰ෦վม
  47. w ϓϩϯϓτςϯϓϨʔτʹΫϥε໊Λ౰ͯ͸ΊͨςΩετ͔Βಛ௃ྔΛநग़  ϓϩϯϓτςϯϓϨʔτ͸ը૾΍໰୊ઃఆʹ߹ΘͤͯਓखͰઃఆ ௗྨͷೝࣝɹɹɿ"QIPUPPGB\PCKFDU^ BUZQFPGCJSE ಓ࿏ඪࣝͷೝࣝɿ"[PPNFEJOQIPUPPGB\PCKFDU^USB ffi DTJHO ϚϧνϞʔμϧରরֶशɿ$-*1<3BEGPSE

    *$.->  I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 l"QIPUPPGBQMBOFz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBDBSz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBEPHz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBCJSEz͔Βநग़ͨ͠ಛ௃ྔ Image Encoder I1 I1·T2 I1·TN I1·T1 … A photo of a dog. I1·T3 ϓϩϯϓτ ςϯϓϨʔτ ϓϩϯϓτςϯϓϨʔτʹΫϥε໊Λ౰ͯ͸ΊͨςΩετ͔Βಛ௃ྔΛநग़ <3BEGPSE *$.->͔ΒҾ༻ɼҰ෦վม
  48. w Ϋϥε෼ྨΛߦ͍͍ͨը૾͔Βಛ௃ྔΛநग़ w ը૾ͱ֤ςΩετؒͷಛ௃ྔͷίαΠϯྨࣅ౓Λܭࢉ w ࠷΋ྨࣅ౓͕ߴ͍ςΩετ಺ͷΫϥε໊Λ༧ଌ݁Ռͱͯ͠Ϋϥε෼ྨ ϚϧνϞʔμϧରরֶशɿ$-*1<3BEGPSE *$.->  plane

    car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction … I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 I1·T2 I1·TN I1·T1 … A photo of a dog. I1·T3 ը૾͔Βಛ௃ྔΛநग़ <3BEGPSE *$.->͔ΒҾ༻ɼҰ෦վม
  49. w Ϋϥε෼ྨΛߦ͍͍ͨը૾͔Βಛ௃ྔΛநग़ w ը૾ͱ֤ςΩετؒͷಛ௃ྔͷίαΠϯྨࣅ౓Λܭࢉ w ࠷΋ྨࣅ౓͕ߴ͍ςΩετ಺ͷΫϥε໊Λ༧ଌ݁Ռͱͯ͠Ϋϥε෼ྨ ϚϧνϞʔμϧରরֶशɿ$-*1<3BEGPSE *$.->  plane

    car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder … (3) Use for zero-shot prediction I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 A photo of a dog. Image Encoder I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 ը૾ͱςΩετؒͷಛ௃ྔͷྨࣅ౓Λܭࢉ <3BEGPSE *$.->͔ΒҾ༻ɼҰ෦վม
  50. w Ϋϥε෼ྨΛߦ͍͍ͨը૾͔Βಛ௃ྔΛநग़ w ը૾ͱ֤ςΩετؒͷಛ௃ྔͷίαΠϯྨࣅ౓Λܭࢉ w ࠷΋ྨࣅ౓͕ߴ͍ςΩετ಺ͷΫϥε໊Λ༧ଌ݁Ռͱͯ͠Ϋϥε෼ྨ ϚϧνϞʔμϧରরֶशɿ$-*1<3BEGPSE *$.->  Text

    Encoder … (3) Use for zero-shot prediction I1 I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 Image Encoder T1 T2 T3 TN … ࠷΋ྨࣅ౓͕ߴ͍ςΩετΛ֬ೝ I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 lEPHzΫϥεͱ൑ఆ <3BEGPSE *$.->͔ΒҾ༻ɼҰ෦վม
  51. w $-*1ʹΑΓ݁ͼ෇͚ΒΕͨը૾ɾݴޠؒͷؔ܎Λར༻༷ͨ͠ʑͳख๏͕ొ৔ ϚϧνϞʔμϧରরֶशɿ$-*1<3BEGPSE *$.->  ը૾ੜ੒Ϟσϧ %"--& VO$-*1 <3BNFTI BS9JW>

    ΦʔϓϯϘΩϟϒϥϦʔηάϝϯςʔγϣϯ bridge sky A white cute cat lying on the ground. (a) CLIP is pre-trained with natural images CLIP Mask proposal generator … CLIP classification A photo of a {bridge} (b) Skeleton of two-stage approaches (c) Bottleneck analysis … 20 40 60 20.1 66.5 mIoU on ADE20K-150 Oracle mask proposals Oracle classification mask class masked images 074FH<-JBOH $713> 7JTVBMQSPNQUFOHJOFFSJOH<4IUFESJUTLJ *$$7> ը૾ϞʔμϧͷϓϩϯϓτΤϯδχΞϦϯά . . . . . . The cub on the right Referring expressions comprehension This is a cat bird Classification A Q / A . . . ear eye nose Naming keypoints bear The ear eye nose of a bear A / Q bear Q Q . . . A . . . VLM VLM VLM ෼෍֎ݕ஌ $-*1/<8BOH *$$7> A photo with a horse runing A photo with a horse runing (A photo of no fish) A photo with a horse runing A photo with a horse runing (A photo of no cat) A photo with a horse runing A photo with a horse runing (A photo of no cow) learnable "no" class prompts OOD image cow cat fish OOD sum 0.15 A photo with a horse runing A photo with a horse runing A photo of a cow A photo with a horse runing A photo with a horse runing A photo of a cat A photo with a horse runing A photo with a horse runing A photo of a fish standard class prompts Text Encoder "no" Text Encoder Image Encoder "no" cat cat OOD Competing-to-win Agreeing-to-differ the decision element multiplication cow cat fish 0.8 cow cat fish saying "no" probabilities 0.6 0.6 0.2 cow cat fish ID probabilities 0.4 0.4 0.25 0.25 0.5 0.15 0.1 0.6 0.8 0.2 OOD Ϋϥε෼ྨͷෆਖ਼ղͷ෼ੳ %0.*/0<&ZVCPHMV *$-3> 0 5 10 15 20 25 30 35 ICLR2022 (2022年/4⽉) CVPR2022 (2022年/6⽉) ECCV2022 (2022年/10⽉) NeurIPS2022 (2022年/11⽉) ICLR2023 (2023年/5⽉) CVPR2023 (2023年/6⽉) ICCV2023 (2023年/10⽉) λΠτϧ΍Ωʔϫʔυʹ$-*1ؚ͕·ΕΔ࿦จͷ݅਺
  52. w ༷ʑͳϞʔμϧͷ૊Έ߹Θͤʹ͓͍ͯϚϧνϞʔμϧରরֶश͕ఏҊ ϚϧνϞʔμϧରরֶशɿ༷ʑͳϚϧνϞʔμϧ΁ͷల։  $."$$$ <4.B *$-3`> ը૾ºԻ੠ ը૾ºݴޠ .$5

    <9:VBO $713`> $-*1 <"3BEGPSE *$.-`> '-*1 <:-J $713`> $P$-3 <5)BO /FVS*14`> ը૾º0QUJDBM fl PX 4-JE3 <$4BVUJFS $713`> ը૾º఺܈ ը૾ºݴޠºԻ੠ ..7/FUXPSLT <+"MBZSBD /FVS*14`> &WFSZUIJOHBU0ODF </4IWFUTPWB $713`> .$/ <#$IFO *$$7`> V T A MIL-NCE NCE ..7/FUXPSLTͷֶशํ๏ ը૾ºݴޠº఺܈ $-*14DFOF <3$IFO $713`> $-*1 <:;FOH $713`> $-*11PJOU <5)VBOH *$$7`> $-*1ͷֶशํ๏ !! " Point Cloud of a Chair Point Cloud of a Box Point Cloud of a Bowl … Point Cloud of a Desk CLIP# Zero-Shot Recognition Optional "$ "" "% … Bowl Desk Chair Box Triplet Proxy Collection Cross-Modal Pretraining Printer, Laptop, Dog, Desk, Pillow, Chair, Light, Shoe, …. image of a Printer image of a Laptop image of a Desk … image of a Chair !! " "# Scene Point Cloud !! $ Scene Image Triplet Proxies #! $! VLM CLIP CLIP2 Text Encoder !! " Image Encoder !! % PC Encoder !! $ #&' Contrastive Alignment %% $ %& $ %' $ %( $ %) $ … %& " %% " %' " %( " %) " … + %% * %& * %+ * … !! % !! $ … … … … … … … … … … … … … … … … … … … … <"MBZSBD /FVS*14>͔ΒҾ༻ <;FOH $713>͔ΒҾ༻
  53. w σʔλ֦ுʹΑΓ࡞੒ͨ͠ਖ਼ྫͱෛྫΛࣝผ͢ΔΑ͏ʹֶश w ֫ಘ͞ΕΔಛ௃දݱͷྑ͠ѱ͠͸ਖ਼ྫɼෛྫʹґଘ  ਖ਼ྫɼෛྫʹண໨༷ͨ͠ʑͳΞϓϩʔν͕ߟҊ w ޯ഑ఀࢭॲཧͱඇରশͳϞσϧߏ଄ʹΑΓෛྫΛߟྀͤͣʹֶश͕Մೳ  ่յΛىͣ͜͞ʹֶश͕Մೳͳཧ༝͸ґવͱͯ͠ະղ໌

    w σʔληοτͰఆٛ͞ΕͨϖΞΛਖ਼ྫͱ͢Δ͜ͱͰϚϧνϞʔμϧ΁֦ுՄೳ  ରরֶशʹΑΓϞʔμϧؒΛΞϥΠϝϯτ  $-*1<3BEGPSE *$.->Ͱଊ͑ͨը૾ͱݴޠͷରԠؔ܎͸༷ʑͳλεΫʹ׆༻͕Մೳ ରরֶशͷ·ͱΊ 
  54. ϓϨςΩετλεΫͷਐల     $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ ࡞੒ͯ͠ରরֶश

    $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX  <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH  .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM  HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO *$$7`> ϚϧνϞʔμϧ΁ͷ֦ு 3(# ηϚηά ਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(# ݴޠ $"7."& <:(POH *$-3`> 3(# Ի੠ ରরֶश .*. 4JN.*. ରরֶश 4J5 <4"UJUP BS9JW`> $."& <;)VBOH BS9JW`> ."& ରরֶश 8IBU%P447J5-FBSO  </1BSL *$-3`> ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ
  55. w #&35ɿ#JEJSFDUJPOBM&ODPEFS3FQSFTFOUBUJPOTGSPN5SBOTGPSNFST w #JEJSFDUJPOBM5SBOTGPSNFSΛ1SFUSBJOJOHͱ'JOF5VOJOHͷͭͷ4UFQʹΑΓֶश w ࣄલֶश 1SFUSBJOJOH ͱͯͭ͠ͷλεΫʹֶ͍ͭͯश  .BTLFE-BOHVBHF.PEFMJOH

    ɿͷτʔΫϯΛϚεΫ͠ɼϚεΫͨ͠τʔΫϯͷ୯ޠΛ༧ଌ  /FYU4FOUFODF1SFEJDUJPO ɿ4FOUFODF#͕4FOUFODF"ͷଓ͖ͷจষ͔༧ଌ ࣗવݴޠॲཧ෼໺ͷࣄલֶश๏ɿ#&35<%FWMJO /""$->  BERT BERT E [CLS] E 1 E [SEP] ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... TokM Question Paragraph Start/End Span BERT E [CLS] E 1 E [SEP] ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... TokM Masked Sentence A Masked Sentence B Pre-training Fine-Tuning NSP Mask LM Mask LM Unlabeled Sentence A and B Pair SQuAD Question Answer Pair NER MNLI <%FWMJO /""$->͔ΒҾ༻
  56. w #&35ͷ.BTLFE-BOHVBHF.PEFMJOHΛը૾Ϟσϧͷࣄલֶश΁Ԡ༻͍ͨ͠  ໰୊఺ɿ5SBOTGPSNFSͷߏ଄ʹج͍ͮͨҐஔຒΊࠐΈ΍ϚεΫτʔΫϯͷ$//΁ͷಋೖ͕ࠔ೉ .BTLFE-BOHVBHF.PEFMJOHͷը૾΁ͷԠ༻  Transformer Encoder MLP Head

    Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder w 7JTJPO5SBOTGPSNFS 7J5 ͷ୆಄  .BTLFE-BOHVBHF.PEFMJOHΛ7J5ͷࣄલֶश΁Ԡ༻ <%PTPWJUTLJZ *$-3>͔ΒҾ༻
  57. w ."&ɿ.BTLFE"VUPFODPEFS w ಛघͳϞσϧߏ଄΍ֶशࡁΈϞσϧ͕ෆཁͳγϯϓϧͳ.*.ख๏ w ϚεΫύονͷըૉΛ&ODPEFSɾ%FDPEFSߏ଄Ͱ༧ଌ ୅දతͳ.*.ɿ."&<)F $713>  ଛࣦܭࢉ.4&

    ʢϚεΫτʔΫϯʹର͢Δग़ྗʹͷΈܭࢉʣ ɿϚεΫτʔΫϯ ɿΤϯίʔυ͞ΕͨύοντʔΫϯ *OQVU &ODPEFS 1& %FDPEFS 1& ࣗݾڭࢣ͋Γֶशޙ͸ &ODPEFSͷΈΛར༻
  58. w ϥϯμϜϚεΩϯάઓུʹ͓͚ΔϚεΫ཰ʹΑΔੑೳมԽ w ϚεΫ཰ʹΑͬͯਫ਼౓͕มԽ w ϚεΫ཰ͷ࣌ʹઢܗධՁͱϑΝΠϯνϡʔχϯάͷ྆ํͰߴ͍ਫ਼౓  #&35ͷϚεΫ཰ͱൺ΂ͯߴ͍ϚεΫ཰ ୅දతͳ.*.ɿ."&<)F $713>

     10 20 30 40 50 60 70 80 90 83 84 85 83.2 83.4 83.4 84.7 84.9 85.0 84.9 84.9 84.5 83.0 fine-tuning masking ratio (%) 10 20 30 40 50 60 70 80 90 50 60 70 54.6 58.9 61.7 67.0 69.9 71.8 73.2 73.5 71.8 66.1 linear probing masking ratio (%) igure 5. Masking ratio. A high masking ratio (75%) works well or both fine-tuning (top) and linear probing (bottom). The y-axes re ImageNet-1K validation accuracy (%) in all plots in this paper. . ImageNet Experiments We do self-supervised pre-training on the ImageNet-1K IN1K) [13] training set. Then we do supervised training to valuate the representations with (i) end-to-end fine-tuning *NBHF/FU,Λ༻͍ͨઢܗධՁ be predicted. We add positional embeddings to n this full set; without this, mask tokens would ormation about their location in the image. The another series of Transformer blocks. E decoder is only used during pre-training to image reconstruction task (only the encoder roduce image representations for recognition). he decoder architecture can be flexibly designed that is independent of the encoder design. We with very small decoders, narrower and shal- he encoder. For example, our default decoder omputation per token vs. the encoder. With this al design, the full set of tokens are only pro- 10 20 30 40 50 60 70 80 90 83 84 85 83.2 83.4 83.4 84.7 84.9 85.0 84.9 84.9 84.5 83.0 fine-tuning masking ratio (%) 10 20 30 40 50 60 70 80 90 50 60 70 54.6 58.9 61.7 67.0 69.9 71.8 73.2 73.5 71.8 66.1 linear probing masking ratio (%) Figure 5. Masking ratio. A high masking ratio (75%) works we for both fine-tuning (top) and linear probing (bottom). The y-axe are ImageNet-1K validation accuracy (%) in all plots in this pape *NBHF/FU,Λ༻͍ͨϑΝΠϯνϡʔχϯά <)F $713>͔ΒҾ༻
  59. w ϥϯμϜϚεΩϯάઓུʹ͓͚ΔϚεΫ཰ʹΑΔੑೳมԽ w ϚεΫ཰ʹΑͬͯਫ਼౓͕มԽ w ϚεΫ཰ͷ࣌ʹઢܗධՁͱϑΝΠϯνϡʔχϯάͷ྆ํͰߴ͍ਫ਼౓  #&35ͷϚεΫ཰ͱൺ΂ͯߴ͍ϚεΫ཰ ୅දతͳ.*.ɿ."&<)F $713>

     10 20 30 40 50 60 70 80 90 50 60 70 54.6 58.9 61.7 67.0 69.9 71.8 73.2 73.5 71.8 66.1 linear probing masking ratio (%) *NBHF/FU,Λ༻͍ͨઢܗධՁ 10 20 30 40 50 60 70 80 90 83 84 85 83.2 83.4 83.4 84.7 84.9 85.0 84.9 84.9 84.5 83.0 fine-tuning masking ratio (%) *NBHF/FU,Λ༻͍ͨϑΝΠϯνϡʔχϯά ϚεΫ཰ͷྫ block 50% grid 75% random 75% block 50% random 75% ୯ʹઢ΍ςΫενϟΛ֦ு͢Δ༧ଌͰ͸ࠔ೉ͳ໰୊ˠ෺ମ΍γʔϯͷશମ૾ͷཧղ΁ <)F $713>͔ΒҾ༻
  60. w ϚεΩϯάઓུʹΑΔੑೳมԽ w ϥϯμϜɿϑΝΠϯνϡʔχϯάͱઢܗධՁͷڞʹ࠷΋ߴ͍ਫ਼౓ w ϒϩοΫɿݻ·ͬͯϚεΫ͞ΕΔͨΊ༧ଌ͕ࠔ೉ w άϦουɿपғͷ৘ใ͔Βߴ඼࣭ͳ༧ଌ͕ՄೳͳҰํͰઢܗධՁͷਫ਼౓͸௿͍ ୅දతͳ.*.ɿ."&<)F $713>

     5.5 0.0 1.9 3.5 3.3 ecoder can im- 128 84.9 69.1 256 84.8 71.3 512 84.9 73.5 768 84.4 73.1 1024 84.3 73.1 (b) Decoder width. The decoder can be nar- rower than the encoder (1024-d). encoder w/ [M] 84.2 59.6 3.3⇥ encoder w/o [M] 84.9 73.5 1⇥ (c) Mask token. An encoder without mask to- kens is more accurate and faster (Table 2). lin 73.5 73.9 72.3 71.6 xels as recon- case ft lin none 84.0 65.7 crop, fixed size 84.7 73.1 crop, rand size 84.9 73.5 crop + color jit 84.3 71.9 (e) Data augmentation. Our MAE works with minimal or no augmentation. case ratio ft lin random 75 84.9 73.5 block 50 83.9 72.3 block 75 82.8 63.9 grid 75 84.0 66.0 (f) Mask sampling. Random sampling works the best. See Figure 6 for visualizations. eriments with ViT-L/16 on ImageNet-1K. We report fine-tuning (ft) and linear probing (lin) accuracy (%). If he decoder has depth 8 and width 512, the reconstruction target is unnormalized pixels, the data augmentation he masking ratio is 75%, and the pre-training length is 800 epochs. Default settings are marked in gray . with the masking ratio until the y gap is up to ⇠20% (54.6% vs. encoder dec. depth ft acc hours speedup ViT-L, w/ [M] 8 84.2 42.4 - block 50% grid 75% random 75% ϚεΫྫͱ֤ύονͷ༧ଌ݁Ռ *NBHF/FU,ʹର͢Δੑೳ ϚεΫύονͷ༧ଌੑೳ͕ߴ͍ྑ͍ಛ௃දݱͷ֫ಘͱ͸ݶΒͳ͍ <)F $713>͔ΒҾ༻
  61. w ܗঢ়΍ߏ଄ͱ͍ͬͨ௿Ϩϕϧ৘ใ͚ͩͰͳ͘ߴϨϕϧͷҙຯత৘ใΛଊ͍͑ͨ  ըૉͷ༧ଌͰ͸ߴϨϕϧͷҙຯత৘ใΛଊ͑Δ͜ͱ͸ࠔ೉ ಛ௃ϕΫτϧͷ༧ଌ  w ֶशࡁΈϞσϧ͕ग़ྗͨ͠ಛ௃ϕΫτϧΛ༧ଌର৅ͱͯ͠.*.  #&J5<#BP

    *$-3> ɿE7"&ʢը૾ੜ੒ϞσϧʣͷΤϯίʔμͷग़ྗΛ༧ଌ  &7"<'BOH $713> ɿ$-*1ͷը૾Τϯίʔμͷग़ྗΛ༧ଌ  J#05<;IPV *$-3> ɿ༧ଌର৅ΛΦϯϥΠϯͰ༻ҙ
  62. w J#05ɿJNBHF#&35QSF5SBJOJOHXJUI0OMJOF5PLFOJ[FS w ϞϝϯλϜϞσϧʢΦϯϥΠϯτʔΫφΠβʔʣͷग़ྗΛਖ਼ղ৘ใͱֶͯ͠श  ෛྫ͕ෆཁͳରরֶशɹɹɹͱಛ௃ϕΫτϧΛ༧ଌ͢Δ.*.ɹɹɹͷͭͷଛࣦ͔Βֶश w ରরֶशΛͭͭ͠ɼରরֶशͰଊ͑ͨ৘ใΛ༻͍ͯ.*. ಛ௃ϕΫτϧͷ༧ଌɿJ#05<;IPV *$-3>

     !~ℐ $ ~% &! &" ℎ! "#$%& ℎ! [()*] ℎ $ "#$%& ℎ $ [()*] ℒ[$%&] ℒ()( online tokenizer " #! "#$%& " #! [()*] # $ [()*] # $ "#$%& $ $ [()*] $ $ "#$%& " $! [()*] stop grad stop grad EMA " $! "#$%& %[,-*.] ( ) ℒ[CLS] ℒMIM <;IPV *$-3>͔ΒҾ༻
  63. w .*.͸ϚεΫͷ࢓ํʹΑͬͯϚεΫύον༧ଌͷ೉қ౓ͱ֫ಘ͢Δಛ௃දݱ͕มԽ ϚεΫͷվળ  eddings to ens would mage. The

    . raining to e encoder ognition). y designed esign. We and shal- lt decoder With this only pro- cantly re- 10 20 30 40 50 60 70 80 90 83 84 85 83.2 83.4 83.4 84.7 84.9 85.0 84.9 84.9 84.5 83.0 fine-tuning masking ratio (%) 10 20 30 40 50 60 70 80 90 50 60 70 54.6 58.9 61.7 67.0 69.9 71.8 73.2 73.5 71.8 66.1 linear probing masking ratio (%) Figure 5. Masking ratio. A high masking ratio (75%) works well for both fine-tuning (top) and linear probing (bottom). The y-axes are ImageNet-1K validation accuracy (%) in all plots in this paper. 4. ImageNet Experiments ϥϯμϜϚεΩϯάઓུʹ͓͚ΔϚεΫ཰ʹΑΔੑೳมԽ cks ft lin 84.8 65.5 84.9 70.0 84.9 71.9 84.9 73.5 84.4 73.3 epth. A deep decoder can im- obing accuracy. dim ft lin 128 84.9 69.1 256 84.8 71.3 512 84.9 73.5 768 84.4 73.1 1024 84.3 73.1 (b) Decoder width. The decoder can be nar- rower than the encoder (1024-d). case ft lin FLOPs encoder w/ [M] 84.2 59.6 3.3⇥ encoder w/o [M] 84.9 73.5 1⇥ (c) Mask token. An encoder without mask to- kens is more accurate and faster (Table 2). ft lin norm) 84.9 73.5 orm) 85.4 73.9 84.6 72.3 n 85.3 71.6 ction target. Pixels as recon- ts are effective. case ft lin none 84.0 65.7 crop, fixed size 84.7 73.1 crop, rand size 84.9 73.5 crop + color jit 84.3 71.9 (e) Data augmentation. Our MAE works with minimal or no augmentation. case ratio ft lin random 75 84.9 73.5 block 50 83.9 72.3 block 75 82.8 63.9 grid 75 84.0 66.0 (f) Mask sampling. Random sampling works the best. See Figure 6 for visualizations. E ablation experiments with ViT-L/16 on ImageNet-1K. We report fine-tuning (ft) and linear probing (lin) accuracy (%). If the default is: the decoder has depth 8 and width 512, the reconstruction target is unnormalized pixels, the data augmentation ized cropping, the masking ratio is 75%, and the pre-training length is 800 epochs. Default settings are marked in gray . eases steadily with the masking ratio until the the accuracy gap is up to ⇠20% (54.6% vs. fine-tuning, the results are less sensitive to the wide range of masking ratios (40–80%) work encoder dec. depth ft acc hours speedup ViT-L, w/ [M] 8 84.2 42.4 - ViT-L 8 84.9 15.4 2.8⇥ ViT-L 1 84.8 11.6 3.7⇥ ViT-H, w/ [M] 8 - 119.6† - ϚεΩϯάઓུʹΑΔੑೳมԽ ."&ʹ͓͚ΔϚεΫͷධՁ ΑΓޮՌతͳࣄલֶशͱ͢ΔͨΊʹϚεΫํ๏΍.*.ͷઓུΛվળ <)F $713>͔ΒҾ༻
  64. w ը૾͸ۭؒత৑௕ੑ͕ߴ͍ͨΊϥϯμϜϚεΫͰ͸ҙຯతྖҬ͕ϚεΫ͞ΕΔՄೳੑ͕௿͍ w J#05΁ϞϝϯλϜϞσϧ 5FBDIFS ͷ"UUFOUJPOXFJHIUʹج͍ͮͨϚεΩϯάΛಋೖ  "UUFOUJPOXFJHIUɿΫϥετʔΫϯʹର͢Δ֤ύονͱͷ"UUFOUJPOXFJHIU w "UUFOUJPOXFJHIUͷߴ͍ྖҬͷϚεΩϯάͱ௿͍ྖҬͷϚεΩϯάΛݕ౼

     ߴ͍ྖҬΛϚεΫ͢Δ͜ͱͰੑೳ͕վળ ϚεΫͷվળɿ"UUFOUJPO(VJEFE.*.<,BLPHFPSHJPV &$$7>  Attention-Guided Masked Image Modeling 3 (a) Input (b) Random (c) Random (d) Block (e) Attention (f) AttMask (g) AttMask Image (30) (75) Wise Map High Low Fig. 1. Different than random masking strategies (b-d), our attention-guided masking (AttMask) uses the attention map arising in the encoder (e) to mask the most highly attended by default (f), <,BLPHFPSHJPV &$$7>͔ΒҾ༻
  65. w ख๏ͷҧ͍ɼධՁํ๏ͷҧ͍ʹΑΔਫ਼౓มԽ  Ϟσϧɿ*NBHF/FU,Λ༻͍ͯࣗݾڭࢣ͋ΓֶशΛߦͬͨ7J5#ͷൺֱ ֤ख๏ͷਫ਼౓ൺֱ  ख๏ ࣗݾڭࢣ͋Γֶशͷ৚݅ *NBHF/FU,ʹର͢Δਫ਼౓<> ΤϙοΫ਺

    ༧ଌ τʔΫφΠβʔ ରরֶश ϚεΫͷվળ ༧ଌํ๏ͷվળ ઢܗධՁ fi OFUVOJOH ରরֶश %*/0  ✔︎   ."&<)F $713`>   ըૉ஋ ͳ͠   #&J5<#BP *$-3`>  ಛ௃ྔ E7"& ֶशࡁΈ   &7"<'BOH $713>  ಛ௃ྔ $-*1 ֶशࡁΈ ʔ  J#05<;IPV *$-3`>   ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎   "UUFOUJPO(VJEFE.*. <,BLPHFPSHJPV &$$7`>  ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎ ✔︎ ʔ ʔ *+&1"<"TTSBO *$$7`>  ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎ ✔︎  ʔ ਫ਼౓͸֤࿦จ͔ΒҾ༻
  66. w ख๏ͷҧ͍ɼධՁํ๏ͷҧ͍ʹΑΔਫ਼౓มԽ  Ϟσϧɿ*NBHF/FU,Λ༻͍ͯࣗݾڭࢣ͋ΓֶशΛߦͬͨ7J5#ͷൺֱ ֤ख๏ͷਫ਼౓ൺֱ  ख๏ ࣗݾڭࢣ͋Γֶशͷ৚݅ *NBHF/FU,ʹର͢Δਫ਼౓<> ΤϙοΫ਺

    ༧ଌ τʔΫφΠβʔ ରরֶश ϚεΫͷվળ ༧ଌํ๏ͷվળ ઢܗධՁ fi OFUVOJOH ରরֶश %*/0  ✔︎   ."&<)F $713`>   ըૉ஋ ͳ͠   #&J5<#BP *$-3`>  ಛ௃ྔ E7"& ֶशࡁΈ   &7"<'BOH $713>  ಛ௃ྔ $-*1 ֶशࡁΈ ʔ  J#05<;IPV *$-3`>   ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎   "UUFOUJPO(VJEFE.*. <,BLPHFPSHJPV &$$7`>  ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎ ✔︎ ʔ ʔ *+&1"<"TTSBO *$$7`>  ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎ ✔︎  ʔ ਫ਼౓͸֤࿦จ͔ΒҾ༻ ରরֶश %*/0 ͱൺ΂ͯϑΝΠϯνϡʔχϯάʹ͓͍ͯਫ਼౓͕վળ
  67. w ख๏ͷҧ͍ɼධՁํ๏ͷҧ͍ʹΑΔਫ਼౓มԽ  Ϟσϧɿ*NBHF/FU,Λ༻͍ͯࣗݾڭࢣ͋ΓֶशΛߦͬͨ7J5#ͷൺֱ ֤ख๏ͷਫ਼౓ൺֱ  ख๏ ࣗݾڭࢣ͋Γֶशͷ৚݅ *NBHF/FU,ʹର͢Δਫ਼౓<> ΤϙοΫ਺

    ༧ଌ τʔΫφΠβʔ ରরֶश ϚεΫͷվળ ༧ଌํ๏ͷվળ ઢܗධՁ fi OFUVOJOH ରরֶश %*/0  ✔︎   ."&<)F $713`>   ըૉ஋ ͳ͠   #&J5<#BP *$-3`>  ಛ௃ྔ E7"& ֶशࡁΈ   &7"<'BOH $713>  ಛ௃ྔ $-*1 ֶशࡁΈ ʔ  J#05<;IPV *$-3`>   ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎   "UUFOUJPO(VJEFE.*. <,BLPHFPSHJPV &$$7`>  ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎ ✔︎ ʔ ʔ *+&1"<"TTSBO *$$7`>  ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎ ✔︎  ʔ ਫ਼౓͸֤࿦จ͔ΒҾ༻ ༧ଌର৅ʹΑΓਫ਼౓͕มԽ
  68. w ख๏ͷҧ͍ɼධՁํ๏ͷҧ͍ʹΑΔਫ਼౓มԽ  Ϟσϧɿ*NBHF/FU,Λ༻͍ͯࣗݾڭࢣ͋ΓֶशΛߦͬͨ7J5#ͷൺֱ ֤ख๏ͷਫ਼౓ൺֱ  ख๏ ࣗݾڭࢣ͋Γֶशͷ৚݅ *NBHF/FU,ʹର͢Δਫ਼౓<> ΤϙοΫ਺

    ༧ଌ τʔΫφΠβʔ ରরֶश ϚεΫͷվળ ༧ଌํ๏ͷվળ ઢܗධՁ fi OFUVOJOH ରরֶश %*/0  ✔︎   ."&<)F $713`>   ըૉ஋ ͳ͠   #&J5<#BP *$-3`>  ಛ௃ྔ E7"& ֶशࡁΈ   &7"<'BOH $713>  ಛ௃ྔ $-*1 ֶशࡁΈ ʔ  J#05<;IPV *$-3`>   ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎   "UUFOUJPO(VJEFE.*. <,BLPHFPSHJPV &$$7`>  ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎ ✔︎ ʔ ʔ *+&1"<"TTSBO *$$7`>  ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎ ✔︎  ʔ ਫ਼౓͸֤࿦จ͔ΒҾ༻ ΦϯϥΠϯτʔΫφΠβʔʹΑΓֶशࡁΈϞσϧΛ༻ҙͤͣʹֶश͕Մೳ
  69. w ख๏ͷҧ͍ɼධՁํ๏ͷҧ͍ʹΑΔਫ਼౓มԽ  Ϟσϧɿ*NBHF/FU,Λ༻͍ͯࣗݾڭࢣ͋ΓֶशΛߦͬͨ7J5# ֤ख๏ͷਫ਼౓ൺֱ  ख๏ ࣗݾڭࢣ͋Γֶशͷ৚݅ *NBHF/FU,ʹର͢Δਫ਼౓<> ΤϙοΫ਺

    ༧ଌ τʔΫφΠβʔ ରরֶश ϚεΫͷվળ ༧ଌํ๏ͷվળ ઢܗධՁ fi OFUVOJOH ରরֶश %*/0  ✔︎   ."&<)F $713`>   ըૉ஋ ͳ͠   #&J5<#BP *$-3`>  ಛ௃ྔ E7"& ֶशࡁΈ   &7"<'BOH $713>  ಛ௃ྔ $-*1 ֶशࡁΈ ʔ  J#05<;IPV *$-3`>   ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎   "UUFOUJPO(VJEFE.*. <,BLPHFPSHJPV &$$7`>  ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎ ✔︎ ʔ ʔ *+&1"<"TTSBO *$$7`>  ಛ௃ྔ ϞϝϯλϜϞσϧ ✔︎ ✔︎  ʔ ਫ਼౓͸֤࿦จ͔ΒҾ༻ ҙຯత৘ใΛ׆༻͢Δ͜ͱͰରরֶशϑϦʔͰֶश͕Մೳ
  70. w ύονͷΑ͏ͳ୯Ґʹ෼ׂ͢Δ͜ͱͰը૾Ҏ֎ͷϞʔμϧʹ΋ద༻Մೳ ը૾Ҏ֎ͷϞʔμϧ΁ͷద༻  encoder decoder .... .... T W

    H T W H input target Figure 1: Masked Autoencoders as spatiotemporal learners. We mask a large subset (e.g., 90%) of random patches in spacetime. An encoder operates on the set of visible patches. A small decoder then processes the full set of encoded patches and mask tokens to reconstruct the input. Except for patch and positional embeddings, neither the encoder, the decoder, nor the masking strategy, has any spatiotemporal inductive bias. To the extreme, if a video has T identical static frames, randomly sampling 1/T of all spacetime patches would reveal most of the static frame. Because slow motion is more likely than fast motion in natural videos, the masking ratio can be very high as we observe empirically. The higher masking ratio leads to a more efficient solution in practice. Following the MAE in [31] that applies the encoder only on visible tokens, a masking ratio of 90% reduces the encoder time and memory complexity to <1/10. Put together with a small decoder [31], the MAE pre-training can achieve a theoretically 7.7⇥ reduction in computation vs. encoding all tokens. In fact, the computation reduction is so large that the data loading time becomes a new bottleneck; even so, we record a 4.1⇥ wall-clock speedup. Such a significant speedup is of great importance for video research that is large-scale and time-consuming. We report strong results on a variety of video recognition datasets. Our MAE pre-training greatly improves generalization performance: on Kinetics-400 [35], it increases the accuracy of ViT-Large [18] by absolute 13% vs. training from scratch, while it takes less wall-clock training time overall (pre-training plus fine-tuning). Our MAE pre-training can outperform its supervised pre-training ."&"T4QBUJPUFNQPSBM-FBSOFST <'FJDIUFOIPGFS /FVS*14> Encoder Decoder … … Target MSE Input ) ( , Figure 1: Audio-MAE for audio self-supervised learning. An audio recording is first transformed into a spectrogram and split into patches. We embed patches and mask out a large subset (80%). An encoder then operates on the visible (20%) patch embeddings. Finally, a decoder processes the order-restored embeddings and mask tokens to reconstruct the input. Audio-MAE is minimizing the mean square error (MSE) on the masked portion of the reconstruction and the input spectrogram. This computational burden has been addressed in different ways. A popular approach is to reduce the sequence length in self-attention. Various ViT-based architectures have been developed to alleviate such issues for image and video understanding. For example, Swin-Transformer [19] only performs local attention within windows that shift across layers. MViT [20] employs pooling attention to construct a hierarchy of Transformers where sequence lengths are downsampled. For self-supervised learning, MAE [1] efficiently encodes only a small portion (25%) of visual patches while the majority of patches is discarded. The simplicity and scalability in MAE make it a promising framework for large-scale self-supervised learning. In this work, we study MAE for sound recognition and the unique challenges of the audio domain. We present Audio-MAE (Fig. 1) as unified and scalable framework for learning self-supervised audio representations. Similar to MAE, it is composed of a pair of a Transformer encoder and decoder. Figure 2: Visualizations on the Kinetics-400 [35] validation set (masking ratio 90%). We show the original video (top), masked video (middle), and MAE output (bottom) for each sample. This model reconstructs the original pixels. The video size is 16⇥224⇥224 and the spacetime patch size is 2⇥16⇥16 (the temporal patch size of 2 is not visualized here). Each sample has 8⇥14⇥14=1568 tokens with 156 being visible. For better visualizations, the known patches in the output are from the original input. Fig. 7 shows more examples. Figure 3: Visualizations of the same pre-trained model in Fig. 2 but with a masking ratio of 95%. Instead of predicting pixels [9, 18, 31, 80], another line of research focuses on the tokenization ."&UIBU-JTUFO <)VBOH /FVS*14> εϖΫτϩάϥϜʹରͯ͠ϚεΫ ಈը૾ʹରͯ͠ϚεΫ <)VBOH /FVS*14>͔ΒҾ༻ <'FJDIUFOIPGFS /FVS*14>͔ΒҾ༻
  71. w ."&"T4QBUJPUFNQPSBM-FBSOFSTʢಈը૾ͷ."&ʣʹ͓͚ΔޮՌ  ֶश࣌ؒͷൺֱɿ."&ʴϑΝΠϯνϡʔχϯά74ϑϧεΫϥονֶश ը૾Ҏ֎ͷϞʔμϧ΁ͷద༻  0 10 20 30

    40 50 60 0 20 40 60 80 accuracy (%) wall-clock time (hours) MAE pre-train 800 epochs fine-tune 100 epochs from scratch 400 epochs w/ MAE from scratch 1-view multi-vie Figure 5: MAE pre-training plus fine-tuning is much more accurate and scratch. Here the x-axis is the wall-clock training time (128 A100 GPUs), a accuracy on Kinetics-400 validation. The table shows the final accuracy. ˠ."&ʴϑΝΠϯνϡʔχϯά͸୹ֶ͍श࣌ؒͰߴੑೳ <'FJDIUFOIPGFS /FVS*14>͔ΒҾ༻
  72. w .VMUJ."&ɿ.VMUJNPEBM.VMUJUBTL.BTLFE"VUPFODPEFST w ."&ΛϚϧνϞʔμϧ΁֦ு ϚϧνϞʔμϧ΁ͷ֦ுɿ.VMUJ."&<#BDINBOO &$$7>  Transformer encoder Pre-trained

    MultiMAE encoder Pre-trained MultiMAE encoder Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... ... ... ... MultiMAE pre-training Single-modal fin Multi-modal fin Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e <#BDINBOO &$$7>͔ΒҾ༻
  73. w ϚϧνϞʔμϧσʔλ  ٖࣅϥϕϦϯάʹΑΓ3(#ը૾͔Β%FQUI৘ใͱ4FNBOUJDTFHNFOUBUJPO৘ใΛ࡞੒  ٖࣅϥϕϦϯάͱͯ͠4FHNFOUBUJPO%FQUIλεΫͷֶशࡁΈϞσϧͷग़ྗΛར༻ w &ODPEFS  ϚεΫ͞Ε͍ͯͳ͍֤ϞμϦςΟͷύονΛ·ͱΊͯೖྗ

    ϚϧνϞʔμϧ΁ͷ֦ுɿ.VMUJ."&<#BDINBOO &$$7>  Transformer encoder Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... MultiMAE pre-training Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from Transformer encoder Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) Decoder Linear proj. Linear proj. Linear Decoder Selected input patches Original images Masked targets RGB Depth ntic ... ... ... ... ... ... MultiMAE pre-training Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) 3(# Transformer encoder Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... ... ... ... ... ... ... MultiMAE pre-training Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e.g., RGB, depth, and semantic segmentation) is linearly projected to tokens with a fixed dimension and encoded using a Transformer. Task-specific decoders reconstruct the masked-out patches by first performing a cross-attention step from queries to the encoded tokens, followed by a shallow Transformer. The queries consist of mask tokens (in gray), with the task-specific encoded tokens added at their respective positions. (Right) Fine-tuning: By pre-training on multiple modalities, MultiMAE lends itself to fine-tuning on single-modal and multi-modal downstream Transformer encoder Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... ... ... ... ... ... ... MultiMAE pre-training Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e.g., RGB, depth, and semantic segmentation) is linearly projected to tokens with a fixed dimension and encoded using a Transformer. Task-specific decoders reconstruct the masked-out patches by first performing a cross-attention step from queries to the encoded tokens, followed by a shallow 4FNBOUJD %FQUI ༧Ίඞཁͳσʔλ <#BDINBOO &$$7>͔ΒҾ༻ɼҰ෦վม
  74. w %FDPEFS  ઢܗࣹӨͨ͠&ODPEFSग़ྗʹରͯ͠Ґஔ৘ใͱϞμϦςΟ৘ใΛ෇༩  $SPTTBUUFOUJPOʹΑΓϞμϦςΟؒͷؔ܎Λߟྀͨ͠τʔΫϯΛ5SBOTGPSNFSCMPDL΁ೖྗ ‣ 2VFSZɹɹɿઢܗࣹӨޙͷ֤ϞμϦςΟͷτʔΫϯ ‣ ,FZ

    7BMVFɿઢܗࣹӨޙͷશϞμϦςΟͷτʔΫϯ ϚϧνϞʔμϧ΁ͷ֦ுɿ.VMUJ."&<#BDINBOO &$$7>  ing implementation de- 15 ation details 15 sification fine-tuning . . . . . . . . . . . . 15 ntation . . . . . . . . 15 stimation . . . . . . . 17 se regression tasks . . 17 egies 17 transfer results 18 on on ImageNet 18 E variants 18 raining time 19 the number of segmentation patches constant, we downsam- ple the semantic segmentation input by a factor of 4 and use patches of size 4⇥4. MultiMAE decoder. We illustrate the MultiMAE decoder in Fig 7. Following MAE [35], each decoder has a linear projection layer to adapt the outputs from the encoder to the decoder dimension. After this linear projection, we add both sine-cosine positional embeddings and learned modal- ity embeddings to the decoder inputs. This is then followed by a cross-attention layer, a MLP, and two Transformer blocks. Figure 7. MultiMAE decoders: Tokens from the MultiMAE en- ,FZ 7BMVF 2VFSZ <#BDINBOO &$$7>͔ΒҾ༻ɼҰ෦վม
  75. w ͭͷϞμϦςΟʹ͓͍ͯ૯ύον਺ͷΛϥϯμϜϚεΫ ϚϧνϞʔμϧ΁ͷ֦ுɿ.VMUJ."&<#BDINBOO &$$7>  Roman Bachmann* David Mizrahi* Andrei

    Atanov Amir Zamir Swiss Federal Institute of Technology Lausanne (EPFL) https://multimae.epfl.ch Masked inputs MultiMAE predictions Target Semantic Depth RGB Masked inputs MultiMAE predictions Target Masked inputs MultiMAE predictions Target Figure 1. MultiMAE pre-training objective. We randomly select 1/6 of all 16⇥16 image patches from multiple modalities and learn to reconstruct the remaining 5/6 masked patches from them. The figure shows validation examples from ImageNet, where masked inputs (left), predictions (middle), and non-masked images (right) for RGB (top), depth (middle), and semantic segmentation (bottom) are provided. Since we do not compute a loss on non-masked patches, we overlay the input patches on the predictions. ׬શʹϚεΫ͞Εͨ৔߹ʹ͓͍ͯ΋ଞϞʔμϧͷ৘ใΛ΋ͱʹ༧ଌ͕Մೳ <#BDINBOO &$$7>͔ΒҾ༻
  76. ϚϧνϞʔμϧ΁ͷ֦ுɿ༷ʑͳϚϧνϞʔμϧ΁ͷల։  ."& <9(FOH BS9JW`> .BTL7-. <(,XPO *$-3`> ."(7-5 <4,JN

    $713`> 3(#ºݴޠ $"7."& <:(POH *$-3`> 9,% <(,XPO BS9JW`> "VEJPWJTVBM."& <.(FPSHFTDV *$$7`> ."7J- <1)VBOH /FVS*14`> 3(#ºԻ੠ 1J."& <"$IFO $713`> (FP.*. <+-JV *$$7`> 3(#º఺܈ .."& <8*LF[PHXP .-)`> 3(#º)FNBUPYZMJOº&PTJO ࡉ๔ Linear Proj. RGB ViT-S Encoder Decoder H E Linear Proj. Linear Proj. Reconstructed RGB 3FNPUF4FOTJOH%BUB'VTJPO <.$IFO BS9JW`> 0QUJDBMJNBHFº4"3JNBHFº%&.º.BQ $P."& <+:BOH """*`> 3(#ºਂ౓ initialization shared parameters Stage 1 w/o positional embeddings encoder encoder decoder with positional embeddings Stage 2 encoder … … contra. loss Positive pair Negative pair "DUJPO."& <48PP """*`> 3(#ºਂ౓º੺֎ઢ ActionMAE RGB+Depth+IR Training Fusion Action Predictor rock scissors paper 𝓛𝓛𝒄𝒄𝒄𝒄𝒄𝒄 Action Predictor take a photo RGB Encoder RGB RGB Encoder 𝓛𝓛𝒓𝒓𝒓𝒓𝒄𝒄 Depth Encoder Depth Depth Missing IR Encoder IR IR Missing mem mem RGB Memory Token Memory Token Random Drop ActionMAE Fusion Reconstruct cls cls Class Token Class Token RGB-only Inference Modality Tokens
  77. w 7JTJPO5SBOTGPSNFS 7J5 Λֶशͷର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश w ύονΛϚεΫ͠ɼϚεΫύονͷըૉ஋΍ಛ௃ྔΛϚεΫ͍ͯ͠ͳ͍ύονΛ΋ͱʹ༧ଌ  ϚεΫύονͱϚεΫ͍ͯ͠ͳ͍ύον͕ඥ෇͚ΒΕɼը૾಺ͷจ຺৘ใΛଊ͑Δ w ֫ಘ͞ΕΔಛ௃දݱͷྑ͠ѱ͠͸ϚεΫํ๏ͱ༧ଌ͢Δ৘ใʹґଘ

     ϚεΫํ๏΍ϚεΫύονͷ༧ଌํ๏ɼ༧ଌର৅ʹண໨༷ͨ͠ʑͳΞϓϩʔν͕ߟҊ w σʔλΛύονͷΑ͏ͳ୯Ґʹ۠੾Δ͜ͱͰ༷ʑͳϞμϦςΟʹ͓͍ͯ.*.͕ద༻Մೳ  ༷ʑͳϞμϦςΟͷ૊Έ߹Θͤʹ͓͍ͯϚϧνϞʔμϧ.*.͕ߟҊ .BTLFE*NBHF.PEFMJOH .*. ͷ·ͱΊ 
  78. w 7J5Λର৅ͱͨ͠৔߹ɼରরֶशͱ.*.͸ҟͳΔϨϕϧͷֶश  ରরֶश ɿը૾ϨϕϧʢΫϥετʔΫϯʣͷֶश  .*. ɿύονϨϕϧͷֶश ରরֶशͱ.*.ͷϋΠϒϦουख๏ 

    w ରরֶशͱ.*.Λಉ࣌ʹߦ͏ϋΠϒϦουख๏  4J5<"UJUP BS9JW>  $."&<)VBOH BS9JW>  8IBU%P4FMG4VQFSWJTFE7J5T-FBSO <1BSL *$-3>
  79. w .*.ͱରরֶश͸4FMG"UUFOUJPOɼಛ௃நग़ɼॏཁͳ૚ͷ؍఺͔ΒҟͳΔֶशޮՌΛൃش w .*.ͱରরֶशΛ૊Έ߹ΘֶͤͨशʹΑΓֶ֤श͕ิ׬తͰ͋Δ͜ͱΛධՁ ϋΠϒϦουख๏ɿ8IBU%P447J5T-FBSO <1BSL *$-3>  L =

    (1 − λ)LMIM + λLCL λɿόϥϯεΛௐ੔͢ΔॏΈ LMIMɿ.*.ͷଛࣦ LCLɿରরֶशͷଛࣦ 4FMG"UUFOUJPOͷ૬ޓ৘ใྔ ಛ௃ྔͷϑʔϦΤղੳ *NBHF/FU,ʹର͢Δਫ਼౓ ద੾ͳόϥϯεʹௐ੔͢Δ͜ͱͰରরֶशͱ.*.ͷ྆ํͷ௕ॴΛޮՌతʹ׆༻ <1BSL *$-3>͔ΒҾ༻
  80. w ໿ԯαϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ܏޲Λௐࠪ w طଘͷෳ਺σʔληοτͱ΢Σϒ্ͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒ w J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ σʔληοτͷେن໛Խɿ%*/0W<0RVBC BS9JW>  Task

    Dataset / Split Images Retrieval Retrieved Final classification ImageNet-22k / – 14,197,086 as is – 14,197,086 classification ImageNet-22k / – 14,197,086 sample 56,788,344 56,788,344 classification ImageNet-1k / train 1,281,167 sample 40,997,344 40,997,344 fine-grained classif. Caltech 101 / train 3,030 cluster 2,630,000 1,000,000 fine-grained classif. CUB-200-2011 / train 5,994 cluster 1,300,000 1,000,000 fine-grained classif. DTD / train1 1,880 cluster 1,580,000 1,000,000 fine-grained classif. FGVC-Aircraft / train 3,334 cluster 1,170,000 1,000,000 fine-grained classif. Flowers-102 / train 1,020 cluster 1,060,000 1,000,000 fine-grained classif. Food-101 / train 75,750 cluster 21,670,000 1,000,000 fine-grained classif. Oxford-IIIT Pet / trainval 3,680 cluster 2,750,000 1,000,000 fine-grained classif. Stanford Cars / train 8,144 cluster 7,220,000 1,000,000 fine-grained classif. SUN397 / train1 19,850 cluster 18,950,000 1,000,000 fine-grained classif. Pascal VOC 2007 / train 2,501 cluster 1,010,000 1,000,000 segmentation ADE20K / train 20,210 cluster 20,720,000 1,000,000 segmentation Cityscapes / train 2,975 cluster 1,390,000 1,000,000 segmentation Pascal VOC 2012 (seg.) / trainaug 1,464 cluster 10,140,000 1,000,000 depth estimation Mapillary SLS / train 1,434,262 as is – 1,434,262 depth estimation KITTI / train (Eigen) 23,158 cluster 3,700,000 1,000,000 depth estimation NYU Depth V2 / train 24,231 cluster 10,850,000 1,000,000 depth estimation SUN RGB-D / train 4,829 cluster 4,870,000 1,000,000 retrieval Google Landmarks v2 / train (clean) 1,580,470 as is – 1,580,470 retrieval Google Landmarks v2 / train (clean) 1,580,470 sample 6,321,880 6,321,880 retrieval AmsterTime / new 1,231 cluster 960,000 960,000 retrieval AmsterTime / old 1,231 cluster 830,000 830,000 retrieval Met / train 397,121 cluster 62,860,000 1,000,000 retrieval Revisiting Oxford / base 4,993 cluster 3,680,000 1,000,000 retrieval Revisiting Paris / base 6,322 cluster 3,660,000 1,000,000 142,109,386 Table 15: Composition of our LVD-142M dataset. We report the list of datasets and associated splits σʔλͷ਺Λ૿ڧ σʔληοτؒͷόϥϯεΛௐ੔ ࠷ऴతͳ-7%. <0RVBC BS9JW>͔ΒҾ༻
  81. w σʔληοτ͝ͱʹͭͷεςοϓ͔Β΢Σϒ্ͷը૾Ͱσʔλ਺Λ૿ڧ  &NCFEEJOH ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशͨ͠7J5)ʹΑΓಛ௃நग़  %FEVQMJDBUJPO ɿ΢Σϒը૾ʹରͯ͠ಛ௃ྔʹج͍ͮͨίϐʔݕग़Λద༻ͯ͠ࣅͨը૾Λ࡟আ  3FUSJFWBM

    ɿσʔληοτʹࣅͨ/ݸͷ΢Σϒը૾Λσʔληοτʹ௥Ճ σʔληοτͷେن໛Խɿ%*/0W<0RVBC BS9JW>  Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval ૿ڧޙͷσʔληοτ ΢Σϒ͔Βऩूͨ͠ը૾ طଘͷσʔληοτ <0RVBC BS9JW>͔ΒҾ༻
  82. w σʔληοτ͝ͱʹͭͷεςοϓ͔Β΢Σϒ্ͷը૾Ͱσʔλ਺Λ૿ڧ  &NCFEEJOH ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशͨ͠7J5)ʹΑΓಛ௃நग़  %FEVQMJDBUJPO ɿ΢Σϒը૾ʹରͯ͠ಛ௃ྔʹج͍ͮͨίϐʔݕग़Λద༻ͯ͠ࣅͨը૾Λ࡟আ  3FUSJFWBM

    ɿσʔληοτʹࣅͨ/ݸͷ΢Σϒը૾Λσʔληοτʹ௥Ճ σʔληοτͷେن໛Խɿ%*/0W<0RVBC BS9JW>  Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval ૿ڧޙͷσʔληοτ طଘͷσʔληοτ ΢Σϒ͔Βऩूͨ͠ը૾ <0RVBC BS9JW>͔ΒҾ༻ɼҰ෦վม
  83. w σʔληοτ͝ͱʹͭͷεςοϓ͔Β΢Σϒ্ͷը૾Ͱσʔλ਺Λ૿ڧ  &NCFEEJOH ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशͨ͠7J5)ʹΑΓಛ௃நग़  %FEVQMJDBUJPO ɿ΢Σϒը૾ʹରͯ͠ಛ௃ྔʹج͍ͮͨίϐʔݕग़Λద༻ͯ͠ࣅͨը૾Λ࡟আ  3FUSJFWBM

    ɿσʔληοτʹࣅͨ/ݸͷ΢Σϒը૾Λσʔληοτʹ௥Ճ σʔληοτͷେن໛Խɿ%*/0W<0RVBC BS9JW>  Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval ૿ڧޙͷσʔληοτ ΢Σϒ͔Βऩूͨ͠ը૾ طଘͷσʔληοτ <0RVBC BS9JW>͔ΒҾ༻ɼҰ෦վม
  84. w σʔληοτ͝ͱʹͭͷεςοϓ͔Β΢Σϒ্ͷը૾Ͱσʔλ਺Λ૿ڧ  &NCFEEJOH ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशͨ͠7J5)ʹΑΓಛ௃நग़  %FEVQMJDBUJPO ɿ΢Σϒը૾ʹରͯ͠ಛ௃ྔʹج͍ͮͨίϐʔݕग़Λద༻ͯ͠ࣅͨը૾Λ࡟আ  3FUSJFWBM

    ɿσʔληοτʹࣅͨ/ݸͷ΢Σϒը૾Λσʔληοτʹ௥Ճ σʔληοτͷେن໛Խɿ%*/0W<0RVBC BS9JW>  Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval ૿ڧޙͷσʔληοτ ΢Σϒ͔Βऩूͨ͠ը૾ طଘͷσʔληοτ <0RVBC BS9JW>͔ΒҾ༻ɼҰ෦վม
  85. w .αϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ܏޲Λௐࠪ w طଘͷσʔληοτͱΠϯλʔωοτͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒ w J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ σʔληοτͷେن໛Խɿ%*/0W<0RVBC BS9JW>  %*/0WʹΑΔࣗݾڭࢣ͋Γֶश

    ೖྗը૾ 7JFX 7J5 QBUDI 4PGUNBY 7JFX 7J5 ʢϞϝϯλϜϞσϧʣ ಛ௃ϕΫτϧ <$-4> QBUDI 4JOLIPSO,OPQQ ʢΫϥελׂ౰ʣ ϓϩτλΠϓ <$-4> ࢦ਺Ҡಈฏۉ ෛྫΛ༻͍ͳ͍ ରরֶश .*. σʔλ֦ு %*/0WͰֶशͨ͠Ϟσϧͷѹॖʢ஌ࣝৠཹʣ 1SFUSBJOFE -BSHF7J5 QBUDI 4NBMM7J5 <$-4> QBUDI <$-4> 4NBMM7J5 ࢦ਺Ҡಈฏۉ ෛྫΛ༻͍ͳ͍ ରরֶश .*. ࠷ऴతʹ࢖༻ ʢϞϝϯλϜϞσϧʣ
  86. w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ  8FBLMZ ɿը૾ͱݴޠͷϚϧνϞʔμϧख๏  %BUB ɿࣄલֶशͷσʔληοτ σʔληοτͷେن໛Խɿ%*/0W<0RVBC BS9JW>

     w ஌ࣝৠཹͷޮՌ  ڭࢣ 7J5H ͷύϥϝʔλ਺ ɿ໿ԯ  ੜె 7J5- ͷύϥϝʔλ਺ ɿ໿ԯສ X 85.8 72.8 47.1 63.9 (a) Koleo loss X Table 3: (a) E ect of the KoLeo loss term. (b) E ect o term. Evaluation performed on ImageNet-{1k,A} (classifi (segmentation with linear layer, mIoU) and Oxford-M (im same number of iterations, that is smaller than our final ru search tasks (e.g. retrieval), and the MIM loss improves p (a) Comparison on individual metrics Arch ViT-g/14 ViT-L/14 ViT-L/14 Arch ViT-g/14 ViT-L/14 ViT-L/14 Figure 5: E ectiveness of knowledge distillation. C εΫϥονͱൺ΂ͯߴ͍ਫ਼౓Λୡ੒ ը૾ͷΈͰैདྷ๏ͱൺ΂ͯߴ͍ਫ਼౓Λୡ੒ kNN linear Method Arch. Data Text sup. val val ReaL V2 Weakly supervised CLIP ViT-L/14 WIT-400M X 79.8 84.3 88.1 75.3 CLIP ViT-L/14 336 WIT-400M X 80.5 85.3 88.8 75.8 SWAG ViT-H/14 IG3.6B X 82.6 85.7 88.7 77.6 OpenCLIP ViT-H/14 LAION X 81.7 84.4 88.4 75.5 OpenCLIP ViT-G/14 LAION X 83.2 86.2 89.4 77.2 EVA-CLIP ViT-g/14 customú X 83.5 86.4 89.3 77.4 Self-supervised MAE ViT-H/14 INet-1k 5 49.4 76.6 83.3 64.8 DINO ViT-S/8 INet-1k 5 78.6 79.2 85.5 68.2 SEERv2 RG10B IG2B 5 – 79.8 – – MSN ViT-L/7 INet-1k 5 79.2 80.7 86.0 69.7 EsViT Swin-B/W=14 INet-1k 5 79.4 81.3 87.0 70.4 Mugs ViT-L/16 INet-1k 5 80.2 82.1 86.9 70.8 iBOT ViT-L/16 INet-22k 5 72.9 82.3 87.5 72.4 DINOv2 ViT-S/14 LVD-142M 5 79.0 81.1 86.6 70.9 ViT-B/14 LVD-142M 5 82.1 84.5 88.3 75.1 ViT-L/14 LVD-142M 5 83.5 86.3 89.5 78.0 ViT-g/14 LVD-142M 5 83.5 86.5 89.6 78.4 Table 4: Linear evaluation on ImageNet-1k of frozen pretrained features. We report Top-1 accuracy on the validation set for publicly available models trained on public or private data, and with or without text supervision (text sup.). For reference, we also report the kNN performance on the validation set. We compare across any possible architectures (Arch.), at resolution 224 ◊ 224 unless stated otherwise. The dataset used for training EVA-CLIP is a custom mixture, see paper for details (Fang et al., 2023). <0RVBC BS9JW>͔ΒҾ༻
  87. w ճͷओ੒෼෼ੳ 1$" Λద༻͢Δ͜ͱͰύονͷಛ௃ྔΛ෼ੳ  ɽෳ਺ͷը૾ͷશͯͷύονಛ௃ྔʹରͯ͠1$"Λద༻ w ୈʔओ੒෼ʹର͖͍ͯ͠͠஋ॲཧΛߦ͍ɼલܠͱഎܠͷύονʹ෼ׂ  ɽෳ਺ͷը૾ͷલܠͱ൑ఆ͞Εͨύονಛ௃ྔʹରͯ͠1$"Λద༻

    w ୈʔओ੒෼ɼୈೋओ੒෼ɼୈࡾओ੒෼ͷ஋Λ3(#ͷ஋ͱ֤ͯ͠ύονΛ৭෇͚ σʔληοτͷେن໛Խɿ%*/0W<0RVBC BS9JW>  ਓखʹΑΔਖ਼ղϥϕϧͳ͠ͰΦϒδΣΫτͷύʔπϨϕϧͷؔ܎ੑΛֶश Figure 1: Visualization of the first PCA components. We compute a PCA between the patches of the images from the same column (a, b, c and d) and show their first 3 components. Each component is matched to a di erent color channel. Same parts are matched between related images despite changes of pose, style or even objects. Background is removed by thresholding the first PCA component. IUUQTBJGBDFCPPLDPNCMPHEJOPWDPNQVUFSWJTJPOTFMGTVQFSWJTFEMFBSOJOH <0RVBC BS9JW>͔ΒҾ༻
  88. w ࣗݾڭࢣ͋Γֶशͨ͠ϞσϧͷධՁͱબ୒  جຊతͳධՁɿ*NBHF/FU,ʹର͢Δਫ਼౓ධՁʴ೚ҙͷԼྲྀλεΫʹର͢Δਫ਼౓ධՁ  ՝୊ɹɹɹɹɿධՁ͢Δ*NBHF/FU,Ҏ֎ͷԼྲྀλεΫ͕ख๏ʹΑΓҟͳΔ ࣗݾڭࢣ͋Γֶशͷ՝୊<>  ˠࣗݾڭࢣͷֶशࡁΈϞσϧͷར༻ɿඞཁͳసҠੑΛ͍࣋ͬͯΔ͔͸ධՁ͢Δ·ͰΘ͔Βͳ͍ )PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS

    <&SJDTTPO $713>ɿܭछྨͷԼྲྀλεΫͰैདྷ๏ΛධՁ ԼྲྀλεΫ΁ͷసҠੑͷධՁ *NBHF/FU,ͷਫ਼౓ͱԼྲྀλεΫͷਫ਼౓ͷؔ܎ ख๏ʹΑͬͯಘҙɾෆಘҙͳԼྲྀλεΫ͕ҟͳΔ *NBHF/FU,ͱ૬͕ؔͳ͍ԼྲྀλεΫ΋ଘࡏ How Well Do Self-Supervised Models Transfer? Linus Ericsson University of Edinburgh [email protected] Henry Gouk University of Edinburgh [email protected] Timothy M. Hospedales University of Edinburgh, Samsung AI Research, Cambridge [email protected] Figure 1. Transfer performance is highly correlated with ImageNet performance for many-shot recognition but increasingly less correlated for few-shot recognition, object detection and dense prediction. On the x-axes we plot ImageNet top-1 accuracy and on the y-axes the average transfer log-odds. The gradients of the regression lines describe the correlation, with confidence intervals in shaded areas. For perfect correlation, the ideal line is a positive slope diagonal. Correlation coefficients (Pearson’s r) are shown in the top left of each plot. Abstract Self-supervised visual representation learning has seen huge progress recently, but no large scale evaluation has compared the many models now available. We evaluate the transfer performance of 13 top self-supervised models on 40 1. Introduction Computer vision in the last decade has been driven by increasingly sophisticated convolutional neural networks (CNNs) and the increasingly large datasets used to train them. Nevertheless, progress in this paradigm is ultimately bottlenecked by the data annotation process. This has moti- iv:2011.13377v2 [cs.CV] 29 Mar 2021 <&SJDTTPO $713>͔ΒҾ༻
  89. w ϋΠύʔύϥϝʔλ΍ख๏ͷཁૉͷධՁͱબ୒  جຊతͳධՁɿԼྲྀλεΫʹ͓͚ΔੑೳมԽʹج͍ͮͯධՁ  ՝୊ɹɹɹɹɿϥϕϧͳ͠σʔλͷΈͰֶश৚݅ʢϋΠύϥʣΛ࠷దԽ͢Δ͜ͱ͕ࠔ೉ ࣗݾڭࢣ͋Γֶशͷ՝୊<>  ˠϥϕϧͳ͠σʔλͷΈΛ༻͍ͨධՁʹಛ௃දݱͷධՁ 4FMG"VHNFOU<3FFE

    $713>ɿରরֶश΁ͷ"VUP"VHNFOU΍3BOE"VHNFOUͷద༻Λ໨తͱͨ͠ϋΠύϥ୳ࡧ Figure 3: For SVHN, we plot the supervised classification accuracy (y-axis) vs the InfoNCE loss function (left), contrastive top-1 accuracy (middle), and self-supervised linear rotation accuracy (right), for a self-supervised model trained using one of each transformation used by SelfAugment. Neither of the left two training metrics are a consistent measure of the quality of the representations, while the rotation prediction accuracy provides a strong linear relationship. test2007 set. • COCO2014 [14] multi-class image classification Fol- lowing [42], we train linear SVMs [43] on the frozen network and evaluate the accuracy over three end-to-end runs, denoted as COCO-C. instance segmentation, We age pairs), while the right plot shows the rotation prediction accuracy for the image transformations in O evaluated after 100 training epochs. Using high or low values of InfoNCE or contrastive accuracy to select the best transformations would select a mixture of mediocre transformations, missing ରরֶशͷଛࣦɼରরֶशͷਫ਼౓ɼ֯౓༧ଌͷਫ਼౓ͱ47)/ʹର͢ΔઢܗධՁͷؔ܎ <3FFE $713>͔ΒҾ༻ ରরֶशͷଛࣦ΍ਫ਼౓ͱઢܗධՁͷਫ਼౓͸ແ૬ؒ ֶशʹ࢖༻͍ͯ͠ͳ͍ϓϨςΩετλεΫʢ֯౓༧ଌʣ Λࢦඪͱͯ͠ϋΠύϥ୳ࡧ͕Մೳ
  90. w ֶशͷఀࢭλΠϛϯά  ܏޲ɿଟ͘ͷख๏ͰΤϙοΫ਺Λ૿΍͢΄Ͳੑೳ͕޲্  ՝୊ɿͲͷ͘Βֶ͍श͢Ε͹ྑ͍ͷ͔ɼֶ͍ͭशΛࢭΊΕ͹ྑ͍ͷ͔͕ෆ໌ ࣗݾڭࢣ͋Γֶशͷ՝୊<>  7 7

    4JN$-3ʹ͓͚Δ܏޲ <$IFO *$.->͔ΒҾ༻ ."&ʹ͓͚Δ܏޲ Figure 6. Mask sampling strategies determine the pretext task difficulty, influencing reconstruction quality and representations (Table 1f). Here each output is from an MAE trained with the spec- ified masking strategy. Left: random sampling (our default). Mid- dle: block-wise sampling [2] that removes large random blocks. Right: grid-wise sampling that keeps one of every four patches. Images are from the validation set. 100 200 400 800 1600 82 83 84 85 82.3 83.3 84.3 84.9 85.1 fine-tuning epochs (log-scale) 100 200 400 800 1600 60 65 70 75 57.3 64.4 69.7 73.5 75.1 linear probing epochs (log-scale) Figure 7. Training schedules. A longer training schedule gives a noticeable improvement. Here each point is a full training sched- ule. The model is ViT-L with the default setting in Table 1. <)F $713>͔ΒҾ༻
  91. w খن໛Ϟσϧͷࣗݾڭࢣ͋Γֶश  ܏޲ɿϞσϧͷύϥϝʔλ਺͕େ͖͍΄ͲԼྲྀλεΫʹ͓͍ͯߴ͍ੑೳΛൃش  ՝୊ɿϞσϧͷύϥϝʔλ਺͕খ͍͞ͱࣗݾڭࢣ͋ΓֶशͷֶशޮՌ͕௿Լ ࣗݾڭࢣ͋Γֶशͷ՝୊<>  amework for

    Contrastive Learning of Visual Representations With color distortion. (over all channels) for e. two rows). The image have the same range. ength 1 (+Blur) AutoAug 64.5 61.1 75.4 77.1 ResNet-50 using linear nder varied color distor- 1 P R 3 P 0 R 7RS 5 5 5 5 5 5 5 5 5 5 5 5 5 6 S 5 6 S 5 6 S 5 5 5 5 Figure 7. Linear evaluation of models with varied depth and width. Models in blue dots are ours trained for 100 epochs, models in red stars are ours trained for 1000 epochs, and models in green crosses are supervised ResNets trained for 90 epochs7 (He et al., 2016). 4JN$-3ʹ͓͚Δ܏޲ʢઢܗධՁʣ <$IFO *$.->͔ΒҾ༻ ."&ʹ͓͚Δ܏޲ʢϑΝΠϯνϡʔχϯάʣ <)F $713>͔ΒҾ༻ method pre-train data ViT-B ViT-L ViT-H ViT-H448 scratch, our impl. - 82.3 82.6 83.1 - DINO [5] IN1K 82.8 - - - MoCo v3 [9] IN1K 83.2 84.1 - - BEiT [2] IN1K+DALLE 83.2 85.2 - - MAE IN1K 83.6 85.9 86.9 87.8 Table 3. Comparisons with previous results on ImageNet- 1K. The pre-training data is the ImageNet-1K training set (ex- cept the tokenizer in BEiT was pre-trained on 250M DALLE data [50]). All self-supervised methods are evaluated by end-to-end fine-tuning. The ViT models are B/16, L/16, H/14 [16]. The best for each column is underlined. All results are on an image size of 224, except for ViT-H with an extra result on 448. Here our MAE reconstructs normalized pixels and is pre-trained for 1600 epochs. 0 200 400 600 76 78 80 82 84 86 88 ViT-B/16 ViT-L/16 ViT-H/14 MAE, IN1K supervised, IN1K, our impl. supervised, IN1K supervised, JFT300M [16] [16] params (M) Figure 8. MAE pre-training vs. supervised pre-training, evalu- ated by fine-tuning in ImageNet-1K (224 size). We compare with the original ViT results [16] trained in IN1K or JFT300M. 0 1 2 70 75 80 85 73.5 81.0 83.1 8 77.6 79.9 80.8 Figure 9. Par of fine-tuned T Table 1. Tunin Our MAE repr sistently better The MAE epochs for b pre-training t on the same h TPU-v3 core epochs and M Comparison nal ViT pape Our impleme better, but ac Our MAE ize better: th higher-capac ɿFQPDIͷڭࢣ͋Γֶश ɿFQPDIͷࣗݾڭࢣ͋Γֶश ɿ FQPDIͷࣗݾڭࢣ͋Γֶश