Naoki Okamoto
November 20, 2023
310

# 第59回名古屋CV・PRMU勉強会：ICCV2023論文紹介（自己教師あり学習）

11月11日に行われた第59回名古屋CV・PRML勉強会のICCV2023論文紹介で使用したスライドです．自己教師あり学習に関する論文を6本紹介．

## Naoki Okamoto

November 20, 2023

## Transcript

2. ### ࣗݾ঺հ  !OBPLPL IUUQTHJUIVCDPNOBPL Ԭຊ௚थ /BPLJ0LBNPUP ݚڀςʔϚɿΞϯαϯϒϧֶशɼ஌ࣝৠཹ d  ɹɹɹɹɹɹࣗݾڭࢣ͋Γֶश

d த෦େֶ޻ֶݚڀՊϩϘοτཧ޻ֶઐ߈ത࢜ޙظ՝ఔ̎೥ੜ ౻٢߂࿱ݚڀࣨॴଐ ࣗݾڭࢣ͋Γֶशͷ࿦จΛ঺հ͠·͢ʂ
3. ### w ڭࢣϥϕϧΛ෇༩ͯ͠ͳ͍େྔͷσʔλΛٖࣅతͳ໰୊ 1SFUFYUUBTL ʹΑֶͬͯश w ࣗݾڭࢣ͋ΓֶशͰֶशͨ͠Ϟσϧ͸ࣄલֶशϞσϧͱͯ͠׆༻ ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH  ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁సҠֶशɾ

fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '\$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ᶃେྔͷڭࢣͳ͠σʔλͰࣄલֶशϞσϧΛ࡞੒ େྔͷڭࢣͳ͠σʔλ ࣄલֶशϞσϧ sses image data, we analyze its internal nearly projects the ﬂattened patches into he top principal components of the the ble basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the d to the l learns ion em- ion em- es in the usoidal D). That ogy ex- ot yield e entire gree the pute the is inte- “atten- We ﬁnd lowest bally is istently ized at- ʜ ࣗݾڭࢣ͋Γֶश
4. ### ࣗݾڭࢣ͋Γֶशͷ୅දతͳํ๏  ."&ͷֶशํ๏ w .BTLFE*NBHF.PEFMJOH .*.   ϚεΫ͞Εͨύονͷಛ௃ྔΛ༧ଌ w

#&J5<)#BP *\$-3> w #&J5W<;1FOH BS9JW>   ϚεΫ͞ΕͨύονͷըૉΛ༧ଌ w ."&<,)F \$713>  w 4JN.*.<;9JF \$713> &ODPEFS &ODPEFS .-1 .-1 ࢦ਺Ҡಈฏۉ 4PGUNBY \$FOUFS 4PGUNBY ଛࣦܭࢉ -PDBM (MPCBM %*/0ͷֶशํ๏ ଛࣦܭࢉ &ODPEFS %FDPEFS ϚεΫॲཧ ϚεΫτʔΫϯͷ௥Ճ ."&ͷֶशํ๏ w \$POUSBTUJWF-FBSOJOH \$-   ෳ਺ͷҟͳΔ7JFX͔Βநग़ͨ͠ը૾Ϩϕϧͷಛ௃Λൺֱ w .P\$PW<9\$IFO *\$\$7>  w %*/0<.\$BSPO *\$\$7>
5. ### ࣗݾڭࢣ͋Γֶशͷ୅දతͳํ๏  w ϚϧνϞʔμϧ\$-  ը૾ͱςΩετͷϖΞͰ\$- w \$-*1<"3BEGPSE *\$.-> w

ϚϧνϞʔμϧ.*.  3(# %FQUI 4FNBOUJDΛಉ࣌ʹ.*. w .VMUJ."&<3#BDINBOO &\$\$7> w ϋΠϒϦου  \$-ͱ.*.Λಉ࣌ʹֶश w J#05<+;IPV *\$-3>  w %*/0W<.0RVBC BS9JW> w \$POUSBTUJWF-FBSOJOH \$-   ෳ਺ͷҟͳΔ7JFX͔Βநग़ͨ͠ը૾Ϩϕϧͷಛ௃Λൺֱ w .P\$PW<9\$IFO *\$\$7>  w %*/0<.\$BSPO *\$\$7> 7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश<>  w .BTLFE*NBHF.PEFMJOH .*.   ϚεΫ͞Εͨύονͷಛ௃ྔΛ༧ଌ w #&J5<)#BP *\$-3> w #&J5W<>   ϚεΫ͞ΕͨύονͷըૉΛ༧ଌ w ."&<,)F \$713>  w 4JN.*.<;9JF \$713> ."&ͷֶशํ๏ &ODPEFS &ODPEFS .-1 .-1 ࢦ਺Ҡಈฏۉ 4PGUNBY \$FOUFS 4PGUNBY ଛࣦܭࢉ -PDBM (MPCBM %*/0ͷֶशํ๏ ଛࣦܭࢉ &ODPEFS %FDPEFS ϚεΫॲཧ ϚεΫτʔΫϯͷ௥Ճ ."&ͷֶशํ๏ 7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश<>  w ϚϧνϞʔμϧ\$-  ը૾ͱςΩετͷϖΞͰ\$- w \$-*1<"3BEGPSE *\$.-> w ϚϧνϞʔμϧ.*.  3(# %FQUI 4FNBOUJDΛಉ࣌ʹ.*. w .VMUJ."&<3#BDINBOO &\$\$7> w ϋΠϒϦου  \$-ͱ.*.Λಉ࣌ʹֶश w J#05<+;IPV *\$-3>  w %*/0W<.0RVBC BS9JW> &ODPEFS &ODPEFS .-1 .-1 ࢦ਺Ҡಈฏۉ 4PGUNBY \$FOUFS 4PGUNBY ଛࣦܭࢉ -PDBM (MPCBM %*/0ͷֶशํ๏ ଛࣦܭࢉ &ODPEFS %FDPEFS ϚεΫॲཧ ϚεΫτʔΫϯͷ௥Ճ ."&ͷֶशํ๏ \$POUSBTUJWF-FBSOJOH \$- .BTLFE*NBHF.PEFMJOH .*.

7. ### w λΠτϧʹࣗݾڭࢣ͋Γֶशʹؔ࿈͢ΔΩʔϫʔυ͕෇͘࿦จΛݕࡧ  4FMGTVQFSWJTFE ɿຊ  \$POUSBTUJWF ɿຊ  .BTLFEɼ."&

ɿຊ  \$-*1 ɿຊ  0QFO7PDBCVMBSZ ɿຊ *\$\$7ͷ࿦จௐࠪ  Ԭຊ͕໘ന͍ɾैདྷ๏ʹ͸ͳ͍Ξϓϩʔνͱࢥͬͨ࿦จຊΛ঺հ
8. ### w σʔληοτͷେن໛ԽʹΑΓߴ͍ࣄલֶशͷޮՌΛൃش  5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH w ج൫Ϟσϧ͔Βͷৠཹ TFMGEJTUJMMBUJPO ʹΑΓج൫ϞσϧͷੑೳΛվળ

 6ONBTLFE5FBDIFS5PXBSET5SBJOJOH& ffi DJFOU7JEFP'PVOEBUJPO.PEFMT  *NQSPWJOH\$-*1'JOFUVOJOH1FSGPSNBODF w \$-*1Ϟσϧʹ͓͍ͯը૾ʹয఺Λ౰ͯͨ৽ͨͳϓϩϯϓτΛఏҊ  8IBUEPFT\$-*1LOPXBCPVUBSFEDJSDMF 7JTVBMQSPNQUFOHJOFFSJOHGPS7-.T w \$-*1ϞσϧΛ௥Ճֶश͢Δ͜ͱͰ৽ͨͳؔ܎ੑʹֶ͍ͭͯश  \$-*1/GPS;FSP4IPU00%%FUFDUJPO5FBDIJOH\$-*1UP4BZ/P  5FBDIJOH\$-*1UP\$PVOUUP5FO ঺հ͢Δ࿦จ 
9. ### w ԯαϯϓϧͷେن໛σʔλΛ༻͍ͨ৔߹ͷ."&ͷֶशޮՌΛௐࠪ  *OTUBHSBNͰऩूͨ͠ԯͷը૾ͱऑϥϕϧʢը૾ͷ౤ߘʹ෇ਵ͢ΔϋογϡλάʣΛ࢖༻ w ."&ˠऑڭࢣ͋Γֶशͷ̎ஈ֊ͷࣄલֶशʹΑͬͯΑΓߴ͍ࣄલֶशͷޮՌΛൃش  ."&ɼऑڭࢣ͸ڞʹFQPDIͷֶशͷΈ w ը૾ϞσϧΛݻఆͯ͠ݴޠϞσϧΛ\$-*1ܗࣜͰֶशˠ;FSPTIPUධՁͰը૾ϞσϧΛධՁ

5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH <.4JOHI *\$\$7>  ViT models at various scales in terms of number of param- eters, including ViT-B (86M), ViT-L (307M), and ViT-H (632M). We also train on larger 1.9B and 6.5B parameter ViT models, which we call ViT-2B and ViT-6.5B, respec- tively (Appendix Table 8). As is common practice [23, 84], we train models of sizes ViT-B, ViT-L with a patch size of 16 and larger models with a patch size of 14. We pretrain with a 224 × 224 resolution for all models. Pre-pretraining (MAE) [33] learns visual representations from image datasets without using any labels. We choose this approach as it is simple to implement and scales very Dataset Task #cls #train #val ImageNet-1k (IN1k) [64] Image cls. 1000 1M 50K iNaturalist-18 (iNat18) [36] Fine-grained cls. 8142 437K 24K ImageNetv2 (INv2) [61] Image cls. 1000 – 10K ImageNet-ReaL (IN-ReaL) [7] Image cls. 1000 – 50K ObjectNet (ON) [6] Image cls. 113 – 19K Food-101 (F-101) [9] Image cls. 101 N/A 25K COCO [49] Obj. det. 80 118K 5K LVIS [32] Obj. det. 1K 100K 20K Kinetics-400 (K400) [43] Action cls. 400 220K 20K Something Something-v2 (SSv2) [30] Action cls. 174 169K 25K Table 1: Evaluation datasets used to evaluate MAE→WSP on image classiﬁcation, object detection, and video action recognition
10. ### w ԯαϯϓϧͷେن໛σʔλΛ༻͍ͨ৔߹ͷ."&ͷֶशޮՌΛௐࠪ  *OTUBHSBNͰऩूͨ͠ԯͷը૾ͱऑϥϕϧʢը૾ͷ౤ߘʹ෇ਵ͢ΔϋογϡλάʣΛ࢖༻ w ."&ˠऑڭࢣ͋Γֶशͷ̎ஈ֊ͷࣄલֶशʹΑͬͯΑΓߴ͍ࣄલֶशͷޮՌΛൃش  ."&ɼऑڭࢣ͸ڞʹFQPDIͷֶशͷΈ w ը૾ϞσϧΛݻఆͯ͠ݴޠϞσϧΛ\$-*1ܗࣜͰֶशˠ;FSPTIPUධՁͰը૾ϞσϧΛධՁ

5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH <.4JOHI *\$\$7>  ϙελʔηογϣϯͰͷஶऀΒ ͲΕ͚ͩ΍ͬͯ΋ऩଋ͠ͳ͍  FQPDIͰे෼ͳੑೳΛൃش SFBTPOBCMZFOPVHI
11. ### w ը૾ͷج൫Ϟσϧ͸ಈը૾΁௚઀ద༻͢Δ͜ͱ͕ࠔ೉ w ը૾ͷج൫ϞσϧΛڭࢣͱͯ͠ɼಈը૾ͷج൫ϞσϧΛεΫϥον͔Βࣗݾڭࢣ͋Γֶश  4UBHFɿϚεΫ͍ͯ͠ͳ͍ύονͷಛ௃ྔ͕ڭࢣͱҰக͢ΔΑ͏ʹֶश  4UBHFɿಈը૾ͱݴޠؒͰରরֶशɾτʔΫϯͷϚονϯάɼݴޠͷ.BTLFE.PEFMJOHͰֶश 6ONBTLFE5FBDIFS5PXBSET5SBJOJOH& ff

i DJFOU7JEFP'PVOEBUJPO.PEFMT <,-J *\$\$7> 
12. ### w .*.Ͱֶशͨ͠Ϟσϧͱൺ΂ͯ\$-*1ͷը૾Ϟσϧ͸'JOFUVOJOHͷਫ਼౓͕௿͍  \$-*1ͱ.*.ͷֶशํ๏ʢ*OQVU 5BSHFU -PTT 4FNBOUJDͳϥϕϧͷ༗ແʣͷҧ͍ʹΑΔ΋ͷͱߟ࡯ w \$-*1ͷ4FNBOUJDͳ৘ใΛҡ࣋ͭͭ͠ɼ.*.ͷಛੑΛ૊ΈࠐΉֶशํ๏'%\$-*1ΛఏҊ 

\$-*1ͷը૾ϞσϧΛڭࢣͱͯ͠ɼ֤τʔΫϯͷதؒ૚ৠཹͰੜెΛεΫϥον͔Βֶश ˠ\$-*1ͷ;FSPTIPUੑೳΛҡ࣋ͭͭ͠'JOFUVOJOHੑೳ͕޲্ *NQSPWJOH\$-*1'JOFUVOJOH1FSGPSNBODF<:8FJ *\$\$7>  ture , in- met- as. ed to rop- e in- mod- new the Table 3: Ingredients comparison between CLIP and MIM methods from the perspective of input ratios, training target granularity and loss format. Method Input Target Loss Semantics BeiT [2] Partial Token-level Cross-entropy MAE [17] Partial Token-level Regression CLIP [42] Full Image-level Cross-entropy X FD-CLIP Full Token-level Regression X widely used in deep reinforcement learning [48, 4], life- long learning [62] and recommendation system [7]. Our work does not aim to propose a new distillation ap- .*.ͷैདྷ๏ͱ\$-*1ͷֶशํ๏ͷҧ͍ '%\$-*1ͷֶशํ๏
13. ### w \$-*1ʹ͓͚Δը૾Ϟʔμϧʹয఺Λ౰ͯͨϓϩϯϓτΤϯδχΞϦϯάͷݚڀ w ը૾Ϟʔμϧͷϓϩϯϓτͱͯ͠ը૾಺ͷ෺ମΛ੺ؙ͍ͰғΉΞϓϩʔνΛఏҊ w ੺ؙ͍Λը૾ʹ௥Ճ͢Δ͜ͱͰؙͷதͷ෺ମʹؔ࿈͢ΔςΩετͱͷྨࣅ౓͕޲্ 8IBUEPFT\$-*1LOPXBCPVUBSFEDJSDMF 7JTVBMQS0NQUFOHJOFFSJOHGPS7-.T <"4IUFESJUTLJ *\$\$7>

 . . . . . . The cub on the right Referring expressions comprehension This is a cat bird Classification A Q / A . . . ear eye nose Naming keypoints bear The ear eye nose of a bear A / Q bear Q Q . . . A . . . VLM VLM VLM
14. ### w \$-*1ʹΑΔΫϥε໊ͷϓϩϯϓτΛ༻͍ͨը૾ͷ;FSPTIUP෼෍֎ݕ஌ͷݚڀ w ϙδςΟϒͱωΨςΟϒͳϓϩϯϓτΛ༻͍ͨ෼෍֎ݕ஌ΛఏҊ  lOPzDMBTTQSPNQUͱ\$-*1ͷ5FYU&ODPEFSͷॏΈͰॳظԽͨ͠lOPz5FYU&ODPEFSΛಋೖ w \$\$.σʔληοτͱlOPzUFYUΛ༻͍ͯlOPz5FYU&ODPEFS͸'JOFUVOJOH  DMBTTQSPNQUͱlOPzDMBTTQSPNQUʹର͢Δྨࣅ౓Λ༻͍ͯ෼෍֎ݕ஌

\$-*1/GPS;FSP4IPU00%%FUFDUJPO5FBDIJOH\$-*1UP4BZ/P<)8BOH *\$\$7>  class 1 class 2 class 3 class 4 class n (a) feature space of standard methods class 1 class 2 class 3 class n class 4 (b) feature space of CLIPN hard-distinguish OOD samples easy-distinguish OOD samples feature space of in-domain classes feature space of the “no” logit A photo with a horse runing A photo with a horse runing (A photo of no fish) A photo with a horse runing A photo with a horse runing (A photo of no cat) A photo with a horse runing A photo with a horse runing (A photo of no cow) learnable "no" class prompts OOD image cow cat fish OOD sum 0.15 A photo with a horse runing A photo with a horse runing A photo of a cow A photo with a horse runing A photo with a horse runing A photo of a cat A photo with a horse runing A photo with a horse runing A photo of a fish standard class prompts Text Encoder "no" Text Encoder Image Encoder "no" cat cat OOD Competing-to-win Agreeing-to-differ the decision element multiplication cow cat fish 0.8 cow cat fish saying "no" probabilities 0.6 0.6 0.2 cow cat fish ID probabilities 0.4 0.4 0.25 0.25 0.5 0.15 0.1 0.6 0.8 0.2 OOD
15. ### w \$-*1͸ը૾಺ͷ෺ମ਺ʹؔ͢ΔςΩετ͕ۤख  ਖ਼֬ͳ෺ମ਺ͷςΩετ͕෇༩͞Εֶͨश༻σʔλΛ࡞੒  ϞσϧͷΧ΢ϯτೳྗΛධՁ͢ΔͨΊͷ৽͍͠ϕϯνϚʔΫ\$PVOU#FODIΛఏҊ w ਖ਼֬ͳ਺ͱؒҧͬͨ਺ͷςΩετΛ۠ผ͢ΔֶशΛઃܭ͠ɼ\$-*1ϞσϧΛ'JOFUVOJOH 5FBDIJOH\$-*1UP\$PVOUUP5FO<31BJTT *\$\$7>

 “Four running horses...” “Landscape in autumn...” Counterfactual Captions Counting Subset Non-Counting Subset h atch “Three running horses...” !!"#\$ Caption CF caption !!%&'( (b) Teaching CLIP to count Image Number Swap CLIP Text Encoder CLIP Image Encoder Original CLIP Model “Three” “Five” “Animals” “Six” “Nine” Ours “Animals” “Dogs” “Hearts” (a) Image Retrieval Results: Ours Original CLIP “Five” ( , ) “Hearts” “Lamps” “Dogs” “Lamps” (b) Relevancy Maps:
16. ### w σʔληοτͷେن໛ԽʹΑΓߴ͍ࣄલֶशͷޮՌΛൃش  5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH w ج൫Ϟσϧ͔Βͷৠཹ TFMGEJTUJMMBUJPO ʹΑΓج൫ϞσϧͷੑೳΛվળ

 6ONBTLFE5FBDIFS5PXBSET5SBJOJOH& ffi DJFOU7JEFP'PVOEBUJPO.PEFMT  *NQSPWJOH\$-*1'JOFUVOJOH1FSGPSNBODF w \$-*1Ϟσϧʹ͓͍ͯը૾ʹয఺Λ౰ͯͨ৽ͨͳϓϩϯϓτΛఏҊ  8IBUEPFT\$-*1LOPXBCPVUBSFEDJSDMF 7JTVBMQSPNQUFOHJOFFSJOHGPS7-.T w \$-*1ϞσϧΛ௥Ճֶश͢Δ͜ͱͰ৽ͨͳؔ܎ੑʹֶ͍ͭͯश  \$-*1/GPS;FSP4IPU00%%FUFDUJPO5FBDIJOH\$-*1UP4BZ/P  5FBDIJOH\$-*1UP\$PVOUUP5FO ·ͱΊ 