Upgrade to Pro — share decks privately, control downloads, hide ads and more …

第59回名古屋CV・PRMU勉強会:ICCV2023論文紹介(自己教師あり学習)

Naoki Okamoto
November 20, 2023

 第59回名古屋CV・PRMU勉強会:ICCV2023論文紹介(自己教師あり学習)

11月11日に行われた第59回名古屋CV・PRML勉強会のICCV2023論文紹介で使用したスライドです.自己教師あり学習に関する論文を6本紹介.

Naoki Okamoto

November 20, 2023
Tweet

More Decks by Naoki Okamoto

Other Decks in Research

Transcript

  1. ࣗݾ঺հ  !OBPLPL IUUQTHJUIVCDPNOBPL Ԭຊ௚थ /BPLJ0LBNPUP ݚڀςʔϚɿΞϯαϯϒϧֶशɼ஌ࣝৠཹ d  ɹɹɹɹɹɹࣗݾڭࢣ͋Γֶश

    d த෦େֶ޻ֶݚڀՊϩϘοτཧ޻ֶઐ߈ത࢜ޙظ՝ఔ̎೥ੜ ౻٢߂࿱ݚڀࣨॴଐ ࣗݾڭࢣ͋Γֶशͷ࿦จΛ঺հ͠·͢ʂ
  2. w ڭࢣϥϕϧΛ෇༩ͯ͠ͳ͍େྔͷσʔλΛٖࣅతͳ໰୊ 1SFUFYUUBTL ʹΑֶͬͯश w ࣗݾڭࢣ͋ΓֶशͰֶशͨ͠Ϟσϧ͸ࣄલֶशϞσϧͱͯ͠׆༻ ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH  ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁సҠֶशɾ

    fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ᶃେྔͷڭࢣͳ͠σʔλͰࣄલֶशϞσϧΛ࡞੒ େྔͷڭࢣͳ͠σʔλ ࣄલֶशϞσϧ sses image data, we analyze its internal nearly projects the flattened patches into he top principal components of the the ble basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the d to the l learns ion em- ion em- es in the usoidal D). That ogy ex- ot yield e entire gree the pute the is inte- “atten- We find lowest bally is istently ized at- ʜ ࣗݾڭࢣ͋Γֶश
  3. ࣗݾڭࢣ͋Γֶशͷ୅දతͳํ๏  ."&ͷֶशํ๏ w .BTLFE*NBHF.PEFMJOH .*.   ϚεΫ͞Εͨύονͷಛ௃ྔΛ༧ଌ w

    #&J5<)#BP *$-3> w #&J5W<;1FOH BS9JW>   ϚεΫ͞ΕͨύονͷըૉΛ༧ଌ w ."&<,)F $713>  w 4JN.*.<;9JF $713> &ODPEFS &ODPEFS .-1 .-1 ࢦ਺Ҡಈฏۉ 4PGUNBY $FOUFS 4PGUNBY ଛࣦܭࢉ -PDBM (MPCBM %*/0ͷֶशํ๏ ଛࣦܭࢉ &ODPEFS %FDPEFS ϚεΫॲཧ ϚεΫτʔΫϯͷ௥Ճ ."&ͷֶशํ๏ w $POUSBTUJWF-FBSOJOH $-   ෳ਺ͷҟͳΔ7JFX͔Βநग़ͨ͠ը૾Ϩϕϧͷಛ௃Λൺֱ w .P$PW<9$IFO *$$7>  w %*/0<.$BSPO *$$7>
  4. ࣗݾڭࢣ͋Γֶशͷ୅දతͳํ๏  w ϚϧνϞʔμϧ$-  ը૾ͱςΩετͷϖΞͰ$- w $-*1<"3BEGPSE *$.-> w

    ϚϧνϞʔμϧ.*.  3(# %FQUI 4FNBOUJDΛಉ࣌ʹ.*. w .VMUJ."&<3#BDINBOO &$$7> w ϋΠϒϦου  $-ͱ.*.Λಉ࣌ʹֶश w J#05<+;IPV *$-3>  w %*/0W<.0RVBC BS9JW> w $POUSBTUJWF-FBSOJOH $-   ෳ਺ͷҟͳΔ7JFX͔Βநग़ͨ͠ը૾Ϩϕϧͷಛ௃Λൺֱ w .P$PW<9$IFO *$$7>  w %*/0<.$BSPO *$$7> 7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश<>  w .BTLFE*NBHF.PEFMJOH .*.   ϚεΫ͞Εͨύονͷಛ௃ྔΛ༧ଌ w #&J5<)#BP *$-3> w #&J5W<>   ϚεΫ͞ΕͨύονͷըૉΛ༧ଌ w ."&<,)F $713>  w 4JN.*.<;9JF $713> ."&ͷֶशํ๏ &ODPEFS &ODPEFS .-1 .-1 ࢦ਺Ҡಈฏۉ 4PGUNBY $FOUFS 4PGUNBY ଛࣦܭࢉ -PDBM (MPCBM %*/0ͷֶशํ๏ ଛࣦܭࢉ &ODPEFS %FDPEFS ϚεΫॲཧ ϚεΫτʔΫϯͷ௥Ճ ."&ͷֶशํ๏ 7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश<>  w ϚϧνϞʔμϧ$-  ը૾ͱςΩετͷϖΞͰ$- w $-*1<"3BEGPSE *$.-> w ϚϧνϞʔμϧ.*.  3(# %FQUI 4FNBOUJDΛಉ࣌ʹ.*. w .VMUJ."&<3#BDINBOO &$$7> w ϋΠϒϦου  $-ͱ.*.Λಉ࣌ʹֶश w J#05<+;IPV *$-3>  w %*/0W<.0RVBC BS9JW> &ODPEFS &ODPEFS .-1 .-1 ࢦ਺Ҡಈฏۉ 4PGUNBY $FOUFS 4PGUNBY ଛࣦܭࢉ -PDBM (MPCBM %*/0ͷֶशํ๏ ଛࣦܭࢉ &ODPEFS %FDPEFS ϚεΫॲཧ ϚεΫτʔΫϯͷ௥Ճ ."&ͷֶशํ๏ $POUSBTUJWF-FBSOJOH $- .BTLFE*NBHF.PEFMJOH .*.
  5. w λΠτϧʹࣗݾڭࢣ͋Γֶशʹؔ࿈͢ΔΩʔϫʔυ͕෇͘࿦จΛݕࡧ  4FMGTVQFSWJTFE ɿຊ  $POUSBTUJWF ɿຊ  .BTLFEɼ."&

    ɿຊ  $-*1 ɿຊ  0QFO7PDBCVMBSZ ɿຊ *$$7ͷ࿦จௐࠪ  Ԭຊ͕໘ന͍ɾैདྷ๏ʹ͸ͳ͍Ξϓϩʔνͱࢥͬͨ࿦จຊΛ঺հ
  6. w σʔληοτͷେن໛ԽʹΑΓߴ͍ࣄલֶशͷޮՌΛൃش  5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH w ج൫Ϟσϧ͔Βͷৠཹ TFMGEJTUJMMBUJPO ʹΑΓج൫ϞσϧͷੑೳΛվળ

     6ONBTLFE5FBDIFS5PXBSET5SBJOJOH& ffi DJFOU7JEFP'PVOEBUJPO.PEFMT  *NQSPWJOH$-*1'JOFUVOJOH1FSGPSNBODF w $-*1Ϟσϧʹ͓͍ͯը૾ʹয఺Λ౰ͯͨ৽ͨͳϓϩϯϓτΛఏҊ  8IBUEPFT$-*1LOPXBCPVUBSFEDJSDMF 7JTVBMQSPNQUFOHJOFFSJOHGPS7-.T w $-*1ϞσϧΛ௥Ճֶश͢Δ͜ͱͰ৽ͨͳؔ܎ੑʹֶ͍ͭͯश  $-*1/GPS;FSP4IPU00%%FUFDUJPO5FBDIJOH$-*1UP4BZ/P  5FBDIJOH$-*1UP$PVOUUP5FO ঺հ͢Δ࿦จ 
  7. w ԯαϯϓϧͷେن໛σʔλΛ༻͍ͨ৔߹ͷ."&ͷֶशޮՌΛௐࠪ  *OTUBHSBNͰऩूͨ͠ԯͷը૾ͱऑϥϕϧʢը૾ͷ౤ߘʹ෇ਵ͢ΔϋογϡλάʣΛ࢖༻ w ."&ˠऑڭࢣ͋Γֶशͷ̎ஈ֊ͷࣄલֶशʹΑͬͯΑΓߴ͍ࣄલֶशͷޮՌΛൃش  ."&ɼऑڭࢣ͸ڞʹFQPDIͷֶशͷΈ w ը૾ϞσϧΛݻఆͯ͠ݴޠϞσϧΛ$-*1ܗࣜͰֶशˠ;FSPTIPUධՁͰը૾ϞσϧΛධՁ

    5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH <.4JOHI *$$7>  ViT models at various scales in terms of number of param- eters, including ViT-B (86M), ViT-L (307M), and ViT-H (632M). We also train on larger 1.9B and 6.5B parameter ViT models, which we call ViT-2B and ViT-6.5B, respec- tively (Appendix Table 8). As is common practice [23, 84], we train models of sizes ViT-B, ViT-L with a patch size of 16 and larger models with a patch size of 14. We pretrain with a 224 × 224 resolution for all models. Pre-pretraining (MAE) [33] learns visual representations from image datasets without using any labels. We choose this approach as it is simple to implement and scales very Dataset Task #cls #train #val ImageNet-1k (IN1k) [64] Image cls. 1000 1M 50K iNaturalist-18 (iNat18) [36] Fine-grained cls. 8142 437K 24K ImageNetv2 (INv2) [61] Image cls. 1000 – 10K ImageNet-ReaL (IN-ReaL) [7] Image cls. 1000 – 50K ObjectNet (ON) [6] Image cls. 113 – 19K Food-101 (F-101) [9] Image cls. 101 N/A 25K COCO [49] Obj. det. 80 118K 5K LVIS [32] Obj. det. 1K 100K 20K Kinetics-400 (K400) [43] Action cls. 400 220K 20K Something Something-v2 (SSv2) [30] Action cls. 174 169K 25K Table 1: Evaluation datasets used to evaluate MAE→WSP on image classification, object detection, and video action recognition
  8. w .*.Ͱֶशͨ͠Ϟσϧͱൺ΂ͯ$-*1ͷը૾Ϟσϧ͸'JOFUVOJOHͷਫ਼౓͕௿͍  $-*1ͱ.*.ͷֶशํ๏ʢ*OQVU 5BSHFU -PTT 4FNBOUJDͳϥϕϧͷ༗ແʣͷҧ͍ʹΑΔ΋ͷͱߟ࡯ w $-*1ͷ4FNBOUJDͳ৘ใΛҡ࣋ͭͭ͠ɼ.*.ͷಛੑΛ૊ΈࠐΉֶशํ๏'%$-*1ΛఏҊ 

    $-*1ͷը૾ϞσϧΛڭࢣͱͯ͠ɼ֤τʔΫϯͷதؒ૚ৠཹͰੜెΛεΫϥον͔Βֶश ˠ$-*1ͷ;FSPTIPUੑೳΛҡ࣋ͭͭ͠'JOFUVOJOHੑೳ͕޲্ *NQSPWJOH$-*1'JOFUVOJOH1FSGPSNBODF<:8FJ *$$7>  ture , in- met- as. ed to rop- e in- mod- new the Table 3: Ingredients comparison between CLIP and MIM methods from the perspective of input ratios, training target granularity and loss format. Method Input Target Loss Semantics BeiT [2] Partial Token-level Cross-entropy MAE [17] Partial Token-level Regression CLIP [42] Full Image-level Cross-entropy X FD-CLIP Full Token-level Regression X widely used in deep reinforcement learning [48, 4], life- long learning [62] and recommendation system [7]. Our work does not aim to propose a new distillation ap- .*.ͷैདྷ๏ͱ$-*1ͷֶशํ๏ͷҧ͍ '%$-*1ͷֶशํ๏
  9. w $-*1ʹΑΔΫϥε໊ͷϓϩϯϓτΛ༻͍ͨը૾ͷ;FSPTIUP෼෍֎ݕ஌ͷݚڀ w ϙδςΟϒͱωΨςΟϒͳϓϩϯϓτΛ༻͍ͨ෼෍֎ݕ஌ΛఏҊ  lOPzDMBTTQSPNQUͱ$-*1ͷ5FYU&ODPEFSͷॏΈͰॳظԽͨ͠lOPz5FYU&ODPEFSΛಋೖ w $$.σʔληοτͱlOPzUFYUΛ༻͍ͯlOPz5FYU&ODPEFS͸'JOFUVOJOH  DMBTTQSPNQUͱlOPzDMBTTQSPNQUʹର͢Δྨࣅ౓Λ༻͍ͯ෼෍֎ݕ஌

    $-*1/GPS;FSP4IPU00%%FUFDUJPO5FBDIJOH$-*1UP4BZ/P<)8BOH *$$7>  class 1 class 2 class 3 class 4 class n (a) feature space of standard methods class 1 class 2 class 3 class n class 4 (b) feature space of CLIPN hard-distinguish OOD samples easy-distinguish OOD samples feature space of in-domain classes feature space of the “no” logit A photo with a horse runing A photo with a horse runing (A photo of no fish) A photo with a horse runing A photo with a horse runing (A photo of no cat) A photo with a horse runing A photo with a horse runing (A photo of no cow) learnable "no" class prompts OOD image cow cat fish OOD sum 0.15 A photo with a horse runing A photo with a horse runing A photo of a cow A photo with a horse runing A photo with a horse runing A photo of a cat A photo with a horse runing A photo with a horse runing A photo of a fish standard class prompts Text Encoder "no" Text Encoder Image Encoder "no" cat cat OOD Competing-to-win Agreeing-to-differ the decision element multiplication cow cat fish 0.8 cow cat fish saying "no" probabilities 0.6 0.6 0.2 cow cat fish ID probabilities 0.4 0.4 0.25 0.25 0.5 0.15 0.1 0.6 0.8 0.2 OOD
  10. w $-*1͸ը૾಺ͷ෺ମ਺ʹؔ͢ΔςΩετ͕ۤख  ਖ਼֬ͳ෺ମ਺ͷςΩετ͕෇༩͞Εֶͨश༻σʔλΛ࡞੒  ϞσϧͷΧ΢ϯτೳྗΛධՁ͢ΔͨΊͷ৽͍͠ϕϯνϚʔΫ$PVOU#FODIΛఏҊ w ਖ਼֬ͳ਺ͱؒҧͬͨ਺ͷςΩετΛ۠ผ͢ΔֶशΛઃܭ͠ɼ$-*1ϞσϧΛ'JOFUVOJOH 5FBDIJOH$-*1UP$PVOUUP5FO<31BJTT *$$7>

     “Four running horses...” “Landscape in autumn...” Counterfactual Captions Counting Subset Non-Counting Subset h atch “Three running horses...” !!"#$ Caption CF caption !!%&'( (b) Teaching CLIP to count Image Number Swap CLIP Text Encoder CLIP Image Encoder Original CLIP Model “Three” “Five” “Animals” “Six” “Nine” Ours “Animals” “Dogs” “Hearts” (a) Image Retrieval Results: Ours Original CLIP “Five” ( , ) “Hearts” “Lamps” “Dogs” “Lamps” (b) Relevancy Maps:
  11. w σʔληοτͷେن໛ԽʹΑΓߴ͍ࣄલֶशͷޮՌΛൃش  5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH w ج൫Ϟσϧ͔Βͷৠཹ TFMGEJTUJMMBUJPO ʹΑΓج൫ϞσϧͷੑೳΛվળ

     6ONBTLFE5FBDIFS5PXBSET5SBJOJOH& ffi DJFOU7JEFP'PVOEBUJPO.PEFMT  *NQSPWJOH$-*1'JOFUVOJOH1FSGPSNBODF w $-*1Ϟσϧʹ͓͍ͯը૾ʹয఺Λ౰ͯͨ৽ͨͳϓϩϯϓτΛఏҊ  8IBUEPFT$-*1LOPXBCPVUBSFEDJSDMF 7JTVBMQSPNQUFOHJOFFSJOHGPS7-.T w $-*1ϞσϧΛ௥Ճֶश͢Δ͜ͱͰ৽ͨͳؔ܎ੑʹֶ͍ͭͯश  $-*1/GPS;FSP4IPU00%%FUFDUJPO5FBDIJOH$-*1UP4BZ/P  5FBDIJOH$-*1UP$PVOUUP5FO ·ͱΊ