w ڭࢣϥϕϧΛ༩ͯ͠ͳ͍େྔͷσʔλΛٖࣅతͳ 1SFUFYUUBTL ʹΑֶͬͯश w ࣗݾڭࢣ͋ΓֶशͰֶशͨ͠ϞσϧࣄલֶशϞσϧͱͯ͠׆༻ ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH
ᶄ44-Ͱ࡞ͨ͠ࣄલֶशϞσϧΛରλεΫసҠֶशɾ fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '$
ࣄલֶशϞσϧ )FBE
ࣄલֶशϞσϧ ΫϥεྨϞσϧ ମݕग़Ϟσϧ ᶃେྔͷڭࢣͳ͠σʔλͰࣄલֶशϞσϧΛ࡞ େྔͷڭࢣͳ͠σʔλ ࣄલֶशϞσϧ sses image data, we analyze its internal nearly projects the flattened patches into he top principal components of the the ble basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the d to the l learns ion em- ion em- es in the usoidal D). That ogy ex- ot yield e entire gree the pute the is inte- “atten- We find lowest bally is istently ized at- ʜ ࣗݾڭࢣ͋Γֶश
w ϚϧνϞʔμϧ$- ը૾ͱςΩετͷϖΞͰ$- w $-*1<"3BEGPSE
*$.-> w ϚϧνϞʔμϧ.*. 3(#
%FQUI
4FNBOUJDΛಉ࣌ʹ.*. w .VMUJ."&<3#BDINBOO
&$$7> w ϋΠϒϦου $-ͱ.*.Λಉ࣌ʹֶश w J#05<+;IPV
*$-3>
w ϚϧνϞʔμϧ$- ը૾ͱςΩετͷϖΞͰ$- w $-*1<"3BEGPSE
*$.-> w ϚϧνϞʔμϧ.*. 3(#
%FQUI
4FNBOUJDΛಉ࣌ʹ.*. w .VMUJ."&<3#BDINBOO
&$$7> w ϋΠϒϦου $-ͱ.*.Λಉ࣌ʹֶश w J#05<+;IPV
*$-3>
w ԯαϯϓϧͷେنσʔλΛ༻͍ͨ߹ͷ."&ͷֶशޮՌΛௐࠪ *OTUBHSBNͰऩूͨ͠ԯͷը૾ͱऑϥϕϧʢը૾ͷߘʹਵ͢ΔϋογϡλάʣΛ༻ w ."&ˠऑڭࢣ͋Γֶशͷ̎ஈ֊ͷࣄલֶशʹΑͬͯΑΓߴ͍ࣄલֶशͷޮՌΛൃش ."&ɼऑڭࢣڞʹFQPDIͷֶशͷΈ w ը૾ϞσϧΛݻఆͯ͠ݴޠϞσϧΛ$-*1ܗࣜͰֶशˠ;FSPTIPUධՁͰը૾ϞσϧΛධՁ 5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH <.4JOHI
*$$7>
ViT models at various scales in terms of number of param- eters, including ViT-B (86M), ViT-L (307M), and ViT-H (632M). We also train on larger 1.9B and 6.5B parameter ViT models, which we call ViT-2B and ViT-6.5B, respec- tively (Appendix Table 8). As is common practice [23, 84], we train models of sizes ViT-B, ViT-L with a patch size of 16 and larger models with a patch size of 14. We pretrain with a 224 × 224 resolution for all models. Pre-pretraining (MAE) [33] learns visual representations from image datasets without using any labels. We choose this approach as it is simple to implement and scales very Dataset Task #cls #train #val ImageNet-1k (IN1k) [64] Image cls. 1000 1M 50K iNaturalist-18 (iNat18) [36] Fine-grained cls. 8142 437K 24K ImageNetv2 (INv2) [61] Image cls. 1000 – 10K ImageNet-ReaL (IN-ReaL) [7] Image cls. 1000 – 50K ObjectNet (ON) [6] Image cls. 113 – 19K Food-101 (F-101) [9] Image cls. 101 N/A 25K COCO [49] Obj. det. 80 118K 5K LVIS [32] Obj. det. 1K 100K 20K Kinetics-400 (K400) [43] Action cls. 400 220K 20K Something Something-v2 (SSv2) [30] Action cls. 174 169K 25K Table 1: Evaluation datasets used to evaluate MAE→WSP on image classification, object detection, and video action recognition
w ԯαϯϓϧͷେنσʔλΛ༻͍ͨ߹ͷ."&ͷֶशޮՌΛௐࠪ *OTUBHSBNͰऩूͨ͠ԯͷը૾ͱऑϥϕϧʢը૾ͷߘʹਵ͢ΔϋογϡλάʣΛ༻ w ."&ˠऑڭࢣ͋Γֶशͷ̎ஈ֊ͷࣄલֶशʹΑͬͯΑΓߴ͍ࣄલֶशͷޮՌΛൃش ."&ɼऑڭࢣڞʹFQPDIͷֶशͷΈ w ը૾ϞσϧΛݻఆͯ͠ݴޠϞσϧΛ$-*1ܗࣜͰֶशˠ;FSPTIPUධՁͰը૾ϞσϧΛධՁ 5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH <.4JOHI
*$$7>
w ը૾ͷج൫Ϟσϧಈը૾ద༻͢Δ͜ͱ͕ࠔ w ը૾ͷج൫ϞσϧΛڭࢣͱͯ͠ɼಈը૾ͷج൫ϞσϧΛεΫϥον͔Βࣗݾڭࢣ͋Γֶश 4UBHFɿϚεΫ͍ͯ͠ͳ͍ύονͷಛྔ͕ڭࢣͱҰக͢ΔΑ͏ʹֶश 4UBHFɿಈը૾ͱݴޠؒͰରরֶशɾτʔΫϯͷϚονϯάɼݴޠͷ.BTLFE.PEFMJOHͰֶश 6ONBTLFE5FBDIFS5PXBSET5SBJOJOH& ff i DJFOU7JEFP'PVOEBUJPO.PEFMT <,-J
*$$7>
w .*.Ͱֶशͨ͠Ϟσϧͱൺͯ$-*1ͷը૾Ϟσϧ'JOFUVOJOHͷਫ਼͕͍ $-*1ͱ.*.ͷֶशํ๏ʢ*OQVU
5BSHFU
-PTT
4FNBOUJDͳϥϕϧͷ༗ແʣͷҧ͍ʹΑΔͷͱߟ w $-*1ͷ4FNBOUJDͳใΛҡ࣋ͭͭ͠ɼ.*.ͷಛੑΛΈࠐΉֶशํ๏'%$-*1ΛఏҊ $-*1ͷը૾ϞσϧΛڭࢣͱͯ͠ɼ֤τʔΫϯͷதؒৠཹͰੜెΛεΫϥον͔Βֶश ˠ$-*1ͷ;FSPTIPUੑೳΛҡ࣋ͭͭ͠'JOFUVOJOHੑೳ্͕ *NQSPWJOH$-*1'JOFUVOJOH1FSGPSNBODF<:8FJ
*$$7>
ture , in- met- as. ed to rop- e in- mod- new the Table 3: Ingredients comparison between CLIP and MIM methods from the perspective of input ratios, training target granularity and loss format. Method Input Target Loss Semantics BeiT [2] Partial Token-level Cross-entropy MAE [17] Partial Token-level Regression CLIP [42] Full Image-level Cross-entropy X FD-CLIP Full Token-level Regression X widely used in deep reinforcement learning [48, 4], life- long learning [62] and recommendation system [7]. Our work does not aim to propose a new distillation ap- .*.ͷैདྷ๏ͱ$-*1ͷֶशํ๏ͷҧ͍ '%$-*1ͷֶशํ๏
w $-*1ʹ͓͚Δը૾ϞʔμϧʹযΛͯͨϓϩϯϓτΤϯδχΞϦϯάͷݚڀ w ը૾Ϟʔμϧͷϓϩϯϓτͱͯ͠ը૾ͷମΛؙ͍ͰғΉΞϓϩʔνΛఏҊ w ؙ͍Λը૾ʹՃ͢Δ͜ͱͰؙͷதͷମʹؔ࿈͢ΔςΩετͱͷྨࣅ্͕ 8IBUEPFT$-*1LOPXBCPVUBSFEDJSDMF 7JTVBMQS0NQUFOHJOFFSJOHGPS7-.T <"4IUFESJUTLJ
*$$7>
. . . . . . The cub on the right Referring expressions comprehension This is a cat bird Classification A Q / A . . . ear eye nose Naming keypoints bear The ear eye nose of a bear A / Q bear Q Q . . . A . . . VLM VLM VLM
w $-*1ʹΑΔΫϥε໊ͷϓϩϯϓτΛ༻͍ͨը૾ͷ;FSPTIUP֎ݕͷݚڀ w ϙδςΟϒͱωΨςΟϒͳϓϩϯϓτΛ༻͍ͨ֎ݕΛఏҊ lOPzDMBTTQSPNQUͱ$-*1ͷ5FYU&ODPEFSͷॏΈͰॳظԽͨ͠lOPz5FYU&ODPEFSΛಋೖ w $$.σʔληοτͱlOPzUFYUΛ༻͍ͯlOPz5FYU&ODPEFS'JOFUVOJOH DMBTTQSPNQUͱlOPzDMBTTQSPNQUʹର͢ΔྨࣅΛ༻͍ͯ֎ݕ $-*1/GPS;FSP4IPU00%%FUFDUJPO5FBDIJOH$-*1UP4BZ/P<)8BOH
*$$7>
class 1 class 2 class 3 class 4 class n (a) feature space of standard methods class 1 class 2 class 3 class n class 4 (b) feature space of CLIPN hard-distinguish OOD samples easy-distinguish OOD samples feature space of in-domain classes feature space of the “no” logit A photo with a horse runing A photo with a horse runing (A photo of no fish) A photo with a horse runing A photo with a horse runing (A photo of no cat) A photo with a horse runing A photo with a horse runing (A photo of no cow) learnable "no" class prompts OOD image cow cat fish OOD sum 0.15 A photo with a horse runing A photo with a horse runing A photo of a cow A photo with a horse runing A photo with a horse runing A photo of a cat A photo with a horse runing A photo with a horse runing A photo of a fish standard class prompts Text Encoder "no" Text Encoder Image Encoder "no" cat cat OOD Competing-to-win Agreeing-to-differ the decision element multiplication cow cat fish 0.8 cow cat fish saying "no" probabilities 0.6 0.6 0.2 cow cat fish ID probabilities 0.4 0.4 0.25 0.25 0.5 0.15 0.1 0.6 0.8 0.2 OOD
w $-*1ը૾ͷମʹؔ͢ΔςΩετ͕ۤख ਖ਼֬ͳମͷςΩετ͕༩͞Εֶͨश༻σʔλΛ࡞ ϞσϧͷΧϯτೳྗΛධՁ͢ΔͨΊͷ৽͍͠ϕϯνϚʔΫ$PVOU#FODIΛఏҊ w ਖ਼֬ͳͱؒҧͬͨͷςΩετΛ۠ผ͢ΔֶशΛઃܭ͠ɼ$-*1ϞσϧΛ'JOFUVOJOH 5FBDIJOH$-*1UP$PVOUUP5FO<31BJTT
*$$7>
“Four running horses...” “Landscape in autumn...” Counterfactual Captions Counting Subset Non-Counting Subset h atch “Three running horses...” !!"#$ Caption CF caption !!%&'( (b) Teaching CLIP to count Image Number Swap CLIP Text Encoder CLIP Image Encoder Original CLIP Model “Three” “Five” “Animals” “Six” “Nine” Ours “Animals” “Dogs” “Hearts” (a) Image Retrieval Results: Ours Original CLIP “Five” ( , ) “Hearts” “Lamps” “Dogs” “Lamps” (b) Relevancy Maps: