fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ ΫϥεྨϞσϧ ମݕग़Ϟσϧ ᶃେྔͷڭࢣͳ͠σʔλͰࣄલֶशϞσϧΛ࡞ େྔͷڭࢣͳ͠σʔλ ࣄલֶशϞσϧ sses image data, we analyze its internal nearly projects the flattened patches into he top principal components of the the ble basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the d to the l learns ion em- ion em- es in the usoidal D). That ogy ex- ot yield e entire gree the pute the is inte- “atten- We find lowest bally is istently ized at- ʜ ࣗݾڭࢣ͋Γֶश
5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH <.4JOHI *$$7> ViT models at various scales in terms of number of param- eters, including ViT-B (86M), ViT-L (307M), and ViT-H (632M). We also train on larger 1.9B and 6.5B parameter ViT models, which we call ViT-2B and ViT-6.5B, respec- tively (Appendix Table 8). As is common practice [23, 84], we train models of sizes ViT-B, ViT-L with a patch size of 16 and larger models with a patch size of 14. We pretrain with a 224 × 224 resolution for all models. Pre-pretraining (MAE) [33] learns visual representations from image datasets without using any labels. We choose this approach as it is simple to implement and scales very Dataset Task #cls #train #val ImageNet-1k (IN1k) [64] Image cls. 1000 1M 50K iNaturalist-18 (iNat18) [36] Fine-grained cls. 8142 437K 24K ImageNetv2 (INv2) [61] Image cls. 1000 – 10K ImageNet-ReaL (IN-ReaL) [7] Image cls. 1000 – 50K ObjectNet (ON) [6] Image cls. 113 – 19K Food-101 (F-101) [9] Image cls. 101 N/A 25K COCO [49] Obj. det. 80 118K 5K LVIS [32] Obj. det. 1K 100K 20K Kinetics-400 (K400) [43] Action cls. 400 220K 20K Something Something-v2 (SSv2) [30] Action cls. 174 169K 25K Table 1: Evaluation datasets used to evaluate MAE→WSP on image classification, object detection, and video action recognition
$-*1ͷը૾ϞσϧΛڭࢣͱͯ͠ɼ֤τʔΫϯͷதؒৠཹͰੜెΛεΫϥον͔Βֶश ˠ$-*1ͷ;FSPTIPUੑೳΛҡ࣋ͭͭ͠'JOFUVOJOHੑೳ্͕ *NQSPWJOH$-*1'JOFUVOJOH1FSGPSNBODF<:8FJ *$$7> ture , in- met- as. ed to rop- e in- mod- new the Table 3: Ingredients comparison between CLIP and MIM methods from the perspective of input ratios, training target granularity and loss format. Method Input Target Loss Semantics BeiT [2] Partial Token-level Cross-entropy MAE [17] Partial Token-level Regression CLIP [42] Full Image-level Cross-entropy X FD-CLIP Full Token-level Regression X widely used in deep reinforcement learning [48, 4], life- long learning [62] and recommendation system [7]. Our work does not aim to propose a new distillation ap- .*.ͷैདྷ๏ͱ$-*1ͷֶशํ๏ͷҧ͍ '%$-*1ͷֶशํ๏
. . . . . . The cub on the right Referring expressions comprehension This is a cat bird Classification A Q / A . . . ear eye nose Naming keypoints bear The ear eye nose of a bear A / Q bear Q Q . . . A . . . VLM VLM VLM
$-*1/GPS;FSP4IPU00%%FUFDUJPO5FBDIJOH$-*1UP4BZ/P<)8BOH *$$7> class 1 class 2 class 3 class 4 class n (a) feature space of standard methods class 1 class 2 class 3 class n class 4 (b) feature space of CLIPN hard-distinguish OOD samples easy-distinguish OOD samples feature space of in-domain classes feature space of the “no” logit A photo with a horse runing A photo with a horse runing (A photo of no fish) A photo with a horse runing A photo with a horse runing (A photo of no cat) A photo with a horse runing A photo with a horse runing (A photo of no cow) learnable "no" class prompts OOD image cow cat fish OOD sum 0.15 A photo with a horse runing A photo with a horse runing A photo of a cow A photo with a horse runing A photo with a horse runing A photo of a cat A photo with a horse runing A photo with a horse runing A photo of a fish standard class prompts Text Encoder "no" Text Encoder Image Encoder "no" cat cat OOD Competing-to-win Agreeing-to-differ the decision element multiplication cow cat fish 0.8 cow cat fish saying "no" probabilities 0.6 0.6 0.2 cow cat fish ID probabilities 0.4 0.4 0.25 0.25 0.5 0.15 0.1 0.6 0.8 0.2 OOD