Slide 9
Slide 9 text
w ԯαϯϓϧͷେنσʔλΛ༻͍ͨ߹ͷ."&ͷֶशޮՌΛௐࠪ
*OTUBHSBNͰऩूͨ͠ԯͷը૾ͱऑϥϕϧʢը૾ͷߘʹਵ͢ΔϋογϡλάʣΛ༻
w ."&ˠऑڭࢣ͋Γֶशͷ̎ஈ֊ͷࣄલֶशʹΑͬͯΑΓߴ͍ࣄલֶशͷޮՌΛൃش
."&ɼऑڭࢣڞʹFQPDIͷֶशͷΈ
w ը૾ϞσϧΛݻఆͯ͠ݴޠϞσϧΛ$-*1ܗࣜͰֶशˠ;FSPTIPUධՁͰը૾ϞσϧΛධՁ
5IF&
ff
FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH
<.4JOHI
*$$7>
ViT models at various scales in terms of number of param-
eters, including ViT-B (86M), ViT-L (307M), and ViT-H
(632M). We also train on larger 1.9B and 6.5B parameter
ViT models, which we call ViT-2B and ViT-6.5B, respec-
tively (Appendix Table 8). As is common practice [23, 84],
we train models of sizes ViT-B, ViT-L with a patch size of
16 and larger models with a patch size of 14. We pretrain
with a 224 × 224 resolution for all models.
Pre-pretraining (MAE) [33] learns visual representations
from image datasets without using any labels. We choose
this approach as it is simple to implement and scales very
Dataset Task #cls #train #val
ImageNet-1k (IN1k) [64] Image cls. 1000 1M 50K
iNaturalist-18 (iNat18) [36] Fine-grained cls. 8142 437K 24K
ImageNetv2 (INv2) [61] Image cls. 1000 – 10K
ImageNet-ReaL (IN-ReaL) [7] Image cls. 1000 – 50K
ObjectNet (ON) [6] Image cls. 113 – 19K
Food-101 (F-101) [9] Image cls. 101 N/A 25K
COCO [49] Obj. det. 80 118K 5K
LVIS [32] Obj. det. 1K 100K 20K
Kinetics-400 (K400) [43] Action cls. 400 220K 20K
Something Something-v2 (SSv2) [30] Action cls. 174 169K 25K
Table 1: Evaluation datasets used to evaluate MAE→WSP on
image classification, object detection, and video action recognition