第12回全日本コンピュータビジョン勉強会：画像の自己教師あり学習における大規模データセット

Slide 1

Slide 1 text

ୈճશ೔ຊίϯϐϡʔλϏδϣϯษڧձ ը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δେن໛σʔληοτ Ԭຊ௚थʢத෦େֶɾػց஌֮ϩϘςΟΫεݚڀάϧʔϓʣ IUUQNQSHKQ

Slide 2

Slide 2 text

ࣗݾ঺հ Ԭຊ௚थ /BPLJ0LBNPUP த෦େֶ޻ֶݚڀՊത࢜ޙظ՝ఔ̎೥ੜɹ౻٢ݚڀࣨॴଐ ࣗݾڭࢣ͋ΓֶशͷαʔϕΠεϥΠυ IUUQTTQFBLFSEFDLDPNOBPLTFMGTVQFSWJTFEMFBSOJOH ΞϯαϯϒϧֶशͷͨΊͷ஌ࣝৠཹ<&$$7> 0. Ensemble (74.52%) 1. ResNet18_ABN (68.1%) 2. ResNet18_ABN (70.68%) Prob(+), Attention(+) (Linear) 3. ResNet18_ABN (70.96%) Attention(+) (Correct) 4. ResNet18_ABN (72.09%) Prob(+), Attention(+) (Correct) 5. ResNet18_ABN (69.03%) Attention(+) (Linear) Attention(+) (Through) Attention(+) (Linear) Prob(+), Attention(+) (Linear) Attention(+) (Through) Prob(-), Attention(-) (Linear) Prob(+) (Correct) Attention(-) (Through) Prob(-), Attention(-) (Linear) Prob(+) (Correct) Attention(+) (Correct) Prob(+) (Linear) Prob(+), Attention(+) (Correct) Prob(+), Attention(+) (Linear) Label Linear Label Through Label Through Label Through %*/0ͷղઆهࣄ ݚڀςʔϚɿϋΠύʔύϥϝʔλ୳ࡧʹΑΔֶशํ๏ͷࣗಈઃܭ ݚڀ෼໺ɹɿ஌ࣝৠཹɼ൒ڭࢣ͋Γֶशɼࣗݾڭࢣ͋Γֶश

Slide 25

Slide 25 text

w ໿ԯαϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ܏޲Λௐࠪ طଘͷෳ਺σʔληοτͱ΢Σϒ্ͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒ J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO <.0RVBC BS9JW> Task Dataset / Split Images Retrieval Retrieved Final classification ImageNet-22k / – 14,197,086 as is – 14,197,086 classification ImageNet-22k / – 14,197,086 sample 56,788,344 56,788,344 classification ImageNet-1k / train 1,281,167 sample 40,997,344 40,997,344 fine-grained classif. Caltech 101 / train 3,030 cluster 2,630,000 1,000,000 fine-grained classif. CUB-200-2011 / train 5,994 cluster 1,300,000 1,000,000 fine-grained classif. DTD / train1 1,880 cluster 1,580,000 1,000,000 fine-grained classif. FGVC-Aircraft / train 3,334 cluster 1,170,000 1,000,000 fine-grained classif. Flowers-102 / train 1,020 cluster 1,060,000 1,000,000 fine-grained classif. Food-101 / train 75,750 cluster 21,670,000 1,000,000 fine-grained classif. Oxford-IIIT Pet / trainval 3,680 cluster 2,750,000 1,000,000 fine-grained classif. Stanford Cars / train 8,144 cluster 7,220,000 1,000,000 fine-grained classif. SUN397 / train1 19,850 cluster 18,950,000 1,000,000 fine-grained classif. Pascal VOC 2007 / train 2,501 cluster 1,010,000 1,000,000 segmentation ADE20K / train 20,210 cluster 20,720,000 1,000,000 segmentation Cityscapes / train 2,975 cluster 1,390,000 1,000,000 segmentation Pascal VOC 2012 (seg.) / trainaug 1,464 cluster 10,140,000 1,000,000 depth estimation Mapillary SLS / train 1,434,262 as is – 1,434,262 depth estimation KITTI / train (Eigen) 23,158 cluster 3,700,000 1,000,000 depth estimation NYU Depth V2 / train 24,231 cluster 10,850,000 1,000,000 depth estimation SUN RGB-D / train 4,829 cluster 4,870,000 1,000,000 retrieval Google Landmarks v2 / train (clean) 1,580,470 as is – 1,580,470 retrieval Google Landmarks v2 / train (clean) 1,580,470 sample 6,321,880 6,321,880 retrieval AmsterTime / new 1,231 cluster 960,000 960,000 retrieval AmsterTime / old 1,231 cluster 830,000 830,000 retrieval Met / train 397,121 cluster 62,860,000 1,000,000 retrieval Revisiting Oxford / base 4,993 cluster 3,680,000 1,000,000 retrieval Revisiting Paris / base 6,322 cluster 3,660,000 1,000,000 142,109,386 Table 15: Composition of our LVD-142M dataset. We report the list of datasets and associated splits σʔλͷ਺Λ૿ڧ σʔληοτؒͷόϥϯεΛௐ੔ ࠷ऴతͳ-7%.

Slide 31

Slide 31 text

w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ 8FBLMZ ɿը૾ͱݴޠͷϚϧνϞʔμϧख๏ %BUB ɿࣄલֶशͷσʔληοτ େن໛σʔληοτʹΑΔֶशޮՌɿೝࣝੑೳ w ஌ࣝৠཹͷޮՌ ڭࢣ 7J5H ͷύϥϝʔλ਺ ɿ໿ԯ ੜె 7J5- ͷύϥϝʔλ਺ ɿ໿ԯສ X 85.8 72.8 47.1 63.9 (a) Koleo loss X Table 3: (a) E ect of the KoLeo loss term. (b) E ect o term. Evaluation performed on ImageNet-{1k,A} (classiﬁ (segmentation with linear layer, mIoU) and Oxford-M (im same number of iterations, that is smaller than our ﬁnal ru search tasks (e.g. retrieval), and the MIM loss improves p (a) Comparison on individual metrics Arch ViT-g/14 ViT-L/14 ViT-L/14 Arch ViT-g/14 ViT-L/14 ViT-L/14 Figure 5: E ectiveness of knowledge distillation. C εΫϥονͱൺ΂ͯߴ͍ਫ਼౓Λୡ੒ ը૾ͷΈͰैདྷ๏ͱൺ΂ͯߴ͍ਫ਼౓Λୡ੒ kNN linear Method Arch. Data Text sup. val val ReaL V2 Weakly supervised CLIP ViT-L/14 WIT-400M X 79.8 84.3 88.1 75.3 CLIP ViT-L/14 336 WIT-400M X 80.5 85.3 88.8 75.8 SWAG ViT-H/14 IG3.6B X 82.6 85.7 88.7 77.6 OpenCLIP ViT-H/14 LAION X 81.7 84.4 88.4 75.5 OpenCLIP ViT-G/14 LAION X 83.2 86.2 89.4 77.2 EVA-CLIP ViT-g/14 customú X 83.5 86.4 89.3 77.4 Self-supervised MAE ViT-H/14 INet-1k 5 49.4 76.6 83.3 64.8 DINO ViT-S/8 INet-1k 5 78.6 79.2 85.5 68.2 SEERv2 RG10B IG2B 5 – 79.8 – – MSN ViT-L/7 INet-1k 5 79.2 80.7 86.0 69.7 EsViT Swin-B/W=14 INet-1k 5 79.4 81.3 87.0 70.4 Mugs ViT-L/16 INet-1k 5 80.2 82.1 86.9 70.8 iBOT ViT-L/16 INet-22k 5 72.9 82.3 87.5 72.4 DINOv2 ViT-S/14 LVD-142M 5 79.0 81.1 86.6 70.9 ViT-B/14 LVD-142M 5 82.1 84.5 88.3 75.1 ViT-L/14 LVD-142M 5 83.5 86.3 89.5 78.0 ViT-g/14 LVD-142M 5 83.5 86.5 89.6 78.4 Table 4: Linear evaluation on ImageNet-1k of frozen pretrained features. We report Top-1 accuracy on the validation set for publicly available models trained on public or private data, and with or without text supervision (text sup.). For reference, we also report the kNN performance on the validation set. We compare across any possible architectures (Arch.), at resolution 224 ◊ 224 unless stated otherwise. The dataset used for training EVA-CLIP is a custom mixture, see paper for details (Fang et al., 2023).

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text