Upgrade to Pro — share decks privately, control downloads, hide ads and more …

第12回全日本コンピュータビジョン勉強会:画像の自己教師あり学習における大規模データセット

 第12回全日本コンピュータビジョン勉強会:画像の自己教師あり学習における大規模データセット

2024年3月3日開催の第12回全日本コンピュータビジョン勉強会における発表資料です.

Naoki Okamoto

March 03, 2024
Tweet

More Decks by Naoki Okamoto

Other Decks in Research

Transcript

  1. ࣗݾ঺հ  Ԭຊ௚थ /BPLJ0LBNPUP த෦େֶ޻ֶݚڀՊത࢜ޙظ՝ఔ̎೥ੜɹ౻٢ݚڀࣨॴଐ ࣗݾڭࢣ͋ΓֶशͷαʔϕΠεϥΠυ IUUQTTQFBLFSEFDLDPNOBPLTFMGTVQFSWJTFEMFBSOJOH ΞϯαϯϒϧֶशͷͨΊͷ஌ࣝৠཹ<&$$7> 0. Ensemble

    (74.52%) 1. ResNet18_ABN (68.1%) 2. ResNet18_ABN (70.68%) Prob(+), Attention(+) (Linear) 3. ResNet18_ABN (70.96%) Attention(+) (Correct) 4. ResNet18_ABN (72.09%) Prob(+), Attention(+) (Correct) 5. ResNet18_ABN (69.03%) Attention(+) (Linear) Attention(+) (Through) Attention(+) (Linear) Prob(+), Attention(+) (Linear) Attention(+) (Through) Prob(-), Attention(-) (Linear) Prob(+) (Correct) Attention(-) (Through) Prob(-), Attention(-) (Linear) Prob(+) (Correct) Attention(+) (Correct) Prob(+) (Linear) Prob(+), Attention(+) (Correct) Prob(+), Attention(+) (Linear) Label Linear Label Through Label Through Label Through %*/0ͷղઆهࣄ ݚڀςʔϚɿϋΠύʔύϥϝʔλ୳ࡧʹΑΔֶशํ๏ͷࣗಈઃܭ ݚڀ෼໺ɹɿ஌ࣝৠཹɼ൒ڭࢣ͋Γֶशɼࣗݾڭࢣ͋Γֶश
  2. w -FBSOJOH5SBOTGFSBCMF7JTVBM.PEFMT'SPN/BUVSBM-BOHVBHF4VQFSWJTJPO  ը૾ͱݴޠͷϚϧνϞʔμϧʹ͓͚Δେن໛σʔληοτΛߏஙֶͯ͠शʹར༻ w %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO  ը૾ͷେن໛σʔληοτΛ࡞੒͢ΔϑϨʔϜϫʔΫΛఏҊ w 5IF&

    ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH  ԯن໛ͷσʔληοτͰ͋Δ*OTUBHSBN#Λ༻͍ͨ৔߹ͷ."&ͷ܏޲Λௐࠪ ঺հ͢Δ࿦จ  ֤࿦จʹ͍ͭͯҎԼͷ఺Λ঺հ ɽେن໛σʔληοτΛ׆༻ֶͨ͠शํ๏ ɽେن໛σʔληοτͷߏஙํ๏ ɽେن໛σʔληοτʹΑΔֶशޮՌ
  3. w ਖ਼ղϥϕϧΛ෇༩ͯ͠ͳ͍େྔͷϥϕϧͳ͠σʔλΛ༻͍ͯࣄલֶश w ໨తɿ༷ʑͳసҠֶशɾ fi OFUVOJOHઌͷλεΫͰ༗ޮͳಛ௃ྔͷநग़ ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH  ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁సҠֶशɾ

    fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ᶃϥϕϧͳ͠σʔλͰϞσϧΛࣄલֶश ϥϕϧͳ͠σʔλ Ϟσϧ ʢཚ਺ॳظ஋ʣ ses image data, we analyze its internal early projects the flattened patches into he top principal components of the the le basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the d to the l learns on em- on em- s in the usoidal ). That ogy ex- ot yield e entire ree the ute the is inte- “atten- We find lowest bally is stently zed at- ʜ ࣗݾڭࢣ͋Γֶश 44-
  4. ࣗݾڭࢣ͋Γֶशͷ୅දతͳํ๏  ."&ͷֶशํ๏ w .BTLFE*NBHF.PEFMJOH .*.   ϚεΫ͞Εͨύονͷಛ௃ྔΛ༧ଌ w

    #&J5<)#BP *$-3> w #&J5W<;1FOH BS9JW>   ϚεΫ͞ΕͨύονͷըૉΛ༧ଌ w ."&<,)F $713>  w 4JN.*.<;9JF $713> &ODPEFS &ODPEFS .-1 .-1 ࢦ਺Ҡಈฏۉ 4PGUNBY $FOUFS 4PGUNBY ଛࣦܭࢉ -PDBM (MPCBM %*/0ͷֶशํ๏ ଛࣦܭࢉ &ODPEFS %FDPEFS ϚεΫॲཧ ϚεΫτʔΫϯͷ௥Ճ ."&ͷֶशํ๏ w $POUSBTUJWF-FBSOJOH $-   ෳ਺ͷҟͳΔ7JFX͔Βநग़ͨ͠ը૾Ϩϕϧͷಛ௃Λൺֱ w .P$PW<9$IFO *$$7>  w %*/0<.$BSPO *$$7>
  5. ࣗݾڭࢣ͋Γֶशͷ୅දతͳํ๏  w ϚϧνϞʔμϧ$-  ը૾ͱςΩετͷϖΞͰ$- w $-*1<"3BEGPSE *$.-> w

    ϚϧνϞʔμϧ.*.  3(# %FQUI 4FNBOUJDΛಉ࣌ʹ.*. w .VMUJ."&<3#BDINBOO &$$7> w ϋΠϒϦου  $-ͱ.*.Λಉ࣌ʹֶश w J#05<+;IPV *$-3>  w %*/0W<.0RVBC BS9JW> w $POUSBTUJWF-FBSOJOH $-   ෳ਺ͷҟͳΔ7JFX͔Βநग़ͨ͠ը૾Ϩϕϧͷಛ௃Λൺֱ w .P$PW<9$IFO *$$7>  w %*/0<.$BSPO *$$7> 7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश<>  w .BTLFE*NBHF.PEFMJOH .*.   ϚεΫ͞Εͨύονͷಛ௃ྔΛ༧ଌ w #&J5<)#BP *$-3> w #&J5W<>   ϚεΫ͞ΕͨύονͷըૉΛ༧ଌ w ."&<,)F $713>  w 4JN.*.<;9JF $713> ."&ͷֶशํ๏ &ODPEFS &ODPEFS .-1 .-1 ࢦ਺Ҡಈฏۉ 4PGUNBY $FOUFS 4PGUNBY ଛࣦܭࢉ -PDBM (MPCBM %*/0ͷֶशํ๏ ଛࣦܭࢉ &ODPEFS %FDPEFS ϚεΫॲཧ ϚεΫτʔΫϯͷ௥Ճ ."&ͷֶशํ๏ 7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश<>  w ϚϧνϞʔμϧ$-  ը૾ͱςΩετͷϖΞͰ$- w $-*1<"3BEGPSE *$.-> w ϚϧνϞʔμϧ.*.  3(# %FQUI 4FNBOUJDΛಉ࣌ʹ.*. w .VMUJ."&<3#BDINBOO &$$7> w ϋΠϒϦου  $-ͱ.*.Λಉ࣌ʹֶश w J#05<+;IPV *$-3>  w %*/0W<.0RVBC BS9JW> &ODPEFS &ODPEFS .-1 .-1 ࢦ਺Ҡಈฏۉ 4PGUNBY $FOUFS 4PGUNBY ଛࣦܭࢉ -PDBM (MPCBM %*/0ͷֶशํ๏ ଛࣦܭࢉ &ODPEFS %FDPEFS ϚεΫॲཧ ϚεΫτʔΫϯͷ௥Ճ ."&ͷֶशํ๏ $POUSBTUJWF-FBSOJOH $- .BTLFE*NBHF.PEFMJOH .*.
  6. w $-*1ɿ$POUSBTUJWF-BOHVBHF*NBHF1SFUSBJOJOH   ࣄલֶशɹɹɹɹɿը૾ͱςΩετͷϖΞͰಛ௃ྔ͕Ұக͢ΔΑ͏ʹࣗݾڭࢣ͋Γֶश   ը૾ͷΫϥε෼ྨɿΫϥεʹؔ͢ΔςΩετ͔Βಛ௃ྔΛநग़  

    ը૾ͷΫϥε෼ྨɿը૾ͱςΩετؒͷಛ௃ྔͷྨࣅ౓͔Βը૾ͷΫϥεΛ෼ྨ w ը૾ͱςΩετͷରԠؔ܎ΛֶशˠରԠؔ܎͔Β௥Ճͷֶशͳ͘ը૾ͷΫϥε෼ྨ͕Մೳ -FBSOJOH5SBOTGFSBCMF7JTVBM.PEFMT'SPN/BUVSBM-BOHVBHF4VQFSWJTJPO <"3BEGPSE *$.->  I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3
  7. w σʔληοτͰఆٛ͞ΕͨϖΞΛਖ਼ྫͱͯ͠ରরֶश  ਖ਼ྫϖΞɿσʔληοτͰఆٛ͞Εͨը૾ͱςΩετͷϖΞ  ෛྫϖΞɿϛχόον಺ͷਖ਼ྫϖΞҎ֎ͷը૾ͱςΩετؒͷϖΞ w σʔληοτͱͯ͠8FC*NBHF5FYUΛ࢖༻  ը૾ͱςΩετͷϖΞͰಛ௃ྔ͕Ұக͢ΔΑ͏ʹࣗݾڭࢣ͋Γֶश

        ʜ     ʜ     ʜ     ʜ  ʜ ʜ ʜ ʜ ʜ I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3 ଛࣦܭࢉ ಛ௃ྔͷྨࣅ౓ؔ܎ ʢίαΠϯྨࣅ౓ʣ ཧ૝తͳྨࣅ౓ؔ܎
  8. w ը૾ͷΫϥε෼ྨ໰୊΁௥ՃͷֶशΛҰ੾ͤͣʹʢ;FSPTIPUͰʣద༻͕Մೳ w Ϋϥε໊ΛؚΜͩςΩετͱͷಛ௃ྔͷྨࣅ౓ؔ܎͔Βೖྗը૾ͷΫϥεΛ༧ଌ d ;FSPTIPUJNBHFDMBTTJ fi DBUJPO  I1·T2

    I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 lEPHzΫϥεͱ൑ఆ l"QIPUPPGBQMBOFz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBDBSz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBEPHz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBCJSEz͔Βநग़ͨ͠ಛ௃ྔ ϓϩϯϓτ ςϯϓϨʔτ
  9. w ϓϩϯϓτςϯϓϨʔτʹΫϥε໊Λ౰ͯ͸ΊͨςΩετ͔Βಛ௃ྔΛநग़  ϓϩϯϓτςϯϓϨʔτ͸ը૾΍໰୊ઃఆʹ߹ΘͤͯਓखͰઃఆ ௗྨͷೝࣝɹɹɿ"QIPUPPGB\PCKFDU^ BUZQFPGCJSE ಓ࿏ඪࣝͷೝࣝɿ"[PPNFEJOQIPUPPGB\PCKFDU^USB ffi DTJHO 

    Ϋϥεʹؔ͢ΔςΩετ͔Βಛ௃ྔΛநग़  I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 l"QIPUPPGBQMBOFz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBDBSz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBEPHz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBCJSEz͔Βநग़ͨ͠ಛ௃ྔ Image Encoder I1 I1·T2 I1·TN I1·T1 … A photo of a dog. I1·T3 ϓϩϯϓτ ςϯϓϨʔτ ϓϩϯϓτςϯϓϨʔτʹΫϥε໊Λ౰ͯ͸ΊͨςΩετ͔Βಛ௃ྔΛநग़
  10. w Ϋϥε෼ྨΛߦ͍͍ͨը૾͔Βಛ௃ྔΛநग़ w ը૾ͱ֤ςΩετؒͷಛ௃ྔͷίαΠϯྨࣅ౓Λܭࢉ w ࠷΋ྨࣅ౓͕ߴ͍ςΩετ಺ͷΫϥε໊Λ༧ଌ݁Ռͱͯ͠Ϋϥε෼ྨ  ը૾ͱςΩετؒͷಛ௃ྔͷྨࣅ౓͔Βը૾ͷΫϥεΛ෼ྨ  plane

    car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction … I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 I1·T2 I1·TN I1·T1 … A photo of a dog. I1·T3 ը૾͔Βಛ௃ྔΛநग़
  11. w Ϋϥε෼ྨΛߦ͍͍ͨը૾͔Βಛ௃ྔΛநग़ w ը૾ͱ֤ςΩετؒͷಛ௃ྔͷίαΠϯྨࣅ౓Λܭࢉ w ࠷΋ྨࣅ౓͕ߴ͍ςΩετ಺ͷΫϥε໊Λ༧ଌ݁Ռͱͯ͠Ϋϥε෼ྨ  ը૾ͱςΩετؒͷಛ௃ྔͷྨࣅ౓͔Βը૾ͷΫϥεΛ෼ྨ  plane

    car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder … (3) Use for zero-shot prediction I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 A photo of a dog. Image Encoder I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 ը૾ͱςΩετؒͷಛ௃ྔͷྨࣅ౓Λܭࢉ
  12. w Ϋϥε෼ྨΛߦ͍͍ͨը૾͔Βಛ௃ྔΛநग़ w ը૾ͱ֤ςΩετؒͷಛ௃ྔͷίαΠϯྨࣅ౓Λܭࢉ w ࠷΋ྨࣅ౓͕ߴ͍ςΩετ಺ͷΫϥε໊Λ༧ଌ݁Ռͱͯ͠Ϋϥε෼ྨ  ը૾ͱςΩετؒͷಛ௃ྔͷྨࣅ౓͔Βը૾ͷΫϥεΛ෼ྨ  Text

    Encoder … (3) Use for zero-shot prediction I1 I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 Image Encoder T1 T2 T3 TN … ࠷΋ྨࣅ౓͕ߴ͍ςΩετΛ֬ೝ I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 lEPHzΫϥεͱ൑ఆ
  13. w ໿ԯαϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ܏޲Λௐࠪ  طଘͷෳ਺σʔληοτͱ΢Σϒ্ͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒  J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO <.0RVBC BS9JW> 

    Task Dataset / Split Images Retrieval Retrieved Final classification ImageNet-22k / – 14,197,086 as is – 14,197,086 classification ImageNet-22k / – 14,197,086 sample 56,788,344 56,788,344 classification ImageNet-1k / train 1,281,167 sample 40,997,344 40,997,344 fine-grained classif. Caltech 101 / train 3,030 cluster 2,630,000 1,000,000 fine-grained classif. CUB-200-2011 / train 5,994 cluster 1,300,000 1,000,000 fine-grained classif. DTD / train1 1,880 cluster 1,580,000 1,000,000 fine-grained classif. FGVC-Aircraft / train 3,334 cluster 1,170,000 1,000,000 fine-grained classif. Flowers-102 / train 1,020 cluster 1,060,000 1,000,000 fine-grained classif. Food-101 / train 75,750 cluster 21,670,000 1,000,000 fine-grained classif. Oxford-IIIT Pet / trainval 3,680 cluster 2,750,000 1,000,000 fine-grained classif. Stanford Cars / train 8,144 cluster 7,220,000 1,000,000 fine-grained classif. SUN397 / train1 19,850 cluster 18,950,000 1,000,000 fine-grained classif. Pascal VOC 2007 / train 2,501 cluster 1,010,000 1,000,000 segmentation ADE20K / train 20,210 cluster 20,720,000 1,000,000 segmentation Cityscapes / train 2,975 cluster 1,390,000 1,000,000 segmentation Pascal VOC 2012 (seg.) / trainaug 1,464 cluster 10,140,000 1,000,000 depth estimation Mapillary SLS / train 1,434,262 as is – 1,434,262 depth estimation KITTI / train (Eigen) 23,158 cluster 3,700,000 1,000,000 depth estimation NYU Depth V2 / train 24,231 cluster 10,850,000 1,000,000 depth estimation SUN RGB-D / train 4,829 cluster 4,870,000 1,000,000 retrieval Google Landmarks v2 / train (clean) 1,580,470 as is – 1,580,470 retrieval Google Landmarks v2 / train (clean) 1,580,470 sample 6,321,880 6,321,880 retrieval AmsterTime / new 1,231 cluster 960,000 960,000 retrieval AmsterTime / old 1,231 cluster 830,000 830,000 retrieval Met / train 397,121 cluster 62,860,000 1,000,000 retrieval Revisiting Oxford / base 4,993 cluster 3,680,000 1,000,000 retrieval Revisiting Paris / base 6,322 cluster 3,660,000 1,000,000 142,109,386 Table 15: Composition of our LVD-142M dataset. We report the list of datasets and associated splits σʔλͷ਺Λ૿ڧ σʔληοτؒͷόϥϯεΛௐ੔ ࠷ऴతͳ-7%.
  14. w σʔληοτ͝ͱʹͭͷεςοϓ͔Β΢Σϒ্ͷը૾Ͱσʔλ਺Λ૿ڧ  &NCFEEJOH ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशͨ͠7J5)ʹΑΓಛ௃நग़  %FEVQMJDBUJPO ɿ΢Σϒը૾ʹରͯ͠ಛ௃ྔʹج͍ͮͨίϐʔݕग़Λద༻ͯ͠ࣅͨը૾Λ࡟আ  3FUSJFWBM

    ɿσʔληοτʹࣅͨ/ݸͷ΢Σϒը૾Λσʔληοτʹ௥Ճ ΢Σϒ্ͷը૾Λ༻͍ͨαϯϓϧ਺ͷ૿ڧ  Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval ૿ڧޙͷσʔληοτ ΢Σϒ͔Βऩूͨ͠ը૾ طଘͷσʔληοτ
  15. w σʔληοτ͝ͱʹͭͷεςοϓ͔Β΢Σϒ্ͷը૾Ͱσʔλ਺Λ૿ڧ  &NCFEEJOH ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशͨ͠7J5)ʹΑΓಛ௃நग़  %FEVQMJDBUJPO ɿ΢Σϒը૾ʹରͯ͠ಛ௃ྔʹج͍ͮͨίϐʔݕग़Λద༻ͯ͠ࣅͨը૾Λ࡟আ  3FUSJFWBM

    ɿσʔληοτʹࣅͨ/ݸͷ΢Σϒը૾Λσʔληοτʹ௥Ճ ΢Σϒ্ͷը૾Λ༻͍ͨαϯϓϧ਺ͷ૿ڧ  Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval ૿ڧޙͷσʔληοτ طଘͷσʔληοτ ΢Σϒ͔Βऩूͨ͠ը૾
  16. w σʔληοτ͝ͱʹͭͷεςοϓ͔Β΢Σϒ্ͷը૾Ͱσʔλ਺Λ૿ڧ  &NCFEEJOH ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशͨ͠7J5)ʹΑΓಛ௃நग़  %FEVQMJDBUJPO ɿ΢Σϒը૾ʹରͯ͠ಛ௃ྔʹج͍ͮͨίϐʔݕग़Λద༻ͯ͠ࣅͨը૾Λ࡟আ  3FUSJFWBM

    ɿσʔληοτʹࣅͨ/ݸͷ΢Σϒը૾Λσʔληοτʹ௥Ճ ΢Σϒ্ͷը૾Λ༻͍ͨαϯϓϧ਺ͷ૿ڧ  Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval ૿ڧޙͷσʔληοτ ΢Σϒ͔Βऩूͨ͠ը૾ طଘͷσʔληοτ
  17. w σʔληοτ͝ͱʹͭͷεςοϓ͔Β΢Σϒ্ͷը૾Ͱσʔλ਺Λ૿ڧ  &NCFEEJOH ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशͨ͠7J5)ʹΑΓಛ௃நग़  %FEVQMJDBUJPO ɿ΢Σϒը૾ʹରͯ͠ಛ௃ྔʹج͍ͮͨίϐʔݕग़Λద༻ͯ͠ࣅͨը૾Λ࡟আ  3FUSJFWBM

    ɿσʔληοτʹࣅͨ/ݸͷ΢Σϒը૾Λσʔληοτʹ௥Ճ ΢Σϒ্ͷը૾Λ༻͍ͨαϯϓϧ਺ͷ૿ڧ  Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval ૿ڧޙͷσʔληοτ ΢Σϒ͔Βऩूͨ͠ը૾ طଘͷσʔληοτ
  18. w .αϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ܏޲Λௐࠪ  طଘͷσʔληοτͱΠϯλʔωοτͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒  J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO <.0RVBC BS9JW> 

    %*/0WʹΑΔࣗݾڭࢣ͋Γֶश ೖྗը૾ 7JFX 7J5 QBUDI 4PGUNBY 7JFX 7J5 ʢϞϝϯλϜϞσϧʣ ಛ௃ϕΫτϧ <$-4> QBUDI 4JOLIPSO,OPQQ ʢΫϥελׂ౰ʣ ϓϩτλΠϓ <$-4> ࢦ਺Ҡಈฏۉ ෛྫΛ༻͍ͳ͍ ରরֶश .*. σʔλ֦ு %*/0WͰֶशͨ͠Ϟσϧͷѹॖʢ஌ࣝৠཹʣ 1SFUSBJOFE -BSHF7J5 QBUDI 4NBMM7J5 <$-4> QBUDI <$-4> 4NBMM7J5 ࢦ਺Ҡಈฏۉ ෛྫΛ༻͍ͳ͍ ରরֶश .*. ࠷ऴతʹ࢖༻ ʢϞϝϯλϜϞσϧʣ
  19. w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ  8FBLMZ ɿը૾ͱݴޠͷϚϧνϞʔμϧख๏  %BUB ɿࣄલֶशͷσʔληοτ େن໛σʔληοτʹΑΔֶशޮՌɿೝࣝੑೳ 

    w ஌ࣝৠཹͷޮՌ  ڭࢣ 7J5H ͷύϥϝʔλ਺ ɿ໿ԯ  ੜె 7J5- ͷύϥϝʔλ਺ ɿ໿ԯສ X 85.8 72.8 47.1 63.9 (a) Koleo loss X Table 3: (a) E ect of the KoLeo loss term. (b) E ect o term. Evaluation performed on ImageNet-{1k,A} (classifi (segmentation with linear layer, mIoU) and Oxford-M (im same number of iterations, that is smaller than our final ru search tasks (e.g. retrieval), and the MIM loss improves p (a) Comparison on individual metrics Arch ViT-g/14 ViT-L/14 ViT-L/14 Arch ViT-g/14 ViT-L/14 ViT-L/14 Figure 5: E ectiveness of knowledge distillation. C εΫϥονͱൺ΂ͯߴ͍ਫ਼౓Λୡ੒ ը૾ͷΈͰैདྷ๏ͱൺ΂ͯߴ͍ਫ਼౓Λୡ੒ kNN linear Method Arch. Data Text sup. val val ReaL V2 Weakly supervised CLIP ViT-L/14 WIT-400M X 79.8 84.3 88.1 75.3 CLIP ViT-L/14 336 WIT-400M X 80.5 85.3 88.8 75.8 SWAG ViT-H/14 IG3.6B X 82.6 85.7 88.7 77.6 OpenCLIP ViT-H/14 LAION X 81.7 84.4 88.4 75.5 OpenCLIP ViT-G/14 LAION X 83.2 86.2 89.4 77.2 EVA-CLIP ViT-g/14 customú X 83.5 86.4 89.3 77.4 Self-supervised MAE ViT-H/14 INet-1k 5 49.4 76.6 83.3 64.8 DINO ViT-S/8 INet-1k 5 78.6 79.2 85.5 68.2 SEERv2 RG10B IG2B 5 – 79.8 – – MSN ViT-L/7 INet-1k 5 79.2 80.7 86.0 69.7 EsViT Swin-B/W=14 INet-1k 5 79.4 81.3 87.0 70.4 Mugs ViT-L/16 INet-1k 5 80.2 82.1 86.9 70.8 iBOT ViT-L/16 INet-22k 5 72.9 82.3 87.5 72.4 DINOv2 ViT-S/14 LVD-142M 5 79.0 81.1 86.6 70.9 ViT-B/14 LVD-142M 5 82.1 84.5 88.3 75.1 ViT-L/14 LVD-142M 5 83.5 86.3 89.5 78.0 ViT-g/14 LVD-142M 5 83.5 86.5 89.6 78.4 Table 4: Linear evaluation on ImageNet-1k of frozen pretrained features. We report Top-1 accuracy on the validation set for publicly available models trained on public or private data, and with or without text supervision (text sup.). For reference, we also report the kNN performance on the validation set. We compare across any possible architectures (Arch.), at resolution 224 ◊ 224 unless stated otherwise. The dataset used for training EVA-CLIP is a custom mixture, see paper for details (Fang et al., 2023).
  20. w ճͷओ੒෼෼ੳ 1$" Λద༻͢Δ͜ͱͰύονͷಛ௃ྔΛ෼ੳ  ɽෳ਺ͷը૾ͷશͯͷύονಛ௃ྔʹରͯ͠1$"Λద༻ w ୈʔओ੒෼ʹର͖͍ͯ͠͠஋ॲཧΛߦ͍ɼલܠͱഎܠͷύονʹ෼ׂ  ɽෳ਺ͷը૾ͷલܠͱ൑ఆ͞Εͨύονಛ௃ྔʹରͯ͠1$"Λద༻

    w ୈʔओ੒෼ɼୈೋओ੒෼ɼୈࡾओ੒෼ͷ஋Λ3(#ͷ஋ͱ֤ͯ͠ύονΛ৭෇͚ େن໛σʔληοτʹΑΔֶशޮՌɿಛ௃ྔͷ෼ੳ  ਓखʹΑΔਖ਼ղϥϕϧͳ͠ͰΦϒδΣΫτͷύʔπϨϕϧͷؔ܎ੑΛֶश Figure 1: Visualization of the first PCA components. We compute a PCA between the patches of the images from the same column (a, b, c and d) and show their first 3 components. Each component is matched to a di erent color channel. Same parts are matched between related images despite changes of pose, style or even objects. Background is removed by thresholding the first PCA component. IUUQTBJGBDFCPPLDPNCMPHEJOPWDPNQVUFSWJTJPOTFMGTVQFSWJTFEMFBSOJOH <0RVBC BS9JW>͔ΒҾ༻
  21. w ԯαϯϓϧͷେن໛σʔλΛ༻͍ͨ৔߹ͷ."&ͷֶशޮՌΛௐࠪ  *OTUBHSBNͰऩूͨ͠ԯͷը૾ͱऑϥϕϧʢը૾ͷ౤ߘʹ෇ਵ͢ΔϋογϡλάʣΛ࢖༻ w ."&ˠऑڭࢣ͋Γֶशͷ̎ஈ֊ͷࣄલֶशʹΑͬͯΑΓߴ͍ࣄલֶशͷޮՌΛൃش  ."&ɼऑڭࢣ͸ڞʹFQPDIͷֶशͷΈ 5IF& ff

    FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH <.4JOHI *$$7>  ViT models at various scales in terms of number of param- eters, including ViT-B (86M), ViT-L (307M), and ViT-H (632M). We also train on larger 1.9B and 6.5B parameter ViT models, which we call ViT-2B and ViT-6.5B, respec- tively (Appendix Table 8). As is common practice [23, 84], we train models of sizes ViT-B, ViT-L with a patch size of 16 and larger models with a patch size of 14. We pretrain with a 224 × 224 resolution for all models. Pre-pretraining (MAE) [33] learns visual representations from image datasets without using any labels. We choose this approach as it is simple to implement and scales very effectively with large ViT model sizes due to patch dropping Dataset Task #cls #train #val ImageNet-1k (IN1k) [64] Image cls. 1000 1M 50K iNaturalist-18 (iNat18) [36] Fine-grained cls. 8142 437K 24K ImageNetv2 (INv2) [61] Image cls. 1000 – 10K ImageNet-ReaL (IN-ReaL) [7] Image cls. 1000 – 50K ObjectNet (ON) [6] Image cls. 113 – 19K Food-101 (F-101) [9] Image cls. 101 N/A 25K COCO [49] Obj. det. 80 118K 5K LVIS [32] Obj. det. 1K 100K 20K Kinetics-400 (K400) [43] Action cls. 400 220K 20K Something Something-v2 (SSv2) [30] Action cls. 174 169K 25K Table 1: Evaluation datasets used to evaluate MAE→WSP on image classification, object detection, and video action recognition tasks. The table reports the task, number of classes (#cls), number
  22. w ԯαϯϓϧͷେن໛σʔλΛ༻͍ͨ৔߹ͷ."&ͷֶशޮՌΛௐࠪ  *OTUBHSBNͰऩूͨ͠ԯͷը૾ͱऑϥϕϧʢը૾ͷ౤ߘʹ෇ਵ͢ΔϋογϡλάʣΛ࢖༻ w ."&ˠऑڭࢣ͋Γֶशͷ̎ஈ֊ͷࣄલֶशʹΑͬͯΑΓߴ͍ࣄલֶशͷޮՌΛൃش  ."&ɼऑڭࢣ͸ڞʹFQPDIͷֶशͷΈ ViT models

    at various scales in terms of number of param- eters, including ViT-B (86M), ViT-L (307M), and ViT-H (632M). We also train on larger 1.9B and 6.5B parameter ViT models, which we call ViT-2B and ViT-6.5B, respec- tively (Appendix Table 8). As is common practice [23, 84], we train models of sizes ViT-B, ViT-L with a patch size of 16 and larger models with a patch size of 14. We pretrain with a 224 × 224 resolution for all models. Pre-pretraining (MAE) [33] learns visual representations from image datasets without using any labels. We choose this approach as it is simple to implement and scales very effectively with large ViT model sizes due to patch dropping Dataset Task #cls #train #val ImageNet-1k (IN1k) [64] Image cls. 1000 1M 50K iNaturalist-18 (iNat18) [36] Fine-grained cls. 8142 437K 24K ImageNetv2 (INv2) [61] Image cls. 1000 – 10K ImageNet-ReaL (IN-ReaL) [7] Image cls. 1000 – 50K ObjectNet (ON) [6] Image cls. 113 – 19K Food-101 (F-101) [9] Image cls. 101 N/A 25K COCO [49] Obj. det. 80 118K 5K LVIS [32] Obj. det. 1K 100K 20K Kinetics-400 (K400) [43] Action cls. 400 220K 20K Something Something-v2 (SSv2) [30] Action cls. 174 169K 25K Table 1: Evaluation datasets used to evaluate MAE→WSP on image classification, object detection, and video action recognition tasks. The table reports the task, number of classes (#cls), number 5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH <.4JOHI *$$7>  ϙελʔηογϣϯͰͷஶऀΒ ͲΕ͚ͩ΍ͬͯ΋ऩଋ͠ͳ͍  FQPDIͰे෼ͳೝࣝੑೳΛൃش SFBTPOBCMZFOPVHI
  23. w -FBSOJOH5SBOTGFSBCMF7JTVBM.PEFMT'SPN/BUVSBM-BOHVBHF4VQFSWJTJPO  8JLJQFEJBͰසग़͢Δ୯ޠʹج͍ͮͯ΢Σϒ͔Βଟ༷ͳը૾ͱςΩετͷϖΞΛόϥϯεΑ͘ऩू ˠԯαϯϓϧͷը૾ͱςΩετͷϖΞͰߏங͞Εͨ8FC*NBHF5FYUσʔληοτΛ࡞੒  ϚϧνϞʔμϧରরֶशʹΑΓଟ༷ͳը૾ͱςΩετͷରԠؔ܎Λ֫ಘ w %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO 

    طଘσʔληοτΛ΢Σϒ্ͷը૾Λ༻͍ͯόϥϯεΑ͘αϯϓϧ਺Λ૿ڧ ˠ૿ڧͨ͠σʔληοτΛ૊Έ߹Θͤͯԯαϯϓϧͷը૾Ͱߏ੒͞Εͨ-7%.Λ࡞੒  ਖ਼ղϥϕϧͳ͠ͰΦϒδΣΫτͷύʔπϨϕϧͷؔ܎ੑΛֶश w 5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH  ԯαϯϓϧͷը૾ͱऑϥϕϧͰߏ੒͞Εͨ*OTUBHSBN#Λ༻͍ͨ৔߹ͷ."&ͷ܏޲Λௐࠪ  ΤϙοΫͷࣗݾڭࢣ͋Γֶशͱऑڭࢣ͋ΓֶशͰे෼ͳೝࣝੑೳΛୡ੒ ·ͱΊ