Upgrade to Pro — share decks privately, control downloads, hide ads and more …

第60回名古屋CV・PRML勉強会:CVPR2024論文紹介(AM-RADIO)

 第60回名古屋CV・PRML勉強会:CVPR2024論文紹介(AM-RADIO)

7月20日に行われた第60回名古屋CV・PRML勉強会のCVPR2024論文紹介で発表予定だったスライドです.複数の基盤モデルからの知識蒸留を行うAM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into Oneを紹介.

Avatar for Naoki Okamoto

Naoki Okamoto

July 19, 2024
Tweet

More Decks by Naoki Okamoto

Other Decks in Research

Transcript

  1. ࣗݾ঺հ  Ԭຊ௚थ /BPLJ0LBNPUP த෦େֶ%౻٢ݚڀࣨॴଐ ݚڀςʔϚɿϋΠύʔύϥϝʔλ୳ࡧʹΑΔֶशํ๏ͷࣗಈઃܭ ݚڀ෼໺ɹɿ஌ࣝৠཹɼ൒ڭࢣ͋Γֶशɼࣗݾڭࢣ͋Γֶश ࣗݾڭࢣ͋ΓֶशͷνϡʔτϦΞϧ 44**प೥ٕज़Ϛοϓɿ஌ࣝৠཹ IUUQTTQFBLFSEFDLDPNOBPL[JKJKJBPTIJ

    BSJYVFYJOJZPSVTIJRJBOYVFYJDWJNUJZVUPSJBSV %BSL,OPXMFEHF ,OPXMFEHF%JTUJMMBUJPO <)JOUPO /*148`> ڭࢣͷ֬཰෼෍ʢ஌ࣝʣΛ ༻͍ͯੜెΛֶश .PEFMDPNQSFTTJPO <#VDJMV㶙 4*(,%%`> Ξϯαϯϒϧͷग़ྗΛϥϕϧͱͯ͠ ͭͷχϡʔϥϧωοτϫʔΫΛֶश Ϟσϧͷ૊Έ߹Θͤ ஌ࣝͷछྨɾ஌ࣝͷసҠํ๏ ೥      44**प೥ٕज़Ϛοϓɿ஌ࣝৠཹ ෳ਺ͷڭࢣʹΑΔΞϯαϯϒϧΛར༻ .VMUJQMF5FBDIFS <:PV ,%%`> ֬཰෼෍Λू໿ '&&% <1BSL,XBL &$"*`> ಛ௃ϚοϓΛू໿ ࣗ෼ࣗ਎ͷ஌ࣝΛར༻ TFMGEJTUJMMBUJPO ਂ͍૚ͷ஌ࣝΛઙ͍૚΁సҠ -FBSOJOHBVOJpFEDMBTTJpFS <)PV $713`> #FZPVSPXOUFBDIFS <;IBOH *$$7`> ෳ਺ͷੜెͷΈͰֶश %.- <;IBOH $713`> ੜెؒͷ஌ࣝৠཹʹΑΓਫ਼౓͕޲্ 0/& <-BO /FVSM14`> $PMMBCPSBUJWFMFBSOJOH <4POHˍ$IBJ /FVSM14`> ੜెͷઙ͍૚ΛॏΈڞ༗ͯ͠ύϥϝʔλ਺Λ࡟ݮ ஈ֊తʹ஌ࣝΛసҠ   7*% <"IO $713`> ૬ޓ৘ใྔ $3% <5JBO *$-3`> ରরֶश "'% <$IVOH *$.-`> ఢରతֶश ,OPXMFEHF%J⒎VTJPO <)VBOH /FVS*14`> ֦ࢄϞσϧͷֶशํ๏ ,OPXMFEHF3FWJFX <$IFO $713`> ҟͳΔਂ͞ͷ૚ͷؒͰ ஌ࣝΛసҠ .(% <:BOH &$$7`> ϚεΫͨ͠ੜెͷಛ௃Ϛοϓ͔Β ڭࢣͷಛ௃ϚοϓΛ༧ଌ தؒ૚ͷ஌ࣝͷసҠํ๏Λվળ 3,% <1BSL $713`> αϯϓϧؒͷؔ܎ੑ 'MPXPG4PMVUJPO1SPDFEVSF <:JN $713`> ૚ؒͷग़ྗͷ૬ޓؔ܎ "UUFOUJPO5SBOTGFS <;BHPSVZLP *$-3`> "UUFOUJPONBQ தؒ૚ͷग़ྗ͔Β஌ࣝΛநग़ ".3"%*0 <3BO[JOHFS $713`> ෳ਺ͷج൫Ϟσϧ %*/0W $-*1 4". ֶशΛૣظऴྃͨ͠ڭࢣΛར༻ 3$0 <+JO *$$7`> 0OUIFF⒏DBDZ <$IPˍ)BSJIBSBO *$$7`> ೳྗΪϟοϓ໰୊ʹରԠ "VUP,% <-J *$$7`> தؒ૚ͷ஌ࣝදݱ &OTFNCMF,5( <0LBNPUP &$$7`> ஌ࣝͱଛࣦͷ૊Έ߹Θͤ ,%;FSP <-J /FVS*14`> ஌ࣝͱଛࣦͷ૊Έ߹Θͤ -BSHFTDBMFEJTUSJCVUFE <"OJM *$-3`> ֬཰෼෍Λू໿ %VBMOFU <)PV *$$7`> ಛ௃ϚοϓΛू໿ ෳ਺ͷੜెʹΑΔΞϯαϯϒϧΛར༻ %BUBTFU%JTUJMMBUJPO <8BOH BS9JW`> ֶशࡁΈϞσϧͷਫ਼౓͕ߴ͘ͳΔ Α͏ʹೖྗϊΠζΛ࠷దԽ ͦͷଞɿσʔληοτͷৠཹ ੜె ஌ࣝΛసҠ ੜె ੜె ஌ࣝΛసҠ ੜె ੜె ஌ࣝΛసҠ ੜె ஌ࣝΛసҠ ڭࢣ ڭࢣ #"/ <'VSMBOFMMP *$.-`> 4NBMMˠ4NBMMˠʜ TFMGEJTUJMMBUJPO 5FBDIFS"TTJTUBOU <.JS[BEFI """*`> -BSHFˠ.JEEMFˠ4NBMM ʢೳྗΪϟοϓ໰୊ʹରԠʣ %BUBEJTUPSUJPOHVJEFETFMGEJTUJMMBUJPO <9VBOE-JV """*`> ݩσʔλ͕ಉ֦͡ுޙͷσʔλͷग़ྗΛ༧ଌ ʢσʔλ͔Βσʔλ΁ͷTFMGEJTUJMMBUJPOʣ ஌ࣝΛసҠ ੜె σʔλ ͭͷڭࢣͰΞϯαϯϒϧ %BUBEJTUJMMBUJPO <3BEPTBWPWJD $713`> σʔλ֦ுΛར༻ 1SFQBSJOH-FTTPOT <8FO /FVSPDPNQVUJOH`> ޡೝࣝͨ͠σʔλͷ஌ࣝͱ ෆ࣮֬ͳ஌ࣝΛௐ੔ (SBEVBM4BNQMJOH(BUF <.JOBNJ .7"`> ਖ਼ղͨ͠σʔλͷ ஌ࣝͷΈΛసҠ ग़ྗ૚ͷ஌ࣝͷసҠํ๏Λվળ 'VODUJPO.BUDIJOH <#FZFS $713`> NJYVQʹΑΔଟ༷ͳը૾Λ༻͍ͯ ڭࢣͱੜెؒͰؔ਺Ϛονϯά &⒎FDUJWFOFTTPGGVODUJPONBUDIJOH JOESJWJOHTDFOFSFDPHOJUJPO <:BTIJNB &$$78`> ϥϕϧͳ͠σʔλΛ༻͍ͯؔ਺Ϛονϯά ؔ਺Ϛονϯάͱͯ͠஌ࣝৠཹΛ࠶ߟ %*45 <)VBOH /FVS*14`> ΫϥεؒʹՃ͑ͯ Ϋϥε಺ͷ૬ؔΛసҠ 0GGMJOF %JTUJMMBUJPO 0OMJOF %JTUJMMBUJPO ஌ࣝΛసҠ ڭࢣ ੜె ΑΓଟ༷ͳ৘ใΛ࣋ͭ தؒ૚ͷग़ྗΛར༻ 'JU/FUT <3PNFSP *$-3`> தؒ૚ͷ஌ࣝͱͯ͠ ಛ௃ϚοϓΛ࢖༻ ɹɹɿύϥϝʔλΛݻఆ ɹɹɿύϥϝʔλΛߋ৽ ڭࢣɿֶशࡁΈϞσϧ ੜెɿະֶशͷϞσϧ ੜెͷΈΛ༻͍ͯ ੜెؒͰ஌ࣝΛసҠ ڭࢣͷ஌ࣝΛੜె΁సҠ ஌ࣝৠཹͷࣗಈઃܭ ஌ࣝసҠΛิॿ͢ΔϞσϧΛ௥Ճ 3FTJEVBM,% <(BP BS9JW`> ஌ࣝͷࠩΛิ׬͢Δ"TTJTUBOU ҟͳΔϞσϧߏ଄ؒͰ஌ࣝΛసҠ %FJ5 <5PVWSPO *$.-`> ஌ࣝͱͯ֬͠཰෼෍Λ༻͍ͯ $//͔Β7J5΁஌ࣝৠཹ 0OFGPS"MM <)BP /FVS*14`> தؒग़ྗΛMPHJUۭؒʹ౤Ө͢Δ͜ͱͰ ҟͳΔߏ଄ͷϞσϧؒͰதؒ૚ৠཹ ஌ࣝৠཹͷࣗಈઃܭ ,5( <.JOBNJ "$$7`> Ϟσϧͱଛࣦͷ૊Έ߹Θͤ 0SBDMF,OPXMFEHF%JTUJMMBUJPO <,BOH """*`> ΞϯαϯϒϧڭࢣͷͨΊͷੜెͷϞσϧߏ଄ Ϋϥεߏ੒΍λεΫ͕ҟͳΔෳ਺ͷڭࢣͷ஌ࣝΛੜెʹू໿ 4UVEFOUCFDPNJOHUIFNBTUFS <:F $713`> ηϚηάΛֶशͨ͠ڭࢣͱਂ౓ਪఆΛֶशͨ͠ڭࢣ "NBMHBNBUJOH,OPXMFEHF <4IFO """*`> ҟͳΔ෼ྨλεΫΛֶशͨ͠ෳ਺ͷڭࢣ ಛఆͷλεΫ ֶश Ϟσϧʹ͓͚Δ஌ࣝΛઃܭ $-*1,% <'BOH $713`> $-*1ɿ$-*1ʹ͓͍ͯ ैདྷͷ஌ࣝͷ༗ޮੑΛௐࠪ .JOJ7J5 <;IBOH $713`> 7JTJPO5SBOTGPSNFSɿ ΞςϯγϣϯॏΈͱύοντʔΫϯ .BOJGPME%JTUJMMBUJPO <)BP /FVS*14`> 7JTJPO5SBOTGPSNFSɿ ύονؒͷؔ܎ੑ -BSHFTDBMFJODSFNFOUBMMFBSOJOH <8V $713`> ܧଓֶशɿաڈλεΫͰ ֶशͨ͠Ϟσϧͷ֬཰෼෍ *NQSPWJOHGBTUTFHNFOUBUJPO XJUIUFBDIFSTUVEFOUMFBSOJOH <9JF #.7$`> ηϚηάɿۙ๣ͷϐΫηϧͱͷMPHJUؔ܎ 4&&% <'BOH *$-3`> ࣗݾڭࢣ͋Γֶशɿ αϯϓϧؒͷؔ܎ੑ -FBSOJOHF⒏DJFOUPCKFDUEFUFDUJPO NPEFMTXJUILOPXMFEHFEJTUJMMBUJPO <;BHPSVZLP *$-3`> ෺ମݕग़ɿ෺ମྖҬͷۣܗ ڭࢣ ੜె ஌ࣝΛసҠ ੜె ੜె ஌ࣝΛసҠ ੜె ੜె ஌ࣝΛసҠ IUUQTDPO fi UBUMBTKQHVJEFFWFOUTTJJTUBUJD TQFDJBM@QSPKFDU@UFDI@NBQ
  2. w େن໛σʔληοτͰֶश༷ͨ͠ʑͳԼྲྀλεΫͰߴ͍ੑೳΛൃش͢ΔֶशࡁΈϞσϧ w ୅දతͳϏδϣϯج൫Ϟσϧ  $-*1 ɿը૾ͱΩϟϓγϣϯϖΞͰֶश͠ɼߴ͍θϩγϣοτੑೳΛൃش  %*/0W ɿ"%&,΍1BTDBM70$ͳͲͷ%FOTFͳλεΫͰ$-*1Λ௒͑ΔੑೳΛൃش

     4". ɿΦʔϓϯϘΩϟϒϥϦͷΠϯελϯεηάϝϯςʔγϣϯͰߴ͍ੑೳΛൃش Ϗδϣϯج൫Ϟσϧ  I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3 $-*1 , score score score , , valid masks image image encoder image embedding mask points box text prompt encoder mask decoder conv %*/0W w .αϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ܏޲Λௐࠪ  طଘͷσʔληοτͱΠϯλʔωοτͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒  J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO <.0RVBC BS9JW> %*/0WʹΑΔࣗݾڭࢣ͋Γֶश ೖྗը૾ 7JFX 7J5 QBUDI 4PGUNBY 7JFX 7J5 ʢϞϝϯλϜϞσϧʣ ಛ௃ϕΫτϧ <$-4> QBUDI 4JOLIPSO,OPQQ ʢΫϥελׂ౰ʣ ϓϩτλΠϓ <$-4> ࢦ਺Ҡಈฏۉ ෛྫΛ༻͍ͳ͍ ରরֶश .*. σʔλ֦ு %*/0WͰֶशͨ͠Ϟσϧͷѹॖʢ 1SFUSBJOFE -BSHF7J5 QBUDI 4NBMM7J5 <$-4> QBUDI <$-4> 4NBMM7J5 ࢦ਺Ҡಈฏۉ ෛ .*. ࠷ऴతʹ ʢϞϝϯλϜϞσϧʣ 4".
  3. w $POUSBTUJWF-BOHVBHF*NBHF1SFUSBJOJOH   ࣄલֶशɹɹɹɹɿը૾ͱςΩετͷϖΞͰಛ௃ྔ͕Ұக͢ΔΑ͏ʹࣗݾڭࢣ͋Γֶश   ը૾ͷΫϥε෼ྨɿΫϥεʹؔ͢ΔςΩετ͔Βಛ௃ྔΛநग़  

    ը૾ͷΫϥε෼ྨɿը૾ͱςΩετؒͷಛ௃ྔͷྨࣅ౓͔Βը૾ͷΫϥεΛ෼ྨ w ը૾ͱςΩετͷରԠؔ܎ΛֶशˠରԠؔ܎͔Β௥Ճͷֶशͳ͘ը૾ͷΫϥε෼ྨ͕Մೳ $-*1<3BEGPSE *$.->  I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3
  4. w 4FMG%JTUJMMBUJPOXJUI/P-BCFMT w ໿ԯαϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ܏޲Λௐࠪ  طଘͷෳ਺σʔληοτͱ΢Σϒ্ͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒ w J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ %*/0W<0RVBC 5.-3>

     w .αϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ  طଘͷσʔληοτͱΠϯλʔωοτͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒  J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO <.0RVBC BS9JW> %*/0WʹΑΔࣗݾڭࢣ͋Γֶश ೖྗը૾ 7JFX 7J5 QBUDI 4PGUNBY 7JFX 7J5 ʢϞϝϯλϜϞσϧʣ ಛ௃ϕΫτϧ <$-4> QBUDI 4JOLIPSO,OPQQ ʢΫϥελׂ౰ʣ ϓϩτλΠϓ <$-4> ࢦ਺Ҡಈฏۉ ෛྫΛ༻͍ͳ͍ ରরֶश .*. σʔλ֦ு %*/0WͰֶशͨ͠ 1SFUSBJO -BSHF7 QBUDI 4NBMM7 <$-4> QBUDI <$-4> 4NBMM7 ࢦ਺Ҡಈฏ ʢϞϝϯλϜϞ
  5. w طଘσʔληοτ΁΢Σϒ্ͷը૾Λ௥Ճ͢Δ͜ͱͰσʔλ਺Λ૿ڧ  σʔληοτ಺ͷσʔλʹࣅͨ΢Σϒը૾Λ௥Ճ͢Δ͜ͱͰσʔλͷଟ༷ੑΛҡ࣋ͭͭ͠૿ڧ w σʔλ਺ͷόϥϯεΛௐ੔ͯ͠૊Έ߹ΘͤΔ͜ͱͰ-7%.σʔληοτΛ࡞੒ %*/0Wɿେن໛σʔληοτͷߏங Task Dataset /

    Split Images Retrieval Retrieved Final classification ImageNet-22k / – 14,197,086 as is – 14,197,086 classification ImageNet-22k / – 14,197,086 sample 56,788,344 56,788,344 classification ImageNet-1k / train 1,281,167 sample 40,997,344 40,997,344 fine-grained classif. Caltech 101 / train 3,030 cluster 2,630,000 1,000,000 fine-grained classif. CUB-200-2011 / train 5,994 cluster 1,300,000 1,000,000 fine-grained classif. DTD / train1 1,880 cluster 1,580,000 1,000,000 fine-grained classif. FGVC-Aircraft / train 3,334 cluster 1,170,000 1,000,000 fine-grained classif. Flowers-102 / train 1,020 cluster 1,060,000 1,000,000 fine-grained classif. Food-101 / train 75,750 cluster 21,670,000 1,000,000 fine-grained classif. Oxford-IIIT Pet / trainval 3,680 cluster 2,750,000 1,000,000 fine-grained classif. Stanford Cars / train 8,144 cluster 7,220,000 1,000,000 fine-grained classif. SUN397 / train1 19,850 cluster 18,950,000 1,000,000 fine-grained classif. Pascal VOC 2007 / train 2,501 cluster 1,010,000 1,000,000 segmentation ADE20K / train 20,210 cluster 20,720,000 1,000,000 segmentation Cityscapes / train 2,975 cluster 1,390,000 1,000,000 segmentation Pascal VOC 2012 (seg.) / trainaug 1,464 cluster 10,140,000 1,000,000 depth estimation Mapillary SLS / train 1,434,262 as is – 1,434,262 depth estimation KITTI / train (Eigen) 23,158 cluster 3,700,000 1,000,000 depth estimation NYU Depth V2 / train 24,231 cluster 10,850,000 1,000,000 depth estimation SUN RGB-D / train 4,829 cluster 4,870,000 1,000,000 retrieval Google Landmarks v2 / train (clean) 1,580,470 as is – 1,580,470 retrieval Google Landmarks v2 / train (clean) 1,580,470 sample 6,321,880 6,321,880 retrieval AmsterTime / new 1,231 cluster 960,000 960,000 retrieval AmsterTime / old 1,231 cluster 830,000 830,000 retrieval Met / train 397,121 cluster 62,860,000 1,000,000 retrieval Revisiting Oxford / base 4,993 cluster 3,680,000 1,000,000 retrieval Revisiting Paris / base 6,322 cluster 3,660,000 1,000,000 142,109,386 Table 15: Composition of our LVD-142M dataset. We report the list of datasets and associated splits used to build the dataset, how they were included (as is without retrieval or via sample-based or cluster-based σʔλ਺Λ૿ڧ σʔληοτؒͷόϥϯεΛௐ੔ 
  6. w ࣄલֶशʹ͸ೖྗը૾ʹର͢Δཧ૝తͳྖҬ෼ׂΛද͢ਖ਼ղ৘ใ͕ඞཁ  େྔͷը૾ʹରͯ͠ਓखͰਖ਼ղ৘ใΛ࡞੒͢Δͷ͸ࠔ೉ˠ.PEFMJOUIFMPPQͰ࡞੒ w .PEFMJOUIFMPPQɿ4".ͷग़ྗΛ΋ͱʹਖ਼ղ৘ใΛ࡞੒ɹਖ਼ղ৘ใΛ༻͍ͯ4".Λֶश  ̏ஈ֊ͷ.PEFMJOUIFMPPQʹ͓͍ͯσʔλ਺Λ૿΍͠ͳ͕Βֶश 4".ɿେن໛σʔληοτͷߏங ˠ

    ˠ (b) Model: Segment Anything Model (SAM) prompt image valid mask image encoder prompt encoder lightweight mask decoder table segmentation ompt image model cat with black ears d mask (c) Data: data engine (top) & dataset (bottom) • 1+ billion masks • 11 million images • privacy respecting • licensed images annotate train data model Segment Anything 1B (SA-1B): ɽ"TTJTUFENBOVBMTUBHF 4".͕༧ଌͨ͠ྖҬ෼ׂΛਓखͰमਖ਼ ສຕͷը૾ʹରͯ͠ສݸͷϚεΫΛ࡞੒ ɽ4FNJBVUPNBUJDTUBHF 4".͕༧ଌ͠ͳ͔ͬͨྖҬΛਓखͰ௥Ճ ສຕͷը૾Λ௥Ճͯ͠ສݸͷϚεΫΛ࡞੒ ɽ'VMMZBVUPNBUJDTUBHF 4".ʹΑΔྖҬ෼ׂΛਖ਼ղ৘ใͱͯͦ͠ͷ··࢖༻ ˠ࠷ऴతʹ ສຕͷը૾ʹରͯ͠ԯݸͷϚεΫΛ࡞੒ֶͯ͠शʹར༻ 
  7. w େن໛σʔληοτͰֶश༷ͨ͠ʑͳԼྲྀλεΫͰߴ͍ੑೳΛൃش͢ΔֶशࡁΈϞσϧ w ୅දతͳϏδϣϯج൫Ϟσϧ  $-*1 ɿը૾ͱΩϟϓγϣϯϖΞͰֶश͠ɼߴ͍θϩγϣοτੑೳΛൃش  %*/0W ɿ"%&,΍1BTDBM70$ͳͲͷ%FOTFͳλεΫͰ$-*1Λ௒͑ΔੑೳΛൃش

     4". ɿΦʔϓϯϘΩϟϒϥϦʔͷΠϯελϯεηάϝϯςʔγϣϯͰߴ͍ੑೳΛൃش Ϗδϣϯج൫Ϟσϧ  I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3 $-*1 , score score score , , valid masks image image encoder image embedding mask points box text prompt encoder mask decoder conv %*/0W w .αϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ܏޲Λௐࠪ  طଘͷσʔληοτͱΠϯλʔωοτͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒  J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO <.0RVBC BS9JW> %*/0WʹΑΔࣗݾڭࢣ͋Γֶश ೖྗը૾ 7JFX 7J5 QBUDI 4PGUNBY 7JFX 7J5 ʢϞϝϯλϜϞσϧʣ ಛ௃ϕΫτϧ <$-4> QBUDI 4JOLIPSO,OPQQ ʢΫϥελׂ౰ʣ ϓϩτλΠϓ <$-4> ࢦ਺Ҡಈฏۉ ෛྫΛ༻͍ͳ͍ ରরֶश .*. σʔλ֦ு %*/0WͰֶशͨ͠Ϟσϧͷѹॖʢ 1SFUSBJOFE -BSHF7J5 QBUDI 4NBMM7J5 <$-4> QBUDI <$-4> 4NBMM7J5 ࢦ਺Ҡಈฏۉ ෛ .*. ࠷ऴతʹ ʢϞϝϯλϜϞσϧʣ 4".
  8. w େن໛σʔληοτͰֶश༷ͨ͠ʑͳԼྲྀλεΫͰߴ͍ੑೳΛൃش͢ΔֶशࡁΈϞσϧ w ୅දతͳϏδϣϯج൫Ϟσϧ  $-*1 ɿը૾ͱΩϟϓγϣϯϖΞͰֶश͠ɼߴ͍θϩγϣοτੑೳΛൃش  %*/0W ɿ"%&,΍1BTDBM70$ͳͲͷ%FOTFͳλεΫͰ$-*1Λ௒͑ΔੑೳΛൃش

     4". ɿΦʔϓϯϘΩϟϒϥϦʔͷΠϯελϯεηάϝϯςʔγϣϯͰߴ͍ੑೳΛൃش Ϗδϣϯج൫Ϟσϧ  I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3 $-*1 , score score score , , valid masks image image encoder image embedding mask points box text prompt encoder mask decoder conv w .αϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ܏޲Λௐࠪ  طଘͷσʔληοτͱΠϯλʔωοτͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒  J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO <.0RVBC BS9JW> %*/0WʹΑΔࣗݾڭࢣ͋Γֶश ೖྗը૾ 7JFX 7J5 QBUDI 4PGUNBY 7JFX 7J5 ʢϞϝϯλϜϞσϧʣ ಛ௃ϕΫτϧ <$-4> QBUDI 4JOLIPSO,OPQQ ʢΫϥελׂ౰ʣ ϓϩτλΠϓ <$-4> ࢦ਺Ҡಈฏۉ ෛྫΛ༻͍ͳ͍ ରরֶश .*. σʔλ֦ு %*/0WͰֶशͨ͠Ϟσϧͷѹॖʢ 1SFUSBJOFE -BSHF7J5 QBUDI 4NBMM7J5 <$-4> QBUDI <$-4> 4NBMM7J5 ࢦ਺Ҡಈฏۉ ෛ .*. ࠷ऴతʹ ʢϞϝϯλϜϞσϧʣ 4". %*/0W Ϗδϣϯج൫Ϟσϧ͝ͱʹಘҙͳλεΫ͕ҟͳΔ ˣ ͭͷϞσϧʹ஌ࣝΛू໿͍ͨ͠
  9. w "HHMPNFSBUJWF7JTJPO'PVOEBUJPO.PEFM3FEVDF"MM%PNBJOT*OUP0OF w Ϟσϧͷग़ྗΛ໛฿͢Δ஌ࣝৠཹʹΑΓෳ਺ͷج൫Ϟσϧͷੑ࣭Λ࣋ͭϞσϧΛֶश  ໛฿ର৅ͷֶशࡁΈϞσϧɿڭࢣϞσϧ  ໛฿ʹΑΓֶश͢ΔϞσϧɿੜెϞσϧ w ੜెϞσϧͱͯ͠ϋʔυ΢ΣΞޮ཰ͷߴ͍Ϟσϧߏ଄ʹ͍ͭͯௐࠪ

    ".3"%*0<3BO[JOHFS $713>  Vision Foundation model (student) Student head 2 CLIP Student head 1 DINOv2 Student head 3 SAM Teacher 1: DINOv2 Multi component distillation loss Teacher 2: CLIP Multi component distillation loss Teacher 3: SAM Multi component distillation loss A vintage radio on white sandy beach, a colorful beach ball nearby Pixel-level visual tasks: Text grounding: Semantic segmentation: (from scratch) Frozen weights Trainable Image only data: Spatial features Summary token Spatial features Summary token Spatial features
  10. w Ұൠతͳ஌ࣝৠཹͷઃܭ  ڭࢣϞσϧɹɿͭͷϞσϧPSಉ͡λεΫʹֶ͍ͭͯशͨ͠ෳ਺Ϟσϧ  ໛฿͢Δग़ྗɿ࠷ऴग़ྗʢ֬཰෼෍ͳͲʣɼதؒग़ྗ  σʔληοτɿڭࢣϞσϧͷֶशʹ࢖༻ͨ͠σʔληοτ w ".3"%*0ʹ͓͚Δ஌ࣝৠཹͷઃܭ

     ڭࢣϞσϧɹɿҟͳΔֶशํ๏ɾλεΫɾσʔληοτʹֶ͍ͭͯशͨ͠ෳ਺Ϟσϧ  ໛฿͢Δग़ྗɿதؒग़ྗʢͨͩ͠ɼڭࢣϞσϧʹΑΓ࣍ݩ਺͕ҟͳΔʣ ˠϚϧνϔουߏ଄ͷੜెϞσϧΛ࠾༻  σʔληοτɿڭࢣϞσϧֶ͕शʹ࢖༻ͨ͠σʔληοτ͕࢖༻Ͱ͖ͳ͍ ˠৠཹʹ࢖༻͢Δσʔληοτ͕ԼྲྀλεΫʹ༩͑ΔӨڹΛௐࠪ ஌ࣝৠཹͷઃܭ 
  11. w ڭࢣʹΑͬͯதؒग़ྗͷ࣍ݩ਺͕ҟͳΔͨΊੜెʹϚϧνϔουߏ଄Λ࠾༻  ͭͷڭࢣʹରͯ͠4VNNBSZUPLFO༻ϔουͱ4QBUJBMGFBUVSFT༻ϔουΛ༻ҙ w ϔουͱͯ͠૚ߏ଄ͷ.-1Λ࢖༻ ੜెͷϞσϧߏ଄ɿΞμϓλʔϔου  Vision Foundation

    model (student) Student head 2 CLIP Student head 1 DINOv2 Student head 3 SAM Teacher 1: DINOv2 Multi component distillation loss Teacher 2: CLIP Multi component distillation loss Teacher 3: SAM Multi component distillation loss A vintage radio on white sandy beach, a colorful beach ball nearby Pixel-level visual tasks: Text grounding: Semantic segmentation: (from scratch) Frozen weights Trainable Image only data: Spatial features Summary token Spatial features Summary token Spatial features
  12. w ৠཹ࣌ʹ࢖༻͢Δσʔληοτ͕ੜెͷԼྲྀλεΫੑೳʹ༩͑ΔӨڹΛௐࠪ  L// ;FSPTIPU ɿ*NBHF/FU,ʹର͢Δ෼ྨੑೳ  ΛධՁ  "%&,

    ɿ"%&,ʹର͢Δηάϝϯςʔγϣϯੑೳ N*06 ΛධՁ w *NBHF/FU,Λ࢖༻͢Δͱը૾ͷ෼ྨλεΫͰߴ͍ੑೳΛୡ੒  ֶश࣌ͱධՁ࣌Ͱಉ͡σʔληοτΛ࢖༻͢ΔͨΊ;FSPTIPUੑೳΛެฏʹධՁͰ͖ͳ͍ w ৠཹ࣌ͷσʔληοτͱͯ͠%BUB$PNQ#Λ࢖༻ ৠཹʹ࢖༻͢Δσʔληοτ  ct that the teacher mod- broad swath of images m datasets such as Ima- 00M [50] or DataComp- oose to study 3 seminal DINOv2 [44], and SAM utstanding performance n CLIP), or specifically m dense tasks, such as se- probe (as in DINOv2), or in SAM). Because these verse domains, we omit Dataset k-NN Zero Shot ADE20K ImageNet 1K 84.79 80.44 48.11 ImageNet 21K 84.61 80.10 48.65 LAION-400M 83.77 77.46 48.6 DataComp-1B 83.91 78.51 49.01 Table 2. Ablation study on the choice of training dataset. We use MetaCLIP ViT-H/14 [15] and DINOv2 ViT-g/14 teachers, and a ViT-L/14 student model with CPE [30]. Both “k-NN” and “Zero Shot” are for ImageNet-1k. ADE20k refers to mIOU linear probe on ADE20k. we argue that it doesn’t fairly measure “zero shot” perfor-
  13. w ੜెͷ֤ϔουͷग़ྗΛ༻͍༷ͯʑͳλεΫΛධՁ  ྫɿੜెͷ4".༻ϔουͷग़ྗΛ4".ͷσίʔμʹೖྗͯ͠ηάϝϯςʔγϣϯ ".3"%*0ͷੑೳධՁ  3"%*07J5)͸ڭࢣͱಉ౳PS௒͑ΔੑೳΛൃش ˠ஌ࣝৠཹʹΑΓෳ਺ͷج൫Ϟσϧͷ஌ࣝΛͭͷϞσϧʹू໿Մೳ Model Params

    Resol- Throughput ImageNet1K Segmentation (linear) Vision-Language (LLaVa-1.5 [36]) SAM [32] (M) ution Zero-shot k-NN ADE20k VOC GQA POPE TextVQA VQAv2 COCO OpenCLIP-H/14 [10] 632 224 503 77.19 81.10 40.04 68.03 57.94 83.61 50.48 72.24 - MetaCLIP-H/14 [60] 632 224 486 80.51 82.12 35.39 62.62 60.57 84.76 53.65 75.71 - SigLIP-M/14 [69] 428 384 241 82.61 85.16 40.53 70.31 57.70 84.85 56.65 71.94 - Intern-ViT-6B [9] 5,902 224 63 83.20:: 78.43 47.20 76.85 60.18 84.02 52.45 76.75 - 5,537 448 14 :: 68.64 42.78 74.43 61.19 87.23 60.36 78.83 - *DFN CLIP-H/14 [17] 633 378 170 83.90 85.27 39.00 70.29 61.73 85.91 56.78 78.78 - *OpenAI CLIP-L/14 [47] 305 336 414 75.54 79.80 36.51 67.04 62.20 86.09 57.92 78.49 - *DINOv2-g/14-reg [13] 1,137 224 294: - 83.41 48.68 82.78 61.88 85.62 47.18 76.23 - *SAM-H/16 [32] 637 1024 12 - 22.12 28.08 34.34 49.92 81.76 43.91 57.65 77.18 E-RADIO-L (Ours) 391 512 468 80.73 83.89 48.22 81.64 61.70 85.07 51.47 76.73 76.31 RADIO-ViT-H/16 (Ours) 653 432 158 82.93 86.06 51.34 84.71 63.01 86.20 56.32 79.28 76.23 Table 1. Comparison of vision foundation and RADIO models. ”Zero-Shot” and k-NN are computed on ImageNet-1K. ADE20K [72] and VOC (PascalVOC2012) refer to linear probe semantic segmentation mIOU. GQA, POPE (popular), TextVQA, and VQAv2 are obtained via LLaVa 1.5 [36] by replacing the vision encoder. COCO is the instance segmentation metric introduced by [7] to evaluate SAM [32] distillation. RADIO attains the best metrics on most benchmarks, and is competitive with the rest, while E-RADIO enables high quality results in resource constrained settings. Note that Zero-Shot and COCO use teacher’s decoder head that is not finetuned. Throughput computed using NVIDIA A100 GPU, stated resolution, and TensorRT v8601. *Denotes teachers used to train our final RADIO. : We failed
  14. w طଘݚڀͱͯ͠(16্Ͱߴ͍εϧʔϓοτΛ໨తͱͨ͠ΞʔΩςΫνϟઃܭ͕ଘࡏ w ߴ͍εϧʔϓοτΛ໨తͱͨ͠ΞʔΩςΫνϟΛ".3"%*0ͰධՁ ޮ཰తͳੜెϞσϧͷઃܭ  with and er trains

    resolu- orming Av2 .66 .28 ken and n when found Backbone Param. Through- Zero k-NN ADE20k FD loss Count put Shot Teachers DINOv2 G/14 1.14B 313 N/A 83.41 47.53 OpenCLIP H/14 632M 556 77.19 81.10 40.04 Existing Efficient Models EfficientNetV2-S 21M 9017 65.37 70.72 27.75 0.415 ResNetv2-101 44M 7283 69.58 75.32 29.61 0.405 RegNetY-064 30M 6573 69.84 74.59 28.9 0.394 EfficientViT-L1 38M 6048 71.73 79.90 33.12 0.376 ConvNext-B 88M 1805 75.43 81.73 38.95 0.358 NFNet-F3 254M 1777 76.93 80.50 38.31 0.340 SwinV2-S 49M 1497 74.70 81.12 35.57 0.364 MaxViT-B 119M 1486 77.49 79.34 38.46 0.340 PoolformerV2-M36 56M 1194 74.46 80.49 35.05 0.377 MViTV2-B 51M 975 75.92 81.39 41.39 0.345 Proposed architecture E-RADIO-B 118M 6422 75.19 82.21 44.03 0.319 ë w/o upsample 113M 7040 75.45 82.05 41.26 0.353 E-RADIO-L 265M 3472 77.87 83.73 45.5 0.265 εϧʔϓοτ ɿ$//ϕʔε͕ྑ͍ ੑೳ ɿ7J5ϕʔε͕ྑ͍ ˣ ϋΠϒϦουΞʔΩςΫνϟ &3"%*0Λઃఆ
  15. with and er trains resolu- orming Av2 .66 .28 ken

    and n when found Backbone Param. Through- Zero k-NN ADE20k FD loss Count put Shot Teachers DINOv2 G/14 1.14B 313 N/A 83.41 47.53 OpenCLIP H/14 632M 556 77.19 81.10 40.04 Existing Efficient Models EfficientNetV2-S 21M 9017 65.37 70.72 27.75 0.415 ResNetv2-101 44M 7283 69.58 75.32 29.61 0.405 RegNetY-064 30M 6573 69.84 74.59 28.9 0.394 EfficientViT-L1 38M 6048 71.73 79.90 33.12 0.376 ConvNext-B 88M 1805 75.43 81.73 38.95 0.358 NFNet-F3 254M 1777 76.93 80.50 38.31 0.340 SwinV2-S 49M 1497 74.70 81.12 35.57 0.364 MaxViT-B 119M 1486 77.49 79.34 38.46 0.340 PoolformerV2-M36 56M 1194 74.46 80.49 35.05 0.377 MViTV2-B 51M 975 75.92 81.39 41.39 0.345 Proposed architecture E-RADIO-B 118M 6422 75.19 82.21 44.03 0.319 ë w/o upsample 113M 7040 75.45 82.05 41.26 0.353 E-RADIO-L 265M 3472 77.87 83.73 45.5 0.265 w طଘݚڀͱͯ͠(16্Ͱߴ͍εϧʔϓοτΛ໨తͱͨ͠ΞʔΩςΫνϟઃܭ͕ଘࡏ w ߴ͍εϧʔϓοτΛ໨తͱͨ͠ΞʔΩςΫνϟΛ".3"%*0ͰධՁ ޮ཰తͳੜెϞσϧͷઃܭ  & ffi DJFOU3"%*0 &3"%*0 w ೖྗ૚ɿ৞ΈࠐΈ૚ʹΑΓը૾αΠζΛʹ w தؒ૚ɿஈ֊ͷ:0-0W$G৞ΈࠐΈϒϩοΫͱஈ֊ͷ 5SBOTGPSNFSϒϩοΫ  5SBOTGPSNFSϒϩοΫɿϩʔΧϧͱάϩʔόϧͷ "UUFOUJPO8JOEPXΛަޓʹ࠾༻ w தؒ૚ɿ࠷ޙʹ%FDPOWPMVUJPO૚Λద༻ طଘͷٕज़Λ૊Έ߹Θͤͯ ΑΓޮ཰తͳΞʔΩςΫνϟΛઃܭ ߴ͍εϧʔϓοτͱߴ͍ੑೳΛಉ࣌ʹ࣮ݱ
  16. w Ϗδϣϯج൫Ϟσϧ  େن໛σʔληοτΛ༻ҙͯ͠ڭࢣ͋Γֶश΍ࣗݾڭࢣ͋ΓֶशʹΑΓֶश  ֶशͷ໨తʢଛࣦؔ਺ɾλεΫʣ΍σʔληοτ͕ҟͳΔ͜ͱͰҟͳΔԼྲྀλεΫੑೳΛൃش w ".3"%*0<3BO[JOHFS $713> 

    ෳ਺ͷج൫ϞσϧΛڭࢣͱͯͭ͠ͷੜెϞσϧʹதؒ૚ৠཹ  ج൫ϞσϧʹΑͬͯதؒͷग़ྗ࣍ݩ਺͕ҟͳΔͨΊϚϧνϔουߏ଄ͷੜెΛ࠾༻  ڭࢣͱಉ౳PS௒͑ΔੑೳΛൃشˠ஌ࣝৠཹʹΑΓෳ਺ͷج൫Ϟσϧͷ஌ࣝΛͭͷϞσϧʹू໿  ੜెͷޮ཰ੑʢεϧʔϓοτʣͱੑೳͷؔ܎͸ੜెͷΞʔΩςΫνϟʹґଘ ·ͱΊ