Upgrade to Pro — share decks privately, control downloads, hide ads and more …

第60回名古屋CV・PRML勉強会:CVPR2024論文紹介(AM-RADIO)

 第60回名古屋CV・PRML勉強会:CVPR2024論文紹介(AM-RADIO)

7月20日に行われた第60回名古屋CV・PRML勉強会のCVPR2024論文紹介で発表予定だったスライドです.複数の基盤モデルからの知識蒸留を行うAM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into Oneを紹介.

Naoki Okamoto

July 19, 2024
Tweet

More Decks by Naoki Okamoto

Other Decks in Research

Transcript

  1. ࣗݾ঺հ  Ԭຊ௚थ /BPLJ0LBNPUP த෦େֶ%౻٢ݚڀࣨॴଐ ݚڀςʔϚɿϋΠύʔύϥϝʔλ୳ࡧʹΑΔֶशํ๏ͷࣗಈઃܭ ݚڀ෼໺ɹɿ஌ࣝৠཹɼ൒ڭࢣ͋Γֶशɼࣗݾڭࢣ͋Γֶश ࣗݾڭࢣ͋ΓֶशͷνϡʔτϦΞϧ 44**प೥ٕज़Ϛοϓɿ஌ࣝৠཹ IUUQTTQFBLFSEFDLDPNOBPL[JKJKJBPTIJ

    BSJYVFYJOJZPSVTIJRJBOYVFYJDWJNUJZVUPSJBSV %BSL,OPXMFEHF ,OPXMFEHF%JTUJMMBUJPO <)JOUPO /*148`> ڭࢣͷ֬཰෼෍ʢ஌ࣝʣΛ ༻͍ͯੜెΛֶश .PEFMDPNQSFTTJPO <#VDJMV㶙 4*(,%%`> Ξϯαϯϒϧͷग़ྗΛϥϕϧͱͯ͠ ͭͷχϡʔϥϧωοτϫʔΫΛֶश Ϟσϧͷ૊Έ߹Θͤ ஌ࣝͷछྨɾ஌ࣝͷసҠํ๏ ೥      44**प೥ٕज़Ϛοϓɿ஌ࣝৠཹ ෳ਺ͷڭࢣʹΑΔΞϯαϯϒϧΛར༻ .VMUJQMF5FBDIFS <:PV ,%%`> ֬཰෼෍Λू໿ '&&% <1BSL,XBL &$"*`> ಛ௃ϚοϓΛू໿ ࣗ෼ࣗ਎ͷ஌ࣝΛར༻ TFMGEJTUJMMBUJPO ਂ͍૚ͷ஌ࣝΛઙ͍૚΁సҠ -FBSOJOHBVOJpFEDMBTTJpFS <)PV $713`> #FZPVSPXOUFBDIFS <;IBOH *$$7`> ෳ਺ͷੜెͷΈͰֶश %.- <;IBOH $713`> ੜెؒͷ஌ࣝৠཹʹΑΓਫ਼౓͕޲্ 0/& <-BO /FVSM14`> $PMMBCPSBUJWFMFBSOJOH <4POHˍ$IBJ /FVSM14`> ੜెͷઙ͍૚ΛॏΈڞ༗ͯ͠ύϥϝʔλ਺Λ࡟ݮ ஈ֊తʹ஌ࣝΛసҠ   7*% <"IO $713`> ૬ޓ৘ใྔ $3% <5JBO *$-3`> ରরֶश "'% <$IVOH *$.-`> ఢରతֶश ,OPXMFEHF%J⒎VTJPO <)VBOH /FVS*14`> ֦ࢄϞσϧͷֶशํ๏ ,OPXMFEHF3FWJFX <$IFO $713`> ҟͳΔਂ͞ͷ૚ͷؒͰ ஌ࣝΛసҠ .(% <:BOH &$$7`> ϚεΫͨ͠ੜెͷಛ௃Ϛοϓ͔Β ڭࢣͷಛ௃ϚοϓΛ༧ଌ தؒ૚ͷ஌ࣝͷసҠํ๏Λվળ 3,% <1BSL $713`> αϯϓϧؒͷؔ܎ੑ 'MPXPG4PMVUJPO1SPDFEVSF <:JN $713`> ૚ؒͷग़ྗͷ૬ޓؔ܎ "UUFOUJPO5SBOTGFS <;BHPSVZLP *$-3`> "UUFOUJPONBQ தؒ૚ͷग़ྗ͔Β஌ࣝΛநग़ ".3"%*0 <3BO[JOHFS $713`> ෳ਺ͷج൫Ϟσϧ %*/0W $-*1 4". ֶशΛૣظऴྃͨ͠ڭࢣΛར༻ 3$0 <+JO *$$7`> 0OUIFF⒏DBDZ <$IPˍ)BSJIBSBO *$$7`> ೳྗΪϟοϓ໰୊ʹରԠ "VUP,% <-J *$$7`> தؒ૚ͷ஌ࣝදݱ &OTFNCMF,5( <0LBNPUP &$$7`> ஌ࣝͱଛࣦͷ૊Έ߹Θͤ ,%;FSP <-J /FVS*14`> ஌ࣝͱଛࣦͷ૊Έ߹Θͤ -BSHFTDBMFEJTUSJCVUFE <"OJM *$-3`> ֬཰෼෍Λू໿ %VBMOFU <)PV *$$7`> ಛ௃ϚοϓΛू໿ ෳ਺ͷੜెʹΑΔΞϯαϯϒϧΛར༻ %BUBTFU%JTUJMMBUJPO <8BOH BS9JW`> ֶशࡁΈϞσϧͷਫ਼౓͕ߴ͘ͳΔ Α͏ʹೖྗϊΠζΛ࠷దԽ ͦͷଞɿσʔληοτͷৠཹ ੜె ஌ࣝΛసҠ ੜె ੜె ஌ࣝΛసҠ ੜె ੜె ஌ࣝΛసҠ ੜె ஌ࣝΛసҠ ڭࢣ ڭࢣ #"/ <'VSMBOFMMP *$.-`> 4NBMMˠ4NBMMˠʜ TFMGEJTUJMMBUJPO 5FBDIFS"TTJTUBOU <.JS[BEFI """*`> -BSHFˠ.JEEMFˠ4NBMM ʢೳྗΪϟοϓ໰୊ʹରԠʣ %BUBEJTUPSUJPOHVJEFETFMGEJTUJMMBUJPO <9VBOE-JV """*`> ݩσʔλ͕ಉ֦͡ுޙͷσʔλͷग़ྗΛ༧ଌ ʢσʔλ͔Βσʔλ΁ͷTFMGEJTUJMMBUJPOʣ ஌ࣝΛసҠ ੜె σʔλ ͭͷڭࢣͰΞϯαϯϒϧ %BUBEJTUJMMBUJPO <3BEPTBWPWJD $713`> σʔλ֦ுΛར༻ 1SFQBSJOH-FTTPOT <8FO /FVSPDPNQVUJOH`> ޡೝࣝͨ͠σʔλͷ஌ࣝͱ ෆ࣮֬ͳ஌ࣝΛௐ੔ (SBEVBM4BNQMJOH(BUF <.JOBNJ .7"`> ਖ਼ղͨ͠σʔλͷ ஌ࣝͷΈΛసҠ ग़ྗ૚ͷ஌ࣝͷసҠํ๏Λվળ 'VODUJPO.BUDIJOH <#FZFS $713`> NJYVQʹΑΔଟ༷ͳը૾Λ༻͍ͯ ڭࢣͱੜెؒͰؔ਺Ϛονϯά &⒎FDUJWFOFTTPGGVODUJPONBUDIJOH JOESJWJOHTDFOFSFDPHOJUJPO <:BTIJNB &$$78`> ϥϕϧͳ͠σʔλΛ༻͍ͯؔ਺Ϛονϯά ؔ਺Ϛονϯάͱͯ͠஌ࣝৠཹΛ࠶ߟ %*45 <)VBOH /FVS*14`> ΫϥεؒʹՃ͑ͯ Ϋϥε಺ͷ૬ؔΛసҠ 0GGMJOF %JTUJMMBUJPO 0OMJOF %JTUJMMBUJPO ஌ࣝΛసҠ ڭࢣ ੜె ΑΓଟ༷ͳ৘ใΛ࣋ͭ தؒ૚ͷग़ྗΛར༻ 'JU/FUT <3PNFSP *$-3`> தؒ૚ͷ஌ࣝͱͯ͠ ಛ௃ϚοϓΛ࢖༻ ɹɹɿύϥϝʔλΛݻఆ ɹɹɿύϥϝʔλΛߋ৽ ڭࢣɿֶशࡁΈϞσϧ ੜెɿະֶशͷϞσϧ ੜెͷΈΛ༻͍ͯ ੜెؒͰ஌ࣝΛసҠ ڭࢣͷ஌ࣝΛੜె΁సҠ ஌ࣝৠཹͷࣗಈઃܭ ஌ࣝసҠΛิॿ͢ΔϞσϧΛ௥Ճ 3FTJEVBM,% <(BP BS9JW`> ஌ࣝͷࠩΛิ׬͢Δ"TTJTUBOU ҟͳΔϞσϧߏ଄ؒͰ஌ࣝΛసҠ %FJ5 <5PVWSPO *$.-`> ஌ࣝͱͯ֬͠཰෼෍Λ༻͍ͯ $//͔Β7J5΁஌ࣝৠཹ 0OFGPS"MM <)BP /FVS*14`> தؒग़ྗΛMPHJUۭؒʹ౤Ө͢Δ͜ͱͰ ҟͳΔߏ଄ͷϞσϧؒͰதؒ૚ৠཹ ஌ࣝৠཹͷࣗಈઃܭ ,5( <.JOBNJ "$$7`> Ϟσϧͱଛࣦͷ૊Έ߹Θͤ 0SBDMF,OPXMFEHF%JTUJMMBUJPO <,BOH """*`> ΞϯαϯϒϧڭࢣͷͨΊͷੜెͷϞσϧߏ଄ Ϋϥεߏ੒΍λεΫ͕ҟͳΔෳ਺ͷڭࢣͷ஌ࣝΛੜెʹू໿ 4UVEFOUCFDPNJOHUIFNBTUFS <:F $713`> ηϚηάΛֶशͨ͠ڭࢣͱਂ౓ਪఆΛֶशͨ͠ڭࢣ "NBMHBNBUJOH,OPXMFEHF <4IFO """*`> ҟͳΔ෼ྨλεΫΛֶशͨ͠ෳ਺ͷڭࢣ ಛఆͷλεΫ ֶश Ϟσϧʹ͓͚Δ஌ࣝΛઃܭ $-*1,% <'BOH $713`> $-*1ɿ$-*1ʹ͓͍ͯ ैདྷͷ஌ࣝͷ༗ޮੑΛௐࠪ .JOJ7J5 <;IBOH $713`> 7JTJPO5SBOTGPSNFSɿ ΞςϯγϣϯॏΈͱύοντʔΫϯ .BOJGPME%JTUJMMBUJPO <)BP /FVS*14`> 7JTJPO5SBOTGPSNFSɿ ύονؒͷؔ܎ੑ -BSHFTDBMFJODSFNFOUBMMFBSOJOH <8V $713`> ܧଓֶशɿաڈλεΫͰ ֶशͨ͠Ϟσϧͷ֬཰෼෍ *NQSPWJOHGBTUTFHNFOUBUJPO XJUIUFBDIFSTUVEFOUMFBSOJOH <9JF #.7$`> ηϚηάɿۙ๣ͷϐΫηϧͱͷMPHJUؔ܎ 4&&% <'BOH *$-3`> ࣗݾڭࢣ͋Γֶशɿ αϯϓϧؒͷؔ܎ੑ -FBSOJOHF⒏DJFOUPCKFDUEFUFDUJPO NPEFMTXJUILOPXMFEHFEJTUJMMBUJPO <;BHPSVZLP *$-3`> ෺ମݕग़ɿ෺ମྖҬͷۣܗ ڭࢣ ੜె ஌ࣝΛసҠ ੜె ੜె ஌ࣝΛసҠ ੜె ੜె ஌ࣝΛసҠ IUUQTDPO fi UBUMBTKQHVJEFFWFOUTTJJTUBUJD TQFDJBM@QSPKFDU@UFDI@NBQ
  2. w େن໛σʔληοτͰֶश༷ͨ͠ʑͳԼྲྀλεΫͰߴ͍ੑೳΛൃش͢ΔֶशࡁΈϞσϧ w ୅දతͳϏδϣϯج൫Ϟσϧ  $-*1 ɿը૾ͱΩϟϓγϣϯϖΞͰֶश͠ɼߴ͍θϩγϣοτੑೳΛൃش  %*/0W ɿ"%&,΍1BTDBM70$ͳͲͷ%FOTFͳλεΫͰ$-*1Λ௒͑ΔੑೳΛൃش

     4". ɿΦʔϓϯϘΩϟϒϥϦͷΠϯελϯεηάϝϯςʔγϣϯͰߴ͍ੑೳΛൃش Ϗδϣϯج൫Ϟσϧ  I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3 $-*1 , score score score , , valid masks image image encoder image embedding mask points box text prompt encoder mask decoder conv %*/0W w .αϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ܏޲Λௐࠪ  طଘͷσʔληοτͱΠϯλʔωοτͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒  J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO <.0RVBC BS9JW> %*/0WʹΑΔࣗݾڭࢣ͋Γֶश ೖྗը૾ 7JFX 7J5 QBUDI 4PGUNBY 7JFX 7J5 ʢϞϝϯλϜϞσϧʣ ಛ௃ϕΫτϧ <$-4> QBUDI 4JOLIPSO,OPQQ ʢΫϥελׂ౰ʣ ϓϩτλΠϓ <$-4> ࢦ਺Ҡಈฏۉ ෛྫΛ༻͍ͳ͍ ରরֶश .*. σʔλ֦ு %*/0WͰֶशͨ͠Ϟσϧͷѹॖʢ 1SFUSBJOFE -BSHF7J5 QBUDI 4NBMM7J5 <$-4> QBUDI <$-4> 4NBMM7J5 ࢦ਺Ҡಈฏۉ ෛ .*. ࠷ऴతʹ ʢϞϝϯλϜϞσϧʣ 4".
  3. w $POUSBTUJWF-BOHVBHF*NBHF1SFUSBJOJOH   ࣄલֶशɹɹɹɹɿը૾ͱςΩετͷϖΞͰಛ௃ྔ͕Ұக͢ΔΑ͏ʹࣗݾڭࢣ͋Γֶश   ը૾ͷΫϥε෼ྨɿΫϥεʹؔ͢ΔςΩετ͔Βಛ௃ྔΛநग़  

    ը૾ͷΫϥε෼ྨɿը૾ͱςΩετؒͷಛ௃ྔͷྨࣅ౓͔Βը૾ͷΫϥεΛ෼ྨ w ը૾ͱςΩετͷରԠؔ܎ΛֶशˠରԠؔ܎͔Β௥Ճͷֶशͳ͘ը૾ͷΫϥε෼ྨ͕Մೳ $-*1<3BEGPSE *$.->  I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3
  4. w 4FMG%JTUJMMBUJPOXJUI/P-BCFMT w ໿ԯαϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ܏޲Λௐࠪ  طଘͷෳ਺σʔληοτͱ΢Σϒ্ͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒ w J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ %*/0W<0RVBC 5.-3>

     w .αϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ  طଘͷσʔληοτͱΠϯλʔωοτͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒  J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO <.0RVBC BS9JW> %*/0WʹΑΔࣗݾڭࢣ͋Γֶश ೖྗը૾ 7JFX 7J5 QBUDI 4PGUNBY 7JFX 7J5 ʢϞϝϯλϜϞσϧʣ ಛ௃ϕΫτϧ <$-4> QBUDI 4JOLIPSO,OPQQ ʢΫϥελׂ౰ʣ ϓϩτλΠϓ <$-4> ࢦ਺Ҡಈฏۉ ෛྫΛ༻͍ͳ͍ ରরֶश .*. σʔλ֦ு %*/0WͰֶशͨ͠ 1SFUSBJO -BSHF7 QBUDI 4NBMM7 <$-4> QBUDI <$-4> 4NBMM7 ࢦ਺Ҡಈฏ ʢϞϝϯλϜϞ
  5. w طଘσʔληοτ΁΢Σϒ্ͷը૾Λ௥Ճ͢Δ͜ͱͰσʔλ਺Λ૿ڧ  σʔληοτ಺ͷσʔλʹࣅͨ΢Σϒը૾Λ௥Ճ͢Δ͜ͱͰσʔλͷଟ༷ੑΛҡ࣋ͭͭ͠૿ڧ w σʔλ਺ͷόϥϯεΛௐ੔ͯ͠૊Έ߹ΘͤΔ͜ͱͰ-7%.σʔληοτΛ࡞੒ %*/0Wɿେن໛σʔληοτͷߏங Task Dataset /

    Split Images Retrieval Retrieved Final classification ImageNet-22k / – 14,197,086 as is – 14,197,086 classification ImageNet-22k / – 14,197,086 sample 56,788,344 56,788,344 classification ImageNet-1k / train 1,281,167 sample 40,997,344 40,997,344 fine-grained classif. Caltech 101 / train 3,030 cluster 2,630,000 1,000,000 fine-grained classif. CUB-200-2011 / train 5,994 cluster 1,300,000 1,000,000 fine-grained classif. DTD / train1 1,880 cluster 1,580,000 1,000,000 fine-grained classif. FGVC-Aircraft / train 3,334 cluster 1,170,000 1,000,000 fine-grained classif. Flowers-102 / train 1,020 cluster 1,060,000 1,000,000 fine-grained classif. Food-101 / train 75,750 cluster 21,670,000 1,000,000 fine-grained classif. Oxford-IIIT Pet / trainval 3,680 cluster 2,750,000 1,000,000 fine-grained classif. Stanford Cars / train 8,144 cluster 7,220,000 1,000,000 fine-grained classif. SUN397 / train1 19,850 cluster 18,950,000 1,000,000 fine-grained classif. Pascal VOC 2007 / train 2,501 cluster 1,010,000 1,000,000 segmentation ADE20K / train 20,210 cluster 20,720,000 1,000,000 segmentation Cityscapes / train 2,975 cluster 1,390,000 1,000,000 segmentation Pascal VOC 2012 (seg.) / trainaug 1,464 cluster 10,140,000 1,000,000 depth estimation Mapillary SLS / train 1,434,262 as is – 1,434,262 depth estimation KITTI / train (Eigen) 23,158 cluster 3,700,000 1,000,000 depth estimation NYU Depth V2 / train 24,231 cluster 10,850,000 1,000,000 depth estimation SUN RGB-D / train 4,829 cluster 4,870,000 1,000,000 retrieval Google Landmarks v2 / train (clean) 1,580,470 as is – 1,580,470 retrieval Google Landmarks v2 / train (clean) 1,580,470 sample 6,321,880 6,321,880 retrieval AmsterTime / new 1,231 cluster 960,000 960,000 retrieval AmsterTime / old 1,231 cluster 830,000 830,000 retrieval Met / train 397,121 cluster 62,860,000 1,000,000 retrieval Revisiting Oxford / base 4,993 cluster 3,680,000 1,000,000 retrieval Revisiting Paris / base 6,322 cluster 3,660,000 1,000,000 142,109,386 Table 15: Composition of our LVD-142M dataset. We report the list of datasets and associated splits used to build the dataset, how they were included (as is without retrieval or via sample-based or cluster-based σʔλ਺Λ૿ڧ σʔληοτؒͷόϥϯεΛௐ੔ 
  6. w ࣄલֶशʹ͸ೖྗը૾ʹର͢Δཧ૝తͳྖҬ෼ׂΛද͢ਖ਼ղ৘ใ͕ඞཁ  େྔͷը૾ʹରͯ͠ਓखͰਖ਼ղ৘ใΛ࡞੒͢Δͷ͸ࠔ೉ˠ.PEFMJOUIFMPPQͰ࡞੒ w .PEFMJOUIFMPPQɿ4".ͷग़ྗΛ΋ͱʹਖ਼ղ৘ใΛ࡞੒ɹਖ਼ղ৘ใΛ༻͍ͯ4".Λֶश  ̏ஈ֊ͷ.PEFMJOUIFMPPQʹ͓͍ͯσʔλ਺Λ૿΍͠ͳ͕Βֶश 4".ɿେن໛σʔληοτͷߏங ˠ

    ˠ (b) Model: Segment Anything Model (SAM) prompt image valid mask image encoder prompt encoder lightweight mask decoder table segmentation ompt image model cat with black ears d mask (c) Data: data engine (top) & dataset (bottom) • 1+ billion masks • 11 million images • privacy respecting • licensed images annotate train data model Segment Anything 1B (SA-1B): ɽ"TTJTUFENBOVBMTUBHF 4".͕༧ଌͨ͠ྖҬ෼ׂΛਓखͰमਖ਼ ສຕͷը૾ʹରͯ͠ສݸͷϚεΫΛ࡞੒ ɽ4FNJBVUPNBUJDTUBHF 4".͕༧ଌ͠ͳ͔ͬͨྖҬΛਓखͰ௥Ճ ສຕͷը૾Λ௥Ճͯ͠ສݸͷϚεΫΛ࡞੒ ɽ'VMMZBVUPNBUJDTUBHF 4".ʹΑΔྖҬ෼ׂΛਖ਼ղ৘ใͱͯͦ͠ͷ··࢖༻ ˠ࠷ऴతʹ ສຕͷը૾ʹରͯ͠ԯݸͷϚεΫΛ࡞੒ֶͯ͠शʹར༻ 
  7. w େن໛σʔληοτͰֶश༷ͨ͠ʑͳԼྲྀλεΫͰߴ͍ੑೳΛൃش͢ΔֶशࡁΈϞσϧ w ୅දతͳϏδϣϯج൫Ϟσϧ  $-*1 ɿը૾ͱΩϟϓγϣϯϖΞͰֶश͠ɼߴ͍θϩγϣοτੑೳΛൃش  %*/0W ɿ"%&,΍1BTDBM70$ͳͲͷ%FOTFͳλεΫͰ$-*1Λ௒͑ΔੑೳΛൃش

     4". ɿΦʔϓϯϘΩϟϒϥϦʔͷΠϯελϯεηάϝϯςʔγϣϯͰߴ͍ੑೳΛൃش Ϗδϣϯج൫Ϟσϧ  I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3 $-*1 , score score score , , valid masks image image encoder image embedding mask points box text prompt encoder mask decoder conv %*/0W w .αϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ܏޲Λௐࠪ  طଘͷσʔληοτͱΠϯλʔωοτͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒  J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO <.0RVBC BS9JW> %*/0WʹΑΔࣗݾڭࢣ͋Γֶश ೖྗը૾ 7JFX 7J5 QBUDI 4PGUNBY 7JFX 7J5 ʢϞϝϯλϜϞσϧʣ ಛ௃ϕΫτϧ <$-4> QBUDI 4JOLIPSO,OPQQ ʢΫϥελׂ౰ʣ ϓϩτλΠϓ <$-4> ࢦ਺Ҡಈฏۉ ෛྫΛ༻͍ͳ͍ ରরֶश .*. σʔλ֦ு %*/0WͰֶशͨ͠Ϟσϧͷѹॖʢ 1SFUSBJOFE -BSHF7J5 QBUDI 4NBMM7J5 <$-4> QBUDI <$-4> 4NBMM7J5 ࢦ਺Ҡಈฏۉ ෛ .*. ࠷ऴతʹ ʢϞϝϯλϜϞσϧʣ 4".
  8. w େن໛σʔληοτͰֶश༷ͨ͠ʑͳԼྲྀλεΫͰߴ͍ੑೳΛൃش͢ΔֶशࡁΈϞσϧ w ୅දతͳϏδϣϯج൫Ϟσϧ  $-*1 ɿը૾ͱΩϟϓγϣϯϖΞͰֶश͠ɼߴ͍θϩγϣοτੑೳΛൃش  %*/0W ɿ"%&,΍1BTDBM70$ͳͲͷ%FOTFͳλεΫͰ$-*1Λ௒͑ΔੑೳΛൃش

     4". ɿΦʔϓϯϘΩϟϒϥϦʔͷΠϯελϯεηάϝϯςʔγϣϯͰߴ͍ੑೳΛൃش Ϗδϣϯج൫Ϟσϧ  I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3 $-*1 , score score score , , valid masks image image encoder image embedding mask points box text prompt encoder mask decoder conv w .αϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ܏޲Λௐࠪ  طଘͷσʔληοτͱΠϯλʔωοτͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒  J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO <.0RVBC BS9JW> %*/0WʹΑΔࣗݾڭࢣ͋Γֶश ೖྗը૾ 7JFX 7J5 QBUDI 4PGUNBY 7JFX 7J5 ʢϞϝϯλϜϞσϧʣ ಛ௃ϕΫτϧ <$-4> QBUDI 4JOLIPSO,OPQQ ʢΫϥελׂ౰ʣ ϓϩτλΠϓ <$-4> ࢦ਺Ҡಈฏۉ ෛྫΛ༻͍ͳ͍ ରরֶश .*. σʔλ֦ு %*/0WͰֶशͨ͠Ϟσϧͷѹॖʢ 1SFUSBJOFE -BSHF7J5 QBUDI 4NBMM7J5 <$-4> QBUDI <$-4> 4NBMM7J5 ࢦ਺Ҡಈฏۉ ෛ .*. ࠷ऴతʹ ʢϞϝϯλϜϞσϧʣ 4". %*/0W Ϗδϣϯج൫Ϟσϧ͝ͱʹಘҙͳλεΫ͕ҟͳΔ ˣ ͭͷϞσϧʹ஌ࣝΛू໿͍ͨ͠
  9. w "HHMPNFSBUJWF7JTJPO'PVOEBUJPO.PEFM3FEVDF"MM%PNBJOT*OUP0OF w Ϟσϧͷग़ྗΛ໛฿͢Δ஌ࣝৠཹʹΑΓෳ਺ͷج൫Ϟσϧͷੑ࣭Λ࣋ͭϞσϧΛֶश  ໛฿ର৅ͷֶशࡁΈϞσϧɿڭࢣϞσϧ  ໛฿ʹΑΓֶश͢ΔϞσϧɿੜెϞσϧ w ੜెϞσϧͱͯ͠ϋʔυ΢ΣΞޮ཰ͷߴ͍Ϟσϧߏ଄ʹ͍ͭͯௐࠪ

    ".3"%*0<3BO[JOHFS $713>  Vision Foundation model (student) Student head 2 CLIP Student head 1 DINOv2 Student head 3 SAM Teacher 1: DINOv2 Multi component distillation loss Teacher 2: CLIP Multi component distillation loss Teacher 3: SAM Multi component distillation loss A vintage radio on white sandy beach, a colorful beach ball nearby Pixel-level visual tasks: Text grounding: Semantic segmentation: (from scratch) Frozen weights Trainable Image only data: Spatial features Summary token Spatial features Summary token Spatial features
  10. w Ұൠతͳ஌ࣝৠཹͷઃܭ  ڭࢣϞσϧɹɿͭͷϞσϧPSಉ͡λεΫʹֶ͍ͭͯशͨ͠ෳ਺Ϟσϧ  ໛฿͢Δग़ྗɿ࠷ऴग़ྗʢ֬཰෼෍ͳͲʣɼதؒग़ྗ  σʔληοτɿڭࢣϞσϧͷֶशʹ࢖༻ͨ͠σʔληοτ w ".3"%*0ʹ͓͚Δ஌ࣝৠཹͷઃܭ

     ڭࢣϞσϧɹɿҟͳΔֶशํ๏ɾλεΫɾσʔληοτʹֶ͍ͭͯशͨ͠ෳ਺Ϟσϧ  ໛฿͢Δग़ྗɿதؒग़ྗʢͨͩ͠ɼڭࢣϞσϧʹΑΓ࣍ݩ਺͕ҟͳΔʣ ˠϚϧνϔουߏ଄ͷੜెϞσϧΛ࠾༻  σʔληοτɿڭࢣϞσϧֶ͕शʹ࢖༻ͨ͠σʔληοτ͕࢖༻Ͱ͖ͳ͍ ˠৠཹʹ࢖༻͢Δσʔληοτ͕ԼྲྀλεΫʹ༩͑ΔӨڹΛௐࠪ ஌ࣝৠཹͷઃܭ 
  11. w ڭࢣʹΑͬͯதؒग़ྗͷ࣍ݩ਺͕ҟͳΔͨΊੜెʹϚϧνϔουߏ଄Λ࠾༻  ͭͷڭࢣʹରͯ͠4VNNBSZUPLFO༻ϔουͱ4QBUJBMGFBUVSFT༻ϔουΛ༻ҙ w ϔουͱͯ͠૚ߏ଄ͷ.-1Λ࢖༻ ੜెͷϞσϧߏ଄ɿΞμϓλʔϔου  Vision Foundation

    model (student) Student head 2 CLIP Student head 1 DINOv2 Student head 3 SAM Teacher 1: DINOv2 Multi component distillation loss Teacher 2: CLIP Multi component distillation loss Teacher 3: SAM Multi component distillation loss A vintage radio on white sandy beach, a colorful beach ball nearby Pixel-level visual tasks: Text grounding: Semantic segmentation: (from scratch) Frozen weights Trainable Image only data: Spatial features Summary token Spatial features Summary token Spatial features
  12. w ৠཹ࣌ʹ࢖༻͢Δσʔληοτ͕ੜెͷԼྲྀλεΫੑೳʹ༩͑ΔӨڹΛௐࠪ  L// ;FSPTIPU ɿ*NBHF/FU,ʹର͢Δ෼ྨੑೳ  ΛධՁ  "%&,

    ɿ"%&,ʹର͢Δηάϝϯςʔγϣϯੑೳ N*06 ΛධՁ w *NBHF/FU,Λ࢖༻͢Δͱը૾ͷ෼ྨλεΫͰߴ͍ੑೳΛୡ੒  ֶश࣌ͱධՁ࣌Ͱಉ͡σʔληοτΛ࢖༻͢ΔͨΊ;FSPTIPUੑೳΛެฏʹධՁͰ͖ͳ͍ w ৠཹ࣌ͷσʔληοτͱͯ͠%BUB$PNQ#Λ࢖༻ ৠཹʹ࢖༻͢Δσʔληοτ  ct that the teacher mod- broad swath of images m datasets such as Ima- 00M [50] or DataComp- oose to study 3 seminal DINOv2 [44], and SAM utstanding performance n CLIP), or specifically m dense tasks, such as se- probe (as in DINOv2), or in SAM). Because these verse domains, we omit Dataset k-NN Zero Shot ADE20K ImageNet 1K 84.79 80.44 48.11 ImageNet 21K 84.61 80.10 48.65 LAION-400M 83.77 77.46 48.6 DataComp-1B 83.91 78.51 49.01 Table 2. Ablation study on the choice of training dataset. We use MetaCLIP ViT-H/14 [15] and DINOv2 ViT-g/14 teachers, and a ViT-L/14 student model with CPE [30]. Both “k-NN” and “Zero Shot” are for ImageNet-1k. ADE20k refers to mIOU linear probe on ADE20k. we argue that it doesn’t fairly measure “zero shot” perfor-
  13. w ੜెͷ֤ϔουͷग़ྗΛ༻͍༷ͯʑͳλεΫΛධՁ  ྫɿੜెͷ4".༻ϔουͷग़ྗΛ4".ͷσίʔμʹೖྗͯ͠ηάϝϯςʔγϣϯ ".3"%*0ͷੑೳධՁ  3"%*07J5)͸ڭࢣͱಉ౳PS௒͑ΔੑೳΛൃش ˠ஌ࣝৠཹʹΑΓෳ਺ͷج൫Ϟσϧͷ஌ࣝΛͭͷϞσϧʹू໿Մೳ Model Params

    Resol- Throughput ImageNet1K Segmentation (linear) Vision-Language (LLaVa-1.5 [36]) SAM [32] (M) ution Zero-shot k-NN ADE20k VOC GQA POPE TextVQA VQAv2 COCO OpenCLIP-H/14 [10] 632 224 503 77.19 81.10 40.04 68.03 57.94 83.61 50.48 72.24 - MetaCLIP-H/14 [60] 632 224 486 80.51 82.12 35.39 62.62 60.57 84.76 53.65 75.71 - SigLIP-M/14 [69] 428 384 241 82.61 85.16 40.53 70.31 57.70 84.85 56.65 71.94 - Intern-ViT-6B [9] 5,902 224 63 83.20:: 78.43 47.20 76.85 60.18 84.02 52.45 76.75 - 5,537 448 14 :: 68.64 42.78 74.43 61.19 87.23 60.36 78.83 - *DFN CLIP-H/14 [17] 633 378 170 83.90 85.27 39.00 70.29 61.73 85.91 56.78 78.78 - *OpenAI CLIP-L/14 [47] 305 336 414 75.54 79.80 36.51 67.04 62.20 86.09 57.92 78.49 - *DINOv2-g/14-reg [13] 1,137 224 294: - 83.41 48.68 82.78 61.88 85.62 47.18 76.23 - *SAM-H/16 [32] 637 1024 12 - 22.12 28.08 34.34 49.92 81.76 43.91 57.65 77.18 E-RADIO-L (Ours) 391 512 468 80.73 83.89 48.22 81.64 61.70 85.07 51.47 76.73 76.31 RADIO-ViT-H/16 (Ours) 653 432 158 82.93 86.06 51.34 84.71 63.01 86.20 56.32 79.28 76.23 Table 1. Comparison of vision foundation and RADIO models. ”Zero-Shot” and k-NN are computed on ImageNet-1K. ADE20K [72] and VOC (PascalVOC2012) refer to linear probe semantic segmentation mIOU. GQA, POPE (popular), TextVQA, and VQAv2 are obtained via LLaVa 1.5 [36] by replacing the vision encoder. COCO is the instance segmentation metric introduced by [7] to evaluate SAM [32] distillation. RADIO attains the best metrics on most benchmarks, and is competitive with the rest, while E-RADIO enables high quality results in resource constrained settings. Note that Zero-Shot and COCO use teacher’s decoder head that is not finetuned. Throughput computed using NVIDIA A100 GPU, stated resolution, and TensorRT v8601. *Denotes teachers used to train our final RADIO. : We failed
  14. w طଘݚڀͱͯ͠(16্Ͱߴ͍εϧʔϓοτΛ໨తͱͨ͠ΞʔΩςΫνϟઃܭ͕ଘࡏ w ߴ͍εϧʔϓοτΛ໨తͱͨ͠ΞʔΩςΫνϟΛ".3"%*0ͰධՁ ޮ཰తͳੜెϞσϧͷઃܭ  with and er trains

    resolu- orming Av2 .66 .28 ken and n when found Backbone Param. Through- Zero k-NN ADE20k FD loss Count put Shot Teachers DINOv2 G/14 1.14B 313 N/A 83.41 47.53 OpenCLIP H/14 632M 556 77.19 81.10 40.04 Existing Efficient Models EfficientNetV2-S 21M 9017 65.37 70.72 27.75 0.415 ResNetv2-101 44M 7283 69.58 75.32 29.61 0.405 RegNetY-064 30M 6573 69.84 74.59 28.9 0.394 EfficientViT-L1 38M 6048 71.73 79.90 33.12 0.376 ConvNext-B 88M 1805 75.43 81.73 38.95 0.358 NFNet-F3 254M 1777 76.93 80.50 38.31 0.340 SwinV2-S 49M 1497 74.70 81.12 35.57 0.364 MaxViT-B 119M 1486 77.49 79.34 38.46 0.340 PoolformerV2-M36 56M 1194 74.46 80.49 35.05 0.377 MViTV2-B 51M 975 75.92 81.39 41.39 0.345 Proposed architecture E-RADIO-B 118M 6422 75.19 82.21 44.03 0.319 ë w/o upsample 113M 7040 75.45 82.05 41.26 0.353 E-RADIO-L 265M 3472 77.87 83.73 45.5 0.265 εϧʔϓοτ ɿ$//ϕʔε͕ྑ͍ ੑೳ ɿ7J5ϕʔε͕ྑ͍ ˣ ϋΠϒϦουΞʔΩςΫνϟ &3"%*0Λઃఆ
  15. with and er trains resolu- orming Av2 .66 .28 ken

    and n when found Backbone Param. Through- Zero k-NN ADE20k FD loss Count put Shot Teachers DINOv2 G/14 1.14B 313 N/A 83.41 47.53 OpenCLIP H/14 632M 556 77.19 81.10 40.04 Existing Efficient Models EfficientNetV2-S 21M 9017 65.37 70.72 27.75 0.415 ResNetv2-101 44M 7283 69.58 75.32 29.61 0.405 RegNetY-064 30M 6573 69.84 74.59 28.9 0.394 EfficientViT-L1 38M 6048 71.73 79.90 33.12 0.376 ConvNext-B 88M 1805 75.43 81.73 38.95 0.358 NFNet-F3 254M 1777 76.93 80.50 38.31 0.340 SwinV2-S 49M 1497 74.70 81.12 35.57 0.364 MaxViT-B 119M 1486 77.49 79.34 38.46 0.340 PoolformerV2-M36 56M 1194 74.46 80.49 35.05 0.377 MViTV2-B 51M 975 75.92 81.39 41.39 0.345 Proposed architecture E-RADIO-B 118M 6422 75.19 82.21 44.03 0.319 ë w/o upsample 113M 7040 75.45 82.05 41.26 0.353 E-RADIO-L 265M 3472 77.87 83.73 45.5 0.265 w طଘݚڀͱͯ͠(16্Ͱߴ͍εϧʔϓοτΛ໨తͱͨ͠ΞʔΩςΫνϟઃܭ͕ଘࡏ w ߴ͍εϧʔϓοτΛ໨తͱͨ͠ΞʔΩςΫνϟΛ".3"%*0ͰධՁ ޮ཰తͳੜెϞσϧͷઃܭ  & ffi DJFOU3"%*0 &3"%*0 w ೖྗ૚ɿ৞ΈࠐΈ૚ʹΑΓը૾αΠζΛʹ w தؒ૚ɿஈ֊ͷ:0-0W$G৞ΈࠐΈϒϩοΫͱஈ֊ͷ 5SBOTGPSNFSϒϩοΫ  5SBOTGPSNFSϒϩοΫɿϩʔΧϧͱάϩʔόϧͷ "UUFOUJPO8JOEPXΛަޓʹ࠾༻ w தؒ૚ɿ࠷ޙʹ%FDPOWPMVUJPO૚Λద༻ طଘͷٕज़Λ૊Έ߹Θͤͯ ΑΓޮ཰తͳΞʔΩςΫνϟΛઃܭ ߴ͍εϧʔϓοτͱߴ͍ੑೳΛಉ࣌ʹ࣮ݱ
  16. w Ϗδϣϯج൫Ϟσϧ  େن໛σʔληοτΛ༻ҙͯ͠ڭࢣ͋Γֶश΍ࣗݾڭࢣ͋ΓֶशʹΑΓֶश  ֶशͷ໨తʢଛࣦؔ਺ɾλεΫʣ΍σʔληοτ͕ҟͳΔ͜ͱͰҟͳΔԼྲྀλεΫੑೳΛൃش w ".3"%*0<3BO[JOHFS $713> 

    ෳ਺ͷج൫ϞσϧΛڭࢣͱͯͭ͠ͷੜెϞσϧʹதؒ૚ৠཹ  ج൫ϞσϧʹΑͬͯதؒͷग़ྗ࣍ݩ਺͕ҟͳΔͨΊϚϧνϔουߏ଄ͷੜెΛ࠾༻  ڭࢣͱಉ౳PS௒͑ΔੑೳΛൃشˠ஌ࣝৠཹʹΑΓෳ਺ͷج൫Ϟσϧͷ஌ࣝΛͭͷϞσϧʹू໿  ੜెͷޮ཰ੑʢεϧʔϓοτʣͱੑೳͷؔ܎͸ੜెͷΞʔΩςΫνϟʹґଘ ·ͱΊ