Upgrade to Pro — share decks privately, control downloads, hide ads and more …

学術バーQ AI研究最前線:自己教師あり学習による画像モデルの事前学習

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

学術バーQ AI研究最前線:自己教師あり学習による画像モデルの事前学習

2026年6月16日の学術バーQ AI研究最前線における発表スライドです

Avatar for Naoki Okamoto

Naoki Okamoto

June 16, 2026

More Decks by Naoki Okamoto

Other Decks in Research

Transcript

  1. ࣗݾ঺հɿԬຊ௚थ/BPLJ0LBNPUP   ܦྺ d த෦େֶେֶӃത࢜ޙظ՝ఔʢ౻٢߂࿱ݚڀࣨʣ d த෦େֶݚڀһ dݱࡏ גࣜձࣾσϯιʔ

    ֶੜ࣌୅ͷݚڀ෼໺ ஌ࣝৠཹɼ൒ڭࢣ͋Γֶशɼࣗݾڭࢣ͋Γֶश ΞϯαϯϒϧֶशͷͨΊͷ஌ࣝৠཹ  IUUQTXXXFDWBOFUQBQFSTFDDW@QBQFST@&$$7 IUNM@&$$7@@QBQFSQIQ ࣗݾڭࢣ͋ΓֶशͷνϡʔτϦΞϧ  #FTUHSBQIGPSFOTFNCMF OPEFT த෦େֶϩΰ த෦େֶϩΰ  ˠ"DRVJSFEJ⒎FSFOUBUUFOUJPONBQT EJWFSTJUZTVJUBCMFGPSFOTFNCMFT JOQVU #SJOHDMPTFSUP #SJOHDMPTFSFBDIPUIFS #SJOHDMPTFSUP #SJOHDMPTFSUP #SJOHDMPTFSUP 4FQBSBUFGSPN 4FQBSBUFGSPN IUUQTTQFBLFSEFDLDPNOBPL[JKJKJBPTIJBSJYVFYJ OJZPSVTIJRJBOYVFYJDWJNUJZVUPSJBSV %BSL,OPXMFEHF ,OPXMFEHF%JTUJMMBUJPO <)JOUPO /*148`> ڭࢣͷ֬཰෼෍ʢ஌ࣝʣΛ ༻͍ͯੜెΛֶश .PEFMDPNQSFTTJPO <#VDJMV㶙 4*(,%%`> Ξϯαϯϒϧͷग़ྗΛϥϕϧͱͯ͠ ͭͷχϡʔϥϧωοτϫʔΫΛֶश Ϟσϧͷ૊Έ߹Θͤ ஌ࣝͷछྨɾ஌ࣝͷసҠํ๏ ೥      44**प೥ٕज़Ϛοϓɿ஌ࣝৠཹ ෳ਺ͷڭࢣʹΑΔΞϯαϯϒϧΛར༻ .VMUJQMF5FBDIFS <:PV ,%%`> ֬཰෼෍Λू໿ '&&% <1BSL,XBL &$"*`> ಛ௃ϚοϓΛू໿ ࣗ෼ࣗ਎ͷ஌ࣝΛར༻ TFMGEJTUJMMBUJPO ਂ͍૚ͷ஌ࣝΛઙ͍૚΁సҠ -FBSOJOHBVOJpFEDMBTTJpFS <)PV $713`> #FZPVSPXOUFBDIFS <;IBOH *$$7`> ෳ਺ͷੜెͷΈͰֶश %.- <;IBOH $713`> ੜెؒͷ஌ࣝৠཹʹΑΓਫ਼౓͕޲্ 0/& <-BO /FVSM14`> $PMMBCPSBUJWFMFBSOJOH <4POHˍ$IBJ /FVSM14`> ੜెͷઙ͍૚ΛॏΈڞ༗ͯ͠ύϥϝʔλ਺Λ࡟ݮ ஈ֊తʹ஌ࣝΛసҠ   7*% <"IO $713`> ૬ޓ৘ใྔ $3% <5JBO *$-3`> ରরֶश "'% <$IVOH *$.-`> ఢରతֶश ,OPXMFEHF%J⒎VTJPO <)VBOH /FVS*14`> ֦ࢄϞσϧͷֶशํ๏ ,OPXMFEHF3FWJFX <$IFO $713`> ҟͳΔਂ͞ͷ૚ͷؒͰ ஌ࣝΛసҠ .(% <:BOH &$$7`> ϚεΫͨ͠ੜెͷಛ௃Ϛοϓ͔Β ڭࢣͷಛ௃ϚοϓΛ༧ଌ தؒ૚ͷ஌ࣝͷసҠํ๏Λվળ 3,% <1BSL $713`> αϯϓϧؒͷؔ܎ੑ 'MPXPG4PMVUJPO1SPDFEVSF <:JN $713`> ૚ؒͷग़ྗͷ૬ޓؔ܎ "UUFOUJPO5SBOTGFS <;BHPSVZLP *$-3`> "UUFOUJPONBQ தؒ૚ͷग़ྗ͔Β஌ࣝΛநग़ ".3"%*0 <3BO[JOHFS $713`> ෳ਺ͷج൫Ϟσϧ %*/0W $-*1 4". ֶशΛૣظऴྃͨ͠ڭࢣΛར༻ 3$0 <+JO *$$7`> 0OUIFF⒏DBDZ <$IPˍ)BSJIBSBO *$$7`> ೳྗΪϟοϓ໰୊ʹରԠ "VUP,% <-J *$$7`> தؒ૚ͷ஌ࣝදݱ &OTFNCMF,5( <0LBNPUP &$$7`> ஌ࣝͱଛࣦͷ૊Έ߹Θͤ ,%;FSP <-J /FVS*14`> ஌ࣝͱଛࣦͷ૊Έ߹Θͤ -BSHFTDBMFEJTUSJCVUFE <"OJM *$-3`> ֬཰෼෍Λू໿ %VBMOFU <)PV *$$7`> ಛ௃ϚοϓΛू໿ ෳ਺ͷੜెʹΑΔΞϯαϯϒϧΛར༻ %BUBTFU%JTUJMMBUJPO <8BOH BS9JW`> ֶशࡁΈϞσϧͷਫ਼౓͕ߴ͘ͳΔ Α͏ʹೖྗϊΠζΛ࠷దԽ ͦͷଞɿσʔληοτͷৠཹ ੜె ஌ࣝΛసҠ ੜె ੜె ஌ࣝΛసҠ ੜె ੜె ஌ࣝΛసҠ ੜె ஌ࣝΛసҠ ڭࢣ ڭࢣ #"/ <'VSMBOFMMP *$.-`> 4NBMMˠ4NBMMˠʜ TFMGEJTUJMMBUJPO 5FBDIFS"TTJTUBOU <.JS[BEFI """*`> -BSHFˠ.JEEMFˠ4NBMM ʢೳྗΪϟοϓ໰୊ʹରԠʣ %BUBEJTUPSUJPOHVJEFETFMGEJTUJMMBUJPO <9VBOE-JV """*`> ݩσʔλ͕ಉ֦͡ுޙͷσʔλͷग़ྗΛ༧ଌ ʢσʔλ͔Βσʔλ΁ͷTFMGEJTUJMMBUJPOʣ ஌ࣝΛసҠ ੜె σʔλ ͭͷڭࢣͰΞϯαϯϒϧ %BUBEJTUJMMBUJPO <3BEPTBWPWJD $713`> σʔλ֦ுΛར༻ 1SFQBSJOH-FTTPOT <8FO /FVSPDPNQVUJOH`> ޡೝࣝͨ͠σʔλͷ஌ࣝͱ ෆ࣮֬ͳ஌ࣝΛௐ੔ (SBEVBM4BNQMJOH(BUF <.JOBNJ .7"`> ਖ਼ղͨ͠σʔλͷ ஌ࣝͷΈΛసҠ ग़ྗ૚ͷ஌ࣝͷసҠํ๏Λվળ 'VODUJPO.BUDIJOH <#FZFS $713`> NJYVQʹΑΔଟ༷ͳը૾Λ༻͍ͯ ڭࢣͱੜెؒͰؔ਺Ϛονϯά &⒎FDUJWFOFTTPGGVODUJPONBUDIJOH JOESJWJOHTDFOFSFDPHOJUJPO <:BTIJNB &$$78`> ϥϕϧͳ͠σʔλΛ༻͍ͯؔ਺Ϛονϯά ؔ਺Ϛονϯάͱͯ͠஌ࣝৠཹΛ࠶ߟ %*45 <)VBOH /FVS*14`> ΫϥεؒʹՃ͑ͯ Ϋϥε಺ͷ૬ؔΛసҠ 0GGMJOF %JTUJMMBUJPO 0OMJOF %JTUJMMBUJPO ஌ࣝΛసҠ ڭࢣ ੜె ΑΓଟ༷ͳ৘ใΛ࣋ͭ தؒ૚ͷग़ྗΛར༻ 'JU/FUT <3PNFSP *$-3`> தؒ૚ͷ஌ࣝͱͯ͠ ಛ௃ϚοϓΛ࢖༻ ɹɹɿύϥϝʔλΛݻఆ ɹɹɿύϥϝʔλΛߋ৽ ڭࢣɿֶशࡁΈϞσϧ ੜెɿະֶशͷϞσϧ ੜెͷΈΛ༻͍ͯ ੜెؒͰ஌ࣝΛసҠ ڭࢣͷ஌ࣝΛੜె΁సҠ ஌ࣝৠཹͷࣗಈઃܭ ஌ࣝసҠΛิॿ͢ΔϞσϧΛ௥Ճ 3FTJEVBM,% <(BP BS9JW`> ஌ࣝͷࠩΛิ׬͢Δ"TTJTUBOU ҟͳΔϞσϧߏ଄ؒͰ஌ࣝΛసҠ %FJ5 <5PVWSPO *$.-`> ஌ࣝͱͯ֬͠཰෼෍Λ༻͍ͯ $//͔Β7J5΁஌ࣝৠཹ 0OFGPS"MM <)BP /FVS*14`> தؒग़ྗΛMPHJUۭؒʹ౤Ө͢Δ͜ͱͰ ҟͳΔߏ଄ͷϞσϧؒͰதؒ૚ৠཹ ஌ࣝৠཹͷࣗಈઃܭ ,5( <.JOBNJ "$$7`> Ϟσϧͱଛࣦͷ૊Έ߹Θͤ 0SBDMF,OPXMFEHF%JTUJMMBUJPO <,BOH """*`> ΞϯαϯϒϧڭࢣͷͨΊͷੜెͷϞσϧߏ଄ Ϋϥεߏ੒΍λεΫ͕ҟͳΔෳ਺ͷڭࢣͷ஌ࣝΛੜెʹू໿ 4UVEFOUCFDPNJOHUIFNBTUFS <:F $713`> ηϚηάΛֶशͨ͠ڭࢣͱਂ౓ਪఆΛֶशͨ͠ڭࢣ "NBMHBNBUJOH,OPXMFEHF <4IFO """*`> ҟͳΔ෼ྨλεΫΛֶशͨ͠ෳ਺ͷڭࢣ ಛఆͷλεΫ ֶश Ϟσϧʹ͓͚Δ஌ࣝΛઃܭ $-*1,% <'BOH $713`> $-*1ɿ$-*1ʹ͓͍ͯ ैདྷͷ஌ࣝͷ༗ޮੑΛௐࠪ .JOJ7J5 <;IBOH $713`> 7JTJPO5SBOTGPSNFSɿ ΞςϯγϣϯॏΈͱύοντʔΫϯ .BOJGPME%JTUJMMBUJPO <)BP /FVS*14`> 7JTJPO5SBOTGPSNFSɿ ύονؒͷؔ܎ੑ -BSHFTDBMFJODSFNFOUBMMFBSOJOH <8V $713`> ܧଓֶशɿաڈλεΫͰ ֶशͨ͠Ϟσϧͷ֬཰෼෍ *NQSPWJOHGBTUTFHNFOUBUJPO XJUIUFBDIFSTUVEFOUMFBSOJOH <9JF #.7$`> ηϚηάɿۙ๣ͷϐΫηϧͱͷMPHJUؔ܎ 4&&% <'BOH *$-3`> ࣗݾڭࢣ͋Γֶशɿ αϯϓϧؒͷؔ܎ੑ -FBSOJOHF⒏DJFOUPCKFDUEFUFDUJPO NPEFMTXJUILOPXMFEHFEJTUJMMBUJPO <;BHPSVZLP *$-3`> ෺ମݕग़ɿ෺ମྖҬͷۣܗ ڭࢣ ੜె ஌ࣝΛసҠ ੜె ੜె ஌ࣝΛసҠ ੜె ੜె ஌ࣝΛసҠ IUUQTDPO fi UBUMBTKQHVJEFFWFOUTTJJTUBUJD TQFDJBM@QSPKFDU@UFDI@NBQ 44**प೥ٕज़Ϛοϓɿ஌ࣝৠཹ  ݸਓϖʔδ 9
  2. w ಺༰ɿࣗݾڭࢣ͋ΓֶशͷνϡʔτϦΞϧʴ࠷ۙͷಈ޲ɿσʔληοτͷେن໛Խ w ໨ඪɿ%*/0Wͷத਎ΛͳΜͱͳ͘ཧղͰ͖Δ w ͓͜ͱΘΓɿൃݴ͸ݸਓͷ΋ͷͰ͋Γݱॴଐͱؔ܎͋Γ·ͤΜ ຊ೔ͷ಺༰ͱ໨త   ͜͡·͞Μ

    දݱֶशͱ͔ࣗݾڭࢣ͋Γֶश෇ۙͷ࠷ઌ୺࿦จΛ ঺հ͍͚ͨͩͨΓ͠ͳ͍Ͱ͠ΐ͏͔ʁʁ ֶੜ࣌୅ʹൃදͨ͠νϡʔτϦΞϧߨԋ<$7*.ݚڀձ`>Λϕʔεʹ ʮࣗݾڭࢣ͋ΓֶशʹΑΔը૾Ϟσϧͷࣄલֶशʯʹ͍ͭͯ঺հ͠·͢ʂ
  3. w ਖ਼ղϥϕϧΛ෇༩ͯ͠ͳ͍େྔͷϥϕϧͳ͠σʔλΛ༻͍ͯࣄલֶश w ໨తɿ༷ʑͳసҠֶशɾ fi OFUVOJOHઌͷλεΫͰ༗ޮͳಛ௃ྔͷநग़ ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH  

    ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁సҠֶशɾ fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ᶃϥϕϧͳ͠σʔλͰϞσϧΛࣄલֶश ϥϕϧͳ͠σʔλ Ϟσϧ ʢཚ਺ॳظ஋ʣ ses image data, we analyze its internal early projects the flattened patches into he top principal components of the the le basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the output token to the input space. See Appendix D.6 for d to the l learns on em- on em- s in the usoidal ). That ogy ex- ot yield e entire ree the ute the is inte- “atten- We find lowest bally is stently zed at- Net be- y serve ʜ ࣗݾڭࢣ͋Γֶश 44- ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁సҠֶशɾ fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ᶃϥϕϧͳ͠σʔλͰϞσϧΛࣄલֶश ϥϕϧͳ͠σʔλ Ϟσϧ ʢཚ਺ॳظ஋ʣ ses image data, we analyze its internal early projects the flattened patches into he top principal components of the the le basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the output token to the input space. See Appendix D.6 for d to the l learns on em- on em- s in the usoidal ). That ogy ex- ot yield e entire ree the ute the is inte- “atten- We find lowest bally is stently zed at- Net be- y serve ʜ ࣗݾڭࢣ͋Γֶश 44- ϓϨςΩετλεΫ 1SFUFYUUBTL ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁సҠֶशɾ fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ᶃϥϕϧͳ͠σʔλͰϞσϧΛࣄલֶश ϥϕϧͳ͠σʔλ Ϟσϧ ʢཚ਺ॳظ஋ʣ ses image data, we analyze its internal early projects the flattened patches into he top principal components of the the le basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the output token to the input space. See Appendix D.6 for d to the l learns on em- on em- s in the usoidal ). That ogy ex- ot yield e entire ree the ute the is inte- “atten- We find lowest bally is stently zed at- Net be- y serve ʜ ࣗݾڭࢣ͋Γֶश 44- ϓϨςΩετλεΫ 1SFUFYUUBTL ԼྲྀλεΫ %PXOTUSFBNUBTL ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁సҠֶशɾ fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ᶃϥϕϧͳ͠σʔλͰϞσϧΛࣄલֶश ϥϕϧͳ͠σʔλ Ϟσϧ ʢཚ਺ॳظ஋ʣ ses image data, we analyze its internal early projects the flattened patches into he top principal components of the the le basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the output token to the input space. See Appendix D.6 for d to the l learns on em- on em- s in the usoidal ). That ogy ex- ot yield e entire ree the ute the is inte- “atten- We find lowest bally is stently zed at- Net be- y serve ʜ ࣗݾڭࢣ͋Γֶश 44- ϓϨςΩετλεΫ 1SFUFYUUBTL ԼྲྀλεΫ %PXOTUSFBNUBTL ˠͲͷΑ͏ͳϓϨςΩετλεΫΛઃܭ͠ɼֶश͢Δ΂͖ͳͷ͔ʁ
  4. w σʔλ͔ΒࣗಈͰਖ਼ղϥϕϧΛ࡞੒Ͱ͖ΔλεΫ w ྫɿ1SFEJDUJOH*NBHF3PUBUJPOT<(JEBSJT *$-3`>  ը૾ʹରͯ͠౓ɼ౓ɼ౓ɼ౓ͷ͍ͣΕ͔ͷճసΛద༻  ೖྗը૾ʹద༻͞ΕͨճసͷछྨΛ༧ଌʢΫϥε෼ྨʣ ϓϨςΩετλεΫ

    1SFUFYUUBTL   Published as a conference paper at ICLR 2018 Rotated image: X0 Rotated image: X3 Rotated image: X 2 Rotated image: X1 ConvNet model F(.) ConvNet model F(.) ConvNet model F(.) ConvNet model F(.) Image X Predict 270 degrees rotation (y=3) Rotate 270 degrees g( X , y=3) Rotate 180 degrees g( X , y=2) Rotate 90 degrees g( X , y=1) Rotate 0 degrees g( X , y=0) Maximize prob. F3( X 3) Predict 0 degrees rotation (y=0) Maximize prob. F2( X2) Maximize prob. F1( X 1) Maximize prob. F0( X 0) Predict 180 degrees rotation (y=2) Predict 90 degrees rotation (y=1) Objectives: Figure 2: Illustration of the self-supervised task that we propose for semantic feature learning. Given four possible geometric transformations, the 0, 90, 180, and 270 degrees rotations, we train a ConvNet model F(.) to recognize the rotation that is applied to the image that it gets as input. Fy(Xy⇤ ) is the probability of rotation transformation y predicted by model F(.) when it gets as input an image that has been transformed by the rotation transformation y⇤. to successfully predict the rotation of an image the ConvNet model must necessarily learn to localize <(JEBSJT *$-3>͔ΒҾ༻ ˠճసͷ༧ଌʹ͸෺ମͷ֓೦ʢҐஔɼ࢟੎ɼछྨͳͲʣͷཧղ͕ඞཁ ˠਖ਼ղϥϕϧ͸ը૾ʹద༻͞Εͨσʔλ֦ுͷ৘ใ͔ΒࣗಈͰ࡞੒Մೳ
  5. ϓϨςΩετλεΫͷਐల      $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ

    ࡞੒ͯ͠ରরֶश $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX  <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH  .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM  HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO $713`> ϚϧνϞʔμϧ΁ͷ֦ு 3(# ηϚηά ਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(# ݴޠ $"7."& <:(POH *$-3`> 3(# Ի੠ ରরֶश .*. 4JN.*. ରরֶश 4J5 <4"UJUP BS9JW`> $."& <;)VBOH BS9JW`> ."& ରরֶश 8IBU%P447J5-FBSO  </1BSL *$-3`> ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ
  6. ϓϨςΩετλεΫͷਐల      $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ

    ࡞੒ͯ͠ରরֶश $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX  <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH  .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM  HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO $713`> ϚϧνϞʔμϧ΁ͷ֦ு 3(# ηϚηά ਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(# ݴޠ $"7."& <:(POH *$-3`> 3(# Ի੠ ରরֶश .*. 4JN.*. ରরֶश 4J5 <4"UJUP BS9JW`> $."& <;)VBOH BS9JW`> ."& ରরֶश 8IBU%P447J5-FBSO  </1BSL *$-3`> ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ w ༷ʑͳ؍఺Ͱଟ༷ͳϓϨςΩετλεΫ͕ొ৔
  7. w 4PMWJOH+JHTBX1V[[MFT</PSPP[JBOE 'BWBSP &$$7`>  λΠϧঢ়ʹͭͷύονΛ࡞੒ͯ͠γϟοϑϧ  ͋Β͔͡Ίఆٛ͞Εͨγϟοϑϧॱ൪ͷΠϯσοΫεΛ༧ଌ w $POUFYU&ODPEFST<1BUIBL

    $713`>  &ODPEFSɾ%FDPEFSߏ଄ͷϞσϧʹΑΓϚεΫ͞ΕͨྖҬΛ༧ଌ ༷ʑͳϓϨςΩετλεΫ   Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles 7 Fig. 3: Context Free Network. The figure illustrates how a puzzle is generated differ in the approach: whereas [7] are solving a discrimina- tive task (is patch A above patch B or below?), our context encoder solves a pure prediction problem (what pixel inten- sities should go in the hole?). Interestingly, similar distinc- tion exist in using language context to learn word embed- dings: Collobert and Weston [5] advocate a discriminative approach, whereas word2vec [30] formulate it as word pre- diction. One important benefit of our approach is that our supervisory signal is much richer: a context encoder needs to predict roughly 15,000 real values per training example, compared to just 1 option among 8 choices in [7]. Likely due in part to this difference, our context encoders take far less time to train than [7]. Moreover, context based predic- Figure 2: Context Encoder. The context image is passed through the encoder to obtain features which are connected to the decoder using channel-wise fully-connected layer as described in Section 3.1. The decoder then produces the $POUFYU&ODPEFST 4PMWJOH+JHTBX1V[[MFT          </PSPP[JBOE 'BWBSP &$$7>͔ΒҾ༻ <1BUIBL $713>͔ΒҾ༻
  8. w $PMPSGVM*NBHF$PMPSJ[BUJPO<;IBOH &$$7`>  Χϥʔը૾͔ΒάϨʔεέʔϧը૾Λ࡞੒  άϨʔεέʔϧը૾͔Β-BC৭ۭؒͷBC஋Λ༧ଌ w /PO1BSBNFUSJD*OTUBODF%JTDSJNJOBUJPO<8V $713`>

     ֤ը૾ʹର͢Δಛ௃ྔ͕ಠཱ͢ΔΑ͏ʹֶशʢσʔλ਺ʹΫϥε਺ͱֶͯ͠शʣ ༷ʑͳϓϨςΩετλεΫ   128D Unit Sphere O 1-th image 2-th image i-th image n-1 th image n-th image CNN backbone 128D 2048D 128D L2 norm low dim Non-param Softmax Memory Bank Figure 2: The pipeline of our unsupervised feature learning approach. We use a backbone CNN to encode each image as a feature vector, which is projected to a 128-dimensional space and L2 normalized. The optimal feature embedding is learned via instance-level discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere. 3. Approach where ⌧ is a temperature parameter that controls the con- *OTUBODF%JTDSJNJOBUJPO 4 Zhang, Isola, Efros Fig. 2. Our network architecture. Each conv layer refers to a block of 2 or 3 repeated conv and ReLU layers, followed by a BatchNorm [30] layer. The net has no pool layers. All changes in resolution are achieved through spatial downsampling or upsampling between conv blocks. $PMPSGVM*NBHF$PMPSJ[BUJPO <8V $713>͔ΒҾ༻ <;IBOH &$$7>͔ΒҾ༻
  9. w $POUFYU1SFEJDUJPO<%PFSTDI *$$7`>  λΠϧঢ়ʹΫϩοϓͨ͠ύονؒͷ૬ରҐஔΛ༧ଌ w $POUSBTUJWF1SFEJDUJWF$PEJOH $1$W <)ÉOB ff

    *$.-`>  ύονͷಛ௃ྔ͔ΒύονҐஔ͕ݸઌͷύονͷಛ௃ྔΛ༧ଌ ༷ʑͳϓϨςΩετλεΫ   fθ gφ x z c InfoNCE [256, 256, 3] [7, 7, 4096] [7, 7, 4096] Masked ConvNet Patched ResNet-161 fθ hψ x z y Cross Ent [256, 256, 3] [7, 7, 4096] [1000, 1] Linear Self-supervised pre-training 100% images; 0% labels Linear classification 100% images and labels fθ hψ x z y Cross Ent [224, 224, 3] [14, 14, 4096] ResNet-33 Efficient classification 1% to 100% images and labels fθ hψ x z y Multi Task [H, W, 3] [H/16, W/16, 4096] Transfer learning 100% images and labels hψ x y Cross Ent [224, 224, 3] [1000, 1] ResNet-152 Supervised training 1% to 100% images and labels Baseline Pre-training Evaluation Pre-trained Fixed / Tuned ResNet-161 Image x Feature Extractor fθ Patched ResNet-161 z c Context Network gφ Masked ConvNet Faster-RCNN [20, 1] [1000, 1] Pre-trained Fixed / Tuned ResNet-161 Pre-trained Fixed Patched ResNet-161 al configuration (if there is no spe- e parts, then it is “stuff” [1]). We d approach to learn a visual repre- . We demonstrate that the resulting good for both object detection, pro- t on PASCAL VOC 2007 compared , as well as for unsupervised object mining. This means, surprisingly, generalizes across images, despite bjective function that operates on a That is, instance-level supervision ormance on category-level tasks. a good image representation is as n appropriate generative model. An of natural images would both gener- o their natural distribution, and be hat it would seek common causes d share information between them. atent structure given an image is in- vely simple models. To deal with sues, a number of works, such as m [23], contrastive divergence [22], 3 2 1 5 4 8 7 6 ); Y = 3 , X = ( Figure 2. The algorithm receives two patches in one of these eight possible spatial arrangements, without any context, and must then classify which configuration was sampled. model (e.g. a deep network) to predict, from a single word, the n preceding and n succeeding words. In principle, sim- ilar reasoning could be applied in the image domain, a kind of visual “fill in the blank” task, but, again, one runs into the problem of determining whether the predictions themselves $POUSBTUJWF1SFEJDUJWF$PEJOH $POUFYU1SFEJDUJPO k <)ÉOB ff *$.->͔ΒҾ༻ <%PFSTDI *$$7>͔ΒҾ༻
  10. ϓϨςΩετλεΫͷਐల      $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ

    ࡞੒ͯ͠ରরֶश $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX  <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH  .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM  HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO *$$7`> ϚϧνϞʔμϧ΁ͷ֦ு 3(# ηϚηά ਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(# ݴޠ $"7."& <:(POH *$-3`> 3(# Ի੠ ରরֶश .*. 4JN.*. ରরֶश 4J5 <4"UJUP BS9JW`> $."& <;)VBOH BS9JW`> ."& ରরֶश 8IBU%P447J5-FBSO  </1BSL *$-3`> ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ w ରরֶश͸ڭࢣ͋Γࣄલֶश๏ͱಉఔ౓Ҏ্ͷֶशޮՌΛൃشˠରরֶश͕ओྲྀʹ
  11. w "ϥϯμϜΫϩοϓʴ৭ม׵ w 4JN$-3<$IFO *$.-`>  ରরֶशʹޮՌతͳϞσϧߏ଄΍σʔλ֦ுΛ෼ੳ͠ɼޙଓͷରরֶशͷج൫ʹ 2ͲΜͳσʔλ֦ு͕ରরֶशʹ͓͍ͯޮՌతʁ  

    RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R <$IFO *$.->͔ΒҾ༻ ͭͷσʔλ֦ுͷ૊Έ߹ΘͤํʹΑΔ ਫ਼౓มԽ *NBHF/FU, σʔλ૿෯ͷ૊Έ߹ΘͤํʹΑͬͯਫ਼౓͕มԽ ˠରরֶश͸σʔλ֦ுͷબ୒͕ॏཁ
  12. w "ϥϯμϜΫϩοϓʴ৭ม׵ w 4JN$-3<$IFO *$.-`>  ରরֶशʹޮՌతͳϞσϧߏ଄΍σʔλ֦ுΛ෼ੳ͠ɼޙଓͷରরֶशͷج൫ʹ 2ͲΜͳσʔλ֦ு͕ରরֶशʹ͓͍ͯޮՌతʁ  

    <$IFO *$.->͔ΒҾ༻ ͭͷσʔλ֦ுͷ૊Έ߹ΘͤํʹΑΔ ਫ਼౓มԽ *NBHF/FU, ϥϯμϜΫϩοϓɼ$VUPVUɼ৭ม׵Λ ࢖༻͠ͳ͍৔߹ʹ௿͍ਫ਼౓ RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R
  13. w "ϥϯμϜΫϩοϓʴ৭ม׵ w 4JN$-3<$IFO *$.-`>  ରরֶशʹޮՌతͳϞσϧߏ଄΍σʔλ֦ுΛ෼ੳ͠ɼޙଓͷରরֶशͷج൫ʹ 2ͲΜͳσʔλ֦ு͕ରরֶशʹ͓͍ͯޮՌతʁ  

    <$IFO *$.->͔ΒҾ༻ ͭͷσʔλ֦ுͷ૊Έ߹ΘͤํʹΑΔ ਫ਼౓มԽ *NBHF/FU, ϥϯμϜΫϩοϓͱ৭ม׵ͷ૊Έ߹Θ͕ͤ ࠷΋ߴ͍ਫ਼౓Λୡ੒ ʢ৭ม׵ɿάϨʔεέʔϧԽɼΧϥʔδολʣ RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R RS R R R 6R 1R 5R R R
  14. ϓϨςΩετλεΫͷਐల      $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ

    ࡞੒ͯ͠ରরֶश $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX  <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH  .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM  HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO *$$7`> ϚϧνϞʔμϧ΁ͷ֦ு 3(# ηϚηά ਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(# ݴޠ $"7."& <:(POH *$-3`> 3(# Ի੠ ରরֶश .*. 4JN.*. ରরֶश 4J5 <4"UJUP BS9JW`> $."& <;)VBOH BS9JW`> ."& ରরֶश 8IBU%P447J5-FBSO  </1BSL *$-3`> ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ w 7J5ͷ୆಄Ҏ߱͸#&35Λը૾΁Ԡ༻ͨ͠.BTLFE*NBHF.PEFMJOH͕ొ৔
  15. w #&35ɿ#JEJSFDUJPOBM&ODPEFS3FQSFTFOUBUJPOTGSPN5SBOTGPSNFST w #JEJSFDUJPOBM5SBOTGPSNFSΛ1SFUSBJOJOHͱ'JOF5VOJOHͷͭͷ4UFQʹΑΓֶश w ࣄલֶश 1SFUSBJOJOH ͱͯͭ͠ͷλεΫʹֶ͍ͭͯश  .BTLFE-BOHVBHF.PEFMJOH

    ɿͷτʔΫϯΛϚεΫ͠ɼϚεΫͨ͠τʔΫϯͷ୯ޠΛ༧ଌ  /FYU4FOUFODF1SFEJDUJPO ɿ4FOUFODF#͕4FOUFODF"ͷଓ͖ͷจষ͔༧ଌ ࣗવݴޠॲཧ෼໺ͷࣄલֶश๏ɿ#&35<%FWMJO /""$-`>   BERT BERT E [CLS] E 1 E [SEP] ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... TokM Question Paragraph Start/End Span BERT E [CLS] E 1 E [SEP] ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... TokM Masked Sentence A Masked Sentence B Pre-training Fine-Tuning NSP Mask LM Mask LM Unlabeled Sentence A and B Pair SQuAD Question Answer Pair NER MNLI <%FWMJO /""$->͔ΒҾ༻
  16. w #&35ͷ.BTLFE-BOHVBHF.PEFMJOHΛը૾Ϟσϧͷࣄલֶश΁Ԡ༻͍ͨ͠  ໰୊఺ɿ5SBOTGPSNFSͷߏ଄ʹج͍ͮͨҐஔຒΊࠐΈ΍ϚεΫτʔΫϯͷ$//΁ͷಋೖ͕ࠔ೉ .BTLFE-BOHVBHF.PEFMJOHͷը૾΁ͷԠ༻   Transformer Encoder MLP

    Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder w 7JTJPO5SBOTGPSNFS 7J5 ͷ୆಄  .BTLFE-BOHVBHF.PEFMJOHΛ7J5ͷࣄલֶश΁Ԡ༻ <%PTPWJUTLJZ *$-3`>͔ΒҾ༻
  17. w "୯ʹपғͷઢ΍ςΫενϟΛ֦ு͢Δ͚ͩͰ͸༧ଌ͕೉͍͠ϚεΫ w .BTLFE"VUPFODPEFS ."& <)F $713`>  ܭࢉޮ཰ͷྑ͍Τϯίʔμɾσίʔμߏ଄ΛఏҊ͠ɼϚεΫྖҬͷըૉΛ༧ଌ 

    ϚεΫ཰ͱϚεΫઓུʹΑΔੑೳมԽΛௐࠪ 2ͲΜͳϚεΫॲཧ͕.*.ʹ͓͍ͯޮՌతʁ   10 20 30 40 50 60 70 80 90 83 84 85 83.2 83.4 83.4 84.7 84.9 85.0 84.9 84.9 84.5 83.0 fine-tuning masking ratio (%) 10 20 30 40 50 60 70 80 90 50 60 70 54.6 58.9 61.7 67.0 69.9 71.8 73.2 73.5 71.8 66.1 linear probing masking ratio (%) Figure 5. Masking ratio. A high masking ratio (75%) works well for both fine-tuning (top) and linear probing (bottom). The y-axes are ImageNet-1K validation accuracy (%) in all plots in this paper. 4. ImageNet Experiments 10 20 30 40 50 60 70 80 90 83 84 85 83.2 83.4 83.4 84.7 84.9 85.0 84.9 84.9 84.5 83.0 fine-tuning masking ratio (%) 73.5 <)F $713>͔ΒҾ༻ ϚεΫ཰ʹΑΔਫ਼౓มԽ *NBHF/FU, ઢܗධՁͱϑΝΠϯνϡʔχϯάͷ྆ํͰ ϚεΫ཰ͷ࣌ʹߴ͍ਫ਼౓Λୡ੒ ʢ#&35ͷϚεΫ཰ͱൺ΂ͯߴ͍ϚεΫ཰ʣ ϚεΫ཰ͷྫ block 50% random 75% ˠ෺ମ΍γʔϯͷશମ૾ͷཧղ΁
  18. w "୯ʹपғͷઢ΍ςΫενϟΛ֦ு͢Δ͚ͩͰ͸༧ଌ͕೉͍͠ϚεΫ w .BTLFE"VUPFODPEFS ."& <)F $713`>  ܭࢉޮ཰ͷྑ͍Τϯίʔμɾσίʔμߏ଄ΛఏҊ͠ɼϚεΫྖҬͷըૉΛ༧ଌ 

    ϚεΫ཰ͱϚεΫઓུʹΑΔੑೳมԽΛௐࠪ 2ͲΜͳϚεΫॲཧ͕.*.ʹ͓͍ͯޮՌతʁ   in 5.5 0.0 1.9 3.5 3.3 ecoder can im- dim ft lin 128 84.9 69.1 256 84.8 71.3 512 84.9 73.5 768 84.4 73.1 1024 84.3 73.1 (b) Decoder width. The decoder can be nar- rower than the encoder (1024-d). case ft lin FLOPs encoder w/ [M] 84.2 59.6 3.3⇥ encoder w/o [M] 84.9 73.5 1⇥ (c) Mask token. An encoder without mask to- kens is more accurate and faster (Table 2). lin 73.5 73.9 72.3 71.6 xels as recon- case ft lin none 84.0 65.7 crop, fixed size 84.7 73.1 crop, rand size 84.9 73.5 crop + color jit 84.3 71.9 (e) Data augmentation. Our MAE works with minimal or no augmentation. case ratio ft lin random 75 84.9 73.5 block 50 83.9 72.3 block 75 82.8 63.9 grid 75 84.0 66.0 (f) Mask sampling. Random sampling works the best. See Figure 6 for visualizations. eriments with ViT-L/16 on ImageNet-1K. We report fine-tuning (ft) and linear probing (lin) accuracy (%). If block 50% grid 75% random 75% ϚεΫྫͱ֤ύονͷ༧ଌ݁Ռ *NBHF/FU,ʹର͢Δੑೳ ϑΝΠϯνϡʔχϯά GU ͱઢܗධՁ MJO ͷڞʹ࠷΋ߴ͍ਫ਼౓Λୡ੒
  19. w "୯ʹपғͷઢ΍ςΫενϟΛ֦ு͢Δ͚ͩͰ͸༧ଌ͕೉͍͠ϚεΫ w .BTLFE"VUPFODPEFS ."& <)F $713`>  ܭࢉޮ཰ͷྑ͍Τϯίʔμɾσίʔμߏ଄ΛఏҊ͠ɼϚεΫྖҬͷըૉΛ༧ଌ 

    ϚεΫ཰ͱϚεΫઓུʹΑΔੑೳมԽΛௐࠪ 2ͲΜͳϚεΫॲཧ͕.*.ʹ͓͍ͯޮՌతʁ   in 5.5 0.0 1.9 3.5 3.3 ecoder can im- dim ft lin 128 84.9 69.1 256 84.8 71.3 512 84.9 73.5 768 84.4 73.1 1024 84.3 73.1 (b) Decoder width. The decoder can be nar- rower than the encoder (1024-d). case ft lin FLOPs encoder w/ [M] 84.2 59.6 3.3⇥ encoder w/o [M] 84.9 73.5 1⇥ (c) Mask token. An encoder without mask to- kens is more accurate and faster (Table 2). lin 73.5 73.9 72.3 71.6 xels as recon- case ft lin none 84.0 65.7 crop, fixed size 84.7 73.1 crop, rand size 84.9 73.5 crop + color jit 84.3 71.9 (e) Data augmentation. Our MAE works with minimal or no augmentation. case ratio ft lin random 75 84.9 73.5 block 50 83.9 72.3 block 75 82.8 63.9 grid 75 84.0 66.0 (f) Mask sampling. Random sampling works the best. See Figure 6 for visualizations. eriments with ViT-L/16 on ImageNet-1K. We report fine-tuning (ft) and linear probing (lin) accuracy (%). If block 50% grid 75% random 75% ϚεΫྫͱ֤ύονͷ༧ଌ݁Ռ *NBHF/FU,ʹର͢Δੑೳ ݻ·ͬͯϚεΫ͞ΕΔͨΊ༧ଌ͕ࠔ೉Ͱ͋ΓɼϥϯμϜͱൺ΂ͯ௿͍ਫ਼౓
  20. w "୯ʹपғͷઢ΍ςΫενϟΛ֦ு͢Δ͚ͩͰ͸༧ଌ͕೉͍͠ϚεΫ w .BTLFE"VUPFODPEFS ."& <)F $713`>  ܭࢉޮ཰ͷྑ͍Τϯίʔμɾσίʔμߏ଄ΛఏҊ͠ɼϚεΫྖҬͷըૉΛ༧ଌ 

    ϚεΫ཰ͱϚεΫઓུʹΑΔੑೳมԽΛௐࠪ 2ͲΜͳϚεΫॲཧ͕.*.ʹ͓͍ͯޮՌతʁ   in 5.5 0.0 1.9 3.5 3.3 ecoder can im- dim ft lin 128 84.9 69.1 256 84.8 71.3 512 84.9 73.5 768 84.4 73.1 1024 84.3 73.1 (b) Decoder width. The decoder can be nar- rower than the encoder (1024-d). case ft lin FLOPs encoder w/ [M] 84.2 59.6 3.3⇥ encoder w/o [M] 84.9 73.5 1⇥ (c) Mask token. An encoder without mask to- kens is more accurate and faster (Table 2). lin 73.5 73.9 72.3 71.6 xels as recon- case ft lin none 84.0 65.7 crop, fixed size 84.7 73.1 crop, rand size 84.9 73.5 crop + color jit 84.3 71.9 (e) Data augmentation. Our MAE works with minimal or no augmentation. case ratio ft lin random 75 84.9 73.5 block 50 83.9 72.3 block 75 82.8 63.9 grid 75 84.0 66.0 (f) Mask sampling. Random sampling works the best. See Figure 6 for visualizations. eriments with ViT-L/16 on ImageNet-1K. We report fine-tuning (ft) and linear probing (lin) accuracy (%). If block 50% grid 75% random 75% ϚεΫྫͱ֤ύονͷ༧ଌ݁Ռ *NBHF/FU,ʹର͢Δੑೳ पғͷ৘ใ͔Βߴ඼࣭ͳ༧ଌ͕ՄೳͳҰํͰઢܗධՁ MJO ͷਫ਼౓͸௿͍
  21. w "୯ʹपғͷઢ΍ςΫενϟΛ֦ு͢Δ͚ͩͰ͸༧ଌ͕೉͍͠ϚεΫ w .BTLFE"VUPFODPEFS ."& <)F $713`>  ܭࢉޮ཰ͷྑ͍Τϯίʔμɾσίʔμߏ଄ΛఏҊ͠ɼϚεΫྖҬͷըૉΛ༧ଌ 

    ϚεΫ཰ͱϚεΫઓུʹΑΔੑೳมԽΛௐࠪ 2ͲΜͳϚεΫॲཧ͕.*.ʹ͓͍ͯޮՌతʁ   in 5.5 0.0 1.9 3.5 3.3 ecoder can im- dim ft lin 128 84.9 69.1 256 84.8 71.3 512 84.9 73.5 768 84.4 73.1 1024 84.3 73.1 (b) Decoder width. The decoder can be nar- rower than the encoder (1024-d). case ft lin FLOPs encoder w/ [M] 84.2 59.6 3.3⇥ encoder w/o [M] 84.9 73.5 1⇥ (c) Mask token. An encoder without mask to- kens is more accurate and faster (Table 2). lin 73.5 73.9 72.3 71.6 xels as recon- case ft lin none 84.0 65.7 crop, fixed size 84.7 73.1 crop, rand size 84.9 73.5 crop + color jit 84.3 71.9 (e) Data augmentation. Our MAE works with minimal or no augmentation. case ratio ft lin random 75 84.9 73.5 block 50 83.9 72.3 block 75 82.8 63.9 grid 75 84.0 66.0 (f) Mask sampling. Random sampling works the best. See Figure 6 for visualizations. eriments with ViT-L/16 on ImageNet-1K. We report fine-tuning (ft) and linear probing (lin) accuracy (%). If block 50% grid 75% random 75% ϚεΫྫͱ֤ύονͷ༧ଌ݁Ռ *NBHF/FU,ʹର͢Δੑೳ ϚεΫύονͷ༧ଌੑೳ͕ߴ͍ྑ͍ಛ௃දݱͷ֫ಘͱ͸ݶΒͳ͍
  22. ϓϨςΩετλεΫͷਐల      $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ

    ࡞੒ͯ͠ରরֶश $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX  <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH  .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM  HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO *$$7`> ϚϧνϞʔμϧ΁ͷ֦ு 3(# ηϚηά ਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(# ݴޠ $"7."& <:(POH *$-3`> 3(# Ի੠ ରরֶश .*. 4JN.*. ରরֶश 4J5 <4"UJUP BS9JW`> $."& <;)VBOH BS9JW`> ."& ରরֶश 8IBU%P447J5-FBSO  </1BSL *$-3`> ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ w ରরֶशͱ.*.ͷϋΠϒϦουʹΑͬͯ྆ํͷ௕ॴΛޮՌతʹ׆༻
  23. w "ରরֶशʴ.*.ͷϋΠϒϦουख๏ w 8IBU%P4FMG4VQFSWJTFE7JTJPO5SBOTGPSNFST-FBSO <1BSL *$-3`>  ରরֶशͱ.*.͸4FMG"UUFOUJPOɼಛ௃நग़ɼॏཁͳ૚ͷ؍఺͔ΒҟͳΔֶशޮՌΛൃش  ରরֶशͱ.*.Λ૊Έ߹ΘֶͤͨशʹΑΓֶ֤श͕ิ׬తͰ͋Δ͜ͱΛධՁ

    2ରরֶशͱ.*.ͷͲͪΒΛ࢖༻͢Δ΂͖ʁ   ద੾ͳόϥϯεʹௐ੔͢Δ͜ͱͰରরֶशͱ.*.ͷ྆ํͷ௕ॴΛޮՌతʹ׆༻ *NBHF/FU,ʹର͢Δਫ਼౓ <1BSL *$-3>͔ΒҾ༻ ग़ྗ૚ͷΈߋ৽ ϞσϧશମΛߋ৽ L = (1 − λ)LMIM + λLCL λɿόϥϯεΛௐ੔͢ΔॏΈ LMIMɿ.*.ͷଛࣦ LCLɿରরֶशͷଛࣦ .*.ͷΈ ରরֶशͷΈ
  24. ϓϨςΩετλεΫͷਐల      $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ

    ࡞੒ͯ͠ରরֶश $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX  <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH  .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM  HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO $713`> ϚϧνϞʔμϧ΁ͷ֦ு 3(# ηϚηά ਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(# ݴޠ $"7."& <:(POH *$-3`> 3(# Ի੠ ରরֶश .*. 4JN.*. ରরֶश 4J5 <4"UJUP BS9JW`> $."& <;)VBOH BS9JW`> ."& ରরֶश 8IBU%P447J5-FBSO  </1BSL *$-3`> ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ
  25. w ࣗݾڭࢣ͋ΓֶशͰ͸ଟ͘ͷ৔߹Ͱ*NBHF/FU,ʢ໿ສͷը૾ʣΛ࢖༻  ࣗવݴޠॲཧ෼໺ʹ͓͚Δج൫ϞσϧͷͨΊͷσʔληοτͱൺ΂Δͱখن໛ ࠷ۙͷಈ޲ɿσʔληοτͷେن໛Խ   w ը૾ʹ͓͚Δج൫Ϟσϧ࡞੒Λ໨ඪͱͯ͠σʔληοτΛେن໛Խ 

    4&&3<(PZBM BS9JW`> ɿΠϯελάϥϜͰऩूͨ͠໿ԯͷը૾Ͱରরֶश  ."84<4JOHI *$$7`> ɿΠϯελάϥϜͰऩूͨ͠໿ԯͷը૾Ͱ.*.  %*/0W<0RVBC 5.-3`> ɿطଘσʔληοτͱ΢ΣϒʹΑΔ໿ԯͷը૾Ͱରরֶश .*.  %*/0W<4JNÉPOJ BS9JW`> ɿ໿ԯͷը૾͔Β໨తͷҟͳΔͭͷσʔληοτΛߏஙֶͯ͠श
  26. w ࣗݾڭࢣ͋ΓֶशͰ͸ଟ͘ͷ৔߹Ͱ*NBHF/FU,ʢ໿ສͷը૾ʣΛ࢖༻  ࣗવݴޠॲཧ෼໺ʹ͓͚Δج൫ϞσϧͷͨΊͷσʔληοτͱൺ΂Δͱখن໛ ࠷ۙͷಈ޲ɿσʔληοτͷେن໛Խ   w ը૾ʹ͓͚Δج൫Ϟσϧ࡞੒Λ໨ඪͱͯ͠σʔληοτΛେن໛Խ 

    4&&3<(PZBM BS9JW`> ɿΠϯελάϥϜͰऩूͨ͠໿ԯͷը૾Ͱରরֶश  ."84<4JOHI *$$7`> ɿΠϯελάϥϜͰऩूͨ͠໿ԯͷը૾Ͱ.*.  %*/0W<0RVBC 5.-3`> ɿطଘσʔληοτͱ΢ΣϒʹΑΔ໿ԯͷը૾Ͱରরֶश .*.  %*/0W<4JNÉPOJ BS9JW`> ɿ໿ԯͷը૾͔Β໨తͷҟͳΔͭͷσʔληοτΛߏஙֶͯ͠श
  27. w ໿ԯͷը૾͔ΒͳΔେن໛σʔληοτΛ༻͍ͨࣗݾڭࢣ͋Γֶशͷ܏޲Λௐࠪ w طଘͷෳ਺σʔληοτͱ΢Σϒ্ͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒ w J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ σʔληοτͷେن໛Խɿ%*/0W<0RVBC 5.-3`>  

    Task Dataset / Split Images Retrieval Retrieved Final classification ImageNet-22k / – 14,197,086 as is – 14,197,086 classification ImageNet-22k / – 14,197,086 sample 56,788,344 56,788,344 classification ImageNet-1k / train 1,281,167 sample 40,997,344 40,997,344 fine-grained classif. Caltech 101 / train 3,030 cluster 2,630,000 1,000,000 fine-grained classif. CUB-200-2011 / train 5,994 cluster 1,300,000 1,000,000 fine-grained classif. DTD / train1 1,880 cluster 1,580,000 1,000,000 fine-grained classif. FGVC-Aircraft / train 3,334 cluster 1,170,000 1,000,000 fine-grained classif. Flowers-102 / train 1,020 cluster 1,060,000 1,000,000 fine-grained classif. Food-101 / train 75,750 cluster 21,670,000 1,000,000 fine-grained classif. Oxford-IIIT Pet / trainval 3,680 cluster 2,750,000 1,000,000 fine-grained classif. Stanford Cars / train 8,144 cluster 7,220,000 1,000,000 fine-grained classif. SUN397 / train1 19,850 cluster 18,950,000 1,000,000 fine-grained classif. Pascal VOC 2007 / train 2,501 cluster 1,010,000 1,000,000 segmentation ADE20K / train 20,210 cluster 20,720,000 1,000,000 segmentation Cityscapes / train 2,975 cluster 1,390,000 1,000,000 segmentation Pascal VOC 2012 (seg.) / trainaug 1,464 cluster 10,140,000 1,000,000 depth estimation Mapillary SLS / train 1,434,262 as is – 1,434,262 depth estimation KITTI / train (Eigen) 23,158 cluster 3,700,000 1,000,000 depth estimation NYU Depth V2 / train 24,231 cluster 10,850,000 1,000,000 depth estimation SUN RGB-D / train 4,829 cluster 4,870,000 1,000,000 retrieval Google Landmarks v2 / train (clean) 1,580,470 as is – 1,580,470 retrieval Google Landmarks v2 / train (clean) 1,580,470 sample 6,321,880 6,321,880 retrieval AmsterTime / new 1,231 cluster 960,000 960,000 retrieval AmsterTime / old 1,231 cluster 830,000 830,000 retrieval Met / train 397,121 cluster 62,860,000 1,000,000 retrieval Revisiting Oxford / base 4,993 cluster 3,680,000 1,000,000 retrieval Revisiting Paris / base 6,322 cluster 3,660,000 1,000,000 142,109,386 Table 15: Composition of our LVD-142M dataset. We report the list of datasets and associated splits σʔλͷ਺Λ૿ڧ σʔληοτؒͷόϥϯεΛௐ੔ ࠷ऴతͳ-7%. <0RVBC BS9JW>͔ΒҾ༻
  28. w σʔληοτ͝ͱʹͭͷεςοϓ͔Β΢Σϒ্ͷը૾Ͱσʔλ਺Λ૿ڧ  &NCFEEJOH ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशͨ͠7J5)ʹΑΓಛ௃நग़  %FEVQMJDBUJPO ɿ΢Σϒը૾ʹରͯ͠ಛ௃ྔʹج͍ͮͨίϐʔݕग़Λద༻ͯ͠ࣅͨը૾Λ࡟আ  3FUSJFWBM

    ɿσʔληοτ಺ͷσʔλʹࣅͨ/ݸͷ΢Σϒը૾Λσʔληοτʹ௥Ճ σʔληοτͷେن໛Խɿ%*/0W<0RVBC 5.-3`>   Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval ૿ڧޙͷσʔληοτ ΢Σϒ͔Βऩूͨ͠ը૾ طଘͷσʔληοτ <0RVBC BS9JW>͔ΒҾ༻
  29. w σʔληοτ͝ͱʹͭͷεςοϓ͔Β΢Σϒ্ͷը૾Ͱσʔλ਺Λ૿ڧ  &NCFEEJOH ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशͨ͠7J5)ʹΑΓಛ௃நग़  %FEVQMJDBUJPO ɿ΢Σϒը૾ʹରͯ͠ಛ௃ྔʹج͍ͮͨίϐʔݕग़Λద༻ͯ͠ࣅͨը૾Λ࡟আ  3FUSJFWBM

    ɿσʔληοτ಺ͷσʔλʹࣅͨ/ݸͷ΢Σϒը૾Λσʔληοτʹ௥Ճ σʔληοτͷେن໛Խɿ%*/0W<0RVBC 5.-3`>   Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval ૿ڧޙͷσʔληοτ طଘͷσʔληοτ ΢Σϒ͔Βऩूͨ͠ը૾ <0RVBC BS9JW>͔ΒҾ༻ɼҰ෦վม
  30. w σʔληοτ͝ͱʹͭͷεςοϓ͔Β΢Σϒ্ͷը૾Ͱσʔλ਺Λ૿ڧ  &NCFEEJOH ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशͨ͠7J5)ʹΑΓಛ௃நग़  %FEVQMJDBUJPO ɿ΢Σϒը૾ʹରͯ͠ಛ௃ྔʹج͍ͮͨίϐʔݕग़Λద༻ͯ͠ࣅͨը૾Λ࡟আ  3FUSJFWBM

    ɿσʔληοτ಺ͷσʔλʹࣅͨ/ݸͷ΢Σϒը૾Λσʔληοτʹ௥Ճ σʔληοτͷେن໛Խɿ%*/0W<0RVBC 5.-3`>   Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval ૿ڧޙͷσʔληοτ ΢Σϒ͔Βऩूͨ͠ը૾ طଘͷσʔληοτ <0RVBC BS9JW>͔ΒҾ༻ɼҰ෦վม
  31. w σʔληοτ͝ͱʹͭͷεςοϓ͔Β΢Σϒ্ͷը૾Ͱσʔλ਺Λ૿ڧ  &NCFEEJOH ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशͨ͠7J5)ʹΑΓಛ௃நग़  %FEVQMJDBUJPO ɿ΢Σϒը૾ʹରͯ͠ಛ௃ྔʹج͍ͮͨίϐʔݕग़Λద༻ͯ͠ࣅͨը૾Λ࡟আ  3FUSJFWBM

    ɿσʔληοτ಺ͷσʔλʹࣅͨ/ݸͷ΢Σϒը૾Λσʔληοτʹ௥Ճ σʔληοτͷେن໛Խɿ%*/0W<0RVBC 5.-3`>   Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval ૿ڧޙͷσʔληοτ ΢Σϒ͔Βऩूͨ͠ը૾ طଘͷσʔληοτ <0RVBC BS9JW>͔ΒҾ༻ɼҰ෦վม
  32. w ໿ԯͷը૾͔ΒͳΔେن໛σʔληοτΛ༻͍ͨࣗݾڭࢣ͋Γֶशͷ܏޲Λௐࠪ w طଘͷσʔληοτͱΠϯλʔωοτͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒ w J#05<;IPV *$-3`>Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ σʔληοτͷେن໛Խɿ%*/0W<0RVBC 5.-3`> 

     େ͖ͳϞσϧ͸θϩ͔Βࣗݾڭࢣ͋Γֶश ೖྗը૾ 7JFX -BSHF7J5 QBUDI 4PGUNBY 7JFX -BSHF7J5 ʢϞϝϯλϜϞσϧʣ ಛ௃ϕΫτϧ <$-4> QBUDI 4JOLIPSO,OPQQ ʢΫϥελׂ౰ʣ ϓϩτλΠϓ <$-4> ࢦ਺Ҡಈฏۉ ෛྫΛ༻͍ͳ͍ ରরֶश .*. σʔλ֦ு খ͞ͳϞσϧ͸େ͖ͳϞσϧ͔ΒͷৠཹͰֶश 1SFUSBJOFE -BSHF7J5 QBUDI 4NBMM7J5 <$-4> QBUDI <$-4> 4NBMM7J5 ࢦ਺Ҡಈฏۉ ෛྫΛ༻͍ͳ͍ ରরֶश .*. ࠷ऴతʹ࢖༻ ʢϞϝϯλϜϞσϧʣ
  33. w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ  8FBLMZ ɿը૾ͱݴޠͷϚϧνϞʔμϧख๏  %BUB ɿࣄલֶशͷσʔληοτ σʔληοτͷେن໛Խɿ%*/0W<0RVBC 5.-3`>

      w ஌ࣝৠཹͷޮՌ  ڭࢣ 7J5H ͷύϥϝʔλ਺ ɿ໿ԯ  ੜె 7J5- ͷύϥϝʔλ਺ ɿ໿ԯສ X 85.8 72.8 47.1 63.9 (a) Koleo loss X Table 3: (a) E ect of the KoLeo loss term. (b) E ect o term. Evaluation performed on ImageNet-{1k,A} (classifi (segmentation with linear layer, mIoU) and Oxford-M (im same number of iterations, that is smaller than our final ru search tasks (e.g. retrieval), and the MIM loss improves p (a) Comparison on individual metrics Arch ViT-g/14 ViT-L/14 ViT-L/14 Arch ViT-g/14 ViT-L/14 ViT-L/14 Figure 5: E ectiveness of knowledge distillation. C εΫϥονͱൺ΂ͯߴ͍ਫ਼౓Λୡ੒ ը૾ͷΈͰैདྷ๏ͱൺ΂ͯߴ͍ਫ਼౓Λୡ੒ kNN linear Method Arch. Data Text sup. val val ReaL V2 Weakly supervised CLIP ViT-L/14 WIT-400M X 79.8 84.3 88.1 75.3 CLIP ViT-L/14 336 WIT-400M X 80.5 85.3 88.8 75.8 SWAG ViT-H/14 IG3.6B X 82.6 85.7 88.7 77.6 OpenCLIP ViT-H/14 LAION X 81.7 84.4 88.4 75.5 OpenCLIP ViT-G/14 LAION X 83.2 86.2 89.4 77.2 EVA-CLIP ViT-g/14 customú X 83.5 86.4 89.3 77.4 Self-supervised MAE ViT-H/14 INet-1k 5 49.4 76.6 83.3 64.8 DINO ViT-S/8 INet-1k 5 78.6 79.2 85.5 68.2 SEERv2 RG10B IG2B 5 – 79.8 – – MSN ViT-L/7 INet-1k 5 79.2 80.7 86.0 69.7 EsViT Swin-B/W=14 INet-1k 5 79.4 81.3 87.0 70.4 Mugs ViT-L/16 INet-1k 5 80.2 82.1 86.9 70.8 iBOT ViT-L/16 INet-22k 5 72.9 82.3 87.5 72.4 DINOv2 ViT-S/14 LVD-142M 5 79.0 81.1 86.6 70.9 ViT-B/14 LVD-142M 5 82.1 84.5 88.3 75.1 ViT-L/14 LVD-142M 5 83.5 86.3 89.5 78.0 ViT-g/14 LVD-142M 5 83.5 86.5 89.6 78.4 Table 4: Linear evaluation on ImageNet-1k of frozen pretrained features. We report Top-1 accuracy on the validation set for publicly available models trained on public or private data, and with or without text supervision (text sup.). For reference, we also report the kNN performance on the validation set. We compare across any possible architectures (Arch.), at resolution 224 ◊ 224 unless stated otherwise. The dataset used for training EVA-CLIP is a custom mixture, see paper for details (Fang et al., 2023). <0RVBC BS9JW>͔ΒҾ༻
  34. w ճͷओ੒෼෼ੳ 1$" Λద༻͢Δ͜ͱͰύονͷಛ௃ྔΛ෼ੳ  ɽෳ਺ͷը૾ͷશͯͷύονಛ௃ྔʹରͯ͠1$"Λద༻ w ୈʔओ੒෼ʹର͖͍ͯ͠͠஋ॲཧΛߦ͍ɼલܠͱഎܠͷύονʹ෼ׂ  ɽෳ਺ͷը૾ͷલܠͱ൑ఆ͞Εͨύονಛ௃ྔʹରͯ͠1$"Λద༻

    w ୈʔओ੒෼ɼୈೋओ੒෼ɼୈࡾओ੒෼ͷ஋Λ3(#ͷ஋ͱ֤ͯ͠ύονΛ৭෇͚ σʔληοτͷେن໛Խɿ%*/0W<0RVBC 5.-3`>   ਓखʹΑΔਖ਼ղϥϕϧͳ͠ͰΦϒδΣΫτͷύʔπϨϕϧͷؔ܎ੑΛֶश Figure 1: Visualization of the first PCA components. We compute a PCA between the patches of the images from the same column (a, b, c and d) and show their first 3 components. Each component is matched to a di erent color channel. Same parts are matched between related images despite changes of pose, style or even objects. Background is removed by thresholding the first PCA component. IUUQTBJGBDFCPPLDPNCMPHEJOPWDPNQVUFSWJTJPOTFMGTVQFSWJTFEMFBSOJOH <0RVBC BS9JW>͔ΒҾ༻
  35. w %*/0WΛϕʔεʹϞσϧαΠζͱσʔληοτΛେن໛Խ  ໿ԯͷը૾͔Βߏஙͨͭ͠ͷσʔληοτΛ༻͍ͯ#ͷύϥϝʔλΛ࣋ͭ7J5Λֶश  %*/0Wଛࣦʹ%FOTFͳಛ௃ͱ(MPCBMͳಛ௃Λཱ྆͢ΔͨΊͷ(SBN"ODIPSJOHΛಋೖ  ֶशޙͷ7J5ʹରͯ͠ճͷ஌ࣝৠཹͰαΠζͷҟͳΔෳ਺ϞσϧΛֶशʢϞσϧѹॖʣ σʔληοτͷେن໛Խɿ%*/0W<4JNÉPOJ BS9JW`>

      ໿ԯͷը૾ ɾΠϯελάϥϜऩूը૾ ɾطଘσʔληοτ ΢Σϒը૾σʔληοτʢ໿ԯͷը૾ʣ ɾߏஙɿ%*/0Wಛ௃Λ༻͍ͨΫϥελϦϯάʹج͍ͮͯσʔλͷଟ༷ੑΛௐ੔ͯ͠࡞੒ ɾ໨తɿ΢Σϒ্ͷࢹ֮֓೦Λ໢ཏ ΢Σϒը૾Ͱ૿ڧͨ͠طଘσʔληοτ ɾߏஙɿ%*/0WͷσʔληοτύΠϓϥΠϯΛద༻ͯ͠࡞੒ ɾ໨తɿԼྲྀͷλεΫͷࢹ֮֓೦Λ໢ཏ طଘσʔληοτ ɾߏஙɿطଘσʔληοτΛͦͷ··࢖༻ ɹɹɹɹ *NBHF/FUL *NBHF/FUL .BQJMMBSZ4USFFUMFWFM4FRVFODFTͳͲ  ɾ໨తɿϞσϧͷੑೳΛ࠷దԽ ϛχόον࡞੒࣌ʹ*NBHF/FU,ͷΈPSࠞ߹σʔλΛϥϯμϜબ୒ֶͯ͠श ʢ%*/0WͰ͸ֶशͷ͕*NBHF/FU,ͷΈʣ
  36. w %*/0WΛϕʔεʹϞσϧαΠζͱσʔληοτΛେن໛Խ  ໿ԯͷը૾͔Βߏஙͨͭ͠ͷσʔληοτΛ༻͍ͯ#ͷύϥϝʔλΛ࣋ͭ7J5Λֶश  %*/0Wଛࣦʹ%FOTFͳಛ௃ͱ(MPCBMͳಛ௃Λཱ྆͢ΔͨΊͷ(SBN"ODIPSJOHΛಋೖ  ֶशޙͷ7J5ʹରͯ͠ճͷ஌ࣝৠཹͰαΠζͷҟͳΔෳ਺ϞσϧΛֶशʢϞσϧѹॖʣ σʔληοτͷେن໛Խɿ%*/0W<4JNÉPOJ BS9JW`>

      ௕ظؒͷֶशʹΑΔੑೳมԽ Ϋϥε෼ྨʢ(MPCBMͳ༧ଌλεΫʣͷੑೳ͕޲্ ηάϝϯςʔγϣϯʢ%FOTFͳ༧ଌλεΫʣͷੑೳ͕௿Լ 250k 500k 750k 1M 75 80 85 Training iterations VOC IN1k (b) ViT-g ੺ͷύονͱଞύονؒͷίαΠϯྨࣅ౓ Image 200k 400k 600k 800k 1M Figure 6: Evolution of the cosine similarity between the patch noted in red and all other patches. As training progresses, the features produced by the model become less localized and the similarity maps become noisier. 4 Gram Anchoring: A Regularization for Dense Features To fully leverage the benefits of large-scale training, we aim to train the 7B model for an extended dura- tion, with the notion that it could potentially train indefinitely. As expected, prolonged training leads to improvements on global benchmarks. However, as training progresses, the performance degrades on dense tasks (Figs. 5b and 5c). This phenomenon, which is due to the emergence of patch-level inconsistencies in ֶश͕ਐΉʹͭΕͯಛ௃ྔͷہॴੑ͕ऑ·ΓɼϊΠζ͕૿Ճ ໨తɿύονಛ௃΁ͷਖ਼ଇԽʹΑͬͯߴ͍(MPCBMੑೳΛҡ࣋ͭͭ͠ɼ%FOTFੑೳͷྼԽΛ཈੍ <4JNÉPOJ BS9JW>͔ΒҾ༻
  37. w %*/0WΛϕʔεʹϞσϧαΠζͱσʔληοτΛେن໛Խ  ໿ԯͷը૾͔Βߏஙͨͭ͠ͷσʔληοτΛ༻͍ͯ#ͷύϥϝʔλΛ࣋ͭ7J5Λֶश  %*/0Wଛࣦʹ%FOTFͳಛ௃ͱ(MPCBMͳಛ௃Λཱ྆͢ΔͨΊͷ(SBN"ODIPSJOHΛಋೖ  ֶशޙͷ7J5ʹରͯ͠ճͷ஌ࣝৠཹͰαΠζͷҟͳΔෳ਺ϞσϧΛֶशʢϞσϧѹॖʣ σʔληοτͷେن໛Խɿ%*/0W<4JNÉPOJ BS9JW`>

      ಺ੵؔ܎ͷอ࣋Λଅ͚ͩ͢ͳͷͰɼಛ௃ྔ͸ॊೈʹΞοϓσʔτՄೳ ೖྗը૾ 7JFX 7J5 QBUDI 7JFX <$-4> QBUDI <$-4> ࢦ਺Ҡಈฏۉ σʔλ֦ு 7J5 ʢϞϝϯλϜϞσϧʣ (SBN 5FBDIFS %FOTFੑೳͷߴֶ͍शॳظ ͷϞσϧΛอ࣋ QBUDI <$-4> QBUDI <$-4> QBUDI <$-4> QBUDI <$-4> ύονಛ௃ؒ ͷ಺ੵ ύονಛ௃ؒ ͷ಺ੵ ʢҰఆͷΠςϨʔγϣϯ͝ͱʹߋ৽ʣ ߦྻΛҰகͤ͞ΔΑ͏ʹֶश ˠ(SBN"ODIPSJOH
  38. w %*/0WΛϕʔεʹϞσϧαΠζͱσʔληοτΛେن໛Խ  ໿ԯͷը૾͔Βߏஙͨͭ͠ͷσʔληοτΛ༻͍ͯ#ͷύϥϝʔλΛ࣋ͭ7J5Λֶश  %*/0Wଛࣦʹ%FOTFͳಛ௃ͱ(MPCBMͳಛ௃Λཱ྆͢ΔͨΊͷ(SBN"ODIPSJOHΛಋೖ  ֶशޙͷ7J5ʹରͯ͠ճͷ஌ࣝৠཹͰαΠζͷҟͳΔෳ਺ϞσϧΛֶशʢϞσϧѹॖʣ σʔληοτͷେن໛Խɿ%*/0W<4JNÉPOJ BS9JW`>

      ಉֶ࣌श͢ΔϞσϧؒͰֶशࡁΈϞσϧΛڞ༗͢Δ͜ͱͰܭࢉ࣌ؒͱίετΛ࡟ݮ Teacher inference (B/N T samples per GPU) Iteration time All-gather samples and teacher inference results Load B/N T samples Load B/N T samples Load B/N T samples Load B/N T samples Load B/N T samples Load B/N T samples Load B/N T samples Student S1 training (B/N S1 samples per GPU) Student S2 training (B/N S2 samples per GPU) Student S3 training (B/N S3 samples per GPU) Synchronize model Synchronize model Synchronize model GPUs Synchronization barrier Wait Wait Load B/N T samples ֶशࡁΈϞσϧͷਪ࿦͸ ͢΂ͯͷ(16Λ࢖༻ ৠཹઌͷϞσϧ͸ ϞσϧʹԠͯ͡(16਺Λઃఆ <4JNÉPOJ BS9JW>͔ΒҾ༻
  39. σʔληοτͷେن໛Խɿ%*/0W<4JNÉPOJ BS9JW`>   Evaluating DINOv3's Performance DINOv3 sets a

    new standard in vision foundation models. For the first time, a model trained with SSL outperforms weakly-supervised models on a broad range of probing tasks, from fine-grained image classification, to semantic segmentation, to object tracking in video. ༷ʑͳλεΫͰߴ͍ੑೳΛൃش IUUQTBJNFUBDPNEJOPW͔ΒҾ༻
  40. w ओ੒෼෼ੳΛ༻͍ͨಛ௃ྔͷՄࢹԽ σʔληοτͷେن໛Խɿ%*/0W<4JNÉPOJ BS9JW`>   Input SigLIP 2 PE

    Spatial DINOv2 w/reg DINOv3 Figure 13: Comparison of dense features. We compare several vision backbones by projecting their dense outputs using PCA and mapping them to RGB. From left to right: SigLIP 2 ViT-g/16, PEspatial ViT-G/14, DINOv2 ViT-g/14 with registers, DINOv3 ViT-7B/16. Images are forwarded at resolution 1280→960 for <4JNÉPOJ BS9JW>͔ΒҾ༻ IUUQTBJNFUBDPNEJOPW͔ΒҾ༻ %*/0W %*/0W
  41. w ࣗݾڭࢣ͋Γֶशͱ͸ʁ  ϥϕϧͳ͠σʔλΛ༻͍ͯϓϨΩετλεΫΛֶश͢Δ͜ͱͰϞσϧΛࣄલֶश  ༷ʑͳԼྲྀλεΫͰ༗ޮͳಛ௃ྔͷநग़Λ໨ࢦ͢  ख๏ͷਐలɿޮՌతͳϓϨςΩετλεΫͷઃܭˠରরֶशˠ.*.ˠσʔληοτͷେن໛Խ ·ͱΊ 

     %*/0W 7J5) *NBHF/FU, 7J5# छྨͷσʔληοτ 7J5H -7%. %*/0W ࣗݾڭࢣ͋Γֶश ίϐʔݕग़Ͱ ଟ༷ੑΛ୲อ σʔλ਺ ໿. ύϥϝʔλ਺ ໿. ΫϥελϦϯάͰ ଟ༷ੑΛ୲อ σʔλ਺ ໿. σʔλ਺  .Ҏ্ ύϥϝʔλ਺ ໿ . ύϥϝʔλ਺ ໿ .
  42. w ࣗݾڭࢣ͋Γֶशͱ͸ʁ  ϥϕϧͳ͠σʔλΛ༻͍ͯϓϨΩετλεΫΛֶश͢Δ͜ͱͰϞσϧΛࣄલֶश  ༷ʑͳԼྲྀλεΫͰ༗ޮͳಛ௃ྔͷநग़Λ໨ࢦ͢  ख๏ͷਐలɿޮՌతͳϓϨςΩετλεΫͷઃܭˠରরֶशˠ.*.ˠσʔληοτͷେن໛Խ w ຊൃදͰ৮ΕΔ͜ͱ͕Ͱ͖ͳ͔ͬͨ಺༰

     ֤ख๏ͷৄࡉɾ೿ੜख๏ͷ঺հ  ࣗݾڭࢣ͋Γֶशͷ෼ੳ  ࣗݾڭࢣ͋Γֶशͨ͠Ϟσϧͷ׆༻  ࣗݾڭࢣ͋Γֶशͨ͠ϞσϧͷϞσϧѹॖͳͲ ·ͱΊ   ".3"%*0<3BO[JOHFS $713`> ɾ%*/0W $-*1 4".ͷ஌ࣝΛͭͷϞσϧʹू໿ ෳ਺ͷج൫Ϟσϧͷ஌ࣝΛू໿ 4JH-JOP<$IBZCPVUJ $713`> ɾ%*/0W 4JH-*1ͷ஌ࣝΛͭͷϞσϧʹू໿