Slide 1

Slide 1 text

ୈճશ೔ຊίϯϐϡʔλϏδϣϯษڧձ ը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δେن໛σʔληοτ Ԭຊ௚थʢத෦େֶɾػց஌֮ϩϘςΟΫεݚڀάϧʔϓʣ IUUQNQSHKQ

Slide 2

Slide 2 text

ࣗݾ঺հ Ԭຊ௚थ /BPLJ0LBNPUP த෦େֶ޻ֶݚڀՊത࢜ޙظ՝ఔ̎೥ੜɹ౻٢ݚڀࣨॴଐ ࣗݾڭࢣ͋ΓֶशͷαʔϕΠεϥΠυ IUUQTTQFBLFSEFDLDPNOBPLTFMGTVQFSWJTFEMFBSOJOH ΞϯαϯϒϧֶशͷͨΊͷ஌ࣝৠཹ<&$$7> 0. Ensemble (74.52%) 1. ResNet18_ABN (68.1%) 2. ResNet18_ABN (70.68%) Prob(+), Attention(+) (Linear) 3. ResNet18_ABN (70.96%) Attention(+) (Correct) 4. ResNet18_ABN (72.09%) Prob(+), Attention(+) (Correct) 5. ResNet18_ABN (69.03%) Attention(+) (Linear) Attention(+) (Through) Attention(+) (Linear) Prob(+), Attention(+) (Linear) Attention(+) (Through) Prob(-), Attention(-) (Linear) Prob(+) (Correct) Attention(-) (Through) Prob(-), Attention(-) (Linear) Prob(+) (Correct) Attention(+) (Correct) Prob(+) (Linear) Prob(+), Attention(+) (Correct) Prob(+), Attention(+) (Linear) Label Linear Label Through Label Through Label Through %*/0ͷղઆهࣄ ݚڀςʔϚɿϋΠύʔύϥϝʔλ୳ࡧʹΑΔֶशํ๏ͷࣗಈઃܭ ݚڀ෼໺ɹɿ஌ࣝৠཹɼ൒ڭࢣ͋Γֶशɼࣗݾڭࢣ͋Γֶश

Slide 3

Slide 3 text

w -FBSOJOH5SBOTGFSBCMF7JTVBM.PEFMT'SPN/BUVSBM-BOHVBHF4VQFSWJTJPO ը૾ͱݴޠͷϚϧνϞʔμϧʹ͓͚Δେن໛σʔληοτΛߏஙֶͯ͠शʹར༻ w %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO ը૾ͷେن໛σʔληοτΛ࡞੒͢ΔϑϨʔϜϫʔΫΛఏҊ w 5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH ԯن໛ͷσʔληοτͰ͋Δ*OTUBHSBN#Λ༻͍ͨ৔߹ͷ."&ͷ܏޲Λௐࠪ ঺հ͢Δ࿦จ ֤࿦จʹ͍ͭͯҎԼͷ఺Λ঺հ ɽେن໛σʔληοτΛ׆༻ֶͨ͠शํ๏ ɽେن໛σʔληοτͷߏஙํ๏ ɽେن໛σʔληοτʹΑΔֶशޮՌ

Slide 4

Slide 4 text

w ਖ਼ղϥϕϧΛ෇༩ͯ͠ͳ͍େྔͷϥϕϧͳ͠σʔλΛ༻͍ͯࣄલֶश w ໨తɿ༷ʑͳసҠֶशɾ fi OFUVOJOHઌͷλεΫͰ༗ޮͳಛ௃ྔͷநग़ ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH ᶄ44-Ͱ࡞੒ͨ͠ࣄલֶशϞσϧΛର৅λεΫ΁సҠֶशɾ fi OFUVOJOH 1FMJDBO ڭࢣϥϕϧ '$ ࣄલֶशϞσϧ )FBE ࣄલֶशϞσϧ Ϋϥε෼ྨϞσϧ ෺ମݕग़Ϟσϧ ᶃϥϕϧͳ͠σʔλͰϞσϧΛࣄલֶश ϥϕϧͳ͠σʔλ Ϟσϧ ʢཚ਺ॳظ஋ʣ ses image data, we analyze its internal early projects the flattened patches into he top principal components of the the le basis functions for a low-dimensional Figure 6: Representative ex- amples of attention from the d to the l learns on em- on em- s in the usoidal ). That ogy ex- ot yield e entire ree the ute the is inte- “atten- We find lowest bally is stently zed at- ʜ ࣗݾڭࢣ͋Γֶश 44-

Slide 5

Slide 5 text

ࣗݾڭࢣ͋Γֶशͷ୅දతͳํ๏ ."&ͷֶशํ๏ w .BTLFE*NBHF.PEFMJOH .*. ϚεΫ͞Εͨύονͷಛ௃ྔΛ༧ଌ w #&J5<)#BP *$-3> w #&J5W<;1FOH BS9JW> ϚεΫ͞ΕͨύονͷըૉΛ༧ଌ w ."&<,)F $713> w 4JN.*.<;9JF $713> &ODPEFS &ODPEFS .-1 .-1 ࢦ਺Ҡಈฏۉ 4PGUNBY $FOUFS 4PGUNBY ଛࣦܭࢉ -PDBM (MPCBM %*/0ͷֶशํ๏ ଛࣦܭࢉ &ODPEFS %FDPEFS ϚεΫॲཧ ϚεΫτʔΫϯͷ௥Ճ ."&ͷֶशํ๏ w $POUSBTUJWF-FBSOJOH $- ෳ਺ͷҟͳΔ7JFX͔Βநग़ͨ͠ը૾Ϩϕϧͷಛ௃Λൺֱ w .P$PW<9$IFO *$$7> w %*/0<.$BSPO *$$7>

Slide 6

Slide 6 text

ࣗݾڭࢣ͋Γֶशͷ୅දతͳํ๏ w ϚϧνϞʔμϧ$- ը૾ͱςΩετͷϖΞͰ$- w $-*1<"3BEGPSE *$.-> w ϚϧνϞʔμϧ.*. 3(# %FQUI 4FNBOUJDΛಉ࣌ʹ.*. w .VMUJ."&<3#BDINBOO &$$7> w ϋΠϒϦου $-ͱ.*.Λಉ࣌ʹֶश w J#05<+;IPV *$-3> w %*/0W<.0RVBC BS9JW> w $POUSBTUJWF-FBSOJOH $- ෳ਺ͷҟͳΔ7JFX͔Βநग़ͨ͠ը૾Ϩϕϧͷಛ௃Λൺֱ w .P$PW<9$IFO *$$7> w %*/0<.$BSPO *$$7> 7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश<> w .BTLFE*NBHF.PEFMJOH .*. ϚεΫ͞Εͨύονͷಛ௃ྔΛ༧ଌ w #&J5<)#BP *$-3> w #&J5W<> ϚεΫ͞ΕͨύονͷըૉΛ༧ଌ w ."&<,)F $713> w 4JN.*.<;9JF $713> ."&ͷֶशํ๏ &ODPEFS &ODPEFS .-1 .-1 ࢦ਺Ҡಈฏۉ 4PGUNBY $FOUFS 4PGUNBY ଛࣦܭࢉ -PDBM (MPCBM %*/0ͷֶशํ๏ ଛࣦܭࢉ &ODPEFS %FDPEFS ϚεΫॲཧ ϚεΫτʔΫϯͷ௥Ճ ."&ͷֶशํ๏ 7J5Λର৅ͱͨࣗ͠ݾڭࢣ͋Γֶश<> w ϚϧνϞʔμϧ$- ը૾ͱςΩετͷϖΞͰ$- w $-*1<"3BEGPSE *$.-> w ϚϧνϞʔμϧ.*. 3(# %FQUI 4FNBOUJDΛಉ࣌ʹ.*. w .VMUJ."&<3#BDINBOO &$$7> w ϋΠϒϦου $-ͱ.*.Λಉ࣌ʹֶश w J#05<+;IPV *$-3> w %*/0W<.0RVBC BS9JW> &ODPEFS &ODPEFS .-1 .-1 ࢦ਺Ҡಈฏۉ 4PGUNBY $FOUFS 4PGUNBY ଛࣦܭࢉ -PDBM (MPCBM %*/0ͷֶशํ๏ ଛࣦܭࢉ &ODPEFS %FDPEFS ϚεΫॲཧ ϚεΫτʔΫϯͷ௥Ճ ."&ͷֶशํ๏ $POUSBTUJWF-FBSOJOH $- .BTLFE*NBHF.PEFMJOH .*.

Slide 7

Slide 7 text

ࣗݾڭࢣ͋Γֶशͷ୅දతͳํ๏ IUUQTTQFBLFSEFDLDPNOBPL[JKJKJBPTIJBSJYVFYJOJZPSVTIJRJBOYVFYJDWJNUJZVUPSJBSV ೥݄$7*.ݚڀձɿνϡʔτϦΞϧ

Slide 8

Slide 8 text

w -FBSOJOH5SBOTGFSBCMF7JTVBM.PEFMT'SPN/BUVSBM-BOHVBHF4VQFSWJTJPO ը૾ͱݴޠͷϚϧνϞʔμϧʹ͓͚Δେن໛σʔληοτΛߏஙֶͯ͠शʹར༻ w %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO ը૾ͷେن໛σʔληοτΛ࡞੒͢ΔϑϨʔϜϫʔΫΛఏҊ w 5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH ԯن໛ͷσʔληοτͰ͋Δ*OTUBHSBN#Λ༻͍ͨ৔߹ͷ."&ͷ܏޲Λௐࠪ ঺հ͢Δ࿦จ

Slide 9

Slide 9 text

w ෳ਺ͷϞʔμϧΛ༻ֶ͍ͨशʹ͓͍ͯϞʔμϧؒͷରԠ෇͚͸ͭͷ՝୊ ˠରরֶशʹΑΓϞʔμϧؒΛΞϥΠϝϯτ w ϚϧνϞʔμϧͷσʔληοτɿҟͳΔϞʔμϧͷ৘ใΛϖΞͱͯͭ͠ͷσʔλΛఆٛ ϚϧνϞʔμϧରরֶश σʔλ֦ுʹΑΔϖΞͷ୅ΘΓʹσʔληοτͰఆٛ͞ΕͨϖΞΛ࢖༻ ը૾ºݴޠɿ$$.σʔληοτ ը૾º఺܈ɿ,*55*σʔληοτ #SPLFOHMBTTNPCJMFQIPOFPOBXIJUFCBDLHSPVOE 1&340/CFBUJOHPOUIFQIPOFTUPDLJNBHFT

Slide 10

Slide 10 text

w $-*1ɿ$POUSBTUJWF-BOHVBHF*NBHF1SFUSBJOJOH ࣄલֶशɹɹɹɹɿը૾ͱςΩετͷϖΞͰಛ௃ྔ͕Ұக͢ΔΑ͏ʹࣗݾڭࢣ͋Γֶश ը૾ͷΫϥε෼ྨɿΫϥεʹؔ͢ΔςΩετ͔Βಛ௃ྔΛநग़ ը૾ͷΫϥε෼ྨɿը૾ͱςΩετؒͷಛ௃ྔͷྨࣅ౓͔Βը૾ͷΫϥεΛ෼ྨ w ը૾ͱςΩετͷରԠؔ܎ΛֶशˠରԠؔ܎͔Β௥Ճͷֶशͳ͘ը૾ͷΫϥε෼ྨ͕Մೳ -FBSOJOH5SBOTGFSBCMF7JTVBM.PEFMT'SPN/BUVSBM-BOHVBHF4VQFSWJTJPO <"3BEGPSE *$.-> I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3

Slide 11

Slide 11 text

w σʔληοτͰఆٛ͞ΕͨϖΞΛਖ਼ྫͱͯ͠ରরֶश ਖ਼ྫϖΞɿσʔληοτͰఆٛ͞Εͨը૾ͱςΩετͷϖΞ ෛྫϖΞɿϛχόον಺ͷਖ਼ྫϖΞҎ֎ͷը૾ͱςΩετؒͷϖΞ w σʔληοτͱͯ͠8FC*NBHF5FYUΛ࢖༻ ը૾ͱςΩετͷϖΞͰಛ௃ྔ͕Ұக͢ΔΑ͏ʹࣗݾڭࢣ͋Γֶश ʜ ʜ ʜ ʜ ʜ ʜ ʜ ʜ ʜ I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3 ଛࣦܭࢉ ಛ௃ྔͷྨࣅ౓ؔ܎ ʢίαΠϯྨࣅ౓ʣ ཧ૝తͳྨࣅ౓ؔ܎

Slide 12

Slide 12 text

w ΢Σϒ͔Βऩूͨ͠ԯͷը૾ͱςΩετͷϖΞ͔Βߏ੒ ஶ࡞ݖͷ໰୊͔Βσʔληοτࣗମ͸ඇެ։ w 8JLJQFEJBͰසग़͢Δ୯ޠʹج͍ͮͯଟ༷ͳσʔλΛόϥϯεΑ͘ऩू 8FC*NBHF5FYUσʔληοτ ɽӳޠͷ8JLJQFEJBͰճҎ্࢖༻͞Ε͍ͯΔ छྨͷ୯ޠΛूΊͨRVFSZMJTUΛ࡞੒ ˣ ɽςΩετʹRVFSZMJTU಺ͷ୯ޠؚ͕·Ε͍ͯΔը૾ͱςΩετͷϖΞΛऩू ʢRVFSZMJTU಺ͷͭͷ୯ޠʹ͖ͭ࠷େ ϖΞʣ ˣ ɽ࠷ऴతʹԯͷը૾ͱςΩετϖΞΛऩूʢඇެ։ʣ

Slide 13

Slide 13 text

w ը૾ͷΫϥε෼ྨ໰୊΁௥ՃͷֶशΛҰ੾ͤͣʹʢ;FSPTIPUͰʣద༻͕Մೳ w Ϋϥε໊ΛؚΜͩςΩετͱͷಛ௃ྔͷྨࣅ౓ؔ܎͔Βೖྗը૾ͷΫϥεΛ༧ଌ d ;FSPTIPUJNBHFDMBTTJ fi DBUJPO I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 lEPHzΫϥεͱ൑ఆ l"QIPUPPGBQMBOFz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBDBSz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBEPHz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBCJSEz͔Βநग़ͨ͠ಛ௃ྔ ϓϩϯϓτ ςϯϓϨʔτ

Slide 14

Slide 14 text

w ϓϩϯϓτςϯϓϨʔτʹΫϥε໊Λ౰ͯ͸ΊͨςΩετ͔Βಛ௃ྔΛநग़ ϓϩϯϓτςϯϓϨʔτ͸ը૾΍໰୊ઃఆʹ߹ΘͤͯਓखͰઃఆ ௗྨͷೝࣝɹɹɿ"QIPUPPGB\PCKFDU^ BUZQFPGCJSE ಓ࿏ඪࣝͷೝࣝɿ"[PPNFEJOQIPUPPGB\PCKFDU^USB ffi DTJHO Ϋϥεʹؔ͢ΔςΩετ͔Βಛ௃ྔΛநग़ I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 l"QIPUPPGBQMBOFz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBDBSz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBEPHz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBCJSEz͔Βநग़ͨ͠ಛ௃ྔ Image Encoder I1 I1·T2 I1·TN I1·T1 … A photo of a dog. I1·T3 ϓϩϯϓτ ςϯϓϨʔτ ϓϩϯϓτςϯϓϨʔτʹΫϥε໊Λ౰ͯ͸ΊͨςΩετ͔Βಛ௃ྔΛநग़

Slide 15

Slide 15 text

w Ϋϥε෼ྨΛߦ͍͍ͨը૾͔Βಛ௃ྔΛநग़ w ը૾ͱ֤ςΩετؒͷಛ௃ྔͷίαΠϯྨࣅ౓Λܭࢉ w ࠷΋ྨࣅ౓͕ߴ͍ςΩετ಺ͷΫϥε໊Λ༧ଌ݁Ռͱͯ͠Ϋϥε෼ྨ ը૾ͱςΩετؒͷಛ௃ྔͷྨࣅ౓͔Βը૾ͷΫϥεΛ෼ྨ plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction … I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 I1·T2 I1·TN I1·T1 … A photo of a dog. I1·T3 ը૾͔Βಛ௃ྔΛநग़

Slide 16

Slide 16 text

w Ϋϥε෼ྨΛߦ͍͍ͨը૾͔Βಛ௃ྔΛநग़ w ը૾ͱ֤ςΩετؒͷಛ௃ྔͷίαΠϯྨࣅ౓Λܭࢉ w ࠷΋ྨࣅ౓͕ߴ͍ςΩετ಺ͷΫϥε໊Λ༧ଌ݁Ռͱͯ͠Ϋϥε෼ྨ ը૾ͱςΩετؒͷಛ௃ྔͷྨࣅ౓͔Βը૾ͷΫϥεΛ෼ྨ plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder … (3) Use for zero-shot prediction I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 A photo of a dog. Image Encoder I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 ը૾ͱςΩετؒͷಛ௃ྔͷྨࣅ౓Λܭࢉ

Slide 17

Slide 17 text

w Ϋϥε෼ྨΛߦ͍͍ͨը૾͔Βಛ௃ྔΛநग़ w ը૾ͱ֤ςΩετؒͷಛ௃ྔͷίαΠϯྨࣅ౓Λܭࢉ w ࠷΋ྨࣅ౓͕ߴ͍ςΩετ಺ͷΫϥε໊Λ༧ଌ݁Ռͱͯ͠Ϋϥε෼ྨ ը૾ͱςΩετؒͷಛ௃ྔͷྨࣅ౓͔Βը૾ͷΫϥεΛ෼ྨ Text Encoder … (3) Use for zero-shot prediction I1 I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 Image Encoder T1 T2 T3 TN … ࠷΋ྨࣅ౓͕ߴ͍ςΩετΛ֬ೝ I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 lEPHzΫϥεͱ൑ఆ

Slide 18

Slide 18 text

w *NBHF/FULͰڭࢣ͋Γֶशͨ͠ϞσϧΛసҠֶशͨ͠ਫ਼౓ͱൺֱ w ωοτϫʔΫɿ3FT/FU ;FSPTIPUJNBHFDMBTTJ fi DBUJPOʹΑΔධՁ छྨ಺ͷछྨͷσʔληοτͰ $-*1͕ߴ͍ਫ਼౓Λୡ੒ Ӵ੕ը૾΍ಓ࿏ަ௨ඪࣝͷ෼ྨͳͲͷෳࡶͳλεΫ ʹ͓͍ͯ͸ਫ਼౓͕େ͖͘௿Լ

Slide 19

Slide 19 text

w $*'"3ͷඈߦػΫϥεʹ͓͍ͯը૾ͱϓϩϯϓτؒͷಛ௃ྔͷྨࣅ౓Λܭࢉ w ࠷΋ྨࣅ౓͕ߴ͍ϓϩϯϓτ͝ͱʹ࿮Λ৭෇͚ ࣮ࡍʹಈ͔ͯ͠Έͨ ੨࿮ BQIPUPPGBBJSQMBOF fl ZJOHUISPVHIUIFCMVFTLZ ੺࿮ BCMBDLBOEXIJUFQIPUP PGBBJSQMBOF ྘࿮ BQIPUPPGBBJSQMBOF UIBUMBOEFEPOUIFHSPVOE

Slide 20

Slide 20 text

w $*'"3ͷඈߦػΫϥεʹ͓͍ͯը૾ͱϓϩϯϓτؒͷಛ௃ྔͷྨࣅ౓Λܭࢉ w ࠷΋ྨࣅ౓͕ߴ͍ϓϩϯϓτ͝ͱʹ࿮Λ৭෇͚ ࣮ࡍʹಈ͔ͯ͠Έͨ ੨࿮ BQIPUPPGBBJSQMBOF fl ZJOHUISPVHIUIFCMVFTLZ ੺࿮ BCMBDLBOEXIJUFQIPUP PGBBJSQMBOF ྘࿮ BQIPUPPGBBJSQMBOF UIBUMBOEFEPOUIFHSPVOE

Slide 21

Slide 21 text

w -FBSOJOH5SBOTGFSBCMF7JTVBM.PEFMT'SPN/BUVSBM-BOHVBHF4VQFSWJTJPO ը૾ͱݴޠͷϚϧνϞʔμϧʹ͓͚Δେن໛σʔληοτΛߏஙֶͯ͠शʹར༻ w %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO ը૾ͷେن໛σʔληοτΛ࡞੒͢ΔϑϨʔϜϫʔΫΛఏҊ w 5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH ԯن໛ͷσʔληοτͰ͋Δ*OTUBHSBN#Λ༻͍ͨ৔߹ͷ."&ͷ܏޲Λௐࠪ ঺հ͢Δ࿦จ

Slide 22

Slide 22 text

w ࣗݾڭࢣ͋ΓֶशͰ͸ଟ͘ͷ৔߹Ͱ*NBHF/FU,Λ࢖༻ ࣗવݴޠॲཧ෼໺ʹ͓͚Δج൫ϞσϧͷͨΊͷσʔληοτͱൺ΂Δͱখن໛ σʔληοτͷେن໛Խ w ը૾ʹ͓͚Δج൫Ϟσϧ࡞੒Λ໨ඪͱͯ͠σʔληοτΛେن໛Խ %*/0W<.0RVBC BS9JW> ɿطଘσʔληοτͱ΢Σϒ্ͷը૾͔Β໿ԯͷσʔλΛߏங ."84<.4JOHI *$$7> ɿΠϯελάϥϜͰऩूͨ͠໿ԯͷը૾Λ࢖༻

Slide 23

Slide 23 text

w ࣗݾڭࢣ͋ΓֶशͰ͸ଟ͘ͷ৔߹Ͱ*NBHF/FU,Λ࢖༻ ࣗવݴޠॲཧ෼໺ʹ͓͚Δج൫ϞσϧͷͨΊͷσʔληοτͱൺ΂Δͱখن໛ σʔληοτͷେن໛Խ w ը૾ʹ͓͚Δج൫Ϟσϧ࡞੒Λ໨ඪͱͯ͠σʔληοτΛେن໛Խ %*/0W<.0RVBC BS9JW> ɿطଘσʔληοτͱ΢Σϒ্ͷը૾͔Β໿ԯͷσʔλΛߏங ."84<.4JOHI *$$7> ɿΠϯελάϥϜͰऩूͨ͠໿ԯͷը૾Λ࢖༻

Slide 24

Slide 24 text

w ໿ԯαϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ܏޲Λௐࠪ طଘͷෳ਺σʔληοτͱ΢Σϒ্ͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒ J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO <.0RVBC BS9JW>

Slide 25

Slide 25 text

w ໿ԯαϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ܏޲Λௐࠪ طଘͷෳ਺σʔληοτͱ΢Σϒ্ͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒ J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO <.0RVBC BS9JW> Task Dataset / Split Images Retrieval Retrieved Final classification ImageNet-22k / – 14,197,086 as is – 14,197,086 classification ImageNet-22k / – 14,197,086 sample 56,788,344 56,788,344 classification ImageNet-1k / train 1,281,167 sample 40,997,344 40,997,344 fine-grained classif. Caltech 101 / train 3,030 cluster 2,630,000 1,000,000 fine-grained classif. CUB-200-2011 / train 5,994 cluster 1,300,000 1,000,000 fine-grained classif. DTD / train1 1,880 cluster 1,580,000 1,000,000 fine-grained classif. FGVC-Aircraft / train 3,334 cluster 1,170,000 1,000,000 fine-grained classif. Flowers-102 / train 1,020 cluster 1,060,000 1,000,000 fine-grained classif. Food-101 / train 75,750 cluster 21,670,000 1,000,000 fine-grained classif. Oxford-IIIT Pet / trainval 3,680 cluster 2,750,000 1,000,000 fine-grained classif. Stanford Cars / train 8,144 cluster 7,220,000 1,000,000 fine-grained classif. SUN397 / train1 19,850 cluster 18,950,000 1,000,000 fine-grained classif. Pascal VOC 2007 / train 2,501 cluster 1,010,000 1,000,000 segmentation ADE20K / train 20,210 cluster 20,720,000 1,000,000 segmentation Cityscapes / train 2,975 cluster 1,390,000 1,000,000 segmentation Pascal VOC 2012 (seg.) / trainaug 1,464 cluster 10,140,000 1,000,000 depth estimation Mapillary SLS / train 1,434,262 as is – 1,434,262 depth estimation KITTI / train (Eigen) 23,158 cluster 3,700,000 1,000,000 depth estimation NYU Depth V2 / train 24,231 cluster 10,850,000 1,000,000 depth estimation SUN RGB-D / train 4,829 cluster 4,870,000 1,000,000 retrieval Google Landmarks v2 / train (clean) 1,580,470 as is – 1,580,470 retrieval Google Landmarks v2 / train (clean) 1,580,470 sample 6,321,880 6,321,880 retrieval AmsterTime / new 1,231 cluster 960,000 960,000 retrieval AmsterTime / old 1,231 cluster 830,000 830,000 retrieval Met / train 397,121 cluster 62,860,000 1,000,000 retrieval Revisiting Oxford / base 4,993 cluster 3,680,000 1,000,000 retrieval Revisiting Paris / base 6,322 cluster 3,660,000 1,000,000 142,109,386 Table 15: Composition of our LVD-142M dataset. We report the list of datasets and associated splits σʔλͷ਺Λ૿ڧ σʔληοτؒͷόϥϯεΛௐ੔ ࠷ऴతͳ-7%.

Slide 26

Slide 26 text

w σʔληοτ͝ͱʹͭͷεςοϓ͔Β΢Σϒ্ͷը૾Ͱσʔλ਺Λ૿ڧ &NCFEEJOH ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशͨ͠7J5)ʹΑΓಛ௃நग़ %FEVQMJDBUJPO ɿ΢Σϒը૾ʹରͯ͠ಛ௃ྔʹج͍ͮͨίϐʔݕग़Λద༻ͯ͠ࣅͨը૾Λ࡟আ 3FUSJFWBM ɿσʔληοτʹࣅͨ/ݸͷ΢Σϒը૾Λσʔληοτʹ௥Ճ ΢Σϒ্ͷը૾Λ༻͍ͨαϯϓϧ਺ͷ૿ڧ Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval ૿ڧޙͷσʔληοτ ΢Σϒ͔Βऩूͨ͠ը૾ طଘͷσʔληοτ

Slide 27

Slide 27 text

w σʔληοτ͝ͱʹͭͷεςοϓ͔Β΢Σϒ্ͷը૾Ͱσʔλ਺Λ૿ڧ &NCFEEJOH ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशͨ͠7J5)ʹΑΓಛ௃நग़ %FEVQMJDBUJPO ɿ΢Σϒը૾ʹରͯ͠ಛ௃ྔʹج͍ͮͨίϐʔݕग़Λద༻ͯ͠ࣅͨը૾Λ࡟আ 3FUSJFWBM ɿσʔληοτʹࣅͨ/ݸͷ΢Σϒը૾Λσʔληοτʹ௥Ճ ΢Σϒ্ͷը૾Λ༻͍ͨαϯϓϧ਺ͷ૿ڧ Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval ૿ڧޙͷσʔληοτ طଘͷσʔληοτ ΢Σϒ͔Βऩूͨ͠ը૾

Slide 28

Slide 28 text

w σʔληοτ͝ͱʹͭͷεςοϓ͔Β΢Σϒ্ͷը૾Ͱσʔλ਺Λ૿ڧ &NCFEEJOH ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशͨ͠7J5)ʹΑΓಛ௃நग़ %FEVQMJDBUJPO ɿ΢Σϒը૾ʹରͯ͠ಛ௃ྔʹج͍ͮͨίϐʔݕग़Λద༻ͯ͠ࣅͨը૾Λ࡟আ 3FUSJFWBM ɿσʔληοτʹࣅͨ/ݸͷ΢Σϒը૾Λσʔληοτʹ௥Ճ ΢Σϒ্ͷը૾Λ༻͍ͨαϯϓϧ਺ͷ૿ڧ Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval ૿ڧޙͷσʔληοτ ΢Σϒ͔Βऩूͨ͠ը૾ طଘͷσʔληοτ

Slide 29

Slide 29 text

w σʔληοτ͝ͱʹͭͷεςοϓ͔Β΢Σϒ্ͷը૾Ͱσʔλ਺Λ૿ڧ &NCFEEJOH ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशͨ͠7J5)ʹΑΓಛ௃நग़ %FEVQMJDBUJPO ɿ΢Σϒը૾ʹରͯ͠ಛ௃ྔʹج͍ͮͨίϐʔݕग़Λద༻ͯ͠ࣅͨը૾Λ࡟আ 3FUSJFWBM ɿσʔληοτʹࣅͨ/ݸͷ΢Σϒը૾Λσʔληοτʹ௥Ճ ΢Σϒ্ͷը૾Λ༻͍ͨαϯϓϧ਺ͷ૿ڧ Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval Uncurated Data Augmented Curated Data Curated Data Embedding Deduplication Retrieval ૿ڧޙͷσʔληοτ ΢Σϒ͔Βऩूͨ͠ը૾ طଘͷσʔληοτ

Slide 30

Slide 30 text

w .αϯϓϧͷେن໛ͳσʔλΛ༻͍ͨը૾ͷࣗݾڭࢣ͋Γֶशʹ͓͚Δ܏޲Λௐࠪ طଘͷσʔληοτͱΠϯλʔωοτͷը૾Λ૊Έ߹Θͤͯ-7%.Λ࡞੒ J#05Λϕʔεʹطଘͷଛࣦ΍ςΫχοΫΛ૊Έ߹Θͤͨ%*/0WΛઃܭ %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO <.0RVBC BS9JW> %*/0WʹΑΔࣗݾڭࢣ͋Γֶश ೖྗը૾ 7JFX 7J5 QBUDI 4PGUNBY 7JFX 7J5 ʢϞϝϯλϜϞσϧʣ ಛ௃ϕΫτϧ <$-4> QBUDI 4JOLIPSO,OPQQ ʢΫϥελׂ౰ʣ ϓϩτλΠϓ <$-4> ࢦ਺Ҡಈฏۉ ෛྫΛ༻͍ͳ͍ ରরֶश .*. σʔλ֦ு %*/0WͰֶशͨ͠Ϟσϧͷѹॖʢ஌ࣝৠཹʣ 1SFUSBJOFE -BSHF7J5 QBUDI 4NBMM7J5 <$-4> QBUDI <$-4> 4NBMM7J5 ࢦ਺Ҡಈฏۉ ෛྫΛ༻͍ͳ͍ ରরֶश .*. ࠷ऴతʹ࢖༻ ʢϞϝϯλϜϞσϧʣ

Slide 31

Slide 31 text

w *NBHF/FU,ʹ͓͚Δਫ਼౓ൺֱ 8FBLMZ ɿը૾ͱݴޠͷϚϧνϞʔμϧख๏ %BUB ɿࣄલֶशͷσʔληοτ େن໛σʔληοτʹΑΔֶशޮՌɿೝࣝੑೳ w ஌ࣝৠཹͷޮՌ ڭࢣ 7J5H ͷύϥϝʔλ਺ ɿ໿ԯ ੜె 7J5- ͷύϥϝʔλ਺ ɿ໿ԯສ X 85.8 72.8 47.1 63.9 (a) Koleo loss X Table 3: (a) E ect of the KoLeo loss term. (b) E ect o term. Evaluation performed on ImageNet-{1k,A} (classifi (segmentation with linear layer, mIoU) and Oxford-M (im same number of iterations, that is smaller than our final ru search tasks (e.g. retrieval), and the MIM loss improves p (a) Comparison on individual metrics Arch ViT-g/14 ViT-L/14 ViT-L/14 Arch ViT-g/14 ViT-L/14 ViT-L/14 Figure 5: E ectiveness of knowledge distillation. C εΫϥονͱൺ΂ͯߴ͍ਫ਼౓Λୡ੒ ը૾ͷΈͰैདྷ๏ͱൺ΂ͯߴ͍ਫ਼౓Λୡ੒ kNN linear Method Arch. Data Text sup. val val ReaL V2 Weakly supervised CLIP ViT-L/14 WIT-400M X 79.8 84.3 88.1 75.3 CLIP ViT-L/14 336 WIT-400M X 80.5 85.3 88.8 75.8 SWAG ViT-H/14 IG3.6B X 82.6 85.7 88.7 77.6 OpenCLIP ViT-H/14 LAION X 81.7 84.4 88.4 75.5 OpenCLIP ViT-G/14 LAION X 83.2 86.2 89.4 77.2 EVA-CLIP ViT-g/14 customú X 83.5 86.4 89.3 77.4 Self-supervised MAE ViT-H/14 INet-1k 5 49.4 76.6 83.3 64.8 DINO ViT-S/8 INet-1k 5 78.6 79.2 85.5 68.2 SEERv2 RG10B IG2B 5 – 79.8 – – MSN ViT-L/7 INet-1k 5 79.2 80.7 86.0 69.7 EsViT Swin-B/W=14 INet-1k 5 79.4 81.3 87.0 70.4 Mugs ViT-L/16 INet-1k 5 80.2 82.1 86.9 70.8 iBOT ViT-L/16 INet-22k 5 72.9 82.3 87.5 72.4 DINOv2 ViT-S/14 LVD-142M 5 79.0 81.1 86.6 70.9 ViT-B/14 LVD-142M 5 82.1 84.5 88.3 75.1 ViT-L/14 LVD-142M 5 83.5 86.3 89.5 78.0 ViT-g/14 LVD-142M 5 83.5 86.5 89.6 78.4 Table 4: Linear evaluation on ImageNet-1k of frozen pretrained features. We report Top-1 accuracy on the validation set for publicly available models trained on public or private data, and with or without text supervision (text sup.). For reference, we also report the kNN performance on the validation set. We compare across any possible architectures (Arch.), at resolution 224 ◊ 224 unless stated otherwise. The dataset used for training EVA-CLIP is a custom mixture, see paper for details (Fang et al., 2023).

Slide 32

Slide 32 text

w ճͷओ੒෼෼ੳ 1$" Λద༻͢Δ͜ͱͰύονͷಛ௃ྔΛ෼ੳ ɽෳ਺ͷը૾ͷશͯͷύονಛ௃ྔʹରͯ͠1$"Λద༻ w ୈʔओ੒෼ʹର͖͍ͯ͠͠஋ॲཧΛߦ͍ɼલܠͱഎܠͷύονʹ෼ׂ ɽෳ਺ͷը૾ͷલܠͱ൑ఆ͞Εͨύονಛ௃ྔʹରͯ͠1$"Λద༻ w ୈʔओ੒෼ɼୈೋओ੒෼ɼୈࡾओ੒෼ͷ஋Λ3(#ͷ஋ͱ֤ͯ͠ύονΛ৭෇͚ େن໛σʔληοτʹΑΔֶशޮՌɿಛ௃ྔͷ෼ੳ ਓखʹΑΔਖ਼ղϥϕϧͳ͠ͰΦϒδΣΫτͷύʔπϨϕϧͷؔ܎ੑΛֶश Figure 1: Visualization of the first PCA components. We compute a PCA between the patches of the images from the same column (a, b, c and d) and show their first 3 components. Each component is matched to a di erent color channel. Same parts are matched between related images despite changes of pose, style or even objects. Background is removed by thresholding the first PCA component. IUUQTBJGBDFCPPLDPNCMPHEJOPWDPNQVUFSWJTJPOTFMGTVQFSWJTFEMFBSOJOH <0RVBC BS9JW>͔ΒҾ༻

Slide 33

Slide 33 text

w %*/0W(JU)VCͷ*TTVFTʹهࡌ͞Ε͍ͯΔٙࣅίʔυΛ΋ͱʹ࠶ݱ ՄࢹԽʹ͸ϑϦʔૉࡐΛఏڙ͍ͯ͠Δ΢ΣϒαΠτͷը૾Λ࢖༻ ࣮ࡍʹࢼͯ͠Έͨ ճ໨ͷ1$"ͷୈҰओ੒෼ͷώετάϥϜ ճ໨ͷ1$"ͷͭͷओ੒෼ʹΑΓ৭෇͚ͨ͠ը૾ ෺ମͷύʔπ͝ͱʹࣅͨύονಛ௃ྔΛநग़

Slide 34

Slide 34 text

w %*/0W(JU)VCͷ*TTVFTʹهࡌ͞Ε͍ͯΔٙࣅίʔυΛ΋ͱʹ࠶ݱ ՄࢹԽʹ͸ϑϦʔૉࡐΛఏڙ͍ͯ͠Δ΢ΣϒαΠτͷը૾Λ࢖༻ ࣮ࡍʹࢼͯ͠Έͨ ճ໨ͷ1$"ͷୈҰओ੒෼ͷώετάϥϜ ճ໨ͷ1$"ͷͭͷओ੒෼ʹΑΓ৭෇͚ͨ͠ը૾ ෺ମͷύʔπ͝ͱʹࣅͨύονಛ௃ྔΛநग़

Slide 35

Slide 35 text

w ࣗݾڭࢣ͋ΓֶशͰ͸ଟ͘ͷ৔߹Ͱ*NBHF/FU,Λ࢖༻ ࣗવݴޠॲཧ෼໺ʹ͓͚Δج൫ϞσϧͷͨΊͷσʔληοτͱൺ΂Δͱখن໛ σʔληοτͷେن໛Խ w ը૾ʹ͓͚Δج൫Ϟσϧ࡞੒Λ໨ඪͱͯ͠σʔληοτΛେن໛Խ %*/0W<.0RVBC BS9JW> ɿطଘσʔληοτͱ΢Σϒ্ͷը૾͔Β໿ԯͷσʔλΛߏங ."84<.4JOHI *$$7> ɿΠϯελάϥϜͰऩूͨ͠໿ԯͷը૾Λ࢖༻

Slide 36

Slide 36 text

w ԯαϯϓϧͷେن໛σʔλΛ༻͍ͨ৔߹ͷ."&ͷֶशޮՌΛௐࠪ *OTUBHSBNͰऩूͨ͠ԯͷը૾ͱऑϥϕϧʢը૾ͷ౤ߘʹ෇ਵ͢ΔϋογϡλάʣΛ࢖༻ w ."&ˠऑڭࢣ͋Γֶशͷ̎ஈ֊ͷࣄલֶशʹΑͬͯΑΓߴ͍ࣄલֶशͷޮՌΛൃش ."&ɼऑڭࢣ͸ڞʹFQPDIͷֶशͷΈ 5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH <.4JOHI *$$7> ViT models at various scales in terms of number of param- eters, including ViT-B (86M), ViT-L (307M), and ViT-H (632M). We also train on larger 1.9B and 6.5B parameter ViT models, which we call ViT-2B and ViT-6.5B, respec- tively (Appendix Table 8). As is common practice [23, 84], we train models of sizes ViT-B, ViT-L with a patch size of 16 and larger models with a patch size of 14. We pretrain with a 224 × 224 resolution for all models. Pre-pretraining (MAE) [33] learns visual representations from image datasets without using any labels. We choose this approach as it is simple to implement and scales very effectively with large ViT model sizes due to patch dropping Dataset Task #cls #train #val ImageNet-1k (IN1k) [64] Image cls. 1000 1M 50K iNaturalist-18 (iNat18) [36] Fine-grained cls. 8142 437K 24K ImageNetv2 (INv2) [61] Image cls. 1000 – 10K ImageNet-ReaL (IN-ReaL) [7] Image cls. 1000 – 50K ObjectNet (ON) [6] Image cls. 113 – 19K Food-101 (F-101) [9] Image cls. 101 N/A 25K COCO [49] Obj. det. 80 118K 5K LVIS [32] Obj. det. 1K 100K 20K Kinetics-400 (K400) [43] Action cls. 400 220K 20K Something Something-v2 (SSv2) [30] Action cls. 174 169K 25K Table 1: Evaluation datasets used to evaluate MAE→WSP on image classification, object detection, and video action recognition tasks. The table reports the task, number of classes (#cls), number

Slide 37

Slide 37 text

w ԯαϯϓϧͷେن໛σʔλΛ༻͍ͨ৔߹ͷ."&ͷֶशޮՌΛௐࠪ *OTUBHSBNͰऩूͨ͠ԯͷը૾ͱऑϥϕϧʢը૾ͷ౤ߘʹ෇ਵ͢ΔϋογϡλάʣΛ࢖༻ w ."&ˠऑڭࢣ͋Γֶशͷ̎ஈ֊ͷࣄલֶशʹΑͬͯΑΓߴ͍ࣄલֶशͷޮՌΛൃش ."&ɼऑڭࢣ͸ڞʹFQPDIͷֶशͷΈ ViT models at various scales in terms of number of param- eters, including ViT-B (86M), ViT-L (307M), and ViT-H (632M). We also train on larger 1.9B and 6.5B parameter ViT models, which we call ViT-2B and ViT-6.5B, respec- tively (Appendix Table 8). As is common practice [23, 84], we train models of sizes ViT-B, ViT-L with a patch size of 16 and larger models with a patch size of 14. We pretrain with a 224 × 224 resolution for all models. Pre-pretraining (MAE) [33] learns visual representations from image datasets without using any labels. We choose this approach as it is simple to implement and scales very effectively with large ViT model sizes due to patch dropping Dataset Task #cls #train #val ImageNet-1k (IN1k) [64] Image cls. 1000 1M 50K iNaturalist-18 (iNat18) [36] Fine-grained cls. 8142 437K 24K ImageNetv2 (INv2) [61] Image cls. 1000 – 10K ImageNet-ReaL (IN-ReaL) [7] Image cls. 1000 – 50K ObjectNet (ON) [6] Image cls. 113 – 19K Food-101 (F-101) [9] Image cls. 101 N/A 25K COCO [49] Obj. det. 80 118K 5K LVIS [32] Obj. det. 1K 100K 20K Kinetics-400 (K400) [43] Action cls. 400 220K 20K Something Something-v2 (SSv2) [30] Action cls. 174 169K 25K Table 1: Evaluation datasets used to evaluate MAE→WSP on image classification, object detection, and video action recognition tasks. The table reports the task, number of classes (#cls), number 5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH <.4JOHI *$$7> ϙελʔηογϣϯͰͷஶऀΒ ͲΕ͚ͩ΍ͬͯ΋ऩଋ͠ͳ͍ FQPDIͰे෼ͳೝࣝੑೳΛൃش SFBTPOBCMZFOPVHI

Slide 38

Slide 38 text

w -FBSOJOH5SBOTGFSBCMF7JTVBM.PEFMT'SPN/BUVSBM-BOHVBHF4VQFSWJTJPO 8JLJQFEJBͰසग़͢Δ୯ޠʹج͍ͮͯ΢Σϒ͔Βଟ༷ͳը૾ͱςΩετͷϖΞΛόϥϯεΑ͘ऩू ˠԯαϯϓϧͷը૾ͱςΩετͷϖΞͰߏங͞Εͨ8FC*NBHF5FYUσʔληοτΛ࡞੒ ϚϧνϞʔμϧରরֶशʹΑΓଟ༷ͳը૾ͱςΩετͷରԠؔ܎Λ֫ಘ w %*/0W-FBSOJOH3PCVTU7JTVBM'FBUVSFTXJUIPVU4VQFSWJTJPO طଘσʔληοτΛ΢Σϒ্ͷը૾Λ༻͍ͯόϥϯεΑ͘αϯϓϧ਺Λ૿ڧ ˠ૿ڧͨ͠σʔληοτΛ૊Έ߹Θͤͯԯαϯϓϧͷը૾Ͱߏ੒͞Εͨ-7%.Λ࡞੒ ਖ਼ղϥϕϧͳ͠ͰΦϒδΣΫτͷύʔπϨϕϧͷؔ܎ੑΛֶश w 5IF& ff FDUJWFOFTTPG."&1SF1SFUSBJOJOHGPS#JMMJPO4DBMF1SFUSBJOJOH ԯαϯϓϧͷը૾ͱऑϥϕϧͰߏ੒͞Εͨ*OTUBHSBN#Λ༻͍ͨ৔߹ͷ."&ͷ܏޲Λௐࠪ ΤϙοΫͷࣗݾڭࢣ͋Γֶशͱऑڭࢣ͋ΓֶशͰे෼ͳೝࣝੑೳΛୡ੒ ·ͱΊ