自己教あり学習 (Self-Supervised Learning) に関する資料
2023年1月24日作成
岡本直樹(中部大学・機械知覚&ロボティクス研究グループ)
ࣗݾڭࢣ͋Γֶश
Ԭຊथɼ౻٢߂ɼฏཌྷɼࢁԼོٛʢத෦େֶɾػց֮ϩϘςΟΫεݚڀάϧʔϓʣ
ੁপխಙʢ౦େֶʣ
IUUQNQSHKQ
w ڭࢣϥϕϧΛ༩ͯ͠ͳ͍େྔͷσʔλΛٖࣅతͳ 1SFUFYUUBTL
ʹΑֶͬͯश
ࣗݾڭࢣ͋ΓֶशͰֶशͨ͠ϞσϧࣄલֶशϞσϧͱͯ͠׆༻
w දతͳֶशํ๏
ରরֶश ɿ4JN$-3<$IFO
*$.->
.P$P<)F
$713>
/FHBUJWFGSFF ɿ#:0-<(SJMM
/FVS*14>
4JN4JBN<$IFO
$713>
.BTLFE*NBHF.PEFMJOH ɿ4JN.*.<9JF
$713>
."&<)F
$713>
ࣗݾڭࢣ͋Γֶश 44-4FMGTVQFSWJTFE-FBSOJOH
ᶃେྔͷڭࢣͳ͠σʔλͰࣄલֶशϞσϧΛ࡞ ᶄ44-Ͱ࡞ͨ͠ࣄલֶशϞσϧΛରλεΫ
fi
OFUVOJOH
1FMJDBO
ڭࢣϥϕϧ
'$
ࣄલֶशϞσϧ
)FBE
ࣄલֶशϞσϧ
େྔͷڭࢣͳ͠σʔλ
ࣄલֶशϞσϧ
esses image data, we analyze its internal
inearly projects the flattened patches into
the top principal components of the the
ible basis functions for a low-dimensional
Figure 6: Representative ex-
amples of attention from the
output token to the input
space. See Appendix D.6 for
details.
ed to the
del learns
ition em-
ition em-
hes in the
nusoidal
D). That
ology ex-
not yield
he entire
egree the
mpute the
n is inte-
is “atten-
We find
he lowest
obally is
nsistently
alized at-
esNet be-
may serve
Further,
bally, we
ʜ
ΫϥεྨϞσϧ
ମݕग़Ϟσϧ
ࣗݾڭࢣ͋Γֶश
w ը૾زԿมͳͲͷσʔλ૿෯Λద༻͢Δ͜ͱͰ࡞ͨ͠
w 1SFUFYUUBTLͷྫɿ4JN$-3ʢϥϯμϜΫϩοϓʴ৭มʣ
ϥϯμϜΫϩοϓɿಉҰҐஔͷ༧ଌͱۙҐஔͷ༧ଌΛ࡞
৭ม ɿ৭ͷ༧ଌΛ࡞ɼҐஔͷ༧ଌΛ৭ใ͔Βղ͘͜ͱΛ੍
ٖࣅతͳ 1SFUFYUUBTL
ಉҰҐஔͷ༧ଌ ۙҐஔͷ༧ଌ ৭ͷ༧ଌ
ϥϯμϜΫϩοϓ ৭ม
w ࣗݾڭࢣ͋ΓֶशͰ֫ಘͨ͠ಛදݱΛධՁ
,//๏ʹΑΔධՁ
ઢܗධՁ
w ࣄલֶशϞσϧͱͯ͠ͷసҠੑΛධՁ
fi
OFUVOJOH
ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏
w ,//๏ʹΑΔධՁ
Ϟσϧ͕நग़ͨ͠ಛྔͱڭࢣϥϕϧΛ༻͍ͯ,//๏Λద༻
ϋΠύʔύϥϝʔλʹΑΔਫ਼มԽ͕গͳ͍ͨΊ౷ҰతͳධՁ͕Մೳ
ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏
ࣄલֶशϞσϧ
sses image data, we analyze its internal
nearly projects the flattened patches into
the top principal components of the the
ble basis functions for a low-dimensional
Figure 6: Representative ex-
amples of attention from the
d to the
el learns
ion em-
ion em-
es in the
nusoidal
D). That
ogy ex-
ot yield
e entire
gree the
pute the
is inte-
s “atten-
We find
e lowest
obally is
sistently
lized at-
sNet be-
ʜ ʜ
ࣗݾڭࢣ͋Γֶश
ʹ༻͍ͨσʔληοτ
ಛྔ
ࣗݾڭࢣ͋Γֶश
ΛߦͳͬͨϞσϧ
ʜ
ಛྔ ڭࢣϥϕϧ
ʜ
"JSQMBOF
$BU
ֶश༻σʔλ
ʜ
ಛྔ ڭࢣϥϕϧ
ʜ
$BU
%PH
ධՁ༻σʔλ
,//๏ʹΑΓධՁ
,ݸͷ͔ۙΒΫϥεྨ
w ઢܗධՁ
Ϟσϧ͕நग़ͨ͠ಛྔͱڭࢣϥϕϧΛ༻͍ͯ'$Λڭࢣ͋Γֶश
ڭࢣ͋Γֶशͷ࠷దͳϋΠύʔύϥϝʔλ͕ࣗݾڭࢣ͋Γֶशͷख๏ʹΑΓҟͳΔ
ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏
ࣄલֶशϞσϧ
ormer processes image data, we analyze its internal
ansformer linearly projects the flattened patches into
left) shows the top principal components of the the
emble plausible basis functions for a low-dimensional
patch.
Figure 6: Representative ex-
amples of attention from the
ding is added to the
hat the model learns
arity of position em-
similar position em-
pears; patches in the
Finally, a sinusoidal
(Appendix D). That
image topology ex-
variants do not yield
ion across the entire
e to what degree the
ally, we compute the
information is inte-
, right). This “atten-
ze in CNNs. We find
lready in the lowest
ormation globally is
ds have consistently
s highly localized at-
t apply a ResNet be-
ʜ
ࣗݾڭࢣ͋Γֶश
ʹ༻͍ͨσʔληοτ
ʜ
ಛྔ
ࣗݾڭࢣ͋Γֶश
ΛߦͳͬͨϞσϧ
ڭࢣϥϕϧ
'$
ʜ
ಛྔ
ʜ
"JSQMBOF
$BU
ڭࢣ͋Γֶश
ཚॳظ
ͷ'$
w
fi
OFUVOJOH
ࣗݾڭࢣ͋Γֶश࣌ͱҟͳΔσʔληοτʢԼྲྀλεΫʣ
fi
OFUVOJOH
ࣗݾڭࢣ͋ΓֶशϞσϧͷධՁํ๏
1FMJDBO
ڭࢣϥϕϧ
'$
ࣄલֶशϞσϧ
)FBE
ࣄલֶशϞσϧ
ΫϥεྨϞσϧ
ମݕग़Ϟσϧ
ࣗݾڭࢣ͋Γֶश
ΛߦͳͬͨϞσϧ
λεΫݻ༗
ͷϞσϧߏ
ڭࢣ͋Γֶश
ੳ
6OEFSTUBOEJOHUIF#FIBWJPVS
<'8BOHBOE)-JV
$713`>
ଛࣦઃܭֶशޮՌʹ͍ͭͯੳ
)PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
<-&SJDTTPO
$713`>
༷ʑͳઃఆͷసҠੑΛධՁ
8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL
<&$PMF
$713`>
σʔληοτͱͷؔੑʹ͍ͭͯੳ
*OGP.JO
<:5JBO
/FVS*14`>
ϙδςΟϒϖΞͷΈ߹Θͤʹ͍ͭͯੳ
ࣗݾڭࢣ͋Γֶशͷදతͳख๏
#BSMPX5XJOT
<+;CPOUBS
*$.-`>
CBUDIEJNFOTJPO
$1$
<"WE0PSE
BS9JW`>
ύονؒͰϖΞΛ
࡞ͯ͠ରরֶश
$1$W
<0+)ÉOB
ff
*$.-`>
ϖΞͷ࡞
ϞσϧߏͳͲΛվળ
ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ
$POUFYU&ODPEFST
<%1BUIBL
$713`>
δάιʔύζϧ
৭ใΛ༧ଌ ճస֯Λ༧ଌ
*NBHF3PUBUJPOT
<4(JEBSJT
*$-3`>
$POUFYU1SFEJDUJPO
<$%PFSTDI
*$$7`>
ύονׂͨ͠ը૾ͷ
ύονؒͷ૬ରҐஔΛ༧ଌ
1*3-
<*.JTSBBOE-.BBUFO
$713`>
δάιʔύζϧΛಋೖ
4JN$-3
<5$IFO
*$.-`>
4JN$-3W
<5$IFO
/FVS*14`>
.P$P
<,)F
$713> .P$PW
<9$IFO
BS9JW>
4JN$-3ͷςΫχοΫΛಋೖ
ରরֶश
$PVOUJOH
<./PSPP[J
*$$7`>
֤ύονग़ྗͷͱը૾શମͷग़ྗ
͕Ұக͢ΔΑ͏ʹֶश
+JHTBX
<./PSPP[J
$713`>
ͭͷը૾ͷύζϧΛϛοΫε
&NCFEEJOH-FBSOJOH
<.:F
$713`>
6OTVQFSWJTFE
8PSEWFD
<5.JLPMPW
BS9JW>
#&35
<+%FWMJO
/""$->
ࣗવݴޠॲཧ
1$-
<+-J
*$-3`>
ϓϩτλΠϓΛಋೖ
MPDBM͔ΒMPDBM༧ଌ
&T7J5
<$-J
*$-3`>
.BTLFE*NBHF.PEFMJOH .*.
$//Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
7J5Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
1SFUFYUλεΫͷվળ
ྡ͢Δ୯ޠͷ༧ଌ
Λը૾Ԡ༻
%*/0
<.$BSPO
*$$7>
σʔλ૿෯ͱෳͷը૾
Λ༻͍ͨରরֶशΛఏҊ
ը૾ʹΫϥεͱֶͯ͠श
աڈͷग़ྗΛ
ωΨςΟϒϖΞͱͯ͠׆༻
.BTLFE-BOHVBHF.PEFMJOH
.-.
Λը૾Ԡ༻
େنωοτϫʔΫͷಋೖ
+JHTBX
<./PSPP[JBOE1'BWBSP
&$$7`>
$PMPSJ[BUJPO
<3;IBOH
&$$7`>
*OTUBODF%JTDSJNJOBUJPO
<;8V
$713`>
MPDBM͔ΒHMPCBM
HMPCBM͔ΒHMPCBMΛ༧ଌ
ಛྔʹϚεΫΩϯά
ϚεΫྖҬͷಛྔΛ༧ଌ
#&J5
<)#BP
*$-3`>
J#05
<+;IPV
*$-3`>
ϚεΫྖҬͷըૉΛ༧ଌ
."&
<,)F
$713`>
4JN.*.
<;9JF
$713`>
4QPU"SUJGBDUT
<4+FOOJBOE1'BWBSP
$713`>
.$5
<9:VBO
$713`>
ϚϧνϞʔμϧ֦ு
ʢը૾ʴςΩετʣ
.P$P#:0-
.P#:
<;9JF
BS9JW`>
.P$PW
<9$IFO
*$$7>
7J5Ͱͷ༗ޮੑΛධՁ
7P-5"
<41SBNBOJDL
BS9JW`>
MPDBMGFBUVSF"MJHONFOU
ϚϧνϞʔμϧʢը૾ʴςΩετʣ
γϯϓϧͳରরֶशΛఏҊ
$-*1
<"3BEGPSE
*$.-`>
;FSP4IPU5SBOTGFS
ϚϧνϞʔμϧʢը૾ʴςΩετʣ
7J5ͷͨΊͷֶशํ๏
/FHBUJWFGSFF
#:0-
<+(SJMM
/FVS*14>
ϙδςΟϒϖΞͷΈͰֶश
4JN4JBN
<9$IFO
$713>
ΑΓγϯϓϧͳֶशΛఏҊ
ੳ
#:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT
<13JDIFNPOE
BS9JW`>
όονਖ਼نԽͷ౷ܭใ͕҉తͳωΨςΟϒϖΞͳͷͰʁ
ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ
4X"7
<.$BSPO
/FVS*14`>
ϙδςΟϒϖΞͷଐ͢Δ
ΫϥελΛਪఆ
ੳ
6OEFSTUBOEJOHUIF#FIBWJPVS
<'8BOHBOE)-JV
$713`>
ଛࣦઃܭֶशޮՌʹ͍ͭͯੳ
)PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
<-&SJDTTPO
$713`>
༷ʑͳઃఆͷసҠੑΛධՁ
8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL
<&$PMF
$713`>
σʔληοτͱͷؔੑʹ͍ͭͯੳ
*OGP.JO
<:5JBO
/FVS*14`>
ϙδςΟϒϖΞͷΈ߹Θͤʹ͍ͭͯੳ
ࣗݾڭࢣ͋Γֶशͷදతͳख๏
#BSMPX5XJOT
<+;CPOUBS
*$.-`>
CBUDIEJNFOTJPO
$1$
<"WE0PSE
BS9JW`>
ύονؒͰϖΞΛ
࡞ͯ͠ରরֶश
$1$W
<0+)ÉOB
ff
*$.-`>
ϖΞͷ࡞
ϞσϧߏͳͲΛվળ
ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ
$POUFYU&ODPEFST
<%1BUIBL
$713`>
δάιʔύζϧ
৭ใΛ༧ଌ ճస֯Λ༧ଌ
*NBHF3PUBUJPOT
<4(JEBSJT
*$-3`>
$POUFYU1SFEJDUJPO
<$%PFSTDI
*$$7`>
ύονׂͨ͠ը૾ͷ
ύονؒͷ૬ରҐஔΛ༧ଌ
1*3-
<*.JTSBBOE-.BBUFO
$713`>
δάιʔύζϧΛಋೖ
4JN$-3
<5$IFO
*$.-`>
4JN$-3W
<5$IFO
/FVS*14`>
.P$P
<,)F
$713> .P$PW
<9$IFO
BS9JW>
4JN$-3ͷςΫχοΫΛಋೖ
ରরֶश
$PVOUJOH
<./PSPP[J
*$$7`>
֤ύονग़ྗͷͱը૾શମͷग़ྗ
͕Ұக͢ΔΑ͏ʹֶश
+JHTBX
<./PSPP[J
$713`>
ͭͷը૾ͷύζϧΛϛοΫε
&NCFEEJOH-FBSOJOH
<.:F
$713`>
6OTVQFSWJTFE
8PSEWFD
<5.JLPMPW
BS9JW>
#&35
<+%FWMJO
/""$->
ࣗવݴޠॲཧ
1$-
<+-J
*$-3`>
ϓϩτλΠϓΛಋೖ
MPDBM͔ΒMPDBM༧ଌ
&T7J5
<$-J
*$-3`>
.BTLFE*NBHF.PEFMJOH .*.
$//Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
7J5Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
1SFUFYUλεΫͷվળ
ྡ͢Δ୯ޠͷ༧ଌ
Λը૾Ԡ༻
%*/0
<.$BSPO
*$$7>
σʔλ૿෯ͱෳͷը૾
Λ༻͍ͨରরֶशΛఏҊ
ը૾ʹΫϥεͱֶͯ͠श
աڈͷग़ྗΛ
ωΨςΟϒϖΞͱͯ͠׆༻
.BTLFE-BOHVBHF.PEFMJOH
.-.
Λը૾Ԡ༻
େنωοτϫʔΫͷಋೖ
+JHTBX
<./PSPP[JBOE1'BWBSP
&$$7`>
$PMPSJ[BUJPO
<3;IBOH
&$$7`>
*OTUBODF%JTDSJNJOBUJPO
<;8V
$713`>
MPDBM͔ΒHMPCBM
HMPCBM͔ΒHMPCBMΛ༧ଌ
ಛྔʹϚεΫΩϯά
ϚεΫྖҬͷಛྔΛ༧ଌ
#&J5
<)#BP
*$-3`>
J#05
<+;IPV
*$-3`>
ϚεΫྖҬͷըૉΛ༧ଌ
."&
<,)F
$713`>
4JN.*.
<;9JF
$713`>
4QPU"SUJGBDUT
<4+FOOJBOE1'BWBSP
$713`>
.$5
<9:VBO
$713`>
ϚϧνϞʔμϧ֦ு
ʢը૾ʴςΩετʣ
.P$P#:0-
.P#:
<;9JF
BS9JW`>
.P$PW
<9$IFO
*$$7>
7J5Ͱͷ༗ޮੑΛධՁ
7P-5"
<41SBNBOJDL
BS9JW`>
MPDBMGFBUVSF"MJHONFOU
ϚϧνϞʔμϧʢը૾ʴςΩετʣ
γϯϓϧͳରরֶशΛఏҊ
$-*1
<"3BEGPSE
*$.-`>
;FSP4IPU5SBOTGFS
ϚϧνϞʔμϧʢը૾ʴςΩετʣ
7J5ͷͨΊͷֶशํ๏
#:0-
<+(SJMM
/FVS*14>
ϙδςΟϒϖΞͷΈͰֶश
4JN4JBN
<9$IFO
$713>
ΑΓγϯϓϧͳֶशΛఏҊ
ੳ
#:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT
<13JDIFNPOE
BS9JW`>
όονਖ਼نԽͷ౷ܭใ͕҉తͳωΨςΟϒϖΞͳͷͰʁ
ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ
4X"7
<.$BSPO
/FVS*14`>
ϙδςΟϒϖΞͷଐ͢Δ
ΫϥελΛਪఆ
/FHBUJWFGSFF
w $PMPSGVM*NBHF$PMPSJ[BUJPO<;IBOH
&$$7>
Χϥʔը૾͔ΒάϨʔεέʔϧը૾Λ࡞
άϨʔεέʔϧը૾͔Β-BC৭ۭؒͷBCΛ༧ଌ
w 1SFEJDUJOH*NBHF3PUBUJPOT<(JEBSJT
*$-3>
ը૾ʹରͯ͠ɼɼɼͷ͍ͣΕ͔ͷճసΛద༻
ద༻͞Εͨճస͕छྨͷ͏͍ͪͣΕ͔Λ༧ଌʢΫϥεྨʣ
1SFUFYUUBTLͷվળ
4 Zhang, Isola, Efros
Fig. 2. Our network architecture. Each conv layer refers to a block of 2 or 3 repeated
conv and ReLU layers, followed by a BatchNorm [30] layer. The net has no pool layers.
All changes in resolution are achieved through spatial downsampling or upsampling
Published as a conference paper at ICLR 2018
Rotated image: X0
Rotated image: X3
Rotated image: X 2
Rotated image: X1
ConvNet
model F(.)
ConvNet
model F(.)
ConvNet
model F(.)
ConvNet
model F(.)
Image X
Predict 270 degrees rotation (y=3)
Rotate 270 degrees
g( X , y=3)
Rotate 180 degrees
g( X , y=2)
Rotate 90 degrees
g( X , y=1)
Rotate 0 degrees
g( X , y=0)
Maximize prob.
F3( X 3)
Predict 0 degrees rotation (y=0)
Maximize prob.
F2( X2)
Maximize prob.
F1( X 1)
Maximize prob.
F0( X 0)
Predict 180 degrees rotation (y=2)
Predict 90 degrees rotation (y=1)
Objectives:
*NBHF3PUBUJPOT
$PMPSGVM*NBHF$PMPSJ[BUJPO
w 4PMWJOH+JHTBX1V[[MFT
λΠϧঢ়ʹͭͷύονΛ࡞ͯ͠γϟοϑϧ
͋Β͔͡Ίఆٛ͞Εͨγϟοϑϧॱ൪ͷΠϯσοΫεΛ༧ଌ
w $POUFYU&ODPEFST<1BUIBL
$713>
&ODPEFSɾ%FDPEFSߏͷϞσϧʹΑΓϚεΫ͞ΕͨྖҬΛ༧ଌ
1SFUFYUUBTLͷվળ
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles 7
Fig. 3: Context Free Network. The figure illustrates how a puzzle is generated
differ in the approach: whereas [7] are solving a discrimina-
tive task (is patch A above patch B or below?), our context
encoder solves a pure prediction problem (what pixel inten-
sities should go in the hole?). Interestingly, similar distinc-
tion exist in using language context to learn word embed-
dings: Collobert and Weston [5] advocate a discriminative
approach, whereas word2vec [30] formulate it as word pre-
diction. One important benefit of our approach is that our
supervisory signal is much richer: a context encoder needs
to predict roughly 15,000 real values per training example,
compared to just 1 option among 8 choices in [7]. Likely
due in part to this difference, our context encoders take far
less time to train than [7]. Moreover, context based predic-
Figure 2: Context Encoder. The context image is passed
through the encoder to obtain features which are connected
to the decoder using channel-wise fully-connected layer as
described in Section 3.1. The decoder then produces the
$POUFYU&ODPEFST
4PMWJOH+JHTBX1V[[MFT
w -FBSOJOHUP$PVOU
ը૾શମͷಛྔͱूͨ͠ύονͷಛྔ͕Ұக͢ΔΑ͏ʹֶश
w /PO1BSBNFUSJD*OTUBODF%JTDSJNJOBUJPO<8V
$713>
֤ը૾ʹର͢Δಛྔ͕ಠཱ͢ΔΑ͏ʹֶशʢσʔλʹΫϥεͱֶͯ͠शʣ
1SFUFYUUBTLͷվળ
×n×3 → Rp×q×3,
x and map them
so define a feature
med image to some
e a feature transfor-
akes J features and
the image transfor-
ature φ by using the
pervisory signal
0 ∀x. (1)
mily consists of the
nsampling factor of
1, . . . , 4, which ex-
of tiles. Notice that
es of the same size.
4
}. We also define
eatures on the trans-
d − 4
j=1
tj. This
shared
weights
φ(T1
◦ x) φ(T2
◦ x) φ(T3
◦ x) φ(T4
◦ x) φ(D ◦ x)
φ(D ◦ y)
+
t d
c
|d − t|2
y x
D ◦ x
D ◦ y T1
◦ x T2
◦ x T3
◦ x T4
◦ x
t
max{0, M − |c − t|2}
AlexNet
conv1-5
fc8 1000
114x114x3
fc7 4096
fc6 4096
3x3x256
ReLU
ReLU
ReLU
128D Unit Sphere
O
1-th image
2-th image
i-th image
n-1 th image
n-th image
CNN backbone
128D
2048D
128D
L2 norm
low dim
Non-param
Softmax
Memory
Bank
Figure 2: The pipeline of our unsupervised feature learning approach. We use a backbone CNN to encode each image as a feature
vector, which is projected to a 128-dimensional space and L2 normalized. The optimal feature embedding is learned via instance-level
discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere.
3. Approach
Our goal is to learn an embedding function v = f✓(x)
without supervision. f✓
is a deep neural network with
where ⌧ is a temperature parameter that controls the con-
centration level of the distribution [11]. ⌧ is important for
supervised feature learning [43], and also necessary for tun-
ing the concentration of v on our unit sphere.
*OTUBODF%JTDSJNJOBUJPO
-FBSOJOHUP$PVOU
w $POUFYU1SFEJDUJPO<%PFSTDI
*$$7>
λΠϧঢ়ʹΫϩοϓͨ͠ύονؒͷ૬ରҐஔΛ༧ଌ
w $POUSBTUJWF1SFEJDUJWF$PEJOH $1$W
<)ÉOB
ff
*$.->
ύονͷಛྔ͔ΒύονҐஔ͕ݸઌͷύονͷಛྔΛ༧ଌ
1SFUFYUUBTLͷվળ
fθ
gφ
x z c
InfoNCE
[256, 256, 3] [7, 7, 4096] [7, 7, 4096]
Masked
ConvNet
Patched
ResNet-161
fθ
hψ
x z y
Cross
Ent
[256, 256, 3] [7, 7, 4096] [1000, 1]
Linear
Self-supervised
pre-training
100% images; 0% labels
Linear classification
100% images and labels
fθ
hψ
x z y
Cross
Ent
[224, 224, 3] [14, 14, 4096] ResNet-33
Efficient classification
1% to 100% images and labels
fθ
hψ
x z y
Multi
Task
[H, W, 3] [H/16, W/16, 4096]
Transfer learning
100% images and labels
hψ
x y
Cross
Ent
[224, 224, 3] [1000, 1]
ResNet-152
Supervised training
1% to 100% images and labels
Baseline
Pre-training Evaluation
Pre-trained
Fixed / Tuned
ResNet-161
Image x
Feature Extractor fθ
Patched ResNet-161
z
c
Context Network gφ
Masked ConvNet
Faster-RCNN [20, 1]
[1000, 1]
Pre-trained
Fixed / Tuned
ResNet-161
Pre-trained Fixed
Patched ResNet-161
al configuration (if there is no spe-
e parts, then it is “stuff” [1]). We
d approach to learn a visual repre-
. We demonstrate that the resulting
good for both object detection, pro-
t on PASCAL VOC 2007 compared
, as well as for unsupervised object
mining. This means, surprisingly,
generalizes across images, despite
bjective function that operates on a
That is, instance-level supervision
ormance on category-level tasks.
a good image representation is as
n appropriate generative model. An
of natural images would both gener-
o their natural distribution, and be
hat it would seek common causes
d share information between them.
atent structure given an image is in-
vely simple models. To deal with
sues, a number of works, such as
m [23], contrastive divergence [22],
3
2
1
5
4
8
7
6
); Y = 3
,
X = (
Figure 2. The algorithm receives two patches in one of these eight
possible spatial arrangements, without any context, and must then
classify which configuration was sampled.
model (e.g. a deep network) to predict, from a single word,
the n preceding and n succeeding words. In principle, sim-
ilar reasoning could be applied in the image domain, a kind
of visual “fill in the blank” task, but, again, one runs into the
problem of determining whether the predictions themselves
$POUSBTUJWF1SFEJDUJWF$PEJOH
$POUFYU1SFEJDUJPO
k
ࣗݾڭࢣ͋Γֶशͷදతͳख๏
#BSMPX5XJOT
<+;CPOUBS
*$.-`>
CBUDIEJNFOTJPO
$1$
<"WE0PSE
BS9JW`>
ύονؒͰϖΞΛ
࡞ͯ͠ରরֶश
$1$W
<0+)ÉOB
ff
*$.-`>
ϖΞͷ࡞
ϞσϧߏͳͲΛվળ
ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ
$POUFYU&ODPEFST
<%1BUIBL
$713`>
δάιʔύζϧ
৭ใΛ༧ଌ ճస֯Λ༧ଌ
*NBHF3PUBUJPOT
<4(JEBSJT
*$-3`>
$POUFYU1SFEJDUJPO
<$%PFSTDI
*$$7`>
ύονׂͨ͠ը૾ͷ
ύονؒͷ૬ରҐஔΛ༧ଌ
1*3-
<*.JTSBBOE-.BBUFO
$713`>
δάιʔύζϧΛಋೖ
4JN$-3
<5$IFO
*$.-`>
4JN$-3W
<5$IFO
/FVS*14`>
.P$P
<,)F
$713> .P$PW
<9$IFO
BS9JW>
4JN$-3ͷςΫχοΫΛಋೖ
ରরֶश
$PVOUJOH
<./PSPP[J
*$$7`>
֤ύονग़ྗͷͱը૾શମͷग़ྗ
͕Ұக͢ΔΑ͏ʹֶश
+JHTBX
<./PSPP[J
$713`>
ͭͷը૾ͷύζϧΛϛοΫε
&NCFEEJOH-FBSOJOH
<.:F
$713`>
6OTVQFSWJTFE
8PSEWFD
<5.JLPMPW
BS9JW>
#&35
<+%FWMJO
/""$->
ࣗવݴޠॲཧ
1$-
<+-J
*$-3`>
ϓϩτλΠϓΛಋೖ
MPDBM͔ΒMPDBM༧ଌ
&T7J5
<$-J
*$-3`>
.BTLFE*NBHF.PEFMJOH .*.
$//Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
7J5Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
1SFUFYUλεΫͷվળ
ྡ͢Δ୯ޠͷ༧ଌ
Λը૾Ԡ༻
%*/0
<.$BSPO
*$$7>
σʔλ૿෯ͱෳͷը૾
Λ༻͍ͨରরֶशΛఏҊ
ը૾ʹΫϥεͱֶͯ͠श
աڈͷग़ྗΛ
ωΨςΟϒϖΞͱͯ͠׆༻
.BTLFE-BOHVBHF.PEFMJOH
.-.
Λը૾Ԡ༻
େنωοτϫʔΫͷಋೖ
+JHTBX
<./PSPP[JBOE1'BWBSP
&$$7`>
$PMPSJ[BUJPO
<3;IBOH
&$$7`>
*OTUBODF%JTDSJNJOBUJPO
<;8V
$713`>
MPDBM͔ΒHMPCBM
HMPCBM͔ΒHMPCBMΛ༧ଌ
ಛྔʹϚεΫΩϯά
ϚεΫྖҬͷಛྔΛ༧ଌ
#&J5
<)#BP
*$-3`>
J#05
<+;IPV
*$-3`>
ϚεΫྖҬͷըૉΛ༧ଌ
."&
<,)F
$713`>
4JN.*.
<;9JF
$713`>
4QPU"SUJGBDUT
<4+FOOJBOE1'BWBSP
$713`>
.$5
<9:VBO
$713`>
ϚϧνϞʔμϧ֦ு
ʢը૾ʴςΩετʣ
.P$P#:0-
.P#:
<;9JF
BS9JW`>
.P$PW
<9$IFO
*$$7>
7J5Ͱͷ༗ޮੑΛධՁ
7P-5"
<41SBNBOJDL
BS9JW`>
MPDBMGFBUVSF"MJHONFOU
ϚϧνϞʔμϧʢը૾ʴςΩετʣ
γϯϓϧͳରরֶशΛఏҊ
$-*1
<"3BEGPSE
*$.-`>
;FSP4IPU5SBOTGFS
ϚϧνϞʔμϧʢը૾ʴςΩετʣ
7J5ͷͨΊͷֶशํ๏
ੳ
6OEFSTUBOEJOHUIF#FIBWJPVS
<'8BOHBOE)-JV
$713`>
ଛࣦઃܭֶशޮՌʹ͍ͭͯੳ
)PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
<-&SJDTTPO
$713`>
༷ʑͳઃఆͷసҠੑΛධՁ
8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL
<&$PMF
$713`>
σʔληοτͱͷؔੑʹ͍ͭͯੳ
*OGP.JO
<:5JBO
/FVS*14`>
ϙδςΟϒϖΞͷΈ߹Θͤʹ͍ͭͯੳ
#:0-
<+(SJMM
/FVS*14>
ϙδςΟϒϖΞͷΈͰֶश
4JN4JBN
<9$IFO
$713>
ΑΓγϯϓϧͳֶशΛఏҊ
ੳ
#:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT
<13JDIFNPOE
BS9JW`>
όονਖ਼نԽͷ౷ܭใ͕҉తͳωΨςΟϒϖΞͳͷͰʁ
ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ
4X"7
<.$BSPO
/FVS*14`>
ϙδςΟϒϖΞͷଐ͢Δ
ΫϥελΛਪఆ
/FHBUJWFGSFF
w 4JN$-3ɿ4JNQMF'SBNFXPSLGPS$POUSBTUJWF-FBSOJOHPG7JTVBM3FQSFTFOUBUJPOT
w ݩը૾͕ಉ͡ಛྔͱͷྨࣅΛେ͖͘ɼҟͳΔಛྔͱͷྨࣅΛখ͘͢͞ΔΑ͏ʹֶश
ઃఆɿݩը૾͕ಉ͡ಛྔͷϖΞΛݟ͚ͭΔ
ରরֶशɿ4JN$-3<$IFO
*$.->
ϛχόον
ϓϩδΣΫλ
Τϯίʔμ
ʢωοτϫʔΫʣ
.-1
ଛࣦܭࢉ
/59FOU
σʔλ૿෯
ಛྔ
Τϯίʔμ ɿग़ྗΛআ͍ͨωοτϫʔΫ
ϓϩδΣΫλ
ɿͷ.-1
w 4JN$-3ɿ4JNQMF'SBNFXPSLGPS$POUSBTUJWF-FBSOJOHPG7JTVBM3FQSFTFOUBUJPOT
w ݩը૾͕ಉ͡ಛྔͱͷྨࣅΛେ͖͘ɼҟͳΔಛྔͱͷྨࣅΛখ͘͢͞ΔΑ͏ʹֶश
ઃఆɿݩը૾͕ಉ͡ಛྔͷϖΞΛݟ͚ͭΔ
ରরֶशɿ4JN$-3<$IFO
*$.->
ϛχόον
ϓϩδΣΫλ
Τϯίʔμ
ʢωοτϫʔΫʣ
.-1
͢
(negative pair)
ଛࣦܭࢉ
/59FOU
σʔλ૿෯
͚ۙͮΔ
(positive pair)
ಛྔ
Τϯίʔμ ɿग़ྗΛআ͍ͨωοτϫʔΫ
ϓϩδΣΫλ
ɿͷ.-1
w 4JN$-3ɿ4JNQMF'SBNFXPSLGPS$POUSBTUJWF-FBSOJOHPG7JTVBM3FQSFTFOUBUJPOT
w ݩը૾͕ಉ͡ಛྔͱͷྨࣅΛେ͖͘ɼҟͳΔಛྔͱͷྨࣅΛখ͘͢͞ΔΑ͏ʹֶश
ઃఆɿݩը૾͕ಉ͡ಛྔͷϖΞΛݟ͚ͭΔ
ରরֶशɿ4JN$-3<$IFO
*$.->
ϛχόον
ϓϩδΣΫλ
Τϯίʔμ
ʢωοτϫʔΫʣ
.-1
͚ۙͮΔ
(positive pair)
͢
(negative pair)
ଛࣦܭࢉ
/59FOU
σʔλ૿෯
ಛྔ
Τϯίʔμ ɿग़ྗΛআ͍ͨωοτϫʔΫ
ϓϩδΣΫλ
ɿͷ.-1
w 4JN$-3ɿ4JNQMF'SBNFXPSLGPS$POUSBTUJWF-FBSOJOHPG7JTVBM3FQSFTFOUBUJPOT
w ݩը૾͕ಉ͡ಛྔͱͷྨࣅΛେ͖͘ɼҟͳΔಛྔͱͷྨࣅΛখ͘͢͞ΔΑ͏ʹֶश
ઃఆɿݩը૾͕ಉ͡ಛྔͷϖΞΛݟ͚ͭΔ
ରরֶशɿ4JN$-3<$IFO
*$.->
ϛχόον
ϓϩδΣΫλ
Τϯίʔμ
ʢωοτϫʔΫʣ
.-1
͚ۙͮΔ
(positive pair)
͢
(negative pair)
ଛࣦܭࢉ
/59FOU
σʔλ૿෯
ಛྔ
Τϯίʔμ ɿग़ྗΛআ͍ͨωοτϫʔΫ
ϓϩδΣΫλ
ɿͷ.-1
w 4JN$-3ɿ4JNQMF'SBNFXPSLGPS$POUSBTUJWF-FBSOJOHPG7JTVBM3FQSFTFOUBUJPOT
w ݩը૾͕ಉ͡ಛྔͱͷྨࣅΛେ͖͘ɼҟͳΔಛྔͱͷྨࣅΛখ͘͢͞ΔΑ͏ʹֶश
ઃఆɿݩը૾͕ಉ͡ಛྔͷϖΞΛݟ͚ͭΔ
ରরֶशɿ4JN$-3<$IFO
*$.->
ಛྔ
ϛχόον
ϓϩδΣΫλ
Τϯίʔμ
ʢωοτϫʔΫʣ
.-1
͚ۙͮΔ
(positive pair)
͢
(negative pair)
ଛࣦܭࢉ
/59FOU
σʔλ૿෯
#BDLQSPQ
Τϯίʔμ ɿग़ྗΛআ͍ͨωοτϫʔΫ
ϓϩδΣΫλ
ɿͷ.-1
w σʔλ૿෯ͷੳɿͭͷσʔλ૿෯ͷΈ߹ΘͤํʹΑΔઢܗධՁͷਫ਼มԽΛௐࠪ
w σʔλ૿෯ͷΈ߹ΘͤํʹΑͬͯਫ਼͕มԽ
Ϋϩοϓʴ৭ม͕࠷ྑ͍Έ߹Θͤˠ4JN$-3Ҏ߱ͷख๏ʹ͓͚Δελϯμʔυͳઃఆʹ
ରরֶशɿ4JN$-3<$IFO
*$.->
RS
R R R
6R 1R
5R
R R
RS
R
R R
6R
1R
5R
R R
w ଛࣦؔͱͯ͠/PSNBMJ[FE5FNQFSBUVSFTDBMFE$SPTT&OUSPQZMPTT /59FOU
Λ༻
αϯϓϧؒͷྨࣅؔΛ֬Ͱදݱͯ͠$SPTT&OUPSQZMPTTΛܭࢉ
ରরֶशɿ4JN$-3<$IFO
*$.->
Li,j
= − log
exp(sim(zi
, zj
)/τ)
∑2N
k=1
1[k≠i]
exp(sim(zi
, zk
)/τ)
ίαΠϯྨࣅ ϙδςΟϒϖΞ
શͯͷϖΞͷྨࣅ
Թύϥϝʔλ
w ଛࣦؔͱͯ͠/PSNBMJ[FE5FNQFSBUVSFTDBMFE$SPTT&OUSPQZMPTT /59FOU
Λ༻
αϯϓϧؒͷྨࣅؔΛ֬Ͱදݱͯ͠$SPTT&OUPSQZMPTTΛܭࢉ
ରরֶशɿ4JN$-3<$IFO
*$.->
Li,j
= − log
exp(sim(zi
, zj
)/τ)
∑2N
k=1
1[k≠i]
exp(sim(zi
, zk
)/τ)
ίαΠϯྨࣅ ϙδςΟϒϖΞ
શͯͷϖΞͷྨࣅ
Թύϥϝʔλ
pi,j
=
exp(sim(zi
, zj
)/τ)
∑2N
k=1
1[k≠i]
exp(sim(zi
, zk
)/τ)
MPHJUT ֬
Թ͖
4PGUNBYؔ
$SPTT&OUSPQZ
ଛࣦ
ڭࢣϥϕϧ
y1,2
y1,3
y1,2N
y1,4
Li,j
= −
2N
∑
k=1
1[k≠i]
yi,k
log pi,k
= − log pi,j
sim(z1
, z2
)
sim(z1
, z3
)
sim(z1
, z2N
)
sim(z1
, z4
)
p1,2
p1,3
p1,2N
p1,4
αϯϓϧͱαϯϓϧ͕ϙδςΟϒϖΞ
ͷྫ
i = 1, j = 2
w ଛࣦؔͱͯ͠/PSNBMJ[FE5FNQFSBUVSFTDBMFE$SPTT&OUSPQZMPTT /59FOU
Λ༻
αϯϓϧؒͷྨࣅؔΛ֬Ͱදݱͯ͠$SPTT&OUPSQZMPTTΛܭࢉ
ରরֶशɿ4JN$-3<$IFO
*$.->
Li,j
= − log
exp(sim(zi
, zj
)/τ)
∑2N
k=1
1[k≠i]
exp(sim(zi
, zk
)/τ)
ίαΠϯྨࣅ ϙδςΟϒϖΞ
શͯͷϖΞͷྨࣅ
Թύϥϝʔλ
αϯϓϧͱαϯϓϧ͕ϙδςΟϒϖΞ
ͷྫ
i = 1, j = 2
p1,2
p1,3
p1,2N
p1,4
p1,2
p1,3
p1,2N
p1,4
֬ΛӶ͘
֬ΛͳͩΒ͔ʹ
MPHJUT
sim(z1
, z2
)
sim(z1
, z3
)
sim(z1
, z2N
)
sim(z1
, z4
)
Թ͖
4PGUNBYؔ
τ < 1.0
τ > 1.0
4JN$-3Ͱ ͱͯ͠
dͷΛ༻
τ
௨ৗͷ4PGUNBYؔ τ = 1.0
ͱൺͯ
w ૬ޓใྔͷ؍͔Βྑ͍ϙδςΟϒϖΞʹ͍ͭͯੳ
w ϖΞͷ૬ޓใྔ͕େ͖ͯ͘খͯ͘͞ྑ͘ͳ͍͜ͱΛ࣮ݧతʹূ໌
ରরֶशͷੳɿ8IBU.BLFTGPS(PPE7JFXTGPS$POUSBTUJWF-FBSOJOH
<5JBO
/FVS*14>
ϥϯμϜΫϩοϓʹ͓͚ΔΫϩοϓҐஔͷӨڹ
nformation between views is changed, information about the downstream task (green)
d) can be selectively included or excluded, biasing the learned representation. (a)
views are chosen to preserve downstream task information between views while
rmation, while in (b) reducing MI always throws out information relevant for the task
ormance as MI is reduced.
reducing I(v1; v2) improves downstream accuracy. We use INCE
as a neural
depends on network architectures. Therefore for each plot in this paper, we
s while keeping other settings the same, to make the results comparable.
-10 classification (b) CIFAR-10 classification
ws by using pairs of image patches at various offsets from each other. As INCE
is
ask accuracy firstly increases and then decreases, leading to a reverse-U shape.
(v1; v2) with spatial distance. We create views by randomly cropping two
૬ޓใྔ େ
খ
I(v1
; v2
) ≥ log(K) − LNCE
= INCE
(v1
; v2
)
1BUDI
EJTUBODF
ࣗݾڭࢣ͋Γֶशɿ%*7,ˠઢܗධՁɿ$*'"3
missing
info
excess
info
# bits
per
hypothesis
captured info
I(x; y)
AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
Sweet Spot
I(v1; v2) = I(x; y)
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
I(v1; v2)
AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
I(v1; v2)
AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
missing
info
# bits
captured in
Sweet Spot
I(v1; v2) = I(x; y
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
missing
info
e
# bits
captured in
Sweet Spot
I(v1; v2) = I(x; y
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
missing
info
excess
info
# bits transfer
performance
hypothesis too m
no
not enough
signal
captured info
I(x; y)
AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
Sweet Spot
I(v1; v2) = I(x; y)
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
I(v1; v2)
AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
I(v1; v2)
AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
I(v1; v2) = I(x; y)
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
missing
info
excess
info
# bits transfer
performance
hypothesis too
n
not enough
signal
captured info
I(x; y)
AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
AAACD3icbVDNS8MwHE3n15xfVY9egkOZl9GKoOBl6EVvE9wHrGOkWbqFpWlJUrGU/gde/Fe8eFDEq1dv/jemXQXdfBB4vPf7JS/PDRmVyrK+jNLC4tLySnm1sra+sbllbu+0ZRAJTFo4YIHoukgSRjlpKaoY6YaCIN9lpONOLjO/c0eEpAG/VXFI+j4acepRjJSWBuahk9+RuCwiKbyuOT5SY9dL7tNz+MPj9GhgVq26lQPOE7sgVVCgOTA/nWGAI59whRmSsmdboeonSCiKGUkrTiRJiPAEjUhPU458IvtJniWFB1oZQi8Q+nAFc/X3RoJ8KWPf1ZNZRDnrZeJ/Xi9S3lk/oTyMFOF4+pAXMagCmJUDh1QQrFisCcKC6qwQj5FAWOkKK7oEe/bL86R9XLetun1zUm1cFHWUwR7YBzVgg1PQAFegCVoAgwfwBF7Aq/FoPBtvxvt0tGQUO7vgD4yPb1Z0nNQ=
Sweet Spot
I(v1; v2) = I(x; y)
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
I(v1; v2)
AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
AAACFnicbVDLSsNAFJ3UV62vqEs3g0WoC0tSBAU3RTe6q2Af0IQymU7s0MkkzEwKJeQr3Pgrblwo4lbc+TdO0oDaemHg3HNfZ44XMSqVZX0ZpaXlldW18nplY3Nre8fc3evIMBaYtHHIQtHzkCSMctJWVDHSiwRBgcdI1xtfZfXuhAhJQ36nphFxA3TPqU8xUpoamCdOviPRNOEKpfCm5gRIjTw/mQzs9AL+ZI30eGBWrbqVB1wEdgGqoIjWwPx0hiGOA70bMyRl37Yi5SZIKIoZSStOLEmE8Fhf72vIUUCkm+SSUnikmSH0Q6EfVzBnf08kKJByGni6MxMp52sZ+V+tHyv/3E0oj2JFOJ4d8mMGVQgzj+CQCoIVm2qAsKBaK8QjJBBW2smKNsGe//Ii6DTqtlW3b0+rzcvCjjI4AIegBmxwBprgGrRAG2DwAJ7AC3g1Ho1n4814n7WWjGJmH/wJ4+MbTMifaw==
I(v1; v2)
AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
AAACBnicbVDLSsNAFL3xWesr6lKEwSLUTUmKoOCm6EZ3FewD2hAm00k7dPJgZlIooSs3/oobF4q49Rvc+TdO2oDaemDgzDn3cu89XsyZVJb1ZSwtr6yurRc2iptb2zu75t5+U0aJILRBIh6Jtocl5SykDcUUp+1YUBx4nLa84XXmt0ZUSBaF92ocUyfA/ZD5jGClJdc8ui13A6wGnp+OXHtyiX5+1cmpa5asijUFWiR2TkqQo+6an91eRJKAhopwLGXHtmLlpFgoRjidFLuJpDEmQ9ynHU1DHFDppNMzJuhEKz3kR0K/UKGp+rsjxYGU48DTldmSct7LxP+8TqL8CydlYZwoGpLZID/hSEUoywT1mKBE8bEmmAimd0VkgAUmSidX1CHY8ycvkma1YlsV++6sVLvK4yjAIRxDGWw4hxrcQB0aQOABnuAFXo1H49l4M95npUtG3nMAf2B8fANJI5hb
I(v1; v2) = I(x; y)
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
AAACI3icbVDLSsNAFJ34rPUVdelmsAh1U5IiKIpQdKO7CvYBbQiT6aQdOnkwMymGkH9x46+4caEUNy78FydtoLX1wMCZc+7l3nuckFEhDeNbW1ldW9/YLGwVt3d29/b1g8OmCCKOSQMHLOBtBwnCqE8akkpG2iEnyHMYaTnDu8xvjQgXNPCfZBwSy0N9n7oUI6kkW796KHc9JAeOm4xsM72Gs181PYM3cOY/z7mx8my9ZFSMCeAyMXNSAjnqtj7u9gIcecSXmCEhOqYRSitBXFLMSFrsRoKECA9Rn3QU9ZFHhJVMbkzhqVJ60A24er6EE3W+I0GeELHnqMpsR7HoZeJ/XieS7qWVUD+MJPHxdJAbMSgDmAUGe5QTLFmsCMKcql0hHiCOsFSxFlUI5uLJy6RZrZhGxXw8L9Vu8zgK4BicgDIwwQWogXtQBw2AwQt4Ax/gU3vV3rWx9jUtXdHyniPwB9rPL3sMo4w=
(a)
Figure 2: As the mutual information between views is changed
and nuisance variables (red) can be selectively included or ex
depicts a scenario where views are chosen to preserve down
throwing out nuisance information, while in (b) reducing MI alw
leading to decreasing performance as MI is reduced.
w ଛࣦؔͷԹύϥϝʔλͱ֫ಘ͢Δಛදݱͷؔʹ͍ͭͯੳ
ԹύϥϝʔλখɿϙδςΟϒϖΞͱωΨςΟϒϖΞ͕
ԹύϥϝʔλେɿϙδςΟϒϖΞͷྨࣅ͕ʹۙͮ͘
w Թύϥϝʔλ͕ͷͱ͖ྑ͍ੑೳΛୡ
ରরֶशͷੳɿ6OEFSTUBOEJOHUIF#FIBWJPVSPG$POUSBTUJWF-PTT
<8BOHBOE-JV
$713>
Dataset Result
Contrastive
Simple
HardContrastive
HardSimple
0.07 0.3 0.7 1.0 0.07 0.3 0.7 1.0
CIFAR10
accuracy 79.75 83.27 82.69 82.21 74.83 79.2 83.63 84.19 84.19 84.84
uniformity 3.86 3.60 3.17 2.96 1.68 3.88 3.89 3.87 3.86 3.85
tolerance 0.04 0.178 0.333 0.372 0.61 0.034 0.0267 0.030 0.030 0.030
CIFAR100
accuracy 51.82 56.44 50.99 48.33 39.31 50.77 56.55 57.54 56.77 55.71
uniformity 3.86 3.60 3.18 2.96 2.12 3.87 3.88 3.87 3.86 3.86
tolerance 0.10 0.269 0.331 0.343 0.39 0.088 0.124 0.158 0.172 0.174
SVHN
accuracy 92.55 95.47 94.17 92.07 70.83 91.82 94.79 95.02 95.26 94.99
uniformity 3.88 3.65 3.27 3.05 1.50 3.89 3.91 3.90 3.88 3.85
tolerance 0.032 0.137 0.186 0.197 0.074 0.025 0.021 0.021 0.023 0.026
ImageNet100
accuracy 71.53 75.10 69.03 63.57 48.09 68.33 74.21 74.70 74.28 74.31
uniformity 3.917 3.693 3.323 3.08 1.742 3.929 3.932 3.927 3.923 3.917
tolerance 0.093 0.380 0.427 0.456 0.528 0.067 0.096 0.121 0.134 0.157
Table 1. We report the accuracy of linear classification on CIFAR10, CIFAR100 and SVHN, including models trained with the ordinary
contrastive loss, simple contrastive loss, hard contrastive loss and hard simple contrastive loss. For models trained on ordinary contrastive
loss and hard contrastive loss, we select several representative temperatures. More results are shown in the supplementary material.
ԹύϥϝʔλʹΑΔਫ਼มԽ
w ଛࣦؔͷԹύϥϝʔλͱ֫ಘ͢Δಛදݱͷؔʹ͍ͭͯੳ
ԹύϥϝʔλখɿϙδςΟϒϖΞͱωΨςΟϒϖΞ͕
ԹύϥϝʔλେɿϙδςΟϒϖΞͷྨࣅ͕ʹۙͮ͘
w Թύϥϝʔλ͕ͷͱ͖ྑ͍ੑೳΛୡ
ରরֶशͷੳɿ6OEFSTUBOEJOHUIF#FIBWJPVSPG$POUSBTUJWF-PTT
<8BOHBOE-JV
$713>
Figure 8. We display the similarity distribution of positive samples and the top-10 nearest negative samples that are marked as ’pos’ and
ϙδςΟϒɾωΨςΟϒϖΞͷಛྔͷྨࣅͱԹύϥϝʔλͷؔ
ࣗݾڭࢣ͋Γֶशͷදతͳख๏
#BSMPX5XJOT
<+;CPOUBS
*$.-`>
CBUDIEJNFOTJPO
$1$
<"WE0PSE
BS9JW`>
ύονؒͰϖΞΛ
࡞ͯ͠ରরֶश
$1$W
<0+)ÉOB
ff
*$.-`>
ϖΞͷ࡞
ϞσϧߏͳͲΛվળ
ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ
$POUFYU&ODPEFST
<%1BUIBL
$713`>
δάιʔύζϧ
৭ใΛ༧ଌ ճస֯Λ༧ଌ
*NBHF3PUBUJPOT
<4(JEBSJT
*$-3`>
$POUFYU1SFEJDUJPO
<$%PFSTDI
*$$7`>
ύονׂͨ͠ը૾ͷ
ύονؒͷ૬ରҐஔΛ༧ଌ
1*3-
<*.JTSBBOE-.BBUFO
$713`>
δάιʔύζϧΛಋೖ
4JN$-3
<5$IFO
*$.-`>
4JN$-3W
<5$IFO
/FVS*14`>
.P$P
<,)F
$713> .P$PW
<9$IFO
BS9JW>
4JN$-3ͷςΫχοΫΛಋೖ
ରরֶश
$PVOUJOH
<./PSPP[J
*$$7`>
֤ύονग़ྗͷͱը૾શମͷग़ྗ
͕Ұக͢ΔΑ͏ʹֶश
+JHTBX
<./PSPP[J
$713`>
ͭͷը૾ͷύζϧΛϛοΫε
&NCFEEJOH-FBSOJOH
<.:F
$713`>
6OTVQFSWJTFE
8PSEWFD
<5.JLPMPW
BS9JW>
#&35
<+%FWMJO
/""$->
ࣗવݴޠॲཧ
1$-
<+-J
*$-3`>
ϓϩτλΠϓΛಋೖ
MPDBM͔ΒMPDBM༧ଌ
&T7J5
<$-J
*$-3`>
.BTLFE*NBHF.PEFMJOH .*.
$//Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
7J5Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
1SFUFYUλεΫͷվળ
ྡ͢Δ୯ޠͷ༧ଌ
Λը૾Ԡ༻
%*/0
<.$BSPO
*$$7>
σʔλ૿෯ͱෳͷը૾
Λ༻͍ͨରরֶशΛఏҊ
ը૾ʹΫϥεͱֶͯ͠श
աڈͷग़ྗΛ
ωΨςΟϒϖΞͱͯ͠׆༻
.BTLFE-BOHVBHF.PEFMJOH
.-.
Λը૾Ԡ༻
େنωοτϫʔΫͷಋೖ
+JHTBX
<./PSPP[JBOE1'BWBSP
&$$7`>
$PMPSJ[BUJPO
<3;IBOH
&$$7`>
*OTUBODF%JTDSJNJOBUJPO
<;8V
$713`>
MPDBM͔ΒHMPCBM
HMPCBM͔ΒHMPCBMΛ༧ଌ
ಛྔʹϚεΫΩϯά
ϚεΫྖҬͷಛྔΛ༧ଌ
#&J5
<)#BP
*$-3`>
J#05
<+;IPV
*$-3`>
ϚεΫྖҬͷըૉΛ༧ଌ
."&
<,)F
$713`>
4JN.*.
<;9JF
$713`>
4QPU"SUJGBDUT
<4+FOOJBOE1'BWBSP
$713`>
.$5
<9:VBO
$713`>
ϚϧνϞʔμϧ֦ு
ʢը૾ʴςΩετʣ
.P$P#:0-
.P#:
<;9JF
BS9JW`>
.P$PW
<9$IFO
*$$7>
7J5Ͱͷ༗ޮੑΛධՁ
7P-5"
<41SBNBOJDL
BS9JW`>
MPDBMGFBUVSF"MJHONFOU
ϚϧνϞʔμϧʢը૾ʴςΩετʣ
γϯϓϧͳରরֶशΛఏҊ
$-*1
<"3BEGPSE
*$.-`>
;FSP4IPU5SBOTGFS
ϚϧνϞʔμϧʢը૾ʴςΩετʣ
7J5ͷͨΊͷֶशํ๏
ੳ
6OEFSTUBOEJOHUIF#FIBWJPVS
<'8BOHBOE)-JV
$713`>
ଛࣦઃܭֶशޮՌʹ͍ͭͯੳ
)PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
<-&SJDTTPO
$713`>
༷ʑͳઃఆͷసҠੑΛධՁ
8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL
<&$PMF
$713`>
σʔληοτͱͷؔੑʹ͍ͭͯੳ
*OGP.JO
<:5JBO
/FVS*14`>
ϙδςΟϒϖΞͷΈ߹Θͤʹ͍ͭͯੳ
#:0-
<+(SJMM
/FVS*14>
ϙδςΟϒϖΞͷΈͰֶश
4JN4JBN
<9$IFO
$713>
ΑΓγϯϓϧͳֶशΛఏҊ
ੳ
#:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT
<13JDIFNPOE
BS9JW`>
όονਖ਼نԽͷ౷ܭใ͕҉తͳωΨςΟϒϖΞͳͷͰʁ
ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ
4X"7
<.$BSPO
/FVS*14`>
ϙδςΟϒϖΞͷଐ͢Δ
ΫϥελΛਪఆ
/FHBUJWFGSFF
w #:0-ɿ#PPUTUSBQ:PVS0XO-BUFOU
w ΦϯϥΠϯωοτϫʔΫͱλʔήοτωοτϫʔΫͷ̎ͭͷωοτϫʔΫΛར༻
w ݩը૾͕ಉ͡ಛྔͱͷྨࣅΛେ͖͘͢ΔΑ͏ʹֶशʢϙδςΟϒϖΞͷΈΛར༻ʣ
ઃఆɿΦϯϥΠϯͷநग़ͨ͠ಛྔ͔Βλʔήοτͷநग़ͨ͠ผͷϏϡʔͷಛྔΛ༧ଌ
/FHBUJWFGSFFɿ#:0-<(SJMM
/FVS*14>
Τϯίʔμ ɿग़ྗΛআ͍ͨωοτϫʔΫ
ϓϩδΣΫλ
ɿͷ.-1
QSFEJDUPS ɿͷ.-1
ϛχόον
Τϯίʔμ ϓϩδΣΫλ
QSFEJDUPS
ΦϯϥΠϯωοτϫʔΫ
λʔήοτωοτϫʔΫ
σʔλ૿෯
ಛྔ
ଛࣦܭࢉ
.4&
.-1
.-1 .-1
TUPQHSBE
w λʔήοτͷύϥϝʔλΦϯϥΠϯͷࢦҠಈฏۉ͢Δ͜ͱͰߋ৽
$PTJOFTDIFEVMFSʹج͍ͮͯ Λมߋ
0.996 ≤ λ ≤ 1
/FHBUJWFGSFFɿ#:0-<(SJMM
/FVS*14>
θt
← λθt
+ (1 − λ)θo
θt
λʔήοτͷύϥϝʔλ
θo
ΦϯϥΠϯͷύϥϝʔλ
λ ߋ৽ύϥϝʔλʹର͢ΔॏΈ
ϛχόον
Τϯίʔμ ϓϩδΣΫλ
.-1
.-1 .-1
ࢦҠಈฏۉ QSFEJDUPS
TUPQHSBE
ΦϯϥΠϯωοτϫʔΫ
λʔήοτωοτϫʔΫ
σʔλ૿෯
ಛྔ
ଛࣦܭࢉ
.4&
#BDLQSPQ
Τϯίʔμ ɿग़ྗΛআ͍ͨωοτϫʔΫ
ϓϩδΣΫλ
ɿͷ.-1
QSFEJDUPS ɿͷ.-1
/FHBUJWFGSFFɿ#:0-<(SJMM
/FVS*14>
25M 50M 100M 200M 400M
Number of parameters
68
70
72
74
76
78
80
ImageNet top-1 accuracy (%)
Sup.
Sup.(2£)
Sup.(4£)
InfoMin
SimCLR
SimCLR (2£)
SimCLR (4£)
MoCov2
CPCv2-L
MoCo
CMC
AMDIM
BYOL
BYOL (2£)
BYOL (4£)
BYOL (200-2£)
Sup.(200-2£)
w
fi
OFUVOJOHʹ͓͚Δਫ਼ൺֱ
ࣗݾڭࢣ͋Γֶश ɿ*NBHF/FU,
ମݕग़ ɿ70$
ηάϝϯςʔγϣϯɿ70$
Supervised-IN baseline (+1.9 mIoU) and SimCLR (
Similarly, we evaluate on object detection by repro
as detailed in Appendix D.5. We fine-tune on trai
AP50
metric; BYOL is significantly better than the S
Finally, we evaluate on depth estimation on the N
given a single RGB image. Depth prediction mea
that information can be localized to pixel accuracy
We evaluate on the commonly used test subset o
in Table 4b: relative (rel) error, root mean square
max(dgt/dp, dp/dgt), is below 1.25n thresholds
depth [40]. BYOL is better or on par with other me
measure is respectively improved by +3.5 points a
Method AP50
mIoU
Supervised-IN [9] 74.4 74.4
MoCo [9] 74.9 72.5
SimCLR (repro) 75.2 75.2
BYOL (ours) 77.5 76.3
(a) Transfer results in semantic
segmentation and object detection.
Method
Supervised-IN
SimCLR (repro
BYOL (ours)
Table 4: Results on transferring
5 Building intuitions with ablations
ύϥϝʔλ͕ଟ͍߹ʹڭࢣ͋ΓֶशͱಉఔͷੑೳΛൃش *NBHF/FUͷڭࢣ͋ΓࣄલֶशϞσϧΛ͑ΔੑೳΛൃش
w *NBHF/FU,ʹ͓͚Δਫ਼ൺֱ
ઢܗධՁʹΑΓྨੑೳΛධՁ
w 4JN4JBNɿTJNQMF4JBNFTFOFUXPSLT
w #:0-ΛΑΓγϯϓϧʹͨ͠ख๏ΛఏҊ
ࢦҠಈฏۉΫϥελϦϯάͳͲͷطଘख๏ʹ͋Δෳࡶͳॲཧ͕ෆཁ
খ͞ͳϛχόονͱֶशճͰֶशՄೳ
/FHBUJWFGSFFɿ4JN4JBN<$IFO
$713>
ϛχόον
Τϯίʔμ ϓϩδΣΫλ QSFEJDUPS
ಛྔ
ଛࣦܭࢉ
ෛͷίαΠϯྨࣅ
Τϯίʔμ ɿग़ྗΛআ͍ͨωοτϫʔΫ
ϓϩδΣΫλ
ɿͷ.-1
QSFEJDUPS ɿͷ.-1ʢ#PUUMFOFDLߏʣ
TUPQHSBE
σʔλ૿෯
#BDLQSPQ
.-1
.-1
w 4JN4JBNΛରরֶश/FHBUJWFGSFFͷڞ௨ϑϨʔϜϫʔΫͱͯ͠ଊ͑Δ͜ͱ͕Մೳ
w 4JN4JBNطଘͷςΫχοΫΛՃɾআ͢Δ͜ͱͰҟͳΔख๏Λදݱ
4JN$-3
ɿωΨςΟϒϖΞɼQSFEJDUPSɼTUPQHSBE
#:0- ɿࢦҠಈฏۉϞσϧ
4X"7 ɿΫϥελϦϯάɼQSFEJDUPS
/FHBUJWFGSFFɿ4JN4JBN<$IFO
$713>
similarity
predictor
encoder
similarity &
dissimilarity
encoder
image
SimCLR
similarity
Sinkhorn-Knopp
encoder
similarity
momentum
encoder
predictor
image
moving
average
BYOL
grad grad
grad grad
grad
similarity
predictor
encoder
similarity &
dissimilarity
encoder
image
SimCLR
similarity
Sinkhorn-Knopp
encoder
similarity
momentum
encoder
predictor
image
moving
average
BYOL
grad grad
grad grad
grad
encode
predicto
encoder
similarity &
dissimilarity
encoder
image
SimCLR
encoder
similarity
encoder
Sinkhorn-Knopp
image
SwAV
encode
predicto
grad grad
grad grad
grad
encoder
similarity
encoder
predictor
image
SimSiam
ncoder
ncoder
horn-Knopp
encoder
similarity
momentum
encoder
predictor
image
moving
average
BYOL
grad
grad
grad
w *NBHF/FU,ʹ͓͚Δਫ਼ൺֱ
*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,ͰઢܗධՁ
ωοτϫʔΫɿ3FT/FU
/FHBUJWFGSFFɿ4JN4JBN<$IFO
$713>
method
batch
size
negative
pairs
momentum
encoder 100 ep 200 ep 400 ep 800 ep
SimCLR (repro.+) 4096 X 66.5 68.3 69.8 70.4
MoCo v2 (repro.+) 256 X X 67.4 69.9 71.0 72.2
BYOL (repro.) 4096 X 66.5 70.6 73.2 74.3
SwAV (repro.+) 4096 66.5 69.1 70.7 71.8
SimSiam 256 68.1 70.0 70.8 71.3
Table 4. Comparisons on ImageNet linear classification. All are based on ResNet-50 pre-trained with two 224⇥224 views. Evaluation
is on a single crop. All competitors are from our reproduction, and “+” denotes improved reproduction vs. original papers (see supplement).
VOC 07 detection VOC 07+12 detection COCO detection COCO instance seg.
pre-train AP50 AP AP75 AP50 AP AP75 AP50 AP AP75 APmask
50
APmask APmask
75
scratch 35.9 16.8 13.0 60.2 33.8 33.1 44.0 26.4 27.8 46.9 29.3 30.8
ImageNet supervised 74.4 42.4 42.7 81.3 53.5 58.8 58.2 38.2 41.2 54.7 33.3 35.2
SimCLR (repro.+) 75.9 46.8 50.1 81.8 55.5 61.4 57.7 37.9 40.9 54.6 33.3 35.3
4JN4JBNগͳ͍όοναΠζɾֶशճͰߴ͍ੑೳΛൃش
w ԼྲྀλεΫʹ͓͚Δਫ਼ൺֱ
*NBHF/FU,Ͱࣗݾڭࢣ͋ΓֶशˠԼྲྀλεΫ
fi
OFUVOJOH
FQPDIͷࣗݾڭࢣ͋Γֶशʹ͓͍֤ͯख๏Λൺֱ
/FHBUJWFGSFFɿ4JN4JBN<$IFO
$713>
method
batch
size
negative
pairs
momentum
encoder 100 ep 200 ep 400 ep 800 ep
SimCLR (repro.+) 4096 X 66.5 68.3 69.8 70.4
MoCo v2 (repro.+) 256 X X 67.4 69.9 71.0 72.2
BYOL (repro.) 4096 X 66.5 70.6 73.2 74.3
SwAV (repro.+) 4096 66.5 69.1 70.7 71.8
SimSiam 256 68.1 70.0 70.8 71.3
Table 4. Comparisons on ImageNet linear classification. All are based on ResNet-50 pre-trained with two 224⇥224 views. Evaluation
is on a single crop. All competitors are from our reproduction, and “+” denotes improved reproduction vs. original papers (see supplement).
VOC 07 detection VOC 07+12 detection COCO detection COCO instance seg.
pre-train AP50 AP AP75 AP50 AP AP75 AP50 AP AP75 APmask
50
APmask APmask
75
scratch 35.9 16.8 13.0 60.2 33.8 33.1 44.0 26.4 27.8 46.9 29.3 30.8
ImageNet supervised 74.4 42.4 42.7 81.3 53.5 58.8 58.2 38.2 41.2 54.7 33.3 35.2
SimCLR (repro.+) 75.9 46.8 50.1 81.8 55.5 61.4 57.7 37.9 40.9 54.6 33.3 35.3
MoCo v2 (repro.+) 77.1 48.5 52.5 82.3 57.0 63.3 58.8 39.2 42.5 55.5 34.3 36.6
BYOL (repro.) 77.1 47.0 49.9 81.4 55.3 61.1 57.8 37.9 40.9 54.3 33.2 35.0
SwAV (repro.+) 75.5 46.5 49.6 81.5 55.4 61.4 57.6 37.6 40.3 54.2 33.1 35.1
SimSiam, base 75.5 47.0 50.2 82.0 56.4 62.8 57.5 37.9 40.9 54.2 33.2 35.2
SimSiam, optimal 77.3 48.5 52.5 82.4 57.0 63.7 59.3 39.2 42.1 56.0 34.4 36.7
Table 5. Transfer Learning. All unsupervised methods are based on 200-epoch pre-training in ImageNet. VOC 07 detection: Faster
R-CNN [32] fine-tuned in VOC 2007 trainval, evaluated in VOC 2007 test; VOC 07+12 detection: Faster R-CNN fine-tuned in VOC 2007
trainval + 2012 train, evaluated in VOC 2007 test; COCO detection and COCO instance segmentation: Mask R-CNN [18] (1⇥ schedule)
fine-tuned in COCO 2017 train, evaluated in COCO 2017 val. All Faster/Mask R-CNN models are with the C4-backbone [13]. All VOC
results are the average over 5 trials. Bold entries are within 0.5 below the best.
4JN4JBNγϯϓϧͳֶशํ๏Ͱैདྷ๏ͱಉఔͷੑೳΛൃش
ࣗݾڭࢣ͋Γֶशͷදతͳख๏
#BSMPX5XJOT
<+;CPOUBS
*$.-`>
CBUDIEJNFOTJPO
$1$
<"WE0PSE
BS9JW`>
ύονؒͰϖΞΛ
࡞ͯ͠ରরֶश
$1$W
<0+)ÉOB
ff
*$.-`>
ϖΞͷ࡞
ϞσϧߏͳͲΛվળ
ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ
$POUFYU&ODPEFST
<%1BUIBL
$713`>
δάιʔύζϧ
৭ใΛ༧ଌ ճస֯Λ༧ଌ
*NBHF3PUBUJPOT
<4(JEBSJT
*$-3`>
$POUFYU1SFEJDUJPO
<$%PFSTDI
*$$7`>
ύονׂͨ͠ը૾ͷ
ύονؒͷ૬ରҐஔΛ༧ଌ
1*3-
<*.JTSBBOE-.BBUFO
$713`>
δάιʔύζϧΛಋೖ
4JN$-3
<5$IFO
*$.-`>
4JN$-3W
<5$IFO
/FVS*14`>
.P$P
<,)F
$713> .P$PW
<9$IFO
BS9JW>
4JN$-3ͷςΫχοΫΛಋೖ
ରরֶश
$PVOUJOH
<./PSPP[J
*$$7`>
֤ύονग़ྗͷͱը૾શମͷग़ྗ
͕Ұக͢ΔΑ͏ʹֶश
+JHTBX
<./PSPP[J
$713`>
ͭͷը૾ͷύζϧΛϛοΫε
&NCFEEJOH-FBSOJOH
<.:F
$713`>
6OTVQFSWJTFE
8PSEWFD
<5.JLPMPW
BS9JW>
#&35
<+%FWMJO
/""$->
ࣗવݴޠॲཧ
1$-
<+-J
*$-3`>
ϓϩτλΠϓΛಋೖ
MPDBM͔ΒMPDBM༧ଌ
&T7J5
<$-J
*$-3`>
.BTLFE*NBHF.PEFMJOH .*.
$//Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
7J5Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
1SFUFYUλεΫͷվળ
ྡ͢Δ୯ޠͷ༧ଌ
Λը૾Ԡ༻
%*/0
<.$BSPO
*$$7>
σʔλ૿෯ͱෳͷը૾
Λ༻͍ͨରরֶशΛఏҊ
ը૾ʹΫϥεͱֶͯ͠श
աڈͷग़ྗΛ
ωΨςΟϒϖΞͱͯ͠׆༻
.BTLFE-BOHVBHF.PEFMJOH
.-.
Λը૾Ԡ༻
େنωοτϫʔΫͷಋೖ
+JHTBX
<./PSPP[JBOE1'BWBSP
&$$7`>
$PMPSJ[BUJPO
<3;IBOH
&$$7`>
*OTUBODF%JTDSJNJOBUJPO
<;8V
$713`>
MPDBM͔ΒHMPCBM
HMPCBM͔ΒHMPCBMΛ༧ଌ
ಛྔʹϚεΫΩϯά
ϚεΫྖҬͷಛྔΛ༧ଌ
#&J5
<)#BP
*$-3`>
J#05
<+;IPV
*$-3`>
ϚεΫྖҬͷըૉΛ༧ଌ
."&
<,)F
$713`>
4JN.*.
<;9JF
$713`>
4QPU"SUJGBDUT
<4+FOOJBOE1'BWBSP
$713`>
.$5
<9:VBO
$713`>
ϚϧνϞʔμϧ֦ு
ʢը૾ʴςΩετʣ
.P$P#:0-
.P#:
<;9JF
BS9JW`>
.P$PW
<9$IFO
*$$7>
7J5Ͱͷ༗ޮੑΛධՁ
7P-5"
<41SBNBOJDL
BS9JW`>
MPDBMGFBUVSF"MJHONFOU
ϚϧνϞʔμϧʢը૾ʴςΩετʣ
γϯϓϧͳରরֶशΛఏҊ
$-*1
<"3BEGPSE
*$.-`>
;FSP4IPU5SBOTGFS
ϚϧνϞʔμϧʢը૾ʴςΩετʣ
7J5ͷͨΊͷֶशํ๏
ੳ
6OEFSTUBOEJOHUIF#FIBWJPVS
<'8BOHBOE)-JV
$713`>
ଛࣦઃܭֶशޮՌʹ͍ͭͯੳ
)PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
<-&SJDTTPO
$713`>
༷ʑͳઃఆͷసҠੑΛධՁ
8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL
<&$PMF
$713`>
σʔληοτͱͷؔੑʹ͍ͭͯੳ
*OGP.JO
<:5JBO
/FVS*14`>
ϙδςΟϒϖΞͷΈ߹Θͤʹ͍ͭͯੳ
#:0-
<+(SJMM
/FVS*14>
ϙδςΟϒϖΞͷΈͰֶश
4JN4JBN
<9$IFO
$713>
ΑΓγϯϓϧͳֶशΛఏҊ
ੳ
#:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT
<13JDIFNPOE
BS9JW`>
όονਖ਼نԽͷ౷ܭใ͕҉తͳωΨςΟϒϖΞͳͷͰʁ
ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ
4X"7
<.$BSPO
/FVS*14`>
ϙδςΟϒϖΞͷଐ͢Δ
ΫϥελΛਪఆ
/FHBUJWFGSFF
w ̑छྨͷλεΫʹର͢ΔసҠੑɼ*NBHF/FUͱԼྲྀλεΫؒͷ૬ؔΛੳ
.BOZTIPUSFDPHOJUJPO ɿछྨͷσʔληοτ
'FXTIPUSFDPHOJUJPO ɿछྨͷσʔληοτ
0CKFDUEFUFDUJPO ɿछྨͷσʔληοτ
4FNBOUJDTFHNFOUBUJPO ɿछྨͷσʔληοτ
4VSGBDFOPSNBMFTUJNBUJPO
ɿछྨͷσʔληοτ
ࣗݾڭࢣ͋ΓֶशͷసҠੑɿ)PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
<&SJDTTPO
$713>
ࣗݾڭࢣ͋Γֶशͷख๏ʹΑͬͯಘҙͳԼྲྀλεΫ͕ҟͳΔ
ࣗݾڭࢣ͋Γֶशͷදతͳख๏
#BSMPX5XJOT
<+;CPOUBS
*$.-`>
CBUDIEJNFOTJPO
$1$
<"WE0PSE
BS9JW`>
ύονؒͰϖΞΛ
࡞ͯ͠ରরֶश
$1$W
<0+)ÉOB
ff
*$.-`>
ϖΞͷ࡞
ϞσϧߏͳͲΛվળ
ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ
$POUFYU&ODPEFST
<%1BUIBL
$713`>
δάιʔύζϧ
৭ใΛ༧ଌ ճస֯Λ༧ଌ
*NBHF3PUBUJPOT
<4(JEBSJT
*$-3`>
$POUFYU1SFEJDUJPO
<$%PFSTDI
*$$7`>
ύονׂͨ͠ը૾ͷ
ύονؒͷ૬ରҐஔΛ༧ଌ
1*3-
<*.JTSBBOE-.BBUFO
$713`>
δάιʔύζϧΛಋೖ
4JN$-3
<5$IFO
*$.-`>
4JN$-3W
<5$IFO
/FVS*14`>
.P$P
<,)F
$713> .P$PW
<9$IFO
BS9JW>
4JN$-3ͷςΫχοΫΛಋೖ
ରরֶश
$PVOUJOH
<./PSPP[J
*$$7`>
֤ύονग़ྗͷͱը૾શମͷग़ྗ
͕Ұக͢ΔΑ͏ʹֶश
+JHTBX
<./PSPP[J
$713`>
ͭͷը૾ͷύζϧΛϛοΫε
&NCFEEJOH-FBSOJOH
<.:F
$713`>
6OTVQFSWJTFE
8PSEWFD
<5.JLPMPW
BS9JW>
#&35
<+%FWMJO
/""$->
ࣗવݴޠॲཧ
1$-
<+-J
*$-3`>
ϓϩτλΠϓΛಋೖ
MPDBM͔ΒMPDBM༧ଌ
&T7J5
<$-J
*$-3`>
.BTLFE*NBHF.PEFMJOH .*.
$//Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
7J5Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
1SFUFYUλεΫͷվળ
ྡ͢Δ୯ޠͷ༧ଌ
Λը૾Ԡ༻
%*/0
<.$BSPO
*$$7>
σʔλ૿෯ͱෳͷը૾
Λ༻͍ͨରরֶशΛఏҊ
ը૾ʹΫϥεͱֶͯ͠श
աڈͷग़ྗΛ
ωΨςΟϒϖΞͱͯ͠׆༻
.BTLFE-BOHVBHF.PEFMJOH
.-.
Λը૾Ԡ༻
େنωοτϫʔΫͷಋೖ
+JHTBX
<./PSPP[JBOE1'BWBSP
&$$7`>
$PMPSJ[BUJPO
<3;IBOH
&$$7`>
*OTUBODF%JTDSJNJOBUJPO
<;8V
$713`>
MPDBM͔ΒHMPCBM
HMPCBM͔ΒHMPCBMΛ༧ଌ
ಛྔʹϚεΫΩϯά
ϚεΫྖҬͷಛྔΛ༧ଌ
#&J5
<)#BP
*$-3`>
J#05
<+;IPV
*$-3`>
ϚεΫྖҬͷըૉΛ༧ଌ
."&
<,)F
$713`>
4JN.*.
<;9JF
$713`>
4QPU"SUJGBDUT
<4+FOOJBOE1'BWBSP
$713`>
.$5
<9:VBO
$713`>
ϚϧνϞʔμϧ֦ு
ʢը૾ʴςΩετʣ
.P$P#:0-
.P#:
<;9JF
BS9JW`>
.P$PW
<9$IFO
*$$7>
7J5Ͱͷ༗ޮੑΛධՁ
7P-5"
<41SBNBOJDL
BS9JW`>
MPDBMGFBUVSF"MJHONFOU
ϚϧνϞʔμϧʢը૾ʴςΩετʣ
γϯϓϧͳରরֶशΛఏҊ
$-*1
<"3BEGPSE
*$.-`>
;FSP4IPU5SBOTGFS
ϚϧνϞʔμϧʢը૾ʴςΩετʣ
7J5ͷͨΊͷֶशํ๏
ੳ
6OEFSTUBOEJOHUIF#FIBWJPVS
<'8BOHBOE)-JV
$713`>
ଛࣦઃܭֶशޮՌʹ͍ͭͯੳ
)PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
<-&SJDTTPO
$713`>
༷ʑͳઃఆͷసҠੑΛධՁ
8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL
<&$PMF
$713`>
σʔληοτͱͷؔੑʹ͍ͭͯੳ
*OGP.JO
<:5JBO
/FVS*14`>
ϙδςΟϒϖΞͷΈ߹Θͤʹ͍ͭͯੳ
#:0-
<+(SJMM
/FVS*14>
ϙδςΟϒϖΞͷΈͰֶश
4JN4JBN
<9$IFO
$713>
ΑΓγϯϓϧͳֶशΛఏҊ
ੳ
#:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT
<13JDIFNPOE
BS9JW`>
όονਖ਼نԽͷ౷ܭใ͕҉తͳωΨςΟϒϖΞͳͷͰʁ
ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ
4X"7
<.$BSPO
/FVS*14`>
ϙδςΟϒϖΞͷଐ͢Δ
ΫϥελΛਪఆ
/FHBUJWFGSFF
w %*/0ɿTFMGEJTUJMMBUJPOXJUIOPMBCFMT
w ੜెωοτϫʔΫͷग़ྗ͕ڭࢣωοτϫʔΫͷग़ྗʹۙͮ͘Α͏ʹֶश
w ಛྔʹରͯ͠Թ͖ιϑτϚοΫεؔΛద༻͢Δ͜ͱͰ֬Λܭࢉ
7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO
*$$7>
ࢦҠಈฏۉ
4PGUNBY
4PGUNBY
$FOUFS
TUPQHSBE
ಛྔ
Τϯίʔμ ϓϩδΣΫλ TIBSQFOJOH ֬
MPDBM
HMPCBM
ੜెωοτϫʔΫ
ڭࢣωοτϫʔΫ
σʔλ૿෯ ଛࣦܭࢉ
7J5
7J5
.-1
.-1
ہॴྖҬΛΫϩοϓ
ʢ͍͠ʣ
͍ൣғΛΫϩοϓ
ʢ༏͍͠ʣ
Τϯίʔμ ɿग़ྗΛআ͍ͨ$//.-1)FBEΛআ͍ͨ7J5
ϓϩδΣΫλ
ɿͷ.-1ʢ#PUUMFOFDLߏʣ
DFOUFSJOH
w TIBSQFOJOHɿಛྔͷதͰͭͷಛΛڧௐ͢ΔΑ͏ʹௐ
w DFOUFSJOH ɿͲΜͳը૾ʹରͯ͠ಉ͡ಛΛڧௐ͠ͳ͍Α͏ʹௐ
w DFOUFSJOH
7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO
*$$7>
Ps
(x)(i) =
exp(gθs
(x)(i)/τs
)
∑K
k=1
exp(gθs
(x)(k)/τs
)
Pt
(x)(i) =
exp((gθt
(x)(i) − c)/τt
)
∑K
k=1
exp((gθt
(x)(k) − c)/τt
)
τt
Թύϥϝʔλ
c DFOUFSJOH
ੜెωοτϫʔΫɿ
τs
Թύϥϝʔλ
m
B
όοναΠζ
ϋΠύϥ
ڭࢣωοτϫʔΫɿ
c ← mc + (1 − m)
1
B
B
∑
i=1
gθt
(xi
)
w ڭࢣͷύϥϝʔλੜెͷύϥϝʔλΛࢦҠಈฏۉ͢Δ͜ͱͰߋ৽
$PTJOFTDIFEVMFSʹج͍ͮͯ Λมߋ
0.996 ≤ λ ≤ 1
7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO
*$$7>
TUPQHSBE
ಛྔ
Τϯίʔμ ϓϩδΣΫλ ֬
MPDBM
HMPCBM
ੜెωοτϫʔΫ
ڭࢣωοτϫʔΫ
σʔλ૿෯ ଛࣦܭࢉ
7J5
.-1
.-1
θt
← λθt
+ (1 − λ)θs
θt
ڭࢣͷύϥϝʔλ
θs
ੜెͷύϥϝʔλ
λ ߋ৽ύϥϝʔλʹର͢ΔॏΈ
7J5
ࢦҠಈฏۉ
#BDLQSPQ
4PGUNBY
4PGUNBY
$FOUFS
TIBSQFOJOH
DFOUFSJOH
w *NBHF/FU,ʹ͓͚Δਫ਼ൺֱ
*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,ͰઢܗධՁL//๏ʹΑΔධՁ
7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO
*$$7>
previous works[18, 19, 69], even though it is not attached
o any label nor supervision in our case. The set of patch
okens and [CLS] token are fed to a standard Transformer
etwork with a “pre-norm” layer normalization [11, 39]. The
Transformer is a sequence of self-attention and feed-forward
ayers, paralleled with skip connections. The self-attention
ayers update the token representations by looking at the
ther token representations with an attention mechanism [4].
mplementation details. We pretrain the models on the
mageNet dataset [60] without labels. We train with the
damw optimizer [44] and a batch size of 1024, distributed
over 16 GPUs when using ViT-S/16. The learning rate is
inearly ramped up during the first 10 epochs to its base
alue determined with the following linear scaling rule [29]:
r = 0.0005 ⇤ batchsize/256. After this warmup, we decay
he learning rate with a cosine schedule [43]. The weight
decay also follows a cosine schedule from 0.04 to 0.4. The
emperature ⌧s
is set to 0.1 while we use a linear warm-up
or ⌧t
from 0.04 to 0.07 during the first 30 epochs. We
ollow the data augmentations of BYOL [30] (color jittering,
Gaussian blur and solarization) and multi-crop [10] with a
bicubic interpolation to adapt the position embeddings to
he scales [19, 69]. The code and models to reproduce our
esults is publicly available.
Evaluation protocols. Standard protocols for self-
Table 2: Linear and k-NN classification on ImageNet. We report
top-1 accuracy for linear and k-NN evaluations on the validation
set of ImageNet for different self-supervised methods. We focus
on ResNet-50 and ViT-small architectures, but also report the best
results obtained across architectures. ⇤ are run by us. We run the
k-NN evaluation for models with official released weights. The
throughput (im/s) is calculated on a NVIDIA V100 GPU with 128
samples per forward. Parameters (M) are of the feature extractor.
Method Arch. Param. im/s Linear k-NN
Supervised RN50 23 1237 79.3 79.3
SCLR [12] RN50 23 1237 69.1 60.7
MoCov2 [15] RN50 23 1237 71.1 61.9
InfoMin [67] RN50 23 1237 73.0 65.3
BarlowT [81] RN50 23 1237 73.2 66.0
OBoW [27] RN50 23 1237 73.8 61.9
BYOL [30] RN50 23 1237 74.4 64.8
DCv2 [10] RN50 23 1237 75.2 67.1
SwAV [10] RN50 23 1237 75.3 65.7
DINO RN50 23 1237 75.3 67.5
Supervised ViT-S 21 1007 79.8 79.8
BYOL⇤ [30] ViT-S 21 1007 71.4 66.6
MoCov2⇤ [15] ViT-S 21 1007 72.7 64.4
SwAV⇤ [10] ViT-S 21 1007 73.5 66.3
DINO ViT-S 21 1007 77.0 74.5
Comparison across architectures
SCLR [12] RN50w4 375 117 76.8 69.3
SwAV [10] RN50w2 93 384 77.3 67.3
BYOL [30] RN50w2 93 384 77.4 –
DINO ViT-B/16 85 312 78.2 76.1
SwAV [10] RN50w5 586 76 78.5 67.1
BYOL [30] RN50w4 375 117 78.6 –
BYOL [30] RN200w2 250 123 79.6 73.9
ther token representations with an attention mechanism [4].
mplementation details. We pretrain the models on the
mageNet dataset [60] without labels. We train with the
damw optimizer [44] and a batch size of 1024, distributed
over 16 GPUs when using ViT-S/16. The learning rate is
inearly ramped up during the first 10 epochs to its base
alue determined with the following linear scaling rule [29]:
r = 0.0005 ⇤ batchsize/256. After this warmup, we decay
he learning rate with a cosine schedule [43]. The weight
decay also follows a cosine schedule from 0.04 to 0.4. The
emperature ⌧s
is set to 0.1 while we use a linear warm-up
or ⌧t
from 0.04 to 0.07 during the first 30 epochs. We
ollow the data augmentations of BYOL [30] (color jittering,
Gaussian blur and solarization) and multi-crop [10] with a
bicubic interpolation to adapt the position embeddings to
he scales [19, 69]. The code and models to reproduce our
esults is publicly available.
Evaluation protocols. Standard protocols for self-
upervised learning are to either learn a linear classifier
on frozen features [82, 33] or to finetune the features
on downstream tasks. For linear evaluations, we apply
andom resize crops and horizontal flips augmentation
during training, and report accuracy on a central crop.
For finetuning evaluations, we initialize networks with
he pretrained weights and adapt them during training.
However, both evaluations are sensitive to hyperparameters,
nd we observe a large variance in accuracy between runs
when varying the learning rate for example. We thus also
valuate the quality of features with a simple weighted
Method Arch. Param. im/s Linear k-NN
Supervised RN50 23 1237 79.3 79.3
SCLR [12] RN50 23 1237 69.1 60.7
MoCov2 [15] RN50 23 1237 71.1 61.9
InfoMin [67] RN50 23 1237 73.0 65.3
BarlowT [81] RN50 23 1237 73.2 66.0
OBoW [27] RN50 23 1237 73.8 61.9
BYOL [30] RN50 23 1237 74.4 64.8
DCv2 [10] RN50 23 1237 75.2 67.1
SwAV [10] RN50 23 1237 75.3 65.7
DINO RN50 23 1237 75.3 67.5
Supervised ViT-S 21 1007 79.8 79.8
BYOL⇤ [30] ViT-S 21 1007 71.4 66.6
MoCov2⇤ [15] ViT-S 21 1007 72.7 64.4
SwAV⇤ [10] ViT-S 21 1007 73.5 66.3
DINO ViT-S 21 1007 77.0 74.5
Comparison across architectures
SCLR [12] RN50w4 375 117 76.8 69.3
SwAV [10] RN50w2 93 384 77.3 67.3
BYOL [30] RN50w2 93 384 77.4 –
DINO ViT-B/16 85 312 78.2 76.1
SwAV [10] RN50w5 586 76 78.5 67.1
BYOL [30] RN50w4 375 117 78.6 –
BYOL [30] RN200w2 250 123 79.6 73.9
DINO ViT-S/8 21 180 79.7 78.3
SCLRv2 [13] RN152w3+SK 794 46 79.8 73.1
DINO ViT-B/8 85 63 80.1 77.4
4. Main Results
We first validate the DINO framework used in this study
with the standard self-supervised benchmark on ImageNet.
We then study the properties of the resulting features for
retrieval, object discovery and transfer-learning.
3FT/FUɿैདྷ๏ͱಉఔͷੑೳΛൃش
7JTJPO5SBOTGPSNFS 7J5
ɿैདྷ๏Λ͑ΔੑೳΛൃش
w ԼྲྀλεΫʹ͓͚Δਫ਼ൺֱ
4VQ ɿ*NBHF/FU,Ͱڭࢣ͋ΓֶशˠԼྲྀλεΫ
fi
OFUVOJOH
%*/0ɿ*NBHF/FU,Ͱࣗݾڭࢣ͋ΓֶशˠԼྲྀλεΫ
fi
OFUVOJOH
7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO
*$$7>
ViT-S/16 22.0 27.3 45.9
ViT-S/8 21.8 23.7 44.7
Figure 4: Segmentations from supervised versus DINO. We vi-
sualize masks obtained by thresholding the self-attention maps to
keep 60% of the mass. On top, we show the resulting masks for
a ViT-S/8 trained with supervision and DINO. We show the best
head for both models. The table at the bottom compares the Jac-
card similarity between the ground truth and these masks on the
validation images of PASCAL VOC12 dataset.
Table 6: Transfer learning by finetuning pretrained models on
different datasets. We report top-1 accuracy. Self-supervised
pretraining with DINO transfers better than supervised pretraining.
Cifar10
Cifar100
INat18
INat19
Flwrs Cars INet
ViT-S/16
Sup. [69] 99.0 89.5 70.7 76.6 98.2 92.1 79.9
DINO 99.0 90.5 72.0 78.2 98.5 93.0 81.5
ViT-B/16
Sup. [69] 99.0 90.8 73.2 77.7 98.4 92.1 81.8
DINO 99.1 91.7 72.6 78.6 98.8 93.0 82.8
In Table 7, we report different model variants as we add
or remove components. First, we observe that in the absence
of momentum, our framework does not work (row 2) and
more advanced operations, SK for example, are required to
avoid collapse (row 9). However, with momentum, using
SK has little impact (row 3). In addtition, comparing rows 3
and 9 highlights the importance of the momentum encoder
for performance. Second, in rows 4 and 5, we observe that
multi-crop training and the cross-entropy loss in DINO are
3 X X X CE 7 72.2
4 X 7 7 CE 7 67.9
5 X 7 X MSE 7 52.6
6 X 7 X CE X 71.8
7 BYOL X 7 7 MSE X 66.6
8 MoCov2 X 7 7 INCE 7 62.0
9 SwAV 7 X X CE 7 64.7
SK: Sinkhorn-Knopp, MC: Multi-Crop, Pred.: Predi
CE: Cross-Entropy, MSE: Mean Square Error, INCE: In
Figure 5: Effe
Patch Size. k-NN
uation as a funct
the throughputs f
ferent input patch
with ViT-B and
Models are train
300 epochs.
with different patch sizes, 16 ⇥ 16, 8 ⇥ 8 and 5 ⇥
also compare to ViT-B with 16 ⇥ 16 and 8 ⇥ 8 patc
the models are trained for 300 epochs. We observe
performance greatly improves as we decrease the siz
patch. It is interesting to see that performance can be
improved without adding additional parameters. H
the performance gain from using smaller patches c
the expense of throughput: when using 5⇥5 patc
throughput falls to 44 im/s, vs 180 im/s for 8⇥8 pat
ڭࢣ͋ΓࣄલֶशϞσϧΛ͑ΔੑೳΛൃش
w %*/0Ͱֶशͨ͠7J5ͷ$-45PLFOʹର͢Δ"UUFOUJPOXFJHIUΛՄࢹԽ
࠷ऴͷ.VMUJ)FBE4FMG"UUFOUJPOͷதͰ࠷લܠʹண͍ͯ͠Δ)FBEʹ͍ͭͯՄࢹԽ
w "UUFOUJPOXFJHIUᮢॲཧΛ͔͚ͯՄࢹԽ
7J5ͷͨΊͷֶशํ๏ɿ%*/0<$BSPO
*$$7>
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron1,2 Hugo Touvron1,3 Ishan Misra1 Herv´
e Jegou1
Julien Mairal2 Piotr Bojanowski1 Armand Joulin1
1 Facebook AI Research 2 Inria⇤ 3 Sorbonne University
Figure 1: Self-attention from a Vision Transformer with 8 ⇥ 8 patches trained with no supervision. We look at the self-attention of
the [CLS] token on the heads of the last layer. This token is not attached to any label nor supervision. These maps show that the model
automatically learns class-specific features leading to unsupervised object segmentations.
Abstract
In this paper, we question if self-supervised learning pro-
vides new properties to Vision Transformer (ViT) [19] that
stand out compared to convolutional networks (convnets).
Beyond the fact that adapting self-supervised methods to this
architecture works particularly well, we make the follow-
ing observations: first, self-supervised ViT features contain
explicit information about the semantic segmentation of an
image, which does not emerge as clearly with supervised
1. Introduction
Transformers [70] have recently emerged as an alternative
to convolutional neural networks (convnets) for visual recog-
nition [19, 69, 83]. Their adoption has been coupled with
a training strategy inspired by natural language processing
(NLP), that is, pretraining on large quantities of data and
finetuning on the target dataset [18, 55]. The resulting Vision
Transformers (ViT) [19] are competitive with convnets but,
they have not yet delivered clear benefits over them: they
v:2104.14294v2 [cs.CV] 24 May 2021
ϥϕϧใ͕ແͯ͘ਖ਼֬ͳମྖҬΛ֫ಘ
Supervised
DINO
Table 7: Important component for self-supervised ViT pre-
training. Models are trained for 300 epochs with ViT-S/16. We
study the different components that matter for the k-NN and linear
(“Lin.”) evaluations. For the different variants, we highlight the
differences from the default DINO setting. The best combination
is the momentum encoder with the multicrop augmentation and
the cross-entropy loss. We also report results with BYOL [30],
MoCo-v2 [15] and SwAV [10].
Supervised
DINO
Random Supervised DINO
ViT-S/16 22.0 27.3 45.9
Table 7: Imp
training. Mod
study the diffe
(“Lin.”) evalu
differences fro
is the momen
the cross-entr
MoCo-v2 [15]
Method M
1 DINO
2
3
ڭࢣ͋Γֶशͱൺͯ%*/0ମྖҬʹूத
ڭࢣ͋Γֶशͨ͠ࡍͷ"UUFOUJPONBQ %*/0ͷ"UUFOUJPONBQ
ࣗݾڭࢣ͋Γֶशͷදతͳख๏
#BSMPX5XJOT
<+;CPOUBS
*$.-`>
CBUDIEJNFOTJPO
$1$
<"WE0PSE
BS9JW`>
ύονؒͰϖΞΛ
࡞ͯ͠ରরֶश
$1$W
<0+)ÉOB
ff
*$.-`>
ϖΞͷ࡞
ϞσϧߏͳͲΛվળ
ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ
$POUFYU&ODPEFST
<%1BUIBL
$713`>
δάιʔύζϧ
৭ใΛ༧ଌ ճస֯Λ༧ଌ
*NBHF3PUBUJPOT
<4(JEBSJT
*$-3`>
$POUFYU1SFEJDUJPO
<$%PFSTDI
*$$7`>
ύονׂͨ͠ը૾ͷ
ύονؒͷ૬ରҐஔΛ༧ଌ
1*3-
<*.JTSBBOE-.BBUFO
$713`>
δάιʔύζϧΛಋೖ
4JN$-3
<5$IFO
*$.-`>
4JN$-3W
<5$IFO
/FVS*14`>
.P$P
<,)F
$713> .P$PW
<9$IFO
BS9JW>
4JN$-3ͷςΫχοΫΛಋೖ
ରরֶश
$PVOUJOH
<./PSPP[J
*$$7`>
֤ύονग़ྗͷͱը૾શମͷग़ྗ
͕Ұக͢ΔΑ͏ʹֶश
+JHTBX
<./PSPP[J
$713`>
ͭͷը૾ͷύζϧΛϛοΫε
&NCFEEJOH-FBSOJOH
<.:F
$713`>
6OTVQFSWJTFE
8PSEWFD
<5.JLPMPW
BS9JW>
#&35
<+%FWMJO
/""$->
ࣗવݴޠॲཧ
1$-
<+-J
*$-3`>
ϓϩτλΠϓΛಋೖ
MPDBM͔ΒMPDBM༧ଌ
&T7J5
<$-J
*$-3`>
.BTLFE*NBHF.PEFMJOH .*.
$//Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
7J5Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
1SFUFYUλεΫͷվળ
ྡ͢Δ୯ޠͷ༧ଌ
Λը૾Ԡ༻
%*/0
<.$BSPO
*$$7>
σʔλ૿෯ͱෳͷը૾
Λ༻͍ͨରরֶशΛఏҊ
ը૾ʹΫϥεͱֶͯ͠श
աڈͷग़ྗΛ
ωΨςΟϒϖΞͱͯ͠׆༻
.BTLFE-BOHVBHF.PEFMJOH
.-.
Λը૾Ԡ༻
େنωοτϫʔΫͷಋೖ
+JHTBX
<./PSPP[JBOE1'BWBSP
&$$7`>
$PMPSJ[BUJPO
<3;IBOH
&$$7`>
*OTUBODF%JTDSJNJOBUJPO
<;8V
$713`>
MPDBM͔ΒHMPCBM
HMPCBM͔ΒHMPCBMΛ༧ଌ
ಛྔʹϚεΫΩϯά
ϚεΫྖҬͷಛྔΛ༧ଌ
#&J5
<)#BP
*$-3`>
J#05
<+;IPV
*$-3`>
ϚεΫྖҬͷըૉΛ༧ଌ
."&
<,)F
$713`>
4JN.*.
<;9JF
$713`>
4QPU"SUJGBDUT
<4+FOOJBOE1'BWBSP
$713`>
.$5
<9:VBO
$713`>
ϚϧνϞʔμϧ֦ு
ʢը૾ʴςΩετʣ
.P$P#:0-
.P#:
<;9JF
BS9JW`>
.P$PW
<9$IFO
*$$7>
7J5Ͱͷ༗ޮੑΛධՁ
7P-5"
<41SBNBOJDL
BS9JW`>
MPDBMGFBUVSF"MJHONFOU
ϚϧνϞʔμϧʢը૾ʴςΩετʣ
γϯϓϧͳରরֶशΛఏҊ
$-*1
<"3BEGPSE
*$.-`>
;FSP4IPU5SBOTGFS
ϚϧνϞʔμϧʢը૾ʴςΩετʣ
7J5ͷͨΊͷֶशํ๏
ੳ
6OEFSTUBOEJOHUIF#FIBWJPVS
<'8BOHBOE)-JV
$713`>
ଛࣦઃܭֶशޮՌʹ͍ͭͯੳ
)PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
<-&SJDTTPO
$713`>
༷ʑͳઃఆͷసҠੑΛධՁ
8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL
<&$PMF
$713`>
σʔληοτͱͷؔੑʹ͍ͭͯੳ
*OGP.JO
<:5JBO
/FVS*14`>
ϙδςΟϒϖΞͷΈ߹Θͤʹ͍ͭͯੳ
#:0-
<+(SJMM
/FVS*14>
ϙδςΟϒϖΞͷΈͰֶश
4JN4JBN
<9$IFO
$713>
ΑΓγϯϓϧͳֶशΛఏҊ
ੳ
#:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT
<13JDIFNPOE
BS9JW`>
όονਖ਼نԽͷ౷ܭใ͕҉తͳωΨςΟϒϖΞͳͷͰʁ
ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ
4X"7
<.$BSPO
/FVS*14`>
ϙδςΟϒϖΞͷଐ͢Δ
ΫϥελΛਪఆ
/FHBUJWFGSFF
w $-*1ɿ$POUSBTUJWF-BOHVBHF*NBHF1SFUSBJOJOH
w ը૾ͱςΩετΛ༻͍ͨϚϧνϞʔμϧࣗݾڭࢣ͋ΓֶशΛఏҊ
σʔληοτͰ༻ҙ͞Ε͍ͯΔը૾ͱςΩετͷϖΞΛϙδςΟϒϖΞͱͯ͠ରরֶश
w ;FSPTIPUͰը૾ͷΫϥεྨద༻Մೳ
ը૾ͱQSPNQUUFNQMBUF"QIPUPPGB\Ϋϥε໊^ͷಛྔͷྨࣅΛΫϥεείΞͱͯ͠ར༻
ϚϧνϞʔμϧɿ$-*1<3BEGPSE
*$.->
I1·T2 I1·T3 …
I2·T1 I2·T3 …
I3·T1 I3·T2 …
⋮ ⋮ ⋮
I1·T1
I2·T2
I3·T3
(1) Contrastive pre-training
Image
Encoder
Text
Encoder
Pepper the
aussie pup
Pepper the
aussie pup
Pepper the
aussie pup
Pepper the
aussie pup
T1 T2 T3 …
I1
I2
I3
⋮
(2) Create dataset classifier from label text
plane
car
dog
⋮
bird
A photo of
a {object}.
⋮
Text
Encoder
T1 T2 T3 TN
…
(3) Use for zero-shot prediction
Image
Encoder
I1 I1·T2 I1·TN
I1·T1
…
…
A photo of
a dog.
TN
IN·T1 IN·T2 IN·T3
I1·TN
I2·TN
I3·TN
⋮
…
IN
…
⋮ ⋱
IN·TN
I1·T3
w จஶऀΒ͕࡞ͨ͠8FC*NBHF5FYU 8*5
σʔληοτΛ༻͍ͯࣗݾڭࢣ͋Γֶश
Πϯλʔωοτ͔Βऩूͨ͠ԯͷը૾ͱςΩετͷϖΞ͔Βߏ
w ;FSP4IPU5SBOTGFSʹΑΔਫ਼Λڭࢣ͋ΓֶशϞσϧͱൺֱ
ωοτϫʔΫɿ3FT/FU
ϚϧνϞʔμϧɿ$-*1<3BEGPSE
*$.->
छྨͷதͰछྨͷσʔληοτͰਫ਼্͕
Ӵը૾ಓ࿏ަ௨ඪࣝͷྨͳͲͷෳࡶͳλεΫ
ʹ͓͍ͯਫ਼͕େ͖͘Լ
w ԼྲྀλεΫͷసҠੑʢઢܗධՁʣΛڭࢣ͋Γֶशࣗݾڭࢣ͋Γֶशͷैདྷ๏ͱൺֱ
ϚϧνϞʔμϧɿ$-*1<3BEGPSE
*$.->
$-*1ʹΑΔࣄલֶश3FT/FU
7JTJPO5SBOTGPSNFS 7J5
ʹΘͣߴ͍ੑೳΛൃش
w 7P-5"ɿ7JTJPO-BOHVBHF5SBOTGPSNFSXJUIXFBLMZTVQFSWJTFEMPDBMGFBUVSF"MJHONFOU
w ը૾ͱςΩετ $BQUJPO
Λ༻͍ͨϚϧνϞʔμϧࣗݾڭࢣ͋ΓֶशΛఏҊ
#PVOEJOH#PYΛ༻ͤͣʹ$BQUJPOͷΈΛ༻͍ͯը૾ͷৄࡉͳঢ়گʢؔʣʹֶ͍ͭͯश
(BUJOHػߏΛ࣋ͬͨ$SPTTBUUFOUJPOΛಋೖ͢Δ͜ͱͰύϥϝʔλΛݮ
w ͭͷࣗݾڭࢣ͋ΓֶशλεΫʹΑΓֶश
ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL
BS9JW>
w (BUJOHػߏΛ࣋ͬͨ$SPTTBUUFOUJPOΛ֤ϞʔμϧϞσϧͷ4FMGBUUFOUJPOಋೖ
ֶशՄೳͳύϥϝʔλͱͯ͠HBUJOHTDBMBSɹΛಋೖ
ɹΛͱ͢Δ͜ͱͰ$SPTT"UUFOUJPOػߏͷΦϑ͕Մೳ
ࣗݾڭࢣ͋ΓֶशͷλεΫʹԠͯ͡ΦϯɾΦϑΛมߋ
w $SPTTNPEBMGVTJPOͷͨΊͷՃͷ͕ඞཁͳ͍ͨΊύϥϝʔλΛݮՄೳ
ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL
BS9JW>
α
α
α = 0 α ≠ 0
̂
x = SelfAtt(x)
x = x + ̂
x + α ⋅ CrossAtt( ̂
x, y)
x = x + FFN(x)
ը૾Ϟσϧʹ͓͚Δॲཧ
ɿը૾ͷ֤ύονಛྔ
x ɿ$BQUJPOͷ֤τʔΫϯಛྔ
y
w ̏εςοϓͷॲཧʹΑΓ֤&ODPEFSΛֶश
$SPTTBUUFOUJPOΛΦϑʹͯ͠ը૾ͱ$BQUJPOΛೖྗ͠ɼଛࣦɹɹɹɹɹɹɹɹɹɹɹΛܭࢉ
$SPTTBUUFOUJPOΛΦϯʹͯ͠ը૾ͱ$BQUJPOΛೖྗ͠ɼଛࣦɹɹɹɹɹΛܭࢉ
શͯͷଛࣦΛ͠߹Θͤͯ#BDLQSPQ
ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL
BS9JW>
4UFQ 4UFQ
LIT′

BT
, LII′

BT
, LI′

T
BT
, LTT′

BT
, LGOT
LMLM
, LITM
ϞʔμϧɾϞʔμϧؒͷରরֶशʢ#BSMPX5XJOTʣ
#BSMPX5XJOTʹج͍ͮͨରরֶशʹΑΓಛϕΫτϧͷ֤࣍ݩ͕ಠཱͨ͠ಛͱͳΔΑ͏ʹֶश
ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL
BS9JW>
LAB
BT
= ∑
i
(1 − Cii
)2 + λ∑
i
∑
j≠i
(Cij
)2
Cij
=
∑
b
zA
b,i
zB
b,j
∑
b
(zA
b,i
)2 ∑
b
(zB
b,j
)2
ɿಛϕΫτϧͷ࣍ݩͷΠϯσοΫε
i, j
ɿϛχόονͷΠϯσοΫε
b
ɿಛϕΫτϧ
zA, zB
ɿQPTJUJWFXFJHIUJOHGBDUPS
λ
ϞʔμϧɾϞʔμϧؒͷରরֶशʢ#BSMPX5XJOTʣ
#BSMPX5XJOTʹج͍ͮͨରরֶशʹΑΓಛϕΫτϧͷ֤࣍ݩ͕ಠཱͨ͠ಛͱͳΔΑ͏ʹֶश
ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL
BS9JW>
LAB
BT
= ∑
i
(1 − Cii
)2 + λ∑
i
∑
j≠i
(Cij
)2
Cij
=
∑
b
zA
b,i
zB
b,j
∑
b
(zA
b,i
)2 ∑
b
(zB
b,j
)2
ɿಛϕΫτϧͷ࣍ݩͷΠϯσοΫε
i, j
ɿϛχόονͷΠϯσοΫε
b
ɿಛϕΫτϧ
zA, zB
zA
࣍ݩ
ϛχόον
zB
࣍ݩ
ϛχόον
i j
Ci,j
ˠಛϕΫτϧͷ࣍ݩؒͷੑΛݮ
ɿQPTJUJWFXFJHIUJOHGBDUPS
λ
zA
0,i
zA
1,i
zA
2,i
zA
3,i
zA
4,i
zB
0,j
zB
1,j
zB
2,j
zB
3,j
zB
4,j
(SBQI0QUJNBM5SBOTQPSU 8BTTFSTUFJO%JTUBODF
ը૾ύονؒͷؔɼ$BQUJPOτʔΫϯؒͷؔΛάϥϑͰදݱ
ϊʔυͱͯ͠ύοντʔΫϯͷಛϕΫτϧɼΤοδͱͯ͠ಛϕΫτϧؒͷྨࣅΛ༻
Ϟʔμϧؒʹ͓͍ͯϊʔυͷ࠷ద༌ૹɼΤοδͷ࠷ద༌ૹʹΑΓMPDBMGFBUVSFΛBMJHONFOU
ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL
BS9JW>
DW
(ϕ, ψ) = min
T∈Π(u,v)
∑
i
∑
j
Tij
⋅ c(xi
, yj
)
Dgw
(ϕ, ψ) = min
̂
T∈Π(u,v)
∑
i,i′

,j,j′

̂
Tij
̂
Ti′

j′

∥c1
(xi
, x′

i
) − c2
(yj
, y′

j
)∥
LGOT
(ϕ, ψ) = γDW
(ϕ, ψ) + (1 − γ)Dgw
(ϕ, ψ)
ɿը૾ͷύονಛྔ
xi
, xj
ɿUSBOTQPSUQMBO
T, ̂
T ɿ$BQUJPOͷτʔΫϯಛྔ
yi
, yj
ɿίαΠϯྨࣅ
c( ⋅ , ⋅ ), c1
( ⋅ , ⋅ ), c2
( ⋅ , ⋅ ) ɿͭͷଛࣦΛௐ͢ΔॏΈ
γ
(SBQI0QUJNBM5SBOTQPSU 8BTTFSTUFJO%JTUBODF
ը૾ύονؒͷؔɼ$BQUJPOτʔΫϯؒͷؔΛάϥϑͰදݱ
ϊʔυͱͯ͠ύοντʔΫϯͷಛϕΫτϧɼΤοδͱͯ͠ಛϕΫτϧؒͷྨࣅΛ༻
Ϟʔμϧؒʹ͓͍ͯϊʔυͷ࠷ద༌ૹɼΤοδͷ࠷ద༌ૹʹΑΓMPDBMGFBUVSFΛBMJHONFOU
ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL
BS9JW>
DW
(ϕ, ψ) = min
T∈Π(u,v)
∑
i
∑
j
Tij
⋅ c(xi
, yj
)
Dgw
(ϕ, ψ) = min
̂
T∈Π(u,v)
∑
i,i′

,j,j′

̂
Tij
̂
Ti′

j′

∥c1
(xi
, x′

i
) − c2
(yj
, y′

j
)∥
LGOT
(ϕ, ψ) = γDW
(ϕ, ψ) + (1 − γ)Dgw
(ϕ, ψ)
ɿը૾ͷύονಛྔ
xi
, xj
ɿUSBOTQPSUQMBO
T, ̂
T ɿ$BQUJPOͷτʔΫϯಛྔ
yi
, yj
ɿίαΠϯྨࣅ
c( ⋅ , ⋅ ), c1
( ⋅ , ⋅ ), c2
( ⋅ , ⋅ ) ɿͭͷଛࣦΛௐ͢ΔॏΈ
γ
∥c1
(xi
, x′

i
) − c2
(yj
, y′

j
)∥
c(xi
, yj
)
.BTLFE-BOHVBHF.PEFMJOH
$BQUJPOͷτʔΫϯͷˋʹϚεΫॲཧΛద༻
$BQUJPOͷτʔΫϯಛྔ͔Β.-.)FBEʹΑΓϚεΫͨ͠τʔΫϯΛ༧ଌ
*NBHF5FYU.BUDIJOH
ը૾ͱϖΞʹͳ͍ͬͯΔ$BQUJPOΛϥϯμϜʹมߋ
֤Ϟʔμϧͷಛྔ͔Β*5.)FBEʹΑΓը૾ͱ$BQUJPO͕ਖ਼͍͠ϖΞ͔༧ଌ
ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL
BS9JW>
w *NBHF/FU,ʹ͓͚Δਫ਼ൺֱ
$0$0σʔληοτΛ༻ͯࣗ͠ݾڭࢣ͋Γֶश
$SPTT"UUFOUJPOΛΦϑʹͨ͠*NBHFFODPEFSʹରͯ͠ઢܗධՁ
XP$."'ɿࣗݾڭࢣ͋ΓֶशͷશͯͷλεΫʹ͓͍ͯ$SPTT"UUFOUJPOΛΦϑʹֶͯ͠श
ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL
BS9JW>
Table 1: Uni-modal downstream: image classification. We benchmark learned representation
classification task by training linear classifiers on fixed features. We report top-1 accuracy on Im
validation set, classification mAP on VOC07, and per-class (PC) and overall (O) F1 scores on COCO
with † are re-implemented by Yuan et al. (2021), and the numbers with ‡ are re-implemented by u
trained with significantly larger dataset are colored gray. Best results are in bold.
Linear probing on ImageNet Validation Set Linear probing on VOC07 and COCO
Method Pre-train Arch. Supervision Top-1(%) Method Pre-train Arch.
VOC07 CO
SVM MLP MLP
Sup. IN-1K RN50 Label 76.5 Sup. IN-1K RN50 87.5 90.8 55.2
Sup. IN-100 RN50 Label 53.3† SimCLR IN-1K RN50 85.5
MoCo COCO RN50 NA 44.5† MoCo IN-1K RN50 79.8
MoCo-v2 COCO RN50 NA 49.3† MoCo-v2 IN-1K RN50 86.4
VirTex COCO RN50 Caption 52.8 SwAV IN-1K RN50 88.9
ICMLM COCO RN50 Caption 51.9 BYOL IN-1K RN50 86.6
MCT COCO RN50 Caption 54.9 BT IN-1K RN50 86.2 91.9‡ 56.1/
MCT COCO RN50 Caption+Tag 55.3 VICReg IN-1K RN50 86.6 91.1‡ 51.0/
VoLTA(w/o CMAF) COCO RN50 Caption 55.3 VoLTA(w/o CMAF) COCO RN50 89.6 94.3 71.4
VoLTA(w/o CMAF) COCO Swin-T Caption 56.3 VoLTA(w/o CMAF) COCO Swin-T 88.2 93.5 73.4
VoLTA(w/o CMAF) COCO Swin-B Caption 62.5 VoLTA(w/o CMAF) COCO Swin-B 88.5 93.9 74.1
VoLTA COCO Swin-B Caption 62.5 VoLTA COCO Swin-B 88.7 94.0 74.5
Table 2: Uni-modal downstream: object detection and instance segmentation. We benchm
representations on VOC07 + 12 object detection task using faster R-CNN (Ren et al., 2015), and on C
object detection and instance segmentation using mask R-CNN He et al. (2017), both with C4 backb
(Wu et al., 2019). Best results are in bold.
3FT/FUɿ$BQUJPOͷΈΛ༻͍ͯߴ͍ਫ਼Λୡ
4XJO5SBOTGPSNFSɿ7J5Ϟσϧʹ͓͍ͯߴ͍ਫ਼ΛൃشՄೳ
w (SBQI0QUJNBM5SBOTQPSUʹΑΔ$BQUJPOτʔΫϯͱը૾ύονͷϚονϯά݁ՌΛՄࢹԽ
จࣈͷ$BQUJPOτʔΫϯʹର͢Δը૾ύονͷϚονϯά݁ՌΛ৭Ͱදݱ
w $0$0σʔληοτΛ༻ͯࣗ͠ݾڭࢣ͋Γֶशͨ͠ϞσϧΛ༻
ϚϧνϞʔμϧɿ7P-5"<1SBNBOJDL
BS9JW>
$BQUJPOͷΈͰը૾ͱ$BQUJPOؒͷରԠؔʹ͍ͭͯ֫ಘ
ࣗݾڭࢣ͋Γֶशͷදతͳख๏
#BSMPX5XJOT
<+;CPOUBS
*$.-`>
CBUDIEJNFOTJPO
$1$
<"WE0PSE
BS9JW`>
ύονؒͰϖΞΛ
࡞ͯ͠ରরֶश
$1$W
<0+)ÉOB
ff
*$.-`>
ϖΞͷ࡞
ϞσϧߏͳͲΛվળ
ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ
$POUFYU&ODPEFST
<%1BUIBL
$713`>
δάιʔύζϧ
৭ใΛ༧ଌ ճస֯Λ༧ଌ
*NBHF3PUBUJPOT
<4(JEBSJT
*$-3`>
$POUFYU1SFEJDUJPO
<$%PFSTDI
*$$7`>
ύονׂͨ͠ը૾ͷ
ύονؒͷ૬ରҐஔΛ༧ଌ
1*3-
<*.JTSBBOE-.BBUFO
$713`>
δάιʔύζϧΛಋೖ
4JN$-3
<5$IFO
*$.-`>
4JN$-3W
<5$IFO
/FVS*14`>
.P$P
<,)F
$713> .P$PW
<9$IFO
BS9JW>
4JN$-3ͷςΫχοΫΛಋೖ
ରরֶश
$PVOUJOH
<./PSPP[J
*$$7`>
֤ύονग़ྗͷͱը૾શମͷग़ྗ
͕Ұக͢ΔΑ͏ʹֶश
+JHTBX
<./PSPP[J
$713`>
ͭͷը૾ͷύζϧΛϛοΫε
&NCFEEJOH-FBSOJOH
<.:F
$713`>
6OTVQFSWJTFE
8PSEWFD
<5.JLPMPW
BS9JW>
#&35
<+%FWMJO
/""$->
ࣗવݴޠॲཧ
1$-
<+-J
*$-3`>
ϓϩτλΠϓΛಋೖ
MPDBM͔ΒMPDBM༧ଌ
&T7J5
<$-J
*$-3`>
$//Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
7J5Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
1SFUFYUλεΫͷվળ
ྡ͢Δ୯ޠͷ༧ଌ
Λը૾Ԡ༻
%*/0
<.$BSPO
*$$7>
σʔλ૿෯ͱෳͷը૾
Λ༻͍ͨରরֶशΛఏҊ
ը૾ʹΫϥεͱֶͯ͠श
աڈͷग़ྗΛ
ωΨςΟϒϖΞͱͯ͠׆༻
.BTLFE-BOHVBHF.PEFMJOH
.-.
Λը૾Ԡ༻
େنωοτϫʔΫͷಋೖ
+JHTBX
<./PSPP[JBOE1'BWBSP
&$$7`>
$PMPSJ[BUJPO
<3;IBOH
&$$7`>
*OTUBODF%JTDSJNJOBUJPO
<;8V
$713`>
MPDBM͔ΒHMPCBM
HMPCBM͔ΒHMPCBMΛ༧ଌ
ಛྔʹϚεΫΩϯά
ϚεΫྖҬͷಛྔΛ༧ଌ
#&J5
<)#BP
*$-3`>
J#05
<+;IPV
*$-3`>
ϚεΫྖҬͷըૉΛ༧ଌ
."&
<,)F
$713`>
4JN.*.
<;9JF
$713`>
4QPU"SUJGBDUT
<4+FOOJBOE1'BWBSP
$713`>
.$5
<9:VBO
$713`>
ϚϧνϞʔμϧ֦ு
ʢը૾ʴςΩετʣ
.P$P#:0-
.P#:
<;9JF
BS9JW`>
.P$PW
<9$IFO
*$$7>
7J5Ͱͷ༗ޮੑΛධՁ
7P-5"
<41SBNBOJDL
BS9JW`>
MPDBMGFBUVSF"MJHONFOU
ϚϧνϞʔμϧʢը૾ʴςΩετʣ
γϯϓϧͳରরֶशΛఏҊ
$-*1
<"3BEGPSE
*$.-`>
;FSP4IPU5SBOTGFS
ϚϧνϞʔμϧʢը૾ʴςΩετʣ
7J5ͷͨΊͷֶशํ๏
.BTLFE*NBHF.PEFMJOH .*.
ੳ
6OEFSTUBOEJOHUIF#FIBWJPVS
<'8BOHBOE)-JV
$713`>
ଛࣦઃܭֶशޮՌʹ͍ͭͯੳ
)PX8FMM%P4FMG4VQFSWJTFE.PEFMT5SBOTGFS
<-&SJDTTPO
$713`>
༷ʑͳઃఆͷసҠੑΛධՁ
8IFO%PFT$POUSBTUJWF7JTVBM3FQSFTFOUBUJPO-FBSOJOH8PSL
<&$PMF
$713`>
σʔληοτͱͷؔੑʹ͍ͭͯੳ
*OGP.JO
<:5JBO
/FVS*14`>
ϙδςΟϒϖΞͷΈ߹Θͤʹ͍ͭͯੳ
#:0-
<+(SJMM
/FVS*14>
ϙδςΟϒϖΞͷΈͰֶश
4JN4JBN
<9$IFO
$713>
ΑΓγϯϓϧͳֶशΛఏҊ
ੳ
#:0-XPSLTFWFOXJUIPVUCBUDITUBUJTUJDT
<13JDIFNPOE
BS9JW`>
όονਖ਼نԽͷ౷ܭใ͕҉తͳωΨςΟϒϖΞͳͷͰʁ
ˠਖ਼نԽʹΑΔֶशͷ҆ఆԽ͕ॏཁ
4X"7
<.$BSPO
/FVS*14`>
ϙδςΟϒϖΞͷଐ͢Δ
ΫϥελΛਪఆ
/FHBUJWFGSFF
.BTLFE*NBHF.PEFMJOHͷදతͳख๏
#&35
<+%FWMJO
/""$->
ࣗવݴޠॲཧ
ϚεΫྖҬͷըૉΛ༧ଌ
."&
<,)F
$713`>
4JN.*.
<;9JF
$713`>
.BTLFE*NBHF.PEFMJOH .*.
#&J5
<)#BP
*$-3`>
J#05
<+;IPV
*$-3`>
.BTLFE-BOHVBHF.PEFMJOH .-.
)0(ಛྔΛ༧ଌ
.BTLFE'FBUVSF1SFEJDUJPO
<$8FJ
$713`>
ҟͳΔϞʔμϧͷద༻
ը૾ͷ࠶ߏ
NVMUJGPMENBTLJOHTUSBUFHZ
4E"&
<:$IFO
&$$7`>
"UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞
"UUFOUJPO(VJEFE.*.
<*,BLPHFPSHJPV
&$$7`>
NVMUJCMPDLNBTLJOHTUSBUFHZ
*+&1"
<."TTSBO
BS9JW`>
ϚϧνεέʔϧͳಛྔΛ֫ಘ
.$."&
<1(BP
/FVS*14`>
ΞʔΩςΫνϟͷվળ
ը૾Ԡ༻
ରরֶश
ͷಋೖ
ϚεΫͷվળ
5PLFOJ[FS
GSFF
ϚϧνλεΫ
4JN.*.ରরֶश
4J5
<4"UJUP
BS9JW`>
$."&
<+.BP
BS9JW`>
ϚϧνλεΫ
."&ରরֶश
'-*1
<:-J
BS9JW>
ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1
.4/
<."TTSBO
&$$7`>
ϚεΫͨ͠ը૾Λ༻͍ͨ
/FHBUJWFGSFF
Ի
."&UIBU-JTUFO
<1)VBOH
/FVS*14`>
."&"T4QBUJPUFNQPSBM-FBSOFST
<$'FJDIUFOIPGFS
/FVS*14`>
ಈը૾
ϚϧνϞʔμϧ
.VMUJ."&
<3#BDINBOO
&$$7`>
.BTLFE*NBHF.PEFMJOHͷදతͳख๏
#&35
<+%FWMJO
/""$->
ࣗવݴޠॲཧ
ϚεΫྖҬͷըૉΛ༧ଌ
."&
<,)F
$713`>
4JN.*.
<;9JF
$713`>
.BTLFE*NBHF.PEFMJOH .*.
#&J5
<)#BP
*$-3`>
J#05
<+;IPV
*$-3`>
.BTLFE-BOHVBHF.PEFMJOH .-.
)0(ಛྔΛ༧ଌ
.BTLFE'FBUVSF1SFEJDUJPO
<$8FJ
$713`>
ҟͳΔϞʔμϧͷద༻
ը૾ͷ࠶ߏ
NVMUJGPMENBTLJOHTUSBUFHZ
4E"&
<:$IFO
&$$7`>
"UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞
"UUFOUJPO(VJEFE.*.
<*,BLPHFPSHJPV
&$$7`>
NVMUJCMPDLNBTLJOHTUSBUFHZ
*+&1"
<."TTSBO
BS9JW`>
ϚϧνεέʔϧͳಛྔΛ֫ಘ
.$."&
<1(BP
/FVS*14`>
ΞʔΩςΫνϟͷվળ
ը૾Ԡ༻
ରরֶश
ͷಋೖ
ϚεΫͷվળ
5PLFOJ[FS
GSFF
ϚϧνλεΫ
4JN.*.ରরֶश
4J5
<4"UJUP
BS9JW`>
$."&
<+.BP
BS9JW`>
ϚϧνλεΫ
."&ରরֶश
'-*1
<:-J
BS9JW>
ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1
.4/
<."TTSBO
&$$7`>
ϚεΫͨ͠ը૾Λ༻͍ͨ
/FHBUJWFGSFF
Ի
."&UIBU-JTUFO
<1)VBOH
/FVS*14`>
."&"T4QBUJPUFNQPSBM-FBSOFST
<$'FJDIUFOIPGFS
/FVS*14`>
ಈը૾
ϚϧνϞʔμϧ
.VMUJ."&
<3#BDINBOO
&$$7`>
w #&35ɿ#JEJSFDUJPOBM&ODPEFS3FQSFTFOUBUJPOTGSPN5SBOTGPSNFST
w #JEJSFDUJPOBM5SBOTGPSNFSΛ1SFUSBJOJOHͱ'JOF5VOJOHͷͭͷ4UFQʹΑΓֶश
w ࣄલֶश 1SFUSBJOJOH
ͱͯͭ͠ͷλεΫʹֶ͍ͭͯश
.BTLFE-BOHVBHF.PEFMJOHɿͷτʔΫϯΛϚεΫ͠ɼϚεΫͨ͠τʔΫϯͷ୯ޠΛ༧ଌ
/FYU4FOUFODF1SFEJDUJPO ɿ4FOUFODF#͕4FOUFODF"ͷଓ͖ͷจষ͔༧ଌ
.BTLFE-BOHVBHF.PEFMJOHɿ#&35<%FWMJO
/""$->
BERT BERT
E
[CLS]
E
1
E
[SEP]
... E
N
E
1
’ ... E
M
’
C T
1
T
[SEP]
... T
N
T
1
’ ... T
M
’
[CLS] Tok 1 [SEP]
... Tok N Tok 1 ... TokM
Question Paragraph
Start/End Span
BERT
E
[CLS]
E
1
E
[SEP]
... E
N
E
1
’ ... E
M
’
C T
1
T
[SEP]
... T
N
T
1
’ ... T
M
’
[CLS] Tok 1 [SEP]
... Tok N Tok 1 ... TokM
Masked Sentence A Masked Sentence B
Pre-training Fine-Tuning
NSP Mask LM Mask LM
Unlabeled Sentence A and B Pair
SQuAD
Question Answer Pair
NER
MNLI
w #&J5ɿ#JEJSFDUJPOBM&ODPEFSSFQSFTFOUBUJPOGSPN*NBHF5SBOTGPSNFST
w ϚεΫͨ͠ύονͷಛྔΛ༧ଌ
w "VUPFODPEFSߏͰֶशͨ͠5PLFOJ[FSͷग़ྗΛਖ਼ղใͱͯ͠ར༻
5PLFOJ[FSɿ%"--&<3BNFTI
*$.->ͷֶशࡁΈEJTDSFUFWBSJBUJPOBMBVUPFODPEFSΛ༻
.BTLFE*NBHF.PEFMJOHɿ#&J5<#BP
*$-3>
123 234 456 567
987 876 765 543
112 223 334 445
211 322 433 544
+ + + + + + + + + + + + + + + +
+
BEIT Encoder
Blockwise
Masking
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0
Flatten
Tokenizer Decoder
Position
Embedding
Patch
Embedding
Original
Image
Image
Patches
Visual Tokens
𝐡2
L 𝐡3
L 𝐡6
L 𝐡7
L 𝐡14
L
Masked Image Modeling Head
Reconstructed
Image
Unused During
Pre-Training
234 456 876 765 322
[S] [M] [M] [M] [M] [M]
w J#05ɿJNBHF#&35QSF5SBJOJOHXJUI0OMJOF5PLFOJ[FS
w ϚεΫͨ͠ύονͷಛྔͱϚεΫΛ͍ͯ͠ͳ͍ҟͳΔ7JFXͷΫϥετʔΫϯΛ༧ଌ
ύονಛྔͷ༧ଌɹɹɹͱҟͳΔ7JFXͷΫϥετʔΫϯͷ༧ଌɹɹɹͷͭͷଛࣦ͔Βֶश
w ࢦҠಈฏۉϞσϧ POMJOFUPLFOJ[FS
ͷग़ྗΛਖ਼ղใͱͯ͠ར༻
.BTLFE*NBHF.PEFMJOHɿJ#05<;IPV
*$-3>
!~ℐ $ ~%
&!
&"
ℎ!
"#$%&
ℎ!
[()*]
ℎ
$
"#$%&
ℎ
$
[()*]
ℒ[$%&]
ℒ()(
online tokenizer
"
#!
"#$%&
"
#!
[()*]
#
$
[()*]
#
$
"#$%&
$
$
[()*]
$
$
"#$%&
"
$!
[()*]
stop grad
stop grad
EMA
"
$!
"#$%&
%[,-*.]
(
)
ℒMIM
ℒ[CLS]
.BTLFE*NBHF.PEFMJOHͷදతͳख๏
#&35
<+%FWMJO
/""$->
ࣗવݴޠॲཧ
ϚεΫྖҬͷըૉΛ༧ଌ
."&
<,)F
$713`>
4JN.*.
<;9JF
$713`>
.BTLFE*NBHF.PEFMJOH .*.
#&J5
<)#BP
*$-3`>
J#05
<+;IPV
*$-3`>
.BTLFE-BOHVBHF.PEFMJOH .-.
)0(ಛྔΛ༧ଌ
.BTLFE'FBUVSF1SFEJDUJPO
<$8FJ
$713`>
ҟͳΔϞʔμϧͷద༻
ը૾ͷ࠶ߏ
NVMUJGPMENBTLJOHTUSBUFHZ
4E"&
<:$IFO
&$$7`>
"UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞
"UUFOUJPO(VJEFE.*.
<*,BLPHFPSHJPV
&$$7`>
NVMUJCMPDLNBTLJOHTUSBUFHZ
*+&1"
<."TTSBO
BS9JW`>
ϚϧνεέʔϧͳಛྔΛ֫ಘ
.$."&
<1(BP
/FVS*14`>
ΞʔΩςΫνϟͷվળ
ը૾Ԡ༻
ରরֶश
ͷಋೖ
ϚεΫͷվળ
5PLFOJ[FS
GSFF
ϚϧνλεΫ
4JN.*.ରরֶश
4J5
<4"UJUP
BS9JW`>
$."&
<+.BP
BS9JW`>
ϚϧνλεΫ
."&ରরֶश
'-*1
<:-J
BS9JW>
ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1
.4/
<."TTSBO
&$$7`>
ϚεΫͨ͠ը૾Λ༻͍ͨ
/FHBUJWFGSFF
Ի
."&UIBU-JTUFO
<1)VBOH
/FVS*14`>
."&"T4QBUJPUFNQPSBM-FBSOFST
<$'FJDIUFOIPGFS
/FVS*14`>
ಈը૾
ϚϧνϞʔμϧ
.VMUJ."&
<3#BDINBOO
&$$7`>
w ."&ɿ.BTLFE"VUPFODPEFS
w ϚεΫͨ͠ύονͷըૉΛ༧ଌ
&ODPEFSɿϚεΫ͞Ε͍ͯͳ͍ύονΛೖྗͱ͢Δ7J5
%FDPEFSɿύοντʔΫϯͱϚεΫτʔΫϯ͔Βը૾Λ࠶ߏ͢Δখنͷ7J5
ը૾ͷ࠶ߏɿ."&<)F
$713>
ଛࣦܭࢉ.4&
ʢϚεΫτʔΫϯʹର͢Δग़ྗʹͷΈܭࢉʣ
ɿϚεΫτʔΫϯ
ɿΤϯίʔυ͞ΕͨύοντʔΫϯ
*OQVU
&ODPEFS
1&
%FDPEFS
1&
ࣗݾڭࢣ͋Γֶशޙ
&ODPEFSͷΈΛར༻
w *NBHF/FU,ͷධՁ༻σʔλʹର͢Δ෮ݩ݁Ռ
w *NBHF/FU,ʹ͓͚Δਫ਼ൺֱʢCBTFMJOF."&ɿ."&ʹΑΔֶश
fi
OFUVOJOHʣ
ը૾ͷ࠶ߏɿ."&<)F
$713>
ϚεΫ͞Ε͍ͯͳ͍ύον͔Βը૾શମͷ࠶ߏ͕Մೳ
ೖྗը૾ ෮ݩ݁Ռ ݪը૾ ೖྗը૾ ෮ݩ݁Ռ ݪը૾ ೖྗը૾ ෮ݩ݁Ռ ݪը૾ ೖྗը૾ ෮ݩ݁Ռ ݪը૾
encoder design. We
narrower and shal-
our default decoder
e encoder. With this
kens are only pro-
ch significantly re-
constructs the input
masked patch. Each
ctor of pixel values
he decoder is a lin-
channels equals the
decoder’s output is
. Our loss function
between the recon-
space. We compute
r to BERT [14].1
nstruction target is
sked patch. Specif-
ard deviation of all
10 20 30 40 50 60 70 80 90
masking ratio (%)
Figure 5. Masking ratio. A high masking ratio (75%) works well
for both fine-tuning (top) and linear probing (bottom). The y-axes
are ImageNet-1K validation accuracy (%) in all plots in this paper.
4. ImageNet Experiments
We do self-supervised pre-training on the ImageNet-1K
(IN1K) [13] training set. Then we do supervised training to
evaluate the representations with (i) end-to-end fine-tuning
or (ii) linear probing. We report top-1 validation accuracy
of a single 224⇥224 crop. Details are in Appendix A.1.
Baseline: ViT-Large. We use ViT-Large (ViT-L/16) [16]
as the backbone in our ablation study. ViT-L is very big (an
order of magnitude bigger than ResNet-50 [25]) and tends
to overfit. The following is a comparison between ViT-L
trained from scratch vs. fine-tuned from our baseline MAE:
scratch, original [16] scratch, our impl. baseline MAE
76.5 82.5 84.9
We note that it is nontrivial to train supervised ViT-L from
w 4JN.*.ɿ4JNQMF'SBNFXPSLGPS.BTLFE*NBHF.PEFMJOH
w ϚεΫͨ͠ύονͱϚεΫ͍ͯ͠ͳ͍ύονͷ྆ํΛ&ODPEFSೖྗ
w %FDPEFSͱͯ͠ͷઢܗΛ༻
w ϚεΫํ๏ɼϚεΫʹΑΔਫ਼มԽΛ࣮ݧతʹੳ
"WH%JTUɿϚεΫͨ͠ϐΫηϧͱ࠷ۙͷϚεΫ͍ͯ͠ͳ͍ϐΫηϧͱͷϢʔΫϦουڑͷฏۉ
ը૾ͷ࠶ߏɿ4JN.*.<9JF
$713>
.BTLFE*NBHF.PEFMJOHͷදతͳख๏
#&35
<+%FWMJO
/""$->
ࣗવݴޠॲཧ
ϚεΫྖҬͷըૉΛ༧ଌ
."&
<,)F
$713`>
4JN.*.
<;9JF
$713`>
.BTLFE*NBHF.PEFMJOH .*.
#&J5
<)#BP
*$-3`>
J#05
<+;IPV
*$-3`>
.BTLFE-BOHVBHF.PEFMJOH .-.
)0(ಛྔΛ༧ଌ
.BTLFE'FBUVSF1SFEJDUJPO
<$8FJ
$713`>
ҟͳΔϞʔμϧͷద༻
ը૾ͷ࠶ߏ
NVMUJGPMENBTLJOHTUSBUFHZ
4E"&
<:$IFO
&$$7`>
"UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞
"UUFOUJPO(VJEFE.*.
<*,BLPHFPSHJPV
&$$7`>
NVMUJCMPDLNBTLJOHTUSBUFHZ
*+&1"
<."TTSBO
BS9JW`>
ϚϧνεέʔϧͳಛྔΛ֫ಘ
.$."&
<1(BP
/FVS*14`>
ΞʔΩςΫνϟͷվળ
ը૾Ԡ༻
ରরֶश
ͷಋೖ
ϚεΫͷվળ
5PLFOJ[FS
GSFF
ϚϧνλεΫ
4JN.*.ରরֶश
4J5
<4"UJUP
BS9JW`>
$."&
<+.BP
BS9JW`>
ϚϧνλεΫ
."&ରরֶश
'-*1
<:-J
BS9JW>
ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1
.4/
<."TTSBO
&$$7`>
ϚεΫͨ͠ը૾Λ༻͍ͨ
/FHBUJWFGSFF
Ի
."&UIBU-JTUFO
<1)VBOH
/FVS*14`>
."&"T4QBUJPUFNQPSBM-FBSOFST
<$'FJDIUFOIPGFS
/FVS*14`>
ಈը૾
ϚϧνϞʔμϧ
.VMUJ."&
<3#BDINBOO
&$$7`>
w ϚεΫͨ͠ύονͷಛྔΛ༧ଌ
w )0(ಛྔΛਖ਼ղใͱͯ͠ར༻
5PLFOJ[FSGSFFɿ.BTLFE'FBUVSF1SFEJDUJPO<8FJ
$713>
s, from pixel
o discrete vi-
seudo-labels
(center col-
nd SIFT [62]
on for over a
MaskFeat in
ual signals is
d continuous
ell.
tations is not
ng local pat-
g supervised
ed data leads
!
!
!
transformer
linear head
masked input target feature
e.g., HOG
H
W
T
Figure 2. MaskFeat pre-training. We randomly replace the in-
put space-time cubes of a video with a [MASK] token and di-
rectly regress features (e.g. HOG) of the masked regions. After
pre-training, the Transformer is fine-tuned on end tasks.
scratch - MViT-S [56] 81
pixel 3 RGB 80
image descriptor 3 HOG [22] 82
dVAE 7 DALL-E [73] 81
unsupervised feature 7 DINO [9], ViT-B 82
supervised feature 7 MViT-B [31] 81
Table 1. Comparing target features for MaskFeat (video).
variants are pre-trained with MaskFeat for 300 epochs on MVi
16⇥4. We report fine-tuning accuracy on K400. Default is g
feature type
scratch
pixel colors
image descriptor
dVAE token
unsupervised feature
unsupervised feature
unsupervised feature
supervised feature
supervised feature
pseudo-label
Table 2. Comparing target features for MaskFeat (image).
We report 100-epoch fine-tuning accuracy on IN-1K. For two
and effective epoch† on IN-1K. The default entry is marked in
† Different teachers use different training strategies. dVAE is pre-trai
training. To measure the cost in a unified way, we normalize the num
)0(ಛྔͷ༧ଌֶशࡁΈϞσϧͷಛྔͷ༧ଌͱಉఔͷਫ਼Λୡ
.BTLFE*NBHF.PEFMJOHͷදతͳख๏
#&35
<+%FWMJO
/""$->
ࣗવݴޠॲཧ
ϚεΫྖҬͷըૉΛ༧ଌ
."&
<,)F
$713`>
4JN.*.
<;9JF
$713`>
.BTLFE*NBHF.PEFMJOH .*.
#&J5
<)#BP
*$-3`>
J#05
<+;IPV
*$-3`>
.BTLFE-BOHVBHF.PEFMJOH .-.
)0(ಛྔΛ༧ଌ
.BTLFE'FBUVSF1SFEJDUJPO
<$8FJ
$713`>
ҟͳΔϞʔμϧͷద༻
ը૾ͷ࠶ߏ
NVMUJGPMENBTLJOHTUSBUFHZ
4E"&
<:$IFO
&$$7`>
"UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞
"UUFOUJPO(VJEFE.*.
<*,BLPHFPSHJPV
&$$7`>
NVMUJCMPDLNBTLJOHTUSBUFHZ
*+&1"
<."TTSBO
BS9JW`>
ϚϧνεέʔϧͳಛྔΛ֫ಘ
.$."&
<1(BP
/FVS*14`>
ΞʔΩςΫνϟͷվળ
ը૾Ԡ༻
ରরֶश
ͷಋೖ
ϚεΫͷվળ
5PLFOJ[FS
GSFF
ϚϧνλεΫ
4JN.*.ରরֶश
4J5
<4"UJUP
BS9JW`>
$."&
<+.BP
BS9JW`>
ϚϧνλεΫ
."&ରরֶश
'-*1
<:-J
BS9JW>
ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1
.4/
<."TTSBO
&$$7`>
ϚεΫͨ͠ը૾Λ༻͍ͨ
/FHBUJWFGSFF
Ի
."&UIBU-JTUFO
<1)VBOH
/FVS*14`>
."&"T4QBUJPUFNQPSBM-FBSOFST
<$'FJDIUFOIPGFS
/FVS*14`>
ಈը૾
ϚϧνϞʔμϧ
.VMUJ."&
<3#BDINBOO
&$$7`>
w 5FBDIFSͷΫϥετʔΫϯʹର͢Δ"UUFOUJPOXFJHIUʹج͍ͮͯϚεΫΛ࡞
"UU.BTL)JHIɿ"UUFOUJPOXFJHIͷߴ͍ྖҬΛϚεΩϯά
"UU.BTL)JOU ɿ"UUFOUJPOXFJHIͷߴ͍ྖҬͷҰ෦͕ΔΑ͏ʹߴ͍ྖҬΛϚεΩϯά
"UU.BTL-PX ɿ"UUFOUJPOXFJHIͷ͍ྖҬΛϚεΩϯά
w 5FBDIFSͱͯ͠4UVEFOUͷࢦҠಈฏۉϞσϧΛ༻
ϚεΫͷվળɿ"UUFOUJPO(VJEFE.*.<,BLPHFPSHJPV
&$$7>
Attention-Guided Masked Image Modeling 3
(a) Input (b) Random (c) Random (d) Block (e) Attention (f) AttMask (g) AttMask
Image (30) (75) Wise Map High Low
Fig. 1. Different than random masking strategies (b-d), our attention-guided masking (AttMask)
uses the attention map arising in the encoder (e) to mask the most highly attended by default (f),
w J#05"UUFOUJPOXFJHIUʹج͍ͮͨϚεΫઓུΛಋೖͯ͠ධՁ
w *NBHF/FU,ʹ͓͚Δਫ਼ൺֱ
*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,ͰL//๏ʹΑΔධՁઢܗධՁ
fi
OFUVOJOH
ϚεΫͷվળɿ"UUFOUJPO(VJEFE.*.<,BLPHFPSHJPV
&$$7>
"UUFOUJPOXFJHIUͷߴ͍ྖҬΛϚεΫ͢Δ͜ͱͰਫ਼্͕
10 I. Kakogeorgiou et al.
Table 1. Different masking strategies for iBOT [78] pre-training on 20% of ImageNet. Top-1
accuracy for k-NN, linear probing on ImageNet validation set; fine-tuning on CIFAR10/100.
†: default iBOT masking strategy from BEiT [2]. ‡: aggressive random masking strategy from
MAE [24].
IBOT MASKING RATIO (%)
IMAGENET-1K CIFAR10 CIFAR100
k-NN LINEAR FINE-TUNING
Random Block-Wise† 10-50 46.7 56.4 98.0 86.0
Random‡ 75 47.3 55.5 97.7 85.5
Random 10-50 47.8 56.7 98.0 86.1
AttMask-Low (ours) 10-50 44.0 53.4 97.6 84.6
AttMask-Hint (ours) 10-50 49.5 57.5 98.1 86.6
AttMask-High (ours) 10-50 49.7 57.9 98.2 86.6
Table 2. Top-1 k-NN accuracy on ImageNet-1k
validation for iBOT pre-training on different per-
centage (%) of ImageNet-1k. †: default iBOT
masking strategy from BEiT [2].
0 20 40 60 80 100
0
10
20
30
40
50
42% fewer
epochs
k-NN
Random Block-Wise†
AttMask-High (ours)
w 4E"&ɿ4FMGEJTUJMMBUFE.BTLFE"VUPFODPEFS
w 5BSHFUͱͳΔಛྔͷ࡞ʹϚεΫΛಋೖ
4UVEFOUɿ."&ߏʹΑΓϚεΫͨ͠ύονͷಛྔΛ༧ଌ
5FBDIFSɿ4UVEFOUͱҟͳΔύονΛϚεΫͯ͠ύονͷಛྔΛநग़ʢ5BSHFUͷ࡞ʣ
w 5FBDIFSͱͯ͠4UVEFOUͷࢦҠಈฏۉϞσϧΛ༻
ϚεΫͷվળɿ4E"&<$IFO
&$$7>
Multi-fold
Mask
Encoder
Input
Encoder
Decoder
EMA
Normalize
Select
Feature
Cosine
Similarity
Teacher
Student
w *NBHF/FU,ʹ͓͚Δਫ਼ൺֱ
*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,Ͱ
fi
OFUVOJOHઢܗධՁ
ϚεΫͷվળɿ4E"&<$IFO
&$$7>
ରরֶशैདྷͷ.*.ͱൺͯ
fi
OFUVOJOH࣌ͷਫ਼্͕
10 Y. Chen et al.
Table 2. Image classification results on the ILSVRC-2012 ImageNet dataset with top
1 accuracy. “Epochs” refers to the number of pre-training epochs. MoCo v3 and DINO
adopt multi-crop augmentation for pre-training. MoCo v3: 2 global crops of 224 × 224.
DINO: 2 global crops of 224 × 224 and 10 local crops of 96 × 96.
Method Epochs Crops Finetune Linear
Methods using ViT-B:
Train from Scratch 300 81.8
MoCo v3 300 2 83.2 76.2
DINO 400 12 83.3 77.3
BEiT 300 1 83.0 49.4
MAE 100 1 82.1 54.8
MAE 300 1 82.9 61.5
MAE 1600 1 83.6 67.8
CAE 300 1 83.3 64.2
SdAE 100 1 83.5 60.3
SdAE 300 1 84.1 64.9
top-1 accuracy. Moreover, compared to the recently proposed CAE, our SdAE
achieves 0.8% top-1 accuracy gain, demonstrating the e↵ectiveness of our self-
distillated design and multi-fold masking strategy. In addition, with only 100
epochs pre-training, SdAE can achieve comparable performance with MAE us-
w *+&1"ɿ*NBHFCBTFE+PJOU&NCFEEJOH1SFEJDUJWF"SDIJUFDUVSF
w 5BSHFUͱͳΔྖҬΛׂ͠ɼ5BSHFU͝ͱʹ5BSHFUྖҬͷύονಛྔΛ༧ଌ
UBSHFU ɿBTQFDUSBUJP<
>
TDBMF<
>ͷൣғͰݸͷUBSHFUCMPDLΛ࡞
DPOUFYU
ɿBTQFDU
TDBMF<
>ͷൣғͰDPOUFYUCMPDLΛ࡞͠ɼUBSHFUͱॏͳͬͨྖҬΛআ
w UBSHFUFODPEFSͱͯ͠DPOUFYUFODPEFSͷࢦҠಈฏۉϞσϧΛ༻
ϚεΫͷվળɿ*+&1"<"TTSBO
BS9JW>
f
q
gf
gf
gf
f ¯
q
L2
context
encoder
predictor
target
encoder
target
context
original context targets
w ಛྔͷநग़
UBSHFU ɿϚεΫΛద༻͍ͯ͠ͳ͍JOQVUJNBHFΛUBSHFUFODPEFSೖྗ
DPOUFYU
ɿDPOUFYUCMPDLͷύονͷΈΛDPOUFYUFODPEFSೖྗ
w UBSHFUCMPDLͷ༧ଌ
UBSHFUCMPDL͝ͱʹύοντʔΫϯͱϚεΫτʔΫϯΛ༻͍ͯQSFEJDUPSʹΑΓύονಛྔΛ༧ଌ
QSFEJDUPSͱͯ͠খنͳ7J5Λ༻
ϚεΫͷվળɿ*+&1"<"TTSBO
BS9JW>
f
q
gf
gf
gf
f ¯
q
L2
context
encoder
predictor
target
encoder
target
context
w *NBHF/FU,ʹ͓͚Δਫ਼ൺֱ
*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,ͰઢܗධՁ
ϚεΫͷվળɿ*+&1"<"TTSBO
BS9JW>
Method Arch. Epochs Top-1
Methods without view data augmentations
data2vec [7] ViT-L/16 1600 53.5
MAE [34]
ViT-B/16 1600 68.0
ViT-L/16 1600 76.0
ViT-H/14 1600 77.2
I-JEPA
ViT-B/16 600 72.9
ViT-L/16 600 77.5
ViT-H/14 300 79.3
ViT-H/16448
300 81.1
Methods using extra view data augmentations
SimCLR v2 [20] RN152 (2⇥) 800 79.1
DINO [17] ViT-B/8 300 80.1
iBOT [74] ViT-L/16 250 81.0
Table 1. ImageNet. Linear-evaluation on ImageNet-1k. ViT-
H/16448
is pretrained at at a resolution of 448 ⇥ 448. I-JEPA sig-
nificantly improves linear probing performance compared to other
methods that do not rely on hand-crafted data-augmentations dur-
ing pretraining (MAE and data2vec). Moreover, I-JEPA demon-
strate good scalability behavior and the larger I-JEPA model
matches the performance of view-invariance approaches without
requiring view data-augmentions.
Method Arch. Epochs Top-1
Methods without view data augmentations
data2vec [7] ViT-L/16 1600 73.3
MAE [34]
ViT-L/16 1600 67.1
ViT-H/14 1600 71.5
I-JEPA
ViT-L/16 600 69.4
ViT-H/14 300 73.3
ViT-H/16448
300 77.3
Methods using extra view data augmentations
iBOT [74] ViT-B/16 250 69.7
DINO [17] ViT-B/8 300 70.0
SimCLR v2 [33] RN151 (2⇥) 800 70.2
BYOL [33] RN200 (2⇥) 800 71.2
MSN [3] ViT-B/4 300 75.7
Table 2. ImageNet-1%. Semi-supervised evaluation on
ImageNet-1K using only 1% of the available labels. Models are
adapted via fine-tuning or linear-probing, depending on whichever
works best for each respective method. ViT-H/16448
is pretrained
at at a resolution of 448 ⇥ 448. I-JEPA pretraining outperforms
MAE which also does not rely on hand-crafted data-augmentations
during pretraining. Moreover, I-JEPA benefits from scale. A ViT-
H/16 trained at resolution 448 surpasses previous methods includ-
103 104
68
70
72
74
76
78
80
ViT-B/16
ViT-H/14
ViT-B/16
ViT-L/16
ViT-H/14
Pretraining GPU Hours
Top 1 (%)
ImageNet Linear Evaluation vs GPU Hours
I-JEPA MAE
ैདྷͷ.*.ରরֶशɾ/FHBUJWFGSFFͱൺͯਫ਼্͕ ."&ͱൺͯগͳֶ͍श࣌ؒͰߴ͍ਫ਼Λൃش
.BTLFE*NBHF.PEFMJOHͷදతͳख๏
#&35
<+%FWMJO
/""$->
ࣗવݴޠॲཧ
ϚεΫྖҬͷըૉΛ༧ଌ
."&
<,)F
$713`>
4JN.*.
<;9JF
$713`>
.BTLFE*NBHF.PEFMJOH .*.
#&J5
<)#BP
*$-3`>
J#05
<+;IPV
*$-3`>
.BTLFE-BOHVBHF.PEFMJOH .-.
)0(ಛྔΛ༧ଌ
.BTLFE'FBUVSF1SFEJDUJPO
<$8FJ
$713`>
ҟͳΔϞʔμϧͷద༻
ը૾ͷ࠶ߏ
NVMUJGPMENBTLJOHTUSBUFHZ
4E"&
<:$IFO
&$$7`>
"UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞
"UUFOUJPO(VJEFE.*.
<*,BLPHFPSHJPV
&$$7`>
NVMUJCMPDLNBTLJOHTUSBUFHZ
*+&1"
<."TTSBO
BS9JW`>
ϚϧνεέʔϧͳಛྔΛ֫ಘ
.$."&
<1(BP
/FVS*14`>
ΞʔΩςΫνϟͷվળ
ը૾Ԡ༻
ରরֶश
ͷಋೖ
ϚεΫͷվળ
5PLFOJ[FS
GSFF
ϚϧνλεΫ
4JN.*.ରরֶश
4J5
<4"UJUP
BS9JW`>
$."&
<+.BP
BS9JW`>
ϚϧνλεΫ
."&ରরֶश
'-*1
<:-J
BS9JW>
ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1
.4/
<."TTSBO
&$$7`>
ϚεΫͨ͠ը૾Λ༻͍ͨ
/FHBUJWFGSFF
Ի
."&UIBU-JTUFO
<1)VBOH
/FVS*14`>
."&"T4QBUJPUFNQPSBM-FBSOFST
<$'FJDIUFOIPGFS
/FVS*14`>
ಈը૾
ϚϧνϞʔμϧ
.VMUJ."&
<3#BDINBOO
&$$7`>
w 4J5ɿ4FMGTVQFSWJTFEWJTJPO5SBOTGPSNFS
w 4JN.*.ϕʔεͷϚεΫྖҬͷըૉ༧ଌʹϖΞͷ༧ଌʢରরֶशʣΛՃ
ϚεΫͱͯ͠ϥϯμϜϊΠζΛ༻
7J5DPOUSBTUJWFUPLFOΛՃ͠ɼDPOUSBTUJWFUPLFOΛ༻͍ͯରরֶश
ରরֶशͷಋೖɿ4J5<"UJUP
BS9JW>
JOURNAL OF L
A
TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4
/LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV
3URMHFWLRQWR,PDJH6SDFH
9LVLRQ7UDQVIRUPHU
'DWD7RNHQV,PDJH
3L[HO&RUUXSWLRQ
2ULJLQDO,PDJH
5HFRQVWUXFWHG,PDJH
3RVLWLRQ
(PEHGGLQJ
&RQWUDVWLYH
+HDG
&RQWUDVWLYH
(PEHGGLQJ
Fig. 1: Self-supervised vIsion Transformer (SiT)
w ϚεΫྖҬͷ༧ଌ
ύονΛϥϯμϜʹϊΠζஔ͖͑ͯ7J5ʹೖྗ
7J5͕ग़ྗͨ͠ύοντʔΫϯΛ%FDPEFSʢͷ.-1ʣೖྗ֤ͯ͠ύονͷըૉΛ࠶ߏ
ଛࣦؔͱͯ͠0SJHJOBM*NBHFͱ࠶ߏը૾ͷ-MPTTΛ༻
ରরֶशͷಋೖɿ4J5<"UJUP
BS9JW>
&RQWUDVWLYH
+HDG
JOURNAL OF L
A
TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4
/LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV
3URMHFWLRQWR,PDJH6SDFH
9LVLRQ7UDQVIRUPHU
'DWD7RNHQV,PDJH
3L[HO&RUUXSWLRQ
2ULJLQDO,PDJH
5HFRQVWUXFWHG,PDJH
3RVLWLRQ
(PEHGGLQJ
&RQWUDVWLYH
+HDG
&RQWUDVWLYH
(PEHGGLQJ
JOURNAL OF L
A
TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4
/LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV
3URMHFWLRQWR,PDJH6SDFH
9LVLRQ7UDQVIRUPHU
'DWD7RNHQV,PDJH
3L[HO&RUUXSWLRQ
2ULJLQDO,PDJH
5HFRQVWUXFWHG,PDJH
3RVLWLRQ
(PEHGGLQJ
&RQWUDVWLYH
+HDG
&RQWUDVWLYH
(PEHGGLQJ
JOURNAL OF L
A
TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4
/LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV
3URMHFWLRQWR,PDJH6SDFH
9LVLRQ7UDQVIRUPHU
'DWD7RNHQV,PDJH
3L[HO&RUUXSWLRQ
2ULJLQDO,PDJH
5HFRQVWUXFWHG,PDJH
3RVLWLRQ
(PEHGGLQJ
&RQWUDVWLYH
+HDG
&RQWUDVWLYH
(PEHGGLQJ
3RVLWLRQ
(PEHGGLQJ
&RQWUDVWLYH
(PEHGGLQJ
w ϖΞͷ༧ଌ
ͭͷը૾͔Βσʔλ૿෯ʹΑΓϙδςΟϒϖΞͱͳΔͭͷ7JFXΛ࡞
7JFXɿύονΛϥϯμϜʹϊΠζஔ͖͑ͯ7J5ೖྗ
7JFXɿύονʹϚεΫॲཧΛద༻ͤͣʹ7J5ͷࢦҠಈฏۉϞσϧೖྗ
DPOUSBTUJWFUPLFOʹର͢Δग़ྗΛ༻͍ͯରরֶश
ଛࣦؔͱͯ͠OPSNBMJ[FEUFNQFSBUVSFTDBMFEDSPTTFOUSPQZMPTTΛ༻
ରরֶशͷಋೖɿ4J5<"UJUP
BS9JW>
JOURNAL OF L
A
TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4
/LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV
3URMHFWLRQWR,PDJH6SDFH
9LVLRQ7UDQVIRUPHU
'DWD7RNHQV,PDJH
3L[HO&RUUXSWLRQ
2ULJLQDO,PDJH
5HFRQVWUXFWHG,PDJH
3RVLWLRQ
(PEHGGLQJ
&RQWUDVWLYH
+HDG
&RQWUDVWLYH
(PEHGGLQJ
JOURNAL OF L
A
TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4
/LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV
3URMHFWLRQWR,PDJH6SDFH
9LVLRQ7UDQVIRUPHU
'DWD7RNHQV,PDJH
3L[HO&RUUXSWLRQ
2ULJLQDO,PDJH
5HFRQVWUXFWHG,PDJH
3RVLWLRQ
(PEHGGLQJ
&RQWUDVWLYH
+HDG
&RQWUDVWLYH
(PEHGGLQJ
3URMHFWLRQWR,PDJH6SDFH
5HFRQVWUXFWHG,PDJH
&RQWUDVWLYH
JOURNAL OF L
A
TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 4
/LQHDU3URMHFWLRQRI)ODWWHQHG3DWFKHV
3URMHFWLRQWR,PDJH6SDFH
9LVLRQ7UDQVIRUPHU
'DWD7RNHQV,PDJH
3L[HO&RUUXSWLRQ
2ULJLQDO,PDJH
5HFRQVWUXFWHG,PDJH
3RVLWLRQ
(PEHGGLQJ
&RQWUDVWLYH
+HDG
&RQWUDVWLYH
(PEHGGLQJ
w ԼྲྀλεΫʹ͓͚Δਫ਼ൺֱ
*NBHF/FU,Ͱࣗݾڭࢣ͋ΓֶशˠԼྲྀλεΫ
fi
OFUVOJOH
ରরֶशͷಋೖɿ4J5<"UJUP
BS9JW>
JOURNAL OF L
A
TEX CLASS FILES, VOL. ??, NO. ??, ?? ?? 8
TABLE 2: Comparison with state-of-the-art methods when pretrained and finetuned on the target dataset, i.e. no external
data is used, employing ViT-S/16.
Method Flowers Pets CUB Aircraft STL10 Cars CIFAR10 CIFAR100
Random init. 68.8 47.5 25.3 31.1 77.1 27.4 96.9 77.8
Comparison with concurrent works
MoCo-v3 [72] 88.9 69.0 53.1 62.5 95.4 84.0 97.3 83.4
Comparison with post arts
Dino [73] 82.4 58.0 43.6 49.3 92.1 73.0 96.8 78.9
MAE [57] 86.9 73.0 59.4 69.0 – 91.0 – –
SiT 92.8 84.7 71.2 77.8 96.5 92.1 98.2 85.2
TABLE 3: Domain Transfer of SiT pretrained on ImageNet-1K dataset.
Method Flowers Pets CUB Aircraft STL10 Cars CIFAR10 CIFAR100 ImageNet-1K
ViT-S/16
Random init. 68.8 47.5 25.3 31.1 77.1 27.4 96.9 77.8 –
Supervised [4] 98.1 91.1 82.7 80.8 98.2 91.7 98.3 86.9 79.9
Comparison with concurrent works
MoCo-v3 [72] 97.7 92.3 82.6 87.3 98.0 93.0 98.2 86.6 81.4
Dino* [73] 97.8 89.4 80.8 83.8 96.7 93.1 98.6 87.1 81.5
SiT 98.2 92.6 84.6 87.6 98.8 93.2 99.0 90.8 82.0
ViT-B/16
Comparison with concurrent works
MoCo-v3 [72] 98.3 93.7 84.1 87.2 98.4 93.4 98.2 87.3 83.2
Dino [73] 98.4 90.2 80.7 81.5 97.2 93.0 98.2 87.1 82.8
ڭࢣ͋ΓֶशγϯάϧλεΫͷࣗݾڭࢣख๏ͱൺͯਫ਼্͕
w $."&ɿ$POUSBTUJWF.BTLFE"VUPFODPEFST
w ."&ʹϖΞͷ༧ଌʢରরֶशʣΛՃ
֤ύονͷಛྔΛ༧ଌ͢Δ'FBUVSF%FDPEFSΛՃ
ݩը૾͕ಉ͡.BTLFEJNBHFͱ1JYFMTIJGUFE7JFXΛϙδςΟϒϖΞͱͯ͠ରরֶश
ରরֶशͷಋೖɿ$."&<.BP
BS9JW>
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
w 0OMJOF&ODPEFS
ύονϨϕϧͷϚεΫॲཧΛߦ͍ɼϚεΫ͍ͯ͠ͳ͍ύονΛೖྗ
w 5BSHFU&ODPEFS1SPKFDUJPO)FBE
ϚεΫॲཧΛద༻ͤͣʹશͯͷύονΛೖྗ
ରরֶशͷಋೖɿ$."&<.BP
BS9JW>
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
w 1JYFM%FDPEFS
0OMJOF&ODPEFSͷύοντʔΫϯͱϚεΫτʔΫϯΛೖྗ֤ͯ͠ύονͷϐΫηϧΛग़ྗ
w 'FBUVSF%FDPEFS
0OMJOF&ODPEFSͷύοντʔΫϯͱϚεΫτʔΫϯΛೖྗ֤ͯ͠ύονͷಛྔΛग़ྗ
ରরֶशͷಋೖɿ$."&<.BP
BS9JW>
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
w 3FDPOTUSVDUJPO-PTT
ϚεΫτʔΫϯʹର͢Δग़ྗʹͷΈ*OQVUͷରԠ͢Δύονͱ.4&MPTTΛܭࢉ
w $POUSBTUJWF-PTT
ݩը૾͕ಉ͡.BTLFEJNBHFͱ1JYFMTIJGUFE7JFXΛϙδςΟϒϖΞͱͯ͠*OGP/$&-PTTΛܭࢉ
֤ύοντʔΫϯΛฏۉͨ͠ಛྔΛը૾શମͷಛྔͱͯ͠ར༻
ରরֶशͷಋೖɿ$."&<.BP
BS9JW>
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
w ύϥϝʔλͷߋ৽
0OMJOF&ODPEFS
%FDPEFS
1SPKFDUJPO)FBEɿ-PTTʹج͍ͮͯޯ߱Լ๏ʹΑΓߋ৽
5BSHFU&ODPEFSɿ0OMJOF&ODPEFSͷύϥϝʔλΛࢦҠಈฏۉ͢Δ͜ͱͰߋ৽
ରরֶशͷಋೖɿ$."&<.BP
BS9JW>
Online
Encoder
...
Target
Encoder
Masked image
Pixel-shifted View
...
Pixel Decoder
Feature
Decoder
Reconstruction
Loss
Projection
Head
Contrastive
Loss
Input
...
...
...
...
w *NBHF/FU,ʹ͓͚Δਫ਼ൺֱ
*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,Ͱ
fi
OFUVOJOH
ରরֶशͷಋೖɿ$."&<.BP
BS9JW>
Method Pre-training epochs Params.(M) Supervision Accuracy
MoCo-v3 [11] 300 86 RGB 83.2
DINO [7] 300 86 RGB 82.8
CIM [18] 300 86 RGB 83.3
BEiT [3] 800 86 DALLE 83.2
SimMIM [43] 800 86 RGB 83.8
PeCo [16] 800 86 Perceptual Codebook 84.5
MaskFeat [40] 1600 86 HOG 84.0
CAE [12] 1600 86 DALLE+RGB 83.9
iBOT [51] 1600 86 RGB 84.0
SIM [38] 1600 86 RGB 83.8
MAE [25] 1600 86 RGB 83.6
CMAE (ours) 800 86 RGB 84.4
CMAE (ours) 1600 86 RGB 84.7
ConvMAE* [19] 800 86 RGB 84.6
ConvMAE* [19] 1600 86 RGB 84.6
CMAE* (ours) 800 86 RGB 85.0
CMAE* (ours) 1600 86 RGB 85.3
Table 2: Comparison of our model with existing methods on ViT-B. We evaluate them with the top-1 accuracy on
γϯάϧλεΫͷࣗݾڭࢣख๏ͱൺͯਫ਼্͕
w .4/ɿ.BTLFE4JBNFTF/FUXPSLT
w ϚεΫͨ͠7JFXΛ༻͍ͨ/FHBUJWFGSFFख๏ΛఏҊ
ϚεΫͨ͠7JFXͱϚεΫ͍ͯ͠ͳ͍ҟͳΔ7JFXؒͰ͕֬Ұக͢ΔΑ͏ʹֶश
֬ΫϥετʔΫϯͱ֤QSPUPUZQFTؒͷྨࣅΛΫϥεείΞͱͯ͠࡞
֤QSPUPUZQFTֶशՄೳͳύϥϝʔλͱͯ͠7J5ڞʹߋ৽
ରরֶशͷಋೖɿ.4/<"TTSBO
&$$7>
fq
z
prototypes
f ¯
q
ema
z
+
/
/
prediction p
target p
+
H(p
+, p)
original
anchor view
target view
patchify & mask
patchify
representation
[CLS]
cluster assignments
w *NBHF/FU,ʹ͓͚Δਫ਼ൺֱ
*NBHF/FU,Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,ͰઢܗධՁ
ରরֶशͷಋೖɿ.4/<"TTSBO
&$$7>
Table 3: Linear evaluation on ImageNet-1K using 100% of the labels.
Method Architecture Params. Epochs Top 1
Comparing similar architectures
SimCLRv2 (Chen et al., 2020c) RN50 24M 800 71.7
BYOL (Grill et al., 2020) RN50 24M 1000 74.4
DINO (Caron et al., 2021) ViT-S/16 22M 800 77.0
iBOT (Zhou et al., 2021) ViT-S/16 22M 800 77.9
MSN ViT-S/16 22M 600 76.9
Comparing larger architectures
MAE (He et al., 2021) ViT-H/14 632M 1600 76.6
BYOL (Grill et al., 2020) RN200 (2⇥) 250M 800 79.6
SimCLRv2 (Chen et al., 2020c) RN151+SK (3⇥) 795M 800 79.8
iBOT (Zhou et al., 2021) ViT-B/16 86M 400 79.4
DINO (Caron et al., 2021) ViT-B/8 86M 300 80.1
MoCov3 (Chen et al., 2021) ViT-BN-L/7 304M 300 81.0
MSN ViT-L/7 304M 200 80.7
Table 4: End-to-end fine-tuning of a ViT-B/16 encoder on ImageNet-1K using 100% of the labels. MSN obtains
competitive performance with both joint-embedding approaches and auto-encoding approaches.
Initialization Pretrain Epochs Top 1
DINO (Caron et al., 2021) 800 83.6
BEiT (Bao et al., 2021) 800 83.2
iBOT (He et al., 2021) 800 83.8
MAE (He et al., 2021) 1600 83.6
SimMIM (Xie et al., 2021) - 83.8
MaskFeat (Wei et al., 2021) - 84.0
ઢܗධՁͷֶश࣌ʹͷֶश༻σʔλΛ༻
Table 1: Extreme low-shot. We evaluate the label-efficiency of self-supervised models pretrained on the
ImageNet-1K dataset. For evaluation, we use an extremely small number of the ImageNet-1K labels and report
the mean top-1 accuracy and standard deviation across 3 random splits of the data.
Images per Class
Method Architecture Epochs 1 2 5
iBOT (Zhou et al., 2021)
ViT-S/16 800 40.4 ± 0.5 50.8 ± 0.8 59.9 ± 0.2
ViT-B/16 400 46.1 ± 0.3 56.2 ± 0.7 64.7 ± 0.3
DINO (Caron et al., 2021)
ViT-S/16 800 38.9 ± 0.4 48.9 ± 0.3 58.5 ± 0.1
ViT-B/16 400 41.8 ± 0.3 51.9 ± 0.6 61.4 ± 0.2
ViT-S/8 800 45.5 ± 0.4 56.0 ± 0.7 64.7 ± 0.4
ViT-B/8 300 45.8 ± 0.5 55.9 ± 0.6 64.6 ± 0.2
MAE (He et al., 2021)
ViT-B/16 1600 8.2 ± 0.3 25.0 ± 0.3 40.5 ± 0.2
ViT-L/16 1600 12.3 ± 0.2 19.3 ± 1.8 42.3 ± 0.3
ViT-H/14 1600 11.6 ± 0.4 18.6 ± 0.2 32.8 ± 0.2
MSN (Ours)
ViT-S/16 800 47.1 ± 0.1 55.8 ± 0.6 62.8 ± 0.3
ViT-B/16 600 49.8 ± 0.2 58.9 ± 0.4 65.5 ± 0.3
ViT-B/8 600 55.1 ± 0.1 64.9 ± 0.7 71.6 ± 0.3
ViT-B/4 300 54.3 ± 0.4 64.6 ± 0.7 72.4 ± 0.3
ViT-L/7 200 57.1 ± 0.6 66.4 ± 0.6 72.1 ± 0.2
loss, iBOT (Zhou et al., 2021) and SplitMask (El-Nouby et al., 2021) apply a joint-embedding loss
to an output representing the global sequence (either the [CLS] token or a global average pool of
the patch vectors). SplitMask shows that by using a patch-level loss, you can reduce the amount of
unlabeled pre-training data. In contrast, we focus on reducing the amount of labeled data available
for the downstream prediction task. Data2Vec (Baevski et al., 2022) demonstrates that this approach
is suitable for multiple modalities such as vision, speech and text. Different from these approaches,
we only match the view representations globally and do not consider a patch level loss. Consequently,
we can completely ignore the masked patches, significantly reducing the computational and memory
ઢܗධՁͷֶश࣌ʹΫϥε͋ͨΓdαϯϓϧͷֶश༻σʔλΛ༻
'FXTIPUͳઃఆʹ͓͍ͯରরֶशैདྷͷ.*.ͱൺͯਫ਼্͕
w '-*1ɿ'BTU-BOHVBHF*NBHF1SFUSBJOJOH
w ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1ΛఏҊ
JNBHFFODPEFSͱͯ͠7JTJPO5SBOTGPSNFSΛ༻
ϚεΫΛద༻͍ͯ͠ͳ͍ը૾ύονͷΈΛJNBHFFODPEFSೖྗֶͯ͠श
w ֶशޙʹϚεΫͰ4UFQͷ'-*1 $-*1
Λߦ͏VONBTLJOHUVOJOHTUSBUFHZΛಋೖ
ϚεΫͨ͠ը૾ͱϚεΫ͍ͯ͠ͳ͍ը૾ͷΪϟοϓΛٵऩ
ରরֶशͷಋೖɿ'-*1<-J
BS9JW>
image encoder
masked image
visible patches
text encoder
text
contrastive
loss
0 50 100 150 200 250
training time (hours)
68
69
70
71
72
73
zero-shot accuracy (%)
3.7 speedup
mask 0% (our CLIP repro.)
mask 50%
mask 75%
ϚεΫઓུΛར༻͢Δ͜ͱͰֶश࣌ؒΛॖ
w *NBHF/FU,ʹ͓͚Δਫ਼ൺֱ
-"*0/.Ͱࣗݾڭࢣ͋Γֶशˠ*NBHF/FU,Ͱ[FSPTIPUUSBOTGFSઢܗධՁ
fi
OFUVOJOH
w ΫϥεࣝผͷԼྲྀλεΫʹ͓͚Δਫ਼ධՁ
-"*0/.Ͱࣗݾڭࢣ͋ΓֶशˠԼྲྀλεΫ[FSPTIPUUSBOTGFS
ରরֶशͷಋೖɿ'-*1<-J
BS9JW>
case data epochs B/16 L/16 L/14 H/14
CLIP [52] WIT-400M 32 68.6 - 75.3 -
OpenCLIP [36] LAION-400M 32 67.1 - 72.8 -
CLIP, our repro. LAION-400M 32 68.2 72.4 73.1 -
FLIP LAION-400M 32 68.0 74.3 74.6 75.5
Table 2. Zero-shot accuracy on ImageNet-1K classification, compared with various CLIP baselines. The image size is 224. The entries
noted by grey are pre-trained on a different dataset. Our models use a 64k batch, 50% masking ratio, and unmasked tuning.
case data epochs model zero-shot linear probe fine-tune
CLIP [52] WIT-400M 32 L/14 75.3 83.9† -
CLIP [52], our transfer WIT-400M 32 L/14 75.3 83.0 87.4
OpenCLIP [36] LAION-400M 32 L/14 72.8 82.1 86.2
CLIP, our repro. LAION-400M 32 L/16 72.4 82.6 86.3
FLIP LAION-400M 32 L/16 74.3 83.6 86.9
Table 3. Linear probing and fine-tuning accuracy on ImageNet-1K classification, compared with various CLIP baselines. The entries
noted by grey are pre-trained on a different dataset. The image size is 224. †: CLIP in [52] optimizes with L-BFGS; we use SGD instead.
The speedup of our method is of great practical value.
The CLIP baseline takes ⇠10 days training in 256 TPU-v3
cores, so a speedup of 2-3⇥ saves many days in wall-clock
time. This speedup facilitates exploring the scaling behav-
ior, as we will discuss later in Sec. 4.3.
4.2. Comparisons with CLIP
In this section, we compare with various CLIP baselines
in a large variety of scenarios. We show that our method is
a competitive alternative to CLIP; as such, our fast training
method is a more desirable choice in practice.
We consider the following CLIP baselines:
• The original CLIP checkpoints [52], trained on the pri-
vate dataset WIT-400M.
Table 2 reports the results of our FLIP models, using
the best practice as we have ablated in Table 1 (a 64k
batch, 50% masking ratio, and unmasked tuning). For
ViT-L/14,2 our method has 74.6% accuracy, which is 1.8%
higher than OpenCLIP and 1.5% higher than our CLIP re-
production. Comparing with the original CLIP, our method
reduces the gap to 0.7%. We hope our method will improve
the original CLIP result if it were trained on the WIT data.
ImageNet linear probing. Table 3 compares the linear
probing results, i.e., training a linear classifier on the tar-
get dataset with frozen features. FLIP has 83.6% accuracy,
1.0% higher than our CLIP counterpart. It is also 0.6%
higher than our transfer of the original CLIP checkpoint,
using the same SGD trainer.
data
Food101
CIFAR10
CIFAR100
Birdsnap
SUN397
Cars
Aircraft
VOC2007
DTD
Oxford Pets
Caltech101
Flowers102
MNIST
STL10
EuroSAT
RESISC45
GTSRB
KITTI
Country211
PCam
UCF101
Kinetics700
CLEVR
HatefulMemes
SST2
CLIP [52] WIT-400M 92.9 96.2 77.9 48.3 67.7 77.3 36.1 84.1 55.3 93.5 92.6 78.7 87.2 99.3 59.9 71.6 50.3 23.1 32.7 58.8 76.2 60.3 24.3 63.3 64.0
CLIP [52], our eval. WIT-400M 91.0 95.2 75.6 51.2 66.6 75.0 32.3 83.3 55.0 93.6 92.4 77.7 76.0 99.3 62.0 71.6 51.6 26.9 30.9 51.6 76.1 59.5 22.2 55.3 67.3
OpenCLIP [36], our eval. LAION-400M 87.4 94.1 77.1 61.3 70.7 86.2 21.8 83.5 54.9 90.8 94.0 72.1 71.5 98.2 53.3 67.7 47.3 29.3 21.6 51.1 71.3 50.5 22.0 55.3 57.1
CLIP, our repro. LAION-400M 88.1 96.0 81.3 60.5 72.3 89.1 25.8 81.1 59.3 93.2 93.2 74.6 69.1 96.5 50.7 69.2 50.2 29.4 21.4 53.1 71.5 53.5 18.5 53.3 57.2
FLIP LAION-400M 89.3 97.2 84.1 63.0 73.1 90.7 29.1 83.1 60.4 92.6 93.8 75.0 80.3 98.5 53.5 70.8 41.4 34.8 23.1 50.3 74.1 55.8 22.7 54.0 58.5
Table 4. Zero-shot accuracy on more classification datasets, compared with various CLIP baselines. This table follows Table 11 in [52].
The model is ViT-L/14 with an image size of 224, for all entries. Entries in green are the best ones using the LAION-400M data.
छྨͷதͰछྨͷσʔληοτͰਫ਼্͕
[FSPTIPUUSBOTGFSͱઢܗධՁʹ͓͍ͯਫ਼্͕
.BTLFE*NBHF.PEFMJOHͷදతͳख๏
#&35
<+%FWMJO
/""$->
ࣗવݴޠॲཧ
ϚεΫྖҬͷըૉΛ༧ଌ
."&
<,)F
$713`>
4JN.*.
<;9JF
$713`>
.BTLFE*NBHF.PEFMJOH .*.
#&J5
<)#BP
*$-3`>
J#05
<+;IPV
*$-3`>
.BTLFE-BOHVBHF.PEFMJOH .-.
)0(ಛྔΛ༧ଌ
.BTLFE'FBUVSF1SFEJDUJPO
<$8FJ
$713`>
ҟͳΔϞʔμϧͷద༻
ը૾ͷ࠶ߏ
NVMUJGPMENBTLJOHTUSBUFHZ
4E"&
<:$IFO
&$$7`>
"UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞
"UUFOUJPO(VJEFE.*.
<*,BLPHFPSHJPV
&$$7`>
NVMUJCMPDLNBTLJOHTUSBUFHZ
*+&1"
<."TTSBO
BS9JW`>
ϚϧνεέʔϧͳಛྔΛ֫ಘ
.$."&
<1(BP
/FVS*14`>
ΞʔΩςΫνϟͷվળ
ը૾Ԡ༻
ରরֶश
ͷಋೖ
ϚεΫͷվળ
5PLFOJ[FS
GSFF
ϚϧνλεΫ
4JN.*.ରরֶश
4J5
<4"UJUP
BS9JW`>
$."&
<+.BP
BS9JW`>
ϚϧνλεΫ
."&ରরֶश
'-*1
<:-J
BS9JW>
ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1
.4/
<."TTSBO
&$$7`>
ϚεΫͨ͠ը૾Λ༻͍ͨ
/FHBUJWFGSFF
Ի
."&UIBU-JTUFO
<1)VBOH
/FVS*14`>
."&"T4QBUJPUFNQPSBM-FBSOFST
<$'FJDIUFOIPGFS
/FVS*14`>
ಈը૾
ϚϧνϞʔμϧ
.VMUJ."&
<3#BDINBOO
&$$7`>
w .$."&ɿ.BTLFE$POWPMVUJPO.FFUT.BTLFE"VUPFODPEFST
w ϚϧνεέʔϧͳಛྔΛ֫ಘՄೳͳ."&ΛఏҊ
5SBOTGPSNFS#MPDLͷύονྖҬʹ߹ΘͤͯϚεΫΛ࡞
ϚεΫྖҬͱϚεΫΛ͍ͯ͠ͳ͍ྖҬͷಛྔ͕ࠞࡏ͠ͳ͍Α͏ʹ.BTLFE$POWPMVUJPOΛಋೖ
ΞʔΩςΫνϟͷվળɿ.$."&<(BP
/FVS*14>
Stage2 Stage3
H/4×W/4×C1 H/8×W/8×C2 (H/16×W/16)×C3
H/16
W/16
H/8
W/8
H/4
W/4
H×W×3
UpSample
UpSample
×11
Block-wise Masking
Encoder
Patch Embedding
Patch Embedding
Stage1
×2
Patch Embedding
Masked
Convolution
Block
×2
Masked
Convolution
Block
Transformer
Block
Transformer
Block
Decoder
Linear
H/4×W/4×C1
StrideConv+Flatten StrideConv+Flatten
H/8×W/8×C2
Multi-Scale Fusion
Mask Mask
DepthWise Conv
Linear
Mask
FFN
Masked Convolution Block
Feature Embeddings
Value
.BTLFE*NBHF.PEFMJOHͷදతͳख๏
#&35
<+%FWMJO
/""$->
ࣗવݴޠॲཧ
ϚεΫྖҬͷըૉΛ༧ଌ
."&
<,)F
$713`>
4JN.*.
<;9JF
$713`>
.BTLFE*NBHF.PEFMJOH .*.
#&J5
<)#BP
*$-3`>
J#05
<+;IPV
*$-3`>
.BTLFE-BOHVBHF.PEFMJOH .-.
)0(ಛྔΛ༧ଌ
.BTLFE'FBUVSF1SFEJDUJPO
<$8FJ
$713`>
ҟͳΔϞʔμϧͷద༻
ը૾ͷ࠶ߏ
Ի
."&UIBU-JTUFO
<1)VBOH
/FVS*14`>
."&"T4QBUJPUFNQPSBM-FBSOFST
<$'FJDIUFOIPGFS
/FVS*14`>
ಈը૾
ϚϧνϞʔμϧ
.VMUJ."&
<3#BDINBOO
&$$7`>
NVMUJGPMENBTLJOHTUSBUFHZ
4E"&
<:$IFO
&$$7`>
"UUFOUJPOXFJHIUʹج͍ͮͨϚεΫ࡞
"UUFOUJPO(VJEFE.*.
<*,BLPHFPSHJPV
&$$7`>
NVMUJCMPDLNBTLJOHTUSBUFHZ
*+&1"
<."TTSBO
BS9JW`>
ϚϧνεέʔϧͳಛྔΛ֫ಘ
.$."&
<1(BP
/FVS*14`>
ΞʔΩςΫνϟͷվળ
ը૾Ԡ༻
ରরֶश
ͷಋೖ
ϚεΫͷվળ
5PLFOJ[FS
GSFF
ϚϧνλεΫ
4JN.*.ରরֶश
4J5
<4"UJUP
BS9JW`>
$."&
<+.BP
BS9JW`>
ϚϧνλεΫ
."&ରরֶश
'-*1
<:-J
BS9JW>
ϚεΫͨ͠ը૾Λ༻͍ͨ$-*1
.4/
<."TTSBO
&$$7`>
ϚεΫͨ͠ը૾Λ༻͍ͨ
/FHBUJWFGSFF
w .VMUJ."&ɿ.VMUJNPEBM.VMUJUBTL.BTLFE"VUPFODPEFST
w ."&ΛϚϧνϞʔμϧ֦ு
ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO
&$$7>
Transformer
encoder
Pre-trained
MultiMAE
encoder
Pre-trained
MultiMAE
encoder
Decoder
Linear
proj.
Linear
proj.
Linear
proj.
Decoder
Decoder
Selected input patches
Original images Masked targets
RGB
Depth
Semantic
...
...
... ...
... ...
MultiMAE pre-training Single-modal fin
Multi-modal fin
Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e
w ϚϧνϞʔμϧσʔλ
ٖࣅϥϕϦϯάʹΑΓ3(#ը૾͔Β%FQUIใͱ4FNBOUJDTFHNFOUBUJPOใΛ࡞
ٖࣅϥϕϦϯάͱͯ͠4FHNFOUBUJPO%FQUIλεΫͷֶशࡁΈϞσϧͷग़ྗΛར༻
w &ODPEFS
ϚεΫ͞Ε͍ͯͳ͍֤ϞμϦςΟͷύονΛ·ͱΊͯೖྗ
ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO
&$$7>
Transformer
encoder
Decoder
Linear
proj.
Linear
proj.
Linear
proj.
Decoder
Decoder
Selected input patches
Original images Masked targets
RGB
Depth
Semantic
... ... ...
MultiMAE pre-training
Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from
Transformer
encoder
Pre-trained
MultiMAE
encoder
Pre-trained
MultiMAE
encoder
Task-
specific
head(s)
Decoder
Linear
proj.
Linear
proj.
Linear
Decoder
Selected input patches
Original images Masked targets
RGB
Depth
ntic
...
...
...
... ...
...
MultiMAE pre-training Single-modal fine-tuning
Multi-modal fine-tuning
Task-
specific
head(s)
3(# Transformer
encoder
Pre-trained
MultiMAE
encoder
Pre-trained
MultiMAE
encoder
Task-
specific
head(s)
Decoder
Linear
proj.
Linear
proj.
Linear
proj.
Decoder
Decoder
Selected input patches
Original images Masked targets
RGB
Depth
Semantic
...
...
...
...
...
... ...
... ...
MultiMAE pre-training Single-modal fine-tuning
Multi-modal fine-tuning
Task-
specific
head(s)
Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e.g., RGB, depth, and
semantic segmentation) is linearly projected to tokens with a fixed dimension and encoded using a Transformer. Task-specific decoders
reconstruct the masked-out patches by first performing a cross-attention step from queries to the encoded tokens, followed by a shallow
Transformer. The queries consist of mask tokens (in gray), with the task-specific encoded tokens added at their respective positions. (Right)
Fine-tuning: By pre-training on multiple modalities, MultiMAE lends itself to fine-tuning on single-modal and multi-modal downstream
Transformer
encoder
Pre-trained
MultiMAE
encoder
Pre-trained
MultiMAE
encoder
Task-
specific
head(s)
Decoder
Linear
proj.
Linear
proj.
Linear
proj.
Decoder
Decoder
Selected input patches
Original images Masked targets
RGB
Depth
Semantic
...
...
...
...
...
... ...
... ...
MultiMAE pre-training Single-modal fine-tuning
Multi-modal fine-tuning
Task-
specific
head(s)
Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e.g., RGB, depth, and
semantic segmentation) is linearly projected to tokens with a fixed dimension and encoded using a Transformer. Task-specific decoders
reconstruct the masked-out patches by first performing a cross-attention step from queries to the encoded tokens, followed by a shallow
4FNBOUJD
%FQUI
༧Ίඞཁͳσʔλ
w %FDPEFS
ઢܗࣹӨͨ͠&ODPEFSग़ྗʹରͯ͠ҐஔใͱϞμϦςΟใΛ༩
$SPTTBUUFOUJPOʹΑΓϞμϦςΟؒͷؔΛߟྀͨ͠τʔΫϯΛ5SBOTGPSNFSCMPDLೖྗ
‣ 2VFSZɹɹɿઢܗࣹӨޙͷ֤ϞμϦςΟͷτʔΫϯ
‣ ,FZ
7BMVFɿઢܗࣹӨޙͷશϞμϦςΟͷτʔΫϯ
ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO
&$$7>
ing implementation de-
15
ation details 15
sification fine-tuning
. . . . . . . . . . . . 15
ntation . . . . . . . . 15
stimation . . . . . . . 17
se regression tasks . . 17
egies 17
transfer results 18
on on ImageNet 18
E variants 18
raining time 19
the number of segmentation patches constant, we downsam-
ple the semantic segmentation input by a factor of 4 and use
patches of size 4⇥4.
MultiMAE decoder. We illustrate the MultiMAE decoder
in Fig 7. Following MAE [35], each decoder has a linear
projection layer to adapt the outputs from the encoder to
the decoder dimension. After this linear projection, we add
both sine-cosine positional embeddings and learned modal-
ity embeddings to the decoder inputs. This is then followed
by a cross-attention layer, a MLP, and two Transformer
blocks.
Figure 7. MultiMAE decoders: Tokens from the MultiMAE en-
,FZ
7BMVF
2VFSZ
w ࠶ߏը૾ͷՄࢹԽ
ͭͷϞμϦςΟʹ͓͍ͯ૯ύονͷΛϥϯμϜϚεΫͯ͠࠶ߏ
ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO
&$$7>
MultiMAE: Multi-modal Multi-task Masked Autoencoders
Roman Bachmann* David Mizrahi* Andrei Atanov Amir Zamir
Swiss Federal Institute of Technology Lausanne (EPFL)
https://multimae.epfl.ch
Masked
inputs
MultiMAE
predictions
Target
Semantic Depth RGB
Masked
inputs
MultiMAE
predictions
Target Masked
inputs
MultiMAE
predictions
Target
Figure 1. MultiMAE pre-training objective. We randomly select 1/6 of all 16⇥16 image patches from multiple modalities and learn to
reconstruct the remaining 5/6 masked patches from them. The figure shows validation examples from ImageNet, where masked inputs (left),
w ԼྲྀλεΫʹ͓͚Δਫ਼ൺֱʢ*NBHF/FU,Ͱࣗݾڭࢣͨ͠ϞσϧΛ
fi
OFUVOJOHʣ
ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO
&$$7>
Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
Supervised [81] 81.8 45.8 33.9 50.1 80.7
DINO [12] 83.1 44.6 32.5 47.9 81.3
MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9
MAE [35] 83.3 46.2 36.5 50.8 85.1
MultiMAE 83.3 46.2 37.0 52.0 86.4
Table 1. Fine-tuning with RGB-only. We report the top-1 ac-
curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU
(") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman-
tic segmentation (S), as well as 1
accuracy (") on NYUv2 depth
(D). Text in bold and underline indicates the first and second-best
results, respectively. All methods are pre-trained on ImageNet-1K
(with pseudo labels for MultiMAE).
Method RG
MAE 36
MultiMAE 37
Table 2. Fine-tun
report semantic seg
RGB and depth, m
leverage additional
Text in gray indica
on.
ADE20K (S) Hypersim (S)
Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB
MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 3
MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 4
Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer r
segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE f
Pre-trained
MultiMAE
encoder
Pre-trained
MultiMAE
encoder
Task-
specific
head(s)
sked targets
...
...
...
...
... ...
Single-modal fine-tuning
Multi-modal fine-tuning
Task-
specific
head(s)
pled patches from multiple modalities (e.g., RGB, depth, and
on and encoded using a Transformer. Task-specific decoders
p from queries to the encoded tokens, followed by a shallow
fic encoded tokens added at their respective positions. (Right)
to fine-tuning on single-modal and multi-modal downstream
Pre-trained
MultiMAE
encoder
Pre-trained
MultiMAE
encoder
Task-
specific
head(s)
sked targets
...
...
...
...
... ...
Single-modal fine-tuning
Multi-modal fine-tuning
Task-
specific
head(s)
pled patches from multiple modalities (e.g., RGB, depth, and
on and encoded using a Transformer. Task-specific decoders
p from queries to the encoded tokens, followed by a shallow
fic encoded tokens added at their respective positions. (Right)
Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
Supervised [81] 81.8 45.8 33.9 50.1 80.7
DINO [12] 83.1 44.6 32.5 47.9 81.3
MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9
MAE [35] 83.3 46.2 36.5 50.8 85.1
MultiMAE 83.3 46.2 37.0 52.0 86.4
Table 1. Fine-tuning with RGB-only. We report the top-1 ac-
curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU
(") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman-
tic segmentation (S), as well as 1
accuracy (") on NYUv2 depth
(D). Text in bold and underline indicates the first and second-best
results, respectively. All methods are pre-trained on ImageNet-1K
(with pseudo labels for MultiMAE).
Hypersim (S)
Method RGB D RGB-D
MAE 36.5 32.5 36.9
MultiMAE 37.0 38.5 47.6
Table 2. Fine-tuning with RGB an
report semantic segmentation transfer
RGB and depth, measured in mIoU ("
leverage additional modalities such a
Text in gray indicates a modality that
on.
ADE20K (S) Hypersim (S)
Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD
MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8
MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9
Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseud
segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labele
gray indicates a modality that the model was not pre-trained on.
than two modalities during transfer quickly becomes com-
putationally expensive, since without masking, our method
now scales with the full number of modalities and tokens.
For performing multi-modal transfers with the standard
MAE, we train a new input projection for the additional
modalities while fine-tuning. Further training details can
depth model was partially traine
mantic segmentation pseudo labe
Mask2Former model as in pre-tr
As shown in Table 3, MultiM
depth or semantic segmentation
yond the RGB-only setting, alt
Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2
Supervised [81] 81.8 45.8 33.9 50.1 80.
DINO [12] 83.1 44.6 32.5 47.9 81.
MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.
MAE [35] 83.3 46.2 36.5 50.8 85.
MultiMAE 83.3 46.2 37.0 52.0 86.
Table 1. Fine-tuning with RGB-only. We report the top-1
curacy (") on ImageNet-1K (IN-1K) [23] classification (C), m
(") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] sem
tic segmentation (S), as well as 1
accuracy (") on NYUv2 de
(D). Text in bold and underline indicates the first and second-
results, respectively. All methods are pre-trained on ImageNet
(with pseudo labels for MultiMAE).
ADE20K (S)
Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB p
MAE 46.2 20.0 46.3 46.2 46.3 36.5 2
MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 3
Table 3. Fine-tuning with RGB and pseudo labels. Semanti
segmentation maps, measured in mIoU ("). MultiMAE benefit
gray indicates a modality that the model was not pre-trained on.
than two modalities during transfer quickly becomes com
putationally expensive, since without masking, our metho
now scales with the full number of modalities and token
For performing multi-modal transfers with the standa
MAE, we train a new input projection for the addition
modalities while fine-tuning. Further training details ca
Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
Supervised [81] 81.8 45.8 33.9 50.1 80.7
DINO [12] 83.1 44.6 32.5 47.9 81.3
MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9
MAE [35] 83.3 46.2 36.5 50.8 85.1
MultiMAE 83.3 46.2 37.0 52.0 86.4
Table 1. Fine-tuning with RGB-only. We report the top-1 ac-
curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU
(") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman-
tic segmentation (S), as well as 1
accuracy (") on NYUv2 depth
(D). Text in bold and underline indicates the first and second-best
results, respectively. All methods are pre-trained on ImageNet-1K
(with pseudo labels for MultiMAE).
Hypersim (S) N
Method RGB D RGB-D RGB
MAE 36.5 32.5 36.9 50.8
MultiMAE 37.0 38.5 47.6 52.0
Table 2. Fine-tuning with RGB and ground t
report semantic segmentation transfer results from
RGB and depth, measured in mIoU ("). MultiMA
leverage additional modalities such as depth, wh
Text in gray indicates a modality that the model w
on.
ADE20K (S) Hypersim (S) NYUv2 (
Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD
MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1
MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6
Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled de
segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities
gray indicates a modality that the model was not pre-trained on.
than two modalities during transfer quickly becomes com-
putationally expensive, since without masking, our method
now scales with the full number of modalities and tokens.
For performing multi-modal transfers with the standard
MAE, we train a new input projection for the additional
modalities while fine-tuning. Further training details can
depth model was partially trained on this d
mantic segmentation pseudo labels, we use t
Mask2Former model as in pre-training.
As shown in Table 3, MultiMAE can use
depth or semantic segmentation to boost p
yond the RGB-only setting, although the
Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
Supervised [81] 81.8 45.8 33.9 50.1 80.7
DINO [12] 83.1 44.6 32.5 47.9 81.3
MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9
MAE [35] 83.3 46.2 36.5 50.8 85.1
MultiMAE 83.3 46.2 37.0 52.0 86.4
Table 1. Fine-tuning with RGB-only. We report the top-1 ac-
curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU
(") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman-
tic segmentation (S), as well as 1
accuracy (") on NYUv2 depth
(D). Text in bold and underline indicates the first and second-best
results, respectively. All methods are pre-trained on ImageNet-1K
(with pseudo labels for MultiMAE).
Hypersim (S) NY
Method RGB D RGB-D RGB
MAE 36.5 32.5 36.9 50.8
MultiMAE 37.0 38.5 47.6 52.0
Table 2. Fine-tuning with RGB and ground tru
report semantic segmentation transfer results from c
RGB and depth, measured in mIoU ("). MultiMAE
leverage additional modalities such as depth, while
Text in gray indicates a modality that the model was
on.
ADE20K (S) Hypersim (S) NYUv2 (S)
Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RG
MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 50
MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 53
Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled dept
segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities as
gray indicates a modality that the model was not pre-trained on.
than two modalities during transfer quickly becomes com-
putationally expensive, since without masking, our method
now scales with the full number of modalities and tokens.
For performing multi-modal transfers with the standard
MAE, we train a new input projection for the additional
modalities while fine-tuning. Further training details can
depth model was partially trained on this da
mantic segmentation pseudo labels, we use the
Mask2Former model as in pre-training.
As shown in Table 3, MultiMAE can use p
depth or semantic segmentation to boost per
yond the RGB-only setting, although the ga
Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
Supervised [81] 81.8 45.8 33.9 50.1 80.7
DINO [12] 83.1 44.6 32.5 47.9 81.3
MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9
MAE [35] 83.3 46.2 36.5 50.8 85.1
MultiMAE 83.3 46.2 37.0 52.0 86.4
Table 1. Fine-tuning with RGB-only. We report the top-1 ac-
curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU
(") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman-
tic segmentation (S), as well as 1
accuracy (") on NYUv2 depth
(D). Text in bold and underline indicates the first and second-best
results, respectively. All methods are pre-trained on ImageNet-1K
(with pseudo labels for MultiMAE).
Hypersim (S) N
Method RGB D RGB-D RGB
MAE 36.5 32.5 36.9 50.8
MultiMAE 37.0 38.5 47.6 52.0
Table 2. Fine-tuning with RGB and ground t
report semantic segmentation transfer results from
RGB and depth, measured in mIoU ("). MultiMA
leverage additional modalities such as depth, wh
Text in gray indicates a modality that the model w
on.
ADE20K (S) Hypersim (S) NYUv2 (
Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD R
MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1
MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6
Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled de
segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities
gray indicates a modality that the model was not pre-trained on.
than two modalities during transfer quickly becomes com- depth model was partially trained on this d
IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
81.8 45.8 33.9 50.1 80.7
83.1 44.6 32.5 47.9 81.3
82.8 43.7 31.7 46.6 80.9
83.3 46.2 36.5 50.8 85.1
83.3 46.2 37.0 52.0 86.4
ne-tuning with RGB-only. We report the top-1 ac-
ImageNet-1K (IN-1K) [23] classification (C), mIoU
0K [102] , Hypersim [68] , and NYUv2 [73] seman-
ion (S), as well as 1
accuracy (") on NYUv2 depth
bold and underline indicates the first and second-best
ctively. All methods are pre-trained on ImageNet-1K
labels for MultiMAE).
Hypersim (S) NYUv2 (S)
Method RGB D RGB-D RGB D RGB-D
MAE 36.5 32.5 36.9 50.8 23.4 49.3
MultiMAE 37.0 38.5 47.6 52.0 41.4 56.0
Table 2. Fine-tuning with RGB and ground truth depth. We
report semantic segmentation transfer results from combinations of
RGB and depth, measured in mIoU ("). MultiMAE can effectively
leverage additional modalities such as depth, while MAE cannot.
Text in gray indicates a modality that the model was not pre-trained
on.
ADE20K (S) Hypersim (S) NYUv2 (S)
RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS
46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 50.1 49.3
46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 53.5 54.0
e-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled depth and semantic
maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities as input. Text in
s a modality that the model was not pre-trained on.
odalities during transfer quickly becomes com- depth model was partially trained on this dataset. For se-
ʢQ%ʣɿٙࣅϥϕϦϯάʹΑΔ%FQUIใ
ʢQ4ʣɿٖࣅϥϕϦϯάʹΑΔ4FNBOUJDTFHNFOUBUJPOใ
ͭͷϞʔμϧΛ༻͍ͯ
fi
OFUVOJOH͢Δ͜ͱͰߴ͍ೝࣝੑೳΛൃش
w ಈը૾ԻσʔλΛ༻͍ͨ."&
ҟͳΔϞʔμϧͷద༻
encoder decoder
....
....
T
W
H
T
W
H
input target
Figure 1: Masked Autoencoders as spatiotemporal learners. We mask a large subset (e.g., 90%)
of random patches in spacetime. An encoder operates on the set of visible patches. A small decoder
then processes the full set of encoded patches and mask tokens to reconstruct the input. Except for
patch and positional embeddings, neither the encoder, the decoder, nor the masking strategy, has any
spatiotemporal inductive bias.
To the extreme, if a video has T identical static frames, randomly sampling 1/T of all spacetime
patches would reveal most of the static frame. Because slow motion is more likely than fast motion
in natural videos, the masking ratio can be very high as we observe empirically.
The higher masking ratio leads to a more efficient solution in practice. Following the MAE in [31]
that applies the encoder only on visible tokens, a masking ratio of 90% reduces the encoder time and
memory complexity to <1/10. Put together with a small decoder [31], the MAE pre-training can
achieve a theoretically 7.7⇥ reduction in computation vs. encoding all tokens. In fact, the computation
reduction is so large that the data loading time becomes a new bottleneck; even so, we record a 4.1⇥
wall-clock speedup. Such a significant speedup is of great importance for video research that is
large-scale and time-consuming.
We report strong results on a variety of video recognition datasets. Our MAE pre-training greatly
improves generalization performance: on Kinetics-400 [35], it increases the accuracy of ViT-Large
[18] by absolute 13% vs. training from scratch, while it takes less wall-clock training time overall
(pre-training plus fine-tuning). Our MAE pre-training can outperform its supervised pre-training
."&"T4QBUJPUFNQPSBM-FBSOFST
<'FJDIUFOIPGFS
/FVS*14>
Encoder
Decoder
…
…
Target
MSE
Input
)
( ,
Figure 1: Audio-MAE for audio self-supervised learning. An audio recording is first transformed
into a spectrogram and split into patches. We embed patches and mask out a large subset (80%).
An encoder then operates on the visible (20%) patch embeddings. Finally, a decoder processes the
order-restored embeddings and mask tokens to reconstruct the input. Audio-MAE is minimizing the
mean square error (MSE) on the masked portion of the reconstruction and the input spectrogram.
This computational burden has been addressed in different ways. A popular approach is to reduce the
sequence length in self-attention. Various ViT-based architectures have been developed to alleviate
such issues for image and video understanding. For example, Swin-Transformer [19] only performs
local attention within windows that shift across layers. MViT [20] employs pooling attention to
construct a hierarchy of Transformers where sequence lengths are downsampled. For self-supervised
learning, MAE [1] efficiently encodes only a small portion (25%) of visual patches while the majority
of patches is discarded. The simplicity and scalability in MAE make it a promising framework for
large-scale self-supervised learning.
In this work, we study MAE for sound recognition and the unique challenges of the audio domain.
We present Audio-MAE (Fig. 1) as unified and scalable framework for learning self-supervised audio
representations. Similar to MAE, it is composed of a pair of a Transformer encoder and decoder.
Figure 2: Visualizations on the Kinetics-400 [35] validation set (masking ratio 90%). We show the original
video (top), masked video (middle), and MAE output (bottom) for each sample. This model reconstructs the
original pixels. The video size is 16⇥224⇥224 and the spacetime patch size is 2⇥16⇥16 (the temporal patch
size of 2 is not visualized here). Each sample has 8⇥14⇥14=1568 tokens with 156 being visible. For better
visualizations, the known patches in the output are from the original input. Fig. 7 shows more examples.
Figure 3: Visualizations of the same pre-trained model in Fig. 2 but with a masking ratio of 95%.
Instead of predicting pixels [9, 18, 31, 80], another line of research focuses on the tokenization
."&UIBU-JTUFO
<)VBOH
/FVS*14>
εϖΫτϩάϥϜʹରͯ͠ϚεΫ
ಈը૾ʹରͯ͠ϚεΫ
w ."&"T4QBUJPUFNQPSBM-FBSOFSTʢಈը૾ͷ."&ʣʹ͓͚ΔޮՌ
ֶश࣌ؒͷൺֱɿ."&ʴϑΝΠϯνϡʔχϯά74ϑϧεΫϥονֶश
ҟͳΔϞʔμϧͷద༻
0 10 20 30 40 50 60
0
20
40
60
80
accuracy (%)
wall-clock time (hours)
MAE pre-train
800 epochs
fine-tune
100 epochs
from scratch
400 epochs
w/ MAE
from scratch
1-view
multi-vie
Figure 5: MAE pre-training plus fine-tuning is much more accurate and
scratch. Here the x-axis is the wall-clock training time (128 A100 GPUs), a
accuracy on Kinetics-400 validation. The table shows the final accuracy.
."&ʴϑΝΠϯνϡʔχϯάֶ͍श࣌ؒͰߴੑೳ
w ڭࢣϥϕϧΛ༩ͯ͠ͳ͍େྔͷσʔλΛٖࣅతͳ 1SFUFYUUBTL
ʹΑֶͬͯश
w ࣗݾڭࢣ͋ΓֶशͰֶशͨ͠ϞσϧࣄલֶशϞσϧͱͯ͠׆༻
w $//Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
දతͳख๏ɿ1SFUFYUUBTLͷվળ d
ˠରরֶश
ˠ/FHBUJWFGSFF
ରরֶशʹΑΓੑೳ͕େ͖͘վળ
ମݕग़ηάϝϯςʔγϣϯͳͲͷҰ෦ͷઃఆͰڭࢣ͋ΓࣄલֶशΛ͑ΔੑೳΛୡ
w 7J5Λରͱͨࣗ͠ݾڭࢣ͋Γֶश
දతͳख๏ɿରরֶशɾ/FHBUJWFGSFF
ˠ.BTLFE*NBHF.PEFMJOH
7JTJPO5SBOTGPSNFSͷߏʹ߹ΘͤͯޮՌΛൃش͢Δํ๏Λઃܭ
·ͱΊ