Slide 101
Slide 101 text
w ԼྲྀλεΫʹ͓͚Δਫ਼ൺֱʢ*NBHF/FU,Ͱࣗݾڭࢣͨ͠ϞσϧΛ
fi
OFUVOJOHʣ
ҟͳΔϞʔμϧͷద༻ɿ.VMUJ."&<#BDINBOO
&$$7>
Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
Supervised [81] 81.8 45.8 33.9 50.1 80.7
DINO [12] 83.1 44.6 32.5 47.9 81.3
MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9
MAE [35] 83.3 46.2 36.5 50.8 85.1
MultiMAE 83.3 46.2 37.0 52.0 86.4
Table 1. Fine-tuning with RGB-only. We report the top-1 ac-
curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU
(") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman-
tic segmentation (S), as well as 1
accuracy (") on NYUv2 depth
(D). Text in bold and underline indicates the first and second-best
results, respectively. All methods are pre-trained on ImageNet-1K
(with pseudo labels for MultiMAE).
Method RG
MAE 36
MultiMAE 37
Table 2. Fine-tun
report semantic seg
RGB and depth, m
leverage additional
Text in gray indica
on.
ADE20K (S) Hypersim (S)
Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB
MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 3
MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 4
Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer r
segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE f
Pre-trained
MultiMAE
encoder
Pre-trained
MultiMAE
encoder
Task-
specific
head(s)
sked targets
...
...
...
...
... ...
Single-modal fine-tuning
Multi-modal fine-tuning
Task-
specific
head(s)
pled patches from multiple modalities (e.g., RGB, depth, and
on and encoded using a Transformer. Task-specific decoders
p from queries to the encoded tokens, followed by a shallow
fic encoded tokens added at their respective positions. (Right)
to fine-tuning on single-modal and multi-modal downstream
Pre-trained
MultiMAE
encoder
Pre-trained
MultiMAE
encoder
Task-
specific
head(s)
sked targets
...
...
...
...
... ...
Single-modal fine-tuning
Multi-modal fine-tuning
Task-
specific
head(s)
pled patches from multiple modalities (e.g., RGB, depth, and
on and encoded using a Transformer. Task-specific decoders
p from queries to the encoded tokens, followed by a shallow
fic encoded tokens added at their respective positions. (Right)
Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
Supervised [81] 81.8 45.8 33.9 50.1 80.7
DINO [12] 83.1 44.6 32.5 47.9 81.3
MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9
MAE [35] 83.3 46.2 36.5 50.8 85.1
MultiMAE 83.3 46.2 37.0 52.0 86.4
Table 1. Fine-tuning with RGB-only. We report the top-1 ac-
curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU
(") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman-
tic segmentation (S), as well as 1
accuracy (") on NYUv2 depth
(D). Text in bold and underline indicates the first and second-best
results, respectively. All methods are pre-trained on ImageNet-1K
(with pseudo labels for MultiMAE).
Hypersim (S)
Method RGB D RGB-D
MAE 36.5 32.5 36.9
MultiMAE 37.0 38.5 47.6
Table 2. Fine-tuning with RGB an
report semantic segmentation transfer
RGB and depth, measured in mIoU ("
leverage additional modalities such a
Text in gray indicates a modality that
on.
ADE20K (S) Hypersim (S)
Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD
MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8
MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9
Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseud
segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labele
gray indicates a modality that the model was not pre-trained on.
than two modalities during transfer quickly becomes com-
putationally expensive, since without masking, our method
now scales with the full number of modalities and tokens.
For performing multi-modal transfers with the standard
MAE, we train a new input projection for the additional
modalities while fine-tuning. Further training details can
depth model was partially traine
mantic segmentation pseudo labe
Mask2Former model as in pre-tr
As shown in Table 3, MultiM
depth or semantic segmentation
yond the RGB-only setting, alt
Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2
Supervised [81] 81.8 45.8 33.9 50.1 80.
DINO [12] 83.1 44.6 32.5 47.9 81.
MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.
MAE [35] 83.3 46.2 36.5 50.8 85.
MultiMAE 83.3 46.2 37.0 52.0 86.
Table 1. Fine-tuning with RGB-only. We report the top-1
curacy (") on ImageNet-1K (IN-1K) [23] classification (C), m
(") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] sem
tic segmentation (S), as well as 1
accuracy (") on NYUv2 de
(D). Text in bold and underline indicates the first and second-
results, respectively. All methods are pre-trained on ImageNet
(with pseudo labels for MultiMAE).
ADE20K (S)
Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB p
MAE 46.2 20.0 46.3 46.2 46.3 36.5 2
MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 3
Table 3. Fine-tuning with RGB and pseudo labels. Semanti
segmentation maps, measured in mIoU ("). MultiMAE benefit
gray indicates a modality that the model was not pre-trained on.
than two modalities during transfer quickly becomes com
putationally expensive, since without masking, our metho
now scales with the full number of modalities and token
For performing multi-modal transfers with the standa
MAE, we train a new input projection for the addition
modalities while fine-tuning. Further training details ca
Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
Supervised [81] 81.8 45.8 33.9 50.1 80.7
DINO [12] 83.1 44.6 32.5 47.9 81.3
MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9
MAE [35] 83.3 46.2 36.5 50.8 85.1
MultiMAE 83.3 46.2 37.0 52.0 86.4
Table 1. Fine-tuning with RGB-only. We report the top-1 ac-
curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU
(") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman-
tic segmentation (S), as well as 1
accuracy (") on NYUv2 depth
(D). Text in bold and underline indicates the first and second-best
results, respectively. All methods are pre-trained on ImageNet-1K
(with pseudo labels for MultiMAE).
Hypersim (S) N
Method RGB D RGB-D RGB
MAE 36.5 32.5 36.9 50.8
MultiMAE 37.0 38.5 47.6 52.0
Table 2. Fine-tuning with RGB and ground t
report semantic segmentation transfer results from
RGB and depth, measured in mIoU ("). MultiMA
leverage additional modalities such as depth, wh
Text in gray indicates a modality that the model w
on.
ADE20K (S) Hypersim (S) NYUv2 (
Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD
MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1
MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6
Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled de
segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities
gray indicates a modality that the model was not pre-trained on.
than two modalities during transfer quickly becomes com-
putationally expensive, since without masking, our method
now scales with the full number of modalities and tokens.
For performing multi-modal transfers with the standard
MAE, we train a new input projection for the additional
modalities while fine-tuning. Further training details can
depth model was partially trained on this d
mantic segmentation pseudo labels, we use t
Mask2Former model as in pre-training.
As shown in Table 3, MultiMAE can use
depth or semantic segmentation to boost p
yond the RGB-only setting, although the
Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
Supervised [81] 81.8 45.8 33.9 50.1 80.7
DINO [12] 83.1 44.6 32.5 47.9 81.3
MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9
MAE [35] 83.3 46.2 36.5 50.8 85.1
MultiMAE 83.3 46.2 37.0 52.0 86.4
Table 1. Fine-tuning with RGB-only. We report the top-1 ac-
curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU
(") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman-
tic segmentation (S), as well as 1
accuracy (") on NYUv2 depth
(D). Text in bold and underline indicates the first and second-best
results, respectively. All methods are pre-trained on ImageNet-1K
(with pseudo labels for MultiMAE).
Hypersim (S) NY
Method RGB D RGB-D RGB
MAE 36.5 32.5 36.9 50.8
MultiMAE 37.0 38.5 47.6 52.0
Table 2. Fine-tuning with RGB and ground tru
report semantic segmentation transfer results from c
RGB and depth, measured in mIoU ("). MultiMAE
leverage additional modalities such as depth, while
Text in gray indicates a modality that the model was
on.
ADE20K (S) Hypersim (S) NYUv2 (S)
Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RG
MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 50
MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 53
Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled dept
segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities as
gray indicates a modality that the model was not pre-trained on.
than two modalities during transfer quickly becomes com-
putationally expensive, since without masking, our method
now scales with the full number of modalities and tokens.
For performing multi-modal transfers with the standard
MAE, we train a new input projection for the additional
modalities while fine-tuning. Further training details can
depth model was partially trained on this da
mantic segmentation pseudo labels, we use the
Mask2Former model as in pre-training.
As shown in Table 3, MultiMAE can use p
depth or semantic segmentation to boost per
yond the RGB-only setting, although the ga
Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
Supervised [81] 81.8 45.8 33.9 50.1 80.7
DINO [12] 83.1 44.6 32.5 47.9 81.3
MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9
MAE [35] 83.3 46.2 36.5 50.8 85.1
MultiMAE 83.3 46.2 37.0 52.0 86.4
Table 1. Fine-tuning with RGB-only. We report the top-1 ac-
curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU
(") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman-
tic segmentation (S), as well as 1
accuracy (") on NYUv2 depth
(D). Text in bold and underline indicates the first and second-best
results, respectively. All methods are pre-trained on ImageNet-1K
(with pseudo labels for MultiMAE).
Hypersim (S) N
Method RGB D RGB-D RGB
MAE 36.5 32.5 36.9 50.8
MultiMAE 37.0 38.5 47.6 52.0
Table 2. Fine-tuning with RGB and ground t
report semantic segmentation transfer results from
RGB and depth, measured in mIoU ("). MultiMA
leverage additional modalities such as depth, wh
Text in gray indicates a modality that the model w
on.
ADE20K (S) Hypersim (S) NYUv2 (
Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD R
MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1
MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6
Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled de
segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities
gray indicates a modality that the model was not pre-trained on.
than two modalities during transfer quickly becomes com- depth model was partially trained on this d
IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D)
81.8 45.8 33.9 50.1 80.7
83.1 44.6 32.5 47.9 81.3
82.8 43.7 31.7 46.6 80.9
83.3 46.2 36.5 50.8 85.1
83.3 46.2 37.0 52.0 86.4
ne-tuning with RGB-only. We report the top-1 ac-
ImageNet-1K (IN-1K) [23] classification (C), mIoU
0K [102] , Hypersim [68] , and NYUv2 [73] seman-
ion (S), as well as 1
accuracy (") on NYUv2 depth
bold and underline indicates the first and second-best
ctively. All methods are pre-trained on ImageNet-1K
labels for MultiMAE).
Hypersim (S) NYUv2 (S)
Method RGB D RGB-D RGB D RGB-D
MAE 36.5 32.5 36.9 50.8 23.4 49.3
MultiMAE 37.0 38.5 47.6 52.0 41.4 56.0
Table 2. Fine-tuning with RGB and ground truth depth. We
report semantic segmentation transfer results from combinations of
RGB and depth, measured in mIoU ("). MultiMAE can effectively
leverage additional modalities such as depth, while MAE cannot.
Text in gray indicates a modality that the model was not pre-trained
on.
ADE20K (S) Hypersim (S) NYUv2 (S)
RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS
46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 50.1 49.3
46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 53.5 54.0
e-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled depth and semantic
maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities as input. Text in
s a modality that the model was not pre-trained on.
odalities during transfer quickly becomes com- depth model was partially trained on this dataset. For se-
ʢQ%ʣɿٙࣅϥϕϦϯάʹΑΔ%FQUIใ
ʢQ4ʣɿٖࣅϥϕϦϯάʹΑΔ4FNBOUJDTFHNFOUBUJPOใ
ͭͷϞʔμϧΛ༻͍ͯ
fi
OFUVOJOH͢Δ͜ͱͰߴ͍ೝࣝੑೳΛൃش