ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) Supervised [81] 81.8 45.8 33.9 50.1 80.7 DINO [12] 83.1 44.6 32.5 47.9 81.3 MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9 MAE [35] 83.3 46.2 36.5 50.8 85.1 MultiMAE 83.3 46.2 37.0 52.0 86.4 Table 1. Fine-tuning with RGB-only. We report the top-1 ac- curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman- tic segmentation (S), as well as 1 accuracy (") on NYUv2 depth (D). Text in bold and underline indicates the first and second-best results, respectively. All methods are pre-trained on ImageNet-1K (with pseudo labels for MultiMAE). Method RG MAE 36 MultiMAE 37 Table 2. Fine-tun report semantic seg RGB and depth, m leverage additional Text in gray indica on. ADE20K (S) Hypersim (S) Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 3 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 4 Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer r segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE f Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) sked targets ... ... ... ... ... ... Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) pled patches from multiple modalities (e.g., RGB, depth, and on and encoded using a Transformer. Task-specific decoders p from queries to the encoded tokens, followed by a shallow fic encoded tokens added at their respective positions. (Right) to fine-tuning on single-modal and multi-modal downstream Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) sked targets ... ... ... ... ... ... Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) pled patches from multiple modalities (e.g., RGB, depth, and on and encoded using a Transformer. Task-specific decoders p from queries to the encoded tokens, followed by a shallow fic encoded tokens added at their respective positions. (Right) Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) Supervised [81] 81.8 45.8 33.9 50.1 80.7 DINO [12] 83.1 44.6 32.5 47.9 81.3 MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9 MAE [35] 83.3 46.2 36.5 50.8 85.1 MultiMAE 83.3 46.2 37.0 52.0 86.4 Table 1. Fine-tuning with RGB-only. We report the top-1 ac- curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman- tic segmentation (S), as well as 1 accuracy (") on NYUv2 depth (D). Text in bold and underline indicates the first and second-best results, respectively. All methods are pre-trained on ImageNet-1K (with pseudo labels for MultiMAE). Hypersim (S) Method RGB D RGB-D MAE 36.5 32.5 36.9 MultiMAE 37.0 38.5 47.6 Table 2. Fine-tuning with RGB an report semantic segmentation transfer RGB and depth, measured in mIoU (" leverage additional modalities such a Text in gray indicates a modality that on. ADE20K (S) Hypersim (S) Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseud segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labele gray indicates a modality that the model was not pre-trained on. than two modalities during transfer quickly becomes com- putationally expensive, since without masking, our method now scales with the full number of modalities and tokens. For performing multi-modal transfers with the standard MAE, we train a new input projection for the additional modalities while fine-tuning. Further training details can depth model was partially traine mantic segmentation pseudo labe Mask2Former model as in pre-tr As shown in Table 3, MultiM depth or semantic segmentation yond the RGB-only setting, alt Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 Supervised [81] 81.8 45.8 33.9 50.1 80. DINO [12] 83.1 44.6 32.5 47.9 81. MoCo-v3 [17] 82.8 43.7 31.7 46.6 80. MAE [35] 83.3 46.2 36.5 50.8 85. MultiMAE 83.3 46.2 37.0 52.0 86. Table 1. Fine-tuning with RGB-only. We report the top-1 curacy (") on ImageNet-1K (IN-1K) [23] classification (C), m (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] sem tic segmentation (S), as well as 1 accuracy (") on NYUv2 de (D). Text in bold and underline indicates the first and second- results, respectively. All methods are pre-trained on ImageNet (with pseudo labels for MultiMAE). ADE20K (S) Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB p MAE 46.2 20.0 46.3 46.2 46.3 36.5 2 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 3 Table 3. Fine-tuning with RGB and pseudo labels. Semanti segmentation maps, measured in mIoU ("). MultiMAE benefit gray indicates a modality that the model was not pre-trained on. than two modalities during transfer quickly becomes com putationally expensive, since without masking, our metho now scales with the full number of modalities and token For performing multi-modal transfers with the standa MAE, we train a new input projection for the addition modalities while fine-tuning. Further training details ca Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) Supervised [81] 81.8 45.8 33.9 50.1 80.7 DINO [12] 83.1 44.6 32.5 47.9 81.3 MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9 MAE [35] 83.3 46.2 36.5 50.8 85.1 MultiMAE 83.3 46.2 37.0 52.0 86.4 Table 1. Fine-tuning with RGB-only. We report the top-1 ac- curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman- tic segmentation (S), as well as 1 accuracy (") on NYUv2 depth (D). Text in bold and underline indicates the first and second-best results, respectively. All methods are pre-trained on ImageNet-1K (with pseudo labels for MultiMAE). Hypersim (S) N Method RGB D RGB-D RGB MAE 36.5 32.5 36.9 50.8 MultiMAE 37.0 38.5 47.6 52.0 Table 2. Fine-tuning with RGB and ground t report semantic segmentation transfer results from RGB and depth, measured in mIoU ("). MultiMA leverage additional modalities such as depth, wh Text in gray indicates a modality that the model w on. ADE20K (S) Hypersim (S) NYUv2 ( Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled de segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities gray indicates a modality that the model was not pre-trained on. than two modalities during transfer quickly becomes com- putationally expensive, since without masking, our method now scales with the full number of modalities and tokens. For performing multi-modal transfers with the standard MAE, we train a new input projection for the additional modalities while fine-tuning. Further training details can depth model was partially trained on this d mantic segmentation pseudo labels, we use t Mask2Former model as in pre-training. As shown in Table 3, MultiMAE can use depth or semantic segmentation to boost p yond the RGB-only setting, although the Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) Supervised [81] 81.8 45.8 33.9 50.1 80.7 DINO [12] 83.1 44.6 32.5 47.9 81.3 MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9 MAE [35] 83.3 46.2 36.5 50.8 85.1 MultiMAE 83.3 46.2 37.0 52.0 86.4 Table 1. Fine-tuning with RGB-only. We report the top-1 ac- curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman- tic segmentation (S), as well as 1 accuracy (") on NYUv2 depth (D). Text in bold and underline indicates the first and second-best results, respectively. All methods are pre-trained on ImageNet-1K (with pseudo labels for MultiMAE). Hypersim (S) NY Method RGB D RGB-D RGB MAE 36.5 32.5 36.9 50.8 MultiMAE 37.0 38.5 47.6 52.0 Table 2. Fine-tuning with RGB and ground tru report semantic segmentation transfer results from c RGB and depth, measured in mIoU ("). MultiMAE leverage additional modalities such as depth, while Text in gray indicates a modality that the model was on. ADE20K (S) Hypersim (S) NYUv2 (S) Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RG MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 50 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 53 Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled dept segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities as gray indicates a modality that the model was not pre-trained on. than two modalities during transfer quickly becomes com- putationally expensive, since without masking, our method now scales with the full number of modalities and tokens. For performing multi-modal transfers with the standard MAE, we train a new input projection for the additional modalities while fine-tuning. Further training details can depth model was partially trained on this da mantic segmentation pseudo labels, we use the Mask2Former model as in pre-training. As shown in Table 3, MultiMAE can use p depth or semantic segmentation to boost per yond the RGB-only setting, although the ga Method IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) Supervised [81] 81.8 45.8 33.9 50.1 80.7 DINO [12] 83.1 44.6 32.5 47.9 81.3 MoCo-v3 [17] 82.8 43.7 31.7 46.6 80.9 MAE [35] 83.3 46.2 36.5 50.8 85.1 MultiMAE 83.3 46.2 37.0 52.0 86.4 Table 1. Fine-tuning with RGB-only. We report the top-1 ac- curacy (") on ImageNet-1K (IN-1K) [23] classification (C), mIoU (") on ADE20K [102] , Hypersim [68] , and NYUv2 [73] seman- tic segmentation (S), as well as 1 accuracy (") on NYUv2 depth (D). Text in bold and underline indicates the first and second-best results, respectively. All methods are pre-trained on ImageNet-1K (with pseudo labels for MultiMAE). Hypersim (S) N Method RGB D RGB-D RGB MAE 36.5 32.5 36.9 50.8 MultiMAE 37.0 38.5 47.6 52.0 Table 2. Fine-tuning with RGB and ground t report semantic segmentation transfer results from RGB and depth, measured in mIoU ("). MultiMA leverage additional modalities such as depth, wh Text in gray indicates a modality that the model w on. ADE20K (S) Hypersim (S) NYUv2 ( Method RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD R MAE 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 MultiMAE 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 Table 3. Fine-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled de segmentation maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities gray indicates a modality that the model was not pre-trained on. than two modalities during transfer quickly becomes com- depth model was partially trained on this d IN-1K (C) ADE20K (S) Hypersim (S) NYUv2 (S) NYUv2 (D) 81.8 45.8 33.9 50.1 80.7 83.1 44.6 32.5 47.9 81.3 82.8 43.7 31.7 46.6 80.9 83.3 46.2 36.5 50.8 85.1 83.3 46.2 37.0 52.0 86.4 ne-tuning with RGB-only. We report the top-1 ac- ImageNet-1K (IN-1K) [23] classification (C), mIoU 0K [102] , Hypersim [68] , and NYUv2 [73] seman- ion (S), as well as 1 accuracy (") on NYUv2 depth bold and underline indicates the first and second-best ctively. All methods are pre-trained on ImageNet-1K labels for MultiMAE). Hypersim (S) NYUv2 (S) Method RGB D RGB-D RGB D RGB-D MAE 36.5 32.5 36.9 50.8 23.4 49.3 MultiMAE 37.0 38.5 47.6 52.0 41.4 56.0 Table 2. Fine-tuning with RGB and ground truth depth. We report semantic segmentation transfer results from combinations of RGB and depth, measured in mIoU ("). MultiMAE can effectively leverage additional modalities such as depth, while MAE cannot. Text in gray indicates a modality that the model was not pre-trained on. ADE20K (S) Hypersim (S) NYUv2 (S) RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS RGB pD RGB-pD RGB-pS RGB-pD-pS 46.2 20.0 46.3 46.2 46.3 36.5 21.0 36.9 37.7 37.3 50.1 23.8 49.1 50.1 49.3 46.2 34.4 46.8 45.7 47.1 37.0 30.6 37.9 38.4 40.1 52.0 39.9 53.6 53.5 54.0 e-tuning with RGB and pseudo labels. Semantic segmentation transfer results using pseudo labeled depth and semantic maps, measured in mIoU ("). MultiMAE benefits much more than MAE from pseudo labeled modalities as input. Text in s a modality that the model was not pre-trained on. odalities during transfer quickly becomes com- depth model was partially trained on this dataset. For se- ʢQ%ʣɿٙࣅϥϕϦϯάʹΑΔ%FQUIใ ʢQ4ʣɿٖࣅϥϕϦϯάʹΑΔ4FNBOUJDTFHNFOUBUJPOใ ͭͷϞʔμϧΛ༻͍ͯ fi OFUVOJOH͢Δ͜ͱͰߴ͍ೝࣝੑೳΛൃش