EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI [CVPR 2024 & NeurIPS 2024]

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
[CVPR 2024 & NeurIPS 2024] 伊東聖矢（NICT） 2025.2.5

はじめに • マルチモーダル・一人称視点設定における3Dシーン理解のためのデータセットとベンチマークに関する論文の紹介 • 以下の論文の内容が含まれます • EmbodiedScan: A Holistic
Multi-Modal 3D Perception Suite Towards Embodied AI [Wang+ ’24] • MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [Lyu+ ’24] ※ 特に断りがない場合，図表は当該論文からの引用です HP HP コードコード論文論文 HP HP コードコード論文論文 2

Embodied AI [Liu+ ’24] Embodied AI（具現化されたAI） • アラン・チューリングが提唱した「具現化チューリングテスト」が発端 • エージェントが仮想環境での抽象的な問題解決能力だけでなく，現実世界の
複雑さや予測不可能性に対応できるかを評価 • Disembodied AI： ChatGPTなど会話型AI • Embodied AI：ロボットや自動車などの物理的な実体を持つAI • 物理的な環境との相互作用を通じてより高度な知能の実現することが目的 • 3D空間のより深い理解や動的な環境への対応が重要 3

Detection [Brazil+ ’23]より図の一部を引用 Segmentation [Dai+ ’17]より引用 Scene Completion [Cao+ ’22]より引用
Object Instance Re-localization [Wand+ ’19] より引用 Visual Grounding [Ding+ ’23] より引用 Captioning [Chen+ ’20] より引用包括的な3Dシーン理解？ – CV分野の3Dタスク 4 PLA: Language-Driven Open-Vocabulary 3D SceneUnderstanding Runyu Ding1*† Jihan Yang1⇤ Chuhui Xue2 Wenqing Zhang2 Song Bai2‡ Xiaojuan Qi1‡ 1TheUniversity of Hong Kong 2ByteDance Abstract Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated la- bel space. The recent breakthrough of 2D open-vocabulary perception islargely driven by Internet-scale paired image- text data with rich vocabulary concepts. However, this success cannot be directly transferred to 3D scenarios due to the inaccessibility of large-scale 3D-text pairs. To this end, we propose to distill knowledge encoded in pretrained vision-language (VL) foundation models through captioning multi-view images from 3D, which allows ex- plicitly associating 3D and semantic-rich captions. Fur- ther, to foster coarse-to-ﬁne visual-semantic representa- tion learning from captions, we design hierarchical 3D- caption pairs, leveraging geometric constraints between 3D scenes and multi-view images. Finally, by employ- ing contrastive learning, the model learns language-aware embeddings that connect 3D and text for open-vocabulary (a) Close-set classification (c) Close-set localization (b) Open-vocabulary classification (d) Open-vocabulary localization bookshelf (unseen class) bookshelf cabinet wall Mistake ‘bookshelf’ as ‘cabinet’ Miss ‘bookshelf’ Successfully detect ‘bookshelf’ Figure1. An exampleof 3D open-vocabulary sceneunderstanding with “bookshelf” as unseen class for ScanNet [7]. The close-set model mistakes“bookshelf” as“cabinet” or simply misses“bookshelf” in (a) and (c). Our open-vocabulary model correctly local- izesand recognizes “bookshelf ” in (b) and (d). on human labor to cover all real-world categories. Thismotivatesusto study open-vocabulary 3D sceneun- 2v2 [cs.CV] 22 Mar 2023 Garrick Brazil1 Abhinav Kumar2 Julian Straub1 Nikhila Ravi1 Justin Johnson1 GeorgiaGkioxari3 1Meta AI 2Michigan State University 3Caltech Dataset Size Cube R-CNN Predictions on COCO KITTI ARKitScenes SUN RGB-D Objectron Hypersim nuScenes 10k 234k 7k OMNI3D 23.3 15.4 SUN RGB-D 34.7 30.6 KITTI 30.9 21.4 AP3D OMNI3D Cube R-CNN ImVoxelNet GUPNet PGD Figure 1. Left: We introduce OMNI3D, a benchmark for 3D object detection which is larger and more diverse than popular 3D benchmarks. Right: We propose Cube R-CNN, which generalizes to unseen datasets (e.g. COCO [48]) and outperforms prior works on existing datasets. Abstract Recognizing scenesand objectsin 3D froma singleimage isa longstanding goal of computer vision with applications in robotics and AR/VR. For 2D recognition, largedatasets and scalable solutions haveled to unprecedented advances. In 3D, existing benchmarks aresmall in sizeand approaches specialize in few object categories and speciﬁc domains, e.g. urban driving scenes. Motivated by the success of 2D recognition, we revisit the task of 3D object detection by introducing a largebenchmark, called OMNI3D. OMNI3D re-purposesand combinesexisting datasetsresulting in 234k images annotated with morethan 3 million instances and 98 categories. 3D detection at such scale is challenging due 1. Introduction Understanding objects and their properties from single images is alongstanding problem in computer vision with applications in roboticsand AR/VR. In thelast decade, 2D object recognition [27,43,66,67,75] has made tremendous advances toward predicting objects on the image plane with thehelp of largedatasets[25,48]. However, theworld and its objects arethreedimensional laid out in 3D space. Perceiv- ingobjectsin 3D from 2D visual inputsposesnew challenges framed by thetask of 3D object detection. Here, the goal is to estimate a3D location and 3D extent of each object in an image in the form of atight oriented 3D bounding box. Today 3D object detection isstudied under two different

EmbodiedScan Embodied Agent への要求：一人称視点の観測から3Dシーンを包括的に理解 5 一人称視点の RGB-D データを含む実世界（屋内）のマルチモーダルベンチマークデータセット EmbodiedScan
とベースライン手法 Embodied Perceptron の提案

既存のデータセットとの比較 6 Dataset #Scans #Imgs #Objs #Cats #Prompts Ego Capture
3D Annotations Replica [51] 35 - - - - 7 7 NYU v2 [14] 464 1.4k 35k 14 - 3 7 SUN RGB-D [50] - 10k - 37 - Mono. Box ScanNet [15, 44] 1513 264k 36k 18 52k [9] 3 Seg., Lang. Matterport3D [7] 2056 194k 51k 40 - Multi-View Seg. 3RScan [57] 1482 363k - - - 3 Seg. ArkitScenes [3] 5047 450k 51k 17 - 3 Box HyperSim [43] 461 77k - 40+ - Mono. & Syn. Box EmbodiedScan 5185 890k 160k 762 970k 3 Box, Occ., Lang. Table 1. Comparison with other 3D indoor scene datasets. “Cats” refers to the categories with box annotations for the 3D detection benchmark. EmbodiedScan features more than 10⇥ categories, prompts, and the most diverse annotations. The numbers are still scaling up with further annotations. Mono./Syn./Lang. meansMonocular/Synthetic/Language. Fig Sca has gor

EmbodiedScan の構成 7 フレームの選択とシーンの分割データ形式，サンプリング周波数，視点間の関係の統一化 • 時間的な連続性を保持しつつ，一般的なマルチビューケースに対応 • 概ね一律のサンプリング周波数になるように調整グローバル座標系
マルチビューの観測を集約し，出力の基準としてグローバル座標系を用意 • ScanNet に倣い，原点をシーンの中心，水平面を床，軸を壁に整列 • 実応用では事前に定義されたグローバル座標系がなく，観測によって座標系が変化既存の屋内3Dデータセットから姿勢付き RGB-D データをもつデータセットを選択 → ScanNet・Matterport3D・3RScan

アノテーション 8 3Dバウンディングボックス • 3D中心，サイズ，ZXYオイラー角（方向）で定義されるボックス • SAMを活用したアノテーションツールにより，方向や小さい物体を補填アノテーション手順
• キーフレームのサンプリング • SAMマスクにより，軸に沿ったボックスを生成 • 調整品質管理 • アノテーションチームで確認 • シーンあたり10〜30分 https://tai-wang.github.io/embodiedscan/ より（画面の一部をクロップして掲載）

アノテーション 9 意味的占有（Semantic Occupancy） • 物体の姿勢を考慮せずに，セマンティック領域の正確な境界を把握 • 各ボクセルに対して，そのセル内に最も多く存在する点のカテゴリを意味ラベルとして割り当て •
40×40×16の占有マップ • 水平面（XY）：[-3.2m〜3.2m]・垂直軸（Z）：[-0.78m〜1.78m] 言語記述 • 3D視覚グラウンディングを行うための言語プロンプト • 3Dバウンディングボックスのアノテーションに基づいて生成 • Sr3D [Achlioptas+ ’20] に従って物体の位置と向きを考慮して空間関係を記述

(a) SAM-Assisted Oriented 3D Bounding BoxesAnnotation. (b) 3D Boxesand Language
Prompt Statistics. 統計情報 10 語彙構築 • アノテーターはオープンボキャブラリで意味カテゴリを記述 • テキスト埋め込みで類似カテゴリをクラスタリングし，WordNetと照合 • 最終的に手作業で修正・統合インスタンス • 760以上のカテゴリを包含 • 288以上のカテゴリで10以上のインスタンス ← 従来の20倍 • 約400カテゴリで5以上のインスタンス← ScanNetの3倍 • 3D物体検出ベンチマーク • {wall, ceiling, floor, object} を削除 • 残りを{head, common, tail} に分割 (a) SAM-Assisted Oriented 3D Bounding BoxesAnnotation. (b) 3D Boxesand LanguagePrompt Statistics. (c) Instance Statistics (Increase w.r.t. ScanNet). (d) Occupancy Statistics. Figure 3. EmbodiedScan annotation and statistics. (a) UI for 3D box annotation. We select keyframes and generate their SAM masks with corresponding axis-aligned boxes. With simple clicks, annotators can create 3D boxes for target objects and further adjust them with reference in three orthogonal views and images. (b) Small boxes (< 1m3) increase more & prompt statistics. objs/avg./des. refer to objects/average/descriptions. (c) Weshow thenumber of instancesper category (300 classes). For categoriesthat exist in ScanNet, weplot

(a) SAM-Assisted Oriented 3D Bounding BoxesAnnotation. (b) 3D Boxesand Language
Prompt Statistics. 統計情報 11 意味的占有 • ナビゲーションと行動計画に関連するカテゴリを明確化 • 下流のタスクにおける分布と重要性に基づいて80カテゴリを選択言語プロンプト • 物体と物体の空間的関係を5種類に分類 • Horizontal Proximity • Vertical Proximity • Support • Allocentric • Between • 特定のカテゴリのインスタンスを2〜6個持っている場合に有効 • Sr3D [Achlioptas+ ’20] の約10倍 (a) SAM-Assisted Oriented 3D Bounding BoxesAnnotation. (b) 3D Boxesand LanguagePrompt Statistics. (c) Instance Statistics (Increase w.r.t. ScanNet). (d) Occupancy Statistics. Figure 3. EmbodiedScan annotation and statistics. (a) UI for 3D box annotation. We select keyframes and generate their SAM masks with corresponding axis-aligned boxes. With simple clicks, annotators can create 3D boxes for target objects and further adjust them with reference in three orthogonal views and images. (b) Small boxes (< 1m3) increase more & prompt statistics. objs/avg./des. refer to objects/average/descriptions. (c) Weshow thenumber of instancesper category (300 classes). For categoriesthat exist in ScanNet, weplot

Embodied Perceptron 12 RGB画像・深度マップから得られる点群・言語プロンプトを含むマルチモーダル入力からマルチモーダル表現を抽出して下流タスクを解くベースライン

マルチモーダル3Dエンコーダ 13 各モダリティのエンコーダ • 2D画像：ResNet50とFPN • 点群：Minkowski ResNet34 • テキスト：BERT
入力視点に対するスケーラビリティ • 任意の数のRGB-D入力に対応可能（順不同） • 点群をグローバル座標系に変換・ダウンサンプリングして深度マップを集約 • 複数の画像に対して3D点に対応する2D特徴を照会して平均化 • 学習時は少ない視点数（20）を用い，推論時は多い視点数（50）で性能を強化可能マルチレベル・マルチモーダリティの融合 • 事前に定義されたグリッド上の単純な融合方法を採用 • 2D画像と点群からマルチレベルの特徴を抽出し，各レベルで特徴を融合視覚・言語の融合 • マルチレベルの視覚特徴とテキスト特徴を Transformer で融合 • 自己注意ブロックで視覚特徴を洗練後，クロスモーダル注意ブロックで相互作用

デコーダ 14 3Dボックス予測 • FCAF3D [Rukhovich+ ’22] と同様に特徴量をアップサンプリング •
分類・回帰・中心性予測ヘッドを追加 • 分類損失・中心性損失・非結合 Chamfer 距離損失（8つのコーナーを使用）を組み合わせて学習意味的占有予測 • 特徴量を 3D FPN に入力し，各スケールで占有予測 • 高解像度から低解像度にかけて，半分の重みで減衰させた損失を用いて学習 • クロスエントロピー損失とシーンクラス親和性損失 [Wei+ ’23] を使用 3D視覚グラウンディング • 3D検出と同じアーキテクチャのヘッドを使用 • 各層のすべての予測ヘッドの出力を用いて学習 • 3D検出と同様の損失関数に加えて対照損失を使用し，テキストの特徴が対応する視覚的特徴に近づき，他の視覚的またはテキストから遠くなることを保証

デモ – In-the-Wild Test 15 https://tai-wang.github.io/embodiedscan/ より（画面の一部をクロップして掲載）

ベンチマーク – 3D物体検出 16 データセット・評価方法 • 連続的・マルチビューのRGB-Dシーケンスを対象 • 760以上のカテゴリを対象とした大規模語彙設定 •
学習/検証/テストはそれぞれ3,930/703/552スキャン • 3D IoUベースの平均精度（AP）で評価結果 • RGB・深度とも重要であり，RGB-Dアプローチが優れた性能 • 方向付き3Dボックスの予測は軸に沿った3Dボックスの予測よりも困難 → 提案手法のデコーダに置き換えると精度向上 • 提案手法は他の手法よりも優れた性能を示しておりベースラインとして適格 Figure 8. Qualitativeresults of different tasks on EmbodiedScan. References [1] Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for ﬁne-grained 3d object identiﬁcation in real-world scenes. In European conference on computer vision, 2020. 3, 4, 8, Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022. 13 [3] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe,

ベンチマーク – 意味的占有 17 データセット・評価方法 • 入力とデータ分割は3D物体検出と同様 • 80の一般的なカテゴリを対象に mIoU
で評価結果 • 深度のみとRGB-Dアプローチの間に顕著な差が生じる • ドアやカーテンのような形状が壁に似ているカテゴリの区別が困難 • 深度は空空間，床，壁の予測に大きく寄与 • カメラのみの提案手法は従来手法よりわずかに優れている → モダリティ別精度の傾向は連続的な入力とマルチビュー入力で同様 Figure 8. Qualitativeresults of different tasks on EmbodiedScan. References [1] Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for ﬁne-grained 3d object identiﬁcation in real-world scenes. In European conference on computer vision, 2020. 3, 4, 8, 12, 13 [2] Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022. 13 [3] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARK- itscenes - a diverse real-world dataset for 3d indoor scene 17

ベンチマーク – 単眼3D物体検出 18 データセット・評価 • 画像で見えている物体を対象 • 学習/検証/テストはそれぞれ689k/115k/86k •
平均精度（AP）で評価結果 • ステレオ幾何学的情報が欠如しているため3D情報を推定するのが困難 → AP-ARのギャップが大きい • 他の手法よりも優れた性能を示しておりベースラインとして適格 • 3Dボックスの回転を考慮したデコーダの設計が重要 Figure 8. Qualitativeresults of different tasks on EmbodiedScan. References [1] Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for ﬁne-grained 3d object identiﬁcation in real-world scenes. In European conference on computer vision, 2020. 3, 4, 8, 12, 13 [2] Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022. 13 [3] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARK- itscenes - a diverse real-world dataset for 3d indoor scene 17

Figure 8. Qualitativeresults of different tasks on EmbodiedScan. ベンチマーク –
3D視覚グラウンディング 19 データセット・評価 • マルチビューのRGB-D入力（順不同）設定でのグラウンディング • データ分割はオリジナルに従う • 評価中にボックスの真値を候補として提供しないため，実用的な設定での検証結果 • 他の手法を凌駕する性能を達成 • 入力や説明文の変更に伴い，従来のベンチマークより低い性能 → より多くのカテゴリや小さな物体，入力プロンプトの複雑化による影響

分析 20 Axis-aligned vs. Oriented Boxes • 回転を予測対象にすると精度が著しく低下 • 回転の推定はより困難な課題
再構成された点群 vs. マルチビューRGB-D • 再構成された点群の方がマルチビュー RGB-D より高精度 • 知覚ループに再構成を組み込むことでよりよい結果が得られる可能性あり実画像 vs. レンダリング画像 • 実画像とレンダリング画像で視覚的差異が顕著 • レンダリング画像で学習させたモデルを実画像でテストすると性能が著しく低下 EmbodiedScan で学習することの利点 • ScanNetとEmbodiedScanの検証用セットの両方で有意な改善 • 特に head カテゴリで精度改善が顕著

ここまで – EmbodiedScan • 一人称視点からの3Dシーン理解を目的とした大規模で多様なモダリティを含むベンチマークデータセット • 5,000以上のスキャン，約100万枚のRGB-D画像，760以上のカテゴリ • 16万以上の3Dバウンディングボックス，80の共通カテゴリによる意味的占有，
オブジェクト間の空間関係に焦点を当てた100万件の言語記述 • 任意の数の視点入力を処理できる統一されたマルチモーダルエンコーダとタスク固有のデコーダを備えたベースラインフレームワークを提案 • EmbodiedScan で学習させたモデルは ScanNet や EmbodiedScan の検証セットで大幅な性能向上 • データセットの分析を通じて新たな課題と可能性を提示 21

MMScan 従来のマルチモーダルな3D知覚：物体の特性や物体間の空間的関係を理解のみ 22 階層的な言語アノテーションを備えた大規模マルチモーダル 3Dシーンデータセット MMScan の提案 Ruiyuan Lyu1,2⇤, Tai
Wang1⇤, Jingli Lin1,3⇤, Shuai Yang1,4⇤, Xiaohan Mao1,3, Yilun Chen1, Runsen Xu1,5, Haifeng Huang1,4, Chenming Zhu1,6, Dahua Lin1,5, Jiangmiao Pang1† 1Shanghai AI Laboratory, 2Tsinghua University, 3Shanghai Jiao Tong University, 4Zhejiang University, 5TheChinese University of Hong Kong, 6TheUniversity of Hong Kong ⇤Equal contribution †Corresponding author

既存のデータセットとの比較 23

メタアノテーショントップダウンロジックによりシーンの包括的で階層的なアノテーションを生成 24 トップダウンロジック • シーンを領域と物体に分割してシーン全体にアノテーションを実施 ※将来的には物体をさらに部品に分割 • 様々な粒度レベルで情報を取得 •
各レベルにおいて，空間と属性の理解に焦点を当て，主要な特性とターゲット間の関係について人間またはVLMにより記述 • 各レベルの階層構造と全体的な記述を持つシーン説明文を獲得

物体レベルの言語アノテーション 25 生成プロセス • EmbodiedScan のバウンディングボックスに基づいて説明文を付与 • 空間：形状，姿勢 • 属性：カテゴリ，見た目，材質，状態，機能
• 最適な視点の画像に対してVLMで初期説明文を生成し，人間が説明文を洗練 • 画像の鮮明度を評価した後，各物体の中心が画像中央25％の領域内にあり，物体表面点の視認性が最大となる視点 • VLM には CogVLM [Wang+ ’23] と InternVL-Chat [Chen+ ’24] を使用

領域アノテーション 26 領域分割アノテーション • 各シーンの領域に対し，事前に定義したカテゴリを付与例：リビング，書斎，休憩，ダイニング，料理，浴室，収納，トイレ，廊下，... • アノテーターは鳥瞰図（BEV）に対して2Dポリゴンでカテゴリ付与 • BEV
から関連する物体の視点にアクセスしてアノテーションの精度を担保 • 7,692個の有効な領域を獲得（「その他」と「オープンスペース」を除く）領域レベルの言語アノテーション • 物体レベルの言語アノテーションのパイプラインを領域レベルに適応 • VLMと視点選択の基準は物体レベルのアノテーションと同様 • 貪欲アルゴリズムにより全ての物体をカバーする最小限の視点セットを選択 • 各領域固有特性とエンティティ間の関係に関する説明文を付与 • 固有特性：場所，機能，空間レイアウト，寸法，建築的要素，装飾など • エンティティ間の関係 • 空間的な関係（Sr3D [Achlioptas+ ’20] に基づく）・物体と物体の関係 • 物体と領域の関係 • シーンにおける高度なQA

領域アノテーション 27 複数ビューでの一貫性の保証 • 視覚的な手がかりとして一意のIDを持つ3Dバウンディングボックスを画像に重ねる • 画像をGPT-4に入力して記述 • テンプレートを使って記述するように指示することでグラウンディングの学習や
3D LLM の評価が可能 • 既存のVLMの内，GPT-4のみが信頼できる説明文を生成可能

後処理 Figure 3: Post-processed annotations for benchmarks. “O” and “R”
means “objects” and “regions”. Apart from samples shown in the ﬁgure, there is a minor part of QA samples for advanced understanding and reasoning, such assituated QA related to everyday life, accounting for 2.18%. 28 ベンチマークのための後処理 • メタアノテーションは各シーンの物体，領域，エンティティ間の関係の情報を網羅 • 視覚グラウンディングとQAは具体的な対象や質問に対する応答の生成が必須 → 単一ターゲットとターゲット間の2つに分類

後処理 – 単一ターゲット 29 単一ターゲット各エンティティに関する説明文から質問への変換 • 単一の物体や領域の空間的特徴や属性について直接質問 • QAベンチマークの質問には特定のターゲットの存在や数量に関するものを含む
• 空間的特徴や属性はシーン内の特定のエンティティに固有の場合と複数のエンティティに共通する場合が存在手順 • ChatGPTを用いて各空間的特徴や属性の説明を抽出して大まかなカテゴリに分類例：配置 → {standing upright, piled up, leaning, lying flat, hanging} • 元の詳細な物体レベルの説明と各領域における物体の特徴的な側面に関する人間による注釈を組み合わせてデータサンプルを生成

後処理 – ターゲット間 30 ターゲット間 • 物体と領域の間の空間的・属性的関係で分割 • 物体ー物体間の関係と物体ー領域間の関係を対象 •
領域ー領域間の関係は除外 • 物体ー物体の空間的関係は EmbodiedScan で生成済み物体ー物体の属性的関係と物体ー領域の関係 • 領域レベルのメタアノテーションから取得 • テンプレートを用いて初期サンプルを作成・改良 • ChatGPTを用いてデータセットを拡張 • 物体ー物体関係のサブクラスには2つの物体間の関係を直接問う，もしくは別の物体との関係に基づいて物体を特定し，その特性を問うかの2つの形式が存在

後処理 – 学習 31 グラウンディングされたシーン説明文の生成 • 従来のマルチモーダル3Dデータセットは階層的なグラウンディングをせずに物体やシーンの全体的な説明文を生成することが一般的 • MMScanでは物体レベルと領域レベルの説明文をを統合可能
• 3D特徴とテキストフレーズ間の対応を使って複雑な文を理解し，フレーズレベルで物体を特定する能力を向上させる一般的な説明文を用いたインストラクションチューニング • MMScanのメタアノテーションは様々な粒度と視点からの説明文を含む • 物体レベル・領域レベル・シーンレベルの説明文を3D-LLMのインストラクションチューニングに利用

統計 32 Figure 17: Prompt for GPT evaluation. Thetable is
provided in Markdown format. (a) EmbodiedScan word cloud. (b) MMScan word cloud. (c) Regions’ category distribution. (d) Objects’ category distribution. Figure 18: Statistics. (a)(b) Comparing the word clouds of EmbodiedScan and MMScan, we can observethesigniﬁcant diversity improvement in the language annotations, from focusing on inter- object spatial relationships only to holistic understanding. (c)(d) The distributions of region and object annotations. 20 データセットの構成 • 6.9Mの言語アノテーション • 114Mのトークンメタアノテーション • 1.4Mのメタアノテーション • 1.05Mの特性の説明文 • 380kの完全な説明文 • 全体で18.3Mトークンデータサンプル • 1.76MのQA （4.06Mの説明文） • 1.28Mの視覚グラウンディング • 97kのグラウンディングされたシーンの説明文（説明文あたり90トークン）

3D Visual Grounding 33 Figure10: Visual grounding qualitativeresults, covering single
and inter-target, spatial and attribute understanding. Is there a door in the room? Yes. Wood I will give you a description of an item in the room: its type is doorframe. Its color is white. Its texture is smooth. Its shape is rectangular. Its placement is vertical. Its state is functional. Its function is supports the door and keeps it stable. What is material of this object? What is region role of this object? The mirror is the focal point of the room, used for grooming and personal hygiene. What can I do if I feel the room is a bit stuffy? You can open the window to ventilate the room. There is a toilet region in this room, these objects are in it. What is the color matching and style theme of this region? The color scheme of the toilet area is neutral, with white tiles and light-colored cabinets creating a clean and spacious feel. The style theme is minimalist, focusing on functionality rather than decoration. Figure 11: Question answering qualitativeresults, covering existential, attribute understanding, and advanced queries. D.5 In-the-Wild Test To test the generalization capability of trained 3D grounding and language models, similar to EmbodiedScan, weuseAzureKinect DK to record theRGB-D streams with camera posesand feed データセット・評価指標 • 学習/検証/テスト：848k/217k/209k • データサンプルを {ST-attr, ST-space, OO-attr, OO-space, OR} に分類 • 平均精度（AP）で評価 • 公平な比較のために画像は使用しない結果 • 従来のグラウンディングベンチマークより低精度（ScanReferでは48%強） → 複雑なプロンプト，9-DoFのボックス推定，不特定のターゲットの数に起因 • 単一ターゲットの性能はターゲット間の性能よりも一般的に低い → モデルがターゲット間の関係を理解できることを示唆 *Object-Region *

3D Question Answering 34 understanding. Is there a door in
the room? Yes. Wood I will give you a description of an item in the room: its type is doorframe. Its color is white. Its texture is smooth. Its shape is rectangular. Its placement is vertical. Its state is functional. Its function is supports the door and keeps it stable. What is material of this object? What is region role of this object? The mirror is the focal point of the room, used for grooming and personal hygiene. What can I do if I feel the room is a bit stuffy? You can open the window to ventilate the room. There is a toilet region in this room, these objects are in it. What is the color matching and style theme of this region? The color scheme of the toilet area is neutral, with white tiles and light-colored cabinets creating a clean and spacious feel. The style theme is minimalist, focusing on functionality rather than decoration. Figure 11: Question answering qualitativeresults, covering existential, attribute understanding, and advanced queries. D.5 In-the-Wild Test To test the generalization capability of trained 3D grounding and language models, similar to EmbodiedScan, weuseAzure Kinect DK to record theRGB-D streams with camera posesand feed them into our models. Thequestion-answering test usestheimproved PointLLM tuned with MMScan without any modification. Thegrounding test uses the trained EmbodiedScan baselinewith the best performance, and weonly visualize the top-k predictions matching the language descriptions, where k is adaptive according to the prompt, e.g., if the question corresponds to a single target, we will only visualize the top-1 prediction. 2 It shows decent performance both in QA and VG regarding different aspects of languageprompts, even with adifferent RGB-D sensor in unseen environments. Wevisualize theresults in theattached supplementary video. E Evaluation Details Thissection presentsmoredetailsregarding theGPT evaluation adopted in themain paper and further conducts human evaluation to validate theconsistency of these two approaches. 2Given the practical use, it is important to explore a certain score threshold to meet general grounding requirements in thefuture. データセット・評価指標 • 学習/検証/テスト：1.1M/297k/295k • 人間とLLMで回答の精度を評価 • データ駆動型指標と伝統的な指標 Thehuman evaluation guidelines for question answering are asfollows: Please comparetheground truth answersand model-generated answer using the following metric Hallucination: 0: Clear hallucination 1: No hallucination Completeness: 0: Completely incorrect 1: Partial coverage2: Complete coverage Weevaluatethefine-tuned LL3DA and LEO, with results presented in Tab. 11. Theseresults show a similar trend with GPT evaluation shown in the main paper, validating the reliability of GP evaluation. Caption evaluation. Similarly, for thecaptioning task, werandomly select 300 captions from th fine-tuned LL3DA and LEO models. Subsequently, we hire five evaluators to assess each resul using thesamecriteria provided to Chat-GPT4 asshown in Fig. 16. Theresults havebeen shown i Tab. 7 and 8 with thesame consistent trend. Table 11: Human evaluation results for the3D question answering benchmark. Model Hallucination Completeness Overall Zero-shot 3D-LLM 33.1 25.7 29.4 Chat3d-v2 26.7 33.1 29.7 LL3DA 27.2 21.3 24.3 LEO 32.7 26.3 29.5 Fine-tuned LL3DA 67.0 63.7 65.3 LEO 71.7 70.2 70.9 18 結果 • MMScanでのファインチューニングが効果的 • 最大25.6ポイントの精度向上（LEO） • ゼロショット性能は期待ほどよくない • 単一ターゲットよりターゲット間より高い精度

分析 35 グラウンディングされたシーン説明文 • 物体と領域のIDを含むシーンの説明文を作成してグラウンディングモデルを効果的に学習可能 • MMScan データを用いることでベンチマークの性能が大幅に向上（最大7.17% AP）
• 共同学習は事前学習よりもわずかに優れる → 人間の関与はデータの品質を向上させるが，期待されるほどの影響はないインストラクションチューニングのための説明文 • 現在の3D-LLMがより強力な性能を達成するための主要な課題は高品質なマルチモーダル3Dデータセットの不足 • MMScan の説明文をインストラクションチューニングに利用すると大幅に改善

分析 36 マルチモーダル3D学習のスケーリング則 • EmbodiedScanのベースラインとLL3DAを用いてデータ量に応じたモデル性能の変化を調査 • 両タスクにおいて，データ量が増加するにつれて性能が大幅に向上 •
視覚的グラウンディングの性能はデータ量の増加に伴って向上 • QAの性能は初期段階で急激に向上した後，徐々に向上 Wefeed these data to our baseline, an improved version of PointLLM with RGB-D input to fit scene- level understanding (more details in the supplemental), and observe the significant improvement on traditional question-answering benchmarks (Tab. 6), achieving state-of-the-art performance. Furthermore, it also showsmuch better in-the-wild test performance, and wepresent the qualitative results in the supplementary materials. Figure 4: The performance of both tasks grows steadily with the increase of training data. Scaling Law for Multi-modal 3D Learning. Finally, to guide future research, we employ EmbodiedScan VG baseline and LL3DA with different amounts of data to study the scaling law for multi-modal 3D learning. As shown in Fig. 4, the VG performance increases steadily while the QA performance exhibits an initial sharp increase followed by a gradual ascent, indicating theVG task still needsmoredatawhileour generated QA samples approach saturation. In summary, both tasks show significant improvement with thedata increase, from 8.7% to 20.6% APand 15.84% to 44.81% accuracy on the VG and QA benchmarks, respectively. 5 Limitations and Conclusion This paper establishes the largest ever multi-modal 3D scenedataset featuring hierarchical language annotations. We employ a top-down approach and harness both VLMs and human annotators to encompass holistic and precise annotations of 3D scene understanding. Based on meta-annotations, wefurther derivedatasamples and grounded scenecaptions for evaluating and training 3D grounding and language models comprehensively. Although thispaper proposes apotentially scalable method to construct large-scale multi-modal 3D datasets, it still relieson human annotators and can befurther improved regarding scene diversity. Exploring how to reduce human correction efforts and scale up thescene diversity are objectives for future work. Social Impact. This paper proposes amulti-modal 3D scene dataset based on existing open-source

おわりに • 1.4Mのメタアノテーションを含む大規模なデータセット • シーンを領域レベルとオブジェクトレベルに分解し，空間的および属性的な理解を包括的に捉えるための階層的アノテーション • VLMによって初期化されたアノテーションを人間のアノテーターが修正し，自然で正確，かつ包括的なアノテーションを実現 •
視覚的グラウンディングとQAのためのベンチマークを提供 • MMScanの限界 • 大規模なデータセット構築にはまだ人間のアノテーターの関与が必要 → コストと時間，アノテーターの主観的な判断がデータセットに影響 • 実世界の多様性を十分に捉えきれていない可能性あり • データセットのライセンスによる制限 37

文献一覧 [Wang+ ’24] T. Wang et al.: EmbodiedScan: A Holistic
Multi-Modal 3D Perception Suite Towards Embodied AI, In CVPR, 2024. [Lyu+ ’24] R. Lyu et al.: MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations, In NeurIPS, 2024. [Liu+ ’24] Y. Liu et al.: Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI, arXiv preprint arXiv:2407.06886, 2024 [Brazil+ ’23] G. Brazil et al.: Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild, In CVPR, 2023. [Dai+ ’17] A. Dai et al.: ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes, In CVPR, 2017. [Cao+ ’22] A-Q. Cao & R. de Charette: MonoScene: Monocular 3D Semantic Scene Completion, In CVPR, 2022. [Wand+ ’19] J. Wald et al: RIO: 3D Object Instance Re-Localization in Changing Indoor Environments, In ICCV, 2019. [Ding+ ’23] R. Ding et al.: PLA: Language-Driven Open-Vocabulary 3D Scene Understanding, In CVPR, 2023. [Chen+ ’20] D. Z. Chen et al.: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language, In ECCV, 2020. [Achlioptas+ ’20] P. Achlioptas et al.: ReferIt3D: Neural Listeners for Fine-Grained3D Object Identification in Real-World Scenes, In ECCV, 2020. [Rukhovich+ ’20] D. Rukhovich et al.: FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection, In ECCV, 2022. [Wang+ ’23] W. Wang et al.: CogVLM: Visual Expert for Pretrained Language Models. arXiv preprint arXiv:2311.03079, 2023. [Chen+ ’24] Z. Chen et al.: How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites, arXiv preprint arXiv:2404.16821, 2024. 38

EmbodiedScan: A Holistic Multi-Modal 3D Percept...

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI [CVPR 2024 & NeurIPS 2024]

Spatial AI Network

More Decks by Spatial AI Network

Other Decks in Technology

Featured

Transcript