[RSJ24] Open-Vocabulary Mobile Manipulation Based on Dual Relaxed Contrastive Learning with Dense Labeling

慶應義塾大学八島大地，是方諒介，杉浦孔明 Multimodal LLMと二重緩和損失に基づく実世界検索エンジン

背景：生活支援ロボットにおけるマルチモーダル言語理解 ▪ 生活支援ロボット ▪ 少子高齢社会における労働力不足解消に期待 ▪ 自然言語による移動マニピュレーション指示 → 利便性が向上タオルの横にある
リモコンを持ってきてくれますか？ - 2 - 参照表現 open- vocabulary

問題設定：open-vocabularyな指示文によるマルチモーダル検索に基づくfetch-and-carryタスク - 3 - ▪ 入力：対象物体を配置目標まで運搬するための指示文 ▪ 出力：候補画像を対象物体
および配置目標のそれぞれに関してランク付けた画像群 ▪ 前提：環境中の画像群は事前の探索で収集済み

関連研究：マルチモーダル検索設定において類似物体を扱うことは難しい - 4 - “壁にかかっている絵をとってきて” 手法概要 MultiRankIt [Kaneda+,
RA-L24] Open-Vocabularyな指示文からtop-20の対象物体を検索  類似物体 (Unlabeled Positive) を考慮していない NLMap [Chen+, ICRA23] 事前探索で収集した画像に基づく物体操作タスクを実行  候補の中からtop-1の対象物体・配置目標のみに着目

手法概要 MultiRankIt [Kaneda+, RA-L24] Open-Vocabularyな指示文からtop-20の対象物体を検索  類似物体 (Unlabeled Positive)
を考慮していない NLMap [Chen+, ICRA23] 事前探索で収集した画像に基づく物体操作タスクを実行  候補の中からtop-1の対象物体・配置目標のみに着目関連研究：マルチモーダル検索設定において類似物体を扱うことは難しい - 5 -  InfoNCE [Oord+, 18]では正解画像以外は全てNegativeとして学習 “壁にかかっている絵をとってきて”  Unlabeled Positiveを全てアノテーションすることは困難例: 6,000指示文，7,000画像のとき，手動では約188,000時間 Unlabeled Positive

提案手法：RelaX-Former - 6 - ▪ 新規性：Multimodal LLM (MLLM) を用いて類似画像に Unlabeled
Positiveを付与して，類似画像に頑健な対照学習を実現

Positiveを付与して，類似画像に頑健な対照学習を実現 ① Spatial Overlay Grounding モジュール SoM画像をMLLMに説明させることで，領域ごとに詳細な視覚特徴量を抽出

Positiveを付与して，類似画像に頑健な対照学習を実現 ② X-Fusion モジュール複数の埋め込み表現を並列に使用し，補完的な強みを活用

Positiveを付与して，類似画像に頑健な対照学習を実現 ③ Dense Representation Learning モジュール Unlabeled PositiveおよびNegativeペアの対照性を緩和

Dense Representation Learning モジュール： Unlabeled PositiveおよびNegativeペアの対照性を緩和 - 10 - ▪
MLLMを用いてUnlabeled Positiveを推定 ▪ Unlabeled Positiveを考慮し，対照性を緩和する損失関数言語特徴量視覚特徴量

▪ MLLMを用いてUnlabeled Positiveを推定 ▪ Unlabeled Positiveを考慮し，対照性を緩和する損失関数 Dense Representation Learning モジュール：
Unlabeled PositiveおよびNegativeペアの対照性を緩和 - 11 - Dense Labeler ”Pick up the wine bottle on the table and place it on the wooden shelf.” “Is there a {wine bottle on the table} in this image?” ‘True’ → ’False’ → : Unlabeled Positiveペアの添字集合

Unlabeled PositiveおよびNegativeペアの対照性を緩和 - 12 - Dual Relaxed Contrastive (DRC) loss ▪ :類似度が以上の領域を無視 ▪ :類似度が負の領域を無視

Unlabeled PositiveおよびNegativeペアの対照性を緩和 - 13 - InfoNCE loss [Oord+, 18] Positive以外が全て負例として扱われる  Unlabeled Positiveを考慮できない Dual Relaxed Contrastive (DRC) loss ▪ :類似度が以上の領域を無視 ▪ :類似度が負の領域を無視

定量的結果：標準的な評価指標で既存手法を上回った - 14 - ▪ 評価指標：recall@K (R@K)↑ [%] HM3D-FC MP3D-FC
手法 R@10↑ R@20↑ R@10↑ R@20↑ NLMap (reprod.) [Chen+, ICRA23] 27.9 53.2 27.1 63.8 MultiRankIt [Kaneda+, RA-L24] 48.3±3.4 73.3±2.6 51.7±8.9 72.7±3.3 DM2RM [Korekata+, 24] 67.1±2.4 87.0±1.1 64.1±3.6 78.5±0.5 提案手法 76.3±0.9 91.6±0.9 72.4±0.7 82.5±0.8

[%] HM3D-FC MP3D-FC 手法 R@10↑ R@20↑ R@10↑ R@20↑ NLMap (reprod.)
[Chen+, ICRA23] 27.9 53.2 27.1 63.8 MultiRankIt [Kaneda+, RA-L24] 48.3±3.4 73.3±2.6 51.7±8.9 72.7±3.3 DM2RM [Korekata+, 24] 67.1±2.4 87.0±1.1 64.1±3.6 78.5±0.5 提案手法 76.3±0.9 91.6±0.9 72.4±0.7 82.5±0.8 ▪ 評価指標：recall@K (R@K)↑ 定量的結果：標準的な評価指標で既存手法を上回った - 15 - +8.3 +9.2

定性的結果 (成功例) (1/2)：対象物体として望ましい画像を上位にランク付け - 16 - Take the painting
near the desk in the work room and put it on the big white sofa in the living room. Positive Unlabeled Positive Negative

定性的結果 (成功例) (1/2)：対象物体として望ましい画像を上位にランク付け - 17 - Take the painting
near the desk in the work room and put it on the big white sofa in the living room. Positive Unlabeled Positive Negative 机の近くにある絵画が上位 ☺ 物体間の関係性を捉えている

定性的結果 (成功例) (2/2)：配置目標として望ましい画像を上位にランク付け - 18 - Pick up the
dry flowers in the white vase next to the cross on the wall and put it in the shelf between two doors in the bedroom. Positive Unlabeled Positive Negative

定性的結果 (成功例) (2/2)：配置目標として望ましい画像を上位にランク付け - 19 - Pick up the
dry flowers in the white vase next to the cross on the wall and put it in the shelf between two doors in the bedroom. Positive Unlabeled Positive Negative 2つのドアの間にある棚が上位 ☺ ドアの間という複雑な表現を理解

実機実験：open-vocabularyな指示文に基づき，対象物体および配置目標を検索，把持・配置 - 20 - Please carry the utensils on
the tall table to the shelf next to the red mug. 16x 16x Pick up the long chips can and place it on the table with fruits.

▪ 検索，把持，および配置の一連の動作を実施 → 正解画像をtop-10以内に検索した場合のみ把持・配置動作実行定量的結果 (実機)：ゼロショット転移条件において，タスク成功率80%を達成 - 21 -
手法 SR(成功数/試行数)↑ [%] NLMap (reprod.) [Chen+, ICRA23] 70 (14/20) MultiRankIt [Kaneda+, RA-L24] 50 (10/20) DM2RM [Korekata+, 24] 75 (15/20) 提案手法 80 (16/20) ☺ ゼロショット転移条件においても既存手法を上回った

まとめ - 22 - ▪ 背景 ✓ マルチモーダル検索に基づく，生活支援ロボットによる物体操作 ▪
提案 ✓ Dense Labelerによる Unlabeled Positiveラベルの付与 ✓ Unlabeled PositiveおよびNegativeペアの対照性を緩和する Dual Relaxed Contrastive Loss ▪ 結果 ✓ 実機実験において，ゼロショット転移でタスク成功率80%を達成発表資料

Appendix

1/2. Spatial Overlay Grounding (SOG) モジュール：領域分割された画像で，領域ごとに詳細な特徴抽出 - 24
- ▪ MLLMで画像を説明させたテキストを画像特徴量として使用 ▪ SAM [Kirillov+, ICCV23]およびSEEM [Xou+, NeurIPS23] によるセグメンテーションマスク重畳画像を並列に入力

1/2. Spatial Overlay Grounding (SOG) モジュール：領域分割された画像で，領域ごとに詳細な特徴抽出 - 25
- ▪ MLLMで画像を説明させたテキストを画像特徴量として使用 ▪ SAM [Kirillov+, ICCV23]およびSEEM [Xou+, NeurIPS23] によるセグメンテーションマスク重畳画像を並列に入力領域分割された各物体にマークをつけることでGPT-4Vの画像説明能力を強化 SEEMで領域分割し，マークをつけた画像

2/2. X-Fusion (XF) モジュール：複数の埋め込み表現を並列に使用し，補完的な強みを活用 - 26 - ▪ 視覚エンコーダ
(e.g., ViT [Dosovitskiy+, ICLR21]) ▪ 物体の色や形状など ▪ マルチモーダルエンコーダ (e.g., CLIP [Radford+, ICML21]) ▪ 言語と視覚がアラインされた埋め込み表現 ▪ MLLM (e.g., LLaVA [Liu+, NeurIPS23]) ▪ 構造的な潜在特徴量

実験設定：標準的な屋内環境で撮影された実画像および参照表現を含む指示文から成るデータセット - 27 - ▪ LTRRIE-FCデータセット[Korekata+, 24]を使用 ▪ 画像：HM3D
[Ramakrishnan+, NeurIPS21] およびMP3D [Chang+, 3DV17] から収集 ▪ 言語：対象物体を配置目標へ運搬するための参照表現を含む指示文環境数 774 画像数 7,148 アノテータ数 226 指示文数 6,581

評価指標：画像検索設定において標準的 - 28 - ▪ Recall@K ：top-Kのサンプル集合：正解サンプル集合：指示文数

定量的結果：標準的な評価指標で既存手法を上回った - 29 - [%] HM3D-FC MP3D-FC 手法 R@5↑ R@10↑
R@20↑ R@5↑ R@10↑ R@20↑ NLMap (reprod.) [Chen+, ICRA23] 14.7 27.9 53.2 12.2 27.1 63.8 MultiRankIt [Kaneda+, RA-L24] 28.7±3.4 48.3±3.4 73.3±2.6 35.7±9.9 51.7±8.9 72.7±3.3 DM2RM [Korekata+, 24] 47.8±1.2 67.1±2.4 87.0±1.1 49.6±0.7 64.1±3.6 78.5±0.5 提案手法 55.4±0.5 76.3±0.9 91.6±0.9 57.0±1.1 72.4±0.7 82.5±0.8

定性的結果 (成功例) (1/2)：対象物体として望ましい画像を上位にランク付け - 30 - Pick up the
dry flowers in the white vase next to the cross on the wall and put it in the shelf between two doors in the bedroom. Positive Unlabeled Positive Negative

定性的結果 (成功例) (2/2)：配置目標として望ましい画像を上位にランク付け - 31 - Take the painting
near the desk in the work room and put it on the big white sofa in the living room. Positive Unlabeled Positive Negative

定性的結果 (失敗例)：指示文に対し， Unlabeled Positiveを上位にランク付け - 32 - Take the cushion
on the couch to the shelf in the living room. 対象物体配置目標

Ablation Study：すべての新規モジュールが性能向上に寄与 - 33 - ☺ X-Fusion (XF)モジュールが最も有効 →
さまざまな視覚特徴量を統合することにより物体を捉える能力が向上 [%] HM3D-FC MP3D-FC 条件 R@5 R@10 R@20 R@5 R@10 R@20 提案手法 (full) 55.4±0.5 76.3±0.9 91.6±0.9 57.0±1.1 72.4±0.7 82.5±0.8 w/o SOG 52.4±2.1 73.6±0.7 91.1±0.8 54.2±1.7 69.8±0.7 80.8±1.2 w/o XF 51.9±1.4 71.6±1.4 86.7±1.9 53.2±1.3 69.5±1.3 79.8±1.3 w/o DRL 52.5±1.4 73.5±1.1 91.4±0.7 55.4±0.7 69.1±1.3 80.9±1.3

エラー分析：Unlabeled positiveを上位にランク付けした場合が最多 - 34 - 失敗要因対象物体配置目標 Unrecognizable
unlabeled positive error 7 6 Annotation error 4 6 Phrase selection error 5 3 Ambiguous instruction error 2 5 Referring expression comprehension error 2 0 合計 20 20

実験設定 (実機)：マルチモーダル検索 + 移動マニピュレーション - 35 - ▪ 環境：WRS
2020 Partner Robot Challenge/Real Spaceの標準環境に準拠 ▪ 実機：Human Support Robot [Yamamoto+, ROBOMECH J.19] ▪ 物体：YCBオブジェクトの一部 [Calli+, RAM15] (計30種類)

実機実験：事前の探索により環境の画像を収集 - 36 - 観測画像 … … … … …
… ▪ 標準的な環境/日常物体/ロボットを使用 20x

定性的結果 (実機)：対象物体/配置目標ともに望ましい画像を上位にランク付け - 37 - Pick up the pear
on the table and place it on the table next to the mustard.

実機：Human Support Robot [Yamamoto+, ROBOMECH J.19] - 38 - https://global.toyota/jp/download/8725215
▪ HSR：トヨタ自動車製の生活支援ロボット ▪ 頭部搭載のAsus Xtion Proカメラを使用

[RSJ24] Open-Vocabulary Mobile Manipulation Bas...

[RSJ24] Open-Vocabulary Mobile Manipulation Based on Dual Relaxed Contrastive Learning with Dense Labeling

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Featured

Transcript