[IROS23] Switching Head–Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with
Fetch-and-Carry Tasks Ryosuke Korekata, Motonari Kambara, Yu Yoshida, Shintaro Ishikawa, Yosuke Kawasaki, Masaki Takahashi, and Komei Sugiura Keio University

Motivation: Supporting care recipients by robots that comprehend natural language
instructions - 2 - Fetch-and-carry ▪ Domestic service robot (DSR) ▪ Expected solution for a scarcity of home caregivers ▪ Interaction through language “Place the red cup to the kitchen.” 8x

▪ Identifying target object and destination from instruction and images
Problem Statement: Dual Referring Expression Comprehension with fetch-and-carry (DREC-fc) - 3 - “Move the bottle on the left side of the plate to the empty chair.”

▪ Identifying target object and destination from instruction and images
+ Carrying the target object to the destination Problem Statement: Dual Referring Expression Comprehension with fetch-and-carry (DREC-fc) - 4 - 2x 2x “Move the bottle on the left side of the plate to the empty chair.”

Challenge: Determining the maximum likelihood pair ▪ Most existing methods
(e.g., TDU [Ishikawa+, RA-L21])  Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵) - 5 -

(e.g., TDU [Ishikawa+, RA-L21])  Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵) - 6 - 𝑀: Number of target object candidates

(e.g., TDU [Ishikawa+, RA-L21])  Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵) - 7 - 𝑀: Number of target object candidates 𝑁: Number of destination candidates

(e.g., TDU [Ishikawa+, RA-L21])  Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵) ▪ Assuming 𝑀 = 𝑁 = 100 and single inference takes 4 × 10−3 seconds, the whole computation would take 𝟒𝟎 seconds - 8 - 𝑀: Number of target object candidates 𝑁: Number of destination candidates

Novelty of Proposed Method: Switching Head-Tail Funnel UNITER (SHeFU) -
9 - 1. Both target objects and destinations can be predicted individually by a single model 2. The computational complexity should not 𝑂(𝑀 × 𝑁) but 𝑶(𝑴 + 𝑵)

Novelty of Proposed Method: Switching Head-Tail Funnel UNITER (SHeFU) -
10 - ☺  1 2 𝑀 ・・・・・・ 𝑁 2 1 1 2 𝑗 ・・・ 𝑀 ・・・・・・・・・ Step 1: Step 2: 1 2 𝑘 𝑁 1. Both target objects and destinations can be predicted individually by a single model 2. The computational complexity should not 𝑂(𝑀 × 𝑁) but 𝑶(𝑴 + 𝑵)

Experimental Settings: Simulation and physical experiments - 11 - 8x
Collecting images of the environment 1. Simulation experiments: ALFRED-fc dataset ▪ Based on the ALFRED [Shridhar+, CVPR20] 2. Physical experiments ▪ Standardized DSR, objects [Calli+, RAM15], and environment

Quantitative Results: Outperformed the baseline method ✓ Outperformed the baseline
method in simulation and physical experiments ✓ Both the Switching Head and Tail mechanisms are effective Method ALFRED-fc Real Extended TDU [Ishikawa+, RA-L21] 79.4 ± 2.76 52.0 Ours (W/o Switching Head) 78.4 ± 2.05 - Ours (W/o Switching Tail) 76.9 ± 2.91 - Ours (SHeFU) 83.1 ± 2.00 55.9 +3.7 - 12 - +3.9 Language comprehension accuracy [%]

Qualitative Results: Successful case - 13 - 4x “Put the
red chips can on the white table with the soccer ball on it.” : Target object : Destination : Target object candidate or destination candidate

[IROS23] Switching Head–Tail Funnel UNITER for ...

[IROS23] Switching Head–Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

Semantic Machine Intelligence Lab., Keio Univ.
PRO

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Featured

Transcript

Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with

Motivation: Supporting care recipients by robots that comprehend natural language

▪ Identifying target object and destination from instruction and images

▪ Identifying target object and destination from instruction and images

Challenge: Determining the maximum likelihood pair ▪ Most existing methods

Challenge: Determining the maximum likelihood pair ▪ Most existing methods

Challenge: Determining the maximum likelihood pair ▪ Most existing methods

Challenge: Determining the maximum likelihood pair ▪ Most existing methods

Novelty of Proposed Method: Switching Head-Tail Funnel UNITER (SHeFU) -

Novelty of Proposed Method: Switching Head-Tail Funnel UNITER (SHeFU) -

Experimental Settings: Simulation and physical experiments - 11 - 8x

Quantitative Results: Outperformed the baseline method ✓ Outperformed the baseline

Qualitative Results: Successful case - 13 - 4x “Put the