[IROS23] Switching Head–Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

Semantic Machine Intelligence Lab., Keio Univ.PRO

September 20, 2023

Transcript

1. Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with

Fetch-and-Carry Tasks Ryosuke Korekata, Motonari Kambara, Yu Yoshida, Shintaro Ishikawa, Yosuke Kawasaki, Masaki Takahashi, and Komei Sugiura Keio University
2. Motivation: Supporting care recipients by robots that comprehend natural language

instructions - 2 - Fetch-and-carry ▪ Domestic service robot (DSR) ▪ Expected solution for a scarcity of home caregivers ▪ Interaction through language “Place the red cup to the kitchen.” 8x
3. ▪ Identifying target object and destination from instruction and images

Problem Statement: Dual Referring Expression Comprehension with fetch-and-carry (DREC-fc) - 3 - “Move the bottle on the left side of the plate to the empty chair.”
4. ▪ Identifying target object and destination from instruction and images

+ Carrying the target object to the destination Problem Statement: Dual Referring Expression Comprehension with fetch-and-carry (DREC-fc) - 4 - 2x 2x “Move the bottle on the left side of the plate to the empty chair.”
5. Challenge: Determining the maximum likelihood pair ▪ Most existing methods

(e.g., TDU [Ishikawa+, RA-L21])  Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵) - 5 -
6. Challenge: Determining the maximum likelihood pair ▪ Most existing methods

(e.g., TDU [Ishikawa+, RA-L21])  Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵) - 6 - 𝑀: Number of target object candidates
7. Challenge: Determining the maximum likelihood pair ▪ Most existing methods

(e.g., TDU [Ishikawa+, RA-L21])  Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵) - 7 - 𝑀: Number of target object candidates 𝑁: Number of destination candidates
8. Challenge: Determining the maximum likelihood pair ▪ Most existing methods

(e.g., TDU [Ishikawa+, RA-L21])  Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵) ▪ Assuming 𝑀 = 𝑁 = 100 and single inference takes 4 × 10−3 seconds, the whole computation would take 𝟒𝟎 seconds - 8 - 𝑀: Number of target object candidates 𝑁: Number of destination candidates
9. Novelty of Proposed Method: Switching Head-Tail Funnel UNITER (SHeFU) -

9 - 1. Both target objects and destinations can be predicted individually by a single model 2. The computational complexity should not 𝑂(𝑀 × 𝑁) but 𝑶(𝑴 + 𝑵)
10. Novelty of Proposed Method: Switching Head-Tail Funnel UNITER (SHeFU) -

10 - ☺  1 2 𝑀 ・ ・ ・ ・ ・ ・ 𝑁 2 1 1 2 𝑗 ・ ・ ・ 𝑀 ・ ・ ・ ・ ・ ・ ・ ・ ・ Step 1: Step 2: 1 2 𝑘 𝑁 1. Both target objects and destinations can be predicted individually by a single model 2. The computational complexity should not 𝑂(𝑀 × 𝑁) but 𝑶(𝑴 + 𝑵)
11. Experimental Settings: Simulation and physical experiments - 11 - 8x

Collecting images of the environment 1. Simulation experiments: ALFRED-fc dataset ▪ Based on the ALFRED [Shridhar+, CVPR20] 2. Physical experiments ▪ Standardized DSR, objects [Calli+, RAM15], and environment
12. Quantitative Results: Outperformed the baseline method ✓ Outperformed the baseline

method in simulation and physical experiments ✓ Both the Switching Head and Tail mechanisms are effective Method ALFRED-fc Real Extended TDU [Ishikawa+, RA-L21] 79.4 ± 2.76 52.0 Ours (W/o Switching Head) 78.4 ± 2.05 - Ours (W/o Switching Tail) 76.9 ± 2.91 - Ours (SHeFU) 83.1 ± 2.00 55.9 +3.7 - 12 - +3.9 Language comprehension accuracy [%]
13. Qualitative Results: Successful case - 13 - 4x “Put the

red chips can on the white table with the soccer ball on it.” : Target object : Destination : Target object candidate or destination candidate