[IROS23] Switching Head–Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

Slide 1

Slide 1 text

Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks Ryosuke Korekata, Motonari Kambara, Yu Yoshida, Shintaro Ishikawa, Yosuke Kawasaki, Masaki Takahashi, and Komei Sugiura Keio University

Slide 2

Slide 2 text

Motivation: Supporting care recipients by robots that comprehend natural language instructions - 2 - Fetch-and-carry ■ Domestic service robot (DSR) ■ Expected solution for a scarcity of home caregivers ■ Interaction through language “Place the red cup to the kitchen.” 8x

Slide 3

Slide 3 text

■ Identifying target object and destination from instruction and images Problem Statement: Dual Referring Expression Comprehension with fetch-and-carry (DREC-fc) - 3 - “Move the bottle on the left side of the plate to the empty chair.”

Slide 4

Slide 4 text

■ Identifying target object and destination from instruction and images + Carrying the target object to the destination Problem Statement: Dual Referring Expression Comprehension with fetch-and-carry (DREC-fc) - 4 - 2x 2x “Move the bottle on the left side of the plate to the empty chair.”

Slide 5

Slide 5 text

Challenge: Determining the maximum likelihood pair ■ Most existing methods (e.g., TDU [Ishikawa+, RA-L21])  Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵) - 5 -

Slide 6

Slide 6 text

Challenge: Determining the maximum likelihood pair ■ Most existing methods (e.g., TDU [Ishikawa+, RA-L21])  Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵) - 6 - 𝑀: Number of target object candidates

Slide 7

Slide 7 text

Challenge: Determining the maximum likelihood pair ■ Most existing methods (e.g., TDU [Ishikawa+, RA-L21])  Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵) - 7 - 𝑀: Number of target object candidates 𝑁: Number of destination candidates

Slide 8

Slide 8 text

Challenge: Determining the maximum likelihood pair ■ Most existing methods (e.g., TDU [Ishikawa+, RA-L21])  Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵) ■ Assuming 𝑀 = 𝑁 = 100 and single inference takes 4 × 10−3 seconds, the whole computation would take 𝟒𝟎 seconds - 8 - 𝑀: Number of target object candidates 𝑁: Number of destination candidates

Slide 9

Slide 9 text

Novelty of Proposed Method: Switching Head-Tail Funnel UNITER (SHeFU) - 9 - 1. Both target objects and destinations can be predicted individually by a single model 2. The computational complexity should not 𝑂(𝑀 × 𝑁) but 𝑶(𝑴 + 𝑵)

Slide 10

Slide 10 text

Novelty of Proposed Method: Switching Head-Tail Funnel UNITER (SHeFU) - 10 - ☺  1 2 𝑀 ・・・・・・ 𝑁 2 1 1 2 𝑗 ・・・ 𝑀 ・・・・・・・・・ Step 1: Step 2: 1 2 𝑘 𝑁 1. Both target objects and destinations can be predicted individually by a single model 2. The computational complexity should not 𝑂(𝑀 × 𝑁) but 𝑶(𝑴 + 𝑵)

Slide 11

Slide 11 text

Experimental Settings: Simulation and physical experiments - 11 - 8x Collecting images of the environment 1. Simulation experiments: ALFRED-fc dataset ■ Based on the ALFRED [Shridhar+, CVPR20] 2. Physical experiments ■ Standardized DSR, objects [Calli+, RAM15], and environment

Slide 12

Slide 12 text

Quantitative Results: Outperformed the baseline method ✓ Outperformed the baseline method in simulation and physical experiments ✓ Both the Switching Head and Tail mechanisms are effective Method ALFRED-fc Real Extended TDU [Ishikawa+, RA-L21] 79.4 ± 2.76 52.0 Ours (W/o Switching Head) 78.4 ± 2.05 - Ours (W/o Switching Tail) 76.9 ± 2.91 - Ours (SHeFU) 83.1 ± 2.00 55.9 +3.7 - 12 - +3.9 Language comprehension accuracy [%]

Slide 13

Slide 13 text

Qualitative Results: Successful case - 13 - 4x “Put the red chips can on the white table with the soccer ball on it.” : Target object : Destination : Target object candidate or destination candidate