# [IROS23] Switching Head–Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

## Semantic Machine Intelligence Lab., Keio Univ.PRO

September 20, 2023

## Transcript

1. Switching Head-Tail Funnel UNITER for
Dual Referring Expression Comprehension with
Ryosuke Korekata, Motonari Kambara, Yu Yoshida, Shintaro Ishikawa,
Yosuke Kawasaki, Masaki Takahashi, and Komei Sugiura
Keio University

2. Motivation: Supporting care recipients by robots that
comprehend natural language instructions
- 2 -
Fetch-and-carry
■ Domestic service robot (DSR)
■ Expected solution for a scarcity of home caregivers
■ Interaction through language
“Place the red cup
to the kitchen.”
8x

3. ■ Identifying target object and destination from instruction and images
Problem Statement: Dual Referring Expression Comprehension
with fetch-and-carry (DREC-fc)
- 3 -
“Move the bottle on the left side of the plate to the empty chair.”

4. ■ Identifying target object and destination from instruction and images
+ Carrying the target object to the destination
Problem Statement: Dual Referring Expression Comprehension
with fetch-and-carry (DREC-fc)
- 4 -
2x 2x
“Move the bottle on the left side of the plate to the empty chair.”

5. Challenge: Determining the maximum likelihood pair
■ Most existing methods (e.g., TDU [Ishikawa+, RA-L21])
 Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵)
- 5 -

6. Challenge: Determining the maximum likelihood pair
■ Most existing methods (e.g., TDU [Ishikawa+, RA-L21])
 Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵)
- 6 -
𝑀: Number of target object candidates

7. Challenge: Determining the maximum likelihood pair
■ Most existing methods (e.g., TDU [Ishikawa+, RA-L21])
 Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵)
- 7 -
𝑀: Number of target object candidates
𝑁: Number of destination candidates

8. Challenge: Determining the maximum likelihood pair
■ Most existing methods (e.g., TDU [Ishikawa+, RA-L21])
 Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵)
■ Assuming 𝑀 = 𝑁 = 100 and single inference takes 4 × 10−3 seconds,
the whole computation would take 𝟒𝟎 seconds
- 8 -
𝑀: Number of target object candidates
𝑁: Number of destination candidates

9. Novelty of Proposed Method:
- 9 -
1. Both target objects and destinations can be predicted individually
by a single model
2. The computational complexity should not 𝑂(𝑀 × 𝑁) but 𝑶(𝑴 + 𝑵)

10. Novelty of Proposed Method:
- 10 -

1
2
𝑀

𝑁
2
1
1 2 𝑗

𝑀

Step 1:
Step 2: 1 2 𝑘 𝑁
1. Both target objects and destinations can be predicted individually
by a single model
2. The computational complexity should not 𝑂(𝑀 × 𝑁) but 𝑶(𝑴 + 𝑵)

11. Experimental Settings: Simulation and physical experiments
- 11 -
8x
Collecting images of the environment
1. Simulation experiments: ALFRED-fc dataset
■ Based on the ALFRED [Shridhar+, CVPR20]
2. Physical experiments
■ Standardized DSR, objects [Calli+, RAM15], and environment

12. Quantitative Results: Outperformed the baseline method
✓ Outperformed the baseline method in simulation and physical experiments
✓ Both the Switching Head and Tail mechanisms are effective
Method ALFRED-fc Real
Extended TDU [Ishikawa+, RA-L21] 79.4 ± 2.76 52.0
Ours (W/o Switching Head) 78.4 ± 2.05 -
Ours (W/o Switching Tail) 76.9 ± 2.91 -
Ours (SHeFU) 83.1 ± 2.00 55.9
+3.7
- 12 -
+3.9
Language comprehension accuracy [%]

13. Qualitative Results: Successful case
- 13 -
4x
“Put the red chips can on the white table with the soccer ball on it.”
: Target object
: Destination
: Target object candidate
or destination candidate