Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[IROS23] Switching Head–Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

[IROS23] Switching Head–Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Transcript

  1. Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with

    Fetch-and-Carry Tasks Ryosuke Korekata, Motonari Kambara, Yu Yoshida, Shintaro Ishikawa, Yosuke Kawasaki, Masaki Takahashi, and Komei Sugiura Keio University
  2. Motivation: Supporting care recipients by robots that comprehend natural language

    instructions - 2 - Fetch-and-carry ▪ Domestic service robot (DSR) ▪ Expected solution for a scarcity of home caregivers ▪ Interaction through language “Place the red cup to the kitchen.” 8x
  3. ▪ Identifying target object and destination from instruction and images

    Problem Statement: Dual Referring Expression Comprehension with fetch-and-carry (DREC-fc) - 3 - “Move the bottle on the left side of the plate to the empty chair.”
  4. ▪ Identifying target object and destination from instruction and images

    + Carrying the target object to the destination Problem Statement: Dual Referring Expression Comprehension with fetch-and-carry (DREC-fc) - 4 - 2x 2x “Move the bottle on the left side of the plate to the empty chair.”
  5. Challenge: Determining the maximum likelihood pair ▪ Most existing methods

    (e.g., TDU [Ishikawa+, RA-L21])  Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵) - 5 -
  6. Challenge: Determining the maximum likelihood pair ▪ Most existing methods

    (e.g., TDU [Ishikawa+, RA-L21])  Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵) - 6 - 𝑀: Number of target object candidates
  7. Challenge: Determining the maximum likelihood pair ▪ Most existing methods

    (e.g., TDU [Ishikawa+, RA-L21])  Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵) - 7 - 𝑀: Number of target object candidates 𝑁: Number of destination candidates
  8. Challenge: Determining the maximum likelihood pair ▪ Most existing methods

    (e.g., TDU [Ishikawa+, RA-L21])  Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵) ▪ Assuming 𝑀 = 𝑁 = 100 and single inference takes 4 × 10−3 seconds, the whole computation would take 𝟒𝟎 seconds - 8 - 𝑀: Number of target object candidates 𝑁: Number of destination candidates
  9. Novelty of Proposed Method: Switching Head-Tail Funnel UNITER (SHeFU) -

    9 - 1. Both target objects and destinations can be predicted individually by a single model 2. The computational complexity should not 𝑂(𝑀 × 𝑁) but 𝑶(𝑴 + 𝑵)
  10. Novelty of Proposed Method: Switching Head-Tail Funnel UNITER (SHeFU) -

    10 - ☺  1 2 𝑀 ・ ・ ・ ・ ・ ・ 𝑁 2 1 1 2 𝑗 ・ ・ ・ 𝑀 ・ ・ ・ ・ ・ ・ ・ ・ ・ Step 1: Step 2: 1 2 𝑘 𝑁 1. Both target objects and destinations can be predicted individually by a single model 2. The computational complexity should not 𝑂(𝑀 × 𝑁) but 𝑶(𝑴 + 𝑵)
  11. Experimental Settings: Simulation and physical experiments - 11 - 8x

    Collecting images of the environment 1. Simulation experiments: ALFRED-fc dataset ▪ Based on the ALFRED [Shridhar+, CVPR20] 2. Physical experiments ▪ Standardized DSR, objects [Calli+, RAM15], and environment
  12. Quantitative Results: Outperformed the baseline method ✓ Outperformed the baseline

    method in simulation and physical experiments ✓ Both the Switching Head and Tail mechanisms are effective Method ALFRED-fc Real Extended TDU [Ishikawa+, RA-L21] 79.4 ± 2.76 52.0 Ours (W/o Switching Head) 78.4 ± 2.05 - Ours (W/o Switching Tail) 76.9 ± 2.91 - Ours (SHeFU) 83.1 ± 2.00 55.9 +3.7 - 12 - +3.9 Language comprehension accuracy [%]
  13. Qualitative Results: Successful case - 13 - 4x “Put the

    red chips can on the white table with the soccer ball on it.” : Target object : Destination : Target object candidate or destination candidate