$30 off During Our Annual Pro Sale. View Details »

[IROS23] Switching Head–Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

[IROS23] Switching Head–Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Transcript

  1. Switching Head-Tail Funnel UNITER for
    Dual Referring Expression Comprehension with
    Fetch-and-Carry Tasks
    Ryosuke Korekata, Motonari Kambara, Yu Yoshida, Shintaro Ishikawa,
    Yosuke Kawasaki, Masaki Takahashi, and Komei Sugiura
    Keio University

    View Slide

  2. Motivation: Supporting care recipients by robots that
    comprehend natural language instructions
    - 2 -
    Fetch-and-carry
    ■ Domestic service robot (DSR)
    ■ Expected solution for a scarcity of home caregivers
    ■ Interaction through language
    “Place the red cup
    to the kitchen.”
    8x

    View Slide

  3. ■ Identifying target object and destination from instruction and images
    Problem Statement: Dual Referring Expression Comprehension
    with fetch-and-carry (DREC-fc)
    - 3 -
    “Move the bottle on the left side of the plate to the empty chair.”

    View Slide

  4. ■ Identifying target object and destination from instruction and images
    + Carrying the target object to the destination
    Problem Statement: Dual Referring Expression Comprehension
    with fetch-and-carry (DREC-fc)
    - 4 -
    2x 2x
    “Move the bottle on the left side of the plate to the empty chair.”

    View Slide

  5. Challenge: Determining the maximum likelihood pair
    ■ Most existing methods (e.g., TDU [Ishikawa+, RA-L21])
     Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵)
    - 5 -

    View Slide

  6. Challenge: Determining the maximum likelihood pair
    ■ Most existing methods (e.g., TDU [Ishikawa+, RA-L21])
     Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵)
    - 6 -
    𝑀: Number of target object candidates

    View Slide

  7. Challenge: Determining the maximum likelihood pair
    ■ Most existing methods (e.g., TDU [Ishikawa+, RA-L21])
     Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵)
    - 7 -
    𝑀: Number of target object candidates
    𝑁: Number of destination candidates

    View Slide

  8. Challenge: Determining the maximum likelihood pair
    ■ Most existing methods (e.g., TDU [Ishikawa+, RA-L21])
     Impractical computational complexity for inference: 𝑶(𝑴 × 𝑵)
    ■ Assuming 𝑀 = 𝑁 = 100 and single inference takes 4 × 10−3 seconds,
    the whole computation would take 𝟒𝟎 seconds
    - 8 -
    𝑀: Number of target object candidates
    𝑁: Number of destination candidates

    View Slide

  9. Novelty of Proposed Method:
    Switching Head-Tail Funnel UNITER (SHeFU)
    - 9 -
    1. Both target objects and destinations can be predicted individually
    by a single model
    2. The computational complexity should not 𝑂(𝑀 × 𝑁) but 𝑶(𝑴 + 𝑵)

    View Slide

  10. Novelty of Proposed Method:
    Switching Head-Tail Funnel UNITER (SHeFU)
    - 10 -


    1
    2
    𝑀






    𝑁
    2
    1
    1 2 𝑗



    𝑀









    Step 1:
    Step 2: 1 2 𝑘 𝑁
    1. Both target objects and destinations can be predicted individually
    by a single model
    2. The computational complexity should not 𝑂(𝑀 × 𝑁) but 𝑶(𝑴 + 𝑵)

    View Slide

  11. Experimental Settings: Simulation and physical experiments
    - 11 -
    8x
    Collecting images of the environment
    1. Simulation experiments: ALFRED-fc dataset
    ■ Based on the ALFRED [Shridhar+, CVPR20]
    2. Physical experiments
    ■ Standardized DSR, objects [Calli+, RAM15], and environment

    View Slide

  12. Quantitative Results: Outperformed the baseline method
    ✓ Outperformed the baseline method in simulation and physical experiments
    ✓ Both the Switching Head and Tail mechanisms are effective
    Method ALFRED-fc Real
    Extended TDU [Ishikawa+, RA-L21] 79.4 ± 2.76 52.0
    Ours (W/o Switching Head) 78.4 ± 2.05 -
    Ours (W/o Switching Tail) 76.9 ± 2.91 -
    Ours (SHeFU) 83.1 ± 2.00 55.9
    +3.7
    - 12 -
    +3.9
    Language comprehension accuracy [%]

    View Slide

  13. Qualitative Results: Successful case
    - 13 -
    4x
    “Put the red chips can on the white table with the soccer ball on it.”
    : Target object
    : Destination
    : Target object candidate
    or destination candidate

    View Slide