$30 off During Our Annual Pro Sale. View Details »

The Confluence of Vision, Language, and Robotics

The Confluence of Vision, Language, and Robotics

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Transcript

  1. Komei Sugiura
    Keio University
    - -
    1
    The Confluence of Vision,
    Language, and Robotics

    View Slide

  2. Use cases of Vision x Language x Robotics
    - -
    2
    Honda CiKoMa (YouTube)
    PaLM-E [Driess (Robotics at Google)+ 2023]
    https://palm-e.github.io/

    View Slide

  3. Impact of foundation models to robotics
    - -
    3
    ■ Foundation models: Trained on broad data at scale and are
    adaptable to a wide range of downstream tasks*
    ■ e.g. BERT, GPT-3, CLIP
    ■ Impact of foundation models to Robotics: Major
    ■ Robustness for unseen situations (Zero-shot/few-shot)
    ■ Easily usable for non-experts of NLP/CV
    ■ Impact of robotics to foundation models: Minor
    ■ RT-1/2
    ■ (Future) Self-driving cars, automated experiments, …
    *Bommasani et al, “On the Opportunities and Risks of Foundation Models”, 2021.

    View Slide

  4. Foundation models for building communication robots
    - -
    4
    Text
    embedding
    Standalone BERT, RoBERTa, DeBERTa, …
    Cloud text-embedding-ada-002 (OpenAI)
    Speech
    recognition
    Standalone Whisper (OpenAI)
    Cloud • Smartphone UI with proprietary cloud APIs
    • Most robot developers prefer not to operate speech
    recognition servers by themselves
    Rospeex [Sugiura+ IROS15]
    • ROS-based
    • 50k unique users
    between 2013 and 2018

    View Slide

  5. Large language models (LLMs) for generating control
    codes
    - -
    5
    Code as Policies
    [Liang+ 22]
    LLM generates sequence of atomic actions (sensing & action)
    ChatGPT for Robotics
    [Vemprala+ 23]
    Humans give feedback to interactively generate control code
    TidyBot
    [Wu+ AR-IROS23]
    Recognizes target objects by CLIP and generates code
    (including receptacles) by LLM
    Situation information
    is manually given
    ■ e.g. objects = ["yellow
    shirt", "black shirt”, ..]

    View Slide

  6. CLIP[Radford+ 21]:
    Vision-language foundation model
    ■ Contrastive learning on 400M image-text pairs
    ■ CLIPInfoNCE loss, OTTER[Wu+ ICLR22]optimal transport
    ■ Many applications (e.g. DALL·E 2+ 22] )
    a photo of a beer bottle
    satellite imagery of roundabout
    a photo of a marimba
    a meme
    Text
    Text
    feat.
    Image
    feat.
    Image
    https://vimeo.com/692375454

    View Slide

  7. Object search/manipulation using CLIP
    - -
    7
    Manipul
    ation
    CLIPort
    [Shridhar+ CoRL21] /
    PerAct
    [Shridhar+ CoRL22]
    • Introduces a side network to Transporter
    Networks[Zeng+ CoRL20] to handle CLIP features
    • Predicts 2D positions/6D poses of the gripper
    KITE
    [Sundaresan+ CoRL23]
    Predicts which parts of objects to grasp
    Search CLIP-Fields
    [Shafiullah+ RSS23]
    Object search by Detic+BERT+CLIP
    OpenScene
    [Peng+ CVPR23]
    Open-vocabulary 3D Scene understanding

    View Slide

  8. Major approaches to encode images by CLIP
    - -
    8
    1D feature vector
    ■ Easy
    ■ model.encode_image(image)
    ■ Positional information is lost
    ■ Additional information is
    required to handle
    e.g. “A is left of B”
    Text
    Text
    feat.
    Image
    feat.
    Image

    View Slide

  9. Major approaches to encode images by CLIP
    - -
    9
    1D feature vector
    ■ Easy
    ■ model.encode_image(image)
    ■ Positional information is lost
    ■ Additional information is
    required to handle
    e.g. “A is left of B”
    2D feature map
    ■ Extract intermediate output
    from ResNet/ViT in CLIP
    ■ E.g. 28 x 28 x 512
    ■ Typical work
    ■ CLIPort [Shridhar+ CoRL21],
    CRIS [Wang+ CVPR22], SAN
    [Mengde+ CVPR23]
    Text
    Text
    feat.
    Image
    feat.
    Image Text
    Text
    feat.
    Image
    feat.
    Image

    View Slide

  10. Generating action sequences by LLMs
    - -
    10
    ■ Late fusion: PaLM SayCan [Ahn(Google)+ 2022]
    ■ Language score (Say) : estimated generation prob. of phrases
    ■ Action score (Can): estimated task success prob. under the situation
    ■ Early fusion: PaLM-E [Driess (Google)+ 2023]
    ■ Decomposes tasks by multimodal prompts (e.g. Given [image], …)

    View Slide

  11. Attempts to build foundation models for robotics
    - -
    11
    RT-1[Brohan+50 authors, 22]
    ■ 13 physical robots x 17 months
    ■ Image-text fusion uses
    FiLM[Perez+ 17]
    ■ Inference@3Hz for base/arm
    actions
    ■ RT-2[Brohan+ 23]: Multimodal
    LLMs predict 6D velocity
    Gato [Reed+ JMLR22]
    ■ Single transformer to learn
    game, image captioning, object
    manipulation, etc

    View Slide

  12. Benchmarks for vision x language x (physical) robots
    - -
    12
    RoboCup@Home(2006-)
    ■ Largest benchmarking test for
    domestic service robots
    HomeRobot [Yenamandra+ CoRL23]
    ■ Open-vocabulary mobile
    manipulation
    ■ Competition at NeurIPS23

    View Slide

  13. RoboCup@Home [Iocchi+ AI15]
    - -
    13
    ■ Includes real ”real environments”
    ■ General Purpose Service Robots
    test (2010-)
    ■ As difficult as tasks handled by
    RT-2/PaLM SayCan
    ■ Almost solved by foundation
    models in 2023
    ■ We won 1st places (2008 & 2010),
    second places (2009 & 2012)
    L. Iocchi et al, "RoboCup@Home: Analysis and Results of Evolving Competitions for Domestic and Service Robots," Artificial Intelligence, Vol. 229, pp. 258-281, 2015.

    View Slide

  14. Challenging examples:
    Referring expression comprehension
    - -
    14
    ■ Google Bard*
    ■ Recognizes this as ”white
    pillow”
    ■ SEEM [Zou+ 23]
    ■ Masks the mirror given
    “Pick up the plant in front of
    the mirror”
    *As of July 2023

    View Slide

  15. Multimodal Language
    Comprehension for Robots
    - -
    15

    View Slide

  16. Motivation: Building robots that assist people
    https://www.toyota.com/usa/toyota-effect/romy-robot.html
    Social issues
    • Decrease in the working-age population that
    supports those who need assistance
    • Training an assistance dog takes two years
    I need to leave
    work to support my
    family…
    I can't manage
    to care for an
    assistance dog.
    What challenges
    arise to build
    language
    interfaces?
    If options are plentiful, touch
    panels may be inconvenient.

    View Slide

  17. Milestones:
    What should be done and to what extent?
    ■ Analyzed tasks defined
    by IAADP*
    ■ Out of 108 subtasks,
    50 subtasks are
    doable by HSR
    ■ Metrics
    ■ Task coverage: 80%
    ■ Success rate: 80%
    *International Association of Assistance Dog Partners

    View Slide

  18. Open-vocabulary mobile manipulation
    - -
    18
    Can you fetch the baseball near the red mug cup
    and place it on the tall table
    Take the tomato soup can next to the orange and
    place it on the tall table
    x4

    View Slide

  19. MultiRankIt: Robots as physical search engines
    Background:
    ■ Low task success rate by fully-
    autonomous approaches (<30%)
    ■ Closed vocabulary settings are
    impractical
    Our approach: Human-in-the-loop
    ■ Machine generates search results
    ■ User selects preferable option
    ■ Multi-level phrase-region matching CLIP
    extended

    View Slide

  20. Successful examples for complicated referring expressions
    Instruction: “Go to the bathroom with a picture of a wagon and bring me the towel directly across from the sink”
    Rank: 1 Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6

    Rank: 1 Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6

    Instruction: “Go to the hallway on level 1 that is lined with wine bottles and pull out the high chair closest to the
    wine bottles at the second table from the door”

    View Slide

  21. Vision-and-language navigation simulators / datasets
    Real world
    ■ Room2Room [Anderson+ CVPR18],
    REVERIE[Qi+ CVPR20]
    ■ Honorable Mention
    Award@REVERIE Challenge 2022
    Simulation
    ■ ALFRED [Shridhar+ CVPR20],
    HomeRobot, VLMbench [Zheng+
    NeurIPS22]
    ■ 1st place in CVPR 2023
    DialFRED Challenge [Kaneda+ 23]
    VLN-BERT
    Matterport3D
    (90 houses)
    “Go to the bathroom in the
    bedroom with orange stripe. Make
    sure the faucet is not leaking.”

    View Slide

  22. CrossMap Transformer:
    A masked [language, vision, path] model [Magassouba+ RAL & IROS21]
    ■ Task: Room2Room
    ■ Masks three modalities [text,
    image, path]
    ■ cf. UniMASK [Carrol+ NeurIPS22]
    ■ Data augmentation by double
    multimodal back-translation
    Method SR OSR
    [Majumdar+ ECCV20] 0.70 0.76
    [Hao+ CVPR20] 0.69 0.78
    Ours 0.73 0.80
    Human performance 0.86 0.90
    Instruction+image
    Path
    Back-translation
    English
    German

    View Slide

  23. Vision-and-language navigation for personal mobility
    Task: Talk2Car [Rufus+ IROS21]
    ■ Predict target region (stuff)
    based on landmark (thing)
    Proposed method
    ■ Trimodal architecture
    ■ Introduces day-night branch to
    estimate mask quality
    Mean IoU
    [Rufus+ IROS21] 32.71±4.59
    Ours 37.61±2.73
    - 23 -
    “pull in behind the blue
    van on the left side.”

    View Slide

  24. Semantic segmentation masks degrade at night
    RGB Semantic segmentation
    mask
    Day
    Night
    - 24 -

    View Slide

  25. Qualitative result (1)
    GT Baseline Ours
    Instruction “pull up behind the guy wearing a white shirt.”
    - 25 -

    View Slide

  26. Understanding multiple referring expressions
    26
    ■ Combinatorial explosion in Carry tasks (“carry A from B to C”)
    ■ Potential pairs of (target, receptacle)  𝑀𝑀 × 𝑵𝑵
    ■ e.g. 𝑀𝑀 = 200, 𝑁𝑁 = 30, inference = 0.005 sec
     Decision requires 30 sec
    𝑀𝑀:# Target candidates
    𝑁𝑁:# Destination candidates
    ( , )

    ( , )

    View Slide

  27. Switching Head-tail Funnel UNITER
    [Korekata+ IROS23]
    27
    Novelty
    ■ Switching head-tail enables the
    model to predict target and
    receptacle with a single model
    ■ 𝑂𝑂(𝑀𝑀 × 𝑁𝑁)   𝑂𝑂(𝑀𝑀 + 𝑁𝑁) 
    ■ Task success: 89%
    Put the red chips can on the white table
    with the soccer ball on it.

    View Slide

  28. Multimodal language comprehension for fetching instructions
    - -
    28
    Prototypical contrastive transfer
    learning [Otsuki+ IROS23]
    ■ Sim2Real for language
    comprehension
    Multimodal diffusion segmentation
    model [Iioka+ IROS23]
    ■ Instruction understanding and
    segmentation

    View Slide

  29. Explainable AI for Robots
    - -
    29

    View Slide

  30. Collected 10 million labeled images
    Reducing manual labeling time from 8,000 to 7 days
    - -
    30

    View Slide

  31. PonNet: Predicting and explaining collisions
    [Magassouba+ Advanced Robotics 2021]
    Background:
    ■ Predicting collisions in
    advance is useful to
    prevent damaging
    collisions
    Novelty:
    ■ Extend ABN for two
    streams integrated by a
    transformer

    View Slide

  32. Related Works
    - -
    32
    Attention Branch Network
    [Fukui+, CVPR19],
    Lambda ABN [Iida+ ACCV22]
    - Explanation generation using branch structures for CNNs
    - ABN for lambda networks
    Attention Rollout
    [Abnar+, 20] Explanation generation method based on QKV attention
    [Petsiuk+, BMCV18] Generic method for explanation generation (RISE)
    https://vitalab.github.io/article/2019/07/18/attention-branch-network.html RISE Attention rollout Lambda ABN
    Explanation generation for solar flare prediction

    View Slide

  33. Future captioning for explaining potential risks
    [Ogura+ IEEE RAL & IROS2020] [Kambara+ IEEE RAL & IROS2021] [Kambara+ IEEE ICIP22]
    Motivation:
    Potential risks should be explained to
    operators in advance
    Method:
    Extended relational self attention
    [Kim+ NeurIPS21] for future captioning
    Results:
    Our method outperformed
    MART[Lei+ ACL2020] on YouCook2
    and collision explanation tasks
    33
    "The robot might hit the hourglass.
    Should we cancel the action?"

    View Slide

  34. Qualitative results (1)
    Appropriate prediction of future events
    34
    Reference Robot hits the camera from above because robot tried to put the
    white bottle where it is
    MART
    [Lei, ACL20]
    Robot hits a black teapot because robot tried to put a round white
    bottle
    Ours Robot hits the camera hard because robot tried to put a white jar


    t t+
    1

    View Slide

  35. Qualitative results (2)
    Generated appropriate descriptions for future events
    35
    Reference Rub flour onto the chicken dip it in egg and coat with
    breadcrumbs
    MART
    [Lei, ACL20]
    Coat the chicken in the flour
    Ours Coat the chicken with flour and breadcrumbs


    Missing information
    t-2 t-1 t t+1

    View Slide

  36. Building evaluation metrics for
    multimodal language generation
    - -
    36
    ■ Automatic evaluation is
    essential for development
    of image captioning
    models
    ■ cf. machine translation
    ■ Human evaluation is
    labor-intensive for daily
    development

    View Slide

  37. JaSPICE highly correlate with human evaluations
    Background: Most metrics for Japanese language do not correlate well with
    human evaluations
    Novelty: graph matching based on predicate-argument structure
    Collected 22,350 samples collected from 100 subjects

    View Slide

  38. Summary
    - -
    38

    View Slide

  39. Summary
    - -
    39
    1. Foundation models for robots
    2. Multimodal language comprehension for robots
    3. Explainable AI for robots
    Acknowledgements: JSPS, JST CREST, JST Moonshot, NEDO, SCOPE, Toyota,
    NICT, Honda, Osaka Univ., Chubu Univ., Collaborators, Students, Staffs

    View Slide