The Confluence of Vision, Language, and Robotics

Komei Sugiura Keio University - - 1 The Confluence of
Vision, Language, and Robotics

Use cases of Vision x Language x Robotics - -
2 Honda CiKoMa （YouTube） PaLM-E [Driess (Robotics at Google)+ 2023] https://palm-e.github.io/

Impact of foundation models to robotics - - 3 ▪
Foundation models: Trained on broad data at scale and are adaptable to a wide range of downstream tasks* ▪ e.g. BERT, GPT-3, CLIP ▪ Impact of foundation models to Robotics: Major ▪ Robustness for unseen situations (Zero-shot/few-shot) ▪ Easily usable for non-experts of NLP/CV ▪ Impact of robotics to foundation models: Minor ▪ RT-1/2 ▪ (Future) Self-driving cars, automated experiments, … *Bommasani et al, “On the Opportunities and Risks of Foundation Models”, 2021.

Foundation models for building communication robots - - 4 Text
embedding Standalone BERT, RoBERTa, DeBERTa, … Cloud text-embedding-ada-002 (OpenAI) Speech recognition Standalone Whisper (OpenAI) Cloud • Smartphone UI with proprietary cloud APIs • Most robot developers prefer not to operate speech recognition servers by themselves Rospeex [Sugiura+ IROS15] • ROS-based • 50k unique users between 2013 and 2018

Large language models (LLMs) for generating control codes - -
5 Code as Policies [Liang+ 22] LLM generates sequence of atomic actions (sensing & action) ChatGPT for Robotics [Vemprala+ 23] Humans give feedback to interactively generate control code TidyBot [Wu+ AR-IROS23] Recognizes target objects by CLIP and generates code (including receptacles) by LLM Situation information is manually given ▪ e.g. objects = ["yellow shirt", "black shirt”, ..]

CLIP[Radford+ 21]: Vision-language foundation model ▪ Contrastive learning on 400M
image-text pairs ▪ CLIPInfoNCE loss, OTTER[Wu+ ICLR22]optimal transport ▪ Many applications (e.g. DALL·E 2+ 22] ) a photo of a beer bottle satellite imagery of roundabout a photo of a marimba a meme Text Text feat. Image feat. Image https://vimeo.com/692375454

Object search/manipulation using CLIP - - 7 Manipul ation CLIPort
[Shridhar+ CoRL21] / PerAct [Shridhar+ CoRL22] • Introduces a side network to Transporter Networks[Zeng+ CoRL20] to handle CLIP features • Predicts 2D positions/6D poses of the gripper KITE [Sundaresan+ CoRL23] Predicts which parts of objects to grasp Search CLIP-Fields [Shafiullah+ RSS23] Object search by Detic+BERT+CLIP OpenScene [Peng+ CVPR23] Open-vocabulary 3D Scene understanding

Major approaches to encode images by CLIP - - 8
1D feature vector ▪ Easy ▪ model.encode_image(image) ▪ Positional information is lost ▪ Additional information is required to handle e.g. “A is left of B” Text Text feat. Image feat. Image

Major approaches to encode images by CLIP - - 9
1D feature vector ▪ Easy ▪ model.encode_image(image) ▪ Positional information is lost ▪ Additional information is required to handle e.g. “A is left of B” 2D feature map ▪ Extract intermediate output from ResNet/ViT in CLIP ▪ E.g. 28 x 28 x 512 ▪ Typical work ▪ CLIPort [Shridhar+ CoRL21], CRIS [Wang+ CVPR22], SAN [Mengde+ CVPR23] Text Text feat. Image feat. Image Text Text feat. Image feat. Image

Generating action sequences by LLMs - - 10 ▪ Late
fusion: PaLM SayCan [Ahn(Google)+ 2022] ▪ Language score (Say) : estimated generation prob. of phrases ▪ Action score (Can): estimated task success prob. under the situation ▪ Early fusion: PaLM-E [Driess (Google)+ 2023] ▪ Decomposes tasks by multimodal prompts (e.g. Given [image], …)

Attempts to build foundation models for robotics - - 11
RT-1[Brohan+50 authors, 22] ▪ 13 physical robots x 17 months ▪ Image-text fusion uses FiLM[Perez+ 17] ▪ Inference@3Hz for base/arm actions ▪ RT-2[Brohan+ 23]: Multimodal LLMs predict 6D velocity Gato [Reed+ JMLR22] ▪ Single transformer to learn game, image captioning, object manipulation, etc

Benchmarks for vision x language x (physical) robots - -
12 RoboCup@Home（2006-） ▪ Largest benchmarking test for domestic service robots HomeRobot [Yenamandra+ CoRL23] ▪ Open-vocabulary mobile manipulation ▪ Competition at NeurIPS23

RoboCup@Home [Iocchi+ AI15] - - 13 ▪ Includes real ”real
environments” ▪ General Purpose Service Robots test (2010-) ▪ As difficult as tasks handled by RT-2/PaLM SayCan ▪ Almost solved by foundation models in 2023 ▪ We won 1st places (2008 & 2010), second places (2009 & 2012) L. Iocchi et al, "RoboCup@Home: Analysis and Results of Evolving Competitions for Domestic and Service Robots," Artificial Intelligence, Vol. 229, pp. 258-281, 2015.

Challenging examples: Referring expression comprehension - - 14 ▪ Google
Bard* ▪ Recognizes this as ”white pillow” ▪ SEEM [Zou+ 23] ▪ Masks the mirror given “Pick up the plant in front of the mirror” *As of July 2023

Multimodal Language Comprehension for Robots - - 15

Motivation: Building robots that assist people https://www.toyota.com/usa/toyota-effect/romy-robot.html Social issues •
Decrease in the working-age population that supports those who need assistance • Training an assistance dog takes two years I need to leave work to support my family… I can't manage to care for an assistance dog. What challenges arise to build language interfaces? If options are plentiful, touch panels may be inconvenient.

Milestones: What should be done and to what extent? ▪
Analyzed tasks defined by IAADP* ▪ Out of 108 subtasks, 50 subtasks are doable by HSR ▪ Metrics ▪ Task coverage: 80% ▪ Success rate: 80% *International Association of Assistance Dog Partners

Open-vocabulary mobile manipulation - - 18 Can you fetch the
baseball near the red mug cup and place it on the tall table Take the tomato soup can next to the orange and place it on the tall table x4

MultiRankIt: Robots as physical search engines Background: ▪ Low task
success rate by fully- autonomous approaches (<30%) ▪ Closed vocabulary settings are impractical Our approach: Human-in-the-loop ▪ Machine generates search results ▪ User selects preferable option ▪ Multi-level phrase-region matching CLIP extended

Successful examples for complicated referring expressions Instruction: “Go to the
bathroom with a picture of a wagon and bring me the towel directly across from the sink” Rank: 1 Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 … Rank: 1 Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 … Instruction: “Go to the hallway on level 1 that is lined with wine bottles and pull out the high chair closest to the wine bottles at the second table from the door”

Vision-and-language navigation simulators / datasets Real world ▪ Room2Room [Anderson+
CVPR18], REVERIE[Qi+ CVPR20] ▪ Honorable Mention Award@REVERIE Challenge 2022 Simulation ▪ ALFRED [Shridhar+ CVPR20], HomeRobot, VLMbench [Zheng+ NeurIPS22] ▪ 1st place in CVPR 2023 DialFRED Challenge [Kaneda+ 23] VLN-BERT Matterport3D (90 houses) “Go to the bathroom in the bedroom with orange stripe. Make sure the faucet is not leaking.”

CrossMap Transformer: A masked [language, vision, path] model [Magassouba+ RAL
& IROS21] ▪ Task: Room2Room ▪ Masks three modalities [text, image, path] ▪ cf. UniMASK [Carrol+ NeurIPS22] ▪ Data augmentation by double multimodal back-translation Method SR OSR [Majumdar+ ECCV20] 0.70 0.76 [Hao+ CVPR20] 0.69 0.78 Ours 0.73 0.80 Human performance 0.86 0.90 Instruction+image Path Back-translation English German

Vision-and-language navigation for personal mobility Task: Talk2Car [Rufus+ IROS21] ▪
Predict target region (stuff) based on landmark (thing) Proposed method ▪ Trimodal architecture ▪ Introduces day-night branch to estimate mask quality Mean IoU [Rufus+ IROS21] 32.71±4.59 Ours 37.61±2.73 - 23 - “pull in behind the blue van on the left side.”

Semantic segmentation masks degrade at night RGB Semantic segmentation mask
Day Night - 24 -

Qualitative result (1) GT Baseline Ours Instruction “pull up behind
the guy wearing a white shirt.” - 25 -

Understanding multiple referring expressions 26 ▪ Combinatorial explosion in Carry
tasks (“carry A from B to C”) ▪ Potential pairs of (target, receptacle)  𝑀𝑀 × 𝑵𝑵 ▪ e.g. 𝑀𝑀 = 200, 𝑁𝑁 = 30, inference = 0.005 sec  Decision requires 30 sec 𝑀𝑀：# Target candidates 𝑁𝑁：# Destination candidates ( ， ) … ( ， ) ？

Switching Head-tail Funnel UNITER [Korekata+ IROS23] 27 Novelty ▪ Switching
head-tail enables the model to predict target and receptacle with a single model ▪ 𝑂𝑂(𝑀𝑀 × 𝑁𝑁)   𝑂𝑂(𝑀𝑀 + 𝑁𝑁)  ▪ Task success: 89% Put the red chips can on the white table with the soccer ball on it.

Multimodal language comprehension for fetching instructions - - 28 Prototypical
contrastive transfer learning [Otsuki+ IROS23] ▪ Sim2Real for language comprehension Multimodal diffusion segmentation model [Iioka+ IROS23] ▪ Instruction understanding and segmentation

Explainable AI for Robots - - 29

Collected 10 million labeled images Reducing manual labeling time from
8,000 to 7 days - - 30

PonNet: Predicting and explaining collisions [Magassouba+ Advanced Robotics 2021] Background:
▪ Predicting collisions in advance is useful to prevent damaging collisions Novelty: ▪ Extend ABN for two streams integrated by a transformer

Related Works - - 32 Attention Branch Network [Fukui+, CVPR19],
Lambda ABN [Iida+ ACCV22] - Explanation generation using branch structures for CNNs - ABN for lambda networks Attention Rollout [Abnar+, 20] Explanation generation method based on QKV attention [Petsiuk+, BMCV18] Generic method for explanation generation (RISE) https://vitalab.github.io/article/2019/07/18/attention-branch-network.html RISE Attention rollout Lambda ABN Explanation generation for solar flare prediction

Future captioning for explaining potential risks [Ogura+ IEEE RAL &
IROS2020] [Kambara+ IEEE RAL & IROS2021] [Kambara+ IEEE ICIP22] Motivation: Potential risks should be explained to operators in advance Method: Extended relational self attention [Kim+ NeurIPS21] for future captioning Results: Our method outperformed MART[Lei+ ACL2020] on YouCook2 and collision explanation tasks 33 "The robot might hit the hourglass. Should we cancel the action?"

Qualitative results (1) Appropriate prediction of future events 34 Reference
Robot hits the camera from above because robot tried to put the white bottle where it is MART [Lei, ACL20] Robot hits a black teapot because robot tried to put a round white bottle Ours Robot hits the camera hard because robot tried to put a white jar   t t+ 1

Qualitative results (2) Generated appropriate descriptions for future events 35
Reference Rub flour onto the chicken dip it in egg and coat with breadcrumbs MART [Lei, ACL20] Coat the chicken in the flour Ours Coat the chicken with flour and breadcrumbs   Missing information t-2 t-1 t t+1

Building evaluation metrics for multimodal language generation - - 36
▪ Automatic evaluation is essential for development of image captioning models ▪ cf. machine translation ▪ Human evaluation is labor-intensive for daily development

JaSPICE highly correlate with human evaluations Background: Most metrics for
Japanese language do not correlate well with human evaluations Novelty: graph matching based on predicate-argument structure Collected 22,350 samples collected from 100 subjects

Summary - - 38

Summary - - 39 1. Foundation models for robots 2.
Multimodal language comprehension for robots 3. Explainable AI for robots Acknowledgements: JSPS, JST CREST, JST Moonshot, NEDO, SCOPE, Toyota, NICT, Honda, Osaka Univ., Chubu Univ., Collaborators, Students, Staffs

The Confluence of Vision, Language, and Robotics

The Confluence of Vision, Language, and Robotics

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Featured

Transcript