Slide 1

Slide 1 text

Komei Sugiura Keio University - - 1 The Confluence of Vision, Language, and Robotics

Slide 2

Slide 2 text

Use cases of Vision x Language x Robotics - - 2 Honda CiKoMa (YouTube) PaLM-E [Driess (Robotics at Google)+ 2023] https://palm-e.github.io/

Slide 3

Slide 3 text

Impact of foundation models to robotics - - 3 ■ Foundation models: Trained on broad data at scale and are adaptable to a wide range of downstream tasks* ■ e.g. BERT, GPT-3, CLIP ■ Impact of foundation models to Robotics: Major ■ Robustness for unseen situations (Zero-shot/few-shot) ■ Easily usable for non-experts of NLP/CV ■ Impact of robotics to foundation models: Minor ■ RT-1/2 ■ (Future) Self-driving cars, automated experiments, … *Bommasani et al, “On the Opportunities and Risks of Foundation Models”, 2021.

Slide 4

Slide 4 text

Foundation models for building communication robots - - 4 Text embedding Standalone BERT, RoBERTa, DeBERTa, … Cloud text-embedding-ada-002 (OpenAI) Speech recognition Standalone Whisper (OpenAI) Cloud • Smartphone UI with proprietary cloud APIs • Most robot developers prefer not to operate speech recognition servers by themselves Rospeex [Sugiura+ IROS15] • ROS-based • 50k unique users between 2013 and 2018

Slide 5

Slide 5 text

Large language models (LLMs) for generating control codes - - 5 Code as Policies [Liang+ 22] LLM generates sequence of atomic actions (sensing & action) ChatGPT for Robotics [Vemprala+ 23] Humans give feedback to interactively generate control code TidyBot [Wu+ AR-IROS23] Recognizes target objects by CLIP and generates code (including receptacles) by LLM Situation information is manually given ■ e.g. objects = ["yellow shirt", "black shirt”, ..]

Slide 6

Slide 6 text

CLIP[Radford+ 21]: Vision-language foundation model ■ Contrastive learning on 400M image-text pairs ■ CLIPInfoNCE loss, OTTER[Wu+ ICLR22]optimal transport ■ Many applications (e.g. DALL·E 2+ 22] ) a photo of a beer bottle satellite imagery of roundabout a photo of a marimba a meme Text Text feat. Image feat. Image https://vimeo.com/692375454

Slide 7

Slide 7 text

Object search/manipulation using CLIP - - 7 Manipul ation CLIPort [Shridhar+ CoRL21] / PerAct [Shridhar+ CoRL22] • Introduces a side network to Transporter Networks[Zeng+ CoRL20] to handle CLIP features • Predicts 2D positions/6D poses of the gripper KITE [Sundaresan+ CoRL23] Predicts which parts of objects to grasp Search CLIP-Fields [Shafiullah+ RSS23] Object search by Detic+BERT+CLIP OpenScene [Peng+ CVPR23] Open-vocabulary 3D Scene understanding

Slide 8

Slide 8 text

Major approaches to encode images by CLIP - - 8 1D feature vector ■ Easy ■ model.encode_image(image) ■ Positional information is lost ■ Additional information is required to handle e.g. “A is left of B” Text Text feat. Image feat. Image

Slide 9

Slide 9 text

Major approaches to encode images by CLIP - - 9 1D feature vector ■ Easy ■ model.encode_image(image) ■ Positional information is lost ■ Additional information is required to handle e.g. “A is left of B” 2D feature map ■ Extract intermediate output from ResNet/ViT in CLIP ■ E.g. 28 x 28 x 512 ■ Typical work ■ CLIPort [Shridhar+ CoRL21], CRIS [Wang+ CVPR22], SAN [Mengde+ CVPR23] Text Text feat. Image feat. Image Text Text feat. Image feat. Image

Slide 10

Slide 10 text

Generating action sequences by LLMs - - 10 ■ Late fusion: PaLM SayCan [Ahn(Google)+ 2022] ■ Language score (Say) : estimated generation prob. of phrases ■ Action score (Can): estimated task success prob. under the situation ■ Early fusion: PaLM-E [Driess (Google)+ 2023] ■ Decomposes tasks by multimodal prompts (e.g. Given [image], …)

Slide 11

Slide 11 text

Attempts to build foundation models for robotics - - 11 RT-1[Brohan+50 authors, 22] ■ 13 physical robots x 17 months ■ Image-text fusion uses FiLM[Perez+ 17] ■ Inference@3Hz for base/arm actions ■ RT-2[Brohan+ 23]: Multimodal LLMs predict 6D velocity Gato [Reed+ JMLR22] ■ Single transformer to learn game, image captioning, object manipulation, etc

Slide 12

Slide 12 text

Benchmarks for vision x language x (physical) robots - - 12 RoboCup@Home(2006-) ■ Largest benchmarking test for domestic service robots HomeRobot [Yenamandra+ CoRL23] ■ Open-vocabulary mobile manipulation ■ Competition at NeurIPS23

Slide 13

Slide 13 text

RoboCup@Home [Iocchi+ AI15] - - 13 ■ Includes real ”real environments” ■ General Purpose Service Robots test (2010-) ■ As difficult as tasks handled by RT-2/PaLM SayCan ■ Almost solved by foundation models in 2023 ■ We won 1st places (2008 & 2010), second places (2009 & 2012) L. Iocchi et al, "RoboCup@Home: Analysis and Results of Evolving Competitions for Domestic and Service Robots," Artificial Intelligence, Vol. 229, pp. 258-281, 2015.

Slide 14

Slide 14 text

Challenging examples: Referring expression comprehension - - 14 ■ Google Bard* ■ Recognizes this as ”white pillow” ■ SEEM [Zou+ 23] ■ Masks the mirror given “Pick up the plant in front of the mirror” *As of July 2023

Slide 15

Slide 15 text

Multimodal Language Comprehension for Robots - - 15

Slide 16

Slide 16 text

Motivation: Building robots that assist people https://www.toyota.com/usa/toyota-effect/romy-robot.html Social issues • Decrease in the working-age population that supports those who need assistance • Training an assistance dog takes two years I need to leave work to support my family… I can't manage to care for an assistance dog. What challenges arise to build language interfaces? If options are plentiful, touch panels may be inconvenient.

Slide 17

Slide 17 text

Milestones: What should be done and to what extent? ■ Analyzed tasks defined by IAADP* ■ Out of 108 subtasks, 50 subtasks are doable by HSR ■ Metrics ■ Task coverage: 80% ■ Success rate: 80% *International Association of Assistance Dog Partners

Slide 18

Slide 18 text

Open-vocabulary mobile manipulation - - 18 Can you fetch the baseball near the red mug cup and place it on the tall table Take the tomato soup can next to the orange and place it on the tall table x4

Slide 19

Slide 19 text

MultiRankIt: Robots as physical search engines Background: ■ Low task success rate by fully- autonomous approaches (<30%) ■ Closed vocabulary settings are impractical Our approach: Human-in-the-loop ■ Machine generates search results ■ User selects preferable option ■ Multi-level phrase-region matching CLIP extended

Slide 20

Slide 20 text

Successful examples for complicated referring expressions Instruction: “Go to the bathroom with a picture of a wagon and bring me the towel directly across from the sink” Rank: 1 Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 … Rank: 1 Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 … Instruction: “Go to the hallway on level 1 that is lined with wine bottles and pull out the high chair closest to the wine bottles at the second table from the door”

Slide 21

Slide 21 text

Vision-and-language navigation simulators / datasets Real world ■ Room2Room [Anderson+ CVPR18], REVERIE[Qi+ CVPR20] ■ Honorable Mention Award@REVERIE Challenge 2022 Simulation ■ ALFRED [Shridhar+ CVPR20], HomeRobot, VLMbench [Zheng+ NeurIPS22] ■ 1st place in CVPR 2023 DialFRED Challenge [Kaneda+ 23] VLN-BERT Matterport3D (90 houses) “Go to the bathroom in the bedroom with orange stripe. Make sure the faucet is not leaking.”

Slide 22

Slide 22 text

CrossMap Transformer: A masked [language, vision, path] model [Magassouba+ RAL & IROS21] ■ Task: Room2Room ■ Masks three modalities [text, image, path] ■ cf. UniMASK [Carrol+ NeurIPS22] ■ Data augmentation by double multimodal back-translation Method SR OSR [Majumdar+ ECCV20] 0.70 0.76 [Hao+ CVPR20] 0.69 0.78 Ours 0.73 0.80 Human performance 0.86 0.90 Instruction+image Path Back-translation English German

Slide 23

Slide 23 text

Vision-and-language navigation for personal mobility Task: Talk2Car [Rufus+ IROS21] ■ Predict target region (stuff) based on landmark (thing) Proposed method ■ Trimodal architecture ■ Introduces day-night branch to estimate mask quality Mean IoU [Rufus+ IROS21] 32.71±4.59 Ours 37.61±2.73 - 23 - “pull in behind the blue van on the left side.”

Slide 24

Slide 24 text

Semantic segmentation masks degrade at night RGB Semantic segmentation mask Day Night - 24 -

Slide 25

Slide 25 text

Qualitative result (1) GT Baseline Ours Instruction “pull up behind the guy wearing a white shirt.” - 25 -

Slide 26

Slide 26 text

Understanding multiple referring expressions 26 ■ Combinatorial explosion in Carry tasks (“carry A from B to C”) ■ Potential pairs of (target, receptacle)  𝑀𝑀 × 𝑵𝑵 ■ e.g. 𝑀𝑀 = 200, 𝑁𝑁 = 30, inference = 0.005 sec  Decision requires 30 sec 𝑀𝑀:# Target candidates 𝑁𝑁:# Destination candidates ( , ) … ( , ) ?

Slide 27

Slide 27 text

Switching Head-tail Funnel UNITER [Korekata+ IROS23] 27 Novelty ■ Switching head-tail enables the model to predict target and receptacle with a single model ■ 𝑂𝑂(𝑀𝑀 × 𝑁𝑁)   𝑂𝑂(𝑀𝑀 + 𝑁𝑁)  ■ Task success: 89% Put the red chips can on the white table with the soccer ball on it.

Slide 28

Slide 28 text

Multimodal language comprehension for fetching instructions - - 28 Prototypical contrastive transfer learning [Otsuki+ IROS23] ■ Sim2Real for language comprehension Multimodal diffusion segmentation model [Iioka+ IROS23] ■ Instruction understanding and segmentation

Slide 29

Slide 29 text

Explainable AI for Robots - - 29

Slide 30

Slide 30 text

Collected 10 million labeled images Reducing manual labeling time from 8,000 to 7 days - - 30

Slide 31

Slide 31 text

PonNet: Predicting and explaining collisions [Magassouba+ Advanced Robotics 2021] Background: ■ Predicting collisions in advance is useful to prevent damaging collisions Novelty: ■ Extend ABN for two streams integrated by a transformer

Slide 32

Slide 32 text

Related Works - - 32 Attention Branch Network [Fukui+, CVPR19], Lambda ABN [Iida+ ACCV22] - Explanation generation using branch structures for CNNs - ABN for lambda networks Attention Rollout [Abnar+, 20] Explanation generation method based on QKV attention [Petsiuk+, BMCV18] Generic method for explanation generation (RISE) https://vitalab.github.io/article/2019/07/18/attention-branch-network.html RISE Attention rollout Lambda ABN Explanation generation for solar flare prediction

Slide 33

Slide 33 text

Future captioning for explaining potential risks [Ogura+ IEEE RAL & IROS2020] [Kambara+ IEEE RAL & IROS2021] [Kambara+ IEEE ICIP22] Motivation: Potential risks should be explained to operators in advance Method: Extended relational self attention [Kim+ NeurIPS21] for future captioning Results: Our method outperformed MART[Lei+ ACL2020] on YouCook2 and collision explanation tasks 33 "The robot might hit the hourglass. Should we cancel the action?"

Slide 34

Slide 34 text

Qualitative results (1) Appropriate prediction of future events 34 Reference Robot hits the camera from above because robot tried to put the white bottle where it is MART [Lei, ACL20] Robot hits a black teapot because robot tried to put a round white bottle Ours Robot hits the camera hard because robot tried to put a white jar   t t+ 1

Slide 35

Slide 35 text

Qualitative results (2) Generated appropriate descriptions for future events 35 Reference Rub flour onto the chicken dip it in egg and coat with breadcrumbs MART [Lei, ACL20] Coat the chicken in the flour Ours Coat the chicken with flour and breadcrumbs   Missing information t-2 t-1 t t+1

Slide 36

Slide 36 text

Building evaluation metrics for multimodal language generation - - 36 ■ Automatic evaluation is essential for development of image captioning models ■ cf. machine translation ■ Human evaluation is labor-intensive for daily development

Slide 37

Slide 37 text

JaSPICE highly correlate with human evaluations Background: Most metrics for Japanese language do not correlate well with human evaluations Novelty: graph matching based on predicate-argument structure Collected 22,350 samples collected from 100 subjects

Slide 38

Slide 38 text

Summary - - 38

Slide 39

Slide 39 text

Summary - - 39 1. Foundation models for robots 2. Multimodal language comprehension for robots 3. Explainable AI for robots Acknowledgements: JSPS, JST CREST, JST Moonshot, NEDO, SCOPE, Toyota, NICT, Honda, Osaka Univ., Chubu Univ., Collaborators, Students, Staffs