Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Confluence of Vision, Language, and Robotics

The Confluence of Vision, Language, and Robotics

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology


  1. Use cases of Vision x Language x Robotics - -

    2 Honda CiKoMa (YouTube) PaLM-E [Driess (Robotics at Google)+ 2023] https://palm-e.github.io/
  2. Impact of foundation models to robotics - - 3 ▪

    Foundation models: Trained on broad data at scale and are adaptable to a wide range of downstream tasks* ▪ e.g. BERT, GPT-3, CLIP ▪ Impact of foundation models to Robotics: Major ▪ Robustness for unseen situations (Zero-shot/few-shot) ▪ Easily usable for non-experts of NLP/CV ▪ Impact of robotics to foundation models: Minor ▪ RT-1/2 ▪ (Future) Self-driving cars, automated experiments, … *Bommasani et al, “On the Opportunities and Risks of Foundation Models”, 2021.
  3. Foundation models for building communication robots - - 4 Text

    embedding Standalone BERT, RoBERTa, DeBERTa, … Cloud text-embedding-ada-002 (OpenAI) Speech recognition Standalone Whisper (OpenAI) Cloud • Smartphone UI with proprietary cloud APIs • Most robot developers prefer not to operate speech recognition servers by themselves Rospeex [Sugiura+ IROS15] • ROS-based • 50k unique users between 2013 and 2018
  4. Large language models (LLMs) for generating control codes - -

    5 Code as Policies [Liang+ 22] LLM generates sequence of atomic actions (sensing & action) ChatGPT for Robotics [Vemprala+ 23] Humans give feedback to interactively generate control code TidyBot [Wu+ AR-IROS23] Recognizes target objects by CLIP and generates code (including receptacles) by LLM Situation information is manually given ▪ e.g. objects = ["yellow shirt", "black shirt”, ..]
  5. CLIP[Radford+ 21]: Vision-language foundation model ▪ Contrastive learning on 400M

    image-text pairs ▪ CLIPInfoNCE loss, OTTER[Wu+ ICLR22]optimal transport ▪ Many applications (e.g. DALL·E 2+ 22] ) a photo of a beer bottle satellite imagery of roundabout a photo of a marimba a meme Text Text feat. Image feat. Image https://vimeo.com/692375454
  6. Object search/manipulation using CLIP - - 7 Manipul ation CLIPort

    [Shridhar+ CoRL21] / PerAct [Shridhar+ CoRL22] • Introduces a side network to Transporter Networks[Zeng+ CoRL20] to handle CLIP features • Predicts 2D positions/6D poses of the gripper KITE [Sundaresan+ CoRL23] Predicts which parts of objects to grasp Search CLIP-Fields [Shafiullah+ RSS23] Object search by Detic+BERT+CLIP OpenScene [Peng+ CVPR23] Open-vocabulary 3D Scene understanding
  7. Major approaches to encode images by CLIP - - 8

    1D feature vector ▪ Easy ▪ model.encode_image(image) ▪ Positional information is lost ▪ Additional information is required to handle e.g. “A is left of B” Text Text feat. Image feat. Image
  8. Major approaches to encode images by CLIP - - 9

    1D feature vector ▪ Easy ▪ model.encode_image(image) ▪ Positional information is lost ▪ Additional information is required to handle e.g. “A is left of B” 2D feature map ▪ Extract intermediate output from ResNet/ViT in CLIP ▪ E.g. 28 x 28 x 512 ▪ Typical work ▪ CLIPort [Shridhar+ CoRL21], CRIS [Wang+ CVPR22], SAN [Mengde+ CVPR23] Text Text feat. Image feat. Image Text Text feat. Image feat. Image
  9. Generating action sequences by LLMs - - 10 ▪ Late

    fusion: PaLM SayCan [Ahn(Google)+ 2022] ▪ Language score (Say) : estimated generation prob. of phrases ▪ Action score (Can): estimated task success prob. under the situation ▪ Early fusion: PaLM-E [Driess (Google)+ 2023] ▪ Decomposes tasks by multimodal prompts (e.g. Given [image], …)
  10. Attempts to build foundation models for robotics - - 11

    RT-1[Brohan+50 authors, 22] ▪ 13 physical robots x 17 months ▪ Image-text fusion uses FiLM[Perez+ 17] ▪ Inference@3Hz for base/arm actions ▪ RT-2[Brohan+ 23]: Multimodal LLMs predict 6D velocity Gato [Reed+ JMLR22] ▪ Single transformer to learn game, image captioning, object manipulation, etc
  11. Benchmarks for vision x language x (physical) robots - -

    12 RoboCup@Home(2006-) ▪ Largest benchmarking test for domestic service robots HomeRobot [Yenamandra+ CoRL23] ▪ Open-vocabulary mobile manipulation ▪ Competition at NeurIPS23
  12. RoboCup@Home [Iocchi+ AI15] - - 13 ▪ Includes real ”real

    environments” ▪ General Purpose Service Robots test (2010-) ▪ As difficult as tasks handled by RT-2/PaLM SayCan ▪ Almost solved by foundation models in 2023 ▪ We won 1st places (2008 & 2010), second places (2009 & 2012) L. Iocchi et al, "RoboCup@Home: Analysis and Results of Evolving Competitions for Domestic and Service Robots," Artificial Intelligence, Vol. 229, pp. 258-281, 2015.
  13. Challenging examples: Referring expression comprehension - - 14 ▪ Google

    Bard* ▪ Recognizes this as ”white pillow” ▪ SEEM [Zou+ 23] ▪ Masks the mirror given “Pick up the plant in front of the mirror” *As of July 2023
  14. Motivation: Building robots that assist people https://www.toyota.com/usa/toyota-effect/romy-robot.html Social issues •

    Decrease in the working-age population that supports those who need assistance • Training an assistance dog takes two years I need to leave work to support my family… I can't manage to care for an assistance dog. What challenges arise to build language interfaces? If options are plentiful, touch panels may be inconvenient.
  15. Milestones: What should be done and to what extent? ▪

    Analyzed tasks defined by IAADP* ▪ Out of 108 subtasks, 50 subtasks are doable by HSR ▪ Metrics ▪ Task coverage: 80% ▪ Success rate: 80% *International Association of Assistance Dog Partners
  16. Open-vocabulary mobile manipulation - - 18 Can you fetch the

    baseball near the red mug cup and place it on the tall table Take the tomato soup can next to the orange and place it on the tall table x4
  17. MultiRankIt: Robots as physical search engines Background: ▪ Low task

    success rate by fully- autonomous approaches (<30%) ▪ Closed vocabulary settings are impractical Our approach: Human-in-the-loop ▪ Machine generates search results ▪ User selects preferable option ▪ Multi-level phrase-region matching CLIP extended
  18. Successful examples for complicated referring expressions Instruction: “Go to the

    bathroom with a picture of a wagon and bring me the towel directly across from the sink” Rank: 1 Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 … Rank: 1 Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 … Instruction: “Go to the hallway on level 1 that is lined with wine bottles and pull out the high chair closest to the wine bottles at the second table from the door”
  19. Vision-and-language navigation simulators / datasets Real world ▪ Room2Room [Anderson+

    CVPR18], REVERIE[Qi+ CVPR20] ▪ Honorable Mention Award@REVERIE Challenge 2022 Simulation ▪ ALFRED [Shridhar+ CVPR20], HomeRobot, VLMbench [Zheng+ NeurIPS22] ▪ 1st place in CVPR 2023 DialFRED Challenge [Kaneda+ 23] VLN-BERT Matterport3D (90 houses) “Go to the bathroom in the bedroom with orange stripe. Make sure the faucet is not leaking.”
  20. CrossMap Transformer: A masked [language, vision, path] model [Magassouba+ RAL

    & IROS21] ▪ Task: Room2Room ▪ Masks three modalities [text, image, path] ▪ cf. UniMASK [Carrol+ NeurIPS22] ▪ Data augmentation by double multimodal back-translation Method SR OSR [Majumdar+ ECCV20] 0.70 0.76 [Hao+ CVPR20] 0.69 0.78 Ours 0.73 0.80 Human performance 0.86 0.90 Instruction+image Path Back-translation English German
  21. Vision-and-language navigation for personal mobility Task: Talk2Car [Rufus+ IROS21] ▪

    Predict target region (stuff) based on landmark (thing) Proposed method ▪ Trimodal architecture ▪ Introduces day-night branch to estimate mask quality Mean IoU [Rufus+ IROS21] 32.71±4.59 Ours 37.61±2.73 - 23 - “pull in behind the blue van on the left side.”
  22. Understanding multiple referring expressions 26 ▪ Combinatorial explosion in Carry

    tasks (“carry A from B to C”) ▪ Potential pairs of (target, receptacle)  𝑀𝑀 × 𝑵𝑵 ▪ e.g. 𝑀𝑀 = 200, 𝑁𝑁 = 30, inference = 0.005 sec  Decision requires 30 sec 𝑀𝑀:# Target candidates 𝑁𝑁:# Destination candidates ( , ) … ( , ) ?
  23. Switching Head-tail Funnel UNITER [Korekata+ IROS23] 27 Novelty ▪ Switching

    head-tail enables the model to predict target and receptacle with a single model ▪ 𝑂𝑂(𝑀𝑀 × 𝑁𝑁)   𝑂𝑂(𝑀𝑀 + 𝑁𝑁)  ▪ Task success: 89% Put the red chips can on the white table with the soccer ball on it.
  24. Multimodal language comprehension for fetching instructions - - 28 Prototypical

    contrastive transfer learning [Otsuki+ IROS23] ▪ Sim2Real for language comprehension Multimodal diffusion segmentation model [Iioka+ IROS23] ▪ Instruction understanding and segmentation
  25. PonNet: Predicting and explaining collisions [Magassouba+ Advanced Robotics 2021] Background:

    ▪ Predicting collisions in advance is useful to prevent damaging collisions Novelty: ▪ Extend ABN for two streams integrated by a transformer
  26. Related Works - - 32 Attention Branch Network [Fukui+, CVPR19],

    Lambda ABN [Iida+ ACCV22] - Explanation generation using branch structures for CNNs - ABN for lambda networks Attention Rollout [Abnar+, 20] Explanation generation method based on QKV attention [Petsiuk+, BMCV18] Generic method for explanation generation (RISE) https://vitalab.github.io/article/2019/07/18/attention-branch-network.html RISE Attention rollout Lambda ABN Explanation generation for solar flare prediction
  27. Future captioning for explaining potential risks [Ogura+ IEEE RAL &

    IROS2020] [Kambara+ IEEE RAL & IROS2021] [Kambara+ IEEE ICIP22] Motivation: Potential risks should be explained to operators in advance Method: Extended relational self attention [Kim+ NeurIPS21] for future captioning Results: Our method outperformed MART[Lei+ ACL2020] on YouCook2 and collision explanation tasks 33 "The robot might hit the hourglass. Should we cancel the action?"
  28. Qualitative results (1) Appropriate prediction of future events 34 Reference

    Robot hits the camera from above because robot tried to put the white bottle where it is MART [Lei, ACL20] Robot hits a black teapot because robot tried to put a round white bottle Ours Robot hits the camera hard because robot tried to put a white jar   t t+ 1
  29. Qualitative results (2) Generated appropriate descriptions for future events 35

    Reference Rub flour onto the chicken dip it in egg and coat with breadcrumbs MART [Lei, ACL20] Coat the chicken in the flour Ours Coat the chicken with flour and breadcrumbs   Missing information t-2 t-1 t t+1
  30. Building evaluation metrics for multimodal language generation - - 36

    ▪ Automatic evaluation is essential for development of image captioning models ▪ cf. machine translation ▪ Human evaluation is labor-intensive for daily development
  31. JaSPICE highly correlate with human evaluations Background: Most metrics for

    Japanese language do not correlate well with human evaluations Novelty: graph matching based on predicate-argument structure Collected 22,350 samples collected from 100 subjects
  32. Summary - - 39 1. Foundation models for robots 2.

    Multimodal language comprehension for robots 3. Explainable AI for robots Acknowledgements: JSPS, JST CREST, JST Moonshot, NEDO, SCOPE, Toyota, NICT, Honda, Osaka Univ., Chubu Univ., Collaborators, Students, Staffs