Semantic Machine Intelligence for Domestic Service Robots

Semantic Machine Intelligence for Domestic Service Robots Komei Sugiura Keio
University https://smilab.org/

Motivation：Building domestic service robots (DSRs) that assist people https://www.toyota.com/usa/toyota-effect/romy-robot.html What
would be the difficulties in building a speech interface for DSRs? If there are many candidates, touch panel is inconvenient. Social issues • Decrease in the working-age population that supports those who need assistance • Training an assistance dog takes two years Need to quit my job to take care of my family… Cannot take care of an assistance dog

Typical frustrating dialog User intent: "Put the coffee cup on
the right side of the third tier of the largest shelf in the kitchen.” Which cup? To where? Which shelf in the kitchen? Which tier of the shelf?... Please put the cup away (Too many questions…)

Q. What would be the metrics for evaluating DSRs? A.
We can evaluate DSRs by using assistance dogs tasks • Assistance dogs tasks are clearly defined – Out of 108 tasks, 50 tasks are physically doable for DSRs • 80% of the tasks can be covered by retrieve, carry, open/close, and follow *International Association of Assistance Dog Partners

MULTIMODAL LANGUAGE UNDERSTANDING

Related work: Multimodal language understanding in robotics Bayesian (e.g. [Kollar+
HRI2010 Best paper] [Sugiura+ Interspeech2009]) DNN (e.g. [Anderson+ CoRL2020] [Hatori+ ICRA2018 Best Paper] [Embodied AI challenge])

MTCM with Attention Branches [Magassouba+ IEEE RAL & IROS2019] [Magassouba+
IEEE RAL & ICRA2020] Task • Multimodal instruction understanding for fetching tasks Methods • Extend the Attention Branch Network (ABN) [Fukui+ CVPR2019] Results • Our method outperformed [Hatori+ ICRA2018 Best Paper] • Accuracy comparable to humans Method Accuracy [Hatori et al., ICRA18] 88.0 MTCM [Magassouba+IROS19] 88.8±0.43 Ours (LAB only) 89.2±0.43 Ours (LAB+TAB only) 89.6±0.28 Ours (full) 90.1±0.47 Human 90.3±2.01

Qualitative results on the PFN-PIC dataset Take the green ball
from lower left box to lower right box (but There is no green ball) Pick the white plastic bottle and put it in the right box (but there are two bottles)

Target-dependent UNITER: Instructions understanding for domestic service robots [Ishikawa+ RAL
& IROS 2021] Task • Understanding instructions such as “Go get an empty plastic bottle from the kitchen shelf” Approach • Extend UNITER [Chen+ ECCV20]- based Transformer by introducing target embedder Results • Our method outperformed [Magassouba+ IROS2020] )

Vision and language navigation tasks ①Room2Room [Anderson+ CVPR18] • Difficulty:
long sentences • E.g. :“Walk through the bedroom and out of the door into the hallway. Walk down the hall along the banister rail … bedroom with a round mirror on the wall and butterfly sculpture.” VLN-BERT ②ALFRED [Shridhar+ CVPR20] • Difficulty: Subgoals are not specified • E.g. : Put a clean apple on a wooden table ＝ Pick up an apple + Wash the apple in the sink + Put the apple on a wooden table

Understanding longer sentences by combining language understanding and generation [Magassouba+
RAL & IROS 2021] Task: Room-to-room [Anderson+ CVPR18] Approach: CrossMap Transformer • Data augmentation by multimodal double back-translation • Crossmodal Masked Path Transformer Results • Our method outperformed [Hao+ CVPR20] [Majumdar+ (Facebook) ECCV20] VLN-BERT Text+Image Path Back translation

Understanding mobile manipulation instructions by HLSM-MAT [Ishikawa+ IEEE ICPR22] Approach
• Introduced Moment-based Adversarial Training to (text, action history, environmental states) Results • Our method outperformed [Zhang+ ACL21] [Blukis+(NVIDIA) CoRL21] Method SR unknown SR known [Zhang+ ACL21] 11.12 13.63 [Blukis+ CoRL21] 20.27 29.94 Ours 21.84 33.01 Human performance 91.00 - Our method successfully predicted unspecified subgoals of "put down the knife you used," whereas the existing method failed “Place a cooked potato slice in the fridge”

Qualitative results 13 “Place the two pillows on the sofa”
“Examine a cup under a lamp”  Appropriate subgoals were predicted PickUp Put PickUp Put PickUp ToggleOn

SEMI-PHOTOREALISTIC ROBOT SIMULATION

Using photorealistic simulation to collect training data Many robotics studies
use small datasets, so benefits of refining DNN structure is unclear Use simulation (e.g. Isaac Sim by NVIDIA) I don't have many robots. What should I do? DeepDrive in Universe [OpenAI, 2017] Neuromation

Semi-photorealistic simulator: We can collect 10 million annotated images in
one week Automatic segmentation Object detection example • Automatic annotation  7 days  • Manual annotation  8000days  Data collection under various conditions

PonNet: Predicting and explaining collisions [Magassouba+ Advanced Robotics 2021] Background:
• Predicting/explaining consequences of actions in advance is useful to prevent damaging collisions Technical point: • Attention Branch Network (ABN) [Fukui+ CVPR19] is extended for RGB and depth • Semi-photorealistic simulation

Collision prediction in Tidy-up tasks (Collaboration with Osaka Univ.)

Method Simulation Physical robot Plane detect. 82.50 83.00 PonNet 90.94±0.22
78.30±6.10 Transformer PonNet 91.19±0.35 83.60±1.20 Results: rule-based approach to collision prediction is no longer necessary Can visualize collision-related regions x100 Visualize safe region

CROSSMODAL LANGUAGE GENERATION

Benchmarking tests for domestic service robots (DSRs) RoboCup@Home • Largest
competition for DSRs • 20-30 participations • Won 1st place (2008, 2010), 2nd place (2009, 2011) World Robot Summit (WRS 2018) Partner Robot Challenge Virtual Space • Randomly generated conditions (start, goal, object placement, etc) • Won 1st place (award money =74k USD)  Instructions are manually given and fixed

Crossmodal instruction generation toward automatic task generation [Ogura+ IEEE RAL
& IROS2020] [Kambara+ IEEE RAL & IROS2021] Motivation • On-the-fly instruction generation according to randomly generated situation Approach • Extended object relation transformer [Herdade+ NeurIPS19] “Bring me the small item on the right-sided armchair”

1. Robot and language 2. Multimodal language understanding 3. Semi-photorealistic
robot simulation 4. Crossmodal language generation Acknowledgment: Chubu Univ., Osaka Univ., Toyota Motor Corporation, JSPS, JST, NEDO

Semantic Machine Intelligence for Domestic Serv...

Semantic Machine Intelligence for Domestic Service Robots

Semantic Machine Intelligence Lab., Keio Univ.
PRO

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Featured

Transcript