via Foundation Models.” CoRL, 2023. [4] A. Oyama et al. “ECRAP: Exophora Resolution and Classifying User Commands for Robot Action Planning by Large Language Models.” IEEE IRC, pp.1–8, 2024. [5] A. Oyama et al. “Exophora Resolution of Linguistic Instructions with a Demonstrative based on Real-World Multimodal Information.” IEEE RO-MAN, pp. 2617–2623, 2023, *,5$)>@ $FWLRQSODQQLQJIURPLPDJHVFRQWDLQLQJJHVWXUHV EDVHGRQTXHULHVZLWKGHPRQVWUDWLYHV 3UREOHP %RWKXVHUJHVWXUHVDQGWDUJHWREMHFWVPXVWEH FRQWDLQHGZLWKLQWKHLPDJH (&5$3>@ &RPELQLQJH[RSKRUDUHVROXWLRQ>@WDVNFODVVLILFDWLRQDQG //0EDVHGDFWLRQSODQQLQJIURPYDULRXVTXHULHV 3UREOHPV • 1RKDQGOHFDVHRIWKHXVHULVLQYLVLEOHIURPURERW • 1RKDQGOHYLVXDODWWULEXWHVLQWKHTXHU\ FRORURUVKDSH • 1RKDQGOHFDVHRIWKHTXHU\ODFNVWKHWDUJHWFODVV
nthat| and pointing, Task Complete Brown. . (c) Interactive Questioning What color is the object? Cup p Cup Cup (Target) Bottle Doll Take that for me. Which object is nthat|? 7RDFKLHYHUREXVWUHDOZRUOGH[RSKRUDUHVROXWLRQWKDWLVUHVLOLHQWWRLQFRPSOHWHREVHUYDWLRQDOGDWD
“Take that cup to kitchen” Demonstrative SentenceBERT CLIP’s Text Encoder “that” RGB-D Image Skeleton Detection Demonstrative Region-based Estimator Objects Position Pointing Direction -based Estimator User Body Coordinates Linguistic Query-based Estimator User’s Speech Sound Source Localization Target Object Probability [0.14, 0.41, … , 0.01, 0.02] User Direction GPT-4o Interactive Questioning Identify Target Object ID: 2 Class: Cup Probability: 0.41 User’s Answer “The color of the cup is brown.” 0XOWLPRGDO,QWHUDFWLYH([RSKRUDUHVROXWLRQZLWKXVHU/RFDOL]DWLRQ 0,(/
via Foundation Models.” CoRL, 2023. [4] A. Oyama et al. “ECRAP: Exophora Resolution and Classifying User Commands for Robot Action Planning by Large Language Models.” IEEE IRC, pp.1–8, 2024. *,5$)>@ $FWLRQSODQQLQJIURPLPDJHVFRQWDLQLQJJHVWXUHV EDVHGRQTXHULHVZLWKGHPRQVWUDWLYHV 3UREOHP %RWKXVHUJHVWXUHVDQGWDUJHWREMHFWVPXVWEH FRQWDLQHGZLWKLQWKHLPDJH (&5$3>@ &RPELQLQJH[RSKRUDUHVROXWLRQWDVNFODVVLILFDWLRQDQG //0EDVHGDFWLRQSODQQLQJIURPYDULRXVTXHULHV 3UREOHPV • 1RKDQGOHFDVHRIWKHXVHULVLQYLVLEOHIURPURERW • 1RKDQGOHYLVXDODWWULEXWHVLQWKHTXHU\ FRORURUVKDSH • 1RKDQGOHFDVHRIWKHTXHU\ODFNVWKHWDUJHWFODVV
FODVVHV LQLPDJHV (QYLURQPHQW6HWWLQJV [8] B. Chen et al. “Open-Vocabulary Queryable Scene Representations for Real World Planning,” IEEE ICRA, pp.11509–11522, 2023.
• +XPDQ+XPDQ ZR,QWHUDFWLRQ • ,IWKHVSHDNHULVQRWYLVLEOHLWLVSRVVLEOHWRIDFHWKHGLUHFWLRQRIWKHVSHDNHU • ,IWKHWDUJHWFDQQRWEHLGHQWLILHGLWLVSRVVLEOHWRDVNWKHVSHDNHUTXHVWLRQV XSWRRQHWLPH • +XPDQ ZR,QWHUDFWLRQ VKRZVWKHUHVXOWVZKHQSUHGLFWLQJWKHWDUJHWREMHFWZLWKRXWDVNLQJ TXHVWLRQV [11] J. Hu et al. “VGPN: Voice-Guided Pointing Robot Navigation for Humans.” IEEE ROBIO, pp.1107–1112, 2018. [6] A. Oyama et al. “ECRAP: Exophora Resolution and Classifying User Commands for Robot Action Planning by Large Language Models.” IEEE IRC, pp.1–8, 2024. &RPSDULVRQ0HWKRGV
/HYHO /HYHO 7RWDO 9*31>@ (&5$3>@ 0,(/ RXU +XPDQ +XPDQ ZR,QWHUDFWLRQ x 1.2 x 0.54 • 0,(/ VKRZHGWLPHVKLJKHUSHUIRUPDQFHWKDQ(&5$3DERXW65 7RS • ,Q/HYHOTXHULHV0,(/FRXOGVXSSOHPHQWLQIRUPDWLRQODFNLQJLQODQJXDJHTXHULHVXVLQJ+5, • &RPSDUHGWR+XPDQDQG+XPDQ ZR,QWHUDFWLRQ 0,(/DFKLHYHGDSSUR[LPDWHO\65 7RS [11] J. Hu et al “VGPN: Voice-Guided Pointing Robot Navigation for Humans,” IEEE ROBIO, pp.1107–1112, 2018. [6] A. Oyama et al. ECRAP: Exophora Resolution and Classifying User Commands for Robot Action Planning by Large Language Models. IEEE IRC, pp.1–8, 2024. R = 1 7ULDOV 6XFFHVVRU)DOVH RU 5HVXOWV&DVHRIXVHULVYLVLEOHIURPWKHURERW VSRVLWLRQ
6RIWZDUH'HYHORSPHQW(QYLURQPHQW>@ [1] L. El Hafi et al. “Software Development Environment for Collaborative Research Workflow in Robotic System Integration.” Advanced Robotics, 2022.