Slide 1

Slide 1 text

Pointing Frame Estimation with Audio-Visual Time Series Data for Daily Life Service Robots Hikaru Nakagawa1, Shoichi Hasegawa1,*, Yoshinobu Hagiwara2,1, Akira Taniguchi1, Tadahiro Taniguchi3,1 Ritsumeikan Univ.1 Soka Univ.2 Kyoto Univ.3 (* is a corresponding author) IEEE International Conference on Systems, Man, and Cybernetics (SMC2024) Session: Cyber-Physical Systems and Robotics 2 (10/09: 12:20-12:40) ID: 1408

Slide 2

Slide 2 text

Research Background 2 People often give instructions with pointing. Pointing is important information for identifying objects [1], but it’s difficult to know when pointing is given. Therefore, the robot needs to capture the timing when the person points (pointing frame). Put this stuffed shark away. Pointing Frame [1] N. Kotani et al. "Point Anywhere: Directed Object Estimation from Omnidirectional Images." ACM SIGGRAPH, 2023. ・ ・ ・ ・

Slide 3

Slide 3 text

Previous Research: Estimating Pointing Frames 3 [2] Y. Chen et al. “YouRefIt: Embodied Reference Understanding with Language and Gesture.” IEEE/CVF ICCV, 2021. [3] A. Vaswani et al. “Attention Is All You Need.” NeurIPS, 2017. Estimating the rectangular area of a reference object and a canonical frame. (images that contains a user- directed reference object and for which the reference object is uniquely determined using image and text) [2] • Building the pair of image-text dataset for training and evaluation • Using Transformer [3] for Temporal Optimization Gestures unrelated to instructions are not addressed, which may reduce estimation accuracy. We address this issue by improving the pointing frame estimator in previous work.

Slide 4

Slide 4 text

Research Idea 4 • For example, suppose a person stretches. If there is no speech information, the robot may misrecognize that the person is stretching and pointing. • However, if speech information is included, the robot may infer that the person is not pointing when there is no “Ah…” or “Ah…” in the speech. Using of human speech to predict the moment when a person points while gestures are made that are not related to instructions. 𝑡 Ah… (False) Pointing Frame Estimator (True) No Pointing Frame (True) Pointing Frame get me some tea..

Slide 5

Slide 5 text

Research Purpose 5 Ah, I'm tired. I'll take a break. Please get me some tea from the kitchen. Speech Image Recognized Text Observation Pointing Frame Estimate To verify to what extent the performance of pointing frame estimations can be enhanced through the integration of speech data during human-robot interactions involving gestures and language.

Slide 6

Slide 6 text

Proposed Model 6

Slide 7

Slide 7 text

Proposed Model 7

Slide 8

Slide 8 text

Feature Fusion Module 8 “Ah, 'm d. ' k b k. Please get me some tea from h k h .” • MSI-Net [4] Neural network that highlights and visualizes where people point (Saliency heatmap). • Openpose [5] Neural network for human skeleton detection from images. • Visual Encoder Darknet-53 [6] trained by MS- COCO [7]. • Textual Encoder The Japanese pre-trained BERT model [8]. [4] A. Kroner et al. “Contextual Encoder–Decoder Network for Visual Saliency Prediction.” Neural Networks, 2020. [5] Z. Cao et al. “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.” IEEE/CVF CVPR, 2017. [6] T.-Y. Lin et al. “Microsoft COCO: Common Objects in Context.” ECCV, 2014. [7] J. Redmon et al. “YOLOv3: An Incremental Improvement.” arXivpreprint arXiv:1804.02767, 2018. [8] J. Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL: Human Language Technologies, 2018. By combining skeletal detection results, speech, and image features, information about user instructions can be appropriately represented.

Slide 9

Slide 9 text

Proposed Model 9

Slide 10

Slide 10 text

Pointing Frame Estimator 10 We compare two methods capable of integrating time series data within the pointing frame estimator. Transformer [3] Bidirectional LSTM (Bi-LSTM) [9] • Transformer has a large number of parameters and requires large training dataset • Fine-tuning with data collected in the real environment • Bi-LSTM does not require large datasets and can be trained from scratch • Our study input the concatenation of features obtained from the rectangular estimator and audio features [3] A. Vaswani et al. “Attention Is All You Need.” NeurIPS, 2017. [9] A. Graves et al. “Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures.” Neural Networks, 2005. https://medium.com/@souro400.nath/why-is-bilstm-better-than-lstm-a7eb0090c1e4 m m b m Fine-Tuned

Slide 11

Slide 11 text

Experiment 11 • Purpose To verify to what extent the model with additional speech information improves the pointing frame estimation in a scenario assuming actual communication between a person and a robot. • Environment m m b m b m

Slide 12

Slide 12 text

Conditions of User Instructions 12 1. Speech containing content other than referential expressions • “I'm thirsty. Go get the plastic bottle on the desk.” 2. Only upright and pointing movements,and non-upright or non-pointing movements 3. Reference objects in the image or not in the image 4. How to call object names • Demonstrative and object name ( “this remote controller”,”that bottle”,“that book”) • Only object name (“cup”,”stuffed doll”) • Only demonstrative (co-series (this),a-series (that),so-series (the)). User Only upright and pointing Non-upright or Includes actions other than pointing The placement of reference object Inside of image 100 100 Outside of image 100 100 Table. The number of data by each item Each item, • Train: 70 • Validate: 10 • Test: 20

Slide 13

Slide 13 text

Examples of Data Collected in Experiment (I think I’ll read. Take that book.) (Oh, I'm almost done reading this book. Take the book there.) (It was delicious. Put this away.) (Thanks for the food. It was delicious. Please put this plate away.) Japanese speech was used in the experiment. • 〇: Referring object in an image • 〇: Only upright and pointing • 〇: Referring object in an image • ×: Only upright and pointing • ×: Referring object in an image • 〇: Only upright and pointing • ×: Referring object in an image • ×: Only upright and pointing 13

Slide 14

Slide 14 text

Comparison Methods Baseline (Chen et al. [2] ) • Dataset from collecting in real environment • Without speech information • Transformer Our model • Condition: We compare a total of six different models combining two conditions • Speech Information • Speech Segments • Mel-Frequency Cepstral Coefficients (MFCCs) • Mel-Spectrogram • Pointing Frame Estimator • Bi-LSTM • Transformer Speech Segments • A threshold is set for the speech waveform • Whether or not each frame is a speech segment is detected by whether or not the threshold is exceeded 0 1 1 [2] Y. Chen et al. “YouRefIt: Embodied Reference Understanding with Language and Gesture.” IEEE/CVF ICCV, 2021. 14

Slide 15

Slide 15 text

Result 15 Evaluation Item 𝐹1 𝑠𝑐𝑜𝑟𝑒 = 2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛・𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 Result • Since the speech data was complex, higher-order speech features such as Mel-Spectrogram and MFCC may have been more effective. • The Mel-Spectrogram was designed for real human-robot communication and included situations where the user's posture and movements were complex, and gestures and speech utterances were varied. → It is believed that the Mel-Spectrogram was able to better analyze subtle speech features in complex situations. Baseline Method [4]

Slide 16

Slide 16 text

Conclusion 16 • We introduced a pointing frame estimator that leverages human speech information across various scenarios where individuals provide instructions to a robot using gestures and language. • Experiment Result • Models using Mel-Spectrogram and Transformer for time series models perform well. • Since the speech data was complex, higher-order speech features such as Mel-Spectrogram and MFCC may have been more effective. Future Works Acknowledgements This work was supported by JSPS KAKENHI Grants-in-Aid for Scientific Research (Grant Numbers JP23K16975 and JP22K12212), JST Moonshot Research & Development Program mb J J 2011 , “ y h Ad m d h y Ritsumeikan”, d J R mb J J 2101 . b m m h . m dy m h b y 2 y d b m b d m To utilize the estimated pointing frames to localize reference objects within a 3D environment [10]. [10] A. Oyama et al. "Exophora Resolution of Linguistic Instructions with a Demonstrative based on Real-World Multimodal Information." IEEE RO-MAN, 2023.

Slide 17

Slide 17 text

Appendix 17

Slide 18

Slide 18 text

Contribution 18 • We show that the performance of pointing frame estimation in robots can be enhanced when a user provides instructions using gestures and language, by integrating the speech information articulated by the user. • We contrast pointing frame estimators leveraging various types of audio features, showing what is the optimal audio features for pointing frame estimation.

Slide 19

Slide 19 text

Embodied Reference Understanding Framework 19 Chen et al. created a comprehensive dataset (YouRefIt) wherein users denote objects using language and gestures, developing an estimator capable of interpreting user directives [1] [2] Y. Chen et al. “YouRefIt: Embodied Reference Understanding with Language and Gesture.” IEEE/CVF ICCV, 2021.

Slide 20

Slide 20 text

Speech and Video Processing 20 ・・・ ・・・ • Sampling videos with audio at 5 Hz and obtained a 0.2 second audio and an image sampled at 5 Hz. • For speech information, the information is put into the extractor to obtain the features. • We got a result of speech recognition from videos with audio. Ah, I'm tired. I'll take a break. Please get me some tea from the kitchen. Whisper [1] [1] A. Radford et al. "Robust Speech Recognition via Large-Scale Weak Supervision." ICML, 2023.

Slide 21

Slide 21 text

Object Rectangle Estimator [1] 21 [1] Z. Yang, et al. “A Fast and Accurateone-Stage Approach to Visual Grounding.” IEEE/CVF ICCV, 2019.

Slide 22

Slide 22 text

Pointing Frame Estimator: Transformer 22 Transformer [1] • Transformer has a large number of parameters and requires large training dataset • Fine-tuning with data collected in the environment • The linear layer uses weight matrix for dimensionality compression 𝓦 = 𝟏 𝟎 ⋯ ⋯ 𝟎 𝟎 𝟏 ⋯ ⋯ ⋯ ⋯ ⋯ 𝟏 ⋯ ⋯ 𝟎 ⋯ 𝟎 𝟏 𝟎 +Gaussian Noise [1] A. Vaswani et al. “Attention Is All You Need.” NeurIPS, 2017. Dimensionally Compression 1. CNNs reduce the dimensionality of the features obtained from the rectangular region estimator. 2. A weight matrix W is applied to the speech and compressed features, where the diagonal component is set to 1, and Gaussian noise is introduced to the remaining components. CNNs Transformer Speech Feature Extractor Linear layer for dimensional compression This adjustment aligns the dimensionality of the input data during the fine-tuning process of the Transformer.

Slide 23

Slide 23 text

Experiment 1 23 • Purpose To verify how adding speech to a pointing frame estimator improves its performance in gesture and language-based interactions. • Environment m m b m b m

Slide 24

Slide 24 text

Each Speech Feature 24 • Speech Segments • Mel-Frequency Cepstral Coefficients (MFCCs) • The component or coefficient obtained by discrete cosine transform of the Mel-spectrogram • Compressed to a lower dimension • A threshold is set for the speech waveform • Whether or not each frame is a speech segment is detected by whether or not the threshold is exceeded 0 1 1 • Mel-Spectrogram • Spectrogram using the frequency axis as a Mel scale • Mel scale is an acoustic scale based on the human ear's sense of frequency

Slide 25

Slide 25 text

Test dataset non upright or some move 25

Slide 26

Slide 26 text

Test dataset upright 26

Slide 27

Slide 27 text

Examples of Data Collected in Experiment 1 27 Dataset Contents: Reference object is included in the image, and a person stands upright and points at the object. (Speech is in Japanese) • The number of total data: 100 • The number of training data: 70 • The number of validation data: 10 • The number of test data: 20 Ex. I think I’ll read. Take that book.

Slide 28

Slide 28 text

Result in Experiment 1 28 Evaluation Item 𝐹1 𝑠𝑐𝑜𝑟𝑒 = 2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛・𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 Result • Since the speech data was simple, it is possible that higher-order speech features such as Mel- Spectrogram and MFCC were excessive and rather acted as noise. • A model based on simple Speech Segment, with limited amount of data, could have avoided over- or under-training and produced more stable results.

Slide 29

Slide 29 text

Reasons for Adopting F1 score 29 Throughout the test data set used in this study, Therefore, we use the F1 score in this experiment. • F1 score: The trade-off relationship between Precision and Recall, and is the harmonic average of the two values in the binary classification task • Precision: Emphasizes the accuracy of the positive predictions • Recall: Emphasizes the accuracy of the negative predictions If the accuracy is used as the evaluation item, the accuracy is 75.7%, which is not accurate even if all frames are estimated to be not pointing frames. • The number of total frames: 12207 The number of pointing frames: 2598 • The number of non-pointing frames: 9249 𝐹1 𝑠𝑐𝑜𝑟𝑒 = 2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛・𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

Slide 30

Slide 30 text

Failure Cases where there is no object in the robot's view 30 The previous research [1] uses a saliency map for gesture features. It uses a saliency map because the dataset is based on the assumption that there is an object in the image. In this study, the reference object may not be in the image, in which case the saliency map cannot be taken well. Therefore, it is expected to be improved by not using the saliency map or by acquiring alternative gesture features. [1] Y. Chen et al. “YouRefIt: Embodied Reference Understanding with Language and Gesture.” IEEE/CVF ICCV, 2021. Successful cases (right: RGB image, left: Saliency Map (highlighting likely object locations)) Failure case

Slide 31

Slide 31 text

Specification of PC 31 PC Environment • CPU:Dual AMD EPYC Rome 7302 processor with 16 cores and 32 threads, running at 3.0GHz with 128MB cache and 115W power consumption • GPU:NVIDIA A100 40GB,CoWoS HBM2 PCIe 4.0 Passive Cooling Graphics Processing Unit (GPU) • Software Development Environment [1] The inference time per a video • Transformer:Averaged 0.21 seconds • Bi-LSTM:Averaged 0.20 seconds [1] L. El Hafi et al. “Software Development Environment for Collaborative Research Workflow in Robotic System Integration.” Advanced Robotics, 2022.

Slide 32

Slide 32 text

Conditions of Learning 32 • Fine-tuning Transformer • The number of epoch : 50 • Optimizer : RMSprop • The number of parameter : 34225922 • Bi-LSTM • The number of epoch : 100 • Optimizer: RMSprop • The number of parameter : 674906

Slide 33

Slide 33 text

Exophora Resolution Framework 33 b m m h . m dy m h b y 2 y d b m b d m Identifying demonstrative (demonstratives, pronouns, etc.) from the environmental information. [10] A. Oyama et al. "Exophora Resolution of Linguistic Instructions with a Demonstrative based on Real-World Multimodal Information." IEEE RO-MAN, 2023.

Slide 34

Slide 34 text

Demonstrative Region-based Estimator • Forming demonstartive regions by 3D Gaussian distribution using the different characters of demonstartive series • Robot obtain eye and wrist coordinates by MediaPipe (skeleton detector) [5] • Coordinates (eye and wrist) are used as parameters of 3D Gaussian distribution 34 Japanese demonstrative Character co – series Referring near the speaker (e.g., human) so – series Referring near the listener (e.g., roboot) a – series Referring to a location far from both [5] C. Lugaresi, et al. “Mediapipe: A Framework for Building Perception Pipelines”. arXiv preprint arXiv:1906.08172, 2019.

Slide 35

Slide 35 text

Object Category-based Estimator 35 [6] J. Redmon, et al. "You Only Look Once: Unified, Real-Time Object Detection." IEEE/CVF CVPR, 2016. Cup: 0.90 Jug: 0.03 Stuffed Toy: 0.61 Sheep: 0.20 Confidence scores for objects obtained in advance using YOLOv5 [6]. Normalized Probability of “Cup” Cup: 0.01 Cup: 0.95 Object detection with YOLOv5 [6] … … Confidence score Pr Class𝑖 Object ∗ Pr Object ∗ IOUpred truth

Slide 36

Slide 36 text

Pointing Direction-based Estimator • Calculate the angle between two vectors • Output probability density of 2D von Mises distribution with the obtained angles as random variables 36 vector Start point End point Pointing vector(blue) eye wrist Object direction vector (red) eye candidate object Probability density function 𝑓 𝜃 = Τ exp{𝛽 cos(𝜃 − 𝜇)} 2𝜋𝐼0 (𝛽) (Random variable: 𝜃, Parameters : 𝛽, 𝜇) von Mises distribution

Slide 37

Slide 37 text

Exophora Resolution Framework: Demonstration Video 37

Slide 38

Slide 38 text

LLM-based Robot Action Planning: System [1] 38 avigation b ect etection ick lace Control Command uccess or ailed lex E ring a cup to the kitchen. b ect Image , ord osition , , b ect name lace name b ect placement T rompt Spatial Concept-based Prompts and Sequential Feedback for LLM-based Robot Action Planning(SpCoRAP) [1] S. Hasegawa, et al. "Leveraging a Large Language Model and a Spatial ConceptModel for Action Planning of a Daily Life Support Robot." RSJ, 2023.

Slide 39

Slide 39 text

avigation b ect etection ick lace Control Command uccess or ailed lex E ring a cup to the kitchen. b ect Image , ord osition , , b ect name lace name b ect placement T rompt Prompts of On-site Knowledge 39 𝑃 𝑤𝑡 𝑖𝑡 , 𝑊, 𝜙, 𝜋) ∝ ෍ 𝐶𝑡 𝑃 𝑤𝑡 𝑊𝐶𝑡 𝑃 𝑖𝑡 𝜙𝐶𝑡 𝑃(𝐶𝑡 |𝜋) Spatial Concept Model [12] Infer a word 𝑤𝑡 about a place in each spatial region 𝑖𝑡 In each region, words related to the following locations are likely to be observed. place1: [living_room, sofa, desk, chair, tv] place2: [sink, refrigerator, desk, chair, kitchen] place3: [toy, shelf, toy_room, box, bed] “ ” Prompts about place words [1] S.Hasegawa, et al. "Inferring Place-Object Relationships by Integrating Probabilistic Logic and Multimodal Spatial Concepts." SII, 2023.

Slide 40

Slide 40 text

avigation b ect etection ick lace Control Command uccess or ailed lex E ring a cup to the kitchen. b ect Image , ord osition , , b ect name lace name b ect placement T rompt Prompts of On-site Knowledge 40 Spatial Concept Model “ ” List of probabilities that an object exists at [place1, place2, place3]: bottle: [0.8, 0.15, 0.05] cup: [0.7, 0.2, 0.1] stuffed_toy: [0.1, 0.05, 0.85] 𝑃 𝑖𝑡 𝑜𝑡 , 𝜉, 𝜙, 𝜋) ∝ ෍ 𝐶𝑡 𝑃 𝑖𝑡 𝜙𝐶𝑡 𝑃 𝑜𝑡 𝜉𝐶𝑡 𝑃(𝐶𝑡 |𝜋) Infer object labels 𝑜𝑡 in each spatial region 𝑖𝑡 Prompts about object placement Use these prompts to represent the parameters of the probability distribution in the spatial concept model.

Slide 41

Slide 41 text

Action Planning by GPT-4 41 Action Plan by GPT-4 [1] Prompts On-site Knowledge Skill Set Location Words place1: [living_room, sofa, desk, chair, tv] … Object Placements bottle: [0.8, 0.15, 0.05] … 1. navigation (location_name) 2. object_detection (object_name) 3. pick (object_name) 4. place (location_name) These behaviors return "succeeded" or "failed". If "failed" is returned, try the same or another behavior again. Linguistic Instruction Could you please find a snack box I'm looking for? GPT-4 Prompt A prompt is here. https://github.com/Shoichi-Hasegawa0628/rsj2023_prompt [1] A , “GPT-4 Technical Report”, arXiv preprint arXiv: 2303.08774, 2023.

Slide 42

Slide 42 text

Behavior Engine by FlexBE [1] 42 Sequentially generates the actions output by GPT-4 and executes them. GPT-4 Prompt FlexBE [1] P. Schillinger, et al. “Human-Robot Collaborative High-Level Control with Application to Rescue Robotics.” ICRA, 2016. navigation object detection place pick Failed Succeed Control Command • navigation failed Path cannot be planned. • pick & place failed Inverse kinematics cannot be calculated. • object detection failed Target label is not detected. Environment