[SMC24] Pointing Frame Estimation with Audio-Visual Time Series Data for Daily Life Service Robots

Pointing Frame Estimation with Audio-Visual Time Series Data for Daily
Life Service Robots Hikaru Nakagawa1, Shoichi Hasegawa1,*, Yoshinobu Hagiwara2,1, Akira Taniguchi1, Tadahiro Taniguchi3,1 Ritsumeikan Univ.1 Soka Univ.2 Kyoto Univ.3 (* is a corresponding author) IEEE International Conference on Systems, Man, and Cybernetics (SMC2024) Session: Cyber-Physical Systems and Robotics 2 (10/09: 12:20-12:40) ID: 1408

Research Background 2 People often give instructions with pointing. Pointing
is important information for identifying objects [1], but it’s difficult to know when pointing is given. Therefore, the robot needs to capture the timing when the person points (pointing frame). Put this stuffed shark away. Pointing Frame [1] N. Kotani et al. "Point Anywhere: Directed Object Estimation from Omnidirectional Images." ACM SIGGRAPH, 2023. ・・・・

Previous Research: Estimating Pointing Frames 3 [2] Y. Chen et
al. “YouRefIt: Embodied Reference Understanding with Language and Gesture.” IEEE/CVF ICCV, 2021. [3] A. Vaswani et al. “Attention Is All You Need.” NeurIPS, 2017. Estimating the rectangular area of a reference object and a canonical frame. (images that contains a user- directed reference object and for which the reference object is uniquely determined using image and text) [2] • Building the pair of image-text dataset for training and evaluation • Using Transformer [3] for Temporal Optimization Gestures unrelated to instructions are not addressed, which may reduce estimation accuracy. We address this issue by improving the pointing frame estimator in previous work.

Research Idea 4 • For example, suppose a person stretches.
If there is no speech information, the robot may misrecognize that the person is stretching and pointing. • However, if speech information is included, the robot may infer that the person is not pointing when there is no “Ah…” or “Ah…” in the speech. Using of human speech to predict the moment when a person points while gestures are made that are not related to instructions. 𝑡 Ah… (False) Pointing Frame Estimator (True) No Pointing Frame (True) Pointing Frame get me some tea..

Research Purpose 5 Ah, I'm tired. I'll take a break.
Please get me some tea from the kitchen. Speech Image Recognized Text Observation Pointing Frame Estimate To verify to what extent the performance of pointing frame estimations can be enhanced through the integration of speech data during human-robot interactions involving gestures and language.

Proposed Model 6

Proposed Model 7

Feature Fusion Module 8 “Ah, 'm d. ' k b
k. Please get me some tea from h k h .” • MSI-Net [4] Neural network that highlights and visualizes where people point (Saliency heatmap). • Openpose [5] Neural network for human skeleton detection from images. • Visual Encoder Darknet-53 [6] trained by MS- COCO [7]. • Textual Encoder The Japanese pre-trained BERT model [8]. [4] A. Kroner et al. “Contextual Encoder–Decoder Network for Visual Saliency Prediction.” Neural Networks, 2020. [5] Z. Cao et al. “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.” IEEE/CVF CVPR, 2017. [6] T.-Y. Lin et al. “Microsoft COCO: Common Objects in Context.” ECCV, 2014. [7] J. Redmon et al. “YOLOv3: An Incremental Improvement.” arXivpreprint arXiv:1804.02767, 2018. [8] J. Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL: Human Language Technologies, 2018. By combining skeletal detection results, speech, and image features, information about user instructions can be appropriately represented.

Proposed Model 9

Pointing Frame Estimator 10 We compare two methods capable of
integrating time series data within the pointing frame estimator. Transformer [3] Bidirectional LSTM (Bi-LSTM) [9] • Transformer has a large number of parameters and requires large training dataset • Fine-tuning with data collected in the real environment • Bi-LSTM does not require large datasets and can be trained from scratch • Our study input the concatenation of features obtained from the rectangular estimator and audio features [3] A. Vaswani et al. “Attention Is All You Need.” NeurIPS, 2017. [9] A. Graves et al. “Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures.” Neural Networks, 2005. https://medium.com/@souro400.nath/why-is-bilstm-better-than-lstm-a7eb0090c1e4 m m b m Fine-Tuned

Experiment 11 • Purpose To verify to what extent the
model with additional speech information improves the pointing frame estimation in a scenario assuming actual communication between a person and a robot. • Environment m m b m b m

Conditions of User Instructions 12 1. Speech containing content other
than referential expressions • “I'm thirsty. Go get the plastic bottle on the desk.” 2. Only upright and pointing movements，and non-upright or non-pointing movements 3. Reference objects in the image or not in the image 4. How to call object names • Demonstrative and object name ( “this remote controller”，”that bottle”，“that book”) • Only object name (“cup”，”stuffed doll”) • Only demonstrative (co-series (this)，a-series (that)，so-series (the))． User Only upright and pointing Non-upright or Includes actions other than pointing The placement of reference object Inside of image 100 100 Outside of image 100 100 Table. The number of data by each item Each item, • Train: 70 • Validate: 10 • Test: 20

Examples of Data Collected in Experiment (I think I’ll read.
Take that book.) (Oh, I'm almost done reading this book. Take the book there.) (It was delicious. Put this away.) (Thanks for the food. It was delicious. Please put this plate away.) Japanese speech was used in the experiment. • 〇: Referring object in an image • 〇: Only upright and pointing • 〇: Referring object in an image • ×: Only upright and pointing • ×: Referring object in an image • 〇: Only upright and pointing • ×: Referring object in an image • ×: Only upright and pointing 13

Comparison Methods Baseline (Chen et al. [2] ) • Dataset
from collecting in real environment • Without speech information • Transformer Our model • Condition: We compare a total of six different models combining two conditions • Speech Information • Speech Segments • Mel-Frequency Cepstral Coefficients (MFCCs) • Mel-Spectrogram • Pointing Frame Estimator • Bi-LSTM • Transformer Speech Segments • A threshold is set for the speech waveform • Whether or not each frame is a speech segment is detected by whether or not the threshold is exceeded 0 1 1 [2] Y. Chen et al. “YouRefIt: Embodied Reference Understanding with Language and Gesture.” IEEE/CVF ICCV, 2021. 14

Result 15 Evaluation Item 𝐹1 𝑠𝑐𝑜𝑟𝑒 = 2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛・𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
+ 𝑅𝑒𝑐𝑎𝑙𝑙 Result • Since the speech data was complex, higher-order speech features such as Mel-Spectrogram and MFCC may have been more effective. • The Mel-Spectrogram was designed for real human-robot communication and included situations where the user's posture and movements were complex, and gestures and speech utterances were varied. → It is believed that the Mel-Spectrogram was able to better analyze subtle speech features in complex situations. Baseline Method [4]

Conclusion 16 • We introduced a pointing frame estimator that
leverages human speech information across various scenarios where individuals provide instructions to a robot using gestures and language. • Experiment Result • Models using Mel-Spectrogram and Transformer for time series models perform well. • Since the speech data was complex, higher-order speech features such as Mel-Spectrogram and MFCC may have been more effective. Future Works Acknowledgements This work was supported by JSPS KAKENHI Grants-in-Aid for Scientific Research (Grant Numbers JP23K16975 and JP22K12212), JST Moonshot Research & Development Program mb J J 2011 , “ y h Ad m d h y Ritsumeikan”, d J R mb J J 2101 . b m m h . m dy m h b y 2 y d b m b d m To utilize the estimated pointing frames to localize reference objects within a 3D environment [10]. [10] A. Oyama et al. "Exophora Resolution of Linguistic Instructions with a Demonstrative based on Real-World Multimodal Information." IEEE RO-MAN, 2023.

Appendix 17

Contribution 18 • We show that the performance of pointing
frame estimation in robots can be enhanced when a user provides instructions using gestures and language, by integrating the speech information articulated by the user. • We contrast pointing frame estimators leveraging various types of audio features, showing what is the optimal audio features for pointing frame estimation.

Embodied Reference Understanding Framework 19 Chen et al. created a
comprehensive dataset (YouRefIt) wherein users denote objects using language and gestures, developing an estimator capable of interpreting user directives [1] [2] Y. Chen et al. “YouRefIt: Embodied Reference Understanding with Language and Gesture.” IEEE/CVF ICCV, 2021.

Speech and Video Processing 20 ・・・・・・ • Sampling videos
with audio at 5 Hz and obtained a 0.2 second audio and an image sampled at 5 Hz. • For speech information, the information is put into the extractor to obtain the features. • We got a result of speech recognition from videos with audio. Ah, I'm tired. I'll take a break. Please get me some tea from the kitchen. Whisper [1] [1] A. Radford et al. "Robust Speech Recognition via Large-Scale Weak Supervision." ICML, 2023.

Object Rectangle Estimator [1] 21 [1] Z. Yang, et al.
“A Fast and Accurateone-Stage Approach to Visual Grounding.” IEEE/CVF ICCV, 2019.

Pointing Frame Estimator: Transformer 22 Transformer [1] • Transformer has
a large number of parameters and requires large training dataset • Fine-tuning with data collected in the environment • The linear layer uses weight matrix for dimensionality compression 𝓦 = 𝟏 𝟎 ⋯ ⋯ 𝟎 𝟎 𝟏 ⋯ ⋯ ⋯ ⋯ ⋯ 𝟏 ⋯ ⋯ 𝟎 ⋯ 𝟎 𝟏 𝟎 +Gaussian Noise [1] A. Vaswani et al. “Attention Is All You Need.” NeurIPS, 2017. Dimensionally Compression 1. CNNs reduce the dimensionality of the features obtained from the rectangular region estimator. 2. A weight matrix W is applied to the speech and compressed features, where the diagonal component is set to 1, and Gaussian noise is introduced to the remaining components. CNNs Transformer Speech Feature Extractor Linear layer for dimensional compression This adjustment aligns the dimensionality of the input data during the fine-tuning process of the Transformer.

Experiment 1 23 • Purpose To verify how adding speech
to a pointing frame estimator improves its performance in gesture and language-based interactions. • Environment m m b m b m

Each Speech Feature 24 • Speech Segments • Mel-Frequency Cepstral
Coefficients (MFCCs) • The component or coefficient obtained by discrete cosine transform of the Mel-spectrogram • Compressed to a lower dimension • A threshold is set for the speech waveform • Whether or not each frame is a speech segment is detected by whether or not the threshold is exceeded 0 1 1 • Mel-Spectrogram • Spectrogram using the frequency axis as a Mel scale • Mel scale is an acoustic scale based on the human ear's sense of frequency

Test dataset non upright or some move 25

Test dataset upright 26

Examples of Data Collected in Experiment 1 27 Dataset Contents:
Reference object is included in the image, and a person stands upright and points at the object. (Speech is in Japanese) • The number of total data: 100 • The number of training data: 70 • The number of validation data: 10 • The number of test data: 20 Ex. I think I’ll read. Take that book.

Result in Experiment 1 28 Evaluation Item 𝐹1 𝑠𝑐𝑜𝑟𝑒 =
2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛・𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 Result • Since the speech data was simple, it is possible that higher-order speech features such as Mel- Spectrogram and MFCC were excessive and rather acted as noise. • A model based on simple Speech Segment, with limited amount of data, could have avoided over- or under-training and produced more stable results.

Reasons for Adopting F1 score 29 Throughout the test data
set used in this study, Therefore, we use the F1 score in this experiment. • F1 score: The trade-off relationship between Precision and Recall, and is the harmonic average of the two values in the binary classification task • Precision: Emphasizes the accuracy of the positive predictions • Recall: Emphasizes the accuracy of the negative predictions If the accuracy is used as the evaluation item, the accuracy is 75.7%, which is not accurate even if all frames are estimated to be not pointing frames. • The number of total frames: 12207 The number of pointing frames: 2598 • The number of non-pointing frames: 9249 𝐹1 𝑠𝑐𝑜𝑟𝑒 = 2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛・𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

Failure Cases where there is no object in the robot's
view 30 The previous research [1] uses a saliency map for gesture features. It uses a saliency map because the dataset is based on the assumption that there is an object in the image. In this study, the reference object may not be in the image, in which case the saliency map cannot be taken well. Therefore, it is expected to be improved by not using the saliency map or by acquiring alternative gesture features. [1] Y. Chen et al. “YouRefIt: Embodied Reference Understanding with Language and Gesture.” IEEE/CVF ICCV, 2021. Successful cases (right: RGB image, left: Saliency Map (highlighting likely object locations)) Failure case

Specification of PC 31 PC Environment • CPU：Dual AMD EPYC
Rome 7302 processor with 16 cores and 32 threads, running at 3.0GHz with 128MB cache and 115W power consumption • GPU：NVIDIA A100 40GB，CoWoS HBM2 PCIe 4.0 Passive Cooling Graphics Processing Unit (GPU) • Software Development Environment [1] The inference time per a video • Transformer：Averaged 0.21 seconds • Bi-LSTM：Averaged 0.20 seconds [1] L. El Hafi et al. “Software Development Environment for Collaborative Research Workflow in Robotic System Integration.” Advanced Robotics, 2022.

Conditions of Learning 32 • Fine-tuning Transformer • The number
of epoch : 50 • Optimizer : RMSprop • The number of parameter : 34225922 • Bi-LSTM • The number of epoch : 100 • Optimizer: RMSprop • The number of parameter : 674906

Exophora Resolution Framework 33 b m m h . m
dy m h b y 2 y d b m b d m Identifying demonstrative (demonstratives, pronouns, etc.) from the environmental information. [10] A. Oyama et al. "Exophora Resolution of Linguistic Instructions with a Demonstrative based on Real-World Multimodal Information." IEEE RO-MAN, 2023.

Demonstrative Region-based Estimator • Forming demonstartive regions by 3D Gaussian
distribution using the different characters of demonstartive series • Robot obtain eye and wrist coordinates by MediaPipe (skeleton detector) [5] • Coordinates (eye and wrist) are used as parameters of 3D Gaussian distribution 34 Japanese demonstrative Character co – series Referring near the speaker (e.g., human) so – series Referring near the listener (e.g., roboot) a – series Referring to a location far from both [5] C. Lugaresi, et al. “Mediapipe: A Framework for Building Perception Pipelines”. arXiv preprint arXiv:1906.08172, 2019.

Object Category-based Estimator 35 [6] J. Redmon, et al. "You
Only Look Once: Unified, Real-Time Object Detection." IEEE/CVF CVPR, 2016. Cup: 0.90 Jug: 0.03 Stuffed Toy: 0.61 Sheep: 0.20 Confidence scores for objects obtained in advance using YOLOv5 [6]. Normalized Probability of “Cup” Cup: 0.01 Cup: 0.95 Object detection with YOLOv5 [6] … … Confidence score Pr Class𝑖 Object ∗ Pr Object ∗ IOUpred truth

Pointing Direction-based Estimator • Calculate the angle between two vectors
• Output probability density of 2D von Mises distribution with the obtained angles as random variables 36 vector Start point End point Pointing vector(blue) eye wrist Object direction vector (red) eye candidate object Probability density function 𝑓 𝜃 = Τ exp{𝛽 cos(𝜃 − 𝜇)} 2𝜋𝐼0 (𝛽) (Random variable: 𝜃, Parameters : 𝛽, 𝜇) von Mises distribution

Exophora Resolution Framework: Demonstration Video 37

LLM-based Robot Action Planning: System [1] 38 avigation b ect
etection ick lace Control Command uccess or ailed lex E ring a cup to the kitchen. b ect Image , ord osition , , b ect name lace name b ect placement T rompt Spatial Concept-based Prompts and Sequential Feedback for LLM-based Robot Action Planning(SpCoRAP) [1] S. Hasegawa, et al. "Leveraging a Large Language Model and a Spatial ConceptModel for Action Planning of a Daily Life Support Robot." RSJ, 2023.

avigation b ect etection ick lace Control Command uccess or
ailed lex E ring a cup to the kitchen. b ect Image , ord osition , , b ect name lace name b ect placement T rompt Prompts of On-site Knowledge 39 𝑃 𝑤𝑡 𝑖𝑡 , 𝑊, 𝜙, 𝜋) ∝ ෍ 𝐶𝑡 𝑃 𝑤𝑡 𝑊𝐶𝑡 𝑃 𝑖𝑡 𝜙𝐶𝑡 𝑃(𝐶𝑡 |𝜋) Spatial Concept Model [12] Infer a word 𝑤𝑡 about a place in each spatial region 𝑖𝑡 In each region, words related to the following locations are likely to be observed. place1: [living_room, sofa, desk, chair, tv] place2: [sink, refrigerator, desk, chair, kitchen] place3: [toy, shelf, toy_room, box, bed] “ ” Prompts about place words [1] S.Hasegawa, et al. "Inferring Place-Object Relationships by Integrating Probabilistic Logic and Multimodal Spatial Concepts." SII, 2023.

avigation b ect etection ick lace Control Command uccess or
ailed lex E ring a cup to the kitchen. b ect Image , ord osition , , b ect name lace name b ect placement T rompt Prompts of On-site Knowledge 40 Spatial Concept Model “ ” List of probabilities that an object exists at [place1, place2, place3]: bottle: [0.8, 0.15, 0.05] cup: [0.7, 0.2, 0.1] stuffed_toy: [0.1, 0.05, 0.85] 𝑃 𝑖𝑡 𝑜𝑡 , 𝜉, 𝜙, 𝜋) ∝ ෍ 𝐶𝑡 𝑃 𝑖𝑡 𝜙𝐶𝑡 𝑃 𝑜𝑡 𝜉𝐶𝑡 𝑃(𝐶𝑡 |𝜋) Infer object labels 𝑜𝑡 in each spatial region 𝑖𝑡 Prompts about object placement Use these prompts to represent the parameters of the probability distribution in the spatial concept model.

Action Planning by GPT-4 41 Action Plan by GPT-4 [1]
Prompts On-site Knowledge Skill Set Location Words place1: [living_room, sofa, desk, chair, tv] … Object Placements bottle: [0.8, 0.15, 0.05] … 1. navigation (location_name) 2. object_detection (object_name) 3. pick (object_name) 4. place (location_name) These behaviors return "succeeded" or "failed". If "failed" is returned, try the same or another behavior again. Linguistic Instruction Could you please find a snack box I'm looking for? GPT-4 Prompt A prompt is here. https://github.com/Shoichi-Hasegawa0628/rsj2023_prompt [1] A , “GPT-4 Technical Report”, arXiv preprint arXiv: 2303.08774, 2023.

Behavior Engine by FlexBE [1] 42 Sequentially generates the actions
output by GPT-4 and executes them. GPT-4 Prompt FlexBE [1] P. Schillinger, et al. “Human-Robot Collaborative High-Level Control with Application to Rescue Robotics.” ICRA, 2016. navigation object detection place pick Failed Succeed Control Command • navigation failed Path cannot be planned. • pick & place failed Inverse kinematics cannot be calculated. • object detection failed Target label is not detected. Environment

[SMC24] Pointing Frame Estimation with Audio-Vi...

[SMC24] Pointing Frame Estimation with Audio-Visual Time Series Data for Daily Life Service Robots

More Decks by Shoichi Hasegawa

Other Decks in Research

Featured

Transcript