Daily life support robots in the home environment
interpret the user’s pointing and understand the instructions,
thereby increasing the number of instructions accomplished.
This study aims to improve the estimation performance of
pointing frames by using speech information when a person
gives pointing or verbal instructions to the robot. The estimation
of the pointing frame, which represents the moment when the
user points, can help the user understand the instructions.
Therefore, we perform pointing frame estimation using a timeseries
model, utilizing the user’s speech, images, and speechrecognized
text observed by the robot. In our experiments,
we set up realistic communication conditions, such as speech
containing everyday conversation, non-upright posture, actions
other than pointing, and reference objects outside the robot’s
field of view. The results showed that adding speech information
improved the estimation performance, especially the Transformer
model with Mel-Spectrogram as a feature. This study
will lead to be applied to object localization and action planning
in 3D environments by robots in the future. The project website
is https://emergentsystemlabstudent.github.io/PointingImgEst/.