Daily life support robots in the home environment interpret the user’s pointing and understand the instructions, thereby increasing the number of instructions accomplished. This study aims to improve the estimation performance of pointing frames by using speech information when a person gives pointing or verbal instructions to the robot. The estimation of the pointing frame, which represents the moment when the user points, can help the user understand the instructions. Therefore, we perform pointing frame estimation using a time series model, utilizing the user’s speech, images, and speech recognized text observed by the robot. In our experiments, we set up realistic communication conditions, such as speech containing everyday conversation, non-upright posture, actions other than pointing, and reference objects outside the robot’s field of view. The results showed that adding speech information improved the estimation performance, especially the Transformer model with Mel-Spectrogram as a feature. This study will lead to be applied to object localization and action planning in 3D environments by robots in the future. The project website is PointingImgEst