users ❶ only use sensors equipped on a robot ❷ use two robots Multiple users sit around a table → Simplify the problem to decide the positions of users
sound source localization results to identify a speaker head Hello Hello → to understand role of addressee [Mutlu, 2009] [Bennewitz, 2005] 4QFBLFS *EFOUJGJDBUJPO → to feel involved in conversation This behavior enables users
Detecting lip movements [Faish, 2012] Recognizing gestures [Bohus, 2009] → It’s difficult to identify speakers when they are out of the field of the system camera - It’s difficult to keep track of users in the field of the robot’s camera In our situation - The robot cannot always look around (the angle of robot’s camera is narrow) (the robot is a participant in the conversation) → Using localization results enable us to identify speakers who are out of the field of the robot’s camera 4QFBLFS *EFOUJGJDBUJPO
PG 4PVOE 4PVSDF -PDBMJ[BUJPO 1. Some positions of users are difficult to localize 2. Environmental noise may cause incorrect localization → Localization results do not always indicate the direction of speaker
PG 4PVOE 4PVSDF -PDBMJ[BUJPO 1. Some positions of users are difficult to localize 2. Environmental noise may cause incorrect localization → Localization results do not always indicate the direction of speaker
users sit around a table Robot A 1. Placing robots on a table to opposite each other so as to compensate each other’s capabilities 2. Integrating sound source localization results from the robots
sit around a table Angular difference is small )J &BTZ Angular difference is large Robot A Robot B 1. Placing robots on a table to opposite each other so as to compensate each other’s capabilities 2. Integrating sound source localization results from the robots
ʁ Environmental Noise Noise ? Utterance ? Robot B Robot A Power = a confidence measure of localization results º 4PMVUJPOT 0WFSWJFX 1. Placing robots on a table to opposite each other so as to compensate each other’s capabilities 2. Integrating sound source localization results from the robots
4FUUJOHT 4PVOE TPVSDF MPDBMJ[BUJPO Robot audition software .JDSPQIPOFT four microphones in head. outputs Localization results: [deg] Power: [dB] every 1 frame (=0.01 second) Angular resolution = 10 [deg] { Impulse response for calculating the transfer function →recorded at 36 points, at intervals of 10 [deg] based on MUltiple SIgnal Classification (MUSIC) method developed in Kyoto Univ.
θr )2 2σ2 r ) fr (θr ) = 1 2πσ2 r = 1 C pr *OUFHSBUJPO PG .VMUJQMF -PDBMJ[BUJPO 3FTVMUT 1. Define probability density function from θr "TTVNQUJPO UIF BNCJHVJUZ PG MPDBMJ[BUJPO SFTVMUT GPMMPXT B OPSNBM EJTUSJCVUJPO 2. Define the maximum probability is proportioned to pr "TTVNQUJPO UIF QPXFS PG MPDBMJ[BUJPO SFTVMUT DBVTFE CZ OPJTF JT MPX C is a constant value and determined empirically
<EFH> 1SPCBCJMJUZ f A (θ) + f B (θ) θmix p mix C p A C p B C θmix = arg max θ (fA (θ) + fB (θ)) Threshold Reduce the number of incorrect localization results due to noise *OUFHSBUJPO PG .VMUJQMF -PDBMJ[BUJPO 3FTVMUT Robot A Robot B
[cm] loudspeaker 30 [cm] 75 [cm] 20° 25° 33° 107° &WBMVBUJPO &YQFSJNFOUT Evaluated whether using two robots improved speaker identification 4FUUJOHT 46° 72° The range of loudspeaker from Robot B
[cm] loudspeaker 30 [cm] 75 [cm] 20° 25° 33° 107° &WBMVBUJPO &YQFSJNFOUT Evaluated whether using two robots improved speaker identification 4FUUJOHT ! 46° 72° *ODPSSFDU $PSSFDU The range of loudspeaker from Robot B
QSFDJTJPO SFDBMM ' " # $ % & "-- 41, QSFDJTJPO SFDBMM ' " # $ % & "-- Robot A Robot B It’s difficult to identify loudspeakers that were far from the robots. Evaluated whether using two robots improved speaker identification 3FTVMUT PG JEFOUJGZJOH MPVETQFBLFST 6TJOH POMZ POF SPCPU
QSFDJTJPO SFDBMM ' " # $ % & "-- 41, QSFDJTJPO SFDBMM ' " # $ % & "-- Robot A Robot B The performances differed between two robots. It’s difficult to identify loudspeakers that were far from the robots. Evaluated whether using two robots improved speaker identification 3FTVMUT PG JEFOUJGZJOH MPVETQFBLFST 6TJOH POMZ POF SPCPU
$ % & "-- 41, QSFDJTJPO SFDBMM ' " # $ % & "-- 41, QSFDJTJPO SFDBMM ' " # $ % & "-- Robot A Robot B Integration Evaluated whether using two robots improved speaker identification 3FTVMUT PG JEFOUJGZJOH MPVETQFBLFST *OUFHSBUJPO C = 800 thresh = 25.5 800 System can identify the areas that only one robot cannot
B Integration 41, QSFDJTJPO SFDBMM ' " # $ % & "-- 41, QSFDJTJPO SFDBMM ' " # $ % & "-- 41, QSFDJTJPO SFDBMM ' " # $ % & "-- Evaluated whether using two robots improved speaker identification 3FTVMUT PG JEFOUJGZJOH MPVETQFBLFST *OUFHSBUJPO In particular, the loudspeaker at C get correctly identified, for which neither robots cannot System can identify the areas that only one robot cannot
Evaluated whether integrated power was valid as a confidence measure -PDBMJ[BUJPO 3FTVMUT CZ 1PXFS When power is high, average errors are small When power is low, error rates are high and average errors are large
When power is high, average errors are small Power can distinguish correct localization results from incorrect ones When power is low, error rates are high and average errors are large Evaluated whether integrated power was valid as a confidence measure -PDBMJ[BUJPO 3FTVMUT CZ 1PXFS
Implement demo system → identifying a speaker and heading toward to answer → executing face detection to check whether a speaker exists on the basis of power Use other evidence of speaker’s existence e.g. image processing → improve performance of speaker identification → improve performance compared with using only one robot